Vowpal Wabbit - precision recall f-measure - machine-learning

How do you usually get precision, recall and f-measure from a model created in Vowpal Wabbit on a classification problem?
Are there any available scripts or programs that are commonly used for this with vw's output?
To make a minimal example using the following data in playtennis.txt :
2 | sunny 85 85 false
2 | sunny 80 90 true
1 | overcast 83 78 false
1 | rain 70 96 false
1 | rain 68 80 false
2 | rain 65 70 true
1 | overcast 64 65 true
2 | sunny 72 95 false
1 | sunny 69 70 false
1 | rain 75 80 false
1 | sunny 75 70 true
1 | overcast 72 90 true
1 | overcast 81 75 false
2 | rain 71 80 true
I create the model with:
vw playtennis.txt --oaa 2 -f playtennis.model --loss_function logistic
Then, I get predictions and raw predictions of the trained model on the training data itself with:
vw -t -i playtennis.model playtennis.txt -p playtennis.predict -r playtennis.rawp
Going from here, what scripts or programs do you usually use to get precision, recall and f-measure, given training data playtennis.txt and the predictions on the training data in playtennis.predict?
Also, if this where a multi-label classification problem (each instance can have more than 1 target label, which vw can also handle), would your proposed scripts or programs capable to process these?

Given that you have a pair of 'predicted vs actual' value for each example, you can use Rich Caruana's KDD perf utility to compute these (and many other) metrics.
In the case of multi-class, you should simply consider every correctly classified case a success and every class-mismatch a failure to predict correctly.
Here's a more detailed recipe for the binary case:
# get the labels into *.actual (correct) file
$ cut -d' ' -f1 playtennis.txt > playtennis.actual
# paste the actual vs predicted side-by-side (+ cleanup trailing zeros)
$ paste playtennis.actual playtennis.predict | sed 's/\.0*$//' > playtennis.ap
# convert original (1,2) classes to binary (0,1):
$ perl -pe 's/1/0/g; s/2/1/g;' playtennis.ap > playtennis.ap01
# run perf to determine precision, recall and F-measure:
$ perf -PRE -REC -PRF -file playtennis.ap01
PRE 1.00000 pred_thresh 0.500000
REC 0.80000 pred_thresh 0.500000
PRF 0.88889 pred_thresh 0.500000
Note that as Martin mentioned, vw uses the {-1, +1} convention for binary classification, whereas perf uses the {0, 1} convention so you may have to translate back and forth when switching between the two.

For binary classification, I would recommend to use labels +1 (play tennis) and -1 (don't play tennis) and --loss_function=logistic (although --oaa 2 and labels 1 and 2 can be used as well). VW then reports the logistic loss, which may be more informative/useful evaluation measure than accuracy/precision/recall/f1 (depending on the application). If you want 0/1 loss (i.e. "one minus accuracy"), add --binary.
For precision, recall, f1-score, auc and other measures, you can use the perf tool as suggested in arielf's answer.
For standard multi-class classification (one correct class for each example), use --oaa N --loss_function=logistic and VW will report the 0/1 loss.
For multi-label multi-class classification (more correct labels per example allowed), you can use --multilabel_oaa N (or convert each original example into N binary-classification examples).

Related

machine learning model different inputs

i have dataset, the data set consists of : Date , ID ( the id of the event ), number_of_activities, running_sum ( the running_sum is the running sum of activities by id).
this is a part of my data :
date | id (id of the event) | number_of_activities | running_sum |
2017-01-06 | 156 | 1 | 1 |
2017-04-26 | 156 | 1 | 2 |
2017-07-04 | 156 | 2 | 4 |
2017-01-19 | 175 | 1 | 1 |
2017-03-17 | 175 | 3 | 4 |
2017-04-27 | 221 | 3 | 3 |
2017-05-05 | 221 | 7 | 10 |
2017-05-09 | 221 | 10 | 20 |
2017-05-19 | 221 | 1 | 21 |
2017-09-03 | 221 | 2 | 23 |
the goal for me is to predict the future number of activities in a given event, my question : can i train my model on all the dataset ( all the events) to predict the next one, if so how? because there are inequalities in the number of inputs ( number of rows for each event is different ), and is it possible to exploit the date data as well.
Sure you can. But alot of more information is needed, which you know yourself the best.
I guess we are talking about timeseries here as you want to predict the future.
You might want to have alook at recurrent-neural nets and LSTMs:
An Recurrent-layer takes a timeseries as input and outputs a vector, which contains the compressed information about the whole timeseries. So lets take event 156, which has 3 steps:
The event is your features, which has 3 timesteps. Each timestep has different numbers of activities (or features). To solve this, just use the maximum amount of features occuring and add a padding value (most often simply zero) so they all have the samel length. Then you have a shape, which is suitable for a recurrent neural Net (where LSTMS are currently a good choice)
Update
You said in the comments, that using padding is not option for you, let me try to convince you. LSTMs are good at situations, where the sequence length is different long. However, for this to work you also need to have longer sequences, what the model can learn its patterns from. What I want to say, when some of your sequences have only a few timesteps like 3, but you have other with 50 and more timesteps, the model might have its difficulties to predict these correct, as you have to specify, which timestep you want to use. So either, you prepare your data differently for a clear question, or you dig deeper into the topic using SequenceToSequence Learning, which is very good at computing sequences with different lenghts. For this you will need to set up a Encoder-Decoder network.
The Encoder squashs the whole sequence into one vector, whatever length it is. This one vector is compressed in a way, that it contains the information of the sequence only in one vector.
The Decoder then learns to use this vector for predicting the next outputs of the sequences. This is a known technique for machine-translation, but is suitable for any kind of sequence2sequence tasks. So I would recommend you to create such a Encoder-Decoder network, which for sure will improve your results. Have a look at this tutorial, which might help you further

Vowpal Wabbit: Cannot retrieve latent factors with gd_mf_weights from a trained --rank model

I trained a rank 40 model on the movielens data, but cannot retrieve the weights from the trained model with gd_mf_weights. I'm following the syntax from the VW matrix factorization example but it is giving me errors. Please advise.
Model training call:
vw --rank 40 -q ui --l2 0.1 --learning_rate 0.015 --decay_learning_rate 0.97 --power_t 0 --passes 50 --cache_file movielens.cache -f movielens.reg -d train.vw
Weights generating call:
library/gd_mf_weights -I train.vw -O '/data/home/mlteam/notebooks/Recommenders-master/notebooks/Outputs/movielens' --vwparams '-q ui --rank 40 -i movielens.reg'
Error:
WARNING: model file has set of {-q, --cubic, --interactions} settings stored, but they'll be OVERRIDEN by set of {-q, --cubic, --interactions} settings from command line.
creating quadratic features for pairs: ui
finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = -nan
total feature number = 0
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injectorboost::program_options::multiple_occurrences >'
what(): option '--rank' cannot be specified more than once
Aborted (core dumped)
If I just run it without specifying rank and interaction variables, it doesn't return the same trained model, since the parameters displayed are different from before.
library/gd_mf_weights -I train.vw -O '/data/home/mlteam/notebooks/Recommenders-master/notebooks/Outputs/movielens' --vwparams '-i movielens.reg'
creating quadratic features for pairs: ui
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
using no cache
Reading datafile =
num sources = 0
Segmentation fault (core dumped)
If I run weights generation with the entire set of model training parameters, it just ignores my extra parameters (and finishes much faster than 50 passes would take) and returns same weights from a randomly initiated rank 40 model.
library/gd_mf_weights -I train.vw -0 '/data/home/mlteam/notebooks/Recommenders-master/notebooks/Outputs/movielens' --vwparams '--rank 40 -q ui --l2 0.1 --learning_rate 0.015 --decay_learning_rate 0.97 --power_t 0 --passes 50 --cache_file movielens.cache -f movielens.reg -d train.vw'

SVM Machine Learning: Feature representation in LibSVM

Im working with Libsvm to classify written Text. (Genderclassification)
Im having Problems understanding how to create Libsvm Training data with multiple features.
Training data in Libsvm is build like this:
label index1:value1 index2:value2
Lets say i want these features:
Top_k words: k Most used words by label
Top_k bigrams: k Most used bigrams
So for Example the count would look like this:
Word count Bigram count
|-----|-----------| |-----|-----------|
|word | counts | |bigra| counts |
|-----|-----|-----| |-----|-----|-----|
index |text | +1 | -1 | index |text | +1 | -1 |
|-----|-----|-----| |-----|-----|-----|
1 |this | 3 | 3 | 4 |bi | 6 | 2 |
2 |forum| 1 | 0 | 5 |gr | 10 | 3 |
3 |is | 10 | 12 | 6 |am | 8 | 10 |
|... | .. | .. | |.. | .. | .. |
|-----|-----|-----| |-----|-----|-----|
Lets say k = 2, Is this how a training instance would look like?(Counts are not affiliated with before)
Label Top_kWords1:33 Top_kWords2:27 Top_kBigrams1:30 Top_kBigrams2:25
Or does it look like this (Does it matter when the features mix up)?
Label Top_kWords1:33 Top_kBigrams1:30 Top_kWords2:27 Top_kBigrams2:25
I just want to know how the feature vector looks like with multiple and different features and how to it.
EDIT:
With the updated table above, is this training data correct?:
Example
1 1:3 2:1 3:10 4:6 5:10 6:8
-1 1:3 2:0 3:12 4:2 5:3 6:10
libSVM representation is purely numeric, so
label index1:value1 index2:value2
means that each "label", "index" and "value" have to be numbers. In your case you have to enumerate your features, for example
1 1:23 2:47 3:0 4:1
if some of the featues has value 0 then you can omit it
1 1:23 2:47 4:1
remember to leave features in increasing order.
In general, libSVM is not designed to work with texts, and I would not recommend you to do so - rather use some already existing library which make working with text easy and wraps around libsvm (such as NLTK or scikit-learn)
Whatever k most words/bigrams you use for training may not be the most popular in your test set. If you want to use the most popular words in the english language you will end up with the, and and so on. Maybee beer and footballare more suitable to classify males even if they are less popular. This process step is called feature selection and has got nothing to do with SVM. When you found selective features (beer, botox, ...) you do enumerate them and stuff them into SVM training.
For bigrams you maybe could omit feature selection as there is at most 26*26=676 bigrams making 676 features. But again I assume bigrams like be to be not selective as the selective match in beer is comleteley buried in lots of matches in to be. But that is speculation, you have to learn the quality of your features.
Also, if you use word/bigram counts you should normalize them, i. e. divide by the overall word/bigram count of your document. Otherwise shorter documents in your training set will have less weight than bigger ones.

Backpropagation: when to update weights?

Could you please help me with a neural network?
If I have an arbitrary dataset:
+---+---------+---------+--------------+--------------+--------------+--------------+
| i | Input 1 | Input 2 | Exp.Output 1 | Exp.Output 2 | Act.output 1 | Act.output 2 |
+---+---------+---------+--------------+--------------+--------------+--------------+
| 1 | 0.1 | 0.2 | 1 | 2 | 2 | 4 |
| 2 | 0.3 | 0.8 | 3 | 5 | 8 | 10 |
+---+---------+---------+--------------+--------------+--------------+--------------+
Let's say I have x hidden layers with different numbers of neurons and different types of activation functions each.
When running backpropagation (especially iRprop+), when do I update the weights? Do I update them after calculating each line from the dataset?
I've read that batch learning is often not as efficient as "on-line" training. That means that it is better to update the weights after each line, right?
And do I understand it correctly: an epoch is when you have looped through each line in the input dataset? If so, that would mean that in one epoch, the weights will be updated twice?
Then, where does the total network error (see below) come into play?
[image, from here.]
tl;dr:
Please help help me understand how backprop works
Typically, you would update the weights after each example in the data set (I assume that's what you mean by each line). So, for each example, you would see what the neural network thinks the output should be (storing the outputs of each neuron in the process) and then back propagate the error. So, starting with the final output, compare the ANN's output with the actual output (what the data set says it should be) and update the weights according to a learning rate.
The learning rate should be a small constant, since you are correcting weights for each and every example. And an epoch is one iteration through every example in the data set.

OpenCV 2.4.3 Haar Classifier Error AdaBoost misclass

I am using OpenCV 2.4.3 on Ubuntu 12.10 64bit and when I run opencv_training I get an error message shown below. The training continues so I don't think it is a critical error but nonetheless it blatantly says 'Error'. I can't seem to find any solutions for this - what does it mean ( what is AdaBoost ) , why is it complaining about a 'misclass' , and how can I fix it? Anything I found on Google referred to this as simply a 'warning' and basically to forget about it. Thanks!
cd dots ; nice -20 opencv_haartraining -data dots_haarcascade -vec samples.vec -bg negatives.dat -nstages 20 -nsplits 2 -minhitrate 0.999 -maxfalsealarm 0.5 -npos 13 -nneg 10 -w 10 -h 10 -nonsym -mem 4000 -mode ALL
Data dir name: dots_w10_h10_haarcascade
Vec file name: samples.vec
BG file name: negatives.dat, is a vecfile: no
Num pos: 13
Num neg: 10
Num stages: 20
Num splits: 2 (tree as weak classifier)
Mem: 4000 MB
Symmetric: FALSE
Min hit rate: 0.999000
Max false alarm rate: 0.500000
Weight trimming: 0.950000
Equal weights: FALSE
Mode: ALL
Width: 10
Height: 10
Applied boosting algorithm: GAB
Error (valid only for Discrete and Real AdaBoost): misclass
Max number of splits in tree cascade: 0
Min number of positive samples per cluster: 500
Required leaf false alarm rate: 9.53674e-07
Stage 0 loaded
Stage 1 loaded
Stage 2 loaded
Stage 3 loaded
Stage 4 loaded
Stage 5 loaded
Stage 6 loaded
Stage 7 loaded
Tree Classifier
Stage
+---+---+---+---+---+---+---+---+
| 0| 1| 2| 3| 4| 5| 6| 7|
+---+---+---+---+---+---+---+---+
0---1---2---3---4---5---6---7
Number of features used : 7544
Parent node: 7
*** 1 cluster ***
POS: 13 96 0.135417
I don't think this is an error message, rather it is a print out describing how the algorithm will measure it's internal error rate. In this case it is using misclassification of the examples. Real and discrete adaboost will map input samples onto the output range [0,1] so there is a meaningful way of measuring the inaccuracy of the algorithm. If a different variant of adaboost is being used, this error measure might cease to be meaningful.

Resources