Vowpal Wabbit Matrix Factorization on one label

Vowpal Wabbit Matrix Factorization on one label - machine-learning

What I'm after is a recommender system for the web, something like "related products". Based on the items a user has bought I want to find related items based on what other users has bought. I've followed the MovieLens tutorial (https://github.com/JohnLangford/vowpal_wabbit/wiki/Matrix-factorization-example) for making a recommender system.
In the example above the users gave the movies a score (1-5). The model can then predict the score a user will give a specific item.
My data, on the other hand, only knows what the user likes. I don't know what they dislike or how much they like something. So I've tried sending 1 as the value on all my entries, but that only gives me a model that returns 1 on every prediction.
Any ideas on how I can structure my data so that I can receive prediction on how likely it is for the user to like an item between 0 and 1?
Example data:
1.0 |user 1 |item 1
1.0 |user 1 |item 2
1.0 |user 2 |item 2
1.0 |user 2 |item 3
1.0 |user 3 |item 1
1.0 |user 3 |item 3
Training command:
cat test.vw | vw /dev/stdin -b 18 -q ui --rank 10 --l2 0.001 --learning_rate 0.015 --passes 20 --decay_learning_rate 0.97 --power_t 0 -f test.reg --cache_file test.cache

Short answer to the question:
To get a prediction resembling "probabilities" you could use --loss_function logistic --link logistic. Be aware that in this single-label setting your probabilities risk tending to 1.0 quickly (i.e. become meaningless).
Additional notes:
Working with a single label is problematic in the sense that there's no separation of the goal. Eventually the learner will peg all predictions to 1.0. To counter that - it is recommended to use --noconstant, use strong regularization, decrease the learning rate, avoid multiple passes, etc. (IOW: anything that avoids over-fitting to the single label)
Even better: add examples where the user hasn't bought/clicked, they should be plentiful, this will make your model much more robust and meaningful.
There's a better implementation of matrix factorization in vw (much faster and lighter on IO for big models). Check the --lrq option and the full demo under demo/movielens in the source tree.
You should pass the training-set directly to vw to avoid Useless use of cat

Related

vowpal-wabbit: use of multiple passes, holdout, & holdout-period to avoid overfitting?

I would like to train the binary sigmoidal feedforward network for category classification with following command using awesome vowpal wabbit tool:
vw --binary --nn 4 train.vw -f category.model
And test it:
vw --binary -t -i category.model -p test.vw
But I had very bad results (comparing to my linear svm estimator).
I found a comment that I should use Number of Training Passes argument (--passes arg).
So my question is how to know the count of training passes in order not to get retrained model?
P.S. should I use holdout_period argument? and how?

The test command in the question is incorrect. It has no input (-p ... indicates output predictions). Also it is not clear if you want to test or predict because it says test but the command used has -p ...
Test means you have labeled-data and you're evaluating the quality of your model. Strictly speaking: predict means you don't have labels, so you can't actually know how good your predictions are. Practically, you may also predict on held-out, labeled data (pretending it has no labels by ignoring them) and then evaluate how good these predictions are, since you actually have labels.
Generally:
if you want to do binary-classification, you should use labels in {-1, 1} and use --loss_function logistic. --binary which is an independent option meaning you want predictions to be binary (giving you less info).
if you already have a separate test-set with labels, you don't need to holdout.
The holdout mechanism in vw was designed to replace the test-set and avoid over-fitting, it is only relevant when multiple passes are used because in a single pass all examples are effectively held-out; each next (yet unseen) example is treated as 1) unlabeled for prediction, and as 2) labeled for testing and model-update. IOW: your train-set is effectively also your test-set.
So you can either do multiple passes on the train-set with no holdout:
vw --loss_function logistic --nn 4 -c --passes 2 --holdout_off train.vw -f model
and then test the model with a separate and labeled, test-set:
vw -t -i model test.vw
or do multiple passes on the same train-set with some hold-out as a test set.
vw --loss_function logistic --nn 4 -c --passes 20 --holdout_period 7 train.vw -f model
If you don't have a test-set, and you want to fit-stronger by using multiple-passes, you can ask vw to hold-out every Nth example (the default N is 10, but you may override it explicitly using --holdout_period <N> as seen above). In this case, you can specify a higher number of passes because vw will automatically do early-termination when the loss on the held-out set starts growing.
You'd notice you hit early termination since vw will print something like:
passes used = 5
...
average loss = 0.06074 h
Indicating that only 5 out of N passes were actually used before early stopping, and the error on the held-out subset of example is 0.06074 (the trailing h indicates this is held-out loss).
As you can see, the number of passes, and the holdout-period are completely independent options.
To improve and get more confidence in your model, you could use other optimizations, vary the holdout_period, try other --nn args. You may also want to check the vw-hypersearch utility (in the utl subdirectory) to help find better hyper-parameters.
Here's an example of using vw-hypersearch on one of the test-sets included with the source:
$ vw-hypersearch 1 20 vw --loss_function logistic --nn % -c --passes 20 --holdout_period 11 test/train-sets/rcv1_small.dat --binary
trying 13 ............. 0.133333 (best)
trying 8 ............. 0.122222 (best)
trying 5 ............. 0.088889 (best)
trying 3 ............. 0.111111
trying 6 ............. 0.1
trying 4 ............. 0.088889 (best)
loss(4) == loss(5): 0.088889
5 0.08888
Indicating that either 4 or 5 should be good parameters for --nn yielding a loss of 0.08888 on a hold-out subset of 1 in 11 examples.

LibSVM - Multi class classification with unbalanced data

I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.

I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!

Vowpal Wabbit not predicting binary values, maybe overtraining?

I am trying to use Vowpal Wabbit to do a binary classification, i.e. given feature values vw will classify it either 1 or 0. This is how I have the training data formatted.
1 'name | feature1:0 feature2:1 feature3:48 feature4:4881 ...
-1 'name2 | feature1:1 feature2:0 feature3:5 feature4:2565 ...
etc
I have about 30,000 1 data points, and about 3,000 0 data points. I have 100 1 and 100 0 data points that I'm using to test on, after I create the model. These test data points are classified by default as 1. Here is how I format the prediction set:
1 'name | feature1:0 feature2:1 feature3:48 feature4:4881 ...
From my understanding of the VW documentation, I need to use either the logistic or hinge loss_function for binary classifications. This is how I've been creating the model:
vw -d ../training_set.txt --loss_function logistic/hinge -f model
And this is how I try the predictions:
vw -d ../test_set.txt --loss_function logistic/hinge -i model -t -p /dev/stdout
However, this is where I'm running into problems. If I use the hinge loss function, all the predictions are -1. When I use the logistic loss function, I get arbitrary values between 5 and 11. There is a general trend for data points that should be 0 to be lower values, 5-7, and for data points that should be 1 to be from 6-11. What am I doing wrong? I've looked around the documentation and checked a bunch of articles about VW to see if I can identify what my problem is, but I can't figure it out. Ideally I would get a 0,1 value, or a value between 0 and 1 which corresponds to how strong VW thinks the result is. Any help would be appreciated!

If the output should be just -1 and +1 labels, use the --binary option (when testing).
If the output should be a real number between 0 and 1, use --loss_function=logistic --link=logistic. The loss_function=logistic is needed when training, so the number can be interpreted as probability.
If the output should be a real number between -1 and 1, use --link=glf1.
If your training data is unbalanced, e.g. 10 times more positive examples than negative, but your test data is balanced (and you want to get the best loss on this test data), set the importance weight of the positive examples to 0.1 (because there are 10 times more positive examples).

Independently of your tool and/or specific algorithm you can use "learning curves" ,and train/cross validation/test splitting to diagnose your algorithm and determine whats your problem . After diagnosing your problem you can apply adjustments to your algorithm, for example if you find you have over-fitting you can apply some actions like:
Add regularization
Get more training data
Reduce the complexity of your model
Eliminate redundant features.
You can reference Andrew Ng. "Advice for machine learning" videos on YouTube to more details on this subject.

why normalizing feature values doesn't change the training output much?

I have 3113 training examples, over a dense feature vector of size 78. The magnitude of features is different: some around 20, some 200K. For example, here is one of the training examples, in vowpal-wabbit input format.
0.050000 1 '2006-07-10_00:00:00_0.050000| F0:9.670000 F1:0.130000 F2:0.320000 F3:0.570000 F4:9.837000 F5:9.593000 F6:9.238150 F7:9.646667 F8:9.631333 F9:8.338904 F10:9.748000 F11:10.227667 F12:10.253667 F13:9.800000 F14:0.010000 F15:0.030000 F16:-0.270000 F17:10.015000 F18:9.726000 F19:9.367100 F20:9.800000 F21:9.792667 F22:8.457452 F23:9.972000 F24:10.394833 F25:10.412667 F26:9.600000 F27:0.090000 F28:0.230000 F29:0.370000 F30:9.733000 F31:9.413000 F32:9.095150 F33:9.586667 F34:9.466000 F35:8.216658 F36:9.682000 F37:10.048333 F38:10.072000 F39:9.780000 F40:0.020000 F41:-0.060000 F42:-0.560000 F43:9.898000 F44:9.537500 F45:9.213700 F46:9.740000 F47:9.628000 F48:8.327233 F49:9.924000 F50:10.216333 F51:10.226667 F52:127925000.000000 F53:-15198000.000000 F54:-72286000.000000 F55:-196161000.000000 F56:143342800.000000 F57:148948500.000000 F58:118894335.000000 F59:119027666.666667 F60:181170133.333333 F61:89209167.123288 F62:141400600.000000 F63:241658716.666667 F64:199031688.888889 F65:132549.000000 F66:-16597.000000 F67:-77416.000000 F68:-205999.000000 F69:144690.000000 F70:155022.850000 F71:122618.450000 F72:123340.666667 F73:187013.300000 F74:99751.769863 F75:144013.200000 F76:237918.433333 F77:195173.377778
The training result was not good, so I thought I would normalize the features to make them in the same magnitude. I calculated mean and standard deviation for each of the features across all examples, then do newValue = (oldValue - mean) / stddev, so that their new mean and stddev are all 1. For the same example, here is the feature values after normalization:
0.050000 1 '2006-07-10_00:00:00_0.050000| F0:-0.660690 F1:0.226462 F2:0.383638 F3:0.398393 F4:-0.644898 F5:-0.670712 F6:-0.758233 F7:-0.663447 F8:-0.667865 F9:-0.960165 F10:-0.653406 F11:-0.610559 F12:-0.612965 F13:-0.659234 F14:0.027834 F15:0.038049 F16:-0.201668 F17:-0.638971 F18:-0.668556 F19:-0.754856 F20:-0.659535 F21:-0.663001 F22:-0.953793 F23:-0.642736 F24:-0.606725 F25:-0.609946 F26:-0.657141 F27:0.173106 F28:0.310076 F29:0.295814 F30:-0.644357 F31:-0.678860 F32:-0.764422 F33:-0.658869 F34:-0.674367 F35:-0.968679 F36:-0.649145 F37:-0.616868 F38:-0.619564 F39:-0.649498 F40:0.041261 F41:-0.066987 F42:-0.355693 F43:-0.638604 F44:-0.676379 F45:-0.761250 F46:-0.653962 F47:-0.668194 F48:-0.962591 F49:-0.635441 F50:-0.611600 F51:-0.615670 F52:-0.593324 F53:-0.030322 F54:-0.095290 F55:-0.139602 F56:-0.652741 F57:-0.675629 F58:-0.851058 F59:-0.642028 F60:-0.648002 F61:-0.952896 F62:-0.629172 F63:-0.592340 F64:-0.682273 F65:-0.470121 F66:-0.045396 F67:-0.128265 F68:-0.185295 F69:-0.510251 F70:-0.515335 F71:-0.687727 F72:-0.512749 F73:-0.471032 F74:-0.789335 F75:-0.491188 F76:-0.400105 F77:-0.505242
However, this yields basically the same testing result (if not exactly the same, since I shuffle the examples before each training).
Wondering why there is no change in the result?
Here is my training and testing commands:
rm -f cache
cat input.feat | vw -f model --passes 20 --cache_file cache
cat input.feat | vw -i model -t -p predictions --invert_hash readable_model
(Yes, I'm testing on the training data right now since I have only very few data examples to train on.)
More context:
Some of the features are "tier 2" - they were derived by manipulating or doing cross products on "tier 1" features (e.g. moving average, 1-3 order of derivatives, etc). If I normalize the tier 1 features before calculating the tier 2 features, it would actually improve the model significantly.
So I'm puzzled as why normalizing tier 1 features (before generating tier 2 features) helps a lot, while normalizing all features (after generating tier 2 features) doesn't help at all?
BTW, since I'm training a regressor, I'm using SSE as the metrics to judge the quality of the model.

vw normalizes feature values for scale as it goes, by default.
This is part of the online algorithm. It is done gradually during runtime.
In fact it does more than that, vw enhanced SGD algorithm also keeps separate learning rates (per feature) so rarer feature learning rates don't decay as fast as common ones (--adaptive). Finally there's an importance aware update, controlled by a 3rd option (--invariant).
The 3 separate SGD enhancement options (which are all turned on by default) are:
--adaptive
--invariant
--normalized
The last option is the one that adjust values for scale (discounts large values vs small). You may disable all these SGD enhancements by using the option --sgd. You may also partially enable any subset by explicitly specifying it.
All in all you have 2^3 = 8 SGD option combinations you can use.

The Possible reason is that whatever Training algorithm that you used to get the result already did the normalization process for you!.In fact many algorithms do the normalization process before working on it.Hope it helps you :)

Can Vowpal Wabbit handle datasize ~ 90 GB?

We have extracted features from search engine query log data and the feature file (as per input format of Vowpal Wabbit) amounts to 90.5 GB. The reason for this huge size is necessary redundancy in our feature construction. Vowpal Wabbit claims to be able to handle TBs of data in a matter of few hours. In addition to that, VW uses a hash function which takes almost no RAM. But When we run logistic regression using VW on our data, within a few minutes, it uses up all of RAM and then stalls.
This is the command we use-
vw -d train_output --power_t 1 --cache_file train.cache -f data.model
--compressed --loss_function logistic --adaptive --invariant
--l2 0.8e-8 --invert_hash train.model
train_output is the input file we want to train VW on, and train.model is the expected model obtained after training
Any help is welcome!

I've found the --invert_hash option to be extremely costly; try running without that option. You can also try turning on the --l1 regularization option to reduce the number of coefficients in the model.
How many features do you have in your model? How many features per row are there?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart