Training a logistic regression in Mahout strange behaviour - mahout

I'm trying to train a logistic regression model in mahout. The command I use is this:mahout trainlogistic --input /home/cloudera/Desktop/final.csv --output /home/cloudera/Desktop/model/model --target Action --predictors Open High Close --types word --features 20 --passes 100 --rate 50 --categories 2
The files I use actually exist. I'm reading a book that says that I should expect an output that looks like
Action ~ 647.186*Close+-44.975*High+3.269*Intercept term +-601.454*Open
and then a 4x2 matrix.
What I actually get is a terminal being filled with calculations, no Action ~, and a 5x4 matrix.
What am I doing wrong?

Well, the type of my predictors was numeric; why did the book I referenced call them words I have no idea.

Related

vowpal-wabbit: use of multiple passes, holdout, & holdout-period to avoid overfitting?

I would like to train the binary sigmoidal feedforward network for category classification with following command using awesome vowpal wabbit tool:
vw --binary --nn 4 train.vw -f category.model
And test it:
vw --binary -t -i category.model -p test.vw
But I had very bad results (comparing to my linear svm estimator).
I found a comment that I should use Number of Training Passes argument (--passes arg).
So my question is how to know the count of training passes in order not to get retrained model?
P.S. should I use holdout_period argument? and how?
The test command in the question is incorrect. It has no input (-p ... indicates output predictions). Also it is not clear if you want to test or predict because it says test but the command used has -p ...
Test means you have labeled-data and you're evaluating the quality of your model. Strictly speaking: predict means you don't have labels, so you can't actually know how good your predictions are. Practically, you may also predict on held-out, labeled data (pretending it has no labels by ignoring them) and then evaluate how good these predictions are, since you actually have labels.
Generally:
if you want to do binary-classification, you should use labels in {-1, 1} and use --loss_function logistic. --binary which is an independent option meaning you want predictions to be binary (giving you less info).
if you already have a separate test-set with labels, you don't need to holdout.
The holdout mechanism in vw was designed to replace the test-set and avoid over-fitting, it is only relevant when multiple passes are used because in a single pass all examples are effectively held-out; each next (yet unseen) example is treated as 1) unlabeled for prediction, and as 2) labeled for testing and model-update. IOW: your train-set is effectively also your test-set.
So you can either do multiple passes on the train-set with no holdout:
vw --loss_function logistic --nn 4 -c --passes 2 --holdout_off train.vw -f model
and then test the model with a separate and labeled, test-set:
vw -t -i model test.vw
or do multiple passes on the same train-set with some hold-out as a test set.
vw --loss_function logistic --nn 4 -c --passes 20 --holdout_period 7 train.vw -f model
If you don't have a test-set, and you want to fit-stronger by using multiple-passes, you can ask vw to hold-out every Nth example (the default N is 10, but you may override it explicitly using --holdout_period <N> as seen above). In this case, you can specify a higher number of passes because vw will automatically do early-termination when the loss on the held-out set starts growing.
You'd notice you hit early termination since vw will print something like:
passes used = 5
...
average loss = 0.06074 h
Indicating that only 5 out of N passes were actually used before early stopping, and the error on the held-out subset of example is 0.06074 (the trailing h indicates this is held-out loss).
As you can see, the number of passes, and the holdout-period are completely independent options.
To improve and get more confidence in your model, you could use other optimizations, vary the holdout_period, try other --nn args. You may also want to check the vw-hypersearch utility (in the utl subdirectory) to help find better hyper-parameters.
Here's an example of using vw-hypersearch on one of the test-sets included with the source:
$ vw-hypersearch 1 20 vw --loss_function logistic --nn % -c --passes 20 --holdout_period 11 test/train-sets/rcv1_small.dat --binary
trying 13 ............. 0.133333 (best)
trying 8 ............. 0.122222 (best)
trying 5 ............. 0.088889 (best)
trying 3 ............. 0.111111
trying 6 ............. 0.1
trying 4 ............. 0.088889 (best)
loss(4) == loss(5): 0.088889
5 0.08888
Indicating that either 4 or 5 should be good parameters for --nn yielding a loss of 0.08888 on a hold-out subset of 1 in 11 examples.

Vowpal Wabbit not predicting binary values, maybe overtraining?

I am trying to use Vowpal Wabbit to do a binary classification, i.e. given feature values vw will classify it either 1 or 0. This is how I have the training data formatted.
1 'name | feature1:0 feature2:1 feature3:48 feature4:4881 ...
-1 'name2 | feature1:1 feature2:0 feature3:5 feature4:2565 ...
etc
I have about 30,000 1 data points, and about 3,000 0 data points. I have 100 1 and 100 0 data points that I'm using to test on, after I create the model. These test data points are classified by default as 1. Here is how I format the prediction set:
1 'name | feature1:0 feature2:1 feature3:48 feature4:4881 ...
From my understanding of the VW documentation, I need to use either the logistic or hinge loss_function for binary classifications. This is how I've been creating the model:
vw -d ../training_set.txt --loss_function logistic/hinge -f model
And this is how I try the predictions:
vw -d ../test_set.txt --loss_function logistic/hinge -i model -t -p /dev/stdout
However, this is where I'm running into problems. If I use the hinge loss function, all the predictions are -1. When I use the logistic loss function, I get arbitrary values between 5 and 11. There is a general trend for data points that should be 0 to be lower values, 5-7, and for data points that should be 1 to be from 6-11. What am I doing wrong? I've looked around the documentation and checked a bunch of articles about VW to see if I can identify what my problem is, but I can't figure it out. Ideally I would get a 0,1 value, or a value between 0 and 1 which corresponds to how strong VW thinks the result is. Any help would be appreciated!
If the output should be just -1 and +1 labels, use the --binary option (when testing).
If the output should be a real number between 0 and 1, use --loss_function=logistic --link=logistic. The loss_function=logistic is needed when training, so the number can be interpreted as probability.
If the output should be a real number between -1 and 1, use --link=glf1.
If your training data is unbalanced, e.g. 10 times more positive examples than negative, but your test data is balanced (and you want to get the best loss on this test data), set the importance weight of the positive examples to 0.1 (because there are 10 times more positive examples).
Independently of your tool and/or specific algorithm you can use "learning curves" ,and train/cross validation/test splitting to diagnose your algorithm and determine whats your problem . After diagnosing your problem you can apply adjustments to your algorithm, for example if you find you have over-fitting you can apply some actions like:
Add regularization
Get more training data
Reduce the complexity of your model
Eliminate redundant features.
You can reference Andrew Ng. "Advice for machine learning" videos on YouTube to more details on this subject.

Vowpal Wabbit multiple class classification predict probabilities

I am trying to do multiple classification problem with Vowpal Wabbit.
I have a train file that look like this:
1 |feature_space
2 |feature_space
3 |feature_space
As an output I want to get probabilities of test item belonging to each class, like this:
1: 0.13 2:0.57 3:0.30
think of sklearn classifiers predict_proba methods, for example.
I've tried the following:
1) vw -oaa 3 train.file -f model.file --loss_function logistic --link logistic
vw -p predict.file -t test.file -i model.file -raw_predictions = pred.txt
but the pred.txt file is empty (contains no records, but is created). Predict.file contains only the final class, and no probabilities.
2) vw - csoaa3 train.file -f model.file --link logistic
I've modified the input files accordingly to fit the cs format. csoaa doesn't accept loss_function logistic with following error message: "You are using a label not -1 or 1 with a loss function expecting that!"
If used with default square loss function, and similar output command, I get pred.txt with raw predictions for each class per item, for example:
2.33 1.67 0.55
I believe it's the resulting square distance.
Is there a way to get VW to output class probabilites or somehow convert these distances into probabilities?
There was a bug in VW version 7.9.0 and fixed in 7.10.0 resulting in the empty raw predictions file.
Since November 2015, the easiest way how to obtain probabilities is to use --oaa=N --loss_function=logistic --probabilities -p probs.txt. (Or if you need label-dependent features: --csoaa_ldf=mc --loss_function=logistic --probabilities -p probs.txt.)

Can Vowpal Wabbit handle datasize ~ 90 GB?

We have extracted features from search engine query log data and the feature file (as per input format of Vowpal Wabbit) amounts to 90.5 GB. The reason for this huge size is necessary redundancy in our feature construction. Vowpal Wabbit claims to be able to handle TBs of data in a matter of few hours. In addition to that, VW uses a hash function which takes almost no RAM. But When we run logistic regression using VW on our data, within a few minutes, it uses up all of RAM and then stalls.
This is the command we use-
vw -d train_output --power_t 1 --cache_file train.cache -f data.model
--compressed --loss_function logistic --adaptive --invariant
--l2 0.8e-8 --invert_hash train.model
train_output is the input file we want to train VW on, and train.model is the expected model obtained after training
Any help is welcome!
I've found the --invert_hash option to be extremely costly; try running without that option. You can also try turning on the --l1 regularization option to reduce the number of coefficients in the model.
How many features do you have in your model? How many features per row are there?

How to use WEKA Machine Learning for a Bayes Neural Network and J48 Decision Tree

I am trying to figure out WEKA and perform some experiments with data that I have.
Basically what I want to do is take Data Set 1, use it as a training set. Run a J48 Decision Tree on it. Then take Data Set 2 and run the trained tree on it, with the output of the original data set with a extra column for what the prediction was.
Then do the same thing again with the Bayes Neural Network.
Can someone point me to a link of detail instructions on how exactly I would accomplish this? I seem to be missing some steps and cannot get the output of the original data set with the extra column.
Here is one way to do it with the command-line. This information is found in Chapter 1 ("A command-line primer") of the Weka manual that comes with the software.
java weka.classifiers.trees.J48 -t training_data.arff -T test_data.arff -p 1-N
where:
-t <training_data.arff> specifies the training data in ARFF format
-T <test_data.arff> specifies the test data in ARFF format
-p 1-N specifies that you want to output the feature vector and the prediction,
where N is the number of features in your feature vector.
For example, here I am using soybean.arff for both training and testing. There are 35 features in the feature vector:
java weka.classifiers.trees.J48 -t soybean.arff -T soybean.arff -p 1-35
The first few lines of the output look like:
=== Predictions on test data ===
inst# actual predicted error prediction (date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,germination,plant-growth,leaves,leafspots-halo,leafspots-marg,leafspot-size,leaf-shread,leaf-malf,leaf-mild,stem,lodging,stem-cankers,canker-lesion,fruiting-bodies,external-decay,mycelium,int-discolor,sclerotia,fruit-pods,fruit-spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots)
1 1:diaporth 1:diaporth 0.952 (october,normal,gt-norm,norm,yes,same-lst-yr,low-areas,pot-severe,none,90-100,abnorm,abnorm,absent,dna,dna,absent,absent,absent,abnorm,no,above-sec-nde,brown,present,firm-and-dry,absent,none,absent,norm,dna,norm,absent,absent,norm,absent,norm)
2 1:diaporth 1:diaporth 0.952 (august,normal,gt-norm,norm,yes,same-lst-two-yrs,scattered,severe,fungicide,80-89,abnorm,abnorm,absent,dna,dna,absent,absent,absent,abnorm,yes,above-sec-nde,brown,present,firm-and-dry,absent,none,absent,norm,dna,norm,absent,absent,norm,absent,norm)
The columns are: (1) data instance number; (2) ground truth label; (3) predicted label; (4) error; (5) prediction confidence; and (6) feature vector.

Resources