Mahout FPG algorithm Always uses single reducer - machine-learning

I'm using Mahout (v 0.7) parallel FPG algorithm, CLI mode, to generate frequent patterns. The algorithm works fine and generates the frequent patterns correctly.
The problem I'm having is that the algorithm always uses one reducer in the second stage of generating the patterns. The algorithm creates one mapper for each input split, but in the second stage, all the mappers send their output to one reducer which significantly slows down the algorithm.
I even tried to set the -Dmapred.reduce.tasks parameter to override the default number of reducers, and it did not work.
I would like to split the work of the second stage to multiple reducers, if possible.
Mahout's FPG command I use:
mahout fpg \
-i /path/to/input \
-o /path/to/output \
-s 5 \
-k 100 \
-method mapreduce

You can change the number of mappers and reducers by adding this at the end of your command:
-Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000
For me, I couldn't change the number of mappers with this parameter, but I was always able to control the number of reducers.

It's possible that your data currently fits into a single mapper split.
You can make the split size smaller with something like:
-Dmapred.max.split.size=1048576
(That would reduce the split size to 1024*1024==1MB, but I've gone even smaller with Mahout before, such as -Dmapred.max.split.size=131072 for 128KB splits on very CPU-intensive jobs.)

Related

vowpal-wabbit: use of multiple passes, holdout, & holdout-period to avoid overfitting?

I would like to train the binary sigmoidal feedforward network for category classification with following command using awesome vowpal wabbit tool:
vw --binary --nn 4 train.vw -f category.model
And test it:
vw --binary -t -i category.model -p test.vw
But I had very bad results (comparing to my linear svm estimator).
I found a comment that I should use Number of Training Passes argument (--passes arg).
So my question is how to know the count of training passes in order not to get retrained model?
P.S. should I use holdout_period argument? and how?
The test command in the question is incorrect. It has no input (-p ... indicates output predictions). Also it is not clear if you want to test or predict because it says test but the command used has -p ...
Test means you have labeled-data and you're evaluating the quality of your model. Strictly speaking: predict means you don't have labels, so you can't actually know how good your predictions are. Practically, you may also predict on held-out, labeled data (pretending it has no labels by ignoring them) and then evaluate how good these predictions are, since you actually have labels.
Generally:
if you want to do binary-classification, you should use labels in {-1, 1} and use --loss_function logistic. --binary which is an independent option meaning you want predictions to be binary (giving you less info).
if you already have a separate test-set with labels, you don't need to holdout.
The holdout mechanism in vw was designed to replace the test-set and avoid over-fitting, it is only relevant when multiple passes are used because in a single pass all examples are effectively held-out; each next (yet unseen) example is treated as 1) unlabeled for prediction, and as 2) labeled for testing and model-update. IOW: your train-set is effectively also your test-set.
So you can either do multiple passes on the train-set with no holdout:
vw --loss_function logistic --nn 4 -c --passes 2 --holdout_off train.vw -f model
and then test the model with a separate and labeled, test-set:
vw -t -i model test.vw
or do multiple passes on the same train-set with some hold-out as a test set.
vw --loss_function logistic --nn 4 -c --passes 20 --holdout_period 7 train.vw -f model
If you don't have a test-set, and you want to fit-stronger by using multiple-passes, you can ask vw to hold-out every Nth example (the default N is 10, but you may override it explicitly using --holdout_period <N> as seen above). In this case, you can specify a higher number of passes because vw will automatically do early-termination when the loss on the held-out set starts growing.
You'd notice you hit early termination since vw will print something like:
passes used = 5
...
average loss = 0.06074 h
Indicating that only 5 out of N passes were actually used before early stopping, and the error on the held-out subset of example is 0.06074 (the trailing h indicates this is held-out loss).
As you can see, the number of passes, and the holdout-period are completely independent options.
To improve and get more confidence in your model, you could use other optimizations, vary the holdout_period, try other --nn args. You may also want to check the vw-hypersearch utility (in the utl subdirectory) to help find better hyper-parameters.
Here's an example of using vw-hypersearch on one of the test-sets included with the source:
$ vw-hypersearch 1 20 vw --loss_function logistic --nn % -c --passes 20 --holdout_period 11 test/train-sets/rcv1_small.dat --binary
trying 13 ............. 0.133333 (best)
trying 8 ............. 0.122222 (best)
trying 5 ............. 0.088889 (best)
trying 3 ............. 0.111111
trying 6 ............. 0.1
trying 4 ............. 0.088889 (best)
loss(4) == loss(5): 0.088889
5 0.08888
Indicating that either 4 or 5 should be good parameters for --nn yielding a loss of 0.08888 on a hold-out subset of 1 in 11 examples.

Can Vowpal Wabbit handle datasize ~ 90 GB?

We have extracted features from search engine query log data and the feature file (as per input format of Vowpal Wabbit) amounts to 90.5 GB. The reason for this huge size is necessary redundancy in our feature construction. Vowpal Wabbit claims to be able to handle TBs of data in a matter of few hours. In addition to that, VW uses a hash function which takes almost no RAM. But When we run logistic regression using VW on our data, within a few minutes, it uses up all of RAM and then stalls.
This is the command we use-
vw -d train_output --power_t 1 --cache_file train.cache -f data.model
--compressed --loss_function logistic --adaptive --invariant
--l2 0.8e-8 --invert_hash train.model
train_output is the input file we want to train VW on, and train.model is the expected model obtained after training
Any help is welcome!
I've found the --invert_hash option to be extremely costly; try running without that option. You can also try turning on the --l1 regularization option to reduce the number of coefficients in the model.
How many features do you have in your model? How many features per row are there?

Vowpal Wabbit -- Active Learning, predictions always 0 even with seed data

Using Vowpal Wabbit and its python interactor for active learning I've got up to the point of being able to send messages back and forth to the server from the client but I am having problems with seeding.
When I seed the model with the current command:
python active_interactor.py --verbose -o labelled.txt --seed data.seed localhost 12345 unlabelled.txt
The interactor sends the examples to the server (and I know this because the server updates the models and the debug information is produced) but when it feeds the unlabelled examples and asks for a label as a response, the predictions are always 0.
My question is: is the model not being seeded? If not, why are the predictions always 0 even though there is a model?
It should be noted that the same data can be successfully used to create a passive model that gives non-0 predictions, so I do not think the problem is with the training data.
---UPDATE---
Upon looking at the tests, we went ahead and changed the vw server to match the test with two parameters in mind that were left as their defaults beforehand, namely initial_t and l.
vw -f final.model --active_learning --active_mellowness 0.000001 --daemon --port 12345 --initial_t 10 -l 10
Once doing this, predictions are produced. This also works when -l is it's default. We will now do a grid search to find out the best possible parameters. One question though, what is the reason why low values of initial_t led to no predictions?

How to deal with frequent classes?

I'm working on a classification task in Weka and got the problem that my class to predict has one value that is very frequent (about 85%). This leads to a lot of learning algorithms just predicting this frequent value of this class for a new dataset.
How can I deal with this problem? Does it just mean that I didn't find features that work well enough in predicting something better? Or is there something specific I can do to solve this problem?
I guess this is a pretty common problem, but I was not able to find a solution to it here.
You need to "SMOTE" your data. First figure out how many more instances of the minority case you need. In my case I wanted to get around a 50/50 ratio so I needed to over sample by 1300 percent. This tutorial will help if you are using the GUI: http://www.youtube.com/watch?v=w14ha2Fmg6U If you are doing this from the command line using Weka, the following command will get you going:
#Weka 3.7.7
java weka.Run -no-scan weka.filters.supervised.instance.SMOTE \
-c last -K 25 -P 1300.0 -S 1 -i input.arff -o output.arff
The -K option is the number of neighbors to take into account when smoting the data. The default is 5, but 25 worked best for my dataset.

How to use WEKA Machine Learning for a Bayes Neural Network and J48 Decision Tree

I am trying to figure out WEKA and perform some experiments with data that I have.
Basically what I want to do is take Data Set 1, use it as a training set. Run a J48 Decision Tree on it. Then take Data Set 2 and run the trained tree on it, with the output of the original data set with a extra column for what the prediction was.
Then do the same thing again with the Bayes Neural Network.
Can someone point me to a link of detail instructions on how exactly I would accomplish this? I seem to be missing some steps and cannot get the output of the original data set with the extra column.
Here is one way to do it with the command-line. This information is found in Chapter 1 ("A command-line primer") of the Weka manual that comes with the software.
java weka.classifiers.trees.J48 -t training_data.arff -T test_data.arff -p 1-N
where:
-t <training_data.arff> specifies the training data in ARFF format
-T <test_data.arff> specifies the test data in ARFF format
-p 1-N specifies that you want to output the feature vector and the prediction,
where N is the number of features in your feature vector.
For example, here I am using soybean.arff for both training and testing. There are 35 features in the feature vector:
java weka.classifiers.trees.J48 -t soybean.arff -T soybean.arff -p 1-35
The first few lines of the output look like:
=== Predictions on test data ===
inst# actual predicted error prediction (date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,germination,plant-growth,leaves,leafspots-halo,leafspots-marg,leafspot-size,leaf-shread,leaf-malf,leaf-mild,stem,lodging,stem-cankers,canker-lesion,fruiting-bodies,external-decay,mycelium,int-discolor,sclerotia,fruit-pods,fruit-spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots)
1 1:diaporth 1:diaporth 0.952 (october,normal,gt-norm,norm,yes,same-lst-yr,low-areas,pot-severe,none,90-100,abnorm,abnorm,absent,dna,dna,absent,absent,absent,abnorm,no,above-sec-nde,brown,present,firm-and-dry,absent,none,absent,norm,dna,norm,absent,absent,norm,absent,norm)
2 1:diaporth 1:diaporth 0.952 (august,normal,gt-norm,norm,yes,same-lst-two-yrs,scattered,severe,fungicide,80-89,abnorm,abnorm,absent,dna,dna,absent,absent,absent,abnorm,yes,above-sec-nde,brown,present,firm-and-dry,absent,none,absent,norm,dna,norm,absent,absent,norm,absent,norm)
The columns are: (1) data instance number; (2) ground truth label; (3) predicted label; (4) error; (5) prediction confidence; and (6) feature vector.

Resources