training for classification using libsvm - machine-learning

I want to classify using libsvm. I have 9 training sets , each set has 144000 labelled instances , each instance having a variable number of features. It is taking about 12 hours to train one set ( ./svm-train with probability estimates ). As i dont have much time , I would like to run more than one set at a time. I'm not sure if i can do this.. Can i run all 9 processes simultaneously in different terminals ?
./svm-train -b 1 feat1.txt
./svm-train -b 1 feat2.txt
.
.
.
./svm-train -b 1 feat9.txt
( i'm using fedora core 5 )

You can tell libsvm to use openmp for parallelization. Look at this libsvm faq entry: http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f432

As Adam said, it depends on how many cores and processors your system has available. If that's insufficient, why not spin up a few EC2 instances to run on?
The Infochimps MachetEC2 public AMI comes with most of the tools you'll need: http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released/

Yes. But unless you have a multi-core or multi-processor system it may not save you that much time.

Related

Random Forest - Verbose and Speed

I am trying to build a randomforest on a data set with 120k rows and 518 columns.
I have two questions:
1. I want to see the progress and logs of building the forest. Is verbose option deprecated in randomForest function?
2. How to increase the speed? Right now it takes more than 6 hours to build a random forest with 1000 trees.
H2O cluster is initialized with below settings:
hadoop jar h2odriver.jar -Dmapreduce.job.queuename=devclinical
-output temp3p -nodes 20 -nthreads -1 -mapperXmx 32g
h2o.init(ip = h2o_ip, port = h2o_port, startH2O = FALSE,
nthreads=-1,max_mem_size = "64G", min_mem_size="4G" )
Depending on congestion of your network and the busyness level of your hadoop nodes, it may finish faster with fewer nodes. For example, if 1 of the 20 nodes you requested is totally slammed by some other jobs, then that node may lag, and the work from that node is not rebalanced to other nodes.
A good way to see what is going on is to connect to H2O Flow in a browser and run the WaterMeter. This will show you CPU activity in your cluster.
You can compare the activity before you start your RF and after you start your RF.
If even before you start your RF the nodes are extremely busy then you may be out of luck and just have to wait. If even after you start your RF the nodes are not busy at all, then the network communication may be too high and fewer nodes would be better.
You'll also want to to look at the H2O logs and see how the dataset got parsed, datatype-wise, and the speed at which individual trees are built. And if your response column is a categorical and you're doing multinomial, each tree is really N trees where N is the number of levels in the response column.
[ Unfortunately, the "it's too slow" complaint is way too generic to say much more. ]
That sounds like a long time to train a Random Forest on a dataset with only 120k x 518 columns. As Tom said above, it might have to do with the congestion on your Hadoop cluster and possibly that this cluster that is way too big for this task. You should be able to train a dataset that size on a single machine (no multi-node cluster necessary).
If possible, try training the model on your laptop for a comparison. If there is nothing you can do to improve the Hadoop environment, this may be a better option for training.
For your other question about a verbose option -- I don't remember there ever being this option in H2O's Random Forest. You can view the progress of models as they build in H2O Flow, the GUI. When you click on a model to view it, there is a "Refresh" button that will allow you to check on the progress of the model at it trains.

why normalizing feature values doesn't change the training output much?

I have 3113 training examples, over a dense feature vector of size 78. The magnitude of features is different: some around 20, some 200K. For example, here is one of the training examples, in vowpal-wabbit input format.
0.050000 1 '2006-07-10_00:00:00_0.050000| F0:9.670000 F1:0.130000 F2:0.320000 F3:0.570000 F4:9.837000 F5:9.593000 F6:9.238150 F7:9.646667 F8:9.631333 F9:8.338904 F10:9.748000 F11:10.227667 F12:10.253667 F13:9.800000 F14:0.010000 F15:0.030000 F16:-0.270000 F17:10.015000 F18:9.726000 F19:9.367100 F20:9.800000 F21:9.792667 F22:8.457452 F23:9.972000 F24:10.394833 F25:10.412667 F26:9.600000 F27:0.090000 F28:0.230000 F29:0.370000 F30:9.733000 F31:9.413000 F32:9.095150 F33:9.586667 F34:9.466000 F35:8.216658 F36:9.682000 F37:10.048333 F38:10.072000 F39:9.780000 F40:0.020000 F41:-0.060000 F42:-0.560000 F43:9.898000 F44:9.537500 F45:9.213700 F46:9.740000 F47:9.628000 F48:8.327233 F49:9.924000 F50:10.216333 F51:10.226667 F52:127925000.000000 F53:-15198000.000000 F54:-72286000.000000 F55:-196161000.000000 F56:143342800.000000 F57:148948500.000000 F58:118894335.000000 F59:119027666.666667 F60:181170133.333333 F61:89209167.123288 F62:141400600.000000 F63:241658716.666667 F64:199031688.888889 F65:132549.000000 F66:-16597.000000 F67:-77416.000000 F68:-205999.000000 F69:144690.000000 F70:155022.850000 F71:122618.450000 F72:123340.666667 F73:187013.300000 F74:99751.769863 F75:144013.200000 F76:237918.433333 F77:195173.377778
The training result was not good, so I thought I would normalize the features to make them in the same magnitude. I calculated mean and standard deviation for each of the features across all examples, then do newValue = (oldValue - mean) / stddev, so that their new mean and stddev are all 1. For the same example, here is the feature values after normalization:
0.050000 1 '2006-07-10_00:00:00_0.050000| F0:-0.660690 F1:0.226462 F2:0.383638 F3:0.398393 F4:-0.644898 F5:-0.670712 F6:-0.758233 F7:-0.663447 F8:-0.667865 F9:-0.960165 F10:-0.653406 F11:-0.610559 F12:-0.612965 F13:-0.659234 F14:0.027834 F15:0.038049 F16:-0.201668 F17:-0.638971 F18:-0.668556 F19:-0.754856 F20:-0.659535 F21:-0.663001 F22:-0.953793 F23:-0.642736 F24:-0.606725 F25:-0.609946 F26:-0.657141 F27:0.173106 F28:0.310076 F29:0.295814 F30:-0.644357 F31:-0.678860 F32:-0.764422 F33:-0.658869 F34:-0.674367 F35:-0.968679 F36:-0.649145 F37:-0.616868 F38:-0.619564 F39:-0.649498 F40:0.041261 F41:-0.066987 F42:-0.355693 F43:-0.638604 F44:-0.676379 F45:-0.761250 F46:-0.653962 F47:-0.668194 F48:-0.962591 F49:-0.635441 F50:-0.611600 F51:-0.615670 F52:-0.593324 F53:-0.030322 F54:-0.095290 F55:-0.139602 F56:-0.652741 F57:-0.675629 F58:-0.851058 F59:-0.642028 F60:-0.648002 F61:-0.952896 F62:-0.629172 F63:-0.592340 F64:-0.682273 F65:-0.470121 F66:-0.045396 F67:-0.128265 F68:-0.185295 F69:-0.510251 F70:-0.515335 F71:-0.687727 F72:-0.512749 F73:-0.471032 F74:-0.789335 F75:-0.491188 F76:-0.400105 F77:-0.505242
However, this yields basically the same testing result (if not exactly the same, since I shuffle the examples before each training).
Wondering why there is no change in the result?
Here is my training and testing commands:
rm -f cache
cat input.feat | vw -f model --passes 20 --cache_file cache
cat input.feat | vw -i model -t -p predictions --invert_hash readable_model
(Yes, I'm testing on the training data right now since I have only very few data examples to train on.)
More context:
Some of the features are "tier 2" - they were derived by manipulating or doing cross products on "tier 1" features (e.g. moving average, 1-3 order of derivatives, etc). If I normalize the tier 1 features before calculating the tier 2 features, it would actually improve the model significantly.
So I'm puzzled as why normalizing tier 1 features (before generating tier 2 features) helps a lot, while normalizing all features (after generating tier 2 features) doesn't help at all?
BTW, since I'm training a regressor, I'm using SSE as the metrics to judge the quality of the model.
vw normalizes feature values for scale as it goes, by default.
This is part of the online algorithm. It is done gradually during runtime.
In fact it does more than that, vw enhanced SGD algorithm also keeps separate learning rates (per feature) so rarer feature learning rates don't decay as fast as common ones (--adaptive). Finally there's an importance aware update, controlled by a 3rd option (--invariant).
The 3 separate SGD enhancement options (which are all turned on by default) are:
--adaptive
--invariant
--normalized
The last option is the one that adjust values for scale (discounts large values vs small). You may disable all these SGD enhancements by using the option --sgd. You may also partially enable any subset by explicitly specifying it.
All in all you have 2^3 = 8 SGD option combinations you can use.
The Possible reason is that whatever Training algorithm that you used to get the result already did the normalization process for you!.In fact many algorithms do the normalization process before working on it.Hope it helps you :)

Can Vowpal Wabbit handle datasize ~ 90 GB?

We have extracted features from search engine query log data and the feature file (as per input format of Vowpal Wabbit) amounts to 90.5 GB. The reason for this huge size is necessary redundancy in our feature construction. Vowpal Wabbit claims to be able to handle TBs of data in a matter of few hours. In addition to that, VW uses a hash function which takes almost no RAM. But When we run logistic regression using VW on our data, within a few minutes, it uses up all of RAM and then stalls.
This is the command we use-
vw -d train_output --power_t 1 --cache_file train.cache -f data.model
--compressed --loss_function logistic --adaptive --invariant
--l2 0.8e-8 --invert_hash train.model
train_output is the input file we want to train VW on, and train.model is the expected model obtained after training
Any help is welcome!
I've found the --invert_hash option to be extremely costly; try running without that option. You can also try turning on the --l1 regularization option to reduce the number of coefficients in the model.
How many features do you have in your model? How many features per row are there?

Vowpal Wabbit -- Active Learning, predictions always 0 even with seed data

Using Vowpal Wabbit and its python interactor for active learning I've got up to the point of being able to send messages back and forth to the server from the client but I am having problems with seeding.
When I seed the model with the current command:
python active_interactor.py --verbose -o labelled.txt --seed data.seed localhost 12345 unlabelled.txt
The interactor sends the examples to the server (and I know this because the server updates the models and the debug information is produced) but when it feeds the unlabelled examples and asks for a label as a response, the predictions are always 0.
My question is: is the model not being seeded? If not, why are the predictions always 0 even though there is a model?
It should be noted that the same data can be successfully used to create a passive model that gives non-0 predictions, so I do not think the problem is with the training data.
---UPDATE---
Upon looking at the tests, we went ahead and changed the vw server to match the test with two parameters in mind that were left as their defaults beforehand, namely initial_t and l.
vw -f final.model --active_learning --active_mellowness 0.000001 --daemon --port 12345 --initial_t 10 -l 10
Once doing this, predictions are produced. This also works when -l is it's default. We will now do a grid search to find out the best possible parameters. One question though, what is the reason why low values of initial_t led to no predictions?

How to deal with frequent classes?

I'm working on a classification task in Weka and got the problem that my class to predict has one value that is very frequent (about 85%). This leads to a lot of learning algorithms just predicting this frequent value of this class for a new dataset.
How can I deal with this problem? Does it just mean that I didn't find features that work well enough in predicting something better? Or is there something specific I can do to solve this problem?
I guess this is a pretty common problem, but I was not able to find a solution to it here.
You need to "SMOTE" your data. First figure out how many more instances of the minority case you need. In my case I wanted to get around a 50/50 ratio so I needed to over sample by 1300 percent. This tutorial will help if you are using the GUI: http://www.youtube.com/watch?v=w14ha2Fmg6U If you are doing this from the command line using Weka, the following command will get you going:
#Weka 3.7.7
java weka.Run -no-scan weka.filters.supervised.instance.SMOTE \
-c last -K 25 -P 1300.0 -S 1 -i input.arff -o output.arff
The -K option is the number of neighbors to take into account when smoting the data. The default is 5, but 25 worked best for my dataset.

Resources