vowpal wabbit for binary text classification setup

vowpal wabbit for binary text classification setup - machine-learning

I am using vw-8.20170116 for a binary text classification problem. The text strings are concatenated from several short (5-20 words) strings. The input looks like
-1 1.0 |aa .... ... ..... |bb ... ... .... .. |cc ....... .. ...
1 5.0 |aa .... ... ..... |bb ..... .. .... . |cc .... .. ...
The command that I am using for training is
./vw-8.20170116 -d train_feat.txt -k -c -f model.vw --ngram 2 --skips 2 --nn 10 --loss_function logistic --passes 100 --l2 1e-8 --holdout_off --threads --ignore bc
and for test
./vw-8.20170116 -d test_feat.txt -t --loss_function logistic --link logistic -i model.vw -p test_pred.txt
Question: How can I get vw to run (train) in parallel on my 8-core machine? I thought --threads should help but I am not seeing any speedups. And how do I control the number of cores used?
Using this link for reference.

Related

GNU parallel arguments

From the example
seq 1 100 | parallel -I ## \ > 'mkdir top-##;seq 1 100 | parallel -X mkdir top-##/sub-{}
How do -X , ##, {} work? Also, what will be the behavior when '1' or '.' is passed inside {}? Is /> used for redirection here?
I was trying to go through the tutorial from https://www.youtube.com/watch?v=P40akGWJ_gY&list=PL284C9FF2488BC6D1&index=2 and reading through man parallel page. I am able to gather some basic knowledge but not exactly how to use it or as such.

Let's do the easy stuff first.
The backslash (\) is just telling the shell that the following line is a continuation of the current one, and the greater than sign (>) is the shell prompting for the continuation line. It is no different from typing:
echo \
hi
where you will actually see this:
echo \
> hi
hi
So, I am saying you can ignore \> and just run the command on a single line.
Next, the things in {}. These are described in the GNU Parallel manual page, but essentially:
{1} refers to the first parameter
{2} refers to the second parameter, and so on
Test this with the following where the column separator is set to a space but we use the parameters in the reverse order:
echo A B | parallel --colsep ' ' echo {2} {1}
B A
{.} refers to a parameter, normally a filename, with its extension removed
Test this with:
echo fred.dat | parallel echo {.}
fred
Now let's come to the actual question, with the continuation line removed as described above and with everything on a single line:
seq 1 100 | parallel -I ## 'mkdir top-##;seq 1 100 | parallel -X mkdir top-##/sub-{}'
So, this is essentially running:
seq 1 100 | parallel -I ## 'ANOTHER COMMAND'
Ole has used ## in place of {} in this command so that the substitutions used in the second, inner, parallel command don't get confused with each other. So, where you see ## you just need to replace it with the values from first seq 1 100.
The second parallel command is pretty much the same as the first one, but here Ole has used X. If you watch the video you link to, you will see that he previously shows you how it works. It actually passes "as many parameters as possible" to a command according to the system's ARGMAX. So, if you want 10,000 directories created, instead of this:
seq 1 10000 | parallel mkdir {}
which will start 10,000 separate processes, each one running mkdir, you will start one mkdir but with 10,000 parameters:
seq 1 10000 | parallel -X mkdir
That avoids the need to create 10,000 separate processes and speeds things up.
Let's now look at the outer parallel invocation and do a dry run to see what it would do, without actually doing anything:
seq 1 100 | parallel -k --dry-run -I ## 'mkdir top-##;seq 1 100 | parallel -X mkdir top-##/sub-{}'
Output
mkdir top-1;seq 1 100 | parallel -X mkdir top-1/sub-{}
mkdir top-2;seq 1 100 | parallel -X mkdir top-2/sub-{}
mkdir top-3;seq 1 100 | parallel -X mkdir top-3/sub-{}
mkdir top-4;seq 1 100 | parallel -X mkdir top-4/sub-{}
mkdir top-5;seq 1 100 | parallel -X mkdir top-5/sub-{}
mkdir top-6;seq 1 100 | parallel -X mkdir top-6/sub-{}
mkdir top-7;seq 1 100 | parallel -X mkdir top-7/sub-{}
mkdir top-8;seq 1 100 | parallel -X mkdir top-8/sub-{}
...
...
mkdir top-99;seq 1 100 | parallel -X mkdir top-99/sub-{}
mkdir top-100;seq 1 100 | parallel -X mkdir top-100/sub-{}
So, now you can see it is going to start 100 processes, each of which will make a directory then start 100 further processes that will each create 100 subdirectories.

How do I get the raw predictions (-r) from Vowpal Wabbit when running in daemon mode?

Using the below, I'm able to get both the raw predictions and the final predictions as a file:
cat train.vw.txt | vw -c -k --passes 30 --ngram 5 -b 28 --l1 0.00000001 --l2 0.0000001 --loss_function=logistic -f model.vw --compressed --oaa 3
cat test.vw.txt | vw -t -i model.vw --link=logistic -r raw.txt -p predictions.txt
However, I'm unable to get the raw predictions when I run VW as a daemon:
vw -t -i model.vw --daemon --port 26542 --link=logistic
Do I have a pass in a specific argument or parameter to get the raw predictions? I prefer the raw predictions, not the final predictions. Thanks

On systems supporting /dev/stdout (and /dev/stderr), you may try this:
vw -t -i model.vw --daemon --port 26542 --link=logistic -r /dev/stdout
The daemon will write raw predictions into standard output which in this case end up in the same place as localhost port 26542.
The relative order of lines is guaranteed because the code dealing with different prints within each example (e.g non-raw vs raw) is always serial.

Since November 2015, the easiest way how to obtain probabilities is to use --oaa=N --loss_function=logistic --probabilities -p probs.txt. (Or if you need label-dependent features: --csoaa_ldf=mc --loss_function=logistic --probabilities -p probs.txt.)
--probabilities work with --daemon as well. There should be no more need for using --raw_predictions.

--raw_predictions is a kind of hack (the semantic depends on the reductions used) and it is not supported in --daemon mode. (Something like --output_probabilities would be useful and not difficult to implement and it would work in daemon mode, but so far no one had time to implement it.)
As a workaround, you can run VW in a pipe, so it reads stdin and writes the probabilities to stdout:
cat test.data | vw -t -i model.vw --link=logistic -r /dev/stdout | script.sh

According to https://github.com/VowpalWabbit/vowpal_wabbit/issues/1118 you can try adding --scores option in command line:
vw --scores -t -i model.vw --daemon --port 26542
It helped me with my oaa model.

Verifying the Model generated by the classifier

I am using Mahout Naive bayes classification algorithm to classify the input documents to known categories.
I am able to build the model using mahout commands.
mahout seq2sparse
mahout split
mahout trainnb
mahout testnb
Test results looks good.
Now, I would like to verify my model with real data.
I am trying below command to verify the ouptut:
mahout org.apache.mahout.classifier.Classify \
-m /data/model/ \
--classify /data/input.txt \
--encoding UTF-8 \
--analyzer org.apache.mahout.vectorizer.DefaultAnalyzer \
--defaultCat unknown \
-ng 1 \
-type bayes \
-source hdfs
This command is failing with "java.lang.ClassNotFoundException: org.apache.mahout.classifier.Classify".
I have set the mahout core jar and other mahout jars in the classpath. I am using Mahout 0.9.
How to run the classifier in Mahout 0.9 ?

Run cvb in mahout 0.8

The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation for lda on how to run an example and generate meaningful output.
Thus, I want to:
preprocess some texts correctly
run the cvb0_local version of cvb
inspect the results by looking at the top n words in each of the generated topics

So here are the subsequent Mahout commands I had to call in a linux shell to do it.
$MAHOUT_HOME points to my mahout/bin folder.
$MAHOUT_HOME/mahout seqdirectory \
-i path/to/directory/with/texts \
-o out/sequenced
$MAHOUT_HOME/mahout seq2sparse -i out/sequenced \
-o out/sparseVectors \
--namedVector \
-wt tf
$MAHOUT_HOME/mahout rowid \
-i out/sparseVectors/tf-vectors/ \
-o out/matrix
$MAHOUT_HOME/mahout cvb0_local \
-i out/matrix/matrix \
-d out/sparseVectors/dictionary.file-0 \
-a 0.5 \
-top 4 -do out/cvb/do_out \
-to out/cvb/to_out
Inspect the output by showing the top 10 words of each topic:
$MAHOUT_HOME/mahout vectordump \
-i out/cvb/to_out \
--dictionary out/sparseVectors/dictionary.file-0 \
--dictionaryType sequencefile \
--vectorSize 10 \
-sort out/cvb/to_out

Thanks to JoKnopp for the detail commands.
If you get:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
you need to add the command line option "maxIterations":
--maxIterations (-m) maxIterations
I use -m 20 and it works
refer to:
https://issues.apache.org/jira/browse/MAHOUT-1141

naivebayes Mahout 0.7

I am working on sentiment analysis of tweets.
I am using mahout naive bayes classifier for it.I am making a directory "data".Inside "data" I am making three more directories named "positive","negative","uncertain"..Then I kept 151 files(total 151Mb) on each of these positive,negatie and uncertain directory..Then I kept the data directory in hdfs..below are the set of command i ran to generate the model and labelindex out of it.
bin/mahout seqdirectory -i ${WORK_DIR}/data -o ${WORK_DIR}/data-seq
bin/mahout seq2sparse -i ${WORK_DIR}/data-seq -o ${WORK_DIR}/data-vectors -lnorm -nv -wttfidf
bin/mahout split -i ${WORK_DIR}/data-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/data-train-vectors --testOutput ${WORK_DIR}/data-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
bin/mahout trainnb -i ${WORK_DIR}/data-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow $c
I am getting the confusion matrix after testing on the same set of data using "testnb" command as given below:
bin/mahout testnb -i ${WORK_DIR}/data-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data-testing $c
Confusion Matrix
-------------------------------------------------------
a b c <--Classified as
151 0 0 | 151 a = negative
0 151 0 | 151 b = positive
0 0 151 | 151 c = uncertain
Then I created a some another directory "data2" in the same way and put some random data(which is a sub set of the training data(30 files(total size 30MB) each)) in positive,negative,uncertain directory inside it .Then i created a vector out of it using the "seq2sparse" command given below :-
bin/mahout seqdirectory -i ${WORK_DIR}/data2 -o ${WORK_DIR}/data2-seq
bin/mahout seq2sparse -i ${WORK_DIR}/data2-seq -o ${WORK_DIR}/data2-vectors -lnorm -nv -wttfidf
On running the "testnb" using the model/lablelindex created from the previous set of data using the command given below:-
bin/mahout testnb -i ${WORK_DIR}/data2-vectors/tfidf-vectors/part-r-00000 -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data2-testing $c
I am getting confusion matrix like this.
Confusion Matrix
-------------------------------------------------------
a b c <--Classified as
0 30 0 | 30 a = negative
0 30 0 | 30 b = positive
0 30 0 | 30 c = uncertain
Can anyone tell me why this is coming.Am i using the correct way to test the model or it is a bug in mahout 0.7.If it is not the correct way please suggest a way out of it.

Can you try this：
bin/mahout testnb -i ${WORK_DIR}/data2-vectors/tfidf-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data2-testing $c
(remove the "part-r-00000")

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

vowpal wabbit for binary text classification setup - machine-learning

Related

GNU parallel arguments

How do I get the raw predictions (-r) from Vowpal Wabbit when running in daemon mode?

Verifying the Model generated by the classifier

Run cvb in mahout 0.8

naivebayes Mahout 0.7

Categories

Resources