naivebayes Mahout 0.7 - mahout

I am working on sentiment analysis of tweets.
I am using mahout naive bayes classifier for it.I am making a directory "data".Inside "data" I am making three more directories named "positive","negative","uncertain"..Then I kept 151 files(total 151Mb) on each of these positive,negatie and uncertain directory..Then I kept the data directory in hdfs..below are the set of command i ran to generate the model and labelindex out of it.
bin/mahout seqdirectory -i ${WORK_DIR}/data -o ${WORK_DIR}/data-seq
bin/mahout seq2sparse -i ${WORK_DIR}/data-seq -o ${WORK_DIR}/data-vectors -lnorm -nv -wttfidf
bin/mahout split -i ${WORK_DIR}/data-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/data-train-vectors --testOutput ${WORK_DIR}/data-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
bin/mahout trainnb -i ${WORK_DIR}/data-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow $c
I am getting the confusion matrix after testing on the same set of data using "testnb" command as given below:
bin/mahout testnb -i ${WORK_DIR}/data-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data-testing $c
Confusion Matrix
-------------------------------------------------------
a b c <--Classified as
151 0 0 | 151 a = negative
0 151 0 | 151 b = positive
0 0 151 | 151 c = uncertain
Then I created a some another directory "data2" in the same way and put some random data(which is a sub set of the training data(30 files(total size 30MB) each)) in positive,negative,uncertain directory inside it .Then i created a vector out of it using the "seq2sparse" command given below :-
bin/mahout seqdirectory -i ${WORK_DIR}/data2 -o ${WORK_DIR}/data2-seq
bin/mahout seq2sparse -i ${WORK_DIR}/data2-seq -o ${WORK_DIR}/data2-vectors -lnorm -nv -wttfidf
On running the "testnb" using the model/lablelindex created from the previous set of data using the command given below:-
bin/mahout testnb -i ${WORK_DIR}/data2-vectors/tfidf-vectors/part-r-00000 -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data2-testing $c
I am getting confusion matrix like this.
Confusion Matrix
-------------------------------------------------------
a b c <--Classified as
0 30 0 | 30 a = negative
0 30 0 | 30 b = positive
0 30 0 | 30 c = uncertain
Can anyone tell me why this is coming.Am i using the correct way to test the model or it is a bug in mahout 0.7.If it is not the correct way please suggest a way out of it.

Can you try this:
bin/mahout testnb -i ${WORK_DIR}/data2-vectors/tfidf-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data2-testing $c
(remove the "part-r-00000")

Related

GNU parallel arguments

From the example
seq 1 100 | parallel -I ## \ > 'mkdir top-##;seq 1 100 | parallel -X mkdir top-##/sub-{}
How do -X , ##, {} work? Also, what will be the behavior when '1' or '.' is passed inside {}? Is /> used for redirection here?
I was trying to go through the tutorial from https://www.youtube.com/watch?v=P40akGWJ_gY&list=PL284C9FF2488BC6D1&index=2 and reading through man parallel page. I am able to gather some basic knowledge but not exactly how to use it or as such.
Let's do the easy stuff first.
The backslash (\) is just telling the shell that the following line is a continuation of the current one, and the greater than sign (>) is the shell prompting for the continuation line. It is no different from typing:
echo \
hi
where you will actually see this:
echo \
> hi
hi
So, I am saying you can ignore \> and just run the command on a single line.
Next, the things in {}. These are described in the GNU Parallel manual page, but essentially:
{1} refers to the first parameter
{2} refers to the second parameter, and so on
Test this with the following where the column separator is set to a space but we use the parameters in the reverse order:
echo A B | parallel --colsep ' ' echo {2} {1}
B A
{.} refers to a parameter, normally a filename, with its extension removed
Test this with:
echo fred.dat | parallel echo {.}
fred
Now let's come to the actual question, with the continuation line removed as described above and with everything on a single line:
seq 1 100 | parallel -I ## 'mkdir top-##;seq 1 100 | parallel -X mkdir top-##/sub-{}'
So, this is essentially running:
seq 1 100 | parallel -I ## 'ANOTHER COMMAND'
Ole has used ## in place of {} in this command so that the substitutions used in the second, inner, parallel command don't get confused with each other. So, where you see ## you just need to replace it with the values from first seq 1 100.
The second parallel command is pretty much the same as the first one, but here Ole has used X. If you watch the video you link to, you will see that he previously shows you how it works. It actually passes "as many parameters as possible" to a command according to the system's ARGMAX. So, if you want 10,000 directories created, instead of this:
seq 1 10000 | parallel mkdir {}
which will start 10,000 separate processes, each one running mkdir, you will start one mkdir but with 10,000 parameters:
seq 1 10000 | parallel -X mkdir
That avoids the need to create 10,000 separate processes and speeds things up.
Let's now look at the outer parallel invocation and do a dry run to see what it would do, without actually doing anything:
seq 1 100 | parallel -k --dry-run -I ## 'mkdir top-##;seq 1 100 | parallel -X mkdir top-##/sub-{}'
Output
mkdir top-1;seq 1 100 | parallel -X mkdir top-1/sub-{}
mkdir top-2;seq 1 100 | parallel -X mkdir top-2/sub-{}
mkdir top-3;seq 1 100 | parallel -X mkdir top-3/sub-{}
mkdir top-4;seq 1 100 | parallel -X mkdir top-4/sub-{}
mkdir top-5;seq 1 100 | parallel -X mkdir top-5/sub-{}
mkdir top-6;seq 1 100 | parallel -X mkdir top-6/sub-{}
mkdir top-7;seq 1 100 | parallel -X mkdir top-7/sub-{}
mkdir top-8;seq 1 100 | parallel -X mkdir top-8/sub-{}
...
...
mkdir top-99;seq 1 100 | parallel -X mkdir top-99/sub-{}
mkdir top-100;seq 1 100 | parallel -X mkdir top-100/sub-{}
So, now you can see it is going to start 100 processes, each of which will make a directory then start 100 further processes that will each create 100 subdirectories.

vowpal wabbit for binary text classification setup

I am using vw-8.20170116 for a binary text classification problem. The text strings are concatenated from several short (5-20 words) strings. The input looks like
-1 1.0 |aa .... ... ..... |bb ... ... .... .. |cc ....... .. ...
1 5.0 |aa .... ... ..... |bb ..... .. .... . |cc .... .. ...
The command that I am using for training is
./vw-8.20170116 -d train_feat.txt -k -c -f model.vw --ngram 2 --skips 2 --nn 10 --loss_function logistic --passes 100 --l2 1e-8 --holdout_off --threads --ignore bc
and for test
./vw-8.20170116 -d test_feat.txt -t --loss_function logistic --link logistic -i model.vw -p test_pred.txt
Question: How can I get vw to run (train) in parallel on my 8-core machine? I thought --threads should help but I am not seeing any speedups. And how do I control the number of cores used?
Using this link for reference.

How do I get the raw predictions (-r) from Vowpal Wabbit when running in daemon mode?

Using the below, I'm able to get both the raw predictions and the final predictions as a file:
cat train.vw.txt | vw -c -k --passes 30 --ngram 5 -b 28 --l1 0.00000001 --l2 0.0000001 --loss_function=logistic -f model.vw --compressed --oaa 3
cat test.vw.txt | vw -t -i model.vw --link=logistic -r raw.txt -p predictions.txt
However, I'm unable to get the raw predictions when I run VW as a daemon:
vw -t -i model.vw --daemon --port 26542 --link=logistic
Do I have a pass in a specific argument or parameter to get the raw predictions? I prefer the raw predictions, not the final predictions. Thanks
On systems supporting /dev/stdout (and /dev/stderr), you may try this:
vw -t -i model.vw --daemon --port 26542 --link=logistic -r /dev/stdout
The daemon will write raw predictions into standard output which in this case end up in the same place as localhost port 26542.
The relative order of lines is guaranteed because the code dealing with different prints within each example (e.g non-raw vs raw) is always serial.
Since November 2015, the easiest way how to obtain probabilities is to use --oaa=N --loss_function=logistic --probabilities -p probs.txt. (Or if you need label-dependent features: --csoaa_ldf=mc --loss_function=logistic --probabilities -p probs.txt.)
--probabilities work with --daemon as well. There should be no more need for using --raw_predictions.
--raw_predictions is a kind of hack (the semantic depends on the reductions used) and it is not supported in --daemon mode. (Something like --output_probabilities would be useful and not difficult to implement and it would work in daemon mode, but so far no one had time to implement it.)
As a workaround, you can run VW in a pipe, so it reads stdin and writes the probabilities to stdout:
cat test.data | vw -t -i model.vw --link=logistic -r /dev/stdout | script.sh
According to https://github.com/VowpalWabbit/vowpal_wabbit/issues/1118 you can try adding --scores option in command line:
vw --scores -t -i model.vw --daemon --port 26542
It helped me with my oaa model.

Find files starting with NULs

How do I efficiently find all the files in the system whose contents starts with \x0000000000 (5 NUL bytes)?
Tried to do the following
$ find . -type f -exec grep -m 1 -ovP "[^\x00]" {}
$ find . -type f -exec grep -m 1 -vP "^\00{5}" {}
but the first variant works only for all-NUL files, and the last one searches through the whole file, not only the first 5 bytes, which makes it very slow and gives many false positives.
Try this :
grep -r '^\\x0000000000' * | cut -d ":" -f 1

Delete certain line while using iperf

I run iperf command like this :
iperf -c 10.0.0.1 -t 2 -f m -w 1K | grep -Po '[0-9.]*(?= Mbits/sec)'
I want to display throughput only such as 0.32 but because I use 1K here, there is a warning and the display becomes
WARNING: TCP window size set to 1024 bytes. A small window size will give poor performance. See the Iperf documentation.
0.32
How to delete this warning so I can get "0.32" only?
Just send the warning message to /dev/null, after that you get output only.
So your command would be,
iperf -c 10.0.0.1 -t 2 -f m -w 1K 2> /dev/null | grep -Po '[0-9.]*(?= Mbits/sec)'

Resources