I am training mahout classifier for my data,
Following commands i issued to create mahout model
./bin/mahout seqdirectory -i /tmp/mahout-work-root/MyData-all -o /tmp/mahout-work-root/MyData-seq
./bin/mahout seq2sparse -i /tmp/mahout-work-root/MyData-seq -o /tmp/mahout-work-root/MyData-vectors -lnorm -nv -wt tfidf
./bin/mahout split -i /tmp/mahout-work-root/MyData-vectors/tfidf-vectors --trainingOutput /tmp/mahout-work-root/MyData-train-vectors --testOutput /tmp/mahout-work-root/MyData-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
./bin/mahout trainnb -i /tmp/mahout-work-root/Mydata-train-vectors -el -o /tmp/mahout-work-root/model -li /tmp/mahout-work-root/labelindex -ow
When i try to create the model using trainnb command i am getting the following Exception :
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:119) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:152)
What could be the problem here?
Note: Original Example mentioned here works fine.
I think it might be the problem of how you put your training files.
The files should be organized as following:
MyData-All
\classA
-file1
-file2
-...
\classB
-filex
....
Related
I'm using lcov to generate coverage reports. I have a tracefile (broker.info) with this content (relevant fragment shown):
$ lcov -r broker.info
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/orionTypes/]
EntityTypeResponse_test.cpp | 100% 11| 100% 6| - 0
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/parse/]
CompoundValueNode_test.cpp | 100% 82| 100% 18| - 0
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/rest/]
OrionError_test.cpp |92.1% 38| 100% 6| - 0
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/serviceRoutines/]
badVerbAllFour_test.cpp | 100% 24| 100% 7| - 0
...
I want to remove all the info corresponding to test/unittest files.
I have attemped to use the -r option which, according to man page is:
-r tracefile pattern
--remove tracefile pattern
Remove data from tracefile.
Use this switch if you want to remove coverage data for a particular set of files from a tracefile. Additional command line parameters will be interpreted as
shell wildcard patterns (note that they may need to be escaped accordingly to prevent the shell from expanding them first). Every file entry in tracefile
which matches at least one of those patterns will be removed.
The result of the remove operation will be written to stdout or the tracefile specified with -o.
Only one of -z, -c, -a, -e, -r, -l, --diff or --summary may be specified at a time.
Thus, I'm using
$ lcov -r broker.info 'test/unittests/*' -o broker.info2
As far as I understand test/unittest/* matches the files under test/unittest. However, it's not working (note Deleted 0 files below):
Reading tracefile broker.info
Deleted 0 files
Writing data to broker.info2
Summary coverage rate:
lines......: 92.6% (58313 of 62978 lines)
functions..: 96.0% (6451 of 6718 functions)
branches...: no data found
I have tried also this variants (same result):
$ lcov -r broker.info "test/unittests/*" -o broker.info2
$ lcov -r broker.info "test/unittests/\*" -o broker.info2
$ lcov -r broker.info "test/unittests" -o broker.info2
So, maybe I'm doing something wrong?
I'm using lcov version 1.13 (just in case the data is relevant)
Thanks!
I have been testing another options and the following one seems to work, using the wildcard in the prefix also:
$ lcov -r broker.info "*/test/unittests/*" -o broker.info2
Maybe it is something new in version 1.13 because in version 1.11 it seems it works without wildcard in the prefix...
The below mentioned lcov command is working fine, even with wild characters (lcov 1.14):
lcov --remove meson-logs/coverage.info '/home/builduser/external/*' '/home/builduser/unittest/*' -o meson-logs/sourcecoverage.info
I am using weka 3.6.13 and trying to use a model to classify data:
java -cp weka-stable-3.6.13.jar weka.classifiers.Evaluation weka.classifiers.trees.RandomForest -l Parking.model -t Data_features_class_ques-2.arff
java.lang.Exception: training and test set are not compatible
though the model works when we use the GUI, through Explorer->Claasify ->Supplied test set and load the arff file->right click on result list and load model-> again right click -> re-evaluate model on current data set...
Any pointers please help.
If your data contains "String" features then first use StringToWordVector in batch mode i.e. for both data set in single command (command 1) then use command 2 and command 3.
Command 1.
java weka.filters.unsupervised.attribute.StringToWordVector -b -R first-last -i training.arff -o training_s2w.arff -r test.arff -s test_s2w.arff
Command 2.
java weka.classifiers.trees.RandomForest -t training_s2w.arff -d model.model
Command 3.
java weka.classifiers.trees.RandomForest -T test_s2w.arff -l model.model -p 0 > result.txt
PS: add path for weka.jar accordingly.
My test.csv file:
==================
1,54,1341775056478
2,1568,1341775056478
1,1622,1341775056498
2,3136,1341775056498
1,3190,1341775056671
2,4704,1341775056671
1,4758,1341775056693
2,6272,1341775056693
1,6326,1341775056714
2,7840,1341775056714
1,7894,1341775056735
2,9408,1341775056735
1,9462,1341775056951
2,10976,1341775056951
1,11030,1341775056972
2,12544,1341775056972
1,12598,1341775056994
2,14112,1341775056994
1,14166,1341775057014
2,15680,1341775057014
1,15734,1341775057065
2,17248,1341775057065
1,17302,1341775057087
2,18816,1341775057087
1,18870,1341775057119
2,20384,1341775057119
....
....
I am trying to cluster this data using mahout k-means algorithm.
I had followed these steps:
1)Create a sequence file from the test.csv file
mahout seqdirectory -c UTF-8 -i /user/mahout/input/test.csv -o /user/sample/out_seq -chunk 64
2)Create a sparse vector from the sequence file
mahout seq2sparse -i /user/mahout/out_seq/ -o /user/mahout/sparse_dir --maxDFPercent 85 --namedVector
3)perfom K-Means clustering
mahout kmeans -i /user/mahout/sparse_dir/tfidf-vectors/ -c /user/mahout/cluster -o /user/mahout/kmeans_out
-dm org.apache.mahout.common.distance.CosineDistanceMeasure --maxIter 10 --numClusters 20 --ow --clustering
At step 3,I'm facing this error:
Exception in thread "main" java.lang.IllegalStateException: No input clusters found in /user/mahout/text/cluster/part-randomSeed. Check your -c argument.
at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:213)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:147)
....
....
How to overcome this error.Actually I did the clustering example successfuuly using reuters dataset.But with my dataset,it is showing this issue.Is there any problem with the dataset ? or due to some other issue,am i facing this error?
Can anyone suggest me regarding this issue...
Thanks, in advance
So I'm using weka 3.7.11 in a Windows machine (and runnings bash scripts with cygwin), and I found an inconsistency regarding the AODE classifier (which in this version of weka, comes from an add-on package).
Using Averaged N-Dependencies Estimators from the GUI, I get the following configuration (from an example that worked alright in the Weka Explorer):
weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.Discretize -F -B 10 -M -1.0 -R first-last" -W weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE -- -F 1 -M 1.0 -S
So I modified this to get the following command in my bash script:
java -Xmx60G -cp "C:\work\weka-3.7.jar;C:\Users\Oracle\wekafiles\packages\AnDE\AnDE.jar" weka.classifiers.meta.FilteredClassifier \
-t train_2.arff -T train_1.arff \
-classifications "weka.classifiers.evaluation.output.prediction.CSV -distribution -p 1 -file predictions_final_multi.csv -suppress" \
-threshold-file umbral_multi.csv \
-F "weka.filters.unsupervised.attribute.Discretize -F -B 10 -M -1.0 -R first-last" \
-W weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE -- -F 1 -M 1.0 -S
But this gives me the error:
Weka exception: No value given for -S option.
Which is weird, since this was not a problem with the GUI. In the GUI, the Information box says that -S it's just a flag ("Subsumption Resolution can be achieved by using -S option"), so it shouldn't expect any number at all, which is consistent with what I got using the Explorer.
So then, what's the deal with the -S option when using the command line? Looking at the error text given by weka, I found this:
Options specific to classifier weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE:
-D
Output debugging information
-F <int>
Impose a frequency limit for superParents (default is 1)
-M <double>
Specify a weight to use with m-estimate (default is 1)
-S <int>
Specify a critical value for specialization-generalilzation SR (default is 100)
-W
Specify if to use weighted AODE
So it seems that this class works in two different ways, depending on which method I use (GUI vs. Command Line).
The solution I found, at least for the meantime, was to write -S 100 on my script. Is this really the same as just putting -S in the GUI?
Thanks in advance.
JM
I've had a play with this Classifier, and can confirm that what you are experiencing on your end is consistent with what I have here. From the GUI, the -S Option (subsumption Resolution) requires no parameters while the Command Prompt does (specialization-generalization SR).
They don't sound like the same parameter, so you may need to raise this issue with the developer of the third party package if you would like to know more information on these parameters. You can find this information from the Tools -> Package Manager -> AnDE, which will point you to the contacts for the library.
I tried to run the kmeans example in Mahout 0.5, but failed! I found in kmeans.props that it required a strange parameter, -c, which means path_to_initial_clusters.
What's this stuff? How could I prepare for it?
kmeans.props:
The following parameters must be specified
i|input = /path/to/input
c|clusters = /path/to/initial/clusters
So mahout cannot needs input in specific format to carry out its clustering algorithm.
So have a look at
seq2sparse: : Sparse Vector generation from Text sequence files
seqdirectory: : Generate sequence files (of Text) from a directory
As an example say for Reuters 21587 data set.
Following are the steps :
1.mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles
2.mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
3.mahout kmeans -i reuters-vectors/tfidf-vectors/ \
-c reuters-initial-clusters \
-o reuters-kmeans-clusters \
-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure \
-cd 1.0 -k 20 -x 20 -cl
Hope it helps
K-means requires initial clusters in order to iteratively update the centroid (which is the center of a cluster) until it converge.
-c, path_to_initial_clusters ask you just to give a directory for mahout to store its initial clusters.
You can specify any path for mahout to store the initial clusters and mahout will compute the initial clusters and store in the directory. Or you can compute the initial cluster by canopy clustering or other method, and tell mahout the directory of the initial cluster you have computed to initialize K-means clustering.