No input clusters found in /user/mahout/cluster/part-randomSeed. Check your -c argument - mahout

My test.csv file:
==================
1,54,1341775056478
2,1568,1341775056478
1,1622,1341775056498
2,3136,1341775056498
1,3190,1341775056671
2,4704,1341775056671
1,4758,1341775056693
2,6272,1341775056693
1,6326,1341775056714
2,7840,1341775056714
1,7894,1341775056735
2,9408,1341775056735
1,9462,1341775056951
2,10976,1341775056951
1,11030,1341775056972
2,12544,1341775056972
1,12598,1341775056994
2,14112,1341775056994
1,14166,1341775057014
2,15680,1341775057014
1,15734,1341775057065
2,17248,1341775057065
1,17302,1341775057087
2,18816,1341775057087
1,18870,1341775057119
2,20384,1341775057119
....
....
I am trying to cluster this data using mahout k-means algorithm.
I had followed these steps:
1)Create a sequence file from the test.csv file
mahout seqdirectory -c UTF-8 -i /user/mahout/input/test.csv -o /user/sample/out_seq -chunk 64
2)Create a sparse vector from the sequence file
mahout seq2sparse -i /user/mahout/out_seq/ -o /user/mahout/sparse_dir --maxDFPercent 85 --namedVector
3)perfom K-Means clustering
mahout kmeans -i /user/mahout/sparse_dir/tfidf-vectors/ -c /user/mahout/cluster -o /user/mahout/kmeans_out
-dm org.apache.mahout.common.distance.CosineDistanceMeasure --maxIter 10 --numClusters 20 --ow --clustering
At step 3,I'm facing this error:
Exception in thread "main" java.lang.IllegalStateException: No input clusters found in /user/mahout/text/cluster/part-randomSeed. Check your -c argument.
at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:213)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:147)
....
....
How to overcome this error.Actually I did the clustering example successfuuly using reuters dataset.But with my dataset,it is showing this issue.Is there any problem with the dataset ? or due to some other issue,am i facing this error?
Can anyone suggest me regarding this issue...
Thanks, in advance

Related

unable to classify using Weka from command line

I am using weka 3.6.13 and trying to use a model to classify data:
java -cp weka-stable-3.6.13.jar weka.classifiers.Evaluation weka.classifiers.trees.RandomForest -l Parking.model -t Data_features_class_ques-2.arff
java.lang.Exception: training and test set are not compatible
though the model works when we use the GUI, through Explorer->Claasify ->Supplied test set and load the arff file->right click on result list and load model-> again right click -> re-evaluate model on current data set...
Any pointers please help.
If your data contains "String" features then first use StringToWordVector in batch mode i.e. for both data set in single command (command 1) then use command 2 and command 3.
Command 1.
java weka.filters.unsupervised.attribute.StringToWordVector -b -R first-last -i training.arff -o training_s2w.arff -r test.arff -s test_s2w.arff
Command 2.
java weka.classifiers.trees.RandomForest -t training_s2w.arff -d model.model
Command 3.
java weka.classifiers.trees.RandomForest -T test_s2w.arff -l model.model -p 0 > result.txt
PS: add path for weka.jar accordingly.

What is the truly correct usage of -S parameter on weka classifier A1DE?

So I'm using weka 3.7.11 in a Windows machine (and runnings bash scripts with cygwin), and I found an inconsistency regarding the AODE classifier (which in this version of weka, comes from an add-on package).
Using Averaged N-Dependencies Estimators from the GUI, I get the following configuration (from an example that worked alright in the Weka Explorer):
weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.Discretize -F -B 10 -M -1.0 -R first-last" -W weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE -- -F 1 -M 1.0 -S
So I modified this to get the following command in my bash script:
java -Xmx60G -cp "C:\work\weka-3.7.jar;C:\Users\Oracle\wekafiles\packages\AnDE\AnDE.jar" weka.classifiers.meta.FilteredClassifier \
-t train_2.arff -T train_1.arff \
-classifications "weka.classifiers.evaluation.output.prediction.CSV -distribution -p 1 -file predictions_final_multi.csv -suppress" \
-threshold-file umbral_multi.csv \
-F "weka.filters.unsupervised.attribute.Discretize -F -B 10 -M -1.0 -R first-last" \
-W weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE -- -F 1 -M 1.0 -S
But this gives me the error:
Weka exception: No value given for -S option.
Which is weird, since this was not a problem with the GUI. In the GUI, the Information box says that -S it's just a flag ("Subsumption Resolution can be achieved by using -S option"), so it shouldn't expect any number at all, which is consistent with what I got using the Explorer.
So then, what's the deal with the -S option when using the command line? Looking at the error text given by weka, I found this:
Options specific to classifier weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE:
-D
Output debugging information
-F <int>
Impose a frequency limit for superParents (default is 1)
-M <double>
Specify a weight to use with m-estimate (default is 1)
-S <int>
Specify a critical value for specialization-generalilzation SR (default is 100)
-W
Specify if to use weighted AODE
So it seems that this class works in two different ways, depending on which method I use (GUI vs. Command Line).
The solution I found, at least for the meantime, was to write -S 100 on my script. Is this really the same as just putting -S in the GUI?
Thanks in advance.
JM
I've had a play with this Classifier, and can confirm that what you are experiencing on your end is consistent with what I have here. From the GUI, the -S Option (subsumption Resolution) requires no parameters while the Command Prompt does (specialization-generalization SR).
They don't sound like the same parameter, so you may need to raise this issue with the developer of the third party package if you would like to know more information on these parameters. You can find this information from the Tools -> Package Manager -> AnDE, which will point you to the contacts for the library.

What's the /path/to/initial/clusters parameter means in Mahout 0.5 kmeans example?

I tried to run the kmeans example in Mahout 0.5, but failed! I found in kmeans.props that it required a strange parameter, -c, which means path_to_initial_clusters.
What's this stuff? How could I prepare for it?
kmeans.props:
The following parameters must be specified
i|input = /path/to/input
c|clusters = /path/to/initial/clusters
So mahout cannot needs input in specific format to carry out its clustering algorithm.
So have a look at
seq2sparse: : Sparse Vector generation from Text sequence files
seqdirectory: : Generate sequence files (of Text) from a directory
As an example say for Reuters 21587 data set.
Following are the steps :
1.mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles
2.mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
3.mahout kmeans -i reuters-vectors/tfidf-vectors/ \
-c reuters-initial-clusters \
-o reuters-kmeans-clusters \
-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure \
-cd 1.0 -k 20 -x 20 -cl
Hope it helps
K-means requires initial clusters in order to iteratively update the centroid (which is the center of a cluster) until it converge.
-c, path_to_initial_clusters ask you just to give a directory for mahout to store its initial clusters.
You can specify any path for mahout to store the initial clusters and mahout will compute the initial clusters and store in the directory. Or you can compute the initial cluster by canopy clustering or other method, and tell mahout the directory of the initial cluster you have computed to initialize K-means clustering.

Error while creating mahout model

I am training mahout classifier for my data,
Following commands i issued to create mahout model
./bin/mahout seqdirectory -i /tmp/mahout-work-root/MyData-all -o /tmp/mahout-work-root/MyData-seq
./bin/mahout seq2sparse -i /tmp/mahout-work-root/MyData-seq -o /tmp/mahout-work-root/MyData-vectors -lnorm -nv -wt tfidf
./bin/mahout split -i /tmp/mahout-work-root/MyData-vectors/tfidf-vectors --trainingOutput /tmp/mahout-work-root/MyData-train-vectors --testOutput /tmp/mahout-work-root/MyData-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
./bin/mahout trainnb -i /tmp/mahout-work-root/Mydata-train-vectors -el -o /tmp/mahout-work-root/model -li /tmp/mahout-work-root/labelindex -ow
When i try to create the model using trainnb command i am getting the following Exception :
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:119) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:152)
What could be the problem here?
Note: Original Example mentioned here works fine.
I think it might be the problem of how you put your training files.
The files should be organized as following:
MyData-All
\classA
-file1
-file2
-...
\classB
-filex
....

Vectorizing a solr index with mahout using lucene.vector

I'm trying to run a clustering job on Amazon EMR using Mahout.
I have a solr index that I uploaded on S3 and I want to vectorize it using mahouts lucene.vector.(this is the first step in the job flow)
The parameters for the step are:
Jar: s3n://mahout-bucket/jars/mahout-core-0.6-job.jar
MainClass: org.apache.mahout.driver.MahoutDriver
Args: lucene.vector --dir s3n://mahout-input/solr_index/ --field name --dictOut /test/solr-dict-out/dict.txt --output /test/solr-vectors-out/vectors
The error in the log is:
Unknown program 'lucene.vector' chosen.
I've done the same process locally with hadoop and Mahout and it worked fine.
How should I call the lucene.vector function on EMR?
program name, lucene.vector should be immediately after bin/mahout
/homes/cuneyt/trunk/bin/mahout lucene.vector --dir /homes/cuneyt/lucene/index --field 0 --output lda/vector --dictOut /homes/cuneyt/lda/dict.txt
I've eventually figured out the answer. The problem was I was using the wrong MainClass argument. Instead of
org.apache.mahout.driver.MahoutDriver
I should have used:
org.apache.mahout.utils.vectors.lucene.Driver
Therefore the correct arguments should have been
Jar: s3n://mahout-bucket/jars/mahout-core-0.6-job.jar MainClass:
org.apache.mahout.utils.vectors.lucene.Driver
Args: --dir s3n://mahout-input/solr_index/ --field name --dictOut /test/solr-dict-out/dict.txt --output /test/solr-vectors-out/vectors

Resources