Running Mahout kMeans in local - mahout

I am running mahout kmeans by setting MAHOUT_LOCAL="True"
Below is the command i have in my shell script.
mahout kmeans --input ./seq_input/ --output ./output --numClusters 4 --maxIter 10 --convergenceDelta .0001 --clustering --distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure --overwrite --clusters ./centroid_vectors
while running the script, i am getting below error
Unknown program 'kmeans' chosen.
Below is the log information:
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/mahout/mahout-examples-0.7-cdh4.3.0-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/mahout/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
log4j:WARN No appenders could be found for logger (org.apache.mahout.driver.MahoutDriver).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Unknown program 'kmeans' chosen.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
cat: : Print a file or resource as the logistic regression models would see it
hmmpredict: : Generate random sequence of observations by given HMM
lucene.vector: : Generate Vectors from a Lucene index
runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
runlogistic: : Run a logistic regression model against CSV data
seqwiki: : Wikipedia xml dump to sequence file
svd: : Lanczos Singular Value Decomposition
trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
trainlogistic: : Train a logistic regression using stochastic gradient descent
validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
viterbi: : Viterbi decoding of hidden states from given output states sequence
What is causing this error ?

Related

Training a logistic regression in Mahout strange behaviour

I'm trying to train a logistic regression model in mahout. The command I use is this:mahout trainlogistic --input /home/cloudera/Desktop/final.csv --output /home/cloudera/Desktop/model/model --target Action --predictors Open High Close --types word --features 20 --passes 100 --rate 50 --categories 2
The files I use actually exist. I'm reading a book that says that I should expect an output that looks like
Action ~ 647.186*Close+-44.975*High+3.269*Intercept term +-601.454*Open
and then a 4x2 matrix.
What I actually get is a terminal being filled with calculations, no Action ~, and a 5x4 matrix.
What am I doing wrong?
Well, the type of my predictors was numeric; why did the book I referenced call them words I have no idea.

Weka: Multiclass classification for Text documents giving abnormal result

I am new to Weka. I am trying to classify text documents after OCR process. The training corpus contains 286 mortgage documents and 57 note documents. The test dataset contains 1-100 text pages. So each line of the training and test dataset contains few paragraphs of text data. After classification text documents should be classified into mortgage or note properly.
I am doing a StringToWordVector operation combining both Training and Test dataset with missing values from Test dataset i.e. "?".
Steps are as follows:
Create training Arff file using following command line:
java -cp weka.jar weka.core.converters.TextDirectoryLoader -dir <text directory>
This creates a training dataset with known classes i.e. mortgage, note
Create test Arff file with missing classes i.e "?"
Combine both training and test dataset
Run the classifier with following command line:
java -cp weka.jar weka.classifiers.meta.FilteredClassifier -t train.arff -test.arff -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F weka.filters.unsupervised.attribute.Standardize" -d trained.model -p 0
I am running the above example from both Weka GUI and from command line as well. Everything works fine as far as commands are concerned. The results are abnormal. Not at all correct.
I have also tried to run StringToWordVector operation separately and tested through NaiveBayes, NaiveBayesMultiNomial, J48 and other multiclass classifiers on the dataset but classification prediction is not correct. Always giving abnormal results.
Please help me to get the proper prediction result. Let me know if the above steps are correct and if I am doing anything wrong.

How to use WEKA Machine Learning for a Bayes Neural Network and J48 Decision Tree

I am trying to figure out WEKA and perform some experiments with data that I have.
Basically what I want to do is take Data Set 1, use it as a training set. Run a J48 Decision Tree on it. Then take Data Set 2 and run the trained tree on it, with the output of the original data set with a extra column for what the prediction was.
Then do the same thing again with the Bayes Neural Network.
Can someone point me to a link of detail instructions on how exactly I would accomplish this? I seem to be missing some steps and cannot get the output of the original data set with the extra column.
Here is one way to do it with the command-line. This information is found in Chapter 1 ("A command-line primer") of the Weka manual that comes with the software.
java weka.classifiers.trees.J48 -t training_data.arff -T test_data.arff -p 1-N
where:
-t <training_data.arff> specifies the training data in ARFF format
-T <test_data.arff> specifies the test data in ARFF format
-p 1-N specifies that you want to output the feature vector and the prediction,
where N is the number of features in your feature vector.
For example, here I am using soybean.arff for both training and testing. There are 35 features in the feature vector:
java weka.classifiers.trees.J48 -t soybean.arff -T soybean.arff -p 1-35
The first few lines of the output look like:
=== Predictions on test data ===
inst# actual predicted error prediction (date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,germination,plant-growth,leaves,leafspots-halo,leafspots-marg,leafspot-size,leaf-shread,leaf-malf,leaf-mild,stem,lodging,stem-cankers,canker-lesion,fruiting-bodies,external-decay,mycelium,int-discolor,sclerotia,fruit-pods,fruit-spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots)
1 1:diaporth 1:diaporth 0.952 (october,normal,gt-norm,norm,yes,same-lst-yr,low-areas,pot-severe,none,90-100,abnorm,abnorm,absent,dna,dna,absent,absent,absent,abnorm,no,above-sec-nde,brown,present,firm-and-dry,absent,none,absent,norm,dna,norm,absent,absent,norm,absent,norm)
2 1:diaporth 1:diaporth 0.952 (august,normal,gt-norm,norm,yes,same-lst-two-yrs,scattered,severe,fungicide,80-89,abnorm,abnorm,absent,dna,dna,absent,absent,absent,abnorm,yes,above-sec-nde,brown,present,firm-and-dry,absent,none,absent,norm,dna,norm,absent,absent,norm,absent,norm)
The columns are: (1) data instance number; (2) ground truth label; (3) predicted label; (4) error; (5) prediction confidence; and (6) feature vector.

How to reavaluate model in WEKA?

I am trying to solve a numeric classification problem with numeric attributes in WEKA using linear regression and then I want to test my model on the existing dataset with ""re-evaluate model on current test dataset.
As a result of the evaluation I am getting the summary:
Correlation coefficient 0.9924
Mean absolute error 1.1017
Root mean squared error 1.2445
Total Number of Instances 17
But I don't have results as it is shown here: http://weka.wikispaces.com/Making+predictions
How to bring WEKA to the result I need?
Thank you.
To answer my question - for trained and tested model, right click on the model and go to visualize classifier error. there use save option to save actual and predicted values.
Are you using command line interface (CLI) or GUI.
If CLI, the command given in the above link works pretty fine
java weka.classifiers.trees.J48 -T unclassified.arff -l j48.model -p 0
So when you train the model you save it as *.model (j48.model) and later use it to evaluate on test data (unclassified.arff)

Training Output of LIBLINEAR

I am trying the libSVM package, playing with RBF and Linear classification, and I followed (I think) all recommendations in their README files.
I have a big file to train on, (70K) so I am trying to use liblinear instead of RBF.
The only problem is that I am unable to get the model after the training phase, my command line looks like this :
./train -c 4 -v 5 -s 6 TrainingSet.scal TrainingSet.scal.Model
After the training is done, I have the accuracy estimation but then when I look at the *.model file to use it against my test set, I simply don't find it.
DO you think it is a bug in the package or is there something I am missing here ?
Thanks
Rad
Option -v 5 means that you are doing 5-fold evaluation on training set. If this option is enabled, then liblinear estimates error using 5-fold evaluation and doesn't output model.
If you want to output model, then don't use -v 5. Tt doesn't output training error in that case. But you can use liblinear-predict to estimate error on test set.
I normally use the library directly on code, but I think in your case the training is not being performed because you are using the option -s 6 which I think is undefined.
This is the usage:
`
-s svm_type : set type of SVM (default 0)
0 -- C-SVC (multi-class classification)
1 -- nu-SVC (multi-class classification)
2 -- one-class SVM
3 -- epsilon-SVR (regression)
4 -- nu-SVR (regression)
`
You are also omitting the kernel type
-t kernel_type : set type of kernel function (default 2)
0 -- linear: u'*v
1 -- polynomial: (gamma*u'*v + coef0)^degree
2 -- radial basis function: exp(-gamma*|u-v|^2)
3 -- sigmoid: tanh(gamma*u'*v + coef0)
4 -- precomputed kernel (kernel values in training_set_file)
Hopefully that would make the trick.

Resources