Weka: Multiclass classification for Text documents giving abnormal result - machine-learning

I am new to Weka. I am trying to classify text documents after OCR process. The training corpus contains 286 mortgage documents and 57 note documents. The test dataset contains 1-100 text pages. So each line of the training and test dataset contains few paragraphs of text data. After classification text documents should be classified into mortgage or note properly.
I am doing a StringToWordVector operation combining both Training and Test dataset with missing values from Test dataset i.e. "?".
Steps are as follows:
Create training Arff file using following command line:
java -cp weka.jar weka.core.converters.TextDirectoryLoader -dir <text directory>
This creates a training dataset with known classes i.e. mortgage, note
Create test Arff file with missing classes i.e "?"
Combine both training and test dataset
Run the classifier with following command line:
java -cp weka.jar weka.classifiers.meta.FilteredClassifier -t train.arff -test.arff -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F weka.filters.unsupervised.attribute.Standardize" -d trained.model -p 0
I am running the above example from both Weka GUI and from command line as well. Everything works fine as far as commands are concerned. The results are abnormal. Not at all correct.
I have also tried to run StringToWordVector operation separately and tested through NaiveBayes, NaiveBayesMultiNomial, J48 and other multiclass classifiers on the dataset but classification prediction is not correct. Always giving abnormal results.
Please help me to get the proper prediction result. Let me know if the above steps are correct and if I am doing anything wrong.

Related

k-fold cross validation in RankLib

I want to do 5 fold cross validation on MQ2008 dataset. I am using RankLib to apply ML algo on the dataset. I am confused about the kcv option given in Ranklib for cross validation.
command used:
java - jar RankLib.jar -ranker 0 -train train.txt -test test.txt -validate vali.txt -kcv 5
here we are specifying different files for training,testing and validation.Then how it is dividing data for 5 fold cross validation.
To do k-fold cross-validation using ranklib, you only need to use one dataset.
The program itself divides the data to train, test and validate randomly.
When you use 5-fold cross-validation, the program will repeat the process 5 times and it gives you the average of the 5 analyses as the final result.
You need to choose a metric for your learning evaluation. See [ -metric2t <metric> ] on this How to use page.
For example, see the command below. I have only one dataset to feed my algorithm. I used NDCG#10 as my evaluation metric. Also, I used -kcvmd to save my models in a directory and -kcvmn to name the models.
java -jar RankLib-2.1-patched.jar -train trainingData.txt -ranker 8 -kcv 5 -kcvmd kcvModels/ -kcvmn txt -metric2t NDCG#10 -metric2T NDCG#10 -save Models/model.txt

Vowpal Wabbit multiple class classification predict probabilities

I am trying to do multiple classification problem with Vowpal Wabbit.
I have a train file that look like this:
1 |feature_space
2 |feature_space
3 |feature_space
As an output I want to get probabilities of test item belonging to each class, like this:
1: 0.13 2:0.57 3:0.30
think of sklearn classifiers predict_proba methods, for example.
I've tried the following:
1) vw -oaa 3 train.file -f model.file --loss_function logistic --link logistic
vw -p predict.file -t test.file -i model.file -raw_predictions = pred.txt
but the pred.txt file is empty (contains no records, but is created). Predict.file contains only the final class, and no probabilities.
2) vw - csoaa3 train.file -f model.file --link logistic
I've modified the input files accordingly to fit the cs format. csoaa doesn't accept loss_function logistic with following error message: "You are using a label not -1 or 1 with a loss function expecting that!"
If used with default square loss function, and similar output command, I get pred.txt with raw predictions for each class per item, for example:
2.33 1.67 0.55
I believe it's the resulting square distance.
Is there a way to get VW to output class probabilites or somehow convert these distances into probabilities?
There was a bug in VW version 7.9.0 and fixed in 7.10.0 resulting in the empty raw predictions file.
Since November 2015, the easiest way how to obtain probabilities is to use --oaa=N --loss_function=logistic --probabilities -p probs.txt. (Or if you need label-dependent features: --csoaa_ldf=mc --loss_function=logistic --probabilities -p probs.txt.)

Training a logistic regression in Mahout strange behaviour

I'm trying to train a logistic regression model in mahout. The command I use is this:mahout trainlogistic --input /home/cloudera/Desktop/final.csv --output /home/cloudera/Desktop/model/model --target Action --predictors Open High Close --types word --features 20 --passes 100 --rate 50 --categories 2
The files I use actually exist. I'm reading a book that says that I should expect an output that looks like
Action ~ 647.186*Close+-44.975*High+3.269*Intercept term +-601.454*Open
and then a 4x2 matrix.
What I actually get is a terminal being filled with calculations, no Action ~, and a 5x4 matrix.
What am I doing wrong?
Well, the type of my predictors was numeric; why did the book I referenced call them words I have no idea.

How to use WEKA Machine Learning for a Bayes Neural Network and J48 Decision Tree

I am trying to figure out WEKA and perform some experiments with data that I have.
Basically what I want to do is take Data Set 1, use it as a training set. Run a J48 Decision Tree on it. Then take Data Set 2 and run the trained tree on it, with the output of the original data set with a extra column for what the prediction was.
Then do the same thing again with the Bayes Neural Network.
Can someone point me to a link of detail instructions on how exactly I would accomplish this? I seem to be missing some steps and cannot get the output of the original data set with the extra column.
Here is one way to do it with the command-line. This information is found in Chapter 1 ("A command-line primer") of the Weka manual that comes with the software.
java weka.classifiers.trees.J48 -t training_data.arff -T test_data.arff -p 1-N
where:
-t <training_data.arff> specifies the training data in ARFF format
-T <test_data.arff> specifies the test data in ARFF format
-p 1-N specifies that you want to output the feature vector and the prediction,
where N is the number of features in your feature vector.
For example, here I am using soybean.arff for both training and testing. There are 35 features in the feature vector:
java weka.classifiers.trees.J48 -t soybean.arff -T soybean.arff -p 1-35
The first few lines of the output look like:
=== Predictions on test data ===
inst# actual predicted error prediction (date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,germination,plant-growth,leaves,leafspots-halo,leafspots-marg,leafspot-size,leaf-shread,leaf-malf,leaf-mild,stem,lodging,stem-cankers,canker-lesion,fruiting-bodies,external-decay,mycelium,int-discolor,sclerotia,fruit-pods,fruit-spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots)
1 1:diaporth 1:diaporth 0.952 (october,normal,gt-norm,norm,yes,same-lst-yr,low-areas,pot-severe,none,90-100,abnorm,abnorm,absent,dna,dna,absent,absent,absent,abnorm,no,above-sec-nde,brown,present,firm-and-dry,absent,none,absent,norm,dna,norm,absent,absent,norm,absent,norm)
2 1:diaporth 1:diaporth 0.952 (august,normal,gt-norm,norm,yes,same-lst-two-yrs,scattered,severe,fungicide,80-89,abnorm,abnorm,absent,dna,dna,absent,absent,absent,abnorm,yes,above-sec-nde,brown,present,firm-and-dry,absent,none,absent,norm,dna,norm,absent,absent,norm,absent,norm)
The columns are: (1) data instance number; (2) ground truth label; (3) predicted label; (4) error; (5) prediction confidence; and (6) feature vector.

How to reavaluate model in WEKA?

I am trying to solve a numeric classification problem with numeric attributes in WEKA using linear regression and then I want to test my model on the existing dataset with ""re-evaluate model on current test dataset.
As a result of the evaluation I am getting the summary:
Correlation coefficient 0.9924
Mean absolute error 1.1017
Root mean squared error 1.2445
Total Number of Instances 17
But I don't have results as it is shown here: http://weka.wikispaces.com/Making+predictions
How to bring WEKA to the result I need?
Thank you.
To answer my question - for trained and tested model, right click on the model and go to visualize classifier error. there use save option to save actual and predicted values.
Are you using command line interface (CLI) or GUI.
If CLI, the command given in the above link works pretty fine
java weka.classifiers.trees.J48 -T unclassified.arff -l j48.model -p 0
So when you train the model you save it as *.model (j48.model) and later use it to evaluate on test data (unclassified.arff)

Resources