How do I compare two classifiers in Weka using the Paired T-test - machine-learning

In Weka I can go to the experimenter. In the set-up I can load in an .arff file, and get weka to create a classifier (i.e. J48), then I can run it and then finally I can go to the analyze tab. In this tab it gives me an option to 'testing with Paired T-Test' but I cannot figure out how to create a second classifier (i.e. J48 unpruned) and do a T-Test on the two results.
Google does not lead me to any tutorial or answers.
How can I get Weka to do a T-Test on the results of two different classifiers, made from the same data?

Please follow the steps in http://fiji.sc/Advanced_Weka_Segmentation_-_How_to_compare_classifiers.
In this screenshot, the author is setting the significance level to be 0.05. In my understanding, in such a test, you always compare with a baseline classifier (here it is the NaiveBayes), the output uses the annotation v or * to indicate that a specific result is statistically better (v) or worse (*) than the baseline scheme at the significance level specified (currently 0.05). It might not be the one you expected though.

Related

Result verification with Weka Experiment tab with individual classifier models

I ran different classifiers on the same dataset. I got some statistical values after run the classifiers.
This is the summary of all classifiers
I am using Weka to trained the model. Weka itself has a method to compare different algorithms. For that we need to use the Experiment tab. I have done with this option as well for the same dataset.
Weka gave me the result for Kappa statistics when use Experiment tab
Rootmean squared error is
Relative absolute error
and so on.....
Now I am unable to understand that the values I got from Experiment tab how does those are similar to the values that I have shared in the table format in the first picture?
I presume that the initial table was populated with statistics obtained from cross-validation runs in the Weka Explorer.
The Explorer aggregates the predictions across a single cross-validation run so that it appears that you had a single test set of that size. It is only to be used as an explorative tool, hence the name.
The Experimenter records the metrics (like accuracy, rmse, etc) generated from each fold pair across the number of runs that you perform during your experiment. The metrics collected across multiple classifiers and/or datasets can then be analyzed using significance tests. By default, 10 runs of 10-fold CV are used, which is recommended for such comparisons. This results in 100 individual values for each metric from which mean and standard deviation are generated. */v indicate whether there is a statistically significant loss/win.

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

Classfication accuracy on Weka

I am using Weka GUI for a classification. I am new to Weka and getting confused with the options
Use training Set
Supplied test set
Cross validation
to train my classification algorithm (for example J48), I trained with cross validation 10 folds and the accuracy is pretty good (97%). When I test my classification - the accuracy drops to about 72%. I am so confused. Any tips please? This is how I did it:
I train my model on the training data (For example: train.arff)
I right-click in the Results list on the item which model you want to save
select Save model and save it for example as j48tree.model
and then
I load the test data (for example: test.arff via the Supplied test set button
Right-click in the Results list, I selected Load model and choose j48tree.model
I selected Re-evaluate model on current test set
Is the way i do it wrong? Why the accuracy miserably dropping to 72% from 97%? Or is doing only the cross-validation with 10 folds is enough to train and test the classifier?
Note: my training and testing datasets have the same attributes and labels. The only difference is, I have more data on the testing set which I don't think will be a problem.
I don't think there is any issue with how you use WEKA.
You mentioned that you test set is larger than training? What is the split? The usual rule of thumb is that test set should be one 1/4 of the whole dataset, i.e. 3 times smaller than training and definitely not larger. This alone could explain the drop from 97% to 72% which is by the way not so bad for real life case.
Also it will be helpful if you build the learning curve https://weka.wikispaces.com/Learning+curves as it will explain whether you have a bias or variance issue. Judging by your values sounds like you have a high variance (i.e. too many parameters for your dataset), so adding more examples or changing your split between training and test set will likely help.
Update
I ran a quick analysis of the dataset at question by randomforest and my performance was similar to the one posted by author. Details and code are available on gitpage http://omdv.github.io/2016/03/10/WEKA-stackoverflow

How to output resultant documents from Weka text-classification

So we are running a multinomial naive bayes classification algorithm on a set of 15k tweets. We first break up each tweet into a vector of word features based on Weka's StringToWordVector function. We then save the results to a new arff file to user as our training set. We repeat this process with another set of 5k tweets and re-evaluate the test set using the same model derived from our training set.
What we would like to do is to output each sentence that weka classified in the test set along with its classification... We can see the general information (Precision, recall, f-score) of the performance and accuracy of the algorithm but we cannot see the individual sentences that were classified by weka, based on our classifier... Is there anyway to do this?
Another problem is that ultimately our professor will give us 20k more tweets and expect us to classify this new document. We are not sure how to do this however as:
All of the data we have been working with has been classified manually, both the training and test sets...
however the data we will be getting from the professor will be UNclassified... How can we
reevaluate our model on the unclassified data if Weka requires that the attribute information must
be the same as the set used to form the model and the test set we are evaluating against?
Thanks for any help!
The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying the classifier to new tweets (unprocessed) by using the vocabular derived by the StringToWordVector filter.
You can see how to do this in the command line in "Command Line Functions for Text Mining in WEKA" and via a program in "A Simple Text Classifier in Java with WEKA".

Weka: Results of each fold in 10-fold CV

For Weka Explorer (GUI), when we do a 10-fold CV for any given ARFF file, then what Weka Explorer provides (as far as I can see) is the average result for all the 10 folds.
Q. Is there any way to get the results of each fold? For instance, I need the error rates (incorrectly identified instances) for each fold.
Help appreciated.
I think this is possible using Weka's GUI. You need to use the Experimenter though instead of the Explorer. Here are the steps:
Open the Experimenter from the GUI Chooser
Create a new experiment (New button # top-right)
[optional] Enter a filename and location in the Results Destination to save the results to
Set the Number of (cross-validation) folds to your liking (start experimenting with 2 folds for easy results)
Add your dataset (if your dataset needs preprocessing then you should do this in the Explorer first and then save the preprocessed dataset)
Set the Number of repetitions (I recommend 1 to start of with)
Add the algorithm(s) you want to test (again start easy, start with one algorithm)
Go to the Run tab and Start the experiment and wait till it finishes
Go to the Analyse tab and import the experiment results by clicking Experiment (top-right)
For Row select: Fold
For Column select: Percent_incorrect or Number_incorrect (or any other measure you want to see)
You now see the specified results for each fold
Weka Explorer does not have an option to give the results for individual folds when using the crossvalidation option, there are some workarounds. If you explicitly don't want to change any code, you need to do some manual fiddling, but I think this gives more or less what you want
Instead of Cross-validation, select Percentage split and set it to 90%
Start classifier
Click More options... and change the Random seed for XVal / % Split value to something you haven't used before.
Repeat ten times.
This is not exactly equivalent to 10-fold crossvalidation though, since the pseudo-folds you make this way might overlap.
An alternative that is equivalent to crossvalidation, but more cumbersome, would be to make 10 folds manually by using the unsupervised instance filter RemoveFolds or RemoveRange.
Generate and save 10 training sets and 10 test sets. Then for every fold, load the training set, select Supplied test set in the classify tab, and select the appropriate test fold.

Resources