I am using GUI version of WEKA and I am classifying using the Random Forest. I'm trying to find out which instances are misclassified.
I know that earlier versions of WEKA had and option of "Output additional attributes" where I can add instance Id and get around this problem, but now with WEKA 3.8 I can't see this option.
Answering my own question, on the preprocess tab you need to use the filter addID or you can add your own attribute as string. Then you would have to use the classifier "FilteredClassifier" click on it and use the filter "Remove" specifying the index of the attribute that holds the id, and then start the classification.
To see the misclassified, right click on the the result int he Result List, choose Visualize classifier errors, then choose save as arff. This will result in seeing all the instances with their prediction.
Related
I can’t seem to apply the ID3 classification algorithm to Mushroom.arff dataset. This dataset consists of nominal attributes only. I think I need to preprocess this in order for it to work, but I don’t know how. How do I proceed?
The ID3 algorithm is an unpruned decision tree generation algorithm with the following properties:
It can only deal with nominal attributes.
It fails to handle missing values.
Empty leaves may result in unclassified instances.
The Mushroom dataset consists of 22 nominal attributes and satisfies the first condition, however upon inspection you’ll find the attribute 'stalk-root' has 2480 (31%) missing values. This is the reason it is unselectable in Weka by default when you try to classify.
In order to fix this, you may proceed with these two solutions.
You may remove the attribute.
Open the .arff file, select the stalk-root attribute in the Attributes tab and click Remove.
You’ll now see that ID3 is available. I was able to get F-score of 1.0.
You may use techniques to handle missing values.
In situations where you do not want to lose out on information(in this case the “stalk-root” attribute), you may proceed with these techniques:
Use a measure of central tendency for the attribute such as mean, median to replace the empty values.
Use the attribute mean or median for all samples belonging to the same class as the given tuple.
Use the most probable value to fill in the missing value using inference-based tools using a Bayesian formalism.
The instances in my dataset have multiple numeric attributes and a binary class. In Weka is there a way to use a clusterer and pass the result to a classifier (say SMO) to improve the results of classification?
One way that you could add cluster information to your data is using the below method (in Weka Explorer):
Load your Favourite Dataset
Choose your Cluster Model (In my case, I used SimpleKMeans)
Modify the Parameters of the Clusterer as Required
Use the Training Set for the Cluster Mode
Start the Clustering Process
Once the Clusters have been generated, Right-Click on the Result List and select 'Visualize Cluster Assignments'
Select Y to be the Cluster, then hit the Save Button as shown below:
Save the Data to a nominated location.
You should then be able to load this file and use the cluster information in your classifier just like any other attribute. Just make sure that the Class is set to the right attribute and you should be right to go.
NOTE: When I ran these tests I used J48 to evaluate the class, and it seemed that J48 used only the values of the clusters to estimate the class. The accuracy of the model was also surprisingly high as well, so either the dataset was either too simple or I may have missed a step somewhere in the clustering process.
Hope this Helps!
In Weka Explorer, after loading your dataset
choose the Preprocess tab,
click "Choose..." Button,
add the unsupervised-attribute-filter "AddCluster".
click next to button, to open the Clusterer Selection field, choose a clusterer,
configure/parameterize the clusterer
close all modal dialog boxes
Click "Apply" button to apply the filter. It will add another attribute called "cluster" as the rightmost one in your attribute list.
Then continue with your classification experiments.
So we are running a multinomial naive bayes classification algorithm on a set of 15k tweets. We first break up each tweet into a vector of word features based on Weka's StringToWordVector function. We then save the results to a new arff file to user as our training set. We repeat this process with another set of 5k tweets and re-evaluate the test set using the same model derived from our training set.
What we would like to do is to output each sentence that weka classified in the test set along with its classification... We can see the general information (Precision, recall, f-score) of the performance and accuracy of the algorithm but we cannot see the individual sentences that were classified by weka, based on our classifier... Is there anyway to do this?
Another problem is that ultimately our professor will give us 20k more tweets and expect us to classify this new document. We are not sure how to do this however as:
All of the data we have been working with has been classified manually, both the training and test sets...
however the data we will be getting from the professor will be UNclassified... How can we
reevaluate our model on the unclassified data if Weka requires that the attribute information must
be the same as the set used to form the model and the test set we are evaluating against?
Thanks for any help!
The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying the classifier to new tweets (unprocessed) by using the vocabular derived by the StringToWordVector filter.
You can see how to do this in the command line in "Command Line Functions for Text Mining in WEKA" and via a program in "A Simple Text Classifier in Java with WEKA".
In Weka I can go to the experimenter. In the set-up I can load in an .arff file, and get weka to create a classifier (i.e. J48), then I can run it and then finally I can go to the analyze tab. In this tab it gives me an option to 'testing with Paired T-Test' but I cannot figure out how to create a second classifier (i.e. J48 unpruned) and do a T-Test on the two results.
Google does not lead me to any tutorial or answers.
How can I get Weka to do a T-Test on the results of two different classifiers, made from the same data?
Please follow the steps in http://fiji.sc/Advanced_Weka_Segmentation_-_How_to_compare_classifiers.
In this screenshot, the author is setting the significance level to be 0.05. In my understanding, in such a test, you always compare with a baseline classifier (here it is the NaiveBayes), the output uses the annotation v or * to indicate that a specific result is statistically better (v) or worse (*) than the baseline scheme at the significance level specified (currently 0.05). It might not be the one you expected though.
For Weka Explorer (GUI), when we do a 10-fold CV for any given ARFF file, then what Weka Explorer provides (as far as I can see) is the average result for all the 10 folds.
Q. Is there any way to get the results of each fold? For instance, I need the error rates (incorrectly identified instances) for each fold.
Help appreciated.
I think this is possible using Weka's GUI. You need to use the Experimenter though instead of the Explorer. Here are the steps:
Open the Experimenter from the GUI Chooser
Create a new experiment (New button # top-right)
[optional] Enter a filename and location in the Results Destination to save the results to
Set the Number of (cross-validation) folds to your liking (start experimenting with 2 folds for easy results)
Add your dataset (if your dataset needs preprocessing then you should do this in the Explorer first and then save the preprocessed dataset)
Set the Number of repetitions (I recommend 1 to start of with)
Add the algorithm(s) you want to test (again start easy, start with one algorithm)
Go to the Run tab and Start the experiment and wait till it finishes
Go to the Analyse tab and import the experiment results by clicking Experiment (top-right)
For Row select: Fold
For Column select: Percent_incorrect or Number_incorrect (or any other measure you want to see)
You now see the specified results for each fold
Weka Explorer does not have an option to give the results for individual folds when using the crossvalidation option, there are some workarounds. If you explicitly don't want to change any code, you need to do some manual fiddling, but I think this gives more or less what you want
Instead of Cross-validation, select Percentage split and set it to 90%
Start classifier
Click More options... and change the Random seed for XVal / % Split value to something you haven't used before.
Repeat ten times.
This is not exactly equivalent to 10-fold crossvalidation though, since the pseudo-folds you make this way might overlap.
An alternative that is equivalent to crossvalidation, but more cumbersome, would be to make 10 folds manually by using the unsupervised instance filter RemoveFolds or RemoveRange.
Generate and save 10 training sets and 10 test sets. Then for every fold, load the training set, select Supplied test set in the classify tab, and select the appropriate test fold.