Extract SVM assigned values against each instance in WEKA - machine-learning

is there any way for extracting the values after using SVM traning model against each instance to see what value SVM has assigned to each instance for classifying the instance in either positive class or negative.. i am looking for some solution to get all the SVM based assigned valus against each instance in WEKA tool.
i have been using LibSVM and LibLinear classifiers under SVM. i need those values to use for ranking

Click Preprocess... Filter ... "Choose" Button...
Then select the Weka /Filters / Supervised / Attribute Filter
"AddClassification"
In its configuration Dialog, set "OutputClassification" to "True"
Click on the "LibSVM" label to invoke the second dialog box. Configure the Classifier.
Click apply.
A new column "Classification" will be added to your dataset - but this won't perform cross-validation on your dataset. It will use the entire dataset as training dataset and thus will lead to overfitting.
Alternative (for getting predictions on cross-validated output): You can also go to the "Classifier" tab, click "More Options..." button, "Output predictions", choose "PlainText" then the predictions will show up in the big "Classifier Output" textpanel.

Related

How to perform classification on training and test dataset in Weka

I am using Weka software to classify model. I have confusion using training and testing dataset partition. I divide 60% of the whole dataset as training dataset and save it to my hard disk and use 40% of data as test dataset and save this data to another file. The data that I am using is an imbalanced data. So I applied SMOTE in my training dataset. After that, in the classify tab of the Weka I selected Use training set option from Test options and used Random Forest classifier to do the classification on the training dataset. After getting the result I chose Supplied test set option from Test options and load my test dataset from hard disk and again ran the classifier.
I try to find out tutorial on how to load training set and test set in Weka but did not get it. I did the above process depend upon my understanding.
Therefore, I would like to know is that the right way to perform classification on training and test dataset?
Thank you.
There is no need to evaluate your classifier on the training set (this will be overly optimistic, since the classifier has already seen this data). Just use the Supplied test set option, then your classifier will get trained automatically on the currently loaded dataset before being evaluated on the specified test set.
Instead of manually splitting your data, you could also use the Percentage split test option, with 60% to be used for your training data.
When using filters, you should always wrap them (in this case SMOTE) and your classifier (in this case RandomForest) in the FilteredClassifier meta-classifier. That way, you will ensure that the training and test set data will get transformed correctly. This will also avoid the problem of leaking information into the test set when transforming the full dataset with a supervised filter and splitting the dataset into train/test afterwards. Finally, it also documents nicely what preprocessing is being done to your input data, all in a single command-line string.
If you need to apply more than one filter, use the MultiFilter to apply them sequentially.

I don't understand how parameter sweep is done in the Azure Machine Learning?

In azure ml if we select a train algorithm(for ex "Two Class Logistic Regression") we can then have a set of parameters to do a parameter sweep while training.But how can I know how they change values of parameters in the training?
In Azure ML Studio, go to Experiments -> Samples and find "Model Parameter Optimization : Sweep parameters" experiment. This sample shows how parameter sweeping works. Basically:
In the algorithm module ("Two-Class Support Vector Machine"), set "Create trainer mode" to "Parameter Range" and specify the range of parameters you want to sweep over. This can be either min,max range, or a comma-separates list of values like 1,2,4,8.
In Tune Model Hyperparameters, specify the sweeping strategy "Entire Grid" (expensive), "Random Sweep" (random points within min,max range) or "Random Grid" (random sampling of points from a grid).
The left output of Tune Model Hyperparameters should show a table of metrics for each combination of parameters that was swept over. The right output should contain the best model given the metric you selected.
Hope this helps,
Roope

Extract WHY label was chosen on classification?

I currently have a system set up where I train from old posts/categories and try to predict what category a new post will be. I am using a pipeline with TfidfVectorizer and LinearSVC to train the dataset and storing that in a pickle, then I process new posts by loading that pickle and using predict from the loaded pickle to classify the new posts. Currently, I am struggling with a few labels and I don't know why.
I am looking to provide some output on what words were triggered in the new post for each classification label so that I can see why a certain label was chosen when classifying new data against a training set, but I cannot find a way to do this.
I know that I can output the top features in my vectorizer when I am training, but how can I output essentially the reason why a certain label was chosen over another one?
During the training phase of the SVM for each word of the corpus vocabulary you learn a weight for each of the classes.
Then, during inference, you calculate the dot product between the class weights and the vector description of the instance to be classified. The algorithm returns the class that yields the highest dot product scores. Hence, you can have an estimate of how things work by examining those weights (coef_ attribute) for your instance.
I agree however that other methods like trees are more interpretable.

Clustering before classification in Weka

The instances in my dataset have multiple numeric attributes and a binary class. In Weka is there a way to use a clusterer and pass the result to a classifier (say SMO) to improve the results of classification?
One way that you could add cluster information to your data is using the below method (in Weka Explorer):
Load your Favourite Dataset
Choose your Cluster Model (In my case, I used SimpleKMeans)
Modify the Parameters of the Clusterer as Required
Use the Training Set for the Cluster Mode
Start the Clustering Process
Once the Clusters have been generated, Right-Click on the Result List and select 'Visualize Cluster Assignments'
Select Y to be the Cluster, then hit the Save Button as shown below:
Save the Data to a nominated location.
You should then be able to load this file and use the cluster information in your classifier just like any other attribute. Just make sure that the Class is set to the right attribute and you should be right to go.
NOTE: When I ran these tests I used J48 to evaluate the class, and it seemed that J48 used only the values of the clusters to estimate the class. The accuracy of the model was also surprisingly high as well, so either the dataset was either too simple or I may have missed a step somewhere in the clustering process.
Hope this Helps!
In Weka Explorer, after loading your dataset
choose the Preprocess tab,
click "Choose..." Button,
add the unsupervised-attribute-filter "AddCluster".
click next to button, to open the Clusterer Selection field, choose a clusterer,
configure/parameterize the clusterer
close all modal dialog boxes
Click "Apply" button to apply the filter. It will add another attribute called "cluster" as the rightmost one in your attribute list.
Then continue with your classification experiments.

How to output resultant documents from Weka text-classification

So we are running a multinomial naive bayes classification algorithm on a set of 15k tweets. We first break up each tweet into a vector of word features based on Weka's StringToWordVector function. We then save the results to a new arff file to user as our training set. We repeat this process with another set of 5k tweets and re-evaluate the test set using the same model derived from our training set.
What we would like to do is to output each sentence that weka classified in the test set along with its classification... We can see the general information (Precision, recall, f-score) of the performance and accuracy of the algorithm but we cannot see the individual sentences that were classified by weka, based on our classifier... Is there anyway to do this?
Another problem is that ultimately our professor will give us 20k more tweets and expect us to classify this new document. We are not sure how to do this however as:
All of the data we have been working with has been classified manually, both the training and test sets...
however the data we will be getting from the professor will be UNclassified... How can we
reevaluate our model on the unclassified data if Weka requires that the attribute information must
be the same as the set used to form the model and the test set we are evaluating against?
Thanks for any help!
The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying the classifier to new tweets (unprocessed) by using the vocabular derived by the StringToWordVector filter.
You can see how to do this in the command line in "Command Line Functions for Text Mining in WEKA" and via a program in "A Simple Text Classifier in Java with WEKA".

Resources