Test single instance in weka which has no class label - machine-learning

This question is being already asked but i didn't understand the answer so I am again posting the question please do reply.
I have a weka model eg: j48 I have trained that model for my dataset and now I have to test the model with a single instance in which it should return the class label. How to do it?
I have tried these ways:
1)When I am giving my test instance a,b,c,class for class as ?. It is showing problem evaluating classifier .train and test are not compatible
2)When I list all the class labels and I put the ? for the class label for the test instance like this:
#attribute class {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27}
#data
1,2,............,?
It is not showing any results like this
=== Evaluation on test set ===
=== Summary ===
Total Number of Instances 0
Ignored Class Unknown Instances 1
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0 0 0 0 0 ? 1
0 0 0 0 0 ? 2
0 0 0 0 0 ? 3
Weighted Avg. NaN NaN NaN NaN NaN NaN
confusion matrix is null
What to do?

Given the incomplete information from the OP, here is what probably happened:
You used
the Weka GUI Chooser
selected the Weka Explorer
loaded your training data on the Preprocess tab
selected the Classify tab
selected the J48 classifier
selected Supplied test set under test options and supplied your aforementioned test set
clicked on Start
Now to you problem:
"Evaluation on test set" should have given it away, because you're are evaluating the classifier -or better: the trained model. But for evaluation, you need to compare the predicted class with the actual class, which you didn't supply. Hence, the instance with the missing class label will be ignored.
Since you don't have any other test instances WITH class label, the confusion matrix is empty. There simply is not enough information available to build one. (And just as a side note: A confusion matrix for only one instance is kinda worthless.)
To see the actual prediction
You have to go to More options ..., click on Choose next to Output predictions and select an output format, e.g. PlainText, and you will see something like:
inst# actual predicted error prediction
1 1:? 1:0 0.757
2 1:? 1:0 0.824
3 1:? 1:0 0.807
4 1:? 1:0 0.807
5 1:? 1:0 0.79
6 1:? 2:1 0.661
This output lists the classified instances in the order they occur in the test file. This example was taken from the Weka site about "Making predictions" with the following explanation.
In this case, taken directly from a test dataset where all class
attributes were marked by "?", the "actual" column, which can be
ignored, simply states that each class belongs to an unknown class.
The "predicted" column shows that instances 1 through 5 are predicted
to be of class 1, whose value is 0, and instance 6 is predicted to be
of class 2, whose value is 1. The error field is empty; if predictions
were being performed on a labeled test set, each instance where the
prediction failed to match the label would contain a "+". The
probability that instance 1 actually belongs to class 0 is estimated
at 0.757.

Related

binary classification:why use +1/0 as label, what's the difference between +1/-1 or even +100/-100

In binary classification problem, we usually use +1 for positive label and 0 for negative label. why is that? especially why use 0 rather than -1 for the negative label?
what's the difference between using -1 for negative label, or even more generally, can we use +100 for positive label and -100 for negative label?
As the name suggests (labeling) is used for differentiating the classes. You can use 0/1, +1/-1, cat/dog, etc. (Any name that fits your problem).
For example:
If you want to distinguish between cat and dog images, then use cat and dog labels.
If you want to detect spam, then labels will be spam/genuine.
However, because ML algorithms mostly work with numbers before training, labels transform to numeric formats.
Using labels of 0 and 1 arises naturally from some of the historically first methods that have been used for binary classification. E.g. logistic regression models directly the probability of an event happening, event in this case meaning belonging of an object to positive or negative class. When we use training data with labels 0 and 1, it basically means that objects with label 0 have probability of 0 belonging to a given class, and objects with label 1 have probability of 1 belonging to a given class. E.g. for spam classification, emails that are not spam would have label 0, which means they have 0 probability of being a spam, and emails that are spam would have label 1, because their probability of being a spam is 1.
So using labels of 0 and 1 makes perfect sense mathematically. When a binary classifaction model outputs e.g. 0.4 for some input, we can usually interpret this as a probability of belonging to a class 1 (although strictly it's not always the case, as pointed out for example here).
There are classification methods that don't make use of convenient properties of labels 0 and 1, such as support vector machines or linear discriminant analysis, but in their case no other labels would provide more convenience than 0 and 1, so using 0 and 1 is still okay.
Even encoding of classes for multiclass classification makes use of probabilities of belonging to a given class. For example in classification with three classes, objects from the first class would be encoded like [1 0 0], from the second class [0 1 0] and the third class [0 0 1], which again can be interpreted with probabilities. (This is called one-hot encoding). Output of a multiclass classification model is often a vector of form [0.1 0.6 0.3] which can be conveniently intepreted as a vector of class probabilities for given object.

How can I do a stratified downsampling?

I need to build a classification model for protein sequences using machine learning techniques. Each observation can either be classified as either a 0 or a 1. However, I noticed that my training set contains a total of 170 000 observations, of which only 5000 are labeled as 1. Therefore, I wish to down sample the number of observations labeled as 0 to 5000.
One of the features I am currently using in the model is the length of the sequence. How can I down sample the data for my class 0 while making sure the distribution of length_sequence remains similar to the one in my class 1?
Here is the histogram of length_sequence for class 1:
Here is the histogram of length_sequence for class 0:
You can see that in both cases, the lengths go from 2 to 255 characters. However, class 0 has many more observations, and they also tend to be significantly longer than the ones seen in class 0.
How can I down sample class 0 and make the new histogram look similar to the one in class 1?
I am trying to do stratified down sampling with scikit-learn, but I'm stuck.

Convert probabilities to a score

I have a project to measure the sentiment level of a customer as 0 (happy), 1 (neutral), 2(unhappy) from the text data supplied by customer comments. I have trained a classifier model on tensorflow and it predicts the sentiment level of a new document. There is no problem until that point. I can get prediction probabilities of classifier indicates a new document belongs to which level. After prediction of a new document belongs to which class I get some probalities like below:
Level - Propability
0 (happy) ---> 0.17
1 (neutral) ---> 0.41
2 (unhappy) ---> 0.42
This result indicates that predicted document belongs to class 2. However, I need precise sentiment scores not just labels. If I divide interval [0-1] into 3 parts each corresponds to a label like [0-0.33],[0.33-0.66],[0.66-1]. For above case I need a score between 0.66 and 1 and also is shold be closer to 0.66 something like 0.68.
As other examples showed below:
EX-I:
Level - Propability
0:[0-0.33] --> 0
1:[0.33-0.66] --> 1
2:[0.66-1] --> 0
For EX-I score should be 0.5
. .
EX-II:
Level - Propability
0:[0-0.33] --> 0.51
1:[0.33-0.66] --> 0.49
2:[0.66-1] --> 0
For EX-II score should be less than 0.33 but so close to it.
What is the exact terminology for this case in math or is there an equation to calculate the current fuzzy score from probabilities.
Thanks for your help.
Instead of doing classification, you should turn to regression.
During your training step, you may convert class happy to 0, class neutral to 0.5 and unhappy to 1.
Then, your tensorflow model will predict values between 0 and 1 that correspond to what you want to do.

Train and Test with 'one class classifier' using Weka

Suppose I have the following train-set:
f1,f2,f3, label
1,2,3, 0
1.2,2.3,3.3, 0
1.25,2.25,3.25, 0
and I want to get the classification for the following test-set:
f1,f2,f3, label
6,7,8, ?
1.1,2.1,3.1, ?
9,10,11, ?
When I'm using Weka and 'One class classifier', first I load the train-set and classify using use training set option in the test options, after that I choose the supplied test set option and load the above test set.The problem is that I get the same classification for all the test-set instances and I get a warning that the train and test set are not compatible, do you want to wrap with inputMappedClassifier?. The above are just a simple examples, I got these problems also with a huge anomaly injected dataset.
What do I do wrong?
I think, Since you are performing oneClassClassification, your test data should be (assumption here is all the test data rows are not outliers):
f1,f2,f3, label
6,7,8, 0
1.1,2.1,3.1, 0
9,10,11, 0
and if you enable predictions on test data, you may get:
=== Predictions on test set ===
inst# actual predicted error prediction
1 1:true 1:true 1
2 1:true ? ?
3 1:true 1:true 1:true
which means in test data:
a) Instance 1 is not outlier
b) Instance 2 is outlier
c) Instance 3 is not outlier

Weka - semi supervised learning - how to label data and get back the result?

I have just started to use Weka and I would like to ask you. I have installed a collective classification package and i have a simple training data.
X Y Label
--------------------
1 2 Class 1
3 2 Class 1
3 3 Unknown class
4 2 Unknown class
11 12 Unknown class
15 20 Unknown class
Is it possible somehow get data back from Weka labeled? Maybe I don't understand the semi-supervised method because in my opinion it's used to label other data if I have labeled a little subset.
In my case, I would like to annotate several normal instances, get label other similar instances and in the end detect anomaly instances..
Thank you your advices
My understanding is that you would like to store the predicted labels of your model into your missing labels.
What you could do is right-click on the Model after training, then select 'Visualize Classifier Errors'. In this visualization screen, set Y as the predicted class and then save the new ARFF. This datafile should then contain the predicted and class labels.
From there, you could try to replace the Missing Values with the predicted labels.
I hope this assists in the problem that you are experiencing.

Resources