I have a project to measure the sentiment level of a customer as 0 (happy), 1 (neutral), 2(unhappy) from the text data supplied by customer comments. I have trained a classifier model on tensorflow and it predicts the sentiment level of a new document. There is no problem until that point. I can get prediction probabilities of classifier indicates a new document belongs to which level. After prediction of a new document belongs to which class I get some probalities like below:
Level - Propability
0 (happy) ---> 0.17
1 (neutral) ---> 0.41
2 (unhappy) ---> 0.42
This result indicates that predicted document belongs to class 2. However, I need precise sentiment scores not just labels. If I divide interval [0-1] into 3 parts each corresponds to a label like [0-0.33],[0.33-0.66],[0.66-1]. For above case I need a score between 0.66 and 1 and also is shold be closer to 0.66 something like 0.68.
As other examples showed below:
EX-I:
Level - Propability
0:[0-0.33] --> 0
1:[0.33-0.66] --> 1
2:[0.66-1] --> 0
For EX-I score should be 0.5
. .
EX-II:
Level - Propability
0:[0-0.33] --> 0.51
1:[0.33-0.66] --> 0.49
2:[0.66-1] --> 0
For EX-II score should be less than 0.33 but so close to it.
What is the exact terminology for this case in math or is there an equation to calculate the current fuzzy score from probabilities.
Thanks for your help.
Instead of doing classification, you should turn to regression.
During your training step, you may convert class happy to 0, class neutral to 0.5 and unhappy to 1.
Then, your tensorflow model will predict values between 0 and 1 that correspond to what you want to do.
Related
I'd like to classify a set of 3d images (MRI). There are 4 classes (i.e. grade of disease A, B, C, D) where the distinction between the 4 grades is not trivial, therefore the labels I have for the training data is not one class per image. It's a set of 4 probabilities, one per class, e.g.
0.7 0.1 0.05 0.15
0.35 0.2 0.45 0.0
...
... would basically mean that
The first image belongs to class A with a probability of 70%, class B with 10%, C with 5% and D with 15%
etc., I'm sure you get the idea.
I don't understand how to fit a model with these labels, because scikit-learn classifiers expect only 1 label per training data. Using just the class with the highest probability results in miserable results.
Can I train my model with scikit-learn multilabel classification (and how)?
Please note:
Feature extraction is not the problem.
Prediction is not the problem.
Can I handle this somehow with the multilable classification framework?
For predict_proba to return the probability for each class A, B, C, D the classifier needs to be trained with one label per image.
If yes: How?
Use the image class as the label (Y) in your training set. That is your input dataset will look something like this:
F1 F2 F3 F4 Y
1 0 1 0 A
0 1 1 1 B
1 0 0 0 C
0 0 0 1 D
(...)
where F# are the features per each image and Y is the class as classified by doctors.
If no: Any other approaches?
For the case where you have more than one label per image, that is multiple potential classes or their respective probabilities, multilabel models might be a more appropriate choice, as documented in Multiclass and multilabel algorithms.
I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?
I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5
You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.
That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.
Examples : I have a sentence in job description : "Java senior engineer in UK ".
I want to use a deep learning model to predict it as 2 categories : English and IT jobs. If i use traditional classification model, it only can predict 1 label with softmax function at last layer . Thus, i can use 2 model neural networks to predict "Yes"/"No" with both categories, but if we have more categories, it is too expensive . So do we have any deeplearning or machine learning model to predict 2 or more categories at same time ?
"Edit" : With 3 labels by traditional approach , it will be encoded by [1,0,0] but in my case, it will be encoded by [1,1,0] or [1,1,1]
Example : if we have 3 labels, and a sentence may be fit with all of these labels. So if output from softmax function is [0.45 , 0.35 , 0.2 ] we should classify it into 3 labels or 2 labels , or may be one ?
The main problem when we do it is : what is good threshold to classify into 1, or 2 , or 3 labels ?
If you have n different categories which can be true at the same time, have n outputs in your output layer with a sigmoid activation function. This will give each output a value between 0 and 1 independently.
Your loss function should be the mean of the negative log likelihood of the outputs. In tensorflow, this is:
linear_output = ... # the output layer before applying activation function
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
logits=linear_output, labels=correct_outputs))
output = tf.sigmoid(linear_output) # 0 to 1 for each category
First I read this: How to interpret weka classification?
but it didn't helped me.
Then, to set up the background, I am trying to learn in kaggle competitions and models are evaluated with ROC area.
Actually I built two models and data about them are represented in this way:
Correctly Classified Instances 10309 98.1249 %
Incorrectly Classified Instances 197 1.8751 %
Kappa statistic 0.7807
K&B Relative Info Score 278520.5065 %
K&B Information Score 827.3574 bits 0.0788 bits/instance
Class complexity | order 0 3117.1189 bits 0.2967 bits/instance
Class complexity | scheme 948.6802 bits 0.0903 bits/instance
Complexity improvement (Sf) 2168.4387 bits 0.2064 bits/instance
Mean absolute error 0.0465
Root mean squared error 0.1283
Relative absolute error 46.7589 % >72<69
Root relative squared error 57.5625 % >72<69
Total Number of Instances 10506
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.998 0.327 0.982 0.998 0.99 0.992 0
0.673 0.002 0.956 0.673 0.79 0.992 1
Weighted Avg. 0.981 0.31 0.981 0.981 0.98 0.992
Apart of K&B Relative Info Score; Relative absolute error and Root relative squared error which are respectively inferior, superior and superior in the best model as assessed by ROC curves,
all data are the same.
I built a third model with similar behavior (TP rate and so on), but again K&B Relative Info Score; Relative absolute error and Root relative squared error varied. But that did not allowed to predict if this third model was superior to both first (variations where the same compared to the best model, so theorically it should have been superior, but it wasn't).
What should I do to predict if a model will perform well given such details about it?
Thanks by advance.
This question is being already asked but i didn't understand the answer so I am again posting the question please do reply.
I have a weka model eg: j48 I have trained that model for my dataset and now I have to test the model with a single instance in which it should return the class label. How to do it?
I have tried these ways:
1)When I am giving my test instance a,b,c,class for class as ?. It is showing problem evaluating classifier .train and test are not compatible
2)When I list all the class labels and I put the ? for the class label for the test instance like this:
#attribute class {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27}
#data
1,2,............,?
It is not showing any results like this
=== Evaluation on test set ===
=== Summary ===
Total Number of Instances 0
Ignored Class Unknown Instances 1
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0 0 0 0 0 ? 1
0 0 0 0 0 ? 2
0 0 0 0 0 ? 3
Weighted Avg. NaN NaN NaN NaN NaN NaN
confusion matrix is null
What to do?
Given the incomplete information from the OP, here is what probably happened:
You used
the Weka GUI Chooser
selected the Weka Explorer
loaded your training data on the Preprocess tab
selected the Classify tab
selected the J48 classifier
selected Supplied test set under test options and supplied your aforementioned test set
clicked on Start
Now to you problem:
"Evaluation on test set" should have given it away, because you're are evaluating the classifier -or better: the trained model. But for evaluation, you need to compare the predicted class with the actual class, which you didn't supply. Hence, the instance with the missing class label will be ignored.
Since you don't have any other test instances WITH class label, the confusion matrix is empty. There simply is not enough information available to build one. (And just as a side note: A confusion matrix for only one instance is kinda worthless.)
To see the actual prediction
You have to go to More options ..., click on Choose next to Output predictions and select an output format, e.g. PlainText, and you will see something like:
inst# actual predicted error prediction
1 1:? 1:0 0.757
2 1:? 1:0 0.824
3 1:? 1:0 0.807
4 1:? 1:0 0.807
5 1:? 1:0 0.79
6 1:? 2:1 0.661
This output lists the classified instances in the order they occur in the test file. This example was taken from the Weka site about "Making predictions" with the following explanation.
In this case, taken directly from a test dataset where all class
attributes were marked by "?", the "actual" column, which can be
ignored, simply states that each class belongs to an unknown class.
The "predicted" column shows that instances 1 through 5 are predicted
to be of class 1, whose value is 0, and instance 6 is predicted to be
of class 2, whose value is 1. The error field is empty; if predictions
were being performed on a labeled test set, each instance where the
prediction failed to match the label would contain a "+". The
probability that instance 1 actually belongs to class 0 is estimated
at 0.757.