I've used sci-kit learn to build a random forest model to predict insurance renewals. This is tricky because, in my data set, 96.24% renew while only 3.76% do not renew. After I ran the model I evaluated model performance with a confusion matrix, classification report, and ROC curve.
[[ 2448 8439]
[ 3 278953]]
precision recall f1-score support
0 1.00 0.22 0.37 10887
1 0.97 1.00 0.99 278956
avg / total 0.97 0.97 0.96 289843
My ROC curve looks like this:
The model predicted renewals at just a hair under 100% (rounded to 1.00, see recall column) and non-renewals at about 22% (see recall column). The ROC curve would suggest an area under the curve of much greater than what is indicated in the bottom-right portion of the plot (area = 0.61).
Does anyone understand why this is happening?
Thank you!
In cases where the classes are highly imbalanced, ROC turns out to be an inappropriate metric. A better metric would be to use average precision or area under the PR curve.
This supporting Kaggle link talks about the exact same issue in a similar problem setting.
This answer and the linked paper explain that the optimizing for the best area under PR curve will also give the best ROC.
Related
I am developing a machine learning scikit-learn model on an imbalanced dataset (binary classification). Looking at the confusion matrix and the F1 score, I expect a lower average precision score but I almost get a perfect score and I can't figure out why. This is the output I am getting:
Confusion matrix on the test set:
[[6792 199]
[ 0 173]]
F1 score:
0.63
Test AVG precision score:
0.99
I am giving the avg precision score function of scikit-learn probabilities which is what the package says to use. I was wondering where the problem could be.
The confusion matrix and f1 score are based on a hard prediction, which in sklearn is produced by cutting predictions at a probability threshold of 0.5 (for binary classification, and assuming the classifier is really probabilistic to begin with [so not SVM e.g.]). The average precision in contrast is computed using all possible probability thresholds; it can be read as the area under the precision-recall curve.
So a high average_precision_score and low f1_score suggests that your model does extremely well at some threshold that is not 0.5.
I'm training an FCN (Fully Convolutional Network) and using "Sigmoid Cross Entropy" as a loss function.
my measurements are F-measure and MAE.
The Train/Dev Loss w.r.t #iteration graph is something like the below:
Although Dev loss has a slight increase after #Iter=2200, my measurements on Dev set have been improved up to near #iter = 10000. I want to know is it possible in machine learning at all? If F-measure has been improved, should the loss also be decreased? How do you explain it?
Every answer would be appreciated.
Short answer, yes it's possible.
How I would explain it is by reasoning on the Cross-Entropy loss and how it differs from the metrics. Loss Functions for classification, generally speaking, are used to optimize models relying on probabilities (0.1/0.9), while metrics usually use the predicted labels. (0/1)
Assuming having strong confidence (close to 0 or to 1) in a model probability hypothesis, a wrong prediction will greatly increase the loss and have a small decrease in F-measure.
Likewise, in the opposite scenario, a model with low confidence (e.g. 0.49/0.51) would have a small impact on the loss function (from a numerical perspective) and a greater impact on the metrics.
Plotting the distribution of your predictions would help to confirm this hypothesis.
I wonder why is our objective is to maximize AUC when maximizing accuracy yields the same?
I think that along with the primary goal to maximize accuracy, AUC will automatically be large.
I guess we use AUC because it explains how well our method is able to separate the data independently of a threshold.
For some applications, we don't want to have false positive or negative. And when we use accuracy, we already make an a priori on the best threshold to separate the data regardless of the specificity and sensitivity.
.
In binary classification, accuracy is a performance metric of a single model for a certain threshold and the AUC (Area under ROC curve) is a performance metric of a series of models for a series of thresholds.
Thanks to this question, I have learnt quite a bit on AUC and accuracy comparisons. I don't think that there's a correlation between the two and I think this is still an open problem. At the end of this answer, I've added some links like these that I think would be useful.
One scenario where accuracy fails:
Example Problem
Let's consider a binary classification problem where you evaluate the performance of your model on a data set of 100 samples (98 of class 0 and 2 of class 1).
Take out your sophisticated machine learning model and replace the whole thing with a dumb system that always outputs 0 for whatever the input it receives.
What is the accuracy now?
Accuracy = Correct predictions/Total predictions = 98/100 = 0.98
We got a stunning 98% accuracy on the "Always 0" system.
Now you convert your system to a cancer diagnosis system and start predicting (0 - No cancer, 1 - Cancer) on a set of patients. Assuming there will be a few cases that corresponds to class 1, you will still achieve a high accuracy.
Despite having a high accuracy, what is the point of the system if it fails to do well on the class 1 (Identifying patients with cancer)?
This observation suggests that accuracy is not a good evaluation metric for every type of machine learning problems. The above is known as an imbalanced class problem and there are enough practical problems of this nature.
As for the comparison of accuracy and AUC, here are some links I think would be useful,
An introduction to ROC analysis
Area under curve of ROC vs. overall accuracy
Why is AUC higher for a classifier that is less accurate than for one that is more accurate?
What does AUC stand for and what is it?
Understanding ROC curve
ROC vs. Accuracy vs. AROC
I have to deal with Class Imbalance Problem and do a binary-classification of the input test data-set where majority of the class-label is 1 (the other class-label is 0) in the training data-set.
For example, following is some part of the training data :
93.65034,94.50283,94.6677,94.20174,94.93986,95.21071,1
94.13783,94.61797,94.50526,95.66091,95.99478,95.12608,1
94.0238,93.95445,94.77115,94.65469,95.08566,94.97906,1
94.36343,94.32839,95.33167,95.24738,94.57213,95.05634,1
94.5774,93.92291,94.96261,95.40926,95.97659,95.17691,0
93.76617,94.27253,94.38002,94.28448,94.19957,94.98924,0
where the last column is the class-label - 0 or 1. The actual data-set is very skewed with a 10:1 ratio of classes, that is around 700 samples have 0 as their class label, while the rest 6800 have 1 as their class label.
The above mentioned are only a few of the all the samples in the given data-set, but the actual data-set contains about 90% of samples with class-label as 1, and the rest with class-label being 0, despite the fact that more or less all the samples are very much similar.
Which classifier should be best for handling this kind of data-set ?
I have already tried logistic-regression as well as svm with class-weight parameter set as "balanced", but got no significant improvement in accuracy.
but got no significant improvement in accuracy.
Accuracy isn't the way to go (e.g. see Accuracy paradox). With a 10:1 ratio of classes you can easily get a 90% accuracy just by always predicting class-label 0.
Some good starting points are:
try a different performance metric. E.g. F1-score and Matthews correlation coefficient
"resample" the dataset: add examples from the under-represented class (over-sampling) / delete instances from the over-represented class (under-sampling; you should have a lot of data)
a different point of view: anomaly detection is a good try for an imbalanced dataset
a different algorithm is another possibility but not a silver shoot. Probably you should start with decision trees (often perform well on imbalanced datasets)
EDIT (now knowing you're using scikit-learn)
The weights from the class_weight (scikit-learn) parameter are used to train the classifier (so balanced is ok) but accuracy is a poor choice to know how well it's performing.
The sklearn.metrics module implements several loss, score and utility functions to measure classification performance. Also take a look at How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?.
Have you tried plotting a ROC curve and AUC curve to check your parameters and different thresholds? If not that should give you a good starting point.
I am using the Stanford NLP for Sentiment Analysis,
but after training the model for more or less 24 hours, the session ended for maximum training time exceeded.
After running the evaluation of the created models, I have found out that the results in accuracy are far less performing than the ones from the Stanford paper.
These are the results of the Evaluation:
Tested 82600 labels
65166 correct
17434 incorrect
0.788935 accuracy
Tested 2210 roots
828 correct
1382 incorrect
0,374661 accuracy
Approximate Negative label accuracy: 0,595578
Approximate Positive label accuracy: 0,663263
Combined approximate label accuracy: 0,634001
Approximate Negative root label accuracy: 0,665570
Approximate Positive root label accuracy: 0,601760
Combined approximate root label accuracy: 0,633718
I decided to retrain the model and set a MaximumTrainTimeSeconds to 3 days, hoping to get better accuracy performance.
Has anyone encountered the same issue?
Do you think that retraining the algorithm for a longer period would make me achieve the expected accuracy?
Moreover, I am not entirely sure of how the score described in the model (e.g. 79,30 in the model in the picture) relates to the accuracy-best performance of the model.
I'm very new to NLP so if I am missing any required information or anything at all please let me know! Thank you!