I'm trying to tackle a binary classification problem with some custom random forest implementation.
The goal is to predict the likelihood that the item belongs to class A. The evaluation strategy is defined such that false positives (a high likelihood for A while the actual class is B) are punished harder than false negatives (a low likelihood for A while the actual class is A).
How should the standard algorithm be adapted to take advantage of this to get a higher evaluation score?
If you haven't already, try using the package rfUtilities: https://cran.r-project.org/web/packages/rfUtilities/rfUtilities.pdf
It was designed to deal with class imbalance by predicting the liklihood of occurence for a single category.
Related
I am currently building a binary classification model to predict stock price movements (trend prediction). More specifically, the model predicts the probability that a stock outperforms the daily median return:
>Class 0: return >= median
>
>Class 1: return < median return
Accordingly, I (should) be dealing with a balanced prediction problem.
The ten stocks with the highest probability will be bought, and the ten stocks with the lowest probability will be shorted daily. So, ideally, the model performs well on both classes (I use softmax, so the model must exclusively decide).
I am wondering whether I should use the Accuracy, F1 or AUC-ROC when choosing the optimal model under these circumstances?
My understanding is that both are suitable metrics when the two classes are equally important. This StackExchange-Answer recommends the AUC over Accuracy because it will "strongly discourage people going for models that are representative, but not discriminative (...) and [only] select models that achieve false positive and true positive rates that are significantly above random chance, which is not guaranteed for accuracy". In contrast, this answer recommends the F1-Score because it is the combination of accuracy and AUC score.
I guess what's confusing me is that I will make use of both classes based on the probabilty assigned by the model. Also, I do not have an imbalanced dataset which usually calls for using the AUC-ROC.
Which evaluation metric should I choose to find the optimal model on validation data?
Thanks a lot for any thoughts or recommendations.
I'm working on a binary classification problem. I had this situation that I used the logistic regression and support vector machine model imported from sklearn. These two models were fit with the same , imbalanced training data and class weights were adjusted. And they have achieved comparable performances. When I used these two pre-trained models to predict a new dataset. The LR model and the SVM models predicted similar number of instances as positives. And the predicted instances share a big overlap.
However, when I looked at the probability scores of being classified as positives, the distribution by LR is from 0.5 to 1 while the SVM starts from around 0.1. I called the function model.predict(prediction_data) to find out the instances predicted as each class and the function
model.predict_proba(prediction_data) to give the probability scores of being classified as 0(neg) and 1(pos), and assume they all have a default threshold 0.5.
There is no error in my code and I have no idea why the SVM predicted instances with probability scores < 0.5 as positives as well. Any thoughts on how to interpret this situation?
That's a known fact in sklearn when it comes to binary classification problems with SVC(), which is reported, for instance, in these github issues
(here and here). Moreover, it is also
reported in the User guide where it is said that:
In addition, the probability estimates may be inconsistent with the scores:
the “argmax” of the scores may not be the argmax of the probabilities; in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.
or directly within libsvm faq, where it is said that
Let's just consider two-class classification here. After probability information is obtained in training, we do not have prob > = 0.5 if and only if decision value >= 0.
All in all, the point is that:
on one side, predictions are based on decision_function values: if the decision value computed on a new instance is positive, the predicted class is the positive class and viceversa.
on the other side, as stated within one of the github issues, np.argmax(self.predict_proba(X), axis=1) != self.predict(X) which is where the inconsistency comes from. In other terms, in order to always have consistency on binary classification problems you would need a classifier whose predictions are based on the output of predict_proba() (which is btw what you'll get when considering calibrators), like so:
def predict(self, X):
y_proba = self.predict_proba(X)
return np.argmax(y_proba, axis=1)
I'd also suggest this post on the topic.
I am doing tensorflow object detection and I find that there are lot of false positives. One of the main reasons that I see for this is the case of overfitting. But my doubt is how does false positive become a result of overfitting? The overfitting happens when it learns a complex pattern in data or in short it leads to memorisation of the data.
If it was memorisation, wouldn't it show more false negatives as it has only memorised the training data and is unable to detect new cases. How can it really classify other objects as belonging to trained class is it not counter intuitive?
One reason I could think of would be outliers in your training data:
Say you have some strong outliers in your training data in class A, which in consequence might lay more in the domain of the other class B in some dimension, then overfitting will result in a shift of the class boundary in the direction of this outlier. This could effectively result in a lot of false positives, as the shifted boundary of class A now partially lays in an area which should be in the domain of class B.
For an extreme example an overfitted boundary might look like this:
Here, due to overfitting, we keep the outlier in the positive class, at the cost of also taking in 2 false negatives. A generalized good boundary in-between those 2 classes would discard the outlier as false negative, but would still have an higher accuracy due to not also including those 2 false positives.
Same could go for false positives due to outliers by the way, that's why overfitting is generally considered bad.
Let's say I have three classes that represent the cervix type needed to be classified by my model. The overall goal is to predict the correct cervix class so health provider can give the patient the most appropriate treatment for their cervical cancer. Misclassifying cervix type would cost diagnosis time for health providers and treatment fee for the patients. In this case, is precision more important than recall?
Just to remember:
Recall: It's the ratio between the true positive and the false negative. So, it measures the ratio between the well predicted class and all the samples of this class. It's good when your goal is to identify all the samples of a class (for example, predict all people who have cancer).
Precision: It's the ratio between the true positive and the false positive. So, it measures the ratio between the well predicted class and all the samples that was predicted as this class. It's good when your goal is to not have false positive (for example, avoid predicting a woman is pregnant when actually she isn't).
In this case, I think it is important to know more about the consequences of a mistake. Is there some life risk if a person is of type A but she is predicted as type B ? Is there any type that is the safest (even if a person is misclassified, there is no life risk)? The most dangerous (there is a life risk)?
According to the answers, you will be able to choose the best metric.
Assuming there is the safest and the most dangerous type, the recall in the dangerous one would be more relevant, because you must identify all of them (maximum as you can).
I have the below F1 and AUC scores for 2 different cases
Model 1: Precision: 85.11 Recall: 99.04 F1: 91.55 AUC: 69.94
Model 2: Precision: 85.1 Recall: 98.73 F1: 91.41 AUC: 71.69
The main motive of my problem to predict the positive cases correctly,ie, reduce the False Negative cases (FN). Should I use F1 score and choose Model 1 or use AUC and choose Model 2. Thanks
Introduction
As a rule of thumb, every time you want to compare ROC AUC vs F1 Score, think about it as if you are comparing your model performance based on:
[Sensitivity vs (1-Specificity)] VS [Precision vs Recall]
Note that Sensitivity is the Recall (they are the same exact metric).
Now we need to understand what are: Specificity, Precision and Recall (Sensitivity) intuitively!
Background
Specificity: is given by the following formula:
Intuitively speaking, if we have 100% specific model, that means it did NOT miss any True Negative, in other words, there were NO False Positives (i.e. negative result that is falsely labeled as positive). Yet, there is a risk of having a lot of False Negatives!
Precision: is given by the following formula:
Intuitively speaking, if we have a 100% precise model, that means it could catch all True positive but there were NO False Positive.
Recall: is given by the following formula:
Intuitively speaking, if we have a 100% recall model, that means it did NOT miss any True Positive, in other words, there were NO False Negatives (i.e. a positive result that is falsely labeled as negative). Yet, there is a risk of having a lot of False Positives!
As you can see, the three concepts are very close to each other!
As a rule of thumb, if the cost of having False negative is high, we want to increase the model sensitivity and recall (which are the exact same in regard to their formula)!.
For instance, in fraud detection or sick patient detection, we don't want to label/predict a fraudulent transaction (True Positive) as non-fraudulent (False Negative). Also, we don't want to label/predict a contagious sick patient (True Positive) as not sick (False Negative).
This is because the consequences will be worse than a False Positive (incorrectly labeling a a harmless transaction as fraudulent or a non-contagious patient as contagious).
On the other hand, if the cost of having False Positive is high, then we want to increase the model specificity and precision!.
For instance, in email spam detection, we don't want to label/predict a non-spam email (True Negative) as spam (False Positive). On the other hand, failing to label a spam email as spam (False Negative) is less costly.
F1 Score
It's given by the following formula:
F1 Score keeps a balance between Precision and Recall. We use it if there is uneven class distribution, as precision and recall may give misleading results!
So we use F1 Score as a comparison indicator between Precision and Recall Numbers!
Area Under the Receiver Operating Characteristic curve (AUROC)
It compares the Sensitivity vs (1-Specificity), in other words, compare the True Positive Rate vs False Positive Rate.
So, the bigger the AUROC, the greater the distinction between True Positives and True Negatives!
AUROC vs F1 Score (Conclusion)
In general, the ROC is for many different levels of thresholds and thus it has many F score values. F1 score is applicable for any particular point on the ROC curve.
You may think of it as a measure of precision and recall at a particular threshold value whereas AUC is the area under the ROC curve. For F score to be high, both precision and recall should be high.
Consequently, when you have a data imbalance between positive and negative samples, you should always use F1-score because ROC averages over all possible thresholds!
Further read:
Credit Card Fraud: Handling highly imbalance classes and why Receiver Operating Characteristics Curve (ROC Curve) should not be used, and Precision/Recall curve should be preferred in highly imbalanced situations
If you look at the definitions, you can that both AUC and F1-score optimize "something" together with the fraction of the sample labeled "positive" that is actually true positive.
This "something" is:
For the AUC, the specificity, which is the fraction of the negatively labeled sample that is correctly labeled. You're not looking at the fraction of your positively labeled samples that is correctly labeled.
Using the F1 score, it's precision: the fraction of the positively labeled sample that is correctly labeled. And using the F1-score you don't consider the purity of the sample labeled as negative (the specificity).
The difference becomes important when you have highly unbalanced or skewed classes: For example there are many more true negatives than true positives.
Suppose you are looking at data from the general population to find people with a rare disease. There are far more people "negative" than "positive", and trying to optimize how well you are doing on the positive and the negative samples simultaneously, using AUC, is not optimal. You want the positive sample to include all positives if possible and you don't want it to be huge, due to a high false positive rate. So in this case you use the F1 score.
Conversely if both classes make up 50% of your dataset, or both make up a sizable fraction, and you care about your performance in identifying each class equally, then you should use the AUC, which optimizes for both classes, positive and negative.
just adding my 2 cents here:
AUC does an implicit weighting of the samples, which F1 does not.
In my last use case comparing the effectiveness of drugs on patients, it's easy to learn which drugs are generally strong, and which are weak. The big question is whether you can hit the outliers (the few positives for a weak drug or the few negatives for a strong drug). To answer that, you have to specifically weigh the outliers up using F1, which you don't need to do with AUC.
to predict the positive cases correctly
one can rewrite a bit your goal and get: when a case is really positive you want classify it as positive too. The probability of such event p(predicted_label = positive | true_label = positive) is a recall by definition. If you want to maximize this property of your model, you'd choose the Model 1.