F1 Score vs ROC AUC - machine-learning

F1 Score vs ROC AUC - machine-learning

I have the below F1 and AUC scores for 2 different cases
Model 1: Precision: 85.11 Recall: 99.04 F1: 91.55 AUC: 69.94
Model 2: Precision: 85.1 Recall: 98.73 F1: 91.41 AUC: 71.69
The main motive of my problem to predict the positive cases correctly,ie, reduce the False Negative cases (FN). Should I use F1 score and choose Model 1 or use AUC and choose Model 2. Thanks

Introduction
As a rule of thumb, every time you want to compare ROC AUC vs F1 Score, think about it as if you are comparing your model performance based on:
[Sensitivity vs (1-Specificity)] VS [Precision vs Recall]
Note that Sensitivity is the Recall (they are the same exact metric).
Now we need to understand what are: Specificity, Precision and Recall (Sensitivity) intuitively!
Background
Specificity: is given by the following formula:
Intuitively speaking, if we have 100% specific model, that means it did NOT miss any True Negative, in other words, there were NO False Positives (i.e. negative result that is falsely labeled as positive). Yet, there is a risk of having a lot of False Negatives!
Precision: is given by the following formula:
Intuitively speaking, if we have a 100% precise model, that means it could catch all True positive but there were NO False Positive.
Recall: is given by the following formula:
Intuitively speaking, if we have a 100% recall model, that means it did NOT miss any True Positive, in other words, there were NO False Negatives (i.e. a positive result that is falsely labeled as negative). Yet, there is a risk of having a lot of False Positives!
As you can see, the three concepts are very close to each other!
As a rule of thumb, if the cost of having False negative is high, we want to increase the model sensitivity and recall (which are the exact same in regard to their formula)!.
For instance, in fraud detection or sick patient detection, we don't want to label/predict a fraudulent transaction (True Positive) as non-fraudulent (False Negative). Also, we don't want to label/predict a contagious sick patient (True Positive) as not sick (False Negative).
This is because the consequences will be worse than a False Positive (incorrectly labeling a a harmless transaction as fraudulent or a non-contagious patient as contagious).
On the other hand, if the cost of having False Positive is high, then we want to increase the model specificity and precision!.
For instance, in email spam detection, we don't want to label/predict a non-spam email (True Negative) as spam (False Positive). On the other hand, failing to label a spam email as spam (False Negative) is less costly.
F1 Score
It's given by the following formula:
F1 Score keeps a balance between Precision and Recall. We use it if there is uneven class distribution, as precision and recall may give misleading results!
So we use F1 Score as a comparison indicator between Precision and Recall Numbers!
Area Under the Receiver Operating Characteristic curve (AUROC)
It compares the Sensitivity vs (1-Specificity), in other words, compare the True Positive Rate vs False Positive Rate.
So, the bigger the AUROC, the greater the distinction between True Positives and True Negatives!
AUROC vs F1 Score (Conclusion)
In general, the ROC is for many different levels of thresholds and thus it has many F score values. F1 score is applicable for any particular point on the ROC curve.
You may think of it as a measure of precision and recall at a particular threshold value whereas AUC is the area under the ROC curve. For F score to be high, both precision and recall should be high.
Consequently, when you have a data imbalance between positive and negative samples, you should always use F1-score because ROC averages over all possible thresholds!
Further read:
Credit Card Fraud: Handling highly imbalance classes and why Receiver Operating Characteristics Curve (ROC Curve) should not be used, and Precision/Recall curve should be preferred in highly imbalanced situations

If you look at the definitions, you can that both AUC and F1-score optimize "something" together with the fraction of the sample labeled "positive" that is actually true positive.
This "something" is:
For the AUC, the specificity, which is the fraction of the negatively labeled sample that is correctly labeled. You're not looking at the fraction of your positively labeled samples that is correctly labeled.
Using the F1 score, it's precision: the fraction of the positively labeled sample that is correctly labeled. And using the F1-score you don't consider the purity of the sample labeled as negative (the specificity).
The difference becomes important when you have highly unbalanced or skewed classes: For example there are many more true negatives than true positives.
Suppose you are looking at data from the general population to find people with a rare disease. There are far more people "negative" than "positive", and trying to optimize how well you are doing on the positive and the negative samples simultaneously, using AUC, is not optimal. You want the positive sample to include all positives if possible and you don't want it to be huge, due to a high false positive rate. So in this case you use the F1 score.
Conversely if both classes make up 50% of your dataset, or both make up a sizable fraction, and you care about your performance in identifying each class equally, then you should use the AUC, which optimizes for both classes, positive and negative.

just adding my 2 cents here:
AUC does an implicit weighting of the samples, which F1 does not.
In my last use case comparing the effectiveness of drugs on patients, it's easy to learn which drugs are generally strong, and which are weak. The big question is whether you can hit the outliers (the few positives for a weak drug or the few negatives for a strong drug). To answer that, you have to specifically weigh the outliers up using F1, which you don't need to do with AUC.

to predict the positive cases correctly
one can rewrite a bit your goal and get: when a case is really positive you want classify it as positive too. The probability of such event p(predicted_label = positive | true_label = positive) is a recall by definition. If you want to maximize this property of your model, you'd choose the Model 1.

Related

XGBoost for precision

I'm using XGBoost for binary classification. The standard/default loss function (binary logistic) considers all classifications (both in the positive and negative classes) for performance.
All I care about is precision. I don't mind if it makes a very small number of classifications, as long as it maximises it's strike rate of getting it right. So I'd like a loss function/evaluation metric combination that doesn't care about missed opportunities at all (ie. false negatives, or true negatives), and only seeks to maximise true positives (and minimise false positives).
I have a relatively balanced panel.
Is there a straightforward way to do this in xgboost (either through existing hyperparameters, or through a new loss function)? If there is a better loss/objective function (and gradient/hessian), is there a paper or reference for this?

Precision Versus Recall for Multiclass classification problem

Let's say I have three classes that represent the cervix type needed to be classified by my model. The overall goal is to predict the correct cervix class so health provider can give the patient the most appropriate treatment for their cervical cancer. Misclassifying cervix type would cost diagnosis time for health providers and treatment fee for the patients. In this case, is precision more important than recall?

Just to remember:
Recall: It's the ratio between the true positive and the false negative. So, it measures the ratio between the well predicted class and all the samples of this class. It's good when your goal is to identify all the samples of a class (for example, predict all people who have cancer).
Precision: It's the ratio between the true positive and the false positive. So, it measures the ratio between the well predicted class and all the samples that was predicted as this class. It's good when your goal is to not have false positive (for example, avoid predicting a woman is pregnant when actually she isn't).
In this case, I think it is important to know more about the consequences of a mistake. Is there some life risk if a person is of type A but she is predicted as type B ? Is there any type that is the safest (even if a person is misclassified, there is no life risk)? The most dangerous (there is a life risk)?
According to the answers, you will be able to choose the best metric.
Assuming there is the safest and the most dangerous type, the recall in the dangerous one would be more relevant, because you must identify all of them (maximum as you can).

what's the meaning of high precision and very much low recall of a recommender system ?

I have not much knowledge about precision and recall. I have design a recommender system. Its gives me
precision value = 0.409
and recall value = 0.067
we know that precision and recall are inversely related though I am not sure about that. Then what about my system??
Its that ok if I can increase precision value and decrease recall
value?

Precision is the percentage of your correctness when you choose positive since it depend on you prediction when you choose positive only (Depend on model positive prediction only ) an. In the other side , Recall measure whats you percentage of correctness in the positive Class (i.e in the All positive cases what is the percentage of true decision that the model take).

Reason of having high AUC and low accuracy in a balanced dataset

Given a balanced dataset (size of both classes are the same), fitting it into an SVM model I yield a high AUC value (~0.9) but a low accuracy (~0.5).
I have totally no idea why would this happen, can anyone explain this case for me?

The ROC curve is biased towards the positive class. The described situation with high AUC and low accuracy can occur when your classifier achieves the good performance on the positive class (high AUC), at the cost of a high false negatives rate (or a low number of true negatives).
The question of why the training process resulted in a classifier with poor predictive performance is very specific to your problem/data and the classification methods used.
The ROC analysis tells you how well the samples of the positive class can be separated from the other class, while the prediction accuracy hints on the actual performance of your classifier.
About ROC analysis
The general context for ROC analysis is binary classification, where a classifier assigns elements of a set into two groups. The two classes are usually referred to as "positive" and "negative". Here, we assume that the classifier can be reduced to the following functional behavior:
def classifier(observation, t):
if score_function(observation) <= t:
observation belongs to the "negative" class
else:
observation belongs to the "positive" class
The core of a classifier is the scoring function that converts observations into a numeric value measuring the affinity of the observation to the positive class. Here, the scoring function incorporates the set of rules, the mathematical functions, the weights and parameters, and all the ingenuity that makes a good classifier. For example, in logistic regression classification, one possible choice for the scoring function is the logistic function that estimates the probability p(x) of an observation x belonging to the positive class.
In a final step, the classifier converts the computed score into a binary class assignment by comparing the score against a decision threshold (or prediction cutoff) t.
Given the classifier and a fixed decision threshold t, we can compute actual class predictions y_p for given observations x. To assess the capability of a classifier, the class predictions y_p are compared with the true class labels y_t of a validation dataset. If y_p and y_t match, we refer to as true positives TP or true negatives TN, depending on the value of y_p and y_t; or false positives FP or false negatives FN if y_p and y_t do not match.
We can apply this to the entire validation dataset and count the total number of TPs, TNs, FPs and FNs, as well as the true positive rate (TPR) and false positive rate rate (FPR), which are defined as follows:
TPR = TP / P = TP / (TP+FN) = number of true positives / number of positives
FPR = FP / N = FP / (FP+TN) = number of false positives / number of negatives
Note that the TPR is often referred to as the sensitivity, and FPR is equivalent to 1-specifity.
In comparison, the accuracy is defined as the ratio of all correctly labeled cases and the total number of cases:
accuracy = (TP+TN)/(Total number of cases) = (TP+TN)/(TP+FP+TN+FN)
Given a classifier and a validation dataset, we can evaluate the true positive rate TPR(t) and false positive rate FPR(t) for varying decision thresholds t. And here we are: Plotting FPR(t) against TPR(t) yields the receiver-operator characteristic (ROC) curve. Below are some sample ROC curves, plotted in Python using roc-utils*.
Think of the decision threshold t as a final free parameter that can be tuned at the end of the training process. The ROC analysis offers means to find an optimal cutoff t* (e.g., Youden index, concordance, distance from optimal point).
Furthermore, we can examine with the ROC curve how well the classifier can discriminate between samples from the "positive" and the "negative" class:
Try to understand how the FPR and TPR change for increasing values of t. In the first extreme case (with some very small value for t), all samples are classified as "positive". Hence, there are no true negatives (TN=0), and thus FPR=TPR=1. By increasing t, both FPR and TPR gradually decrease, until we reach the second extreme case, where all samples are classified as negative, and none as positive: TP=FP=0, and thus FPR=TPR=0. In this process, we start in the top right corner of the ROC curve and gradually move to the bottom left.
In the case where the scoring function is able to separate the samples perfectly, leading to a perfect classifier, the ROC curve passes through the optimal point FPR(t)=0 and TPR(t)=1 (see the left figure below). In the other extreme case where the distributions of scores coincide for both classes, resulting in a random coin-flipping classifier, the ROC curve travels along the diagonal (see the right figure below).
Unfortunately, it is very unlikely that we can find a perfect classifier that reaches the optimal point (0,1) in the ROC curve. But we can try to get as close to it as possible.
The AUC, or the area under the ROC curve, tries to capture this characteristic. It is a measure for how well a classifier can discriminate between the two classes. It varies between 1. and 0. In the case of a perfect classifier, the AUC is 1. A classifier that assigns a random class label to input data would yield an AUC of 0.5.
* Disclaimer: I'm the author of roc-utils

I guess you are miss reading the correct class when calculating the roc curve...
That will explain the low accuracy and the high (wrongly calculated) AUC.
It is easy to see that AUC can be misleading when used to compare two
classifiers if their ROC curves cross. Classifier A may produce a
higher AUC than B, while B performs better for a majority of the
thresholds with which you may actually use the classifier. And in fact
empirical studies have shown that it is indeed very common for ROC
curves of common classifiers to cross. There are also deeper reasons
why AUC is incoherent and therefore an inappropriate measure (see
references below).
http://sandeeptata.blogspot.com/2015/04/on-dangers-of-auc.html

Another simple explanation for this behaviour is that your model is actually very good - just its final threshold to make predictions binary is bad.
I came across this problem with a convolutional neural network on a binary image classification task. Consider e.g, that you have 4 samples with labels 0,0,1,1. Lets say your model creates continuous predictions for these four samples like so: 0.7, 0.75, 0.9 and 0.95.
We would consider this to be a good model, since high values (> 0.8) predict class 1 and low values (< 0.8) predict class 0. Hence, the ROC-AUC would be 1. Note how I used a threshold of 0.8. However, if you use a fixed and badly-chosen threshold for these predictions, say 0.5, which is what we sometimes force upon our model output, then all 4 sample predictions would be class 1, which leads to an accuracy of 50%.
Note that most models optimize not for accuracy, but for some sort of loss function. In my CNN, training for just a few epochs longer solved the problem.
Make sure that you know what you are doing when you transform a continuous model output into a binary prediction. If you do not know what threshold to use for a given ROC curve, have a look at Youden's index or find the threshold value that represents the "most top-left" point in your ROC curve.

If this is happening every single time, may be your model is not correct.
Starting from kernel you need to change and try the model with the new sets.
Look the confusion matrix every time and check TN and TP areas. The model should be inadequate to detect one of them.

Why is the F-Measure a harmonic mean and not an arithmetic mean of the Precision and Recall measures?

When we calculate the F-Measure considering both Precision and Recall, we take the harmonic mean of the two measures instead of a simple arithmetic mean.
What is the intuitive reason behind taking the harmonic mean and not a simple average?

To explain, consider for example, what the average of 30mph and 40mph is? if you drive for 1 hour at each speed, the average speed over the 2 hours is indeed the arithmetic average, 35mph.
However if you drive for the same distance at each speed -- say 10 miles -- then the average speed over 20 miles is the harmonic mean of 30 and 40, about 34.3mph.
The reason is that for the average to be valid, you really need the values to be in the same scaled units. Miles per hour need to be compared over the same number of hours; to compare over the same number of miles you need to average hours per mile instead, which is exactly what the harmonic mean does.
Precision and recall both have true positives in the numerator, and different denominators. To average them it really only makes sense to average their reciprocals, thus the harmonic mean.

Because it punishes extreme values more.
Consider a trivial method (e.g. always returning class A). There are infinite data elements of class B, and a single element of class A:
Precision: 0.0
Recall: 1.0
When taking the arithmetic mean, it would have 50% correct. Despite being the worst possible outcome! With the harmonic mean, the F1-measure is 0.
Arithmetic mean: 0.5
Harmonic mean: 0.0
In other words, to have a high F1, you need to both have a high precision and recall.

The above answers are well explained. This is just for a quick reference to understand the nature of the arithmetic mean and the harmonic mean with plots. As you can see from the plot, consider the X axis and Y axis as precision and recall, and the Z axis as the F1 Score. So, from the plot of the harmonic mean, both the precision and recall should contribute evenly for the F1 score to rise up unlike the Arithmetic mean.
This is for the arithmetic mean.
This is for the Harmonic mean.

The harmonic mean is the equivalent of the arithmetic mean for reciprocals of quantities that should be averaged by the arithmetic mean. More precisely, with the harmonic mean, you transform all your numbers to the "averageable" form (by taking the reciprocal), you take their arithmetic mean and then transform the result back to the original representation (by taking the reciprocal again).
Precision and the recall are "naturally" reciprocals because their numerator is the same and their denominators are different. Fractions are more sensible to average by arithmetic mean when they have the same denominator.
For more intuition, suppose that we keep the number of true positive items constant. Then by taking the harmonic mean of the precision and the recall, you implicitly take the arithmetic mean of the false positives and the false negatives. It basically means that false positives and false negatives are equally important to you when the true positives stay the same. If an algorithm has N more false positive items but N less false negatives (while having the same true positives), the F-measure stays the same.
In other words, the F-measure is suitable when:
mistakes are equally bad, whether they are false positives or false negatives
the number of mistakes is measured relative to the number of true positives
true negatives are uninteresting
Point 1 may or may not be true, there are weighted variants of the F-measure that can be used if this assumption isn't true. Point 2 is quite natural since we can expect the results to scale if we just classify more and more points. The relative numbers should stay the same.
Point 3 is quite interesting. In many applications negatives are the natural default and it may even be hard or arbitrary to specify what really counts as a true negative. For example a fire alarm is having a true negative event every second, every nanosecond, every time a Planck time has passed etc. Even a piece of rock has these true negative fire-detection events all the time.
Or in a face detection case, most of the time you "correctly don't return" billions of possible areas in the image but this is not interesting. The interesting cases are when you do return a proposed detection or when you should return it.
By contrast the classification accuracy cares equally about true positives and true negatives and is more suitable if the total number of samples (classification events) is well-defined and rather small.

Here we already have some elaborate answers, but I thought some more information about it would be helpful for some guys who want to delve deeper(especially why F measure).
According to the theory of measurement, the composite measure should satisfy the following 6 definitions:
Connectedness(two pairs can be ordered) and transitivity(if e1 >= e2 and e2 >= e3 then e1 >= e3)
Independence: two components contribute their effects independently to the effectiveness.
Thomsen condition: Given that at a constant recall (precision) we find a difference in effectiveness for two values of precision (recall), then this difference cannot be removed or reversed by changing the constant value.
Restricted solvability.
Each component is essential: Variation in one while leaving the other constant gives a variation in effectiveness.
Archimedean property for each component. It merely ensures that the intervals on a component are comparable.
We can then derive and get the function of the effectiveness:
And normally we don't use the effectiveness but the much simpler F score
because F is just 1 - E:
Now that we take the general formula of F measure:
where we can place more emphasis on recall or precision by setting beta, because beta is defined as follows:
If we recall weight more important than precision(all relevant are selected), we can set beta as 2 and we get the F2 measure. And if we do the reverse and weight precision higher than recall(as many selected elements are relevant as possible, for instance in some grammar error correction scenarios like CoNLL) we just set beta as 0.5 and get the F0.5 measure. And obviously, we can set beta as 1 to get the most used F1 measure(harmonic mean of precision and recall).
I think to some extent I have already answered why we do not use the arithmetic meaning.
References:
https://en.wikipedia.org/wiki/F1_score
The truth of the F-measure
Information retrival
File:Harmonic mean 3D plot from 0 to 100.png

The Harmonic Mean has the least value as compared to Geometric Mean and Arithmetic Mean : min < Harmonic Mean< Geometric Mean < Arithmetic Mean < Max
Secondly, Precision and Recalls are ratios. Harmonic mean is the best measure to average ratios. (Arithmetic mean is suitable for additive/linear data, Geometric mean

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart