Is accuracy, precision and recall measured for negative cases as well? - machine-learning

Most examples I have seen about accuracy, precision and recall use positive values. Eg.
Accuracy: The ratio of correct predictions (true positives + true
negatives) to the total number of predictions.
Precision: The fraction of the cases classified as positive that are actually positive (the number of true positives divided by the
number of true positives plus false positives).
Recall: The fraction of positive cases correctly identified (the number of true positives divided by the number of true positives plus
false negatives).
Is it better to measure these metrics for both positive and negative values?
Eg
Consider these two scenarios
If only positive cases are used in measures then
Accuracy - Model is 80% accurate for scenario 1 and 99.9% accurate for scenario 2
Precision - If models predicts that something is positive, it would be 75% correct in scenario 1 and 99.9% correct in scenario 2
Recall - Model would identify 90% of positive cases in scenario 1 and 100% of positive cases in scenario 2. So low recall?
However, the model for scenario 2 doesn't work as it fails 100% of times for negative cases.
In practice, is it better to measure the metrics for both positive and negative cases?

Related

Precision of a model in training (balanced dataset) versus production (imbalanced dataset)

I have a balanced dataset used for model training purposes. There are two classes. My model has a precision of 50%, meaning that for 100 samples it predicts that 50 are positive, of those 50 only 25 are actually positive. The model is basically as good as flipping a coin.
Now in production, the data is highly unbalanced, say only 4 out of 100 samples are positive. Will my model still have the same precision?
The way I understand it is that my coin-flip model would then label 50 samples as positive, of which only 2 would actually be positive so precision would be 4% (2/50) in production.
Is it true that a model that was trained on a balanced dataset would have a different precision in production?
That depends: of those 50 samples classified as positive, are all 25 true positive samples correctly classified?
If your model correctly predicts every positive sample as positive and then also negative samples as positive (high sensitivity, low specificity), I think your precision would be at around 8%. Nevertheless, you should revisit your training, since fpr 50% precision you don't need a ML model but rather a one-liner generating a random variable between 0 and 1.

what's the meaning of high precision and very much low recall of a recommender system ?

I have not much knowledge about precision and recall. I have design a recommender system. Its gives me
precision value = 0.409
and recall value = 0.067
we know that precision and recall are inversely related though I am not sure about that. Then what about my system??
Its that ok if I can increase precision value and decrease recall
value?
Precision is the percentage of your correctness when you choose positive since it depend on you prediction when you choose positive only (Depend on model positive prediction only ) an. In the other side , Recall measure whats you percentage of correctness in the positive Class (i.e in the All positive cases what is the percentage of true decision that the model take).

F1 Score vs ROC AUC

I have the below F1 and AUC scores for 2 different cases
Model 1: Precision: 85.11 Recall: 99.04 F1: 91.55 AUC: 69.94
Model 2: Precision: 85.1 Recall: 98.73 F1: 91.41 AUC: 71.69
The main motive of my problem to predict the positive cases correctly,ie, reduce the False Negative cases (FN). Should I use F1 score and choose Model 1 or use AUC and choose Model 2. Thanks
Introduction
As a rule of thumb, every time you want to compare ROC AUC vs F1 Score, think about it as if you are comparing your model performance based on:
[Sensitivity vs (1-Specificity)] VS [Precision vs Recall]
Note that Sensitivity is the Recall (they are the same exact metric).
Now we need to understand what are: Specificity, Precision and Recall (Sensitivity) intuitively!
Background
Specificity: is given by the following formula:
Intuitively speaking, if we have 100% specific model, that means it did NOT miss any True Negative, in other words, there were NO False Positives (i.e. negative result that is falsely labeled as positive). Yet, there is a risk of having a lot of False Negatives!
Precision: is given by the following formula:
Intuitively speaking, if we have a 100% precise model, that means it could catch all True positive but there were NO False Positive.
Recall: is given by the following formula:
Intuitively speaking, if we have a 100% recall model, that means it did NOT miss any True Positive, in other words, there were NO False Negatives (i.e. a positive result that is falsely labeled as negative). Yet, there is a risk of having a lot of False Positives!
As you can see, the three concepts are very close to each other!
As a rule of thumb, if the cost of having False negative is high, we want to increase the model sensitivity and recall (which are the exact same in regard to their formula)!.
For instance, in fraud detection or sick patient detection, we don't want to label/predict a fraudulent transaction (True Positive) as non-fraudulent (False Negative). Also, we don't want to label/predict a contagious sick patient (True Positive) as not sick (False Negative).
This is because the consequences will be worse than a False Positive (incorrectly labeling a a harmless transaction as fraudulent or a non-contagious patient as contagious).
On the other hand, if the cost of having False Positive is high, then we want to increase the model specificity and precision!.
For instance, in email spam detection, we don't want to label/predict a non-spam email (True Negative) as spam (False Positive). On the other hand, failing to label a spam email as spam (False Negative) is less costly.
F1 Score
It's given by the following formula:
F1 Score keeps a balance between Precision and Recall. We use it if there is uneven class distribution, as precision and recall may give misleading results!
So we use F1 Score as a comparison indicator between Precision and Recall Numbers!
Area Under the Receiver Operating Characteristic curve (AUROC)
It compares the Sensitivity vs (1-Specificity), in other words, compare the True Positive Rate vs False Positive Rate.
So, the bigger the AUROC, the greater the distinction between True Positives and True Negatives!
AUROC vs F1 Score (Conclusion)
In general, the ROC is for many different levels of thresholds and thus it has many F score values. F1 score is applicable for any particular point on the ROC curve.
You may think of it as a measure of precision and recall at a particular threshold value whereas AUC is the area under the ROC curve. For F score to be high, both precision and recall should be high.
Consequently, when you have a data imbalance between positive and negative samples, you should always use F1-score because ROC averages over all possible thresholds!
Further read:
Credit Card Fraud: Handling highly imbalance classes and why Receiver Operating Characteristics Curve (ROC Curve) should not be used, and Precision/Recall curve should be preferred in highly imbalanced situations
If you look at the definitions, you can that both AUC and F1-score optimize "something" together with the fraction of the sample labeled "positive" that is actually true positive.
This "something" is:
For the AUC, the specificity, which is the fraction of the negatively labeled sample that is correctly labeled. You're not looking at the fraction of your positively labeled samples that is correctly labeled.
Using the F1 score, it's precision: the fraction of the positively labeled sample that is correctly labeled. And using the F1-score you don't consider the purity of the sample labeled as negative (the specificity).
The difference becomes important when you have highly unbalanced or skewed classes: For example there are many more true negatives than true positives.
Suppose you are looking at data from the general population to find people with a rare disease. There are far more people "negative" than "positive", and trying to optimize how well you are doing on the positive and the negative samples simultaneously, using AUC, is not optimal. You want the positive sample to include all positives if possible and you don't want it to be huge, due to a high false positive rate. So in this case you use the F1 score.
Conversely if both classes make up 50% of your dataset, or both make up a sizable fraction, and you care about your performance in identifying each class equally, then you should use the AUC, which optimizes for both classes, positive and negative.
just adding my 2 cents here:
AUC does an implicit weighting of the samples, which F1 does not.
In my last use case comparing the effectiveness of drugs on patients, it's easy to learn which drugs are generally strong, and which are weak. The big question is whether you can hit the outliers (the few positives for a weak drug or the few negatives for a strong drug). To answer that, you have to specifically weigh the outliers up using F1, which you don't need to do with AUC.
to predict the positive cases correctly
one can rewrite a bit your goal and get: when a case is really positive you want classify it as positive too. The probability of such event p(predicted_label = positive | true_label = positive) is a recall by definition. If you want to maximize this property of your model, you'd choose the Model 1.

Why is the F-Measure a harmonic mean and not an arithmetic mean of the Precision and Recall measures?

When we calculate the F-Measure considering both Precision and Recall, we take the harmonic mean of the two measures instead of a simple arithmetic mean.
What is the intuitive reason behind taking the harmonic mean and not a simple average?
To explain, consider for example, what the average of 30mph and 40mph is? if you drive for 1 hour at each speed, the average speed over the 2 hours is indeed the arithmetic average, 35mph.
However if you drive for the same distance at each speed -- say 10 miles -- then the average speed over 20 miles is the harmonic mean of 30 and 40, about 34.3mph.
The reason is that for the average to be valid, you really need the values to be in the same scaled units. Miles per hour need to be compared over the same number of hours; to compare over the same number of miles you need to average hours per mile instead, which is exactly what the harmonic mean does.
Precision and recall both have true positives in the numerator, and different denominators. To average them it really only makes sense to average their reciprocals, thus the harmonic mean.
Because it punishes extreme values more.
Consider a trivial method (e.g. always returning class A). There are infinite data elements of class B, and a single element of class A:
Precision: 0.0
Recall: 1.0
When taking the arithmetic mean, it would have 50% correct. Despite being the worst possible outcome! With the harmonic mean, the F1-measure is 0.
Arithmetic mean: 0.5
Harmonic mean: 0.0
In other words, to have a high F1, you need to both have a high precision and recall.
The above answers are well explained. This is just for a quick reference to understand the nature of the arithmetic mean and the harmonic mean with plots. As you can see from the plot, consider the X axis and Y axis as precision and recall, and the Z axis as the F1 Score. So, from the plot of the harmonic mean, both the precision and recall should contribute evenly for the F1 score to rise up unlike the Arithmetic mean.
This is for the arithmetic mean.
This is for the Harmonic mean.
The harmonic mean is the equivalent of the arithmetic mean for reciprocals of quantities that should be averaged by the arithmetic mean. More precisely, with the harmonic mean, you transform all your numbers to the "averageable" form (by taking the reciprocal), you take their arithmetic mean and then transform the result back to the original representation (by taking the reciprocal again).
Precision and the recall are "naturally" reciprocals because their numerator is the same and their denominators are different. Fractions are more sensible to average by arithmetic mean when they have the same denominator.
For more intuition, suppose that we keep the number of true positive items constant. Then by taking the harmonic mean of the precision and the recall, you implicitly take the arithmetic mean of the false positives and the false negatives. It basically means that false positives and false negatives are equally important to you when the true positives stay the same. If an algorithm has N more false positive items but N less false negatives (while having the same true positives), the F-measure stays the same.
In other words, the F-measure is suitable when:
mistakes are equally bad, whether they are false positives or false negatives
the number of mistakes is measured relative to the number of true positives
true negatives are uninteresting
Point 1 may or may not be true, there are weighted variants of the F-measure that can be used if this assumption isn't true. Point 2 is quite natural since we can expect the results to scale if we just classify more and more points. The relative numbers should stay the same.
Point 3 is quite interesting. In many applications negatives are the natural default and it may even be hard or arbitrary to specify what really counts as a true negative. For example a fire alarm is having a true negative event every second, every nanosecond, every time a Planck time has passed etc. Even a piece of rock has these true negative fire-detection events all the time.
Or in a face detection case, most of the time you "correctly don't return" billions of possible areas in the image but this is not interesting. The interesting cases are when you do return a proposed detection or when you should return it.
By contrast the classification accuracy cares equally about true positives and true negatives and is more suitable if the total number of samples (classification events) is well-defined and rather small.
Here we already have some elaborate answers, but I thought some more information about it would be helpful for some guys who want to delve deeper(especially why F measure).
According to the theory of measurement, the composite measure should satisfy the following 6 definitions:
Connectedness(two pairs can be ordered) and transitivity(if e1 >= e2 and e2 >= e3 then e1 >= e3)
Independence: two components contribute their effects independently to the effectiveness.
Thomsen condition: Given that at a constant recall (precision) we find a difference in effectiveness for two values of precision (recall), then this difference cannot be removed or reversed by changing the constant value.
Restricted solvability.
Each component is essential: Variation in one while leaving the other constant gives a variation in effectiveness.
Archimedean property for each component. It merely ensures that the intervals on a component are comparable.
We can then derive and get the function of the effectiveness:
And normally we don't use the effectiveness but the much simpler F score
because F is just 1 - E:
Now that we take the general formula of F measure:
where we can place more emphasis on recall or precision by setting beta, because beta is defined as follows:
If we recall weight more important than precision(all relevant are selected), we can set beta as 2 and we get the F2 measure. And if we do the reverse and weight precision higher than recall(as many selected elements are relevant as possible, for instance in some grammar error correction scenarios like CoNLL) we just set beta as 0.5 and get the F0.5 measure. And obviously, we can set beta as 1 to get the most used F1 measure(harmonic mean of precision and recall).
I think to some extent I have already answered why we do not use the arithmetic meaning.
References:
https://en.wikipedia.org/wiki/F1_score
The truth of the F-measure
Information retrival
File:Harmonic mean 3D plot from 0 to 100.png
The Harmonic Mean has the least value as compared to Geometric Mean and Arithmetic Mean : min < Harmonic Mean< Geometric Mean < Arithmetic Mean < Max
Secondly, Precision and Recalls are ratios. Harmonic mean is the best measure to average ratios. (Arithmetic mean is suitable for additive/linear data, Geometric mean

Why KNN has low accuracy but high precision?

I classified 20NG dataset with k-nn with 200 instance in each category with 80-20 train-test split where i found the following results
Here accuracy is quite low but how precision is high when accuracy is that low ? isn't precision formulae TP/(TP + FP) ? If yes than high accurate classifier needs to generate high true positive which will result in high precision but how K-nn is generating high precision with too less true positive rate ?
Recall is equivalent to the True Positive rate. Text classification tasks (especially Information Retrieval, but Text Categorization as well) show a trade-off between recall and precision. When precision is very high, recall tends to be low, and the opposite. This is due to the fact that you can tune the classifier to classify more or less instances as positive. The less instances you classify as positive, the higher the precision and the lower the recall.
To ensure that the effectiveness measure correlates with accuracy, you shoud focus on the F-measure, which averages recall and precision (F-measure = 2*r*p / (r+p)).
Non-lazy classifiers follow a training process in which they try to optimize accuracy or error. K-NN, being lazy, has not a training process, and in consequence, it does not try to optimize any effectiveness measure. You can play with different values of K, and intuitively, the bigger the K the higher the recall and the lower the precision, and the opposite.

Resources