ML model validation on skewed data - machine-learning

Say, I'm building a ML model to predict if a patient has flu or not. I know that, on average, only 2 out of 100 patients in the population have flu.
Usually, to estimate model's accuracy I just calculate what percentage of new data the model labels correctly:
accuracy rate = (correctly identified patients / total number of patients)
But in this case, I can write a model that labels all patients as not having flu and it will be accurate 98% of the time.
So probably the estimator should consider not only how much patients the model labeled correctly but also how much of the sick patient it actually found, something like
accuracy rate = (correctly identified patients / total number of patients) *
(correctly identified patients with flu / total number of patient with flu)
But this estimator has no real-world interpretation.
Is it a right way to think about it and how would you calculate accuracy rate of a model on such a skewed data? Thanks!

If you want to a balanced model, the long answer is "It depends", the short-term answer you can look into is something called the Matthews Correlation Coefficient (MCC) / Phi-Value.
As you saw, Accuracy is a really bad metric when facing imbalanced datasets. The MCC takes the size of classes into account and corrects for that. It delivers the same result for the the same model performance, no matter what the makeup of a dataset is.
TP = Number of true positives
TN = Number of true negatives
FP = Number of false positives
TN = Number of false negatives
MCC = (TP * TN - FP * FN) / sqrt((TP + FP)*(TP + FN)*(TN + FP)*(TN + FN))
MCC = 1 -> Perfect prediction
MCC = 0 -> No correlation
MCC = -1 -> Absolute contradiction
Just from experience (in my field, therefore with a huuuuge grain of salt):
reasonable models for the companies I work with usually start around MCC >= 0.75

I think You have to use of MAP. And for it you need to calculate Recall and Precision:
Recall = (True Positive) / (True Positive + False Positive)
Precision = (True Positive) / (True Positive + False Negative)
Positive: patient has flu
Negative: patient has not flu
True: correctly identified
False: wrong identified

Related

in a classification problem, why F1-score is more suitable than accuracy when the classes are unbalanced?

i understand that F1 score is more important if the false positive/false negative are crucial to determine a good classifier. i read in a site that "F1 Score is the weighted average of Precision and Recall; therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution". the fact the F1 score is more suitable for uneven or unbalanced class was written also in other sites, but what is the reason about this condition?
lets say you have class A = 1000 and class B = 100,
Here when you use accuracy as a evaluation metrics.
where,
Accuracy = Correct Predictions for class A + Class B / Total Predictions
lets say out of 1000 from class A , correct prediction is 950 and for class B , correct predictions are 10 out of 100.
Then as per the accuracy formula,
Accuracy = 950 (class A correct predictions) + 10 (class B correct predictions) / 1100
Accuracy = 0.8727272727272727 (87%)
In this imbalanced case we got 87% accuracy which is good but if you noticed for class B we only predicted 10 records correctly out of 100, which means our model is not able to predict class B but Accuracy metric shows our model is very good (87%) accuracy.
So for this case we use f1-score which handle evaluation of imbalanced problem.
F1 = 2 * (precision * recall) / (precision + recall)
f1-score takes precision and recall into consideration hence it is important to evaluate model with f1-score in case of imbalance data or else if you still want to use accuracy as a matrix use with class wise accuracy like accuracy for class A and accuracy for class B.

How to improve the precision-recall scores as negative labels in test set is increased

Consider the below scenario:
I have batches of data whose features and labels have similar distribution.
Say something like 4000000 negative labels and 25000 positive labels
As its a highly imbalanced set, I have undersampled the negative labels so that my training set (taken from one of the batch) now contains 25000 positive labels and 500000 negative labels.
Now I am trying to measure the precision and recall from a test set after training (generated from a different batch)
I am using XGBoost with 30 estimators.
Now if I use all of 40000000 negative labels, I get a (0.1 precsion and 0.1 recall at 0.7 threshold) worser precision-recall score than if I use a subset say just 500000 negative labels(0.4 precision with 0.1 recall at 0.3 threshold)..
What could be a potential reason that this could happen?
Few of the thoughts that I had:
The features of the 500000 negative labels are vastly different from the rest in the overall 40000000 negative labels.
But when I plot the individual features, their central tendencies closely match with the subset.
Are there any other ways to identify why I get a lower and a worser presicion recall, when the number of negative labels increase so much?
Are there any ways to compare the distributions?
Is my undersampled training a cause for this?
To understand this, we first need to understand how precision and recall are calculated. For this I will use the following variables:
P - total number of positives
N - total number of negatives
TP - number of true positives
TN - number of true negatives
FP - number of false positives
FN - number of false negatives
It is important to note that:
P = TP + FN
N = TN + FP
Now, precision is TP/(TP + FP)
recall is TP/(TP + FN), therefore TP/P.
Accuracy is TP/(TP + FN) + TN/(TN + FP), hence (TP + TN)/(P + N)
In your case where the the data is imbalanced, we have that N>>P.
Now imagine some random model. We can usually say that for such a model accuracy is around 50%, but that is only if the data is balanced. In your case, there will tend to be more FP's and TN's than TP's and FN's because a random selection of the data has more liklihood of returning a negative sample.
So we can establish that the more % of negative samples N/(T+N), the more FP and TN we get. That is, whenever your model is not able to select the correct label, it will pick a random label out of P and N and it is mostly going to be N.
Recall that FP is a denominator in precision? This means that precision also decreases with increasing N/(T+N).
For recall, we have neither FP nor TN in its derivation, so will likely not to change much with increasing N/(T+N) . As can be seen in your example, it clearly stays the same.
Therefore, I would try to make the data balanced to get better result. A ratio of 1:1.5 should do.
You can also use a different metric like the F1 score that combines precision and recall to get a better understanding of the performance.
Also check some of the other points made here on how to combat imbalance data

Calculte precision and recall for text mining result

I am doing a project to find out the disease associated genes using text mining. I am using 1000 articles for this. I got around 129 gene names. The actual dataset contains around 1000 entries. Now I would like to calculate the precision and recall of my method. When i did the comparison, out of the 129 genes, 72 were found to be correct. So the
precision = 72/129.
Is it correct?
Now how can I calculate the recall? Please help
The Wikipedia Article on Precision and Recall might help.
The definitions are:
Precision: tp / (tp+fp)
Recall: tp / (tp + fn)
Where tp are the true positives (genes which are associated with disease and you found them), fp are the false positives (genes you found but they actually aren't associated with the disease) and fn are the false negatives (genes which are actually associated with the disease but you didn't find them).
I am not quite sure what the numbers you have posted represent. Do you know the genes which are truly associated with the disease?
You have most likely computed the accuracy:
Accuracy = (tp + fp) / (Total Number)
The main issue was that the articles that i am considering might not contain all the originally listed gene names since its a small dataset. So while calculating the recall, instead of considering the denominator as 1000, I can compare the original database of genes with the articles to find out how many of the originally associated genes are present in the literature. i.e., if there are 1000 associated genes, I will check out of 1000 how many are there in the dataset I am considering. If it is 300, i will set the denominator as 300 instead of 1000. That will give recall.

what is f-measure for each class in weka

When we evaluate a classifier in WEKA, for example a 2-class classifier, it gives us 3 f-measures: f-measure for class 1, for class 2 and the weighted f-measure.
I'm so confused! I thought f-measure is a balanced measure that show balanced performance measure for multiple class, so what does f-measure for class 1 and 2 mean?
The f-score (or f-measure) is calculated based on the precision and recall. The calculation is as follows:
Precision = t_p / (t_p + f_p)
Recall = t_p / (t_p + f_n)
F-score = 2 * Precision * Recall / (Precision + Recall)
Where t_p is the number of true positives, f_p the number of false positives and f_n the number of false negatives. Precision is defined as the fraction of elements correctly classified as positive out of all the elements the algorithm classified as positive, whereas recall is the fraction of elements correctly classified as positive out of all the positive elements.
In the multiclass case, each class i have a respective precision and recall, in which a "true positive" is an element predicted to be in i is really in it and a "true negative" is an element predicted to not be in i that isn't in it.
Thus, with this new definition of precision and recall, each class can have its own f-score by doing the same calculation as in the binary case. This is what Weka's showing you.
The weighted f-score is a weighted average of the classes' f-scores, weighted by the proportion of how many elements are in each class.
I am confused too,
I used the same equation for f-score for each class depending of their precision and recall, but the results are different!
example:
f-score different from weka claculaton

What does recall mean in Machine Learning?

What's the meaning of recall of a classifier, e.g. bayes classifier? please give an example.
for example, the Precision = correct/correct+wrong docs for test data. how to understand recall?
Recall literally is how many of the true positives were recalled (found), i.e. how many of the correct hits were also found.
Precision (your formula is incorrect) is how many of the returned hits were true positive i.e. how many of the found were correct hits.
I found the explanation of Precision and Recall from Wikipedia very useful:
Suppose a computer program for recognizing dogs in photographs identifies 8 dogs in a picture containing 12 dogs and some cats. Of the 8 dogs identified, 5 actually are dogs (true positives), while the rest are cats (false positives). The program's precision is 5/8 while its recall is 5/12. When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3.
So, in this case, precision is "how useful the search results are", and recall is "how complete the results are".
Precision in ML is the same as in Information Retrieval.
recall = TP / (TP + FN)
precision = TP / (TP + FP)
(Where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative).
It makes sense to use these notations for binary classifier, usually the "positive" is the less common classification. Note that the precision/recall metrics is actually the specific form where #classes=2 for the more general confusion matrix.
Also, your notation of "precision" is actually accuracy, and is (TP+TN)/ ALL
Giving you an example. Imagine we have a machine learning model which can detect cat vs dog. The actual label which is provided by human is called the ground-truth.
Again the output of your model is called the prediction. Now look at the following table:
ExampleNo Ground-truth Model's Prediction
0 Cat Cat
1 Cat Dog
2 Cat Cat
3 Dog Cat
4 Dog Dog
Say we want to find recall for the class cat. By definition recall means the percentage of a certain class correctly identified (from all of the given examples of that class). So for the class cat the model correctly identified it for 2 times (in example 0 and 2). But does it mean actually there are only 2 cats? No! In reality there are 3 cats in the ground truth (human labeled). So what is the percentage of correct identification of this certain class? 2 out of 3 that is (2/3) * 100 % = 66.67% or 0.667 if you normalize it within 1. Here is another prediction of cat in example 3 but it is not a correct prediction and hence, we are not considering it.
Now coming to mathematical formulation. First understand two terms:
TP (True positive): Predicting something positive when it is actually positive. If cat is our positive example then predicting something a cat when it is actually a cat.
FN (False negative): Predicting something negative when it is not actually negative.
Now for a certain class this classifier's output can be of two types: Cat or Dog (Not Cat). So the number correct identification is the number of True positive (TP). Again total number of examples of that class in ground-truth will be TP + FN. Because out of all cats the model either detected them correctly (TP) or didn't detect them correctly (FN i.e, the model falsely said Negative (Non Cat) when it was actually positive (Cat)). So For a certain class TP + FN denotes the total number of examples available in the ground truth of that class. So the formula is:
Recall = TP / (TP + FN)
Similarly recall can be calculated for Dog as well. At that time think the Dog as the positive class and the Cat as negative classes.
So for any number of classes to find recall of a certain class take the class as the positive class and take the rest of the classes as the negative classes and use the formula to find recall. Continue the process for each of the classes to find recall for all of them.
If you want to learn about precision as well then go here: https://stackoverflow.com/a/63121274/6907424
In very simple language: For example, in a series of photos showing politicians, how many times was the photo of politician XY was correctly recognised as showing A. Merkel and not some other politician?
precision is the ratio of how many times ANOTHER person was recognized (false positives) : (Correct hits) / (Correct hits) + (false positives)
recall is the ratio of how many times the name of the person shown in the photos was incorrectly recognized ('recalled'): (Correct calls) / (Correct calls) + (false calls)

Resources