Precision Versus Recall for Multiclass classification problem

Precision Versus Recall for Multiclass classification problem - machine-learning

Let's say I have three classes that represent the cervix type needed to be classified by my model. The overall goal is to predict the correct cervix class so health provider can give the patient the most appropriate treatment for their cervical cancer. Misclassifying cervix type would cost diagnosis time for health providers and treatment fee for the patients. In this case, is precision more important than recall?

Just to remember:
Recall: It's the ratio between the true positive and the false negative. So, it measures the ratio between the well predicted class and all the samples of this class. It's good when your goal is to identify all the samples of a class (for example, predict all people who have cancer).
Precision: It's the ratio between the true positive and the false positive. So, it measures the ratio between the well predicted class and all the samples that was predicted as this class. It's good when your goal is to not have false positive (for example, avoid predicting a woman is pregnant when actually she isn't).
In this case, I think it is important to know more about the consequences of a mistake. Is there some life risk if a person is of type A but she is predicted as type B ? Is there any type that is the safest (even if a person is misclassified, there is no life risk)? The most dangerous (there is a life risk)?
According to the answers, you will be able to choose the best metric.
Assuming there is the safest and the most dangerous type, the recall in the dangerous one would be more relevant, because you must identify all of them (maximum as you can).

Related

Accuracy, AUC, or F1 for Binary Classification without threshold

I am currently building a binary classification model to predict stock price movements (trend prediction). More specifically, the model predicts the probability that a stock outperforms the daily median return:
>Class 0: return >= median
>
>Class 1: return < median return
Accordingly, I (should) be dealing with a balanced prediction problem.
The ten stocks with the highest probability will be bought, and the ten stocks with the lowest probability will be shorted daily. So, ideally, the model performs well on both classes (I use softmax, so the model must exclusively decide).
I am wondering whether I should use the Accuracy, F1 or AUC-ROC when choosing the optimal model under these circumstances?
My understanding is that both are suitable metrics when the two classes are equally important. This StackExchange-Answer recommends the AUC over Accuracy because it will "strongly discourage people going for models that are representative, but not discriminative (...) and [only] select models that achieve false positive and true positive rates that are significantly above random chance, which is not guaranteed for accuracy". In contrast, this answer recommends the F1-Score because it is the combination of accuracy and AUC score.
I guess what's confusing me is that I will make use of both classes based on the probabilty assigned by the model. Also, I do not have an imbalanced dataset which usually calls for using the AUC-ROC.
Which evaluation metric should I choose to find the optimal model on validation data?
Thanks a lot for any thoughts or recommendations.

Random Forest classifier class_weight

I have an unbalanced dataset of 200000 descriptions being class 0, and something like 10000 being class 1. However, in my training dataset I have equal number of 'positive' and 'negative' samples, about 8000 each. So now I am confused about how I should properly use the "class_weight" option of the classifier. It seems that it works only if the number of the 'positive' and 'negative' samples in the training data is the same as in the whole dataset. In this case it would be 8000 'positive' and 160000 of 'negative' ones, which is not really feasible. And reducing the number of the 'positive' samples doesn't seem to be a good idea either. Or am I wrong?

The class_weightoption does nothing more than increasing the weight of making an error with the under-represented class. In other words, misclassifying the rare class is punished harsher.
The classifier is likely to perform better on your test set (where both classes are represented equally, so both are equally important), but that is something you can easily verify yourself.
A side-effect is that predict_proba returns probabilities which are far away from the actual probabilities. (If you want to understand why, plot the simple average chance and the distribution of predicted scores without and with different class_weight=. How do the predicted scores shift?). Depending on your final use-case (classification, ranking, probability estimation) you should consider the choices in your model.

Strictly speaking, from the point of view of your training set, you don't face a class imbalance issue, so you could very well leave class_weight to its default None value.
The real issue here and in imbalanced datasets in general (about which you don't provide any info) is if the cost of misclassification is the same for both classes. And this is a "businesss" decision (i.e. not a statistics/algorithmic one).
Usually, imbalanced datasets go hand-in-hand with problems with different misclassification costs; medical diagnosis is a textbook example here, since:
The datasets are almost by default imbalanced, since healthy people vastly outnumber infected ones
We would prefer a false alarm (misclassifying someone as having the disease, while he/she doesn't) rather than a missed detection (misclassifying an infected person as healthy, hence risking his/her life)
So, this is the actual problem you should be thinking about (i.e. even before building your training set).
If, for the business problem you are trying to address, there is not any difference between misclassifying a "0" for "1" and a "1" for "0", and given that your training set is balanced, you can proceed without worrying about assigning different class weights...

F1 Score vs ROC AUC

I have the below F1 and AUC scores for 2 different cases
Model 1: Precision: 85.11 Recall: 99.04 F1: 91.55 AUC: 69.94
Model 2: Precision: 85.1 Recall: 98.73 F1: 91.41 AUC: 71.69
The main motive of my problem to predict the positive cases correctly,ie, reduce the False Negative cases (FN). Should I use F1 score and choose Model 1 or use AUC and choose Model 2. Thanks

Introduction
As a rule of thumb, every time you want to compare ROC AUC vs F1 Score, think about it as if you are comparing your model performance based on:
[Sensitivity vs (1-Specificity)] VS [Precision vs Recall]
Note that Sensitivity is the Recall (they are the same exact metric).
Now we need to understand what are: Specificity, Precision and Recall (Sensitivity) intuitively!
Background
Specificity: is given by the following formula:
Intuitively speaking, if we have 100% specific model, that means it did NOT miss any True Negative, in other words, there were NO False Positives (i.e. negative result that is falsely labeled as positive). Yet, there is a risk of having a lot of False Negatives!
Precision: is given by the following formula:
Intuitively speaking, if we have a 100% precise model, that means it could catch all True positive but there were NO False Positive.
Recall: is given by the following formula:
Intuitively speaking, if we have a 100% recall model, that means it did NOT miss any True Positive, in other words, there were NO False Negatives (i.e. a positive result that is falsely labeled as negative). Yet, there is a risk of having a lot of False Positives!
As you can see, the three concepts are very close to each other!
As a rule of thumb, if the cost of having False negative is high, we want to increase the model sensitivity and recall (which are the exact same in regard to their formula)!.
For instance, in fraud detection or sick patient detection, we don't want to label/predict a fraudulent transaction (True Positive) as non-fraudulent (False Negative). Also, we don't want to label/predict a contagious sick patient (True Positive) as not sick (False Negative).
This is because the consequences will be worse than a False Positive (incorrectly labeling a a harmless transaction as fraudulent or a non-contagious patient as contagious).
On the other hand, if the cost of having False Positive is high, then we want to increase the model specificity and precision!.
For instance, in email spam detection, we don't want to label/predict a non-spam email (True Negative) as spam (False Positive). On the other hand, failing to label a spam email as spam (False Negative) is less costly.
F1 Score
It's given by the following formula:
F1 Score keeps a balance between Precision and Recall. We use it if there is uneven class distribution, as precision and recall may give misleading results!
So we use F1 Score as a comparison indicator between Precision and Recall Numbers!
Area Under the Receiver Operating Characteristic curve (AUROC)
It compares the Sensitivity vs (1-Specificity), in other words, compare the True Positive Rate vs False Positive Rate.
So, the bigger the AUROC, the greater the distinction between True Positives and True Negatives!
AUROC vs F1 Score (Conclusion)
In general, the ROC is for many different levels of thresholds and thus it has many F score values. F1 score is applicable for any particular point on the ROC curve.
You may think of it as a measure of precision and recall at a particular threshold value whereas AUC is the area under the ROC curve. For F score to be high, both precision and recall should be high.
Consequently, when you have a data imbalance between positive and negative samples, you should always use F1-score because ROC averages over all possible thresholds!
Further read:
Credit Card Fraud: Handling highly imbalance classes and why Receiver Operating Characteristics Curve (ROC Curve) should not be used, and Precision/Recall curve should be preferred in highly imbalanced situations

If you look at the definitions, you can that both AUC and F1-score optimize "something" together with the fraction of the sample labeled "positive" that is actually true positive.
This "something" is:
For the AUC, the specificity, which is the fraction of the negatively labeled sample that is correctly labeled. You're not looking at the fraction of your positively labeled samples that is correctly labeled.
Using the F1 score, it's precision: the fraction of the positively labeled sample that is correctly labeled. And using the F1-score you don't consider the purity of the sample labeled as negative (the specificity).
The difference becomes important when you have highly unbalanced or skewed classes: For example there are many more true negatives than true positives.
Suppose you are looking at data from the general population to find people with a rare disease. There are far more people "negative" than "positive", and trying to optimize how well you are doing on the positive and the negative samples simultaneously, using AUC, is not optimal. You want the positive sample to include all positives if possible and you don't want it to be huge, due to a high false positive rate. So in this case you use the F1 score.
Conversely if both classes make up 50% of your dataset, or both make up a sizable fraction, and you care about your performance in identifying each class equally, then you should use the AUC, which optimizes for both classes, positive and negative.

just adding my 2 cents here:
AUC does an implicit weighting of the samples, which F1 does not.
In my last use case comparing the effectiveness of drugs on patients, it's easy to learn which drugs are generally strong, and which are weak. The big question is whether you can hit the outliers (the few positives for a weak drug or the few negatives for a strong drug). To answer that, you have to specifically weigh the outliers up using F1, which you don't need to do with AUC.

to predict the positive cases correctly
one can rewrite a bit your goal and get: when a case is really positive you want classify it as positive too. The probability of such event p(predicted_label = positive | true_label = positive) is a recall by definition. If you want to maximize this property of your model, you'd choose the Model 1.

training set with only one label, missing the other

Hi I've been doing a machine learning project about predicting if a given (query, answer) pair is a good match (label the pair with 1 if it is a good match, 0 otherwise). But the problem is, in the training set, all the items are labelled with 1. So I got confused because I don't think the training set has strong discriminative power. To be more specific, now I could extract some features like:
1. textual similarity between query and answer
2. some attributes like the posting date, who created it, which aspect is it about etc.
Maybe I should try semi supervised learning (never studied it so have no idea if it will work)? But with such a training set I even cannot do validation....

Actually, you can train a data set on only positive examples; 1-class SVM does this. However, this presumes that anything "sufficiently outside" the original data set is negative data, with "sufficiently outside" affected mainly by gamma (allowed error rate) and k (degree of the kernel function).
A solution for your problem depends on the data you have. You are quite correct that a model trains better when given representative negative examples. The description you give strongly suggests that you do know there are insufficient matches.
Do you need a strict +/- scoring for the matches? Most applications simply rank them: the match strength is the score. This changes your problem from a classification to a prediction case. If you do need a strict +/- partition (classification), then I suggest that you slightly alter your training set: include only obvious examples: throw out anything scored near your comfort threshold for declaring a match.
With these inputs only, train your model. You'll have a clear "alley" between good and bad matches, and the model will "decide" which way to judge the in-between cases in testing and production.

Does prior distribution matter in classification?

Currently I get a classification problem with two classes. what I want to do is that given a bunch of candidates, find out who will more likely to be the class 1. The problem is that class 1 is very rare (around 1%), which I guess makes my prediction quite inaccurate.
For training the dataset, can I sample half class 1 and half class 0? This will change the prior distribution, but I don't know whether the prior distribution affects the classification results?

Indeed, a very imbalanced dataset can cause problems in classification. Because by defaulting to the majority class 0, you can get your error rate already very low.
There are some workarounds that may or may not work for your particular problem, such as giving equal weight to the two classes (thus weighting instances from the rare class stronger), oversampling the rare class (i.e. learning each instance multiple times), producing slight variations of the rare objects to restore balance etc. SMOTE and so on.
You really should to grab some classification or machine learning book, and check the index for "imbalanced classification" or "unbalanced classification". If the book is any good, it will discuss this problem. (I just assume you did not know the term that they use.)

If you're forced to pick exactly one from a group, then the prior distribution over classes won't matter because it will be constant for all members of that group. If you must look at each in turn and make an independent decision as to whether they're class one or class two, the prior will potentially change the decision, depending on which method you choose to do the classification. I would suggest you get hold of as many examples of the rare class as possible, but beware that feeding a 50-50 split to a classifier as training blindly may make it implicitly fit a model that assumes this is the distribution at test time.

Sampling your two classes evenly doesn't change assumed priors unless your classification algorithm computes (and uses) priors based on the training data. You stated that your problem is "given a bunch of candidates, find out who will more likely to be the class 1". I read this to mean that you want to determine which observation is most likely to belong to class 1. To do this, you want to pick the observation $x_i$ that maximizes $p(c_1|x_i)$. Using Bayes' theorem, this becomes:
$$
p(c_1|x_i)=\frac{p(x_i|c_1)p(c_1)}{p(x_i)}
$$
You can ignore $p(c_1)$ in the equation above since it is a constant. However, computing the denominator will still involve using prior probabilities. Since your problem is really more of a target detection problem than a classification problem, an alternate approach for detecting low probability targets is to take the likelihood ratio of the two classes:
$$
\Lambda=\frac{p(x_i|c_1)}{p(x_i|c_0)}
$$
To pick which of your candidates is most likely to belong to class 1, pick the one with the highest value of $\Lambda$. If your two classes are described by multivariate Gaussian distributions, you can replace $\Lambda$ with its natural logarithm, resulting in a simpler quadratic detector. If you further assume that the target and background have the same covariance matrices, this results in a linear discriminant (http://en.wikipedia.org/wiki/Linear_discriminant_analysis).

You may want to consider Bayesian utility theory to re-weight the costs of different kinds of error to get away from the problem of the priors dominating the decision.
Let A be the 99% prior probability class, B be the 1% class.
If we just say that all errors incur the same cost (negative utility), then
it's possible that the optimal decision approach is to always declare "A". Many
classification algorithms (implicitly) assume this.
If instead, we declare that the cost of declaring "B" when, in fact, the instance
was "A" is much bigger than the cost of the opposite error, then the decision logic
becomes, in a sense, more sensitive to slighter differences in the features.
This kind of situation frequently comes up in fault detection -- faults in the monitored
system will be rare, but you want to be sure that if we see any data that points to
an error condition, action needs to be taken (even if it is just reviewing the data).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart