I have read plenty of articles about ROC and AUC, and I found out we need to measure TPR and FPR for different classification thresholds. Does it mean that ROC and AUC can be measured for only probabilistic classifiers and not the descrete ones (like trees)?
Yes, in order to calculate AUC, you need to have predicted probabilities. AUC is the area under the ROC curve. To make a ROC curve you need to calculate true positive rate and false positive rate for different decision thresholds - and in order to use different decision thresholds, you need to have probabilities as your model's output (because it makes no sense to apply a threshold to a binary label 0 or 1.) For more information about how to calculate AUC, when to use AUC, and the strengths and weakness of AUC as a performance metric, you can read this article.
Related
I am using several U-Net variants for a brain tumor segmentation task. I get the following values for the performance measures including Dice, IOU, Area under receiver-operating characteristic (AUC) curves, and Area under Precision-Recall curves (AUPRC), otherwise called the average precision (AP) computed for varying IOU thresholds in the range [0.5:0.95] in intervals of 0.05.
From the above table, I could observe that Model-2 gave better values for the IOU and Dice metrics. I could understand that Dice coefficient gives more weightage for the TPs. However, Model - 1 gives superior values for the AUC, and AP#[0.5:0.95] metrics. What parameters need to be given higher importance in model selection under these circumstances?
What are the underlying assumptions of the ROC curve?
What part of an ROC curve impacts the PR curve more?
ROC Curves summarize the trade-off between the True Positive Rate and False Positive Rate using different probability thresholds.
Precision-Recall curves summarize the trade-off between the True Positive Rate and the Positive predictive value using different probability thresholds.
ROC curves are appropriate when the target class is balanced, whereas Precision-Recall curves are suitable for imbalanced datasets.
Here is a good article for deeper understanding.
As you probably know, in K-NN, the decision is usually taken according to the "majority vote", and not according to some threshold - i.e. there is no parameter to base a ROC curve on.
Note that in the implementation of my K-NN classifier, the votes don't have equal weights. I.e. the weight for each "neighbor" is e^(-d), where d is the distance between the tested sample and the neighbor. This measure gives higher weights for the votes of the nearer neighbors among the K neighbors.
My current decision rule is that if the sum of the scores of the positive neighbors is higher than the sum of the scores of the negative samples, then my classifier says POSITIVE, else, it says NEGATIVE.
But - There is no threshold.
Then, I thought about the following idea:
Deciding on the class of the samples which has a higher sum of votes, could be more generally described as using the threshold 0, for the score computed by: (POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
So I thought changing my decision rule to be using a threshold on that measure, and plotting a ROC curve basing on thresholds on the values of
(POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
Does it sound like a good approach for this task?
Yes, it is more or less what is typically used. If you take a look at scikit-learn it has weights in knn, and they also have predit_proba, which gives you a clear decision threshold. Typically you do not want to condition on a difference, however, but rather ratio
votes positive / (votes negative + votes positive) < T
this way, you know that you just have to "move" threshold from 0 to 1, and not arbitrary values. it also now has a clear interpretation - as an internal probability estimate that you consider "sure enough". By default T = 0.5, if the probability is above 50% you classify as positive, but as said before - you can do anything wit it.
The following image definitely makes sense to me.
Say you have a few trained binary classifiers A, B (B not much better than random guessing etc. ...) and a test set composed of n test samples to go with all those classifiers. Since Precision and Recall are computed for all n samples, those dots corresponding to classifiers make sense.
Now sometimes people talk about ROC curves and I understand that precision is expressed as a function of recall or simply plotted Precision(Recall).
I don't understand where does this variability come from, since you have a fixed number of test samples. Do you just pick some subsets of the test set and find precision and recall in order to plot them and hence many discrete values (or an interpolated line) ?
The ROC curve is well-defined for a binary classifier that expresses its output as a "score." The score can be, for example, the probability of being in the positive class, or it could also be the probability difference (or even the log-odds ratio) between probability distributions over each of the two possible outcomes.
The curve is obtained by setting the decision threshold for this score at different levels and measuring the true-positive and false-positive rates, given that threshold.
There's a good example of this process in Wikipedia's "Receiver Operating Characteristic" page:
For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (black vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
If code speaks more clearly to you, here's the code in scikit-learn that computes an ROC curve given a set of predictions for each item in a dataset. The fundamental operation seems to be (direct link):
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
# accumulate the true positives with decreasing threshold
tps = y_true.cumsum()
fps = 1 + list(range(len(y_true))) - tps
return fps, tps, y_score
(I've omitted a bunch of code in there that deals with (common) cases of having weighted samples and when the classifier gives near-identical scores to multiple samples.) Basically the true labels are sorted in descending order by the score assigned to them by the classifier, and then their cumulative sum is computed, giving the true positive rate as a function of the score assigned by the classifier.
And here's an example showing how this gets used: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
ROC curve just shows "How much sensitivity you will obtain if you increase FPR by some amount". Tradeoff between TPR and FPR. Variability comes from varying some parameter of classifier (For logistic regression case below - it is threshold value).
For example logistic regression gives you probability that object belongs to positive class (values in [0..1]), but it's just probability. It's not a class. So in general case you have to specify threshold for probability, above which you will classify object as positive. You can learn logistic regression, obtain from it probabilities of positive class for each object of your set, and then you just vary this threshold parameter, with some step from 0 to 1, by thresholding your probabilities (computed on previous step) with this threshold you will get class labels for every object, and compute TPR and FPR from this labels. Thus you will get TPR and FPR for every threshold. You can mark them on plot and eventually, after you compute (TPR,FPR) pairs for all thresholds - draw a line through them.
Also for linear binary classifiers you can think about this varying process as a process of choosing distance between decision line and positive (or negative, if you want) class cluster. If you move decision line far from positive class - you will classify more objects as a positive (because you increased positive class space), and at the same time you increased FPR by some value (because space of negative class decreased).
How to generate a ROC curve for a cross validation?
For a single test I think I should threshold the classification scores of SVM to generate the ROC curve.
But I am unclear about how to generate it for a cross validation?
After a complete round of cross validation all observations have been classified once (although by different models) and have been give an estimated probability of belonging to the class of interest, or a similar statistic. These probabilities can be used to generate a ROC curve in exactly the same way as probabilities obtained on an external test set. Just calculate the classwise error rates as you vary the classification threshold from 0 to 1 and your are all set.
However, typically you would like to perform more than one round of crossvalidation, as the performance varies depending on how the folds are divided. It is not obvious to me how to calculate the mean ROC curve of all rounds. I suggest plotting them all and calculate the mean AUC.
As follow-up to Backlin:
The variation in the results for different runs of k-fold or leave-n-out cross validation show instability of the models. This is valuable information.
Of course you can pool the results and just generate one ROC.
But you can also plot the set of curves
see e.g. the R package ROCR
or calculate e.g. median and IQR at different thresholds and construct a band depicting these variations.
Here's an example: the shaded areas are the inter quartile ranges observed over 125 iterations of 8-fold cross validation. The thin black areas contain half of the observed specificity-sensitivity pairs for one particular threshold, median marked by x (ignore the + marks).