receiver operating characteristic (ROC) on a test set - machine-learning

The following image definitely makes sense to me.
Say you have a few trained binary classifiers A, B (B not much better than random guessing etc. ...) and a test set composed of n test samples to go with all those classifiers. Since Precision and Recall are computed for all n samples, those dots corresponding to classifiers make sense.
Now sometimes people talk about ROC curves and I understand that precision is expressed as a function of recall or simply plotted Precision(Recall).
I don't understand where does this variability come from, since you have a fixed number of test samples. Do you just pick some subsets of the test set and find precision and recall in order to plot them and hence many discrete values (or an interpolated line) ?

The ROC curve is well-defined for a binary classifier that expresses its output as a "score." The score can be, for example, the probability of being in the positive class, or it could also be the probability difference (or even the log-odds ratio) between probability distributions over each of the two possible outcomes.
The curve is obtained by setting the decision threshold for this score at different levels and measuring the true-positive and false-positive rates, given that threshold.
There's a good example of this process in Wikipedia's "Receiver Operating Characteristic" page:
For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (black vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
If code speaks more clearly to you, here's the code in scikit-learn that computes an ROC curve given a set of predictions for each item in a dataset. The fundamental operation seems to be (direct link):
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
# accumulate the true positives with decreasing threshold
tps = y_true.cumsum()
fps = 1 + list(range(len(y_true))) - tps
return fps, tps, y_score
(I've omitted a bunch of code in there that deals with (common) cases of having weighted samples and when the classifier gives near-identical scores to multiple samples.) Basically the true labels are sorted in descending order by the score assigned to them by the classifier, and then their cumulative sum is computed, giving the true positive rate as a function of the score assigned by the classifier.
And here's an example showing how this gets used: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

ROC curve just shows "How much sensitivity you will obtain if you increase FPR by some amount". Tradeoff between TPR and FPR. Variability comes from varying some parameter of classifier (For logistic regression case below - it is threshold value).
For example logistic regression gives you probability that object belongs to positive class (values in [0..1]), but it's just probability. It's not a class. So in general case you have to specify threshold for probability, above which you will classify object as positive. You can learn logistic regression, obtain from it probabilities of positive class for each object of your set, and then you just vary this threshold parameter, with some step from 0 to 1, by thresholding your probabilities (computed on previous step) with this threshold you will get class labels for every object, and compute TPR and FPR from this labels. Thus you will get TPR and FPR for every threshold. You can mark them on plot and eventually, after you compute (TPR,FPR) pairs for all thresholds - draw a line through them.
Also for linear binary classifiers you can think about this varying process as a process of choosing distance between decision line and positive (or negative, if you want) class cluster. If you move decision line far from positive class - you will classify more objects as a positive (because you increased positive class space), and at the same time you increased FPR by some value (because space of negative class decreased).

Related

How to choose the best segmentation model using the area under the precision recall curve, IOU and Dice metrics?

I am using several U-Net variants for a brain tumor segmentation task. I get the following values for the performance measures including Dice, IOU, Area under receiver-operating characteristic (AUC) curves, and Area under Precision-Recall curves (AUPRC), otherwise called the average precision (AP) computed for varying IOU thresholds in the range [0.5:0.95] in intervals of 0.05.
From the above table, I could observe that Model-2 gave better values for the IOU and Dice metrics. I could understand that Dice coefficient gives more weightage for the TPs. However, Model - 1 gives superior values for the AUC, and AP#[0.5:0.95] metrics. What parameters need to be given higher importance in model selection under these circumstances?

Values greater than 1 in svm prediction file

I am using svm light to train a model for binary classification. Using the model, I tested some examples. I was surprised to see the output of the prediction file, it contains values greater than 1 as well as less than -1. I thought the range is [-1,1]. Am I doing something wrong?
It makes sense why the values are not bounded by the interval of [-1, 1] if you understand how the SVM works. An SVM tries to draw the line which separates the negative and positive data points while maximizing their distances from the line.
The values in the prediction file represent the distances of data from the SVM optimal hyperplane, where positive values are on the positive class side of the hyperplane and negative values are on the negative class side of the hyperplane. These distance can be arbitrarily large or small and are not bounded as can be seen by this image:
I've seen some SVM implementations such as Weka's implementation of Platt's SMO which normalize the values so that they are confidence values on the positive class bounded by the interval of [0, 1], but both ways work just fine for determining how confident an SVM is on a classification since a data point further from the hyperplane is more certain than one lying close to the hyperplane.

Computing ROC curve for K-NN classifier

As you probably know, in K-NN, the decision is usually taken according to the "majority vote", and not according to some threshold - i.e. there is no parameter to base a ROC curve on.
Note that in the implementation of my K-NN classifier, the votes don't have equal weights. I.e. the weight for each "neighbor" is e^(-d), where d is the distance between the tested sample and the neighbor. This measure gives higher weights for the votes of the nearer neighbors among the K neighbors.
My current decision rule is that if the sum of the scores of the positive neighbors is higher than the sum of the scores of the negative samples, then my classifier says POSITIVE, else, it says NEGATIVE.
But - There is no threshold.
Then, I thought about the following idea:
Deciding on the class of the samples which has a higher sum of votes, could be more generally described as using the threshold 0, for the score computed by: (POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
So I thought changing my decision rule to be using a threshold on that measure, and plotting a ROC curve basing on thresholds on the values of
(POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
Does it sound like a good approach for this task?
Yes, it is more or less what is typically used. If you take a look at scikit-learn it has weights in knn, and they also have predit_proba, which gives you a clear decision threshold. Typically you do not want to condition on a difference, however, but rather ratio
votes positive / (votes negative + votes positive) < T
this way, you know that you just have to "move" threshold from 0 to 1, and not arbitrary values. it also now has a clear interpretation - as an internal probability estimate that you consider "sure enough". By default T = 0.5, if the probability is above 50% you classify as positive, but as said before - you can do anything wit it.

how to interpret the "soft" and "max" in the SoftMax regression?

I know the form of the softmax regression, but I am curious about why it has such a name? Or just for some historical reasons?
The maximum of two numbers max(x,y) could have sharp corners / steep edges which sometimes is an unwanted property (e.g. if you want to compute gradients).
To soften the edges of max(x,y), one can use a variant with softer edges: the softmax function. It's still a max function at its core (well, to be precise it's an approximation of it) but smoothed out.
If it's still unclear, here's a good read.
Let's say you have a set of scalars xi and you want to calculate a weighted sum of them, giving a weight wi to each xi such that the weights sum up to 1 (like a discrete probability). One way to do it is to set wi=exp(a*xi) for some positive constant a, and then normalize the weights to one. If a=0 you get just a regular sample average. On the other hand, for a very large value of a you get max operator, that is the weighted sum will be just the largest xi. Therefore, varying the value of a gives you a "soft", or a continues way to go from regular averaging to selecting the max. The functional form of this weighted average should look familiar to you if you already know what a SoftMax regression is.

Precision/recall for multiclass-multilabel classification

I'm wondering how to calculate precision and recall measures for multiclass multilabel classification, i.e. classification where there are more than two labels, and where each instance can have multiple labels?
For multi-label classification you have two ways to go
First consider the following.
is the number of examples.
is the ground truth label assignment of the example..
is the example.
is the predicted labels for the example.
Example based
The metrics are computed in a per datapoint manner. For each predicted label its only its score is computed, and then these scores are aggregated over all the datapoints.
Precision =
, The ratio of how much of the predicted is correct. The numerator finds how many labels in the predicted vector has common with the ground truth, and the ratio computes, how many of the predicted true labels are actually in the ground truth.
Recall =
, The ratio of how many of the actual labels were predicted. The numerator finds how many labels in the predicted vector has common with the ground truth (as above), then finds the ratio to the number of actual labels, therefore getting what fraction of the actual labels were predicted.
There are other metrics as well.
Label based
Here the things are done labels-wise. For each label the metrics (eg. precision, recall) are computed and then these label-wise metrics are aggregated. Hence, in this case you end up computing the precision/recall for each label over the entire dataset, as you do for a binary classification (as each label has a binary assignment), then aggregate it.
The easy way is to present the general form.
This is just an extension of the standard multi-class equivalent.
Macro averaged
Micro averaged
Here the are the true positive, false positive, true negative and false negative counts respectively for only the label.
Here $B$ stands for any of the confusion-matrix based metric. In your case you would plug in the standard precision and recall formulas. For macro average you pass in the per label count and then sum, for micro average you average the counts first, then apply your metric function.
You might be interested to have a look into the code for the mult-label metrics here , which a part of the package mldr in R. Also you might be interested to look into the Java multi-label library MULAN.
This is a nice paper to get into the different metrics: A Review on Multi-Label Learning Algorithms
The answer is that you have to compute precision and recall for each class, then average them together. E.g. if you classes A, B, and C, then your precision is:
(precision(A) + precision(B) + precision(C)) / 3
Same for recall.
I'm no expert, but this is what I have determined based on the following sources:
https://list.scms.waikato.ac.nz/pipermail/wekalist/2011-March/051575.html
http://stats.stackexchange.com/questions/21551/how-to-compute-precision-recall-for-multiclass-multilabel-classification
Let us assume that we have a 3-class multi classification problem with labels A, B and C.
The first thing to do is to generate a confusion matrix. Note that the values in the diagonal are always the true positives (TP).
Now, to compute recall for label A you can read off the values from the confusion matrix and compute:
= TP_A/(TP_A+FN_A)
= TP_A/(Total gold labels for A)
Now, let us compute precision for label A, you can read off the values from the confusion matrix and compute:
= TP_A/(TP_A+FP_A)
= TP_A/(Total predicted as A)
You just need to do the same for the remaining labels B and C. This applies to any multi-class classification problem.
Here is the full article that talks about how to compute precision and recall for any multi-class classification problem, including examples.
In python using sklearn and numpy:
from sklearn.metrics import confusion_matrix
import numpy as np
labels = ...
predictions = ...
cm = confusion_matrix(labels, predictions)
recall = np.diag(cm) / np.sum(cm, axis = 1)
precision = np.diag(cm) / np.sum(cm, axis = 0)
Simple averaging will do if the classes are balanced.
Otherwise, recall for each real class needs to be weighted by prevalence of the class, and precision for each predicted label needs to be weighted by the bias (probability) for each label. Either way you get Rand Accuracy.
A more direct way is to make a normalized contingency table (divide by N so table adds up to 1 for each combination of label and class) and add the diagonal to get Rand Accuracy.
But if classes aren't balanced, the bias remains and a chance corrected method such as kappa is more appropriate, or better still ROC analysis or a chance correct measure such as informedness (height above the chance line in ROC).

Resources