For a performance measuring purpose I am trying to draw ROC curve. In ROC curve I have to plot False Positive Rate (FPR) in x-axis and True Positive Rate (TPR) in y-axis. As we know,
FPR = FP/(FP+TN)
So in the following picture how can i detect True Negative(TN) ? I have used HOG classifier to detect human. I marked with rectangle 1,2,3,4,5,6(or should be 7) to show the human objects that should be ignored and not to classify as human. and I think those are True Negative.
In this picture i want to say my assumption,as we know,
False negative: Result should have been positive, but is negative.
False positive: Result should have been negative, but is positive.
True positive: Result should have been positive and is positive.
True negative: Result should have been negative and is negative
So i think in this frame FP = 0, TP = 0, FN = 0 but not sure about TN, is it 6 or 7 or anything other? Please correct me also about FP, TP, and FN if i am wrong. I saw this question How to categorize True Negatives in sliding window object detection? which was really helpful but still i have to calculate FPR for this scenario.
You cannot calculate these values from such image, you need more data (knowledge what is actually happening). But what you need is probably just total amount of these windows, which is some constant N. Now, it seems like all these windows are wrong (none is on the human), thus:
FP = 6 (your method claims there are 6 people, but none of these claims is valid since they are completely off - however if this is just visualization issue, and method actually captured valid people, this 6 should be moved to TP instead)
TP = 0 (it does not correctly mark any human)
FN = 10 (if I counted correctly there are 10 people at this image, and all these are missing)
TN = N - 16, where N is number of all analized windows, since all of them are correctly classified as "lack of human" up to 10 FNs and 6 FPs, which add up to these 16.
In general
FP = how many actual not humans are marked "human"
TP = how many actual humans are marked "human"
FN = how many actual humans are correctly ignored (not marked "human")
TN = how many actual not humans are correctly ignored (not marked "human")
Related
How can I calculate the false positive rate for an object detection algorithm, where I can have multiple objects per image?
In my data, a given image may have many objects. I am counting a predicted box as a true positive if its IOU with a truth box is above a certain threshold, and as a false positive otherwise. For example:
I have 2 prediction bounding boxes and 2 ground-truth bounding boxes:
I computed IoU for each pair of prediction and ground-truth bounding boxes:
IoU = 0.00, 0.60, 0.10, 0.05
threshold = 0.50
In this case do I have TP example or not? Could You explain it?
Summary, specific: Yes, you have a TP; you also have a FP and a FN.
Summary, detailed: Your prediction model correctly identified one GT (ground truth) box. It missed the other. It incorrectly identified a third box.
Classification logic:
At the very least, your IoU figures should be a matrix, not a linear sequence. For M predictions and N GT boxes, you will have a NxM matrix. Your looks like this:
0.00 0.60
0.10 0.05
Now, find the largest value in the matrix, 0.60. This is above the threshold, so you declare the match and eliminate both that prediction and that GT box from the matrix. This leaves you with a rather boring matrix:
0.10
Since this value is below the threshold, you are out of matches. You have one prediction and one GT remaining. With the one "hit", you have three objects in your classification set: two expected objects, and a third created by the predictor. You code your gt and pred lists like this:
gt = [1, 1, 0] // The first two objects are valid; the third is a phantom.
pred = [1, 0, 1] // Identified one actual box and the phantom.
Is that clear enough?
You can use an algorithm (e.g. Hungarian algorithm aka Kuhn–Munkres algorithm aka Munkres algorithm) to assign detections to ground truths. You might incorporate the ability to not assign a detection to ground truth & vice versa (e.g. allow for false alarms and missed detections).
After assigning the detections to ground truths, just use the definition of TPR Wikipedia page for Sensitivity (aka TPR) & Specificity (aka TNR)
I provide this answer since I think #Prune provided an answer which uses a Greedy algorithm to perform assignment of detections to ground truths (i.e. "Now, find the largest value in the matrix, 0.60. This is above the threshold, so you declare the match and eliminate both that prediction and that GT box from the matrix."). This Greedy assignment method will not work well in all scenarios. For example imagine a matrix of IoU values between detections and ground truth bounding boxes
det1 det2
pred1 0.4 0.0
pred2 0.6 0.4
The Greedy algorithm would assign pred2 to det1 and pred1 to det2 (or pred1 to nothing if accounting for possibility of false alarms). However, the Hungarian algorithm would assign pred1 to det1 and pred2 to det2, which might be better in some cases.
Consider the below scenario:
I have batches of data whose features and labels have similar distribution.
Say something like 4000000 negative labels and 25000 positive labels
As its a highly imbalanced set, I have undersampled the negative labels so that my training set (taken from one of the batch) now contains 25000 positive labels and 500000 negative labels.
Now I am trying to measure the precision and recall from a test set after training (generated from a different batch)
I am using XGBoost with 30 estimators.
Now if I use all of 40000000 negative labels, I get a (0.1 precsion and 0.1 recall at 0.7 threshold) worser precision-recall score than if I use a subset say just 500000 negative labels(0.4 precision with 0.1 recall at 0.3 threshold)..
What could be a potential reason that this could happen?
Few of the thoughts that I had:
The features of the 500000 negative labels are vastly different from the rest in the overall 40000000 negative labels.
But when I plot the individual features, their central tendencies closely match with the subset.
Are there any other ways to identify why I get a lower and a worser presicion recall, when the number of negative labels increase so much?
Are there any ways to compare the distributions?
Is my undersampled training a cause for this?
To understand this, we first need to understand how precision and recall are calculated. For this I will use the following variables:
P - total number of positives
N - total number of negatives
TP - number of true positives
TN - number of true negatives
FP - number of false positives
FN - number of false negatives
It is important to note that:
P = TP + FN
N = TN + FP
Now, precision is TP/(TP + FP)
recall is TP/(TP + FN), therefore TP/P.
Accuracy is TP/(TP + FN) + TN/(TN + FP), hence (TP + TN)/(P + N)
In your case where the the data is imbalanced, we have that N>>P.
Now imagine some random model. We can usually say that for such a model accuracy is around 50%, but that is only if the data is balanced. In your case, there will tend to be more FP's and TN's than TP's and FN's because a random selection of the data has more liklihood of returning a negative sample.
So we can establish that the more % of negative samples N/(T+N), the more FP and TN we get. That is, whenever your model is not able to select the correct label, it will pick a random label out of P and N and it is mostly going to be N.
Recall that FP is a denominator in precision? This means that precision also decreases with increasing N/(T+N).
For recall, we have neither FP nor TN in its derivation, so will likely not to change much with increasing N/(T+N) . As can be seen in your example, it clearly stays the same.
Therefore, I would try to make the data balanced to get better result. A ratio of 1:1.5 should do.
You can also use a different metric like the F1 score that combines precision and recall to get a better understanding of the performance.
Also check some of the other points made here on how to combat imbalance data
I am currently learning Information retrieval and i am rather stuck with an example of recall and precision
A searcher uses a search engine to look for information. There are 10 documents on the first screen of results and 10 on the second.
Assuming there is known to be 10 relevant documents in the search engines index.
Soo... there is 20 searches all together of which 10 are relevant.
Can anyone help me make sense of this?
Thanks
Recall and precision measure the quality of your result. To understand them let's first define the types of results. A document in your returned list can either be
classified correctly
a true positive (TP): a document which is relevant (positive) that was indeed returned (true)
a true negative (TN): a document which is not relevant (negative) that was indeed NOT returned (true)
misclassified
a false positive (FP): a document which is not relevant but was returned positive
a false negative (FN): a document which is relevant but was not returned negative
the precision is then:
|TP| / (|TP| + |FP|)
i.e. the fraction of retrieved documents which are indeed relevant
the recall is then:
|TP| / (|TP| + |FN|)
i.e. the fraction of relevant documents which are in your result set
So, in your example 10 out of 20 results are relevant. This gives you a precision of 0.5. If there are no more than these 10 relevant documents, you have got a recall of 1.
(When measuring the performance of an Information Retrieval system it only makes sense to consider both precision and recall. You can easily get a precision of 100% by returning no result at all (i.e. no spurious returned instance => no FP) or a recall of 100% by returning every instance (i.e. no relevant document was missed => no FN). )
Well, this is an extension of my answer on recall at: https://stackoverflow.com/a/63120204/6907424. First read about precision here and than go to read recall. Here I am only explaining Precision using the same example:
ExampleNo Ground-truth Model's Prediction
0 Cat Cat
1 Cat Dog
2 Cat Cat
3 Dog Cat
4 Dog Dog
For now I am calculating precision for Cat. So Cat is our Positive Class and the rest of the classes (Here Dog only) are the Negative Classes. Precision means what the percentage of positive detection was actually positive. So here for Cat there are 3 detection by the model. But are all of them correct? No! Out of them only 2 are correct (in example 0 and 2) and another is wrong (in example 3). So the percentage of correct detection is 2 out of 3 which is (2 / 3) * 100 % = 66.67%.
Now coming to the formulation, here:
TP (True positive): Predicting something positive when it is actually positive. If cat is our positive example then predicting something a cat when it is actually a cat.
FP (False positive): Predicting something as positive but which is not actually positive, i.e, saying something positive "falsely".
Now the number of correct detection of a certain class is the number of TP of that class. But apart from them the model also predicted some other examples as positives but which were not actually positives and so these are the false positives (FP). So irrespective of correct or wrong the total number of positive class detected by the model is TP + FP. So the percentage of correct detection of positive class among all detection of that class will be: TP / (TP + FP) which is the precision of the detection of that class.
Like recall we can also generalize this formula for any number of classes. Just take one class at a time and consider it as the positive class and the rest of the classes as negative classes and continue the same process for all of the classes to calculate precision for each of them.
You can calculate precision and recall in another way (basically the other way of thinking the same formulae). Say for Cat, first count how many examples at the same time have Cat in both Ground-truth and Model's prediction (i.e, count the number of TP). Therefore if you are calculating precision then divide this count by the number of "Cat"s in the Model's Prediction. Otherwise for recall divide by the number of "Cat"s in the Ground-truth. This works as the same as the formulae of precision and recall. If you can't understand why then you should think for a while and review what actually TP, FP, TN and FN means.
If you have difficulty understanding precision and recall, consider reading this
https://medium.com/seek-product-management/8-out-of-10-brown-cats-6e39a22b65dc
I made a network that predicts either 1 or 0. I'm now working on the ROC Curve of that network where I have to find the TN, FN, TP, FP. When the output of my network is >= 0.5 with desired output of 1, I classified it under True Positive. And when it's >=0.5 with desired output of 0, I classified it under False Positive. Is that the right thing to do? Just wanna make sure if my understanding is correct.
It all depends on how you are using your network as the True/False Positive/Negative is just a form of analysing results of your classification, not the internals of the network. From what you have written I assume, that you have a network with one output node, which can yield values in the [0,1]. If you use your model in the way, that if this value is bigger then 0.5 then you assume the 1 output and 0 otherwise, then yes, you are correct. In general, you should consider what is the "interpretation" of your output and simply use the definition of TP, FN, etc. which can be summarized as follows:
your network
truth 1 0
1 TP FN
0 FP TN
I refered to "interpretation" as in fact you are always using some function g( output ), which returns the predicted class number. In your case, it is simply g( output ) = 1 iff output >= 0.5. but in multi class problem it would be probably g( output ) = argmax( output ), yet it does not have to, in particular - what about "draws" (when two or more neurons have the same value). For calculating True/False Positives/Negatives you should always only consider the final classification. And as a result, you are measuring the quality of the model, learning process as well as this "interpretation" g.
It should also be noted, that concept of "positive" and "negative" class is often ambiguous. In problems like detection of some object/event it is quite clear, that "occurence" is a positive event and "lack of" is negative, but in many others - like for example gender classification there is no clear interpretation. In such cases one should carefully choose used metrics, as some of them are biased towards positive (or negative) examples (for example precision do not consider neither true nor false negatives).
What's the meaning of recall of a classifier, e.g. bayes classifier? please give an example.
for example, the Precision = correct/correct+wrong docs for test data. how to understand recall?
Recall literally is how many of the true positives were recalled (found), i.e. how many of the correct hits were also found.
Precision (your formula is incorrect) is how many of the returned hits were true positive i.e. how many of the found were correct hits.
I found the explanation of Precision and Recall from Wikipedia very useful:
Suppose a computer program for recognizing dogs in photographs identifies 8 dogs in a picture containing 12 dogs and some cats. Of the 8 dogs identified, 5 actually are dogs (true positives), while the rest are cats (false positives). The program's precision is 5/8 while its recall is 5/12. When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3.
So, in this case, precision is "how useful the search results are", and recall is "how complete the results are".
Precision in ML is the same as in Information Retrieval.
recall = TP / (TP + FN)
precision = TP / (TP + FP)
(Where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative).
It makes sense to use these notations for binary classifier, usually the "positive" is the less common classification. Note that the precision/recall metrics is actually the specific form where #classes=2 for the more general confusion matrix.
Also, your notation of "precision" is actually accuracy, and is (TP+TN)/ ALL
Giving you an example. Imagine we have a machine learning model which can detect cat vs dog. The actual label which is provided by human is called the ground-truth.
Again the output of your model is called the prediction. Now look at the following table:
ExampleNo Ground-truth Model's Prediction
0 Cat Cat
1 Cat Dog
2 Cat Cat
3 Dog Cat
4 Dog Dog
Say we want to find recall for the class cat. By definition recall means the percentage of a certain class correctly identified (from all of the given examples of that class). So for the class cat the model correctly identified it for 2 times (in example 0 and 2). But does it mean actually there are only 2 cats? No! In reality there are 3 cats in the ground truth (human labeled). So what is the percentage of correct identification of this certain class? 2 out of 3 that is (2/3) * 100 % = 66.67% or 0.667 if you normalize it within 1. Here is another prediction of cat in example 3 but it is not a correct prediction and hence, we are not considering it.
Now coming to mathematical formulation. First understand two terms:
TP (True positive): Predicting something positive when it is actually positive. If cat is our positive example then predicting something a cat when it is actually a cat.
FN (False negative): Predicting something negative when it is not actually negative.
Now for a certain class this classifier's output can be of two types: Cat or Dog (Not Cat). So the number correct identification is the number of True positive (TP). Again total number of examples of that class in ground-truth will be TP + FN. Because out of all cats the model either detected them correctly (TP) or didn't detect them correctly (FN i.e, the model falsely said Negative (Non Cat) when it was actually positive (Cat)). So For a certain class TP + FN denotes the total number of examples available in the ground truth of that class. So the formula is:
Recall = TP / (TP + FN)
Similarly recall can be calculated for Dog as well. At that time think the Dog as the positive class and the Cat as negative classes.
So for any number of classes to find recall of a certain class take the class as the positive class and take the rest of the classes as the negative classes and use the formula to find recall. Continue the process for each of the classes to find recall for all of them.
If you want to learn about precision as well then go here: https://stackoverflow.com/a/63121274/6907424
In very simple language: For example, in a series of photos showing politicians, how many times was the photo of politician XY was correctly recognised as showing A. Merkel and not some other politician?
precision is the ratio of how many times ANOTHER person was recognized (false positives) : (Correct hits) / (Correct hits) + (false positives)
recall is the ratio of how many times the name of the person shown in the photos was incorrectly recognized ('recalled'): (Correct calls) / (Correct calls) + (false calls)