What's the meaning of recall of a classifier, e.g. bayes classifier? please give an example.
for example, the Precision = correct/correct+wrong docs for test data. how to understand recall?
Recall literally is how many of the true positives were recalled (found), i.e. how many of the correct hits were also found.
Precision (your formula is incorrect) is how many of the returned hits were true positive i.e. how many of the found were correct hits.
I found the explanation of Precision and Recall from Wikipedia very useful:
Suppose a computer program for recognizing dogs in photographs identifies 8 dogs in a picture containing 12 dogs and some cats. Of the 8 dogs identified, 5 actually are dogs (true positives), while the rest are cats (false positives). The program's precision is 5/8 while its recall is 5/12. When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3.
So, in this case, precision is "how useful the search results are", and recall is "how complete the results are".
Precision in ML is the same as in Information Retrieval.
recall = TP / (TP + FN)
precision = TP / (TP + FP)
(Where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative).
It makes sense to use these notations for binary classifier, usually the "positive" is the less common classification. Note that the precision/recall metrics is actually the specific form where #classes=2 for the more general confusion matrix.
Also, your notation of "precision" is actually accuracy, and is (TP+TN)/ ALL
Giving you an example. Imagine we have a machine learning model which can detect cat vs dog. The actual label which is provided by human is called the ground-truth.
Again the output of your model is called the prediction. Now look at the following table:
ExampleNo Ground-truth Model's Prediction
0 Cat Cat
1 Cat Dog
2 Cat Cat
3 Dog Cat
4 Dog Dog
Say we want to find recall for the class cat. By definition recall means the percentage of a certain class correctly identified (from all of the given examples of that class). So for the class cat the model correctly identified it for 2 times (in example 0 and 2). But does it mean actually there are only 2 cats? No! In reality there are 3 cats in the ground truth (human labeled). So what is the percentage of correct identification of this certain class? 2 out of 3 that is (2/3) * 100 % = 66.67% or 0.667 if you normalize it within 1. Here is another prediction of cat in example 3 but it is not a correct prediction and hence, we are not considering it.
Now coming to mathematical formulation. First understand two terms:
TP (True positive): Predicting something positive when it is actually positive. If cat is our positive example then predicting something a cat when it is actually a cat.
FN (False negative): Predicting something negative when it is not actually negative.
Now for a certain class this classifier's output can be of two types: Cat or Dog (Not Cat). So the number correct identification is the number of True positive (TP). Again total number of examples of that class in ground-truth will be TP + FN. Because out of all cats the model either detected them correctly (TP) or didn't detect them correctly (FN i.e, the model falsely said Negative (Non Cat) when it was actually positive (Cat)). So For a certain class TP + FN denotes the total number of examples available in the ground truth of that class. So the formula is:
Recall = TP / (TP + FN)
Similarly recall can be calculated for Dog as well. At that time think the Dog as the positive class and the Cat as negative classes.
So for any number of classes to find recall of a certain class take the class as the positive class and take the rest of the classes as the negative classes and use the formula to find recall. Continue the process for each of the classes to find recall for all of them.
If you want to learn about precision as well then go here: https://stackoverflow.com/a/63121274/6907424
In very simple language: For example, in a series of photos showing politicians, how many times was the photo of politician XY was correctly recognised as showing A. Merkel and not some other politician?
precision is the ratio of how many times ANOTHER person was recognized (false positives) : (Correct hits) / (Correct hits) + (false positives)
recall is the ratio of how many times the name of the person shown in the photos was incorrectly recognized ('recalled'): (Correct calls) / (Correct calls) + (false calls)
Related
I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.
So I've got the following results from Naïves Bayes classification on my data set:
I am stuck however on understanding how to interpret the data. I am wanting to find and compare the accuracy of each class (a-g).
I know accuracy is found using this formula:
However, lets take the class a. If I take the number of correctly classified instances - 313 - and divide it by the total number of 'a' (4953) from the row a, this gives ~6.32%. Would this be the accuracy?
EDIT: if we use the column instead of the row, we get 313/1199 which gives ~26.1% which seems a more reasonable number.
EDIT 2: I have done a calculation of the accuracy of a in excel which gives me 84% as the accuracy, using the accuracy calculation shown above:
This doesn't seem right, as the overall accuracy of classification successfully is ~24%
No -- all you've calculated is tp/(tp+fn), the total correct identifications of class a, divided by the total of actual a examples. This is recall, not accuracy. You need to include the other two figures.
fp is the rest of the a column; tn is all of the other figures in the non-a rows and columns, the 6x6 sub-matrix. This will reduce all 35K+ trials to a 2x2 matrix with labels a and not a, the 2x2 confusion matrix with which you're already familiar.
Yes, you get to repeat that reduction for each of the seven features. I recommend doing it programmatically.
RESPONSE TO OP UPDATE
Your accuracy is that high: you have a huge quantity of true negatives, not-a samples that were properly classified as not-a.
Perhaps it doesn't feel right because our experience focuses more on the class in question. There are [other statistics that handle that focus.
Recall is tp / (tp+fn) -- of all items actually in class a, what percentage did we properly identify? This is the 6.32% figure.
Precision is tp / (tp + fp) -- of all items identified as class a, what percentage were actually in that class. This is the 26.1% figure you calculated.
I am doing a project to find out the disease associated genes using text mining. I am using 1000 articles for this. I got around 129 gene names. The actual dataset contains around 1000 entries. Now I would like to calculate the precision and recall of my method. When i did the comparison, out of the 129 genes, 72 were found to be correct. So the
precision = 72/129.
Is it correct?
Now how can I calculate the recall? Please help
The Wikipedia Article on Precision and Recall might help.
The definitions are:
Precision: tp / (tp+fp)
Recall: tp / (tp + fn)
Where tp are the true positives (genes which are associated with disease and you found them), fp are the false positives (genes you found but they actually aren't associated with the disease) and fn are the false negatives (genes which are actually associated with the disease but you didn't find them).
I am not quite sure what the numbers you have posted represent. Do you know the genes which are truly associated with the disease?
You have most likely computed the accuracy:
Accuracy = (tp + fp) / (Total Number)
The main issue was that the articles that i am considering might not contain all the originally listed gene names since its a small dataset. So while calculating the recall, instead of considering the denominator as 1000, I can compare the original database of genes with the articles to find out how many of the originally associated genes are present in the literature. i.e., if there are 1000 associated genes, I will check out of 1000 how many are there in the dataset I am considering. If it is 300, i will set the denominator as 300 instead of 1000. That will give recall.
I am currently learning Information retrieval and i am rather stuck with an example of recall and precision
A searcher uses a search engine to look for information. There are 10 documents on the first screen of results and 10 on the second.
Assuming there is known to be 10 relevant documents in the search engines index.
Soo... there is 20 searches all together of which 10 are relevant.
Can anyone help me make sense of this?
Thanks
Recall and precision measure the quality of your result. To understand them let's first define the types of results. A document in your returned list can either be
classified correctly
a true positive (TP): a document which is relevant (positive) that was indeed returned (true)
a true negative (TN): a document which is not relevant (negative) that was indeed NOT returned (true)
misclassified
a false positive (FP): a document which is not relevant but was returned positive
a false negative (FN): a document which is relevant but was not returned negative
the precision is then:
|TP| / (|TP| + |FP|)
i.e. the fraction of retrieved documents which are indeed relevant
the recall is then:
|TP| / (|TP| + |FN|)
i.e. the fraction of relevant documents which are in your result set
So, in your example 10 out of 20 results are relevant. This gives you a precision of 0.5. If there are no more than these 10 relevant documents, you have got a recall of 1.
(When measuring the performance of an Information Retrieval system it only makes sense to consider both precision and recall. You can easily get a precision of 100% by returning no result at all (i.e. no spurious returned instance => no FP) or a recall of 100% by returning every instance (i.e. no relevant document was missed => no FN). )
Well, this is an extension of my answer on recall at: https://stackoverflow.com/a/63120204/6907424. First read about precision here and than go to read recall. Here I am only explaining Precision using the same example:
ExampleNo Ground-truth Model's Prediction
0 Cat Cat
1 Cat Dog
2 Cat Cat
3 Dog Cat
4 Dog Dog
For now I am calculating precision for Cat. So Cat is our Positive Class and the rest of the classes (Here Dog only) are the Negative Classes. Precision means what the percentage of positive detection was actually positive. So here for Cat there are 3 detection by the model. But are all of them correct? No! Out of them only 2 are correct (in example 0 and 2) and another is wrong (in example 3). So the percentage of correct detection is 2 out of 3 which is (2 / 3) * 100 % = 66.67%.
Now coming to the formulation, here:
TP (True positive): Predicting something positive when it is actually positive. If cat is our positive example then predicting something a cat when it is actually a cat.
FP (False positive): Predicting something as positive but which is not actually positive, i.e, saying something positive "falsely".
Now the number of correct detection of a certain class is the number of TP of that class. But apart from them the model also predicted some other examples as positives but which were not actually positives and so these are the false positives (FP). So irrespective of correct or wrong the total number of positive class detected by the model is TP + FP. So the percentage of correct detection of positive class among all detection of that class will be: TP / (TP + FP) which is the precision of the detection of that class.
Like recall we can also generalize this formula for any number of classes. Just take one class at a time and consider it as the positive class and the rest of the classes as negative classes and continue the same process for all of the classes to calculate precision for each of them.
You can calculate precision and recall in another way (basically the other way of thinking the same formulae). Say for Cat, first count how many examples at the same time have Cat in both Ground-truth and Model's prediction (i.e, count the number of TP). Therefore if you are calculating precision then divide this count by the number of "Cat"s in the Model's Prediction. Otherwise for recall divide by the number of "Cat"s in the Ground-truth. This works as the same as the formulae of precision and recall. If you can't understand why then you should think for a while and review what actually TP, FP, TN and FN means.
If you have difficulty understanding precision and recall, consider reading this
https://medium.com/seek-product-management/8-out-of-10-brown-cats-6e39a22b65dc
I made a network that predicts either 1 or 0. I'm now working on the ROC Curve of that network where I have to find the TN, FN, TP, FP. When the output of my network is >= 0.5 with desired output of 1, I classified it under True Positive. And when it's >=0.5 with desired output of 0, I classified it under False Positive. Is that the right thing to do? Just wanna make sure if my understanding is correct.
It all depends on how you are using your network as the True/False Positive/Negative is just a form of analysing results of your classification, not the internals of the network. From what you have written I assume, that you have a network with one output node, which can yield values in the [0,1]. If you use your model in the way, that if this value is bigger then 0.5 then you assume the 1 output and 0 otherwise, then yes, you are correct. In general, you should consider what is the "interpretation" of your output and simply use the definition of TP, FN, etc. which can be summarized as follows:
your network
truth 1 0
1 TP FN
0 FP TN
I refered to "interpretation" as in fact you are always using some function g( output ), which returns the predicted class number. In your case, it is simply g( output ) = 1 iff output >= 0.5. but in multi class problem it would be probably g( output ) = argmax( output ), yet it does not have to, in particular - what about "draws" (when two or more neurons have the same value). For calculating True/False Positives/Negatives you should always only consider the final classification. And as a result, you are measuring the quality of the model, learning process as well as this "interpretation" g.
It should also be noted, that concept of "positive" and "negative" class is often ambiguous. In problems like detection of some object/event it is quite clear, that "occurence" is a positive event and "lack of" is negative, but in many others - like for example gender classification there is no clear interpretation. In such cases one should carefully choose used metrics, as some of them are biased towards positive (or negative) examples (for example precision do not consider neither true nor false negatives).