Precision and Recall computation for different group sizes

Precision and Recall computation for different group sizes - machine-learning

I didn't find an answer for this question anywhere, so I hope someone here could help me and also other people with the same problem.
Suppose that I have 1000 Positive samples and 1500 Negative samples.
Now, suppose that there are 950 True Positives (positive samples correctly classified as positive) and 100 False Positives (negative samples incorrectly classified as positive).
Should I use these raw numbers to compute the Precision, or should I consider the different group sizes?
In other words, should my precision be:
TruePositive / (TruePositive + FalsePositive) = 950 / (950 + 100) = 90.476%
OR should it be:
(TruePositive / 1000) / [(TruePositive / 1000) + (FalsePositive / 1500)] = 0.95 / (0.95 + 0.067) = 93.44%
In the first calculation, I took the raw numbers without any consideration to the amount of samples in each group, while in the second calculation, I used the proportions of each measure to its corresponding group, to remove the bias caused by the groups' different size

Answering the stated question: by definition, precision is computed by the first formula: TP/(TP+FP).
However, it doesn't mean that you have to use this formula, i.e. precision measure. There are many other measures, look at the table on this wiki page and choose the one most suited for your task.
For example, positive likelihood ratio seems to be the most similar to your second formula.

Related

How to convert TS-SS result to similarity measure between 0 - 1?

I'm currently developing a question plugin for some LMS that auto grade the answer based on the similarity between the answer and answer key with cosine similarity. But lately, I found that there is a better algorithm that promised to be more accurate called TS-SS. But, the result of the calculation 0 - infinity. Being not a machine learning guy, I was assuming that the result maybe a distance, just like Euclidean Distance, but I'm not sure. It can be a geometry or something, because the algorithm is calculating the triangle and sector, so I'm assuming that it is a geometric similarity or something, I'm not sure though.
So I have some example in my note, and then I tried to convert it with what people suggest, S = 1 / (1 + D), but the result was not what I'm looking for. With cosine similarity I got 0.77, but with TS-SS plus equation before, I got 0.4. And then I found this SO answer that uses S = 1 / (1.1 ** D). When I tried the equation, sure enough it gave me "relevant" result, 0.81. That is not far from cosine similarity, and in my opinion the result is better suited for auto grading than 0.77 one based on the answer key.
But unfortunately, I don't know where that equation come from, and I tried to google it but no luck, so that is why I'm asking this question.
How to convert the TS-SS result to similarity measure the right way? Is the S = 1 / (1.1 ** D) enough or...?
Edit:
When calculating TS-SS, it is actually using cosine similarity calculation as well. So, if the cosine similarity is 1, then the TS-SS will be 0. But, if the cosine similarity is 0, the TS-SS is not infinty. So, I think it is reasonable to compare the result between the two to know what conversion formula will be used
TS-SS Cosine Similarity
38.19 0
7.065 0.45
3.001 0.66
1.455 0.77
0.857 0.81
0.006 0.80
0 1
another random comparison from multiple answer key
36.89 0
9.818 0.42
7.581 0.45
3.910 0.63
2.278 0.77
2.935 0.75
1.329 0.81
0.494 0.84
0.053 0.75
0.011 0.80
0.003 0.98
0 1
comparison from the same answer key
38.11 0.71
4.293 0.33
1.448 0
1.203 0.17
0.527 0.62
Thank you in advance

With these new figures, the answer is simply that we can't give you an answer. The two functions give you a distance measure based on metrics that appear to be different enough that we can't simply transform between TS-SS and CS. In fact, if the two functions are continuous (which they're supposed to be for comfortable use), then the transformation between them isn't a bijection (two-way function).
For a smooth translation between the two, we need at least for the functions to be continuous and differentiable for the entire interval of application. a small change in the document results in a small change in the metric. We also need them to be monotonic over the interval, such that a rise in TS-SS would always result in a drop in CS.
Your data tables show that we can't even craft such transformation functions for a single document, let alone the metrics in general.
The cited question was a much simpler problem: there, the OP already has a transformation with all of desired properties; they needed only to alter the slopes of change and ensure the boundary properties.

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.

As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

How to improve the precision-recall scores as negative labels in test set is increased

Consider the below scenario:
I have batches of data whose features and labels have similar distribution.
Say something like 4000000 negative labels and 25000 positive labels
As its a highly imbalanced set, I have undersampled the negative labels so that my training set (taken from one of the batch) now contains 25000 positive labels and 500000 negative labels.
Now I am trying to measure the precision and recall from a test set after training (generated from a different batch)
I am using XGBoost with 30 estimators.
Now if I use all of 40000000 negative labels, I get a (0.1 precsion and 0.1 recall at 0.7 threshold) worser precision-recall score than if I use a subset say just 500000 negative labels(0.4 precision with 0.1 recall at 0.3 threshold)..
What could be a potential reason that this could happen?
Few of the thoughts that I had:
The features of the 500000 negative labels are vastly different from the rest in the overall 40000000 negative labels.
But when I plot the individual features, their central tendencies closely match with the subset.
Are there any other ways to identify why I get a lower and a worser presicion recall, when the number of negative labels increase so much?
Are there any ways to compare the distributions?
Is my undersampled training a cause for this?

To understand this, we first need to understand how precision and recall are calculated. For this I will use the following variables:
P - total number of positives
N - total number of negatives
TP - number of true positives
TN - number of true negatives
FP - number of false positives
FN - number of false negatives
It is important to note that:
P = TP + FN
N = TN + FP
Now, precision is TP/(TP + FP)
recall is TP/(TP + FN), therefore TP/P.
Accuracy is TP/(TP + FN) + TN/(TN + FP), hence (TP + TN)/(P + N)
In your case where the the data is imbalanced, we have that N>>P.
Now imagine some random model. We can usually say that for such a model accuracy is around 50%, but that is only if the data is balanced. In your case, there will tend to be more FP's and TN's than TP's and FN's because a random selection of the data has more liklihood of returning a negative sample.
So we can establish that the more % of negative samples N/(T+N), the more FP and TN we get. That is, whenever your model is not able to select the correct label, it will pick a random label out of P and N and it is mostly going to be N.
Recall that FP is a denominator in precision? This means that precision also decreases with increasing N/(T+N).
For recall, we have neither FP nor TN in its derivation, so will likely not to change much with increasing N/(T+N) . As can be seen in your example, it clearly stays the same.
Therefore, I would try to make the data balanced to get better result. A ratio of 1:1.5 should do.
You can also use a different metric like the F1 score that combines precision and recall to get a better understanding of the performance.
Also check some of the other points made here on how to combat imbalance data

Calculte precision and recall for text mining result

I am doing a project to find out the disease associated genes using text mining. I am using 1000 articles for this. I got around 129 gene names. The actual dataset contains around 1000 entries. Now I would like to calculate the precision and recall of my method. When i did the comparison, out of the 129 genes, 72 were found to be correct. So the
precision = 72/129.
Is it correct?
Now how can I calculate the recall? Please help

The Wikipedia Article on Precision and Recall might help.
The definitions are:
Precision: tp / (tp+fp)
Recall: tp / (tp + fn)
Where tp are the true positives (genes which are associated with disease and you found them), fp are the false positives (genes you found but they actually aren't associated with the disease) and fn are the false negatives (genes which are actually associated with the disease but you didn't find them).
I am not quite sure what the numbers you have posted represent. Do you know the genes which are truly associated with the disease?
You have most likely computed the accuracy:
Accuracy = (tp + fp) / (Total Number)

The main issue was that the articles that i am considering might not contain all the originally listed gene names since its a small dataset. So while calculating the recall, instead of considering the denominator as 1000, I can compare the original database of genes with the articles to find out how many of the originally associated genes are present in the literature. i.e., if there are 1000 associated genes, I will check out of 1000 how many are there in the dataset I am considering. If it is 300, i will set the denominator as 300 instead of 1000. That will give recall.

Understanding Recall and Precision

I am currently learning Information retrieval and i am rather stuck with an example of recall and precision
A searcher uses a search engine to look for information. There are 10 documents on the first screen of results and 10 on the second.
Assuming there is known to be 10 relevant documents in the search engines index.
Soo... there is 20 searches all together of which 10 are relevant.
Can anyone help me make sense of this?
Thanks

Recall and precision measure the quality of your result. To understand them let's first define the types of results. A document in your returned list can either be
classified correctly
a true positive (TP): a document which is relevant (positive) that was indeed returned (true)
a true negative (TN): a document which is not relevant (negative) that was indeed NOT returned (true)
misclassified
a false positive (FP): a document which is not relevant but was returned positive
a false negative (FN): a document which is relevant but was not returned negative
the precision is then:
|TP| / (|TP| + |FP|)
i.e. the fraction of retrieved documents which are indeed relevant
the recall is then:
|TP| / (|TP| + |FN|)
i.e. the fraction of relevant documents which are in your result set
So, in your example 10 out of 20 results are relevant. This gives you a precision of 0.5. If there are no more than these 10 relevant documents, you have got a recall of 1.
(When measuring the performance of an Information Retrieval system it only makes sense to consider both precision and recall. You can easily get a precision of 100% by returning no result at all (i.e. no spurious returned instance => no FP) or a recall of 100% by returning every instance (i.e. no relevant document was missed => no FN). )

Well, this is an extension of my answer on recall at: https://stackoverflow.com/a/63120204/6907424. First read about precision here and than go to read recall. Here I am only explaining Precision using the same example:
ExampleNo Ground-truth Model's Prediction
0 Cat Cat
1 Cat Dog
2 Cat Cat
3 Dog Cat
4 Dog Dog
For now I am calculating precision for Cat. So Cat is our Positive Class and the rest of the classes (Here Dog only) are the Negative Classes. Precision means what the percentage of positive detection was actually positive. So here for Cat there are 3 detection by the model. But are all of them correct? No! Out of them only 2 are correct (in example 0 and 2) and another is wrong (in example 3). So the percentage of correct detection is 2 out of 3 which is (2 / 3) * 100 % = 66.67%.
Now coming to the formulation, here:
TP (True positive): Predicting something positive when it is actually positive. If cat is our positive example then predicting something a cat when it is actually a cat.
FP (False positive): Predicting something as positive but which is not actually positive, i.e, saying something positive "falsely".
Now the number of correct detection of a certain class is the number of TP of that class. But apart from them the model also predicted some other examples as positives but which were not actually positives and so these are the false positives (FP). So irrespective of correct or wrong the total number of positive class detected by the model is TP + FP. So the percentage of correct detection of positive class among all detection of that class will be: TP / (TP + FP) which is the precision of the detection of that class.
Like recall we can also generalize this formula for any number of classes. Just take one class at a time and consider it as the positive class and the rest of the classes as negative classes and continue the same process for all of the classes to calculate precision for each of them.
You can calculate precision and recall in another way (basically the other way of thinking the same formulae). Say for Cat, first count how many examples at the same time have Cat in both Ground-truth and Model's prediction (i.e, count the number of TP). Therefore if you are calculating precision then divide this count by the number of "Cat"s in the Model's Prediction. Otherwise for recall divide by the number of "Cat"s in the Ground-truth. This works as the same as the formulae of precision and recall. If you can't understand why then you should think for a while and review what actually TP, FP, TN and FN means.

If you have difficulty understanding precision and recall, consider reading this
https://medium.com/seek-product-management/8-out-of-10-brown-cats-6e39a22b65dc

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart