I have looked at the questions already asked about how to calculate precision and recall. However, I am still confused. There are specific questions that I face during the interviews that I am not sure how to approach them.
Question-1:
A binary classifier correctly filters 90% of spam emails but misclassifies 5% of non-spam emails as spam. Give the classifier's:
- recall
- precision
- accuracy,
or explain why there's not enough information to know.
For the above example, my solution would be:
TP = 90%
FP = 5%
Recall = TP / TP + FN => cannot be calculated without knowing FN
Precision = TP / TP + FP => 0.90/0.90 + 0.05 = 0.94
Accuracy = TP + TN / FN + FP + TP + TN => cannot be calculated without knowing TN & FN
Question-2:
1 percent of Uber transactions are fraud
We have a model that:
When a transaction is fraud, 99% classifies as fraud
When a transaction is not fraud, 99% classifies not fraud
What is the precision and recall of this model?
For the above example, my solution would be:
Let's say we have 1,000 daily transactions
number of positive classes (number of frauds): 10
number of negative classes (number of non-frauds): 990
TP = 10 * 99% = 9.9
TN = 990 * 99% = 980.1
FN = 990 * 1% = 9.9
FP = 10 * 1% = 0.1
Then use these in the below formula.
recall = TP / (TP + FN)
precision = TP / (FP + TP)
Can you please help me clarify an approach for these kinds of questions?
Thank you all!
Related
I am trying to do a multiclass classification problem (containing 3 labels) with softmax regression.
This is my first rough implementation with gradient descent and back propagation (without using regularization and any advanced optimization algorithm) containing only 1 layer.
Also when learning-rate is big (>0.003) cost becomes NaN, on decreasing learning-rate the cost function works fine.
Can anyone explain what I'm doing wrong??
# X is (13,177) dimensional
# y is (3,177) dimensional with label 0/1
m = X.shape[1] # 177
W = np.random.randn(3,X.shape[0])*0.01 # (3,13)
b = 0
cost = 0
alpha = 0.0001 # seems too small to me but for bigger values cost becomes NaN
for i in range(100):
Z = np.dot(W,X) + b
t = np.exp(Z)
add = np.sum(t,axis=0)
A = t/add
loss = -np.multiply(y,np.log(A))
cost += np.sum(loss)/m
print('cost after iteration',i+1,'is',cost)
dZ = A-y
dW = np.dot(dZ,X.T)/m
db = np.sum(dZ)/m
W = W - alpha*dW
b = b - alpha*db
This is what I get :
cost after iteration 1 is 6.661713420377916
cost after iteration 2 is 23.58974203186562
cost after iteration 3 is 52.75811642877174
.............................................................
...............*upto 100 iterations*.................
.............................................................
cost after iteration 99 is 1413.555298639879
cost after iteration 100 is 1429.6533630169406
Well after some time i figured it out.
First of all the cost was increasing due to this :
cost += np.sum(loss)/m
Here plus sign is not needed as it will add all the previous cost computed on every epoch which is not what we want. This implementation is generally required during mini-batch gradient descent for computing cost over each epoch.
Secondly the learning rate is too big for this problem that's why cost was overshooting the minimum value and becoming NaN.
I looked in my code and find out that my features were of very different range (one was from -1 to 1 and other was -5000 to 5000) which was limiting my algorithm to use greater values for learning rate.
So I applied feature scaling :
var = np.var(X, axis=1)
X = X/var
Now learning rate can be much bigger (<=0.001).
I am looking at the equation PPV and Sensitivity
and I got this
PPV = TP / (TF+FN)
and
Sensitivity = TP / (TF+FN)
Which means both are the same !!
So do we have them in 2 names?
and how come F1 Score is
F1 Score = 2*PPV*S / (PPV+S)
Can we rewrite F1 Score to be
F1 Score = 2*PPV*PPV / (PPV+PPV) = 2*PPV*PPV / (2*PPV) = PPV !!
They all the same?
It seems there is some condition or something I am missing here!
can someone please explain to me what am I missing?
Using medical diagnosis as an example:
sensitivity is the proportion testing positive among all those who actually have the disease.
Sensitivity = TP/(TP+FN) = TPR
While,
PPV is how likely a patient has a predicted specific disease given the test results.
PPV = TP/(TP+FP) which is definitely NOT equal to TP/(TP+FN)!
Regarding F1:
F1 is the harmonic mean of precision and sensitivity. One is normalized by column and the other normalized by row. Precision is synonymous with PPV while sensitivity is synonymous with TPR.
F1 = 2*PPV*TPR / (PPV+TPR)
Suppose we have sinusoidal with frequency 100Hz and sampling frequency of 1000Hz. It means that our signal has 100 periods in a second and we are taking 1000 samples in a second. Therefore, in order to select a complete period I'll have to take fs/f=10 samples. Right?
What if the sampling period is not a multiple of the frequency of the signal (like 550Hz)? Do I have to find the minimum multiple M of f and fs, and than take M samples?
My goal is to select an integer number of periods in order to be able to replicate them without changes.
You have f periods a second, and fs samples a second.
If you take M samples, it would cover M/fs part of a second, or P = f * (M/fs) periods. You want this number to be integer.
So you need to take M = fs / gcd(f, fs) samples.
For your example P = 1000 / gcd(100, 1000) = 1000 / 100 = 10.
If you have 60 Hz frequency and 80 Hz sampling frequency, it gives P = 80 / gcd(60, 80) = 80 / 20 = 4 -- 4 samples will cover 4 * 1/80 = 1/20 part of a second, and that will be 3 periods.
If you have 113 Hz frequency and 512 Hz sampling frequency, you are out of luck, since gcd(113, 512) = 1 and you'll need 512 samples, covering the whole second and 113 periods.
In general, an arbitrary frequency will not have an integer number of periods. Irrational frequencies will never even repeat ever. So some means other than concatenation of buffers one period in length will be needed to synthesize exactly periodic waveforms of arbitrary frequencies. Approximation by interpolation for fractional phase offsets is one possibility.
Someone will answer a series of questions and will mark each important (I), very important (V), or extremely important (E). I'll then match their answers with answers given by everyone else, compute the percent of the answers in each bucket that are the same, then combine the percentages to get a final score.
For example, I answer 10 questions, marking 3 as extremely important, 5 as very important, and 2 as important. I then match my answers with someone else's, and they answer the same to 2/3 extremely important questions, 4/5 very important questions, and 2/2 important questions. This results in percentages of 66.66 (extremely important), 80.00 (very important), and 100.00 (important). I then combine these 3 percentages to get a final score, but I first weigh each percentage to reflect the importance of each bucket. So the result would be something like: score = E * 66.66 + V * 80.00 + I * 100.00. The values of E, V, and I (the weights) are what I'm trying to figure out how to calculate.
The following are the constraints present:
1 + X + X^2 = X^3
E >= X * V >= X^2 * I > 0
E + V + I = 1
E + 0.9 * V >= 0.9
0.9 > 0.9 * E + 0.75 * V >= 0.75
E + I < 0.75
When combining the percentages, I could give important a weight of 0.0749, very important a weight of .2501, and extremely important a weight of 0.675, but this seems arbitrary, so I'm wondering how to go about calculating the optimal value for each weight. Also, how do I calculate the optimal weights if I ignore all constraints?
As far as what I mean by optimal: while adhering to the last 4 constraints, I want the weight of each bucket to be the maximum possible value, while having the weights be as far apart as possible (extremely important questions weighted maximally more than very important questions, and very important questions weighted maximally more than important questions).
I am familiar with determining the extent of match of a given set of documents in our knowledge base against a search query document (based on cosine distance) once the features are available. We would map both on the vector space based on the features.
How do I handle the reverse- I have been given a set of documents and the match score against multiple query documents and have to determine the features(or decision criteria to determine match). This would be the training data, and the model would be used to identify matches against our knowledge database for new
search queries
Our current approach is to think up a set of features and see which combinations get the best match scores in the training set... but we will end up trying multiple combinations. Is there a better way to do this?
Here is a simple and straightforward way (Linear Model) that should work.
If you are working with documents and queries, probably the features you are using are those tokens(or words) or n-grams or topics. Let's assume the features are words for simplicity.
Suppose you have a query document:
apple iphone6
and you have a set of documents and their corresponding match scores against the above query:
(Assume the documents are the contents of urls)
www.apple.com (Apple - iPhone 6) score: 0.8
www.cnet.com/products/apple-iphone-6 (Apple iPhone 6 review), score: 0.75
www.stuff.tv/apple/apple-iphone-6/review (Apple iPhone 6 review), score: 0.7
....
Per-query model
First you need to extract word features from the matching urls. suppose we get word and their L1-normalized TF/IDF scores:
www.apple.com
apple 0.5
iphone 0.4
ios8 0.1
www.cnet.com/products/apple-iphone-6
apple 0.4
iphone 0.2
review 0.2
cnet 0.2
www.stuff.tv/apple/apple-iphone-6/review
apple 0.4
iphone 0.4
review 0.1
cnet 0.05
stuff 0.05
Second you can combine feature scores and match scores and aggregate on a per-feature basis:
w(apple) = 0.5 * 0.8 + 0.4 * 0.75 + 0.1 * 0.7 = 0.77
w(iphone) = 0.4 * 0.8 + 0.2 * 0.75 + 0.4 * 0.7 = 0.75
w(ios8) = 0.1 * 0.8 = 0.08
w(review) = 0.2 * 0.75 + 0.1 * 0.7 = 0.22
w(cnet) = 0.2 * 0.75 = 0.15
w(stuff) = 0.05 * 0.7 = 0.035
You might want to do a normalization step to divide each w by the number of documents. Now you get below features ordered by their relevancy decendingly:
w(apple)=0.77 / 3
w(iphone)=0.75 / 3
w(review)=0.22 / 3
w(cnet)=0.15 / 3
w(ios8)=0.08 / 3
w(stuff)=0.035 / 3
You even get a linear classifier by using those weights:
score = w(apple) * tf-idf(apple) + w(iphone) * tf-idf(iphone) + ... + w(stuff) * tf-idf(stuff)
Suppose now you have a new url with those features detected:
ios8: 0.5
cnet: 0.3
iphone:0.2
You can then calculate its match score regarding query "apple iphone6":
score = w(ios8)*0.5 + w(cnet)*0.3 + w(iphone)*0.2
= (.08*.5 + .15*0.3 + .75*.2 ) / 3
The match score can then be used to rank documents regarding their relevancy to the same query.
Any-query model
You perform the same thing to construct a linear model for each query. Suppose you have k such queries and matching documents in your training data, you will end up with
k such models; each model is constructed based on one query.
model(apple iphone6) = (0.77*apple + 0.75iphone + 0.22review + ...) / 3
model(android apps) = (0.77google + 0.5android + ...) / 5
model(samsung phone) = (0.5samsung + 0.2galaxy + ...) / 10
Note in above example models, 3, 5, 10 are the normalizer (the total number of documents matched to each query).
Now a new query comes, suppose it is :
samsung android release
Our tasks left are to:
find relevant queries q1, q2, ..., qm
use query models to score new documents and aggregate.
You first need to extract features from this query and also suppose you have already cached the features for each query you have learned. Based on any nearest neighbor approach (e.g., Locality sensitive hashing), you can find the top-k similar queries to "samsung android release", probably they should be:
similarity(samsung phone, samsung android release) = 0.2
similarity(android apps, samsung android release) = 0.2
Overall Ranker
Thus we get our final ranker as:
0.2*model(samsung phone) + 0.2*model(android apps) =
0.2* (0.77*apple + 0.75iphone + 0.22review + ...) / 3 +
0.2* (0.77google + 0.5android + ...) / 5
Usually in those information retrieval apps, you have constructed inverted index from features (words) to documents. Therefore the final rank should be able to evaluated very efficiently across top documents.
Reference
For details, please refer to the IND algorithm in Omid Madani et al. Learning When Concepts Abound.