I want to classify faults and no-fault conditions for a device. Label A for fault and label B for no-fault.
scikit-learn gives me a report for classification matrix as :
precision recall f1-score support
A 0.82 0.18 0.30 2565
B 0.96 1.00 0.98 45100
Now which of A or B results should I use to specify the model operation?
Introduction
There's no single score that can universally describe the model, all depends on what's your objective. In your case, you're dealing with fault detection, so you're interested in finding faults among much greater number of non-fault cases. Same logic applies to e.g. population and finding individuals carrying a pathogen.
In such cases, it's typically very important to have high recall (also known as sensitivity) for "fault" cases (or that e.g. you might be ill). In such screening it's typically fine to diagnose as "fault", something that actually works fine - that is your false positive. Why? Because the cost of missing a faulty part in an engine or a tumor is much greater than asking engineer or doctor to verify the case.
Solution
Assuming that this assumption (recall for faults is most important metric) holds in your case, then you should be looking at recall for Label A (faults). By these standards, your model is doing rather poorly: it finds only 18% of faults. Likely much stems from the fact that number of faults is ~20x smaller than non-faults, introducing heavy bias (that needs to be tackled).
I can think of number of scenarios where this score would not be actually bad. If you can detect 18% of all faults in engine (on top of other systems) and not introduce false alarms, then it can be really useful - you don't want too often fire alarm to the driver while everything's fine. At the same time, likely you don't want to use the same logic for e.g. cancer detection and tell patient "everything's OK", while there's a very high risk that the diagnosis is wrong.
Metrics
For the sake of completeness, I will explain the terms. Consider these definitions:
tp - true positive (real fault)
tn - true negative (it's not a fault)
fp - false positive (detected fault, while it's OK)
fn - false negative (detected OK, while it's a fault)
Here is one article that attempts to nicely explain what's precision, recall and F1.
Related
So I know what Precision and Recall stand for.
Precision optimizes for False Positives and Recall False Negatives. In the end what is the cost objective of a business that should be taken into account. Like for a hospital you might want to have a algo with high recall (low false negatives) as the cost to missing out on identifying a malignant tumor is more than having to do more investigation on those false alarms.
But what is still considered decent Precision/Recall metrics? Like I have a Binary Classification algo, which has 0.34 as Precision, but Recall is 0.98. Even if business objective favors optimizes on favoring False Negatives (high recall), is this ok to consider such a algo which although favors High recall but has poor Precision values.
Note: I had a severe class imbalance problems with around 99% of obs 0 and just under 1% being 1 class.
This is highly dependent on the context, but let's assume this classifier detects malignant tumors at very early stages where it's quite hard to detect.
Now for the purpose of this analysis, let's have two scenarios with two different assumptions.
Scenario 1: The system will be used as a quick filtering phase on a huge number of people to quickly dismiss those who are not suspects of tumors
Well in that case, this model with .98 recall will rarely let a person with a tumor slip by undetected, and this is the main purpose of the system since it's merely a quick filtering phase to dismiss a considerable portion of the population, since the following inspections are quite costly and time consuming.
I would say this system would do pretty well in that scenario
Scenario 2: The system will be used to diagnose people with tumor that will be directly enlisted in costly treatment programs
In this fictional scenario, the system is meant to be very confident and precise on those who are classified with tumors since there is no post-filtering phases after this one and the treatments will be both costly and might cause pretty harmful side-effects for those who aren't cancer fighters.
In that case, this model would perform terribly for the purpose it was meant for in this scenario.
So it's totally dependent on the case, in scenario number 1, it's totally fine to go with low precision as long as the recall it very high, of course the higher the precision is better, but as long as it doesn't fall below a certain threshold for the recall.
While for scenario 2, it's expected to have a very high precision, even if the recall is too low, .99 precision with .05 recall is totally fine in that scenario.
UPDATE 1
Regarding the class imbalance that your dataset suffers from, this could be a direct influence on the bad precision for the under-samples class, have you tried using weighted loss, where the under-sampled class have higher weights that should help in balancing the class affect during training.
There are many techniques that can be used to handle imbalanced datasets, you can read more about them here
I read that when using CNNs, we should have approximately equal number of samples per class. I am doing binary classification, detecting pedestrians from background so the 2 classes are pedestrian and background (anything not pedestrian really).
If I were to incorporate hard negative mining in my training, I would end up with more negative samples than positive if I am getting a lot of false positives.
1) Would this be okay?
2) If not then how do I solve this issue?
3) And what are the consequences of training a CNN with more negative than positive samples?
4) If it is okay to have more negative than positive samples, is there a maximum limit that I should not exceed ? Like for eg. I should not have 3x more negative samples than positives.
5) I can augment my positive samples by jittering but how much additional samples per image should I create? Is there a 'too much'? Like if I start off with 2000 positive samples, how much additional samples is too much? Is generating a total of 100k samples from the 2k samples via jittering too much?
It depends on which cost function you use but if you set it to be a log_loss then I can show you how intuitively not balanced dataset may harm your training and what are the possible solutions for this problem:
a. If you don't change the distribution of your classes and leave them unbalanced then - if your model is able to achieve relatively small value of a loss function then it will not only be a good detector of a pedestrian on an image but also it will learn that pedestrian detection is a relatively rare event and it may prevent you from a lot of false positives. So if you are able to spend a lot more time on training a bigger model - it may bring you a really good results.
b. If you change the distribution of your classes - then you could probably achieve relatively good results with much smaller model in shorter time - but on the other hand - because of the fact that your classifier will learn different distribution - you may achieve a lot of False positives.
But - if the training phase of your classifier is not lasting too long - you may find a good compromise between these two methods. You may set a multiplication factor (e.g. if you will increase the number of samples by 2, 3 or n times) as meta parameter and optimise the value of it e.g. using grid search schema.
I am reading about precision and recall in machine learning.
Question 1: When are precision and recall inversely related? That is, when does the situation occur where you can improve your precision but at the cost of lower recall, and vice versa? The Wikipedia article states:
Often, there is an inverse relationship between precision and recall,
where it is possible to increase one at the cost of reducing the
other. Brain surgery provides an obvious example of the tradeoff.
However, I have seen research experiment results where both precision and recall increase simultaneously (for example, as you use different or more features).
In what scenarios does the inverse relationship hold?
Question 2: I'm familiar with the precision and recall concept in two fields: information retrieval (e.g. "return 100 most relevant pages out of a 1MM page corpus") and binary classification (e.g. "classify each of these 100 patients as having the disease or not"). Are precision and recall inversely related in both or one of these fields?
The inverse relation only holds when you have some parameter in the system that you can vary in order to get more/less results. Then there's a straightforward relationship: you lower the threshold to get more results and among them some are TPs and some FPs. This, actually, doesn't always mean that precision or recall will rise and fall simultaneously - the real relationship can be mapped using the ROC curve. As for Q2, likewise, in both of these tasks precision and recall are not necessarily inversely related.
So, how do you increase recall or precision, not impacting the other simultaneously? Usually, by improving the algorithm or model. I.e. when you just change parameters of a given model, the inverse relationship will usually hold, although you should mind that it will also be usually non-linear. But if you, for example, add more descriptive features to the model, you can increase both metrics at once.
Regarding the first question, I interpret these concepts in terms of how restrictive your results must be.
If you're more restrictive, I mean, if you're more "demanding on the correctness" of the results, you want it to be more precise. For that, you might be willing to reject some correct results as long as everything you get is correct. Thus, you're raising your precision and lowering your recall. Conversely, if you do not mind getting some incorrect results as long as you get all the correct ones, you're raising your recall and lowering your precision.
On what concerns the second question, if I look at it from the point of view of the paragraphs above, I can say that yes, they are inversely related.
To the best of my knowledge, In order to be able to increase both, precision and recall, you'll need either, a better model (more suitable for your problem) or better data (or both, actually).
I am reading the < Foundations of Statistical Natural Language Processing >. It has the following statement about the relationship between information entropy and language model:
...The essential point here is that if a model captures more of the
structure of a language, then the entropy of the model should be
lower. In other words, we can sue entropy as a measure of the quality
of our models...
But how about this example:
Suppose we have a machine that spit 2 characters, A and B, one by one. And the designer of the machine makes A and B has the equal probability.
I am not the designer. And I try to model it through experiment.
During a initial experiment, I see the machine split the following character sequence:
A, B, A
So I model the machine as P(A)=2/3 and P(B)=1/3. And we can calculate entropy of this model as :
-2/3*Log(2/3)-1/3*Log(1/3)= 0.918 bit (the base is 2)
But then, the designer tell me about his design, so I refined my model with this more information. The new model looks like this:
P(A)=1/2 P(B)=1/2
And the entropy of this new model is:
-1/2*Log(1/2)-1/2*Log(1/2) = 1 bit
The second model is obviously better than the first one. But the entropy increased.
My point is, due to the arbitrariness of the model being tried, we cannot blindly say a smaller entropy indicates a better model.
Could anyone shed some light on this?
ADD 1
(Much thanks to Rob Neuhaus!)
Yes, after I re-digested the mentioned NLP book. I think I can explain it now.
What I calculated is actually the entropy of the language model distribution. It cannot be used to evaluate the effectiveness of a language model.
To evaluate a language model, we should measure how much surprise it gives us for real sequences in that language. For each real word encountered, the language model will give a probability p. And we use -log(p) to quantify the surprise. And we average the total surprise over a long enough sequence. So, in case of a 1000-letter sequence with 500 A and 500 B,
the surprise given by the 1/3-2/3 model will be:
[-500*log(1/3) - 500*log(2/3)]/1000 = 1/2 * Log(9/2)
While the correct 1/2-1/2 model will give:
[-500*log(1/2) - 500*log(1/2)]/1000 = 1/2 * Log(8/2)
So, we can see, the 1/3, 2/3 model gives more surprise, which indicates it is worse than the correct model.
Only when the sequence is long enough, the average effect will mimic the expectation over the 1/2-1/2 distribution. If the sequence is short, it won't give a convincing result.
I didn't mention the cross-entropy here since I think this jargon is too intimidating and not much helpful to reveal the root cause.
If you had a larger sample of data, it's very likely that the model that assigns 2/3 to A and 1/3 to B will do worse than the true model, which gives 1/2 to each. The problem is that your training set is too small, so you were mislead into thinking the wrong model was better. I encourage you to experiment, generate a random string of length 10000, where each character equally likely. Then measure the cross entropy of the 2/3,1/3 model vs the 1/2,1/2 model on that much longer string. I am sure you will see the latter performs better. Here is some sample Python code demonstrating the fact.
from math import log
import random
def cross_entropy(prediction_probability_seq):
probs = list(prediction_probability_seq)
return -sum(log(p, 2) for p in probs) / len(probs)
def predictions(seq, model):
for item in seq:
yield model[item]
rand_char_seq = [random.choice(['a', 'b']) for _ in xrange(1000)]
def print_ent(m):
print 'cross entropy of', str(m), \
cross_entropy(predictions(rand_char_seq, m))
print_ent({'a': .5, 'b': .5})
print_ent({'a': 2./3, 'b': 1./3})
Notice that if you add an extra 'a' to the choice, then the second model (which is closer to the true distribution) gets lower cross entropy than the first.
However, one other thing to consider is that you really want to measure the likelihood on held out data that you didn't observe during training. If you do not do this, more complicated models that memorize the noise in the training data will have an advantage over smaller/simpler models that don't have as much ability to memorize noise.
One real problem with likelihood as measuring language model quality is that it sometimes doesn't perfectly predict the actual higher level application error rate. For example, language models are often used in speech recognition systems. There have been improved language models (in terms of entropy) that didn't drive down the overall system's word error rate, which is what the designers really care about. This can happen if the language model improves predictions where the entire recognition system is already confident enough to get the right answer.
I created a heuristic (an ANN, but that's not important) to estimate the probabilities of an event (the results of sports games, but that's not important either). Given some inputs, this heuristics tell me what are the probabilities of the event. Something like : Given theses inputs, team B as 65% chances to win.
I have a large set of inputs data for which I now the result (games previously played). Which formula/metric could I use to qualify the accuracy of my estimator.
The problem I see is, if the estimator says the event has a probability of 20% and the event actually do occurs. I have no way to tell if my estimator is right or wrong. Maybe it's wrong and the event was more likely than that. Maybe it's right, the event as about 20% chance to occur and did occur. Maybe it's wrong, the event has really low chances to occurs, say 1 in 1000, but happened to occur this time.
Fortunately I have lots of theses actual test data, so there is probably a way to use them to qualify my heuristic.
anybody got an idea?
There are a number of measurements that you could use to quantify the performance of a binary classifier.
Do you care whether or not your estimator (ANN, e.g.) outputs a calibrated probability or not?
If not, i.e. all that matters is rank ordering, maximizing area under ROC curve (AUROC) is a pretty good summary of the performance of the metric. Others are "KS" statistic, lift. There are many in use, and emphasize different facets of performance.
If you care about calibrated probabilities then the most common metrics are the "cross entropy" (also known as Bernoulli probability/maximum likelihood, the typical measure used in logistic regression) or "Brier score". Brier score is none other than mean squared error comparing continuous predicted probabilites to binary actual outcomes.
Which is the right thing to use depends on the ultimate application of the classifier. For example, your classifier may estimate probability of blowouts really well, but be substandard on close outcomes.
Usually, the true metric that you're trying to optimize is "dollars made". That's often hard to represent mathematically but starting from that is your best shot to coming up with an appropriate and computationally tractable metric.
In a way it depends on the decision function you are using.
In the case of a binary classification task (predicting whether an event occurred or not [ex: win]), a simple implementation is to predict 1 if the probability is greater than 50%, 0 otherwise.
If you have a multiclass problem (predicting which one of K events occurred [ex: win/draw/lose]), you can predict the class with the highest probability.
And the way to evaluate your heuristic is to compute the prediction error by comparing the actual class of each input with the prediction of your heuristic for that instance.
Note that you would usually divide your data into train/test parts to get better (unbiased) estimates of the performance.
Other tools for evaluation exist such as ROC curves, which is a way to depict the performance with regard to true/false postitives.
As you stated, if you predict that an event has a 20% of happening - and 80% not to happen - observing a single isolated event would not tell you how good or poor your estimator was. However, if you had a large sample of events for which you predicted 20% success, but observe that over that sample, 30% succeeded, you could begin to suspect that your estimator is off.
One approach would be to group your events by predicted probability of occurrence, and observe the actual frequency by group, and measure the difference. For instance, depending on how much data you have, group all events where you predict 20% to 25% occurrence, and compute the actual frequency of occurrence by group - and measure the difference for each group. This should give you a good idea of whether your estimator is biased, and possibly for which ranges it's off.