I am trying my own haar cascade classifier I have 2139 positive images However I have 16000 negative images This is right ? And so I have a negative numPos
Because:
numPos<=(Positive samples-negative samples)/(1+(stages number-1)(1-minhitrate)))
so:
(2139-16000)/(1+(17-1)(1-0.995))=-12834
This is normal??
no, numPos has nothing to do with your negative samples. numPos is the number of positives you want to use in each stage. This must be a bit lower than your total number of positive samples because you'll lose all the false negatives ( = positive samples which are falsely not detected anymore by the classifier) in each stage.
For example if you sr numPos to 1000 and minHitRate to 0.999 you lose up to 1 positive sample (1000 - 1000*0.999) in each stage. So if you want to compute 2 stages you'll need up to 1001 samples when choosing numPos = 1000.
For 20 stages I roughly choose numPos to be 90% of my positive samples although that is too pessimistic for minHitRate 0.999 (fits 0.995 quite well afair). There is a formula in the openCV Q&A if you want to compute the best/max save value.
Related
I am wondering why the number of images has no influence on the number of iterations when training. Here is an example to to make my question clearer:
Suppose we have 6400 images for a training to recognize 4 classes. Based on AlexeyAB explanations, we keep batch= 64, subdivisions = 16 and write max_batches = 8000 since max_batches is determined by #classes x 2000.
Since we have 6400 images, a complete epoch requires 100 iterations. Therefore this training ends after 80 epochs.
Now, suppose that we have 12800 images. In that case, an epoch needs 200 iterations. Therefore the training ends after 40 epochs.
Since an epoch refers to one cycle through the full training dataset, I'm wondering why we don't increase the number of iterations when our dataset increases, in order to keep the number of epochs constant.
Said differently, I'm asking for a simple explanation as to why the number of epochs seems to be irrelevant to the quality of the training. I feel that it's a consequence of Yolo's construction but I am not knowledgeable enough to understand how.
Why the number of images has no influence on the number of iterations when training?
In darknet yolo, the number of iterations depends on the max_batches parameter in .cfg file. After running for max_batches, the darknet saves the final_weights.
In each epoch, all the data samples are passed through the network, so if you have many images, the training time for one epoch (and iteration) will be higher, you can test that by increasing images in your data.
The sub-division accounts for the number of mini-batches. Let's say, you have 100 images in your dataset. your batch size is 10, sub-division is 2, max_batches is 20.
So, in each iteration, 10 images are passed to the network in two mini-batches (Each having 5 samples), once you have done 20 baches (20*10 data samples), the training will be completed. (The details can be a little different, I'm using a slightly modified darknet by original author pjreddie)
The instructions are updated now. max_batches is equal to classes*2000 but not less than number of training images and not less than 6000. Please find it at this link.
Consider the below scenario:
I have batches of data whose features and labels have similar distribution.
Say something like 4000000 negative labels and 25000 positive labels
As its a highly imbalanced set, I have undersampled the negative labels so that my training set (taken from one of the batch) now contains 25000 positive labels and 500000 negative labels.
Now I am trying to measure the precision and recall from a test set after training (generated from a different batch)
I am using XGBoost with 30 estimators.
Now if I use all of 40000000 negative labels, I get a (0.1 precsion and 0.1 recall at 0.7 threshold) worser precision-recall score than if I use a subset say just 500000 negative labels(0.4 precision with 0.1 recall at 0.3 threshold)..
What could be a potential reason that this could happen?
Few of the thoughts that I had:
The features of the 500000 negative labels are vastly different from the rest in the overall 40000000 negative labels.
But when I plot the individual features, their central tendencies closely match with the subset.
Are there any other ways to identify why I get a lower and a worser presicion recall, when the number of negative labels increase so much?
Are there any ways to compare the distributions?
Is my undersampled training a cause for this?
To understand this, we first need to understand how precision and recall are calculated. For this I will use the following variables:
P - total number of positives
N - total number of negatives
TP - number of true positives
TN - number of true negatives
FP - number of false positives
FN - number of false negatives
It is important to note that:
P = TP + FN
N = TN + FP
Now, precision is TP/(TP + FP)
recall is TP/(TP + FN), therefore TP/P.
Accuracy is TP/(TP + FN) + TN/(TN + FP), hence (TP + TN)/(P + N)
In your case where the the data is imbalanced, we have that N>>P.
Now imagine some random model. We can usually say that for such a model accuracy is around 50%, but that is only if the data is balanced. In your case, there will tend to be more FP's and TN's than TP's and FN's because a random selection of the data has more liklihood of returning a negative sample.
So we can establish that the more % of negative samples N/(T+N), the more FP and TN we get. That is, whenever your model is not able to select the correct label, it will pick a random label out of P and N and it is mostly going to be N.
Recall that FP is a denominator in precision? This means that precision also decreases with increasing N/(T+N).
For recall, we have neither FP nor TN in its derivation, so will likely not to change much with increasing N/(T+N) . As can be seen in your example, it clearly stays the same.
Therefore, I would try to make the data balanced to get better result. A ratio of 1:1.5 should do.
You can also use a different metric like the F1 score that combines precision and recall to get a better understanding of the performance.
Also check some of the other points made here on how to combat imbalance data
If I have a trained binary classifier, what is the probability of making a correct prediction by chance?
For example, lets say that I want to make 5 predictions. What is the probability of getting all 5 predictions correct by chance?
Is it: 0.5 * 0.5 * 0.5 * 0.5 * 0.5 = 0.0313 ?
You are correct, however, under the assumption that classes are equally probable.
As a similar thought experiment, if you have a model with 99% accuracy (meaning that for any, randomly chosen sample, it will provide correct label 99% of the time), it also does not have high probability of having all samples correctly. For 100 samples it is just about 36%, and for 300 it is less than 5%... for 1000 it is 0.004%.
In general probability of many event happening one by one will fall down very quickly (exponentially) if the probability of each success is constant.
I have a binary classification problem and I'm trying to get precision-recall curve for my classifier. I use libsvm with RBF kernel and probability estimate option.
To get the curve I'm changing decision threshold from 0 to 1 with steps of 0.1. But on every run, I get high precision even if recall decreases with increasing threshold. My false positive rate seems always low compared to true positives.
My results are these:
Threshold: 0.1
TOTAL TP:393, FP:1, FN: 49
Precision:0.997462, Recall: 0.889140
Threshold: 0.2
TOTAL TP:393, FP:5, FN: 70
Precision:0.987437, Recall: 0.848812
Threshold: 0.3
TOTAL TP:354, FP:4, FN: 78
Precision:0.988827, Recall: 0.819444
Threshold: 0.4
TOTAL TP:377, FP:9, FN: 104
Precision:0.976684, Recall: 0.783784
Threshold: 0.5
TOTAL TP:377, FP:5, FN: 120
Precision:0.986911, Recall: 0.758551
Threshold: 0.6
TOTAL TP:340, FP:4, FN: 144
Precision:0.988372, Recall: 0.702479
Threshold: 0.7
TOTAL TP:316, FP:5, FN: 166
Precision:0.984424, Recall: 0.655602
Threshold: 0.8
TOTAL TP:253, FP:2, FN: 227
Precision:0.992157, Recall: 0.527083
Threshold: 0.9
TOTAL TP:167, FP:2, FN: 354
Precision:0.988166, Recall: 0.320537
Does this mean I have a good classifier or I have a fundamental mistake somewhere?
One of the reasons for this could be while training the data you have lot of negative samples than positive ones. Hence, almost all the examples are being classified as negative samples except the few. Hence, you get high precision i.e. less false positives and low recall i.e. more false negatives.
Edit:
Now that we know you have more negative samples than positive ones:
If you look at the results, as and when you increase the threshold the number of False negatives are increasing i.e. your positive samples are classified as negative ones, which is not a good thing. Again, it depends on your problem, some problems will prefer high precision over recall, some will prefer high recall over precision. If you want both precision and recall to be high, you might need to resolve class imbalance, by trying oversampling (repeating positive samples so that ratio becomes 1:1) or undersampling (taking random negative samples in proportion with positive samples) or something more sophisticated like SMOTE algorithm (which adds similar positive samples).
Also, I am sure there must be "class_weight" parameter in the classifier, which gives more importance to error in the class where there are less training examples. You might want to try giving more weight to positive class than negative ones.
Having a high precision can be that your data has a pattern that your model seems to grasp easily so it's a good classifier.
Maybe your measures are incorrectly computed or the most probable : your model is overfitting. That means that your model is not learning but rather memorizing.
This can be produced by testing your model on your training set.
I am trying to build a classifier to detect faces in Thermal images. So I tried training using Haar, LBP and HOG classifiers. I am working with OpenCV 2.4.8 on windows.
opencv_traincascade.exe -data haarcascades -vec pos.vec -bg neg.txt -numPos 250 -numStages 24 -numNeg 900 -w 24 -h 24
I have 307 positive samples in total. The negative samples are of size 75x75. For each of the three cases the training gets stuck at a particular stage-earlier for Haar (stage-12) and later for LBP (stage-14/15). I reduced the number of negatives (upto 200) but that means the training gets stuck at a later stage. The training hasn't progressed since 2 days. No negatives are being consumed and the command window looks like this-
===== TRAINING 14-stage =====
<BEGIN
POS count : consumed 255 : 262
Also
What do POS count consumed and NEG count consumed signify?
When I reduce the minHitRate to say 0.7 why do the number of POS consumed increase?
Please let me know what I am doing wrong.
Thanks.
I had the similar problem myself. The thing is that classifier at each stage takes those negative examples which are classified as positive in the previous stages. So the thing that happens is that none of the negative samples are classified as positive and the code goes in the infinite loop trying to find one. I solved this by changing the source code so that the algorithm terminates after it cant find any negative example and just use the previous stages for the classifier.
If you dont want to change the code try adding more negative examples or reducing the number of stages.
Count consumed is the amount of possitve and negative images that are used in each stages. And you need to use more possitive and negatives images around 1000 positives and 2000 negative to get a good result