I would like to use the Haar classifier to detect the presence of vehicles in a scene (trying with only cars so far). Since I have not found many trained XML files online, I decided to generate my own.
I found some image sets of vehicles that have been used for similar purposes (training computer vision algorithms) and used these to create my own XML files. It has been almost a week and some of them have finished, so I tried using them but the results were terrible. The classifiers I found online worked decently, at least it appears they are trying to detect vehicles and work fast enough for real-time application (maybe 5-10 FPS or so).
Whereas mine can take several minutes to analyze a frame using detectMultiScale() using the same parameters, and if I pass different parameters (e.g. increase min size, decrease max size, increase scaling factor) it will work faster (maybe 1 FPS) but detects absolutely nothing of note, never detects any vehicle and randomly detects some spots of asphalt as a vehicle.
Where did I go wrong in generating my files? I have limited time to complete this task and these classifiers can take a whole week to train so I have very few attempts remaining. For reference, my methodology is (following this tutorial):
-Take all positive and negative images; if no negative images supplied, take negative images from another data set, at least as many negatives as positives
-Generate as many samples as the number of positives
-Use same parameters as suggested, except image size (set to the size of images in a given data set), and nstages (set to 10 because 20 takes far too long)
-For the npos parameter, I use 1/10th the number of samples, using the full number of samples resulted in "assertion failed" after a few hours, apparently the number of samples cannot be the same as the npos according to this so I gave myself a safety margin.
TL;DR Haar classifier I trained myself performs much worse than one found online (in terms of time and accuracy), need advice on how to improve it and not waste another week training it.
There are two problems here. One, the accuracy of the classifier is low. The other, the classifier runs too slow.
There seems to be no problem with the reference that you used. The steps seem accurate, and I have personally tried them in that order and managed to get good results.
As #Micka mentions, nPos around 90% of the original sample count is good enough. minHitRate is a parameter that you can change. Did you observe the numbers that are displayed while training? How was the accuracy improving, and did your classifier stop training (or are you using the trained parameters before learning ends?)?
For the low speed in detection, the most likely reason is that your training data did not have simple features to learn quickly. Did you trying detection on the data that you used for training? How were the results in that case? Compiler settings or high image resolution can be a problem too, but if you tried the same inputs and settings with other classifiers, this is unlikely.
If you like tor try a different approach (and have a GPU), YOLO V2 should be much faster and more accurate for this task.
Related
Deep learning has been a revolution recently and its success is related with the huge amount of data that we can currently manage and the generalization of the GPUs.
So here is the problem I'm facing. I know that deep neural nets have the best performance, there is no doubt about it. However, they have a good performance when the number of training examples is huge. If the number of training examples is low it is better to use a SVM or decision trees.
But what is huge? what is low? In this paper of face recognition (FaceNet by Google) they show the performance vs the flops (which can be related with the number of training examples)
They used between 100M and 200M training examples, which is huge.
My question is:
Is there any method to predict in advance the number of training examples I need to have a good performance in deep learning??? The reason I ask this is because it is a waste of time to manually classify a dataset if the performance is not going to be good.
My question is: Is there any method to predict in advance the number of training examples I need to have a good performance in deep learning??? The reason I ask this is because it is a waste of time to manually classify a dataset if the performance is not going to be good.
The short answer is no. You do not have this kind of knowledge, furthermore you will never have. These kind of problems are impossible to solve, ever.
What you can have are just some general heuristics/empirical knowledge, which will say if it is probable that DL will not work well (as it is possible to predict fail of the method, while nearly impossible to predict the success), nothing more. In current research, DL rarely works well for datasets smaller than hundreads thousands/milions of samples (I do not count MNIST because everything works well on MNIST). Furthermore, DL is heavily studied actually in just two types of problems - NLP and image processing, thus you cannot really extraplate it to any other kind of problems (no free lunch theorem).
Update
Just to make it a bit more clear. What you are asking about is to predit whether given estimator (or set of estimators) will yield a good results given a particular training set. In fact you even restrict just to the size.
The simpliest proof (based on your simplification) is as follows: for any N (sample size) I can construct N-mode (or N^2 to make it even more obvious) distribution which no estimator can reasonably estimate (including deep neural network) and I can construct trivial data with just one label (thus perfect model requires just one sample). End of proof (there are two different answers for the same N).
Now let us assume that we do have access to the training samples (without labels for now) and not just sample size. Now we are given X (training samples) of size N. Again I can construct N-mode labeling yielding impossible to estimate distribution (by anything) and trivial labeling (just a single label!). Again - two different answers for the exact same input.
Ok, so maybe given training samples and labels we can predict what will behave well? Now we cannot manipulate samples nor labels to show that there are no such function. So we have to get back to statistics and what we are trying to answer. We are asking about expected value of loss function over whole probability distribution which generated our training samples. So now again, the whole "clue" is to see, that I can manipulate the underlying distributions (construct many different ones, many of which impossible to model well by deep neural network) and still expect that my training samples come from them. This is what statisticians call the problem of having non-representible sample from a pdf. In particular, in ML, we often relate to this problem with curse of dimensionality. In simple words - in order to estimate the probability well we need enormous number of samples. Silverman shown that even if you know that your data is just a normal distribution and you ask "what is value in 0?" You need exponentialy many samples (as compared to space dimensionality). In practise our distributions are multi-modal, complex and unknown thus this amount is even higher. We are quite safe to say that given number of samples we could ever gather we cannot ever estimate reasonably well distributions with more than 10 dimensions. Consequently - whatever we do to minimize the expected error we are just using heuristics, which connect the empirical error (fitting to the data) with some kind of regularization (removing overfitting, usually by putting some prior assumptions on distributions families). To sum up we cannot construct a method able to distinguish if our model will behave good, because this would require deciding which "complexity" distribution generated our samples. There will be some simple cases when we can do it - and probably they will say something like "oh! this data is so simple even knn will work well!". You cannot have generic tool, for DNN or any other (complex) model though (to be strict - we can have such predictor for very simple models, because they simply are so limited that we can easily check if your data follows this extreme simplicity or not).
Consequently, this boils down nearly to the same question - to actually building a model... thus you will need to try and validate your approach (thus - train DNN to answer if DNN works well). You can use cross validation, bootstraping or anything else here, but all essentialy do the same - build multiple models of your desired type and validate it.
To sum up
I do not claim we will not have a good heuristics, heuristic drive many parts of ML quite well. I only answer if there is a method able to answer your question - and there is no such thing and cannot exist. There can be many rules of thumb, which for some problems (classes of problems) will work well. And we already do have such:
for NLP/2d images you should have ~100,000 samples at least to work with DNN
having lots of unlabeled instances can partially substitute the above number (thus you can have like 30,000 labeled ones + 70,000 unlabeled) with pretty reasonable results
Furthermore this does not mean that given this size of data DNN will be better than kernelized SVM or even linear model. This is exactly what I was refering to earlier - you can easily construct counterexamples of distributions where SVM will work the same or even better despite number of samples. The same applies for any other technique.
Yet still, even if you are just interested if DNN will work well (and not better than others) these are just empirical, trivial heuristics, which are based on at most 10 (!) types of problems. This could be very harmfull to treat these as rules or methods. This are just rough, first intuitions gained through extremely unstructured, random research that happened in last decade.
Ok, so I am lost now... when should I use DL? And the answer is exteremly simple:
Use deep learning only if:
You already tested "shallow" techniques and they do not work well
You have large amounts of data
You have huge computational resources
You have experience with neural networks (this are very tricky and ungreatful models, really)
You have great amount of time to spare, even if you will just get a few % better results as an effect.
I have been using cascade classifier to train some kind of plants. Here is a sample image for what I want to detect
I sampled the little green plants for positives, and made negatives out of images with similar background and no green plants (as suggested by many sources). Used many images similar to this one for sampling.
I did not have a lot of training data so of course I did not expect some idealistic classification results.
I have set the usual parameters min_hit_rate 0.95 max_false_alarm 0.5 etc. I have tried training with 5,6,7,8,9 and 10 stages. The strange thing that happens to me is that during the training process I get hit rate of 1 during all stages, and after 5 stages I get good acceptance ratio 0.004 (similar for later stages 6,7,8...).
I tried testing my classifier on the same image which I used for the training samples and there is very illogical behavior:
the classifier detects almost everything BUT the positive samples i took from it (the same samples in the training with HIT RATION EQUAL TO 1).
the classifier is really but really slow it took over an hour for single input image (down-sampled scale factor 1.1).
I do not get it how could the same samples be classified as positives during training (through all the stages) and then NONE of it as positive on the image (there are a lot of false positives around it).
I checked everything a million times (I thought that I somehow mixed positives and negatives but I did not).
Can someone help me with this issue?
I can try and help but of course I can't train this thing for you unless you send me your images.
In my experience if you aren't getting the desired results, you are simply giving traincascade the wrong or not enough images (either or both positives or negatives).
I did not get great results until I created an annotation file using the built-in opencv_annotation tool. Have you done that? How many positives?
Did your negatives contain the background that you are attempting to detect your object in? This is key and can't be overlooked.
Also, I would use LBP, it's much faster.
If you or anyone is still stuck and have some positives created, send them to me and I'll see if I can train this thing.
And also, I have written hopefully a one-stop tutorial about this stuff after my experiences with it:
http://johnallen.github.io/opencv-object-detection-tutorial/
I've been having a bit of a debate with my adviser about this issue, and I'd like to get your opinion on it.
I have a fairly large dataset that I've used to build a classifier. I have a separate, smaller testing dataset that was obtained independently from the training set (in fact, you could say that each sample in either set was obtained independently). Each sample has a class label, along with metadata such as collection date and location.
There is no sample in the testing set that has the same metadata as any sample in the training set (as each sample was collected at a different location or time). However, it is possible that the feature vector itself could be identical to some sample in the training set. For example, there could be two virus strains that were sampled in Africa and Canada, respectively, but which both have the same protein sequence (the feature vector).
My adviser thinks that I should remove such samples from the testing set. His reasoning is that these are like "freebies" when it comes to testing, and may artificially boost the reported accuracy.
However, I disagree and think they should be included, because it may actually happen in the real world that the classifier sees a sample that it has already seen before. To remove these samples would bring us even further from reality.
What do you think?
It would be nice to know if you're talking about a couple of repetitions in million samples or 10 repetitions in 15 samples.
In general I don't find what you're doing reasonable. I think your advisor has a very good point. Your evaluation needs to be as close as possible to using your classifier outside your control -- You can't just assume your going to be evaluated on a datapoint you've already seen. Even if each data point is independent, you're going to be evaluated on never-before-seen data.
My experience is in computer vision, and it would be very highly questionable to train and test with the same picture of a one subject. In fact I wouldn't be comfortable training and testing with frames of the same video (not even the same frame).
EDIT:
There are two questions:
The distribution permits that these repetitions naturally happen. I
believe you, you know your experiment, you know your data, you're
the expert.
The issue that you're getting a boost by doing this and that this
boost is possibly unfair. One possible way to address your advisor's
concerns is to evaluate how significant a leverage you're getting
from the repeated data points. Generate 20 test cases 10 in which
you train with 1000 and test on 33 making sure there are not
repetitions in the 33, and another 10 cases in which you train with
1000 and test on 33 with repetitions allowed as they occur
naturally. Report the mean and standard deviation of both
experiments.
It depends... Your adviser suggested the common practice. You usually test a classifier on samples which have not been used for training. If the samples of the test set matching the training set are very few, your results are not going to have statistical difference because of the reappearance of the same vectors. If you want to be formal and still keep your logic, you have to prove that the reappearance of the same vectors has no statistical significance on the testing process. If you proved this theoretically, I would accept your logic. See this ebook on statistics in general, and this chapter as a start point on statistical significance and null hypothesis testing.
Hope I helped!
In as much as the training and testing datasets are representative of the underlying data distribution, I think it's perfectly valid to leave in repetitions. The test data should be representative of the kind of data you would expect your method to perform on. If you genuinely can get exact replicates, that's fine. However, I would question what your domain is where it's possible to generate exactly the same sample multiple times. Are your data synthetic? Are you using a tiny feature set with few possible values for each of your features, such that different points in input space map to the same point in feature space?
The fact that you're able to encounter the same instance multiple times is suspicious to me. Also, if you have 1,033 instances, you should be using far more than 33 of them for testing. The variance in your test accuracy will be huge. See the answer here.
Having several duplicate or very similar samples seems somewhat analogous to the distribution of the population you're attempting to classify being non-uniform. That is, certain feature combinations are more common than others, and the high occurrence of them in your data is giving them more weight. Either that, or your samples are not representative.
Note: Of course, even if a population is uniformly distributed there is always some likelihood of drawing similar samples (perhaps even identical depending on the distribution).
You could probably make some argument that identical observations are a special case, but are they really? If your samples are representative it seems perfectly reasonable that some feature combinations would be more common than others (perhaps even identical depending on your problem domain).
If I provided you with data sufficient to classify a bunch of objects as either apples, oranges or bananas, how long might it take you to build an SVM that could make that classification? I appreciate that it probably depends on the nature of the data, but are we more likely talking hours, days or weeks?
Ok. Now that you have that SVM, and you have an understanding of how the data behaves, how long would it likely take you to upgrade that SVM (or build a new one) to classify an extra class (tomatoes) as well? Seconds? Minutes? Hours?
The motivation for the question is trying to assess the practical suitability of SVMs to a situation in which not all data is available to be sampled at any time. Fruit are an obvious case - they change colour and availability with the season.
If you would expect SVMs to be too fiddly to be able to create inside 5 minutes on demand, despite experience with the problem domain, then suggestions of a more user-friendly form of classifier for such a situation would be appreciated.
Generally, adding a class to a 1 vs. many SVM classifier requires retraining all classes. In case of large data sets, this might turn out to be quite expensive. In the real world, when facing very large data sets, if performance and flexibility are more important than state-of-the-art accuracy, Naive Bayes is quite widely used (adding a class to a NB classifier requires training of the new class only).
However, according to your comment, which states the data has tens of dimensions and up to 1000s of samples, the problem is relatively small, so practically, SVM retrain can be performed very fast (probably, in the order of seconds to tens of seconds).
You need to give us more details about your problem, since there are too many different scenarios where SVM can be trained fairly quickly (I could train it in real time in a third person shooting game and not have any latency) or it could last several minutes (I have a case for a face detector that training took an hour long)
As a thumb rule, the training time is proportional to the number of samples and the dimension of each vector.
I'm using the random forest algorithm as the classifier of my thesis project.
The training set consists of thousands of images, and for each image about 2000
pixels get sampled. For each pixel, I've hundred of thousands of features. With
my current hardware limitations (8G of ram, possibly extendable to 16G) I'm able
to fit in memory the samples (i.e. features per pixel) for only one image. My
questions is: is it possible to call multiple times the train method, each time
with a different image's samples, and get the statistical model automatically
updated at each call? I'm particularly interested in the variable importance since, after I
train the full training set with the whole features set, my idea is to reduce
the number of features from hundred of thousands to about 2000, keeping only the
most important ones.
Thank you for any advice,
Daniele
I dont think the algorithm supports incremental training. You could consider reducing the size of your descriptors prior to training, using other feature reduction method. Or estimate the variable importance on a random subset of pixels taken among all your training images, as much as you can stuff into your memory...
See my answer to this post. There are incremental versions of random forests, and they will let you train on much larger data.