I got some questions about the training of a cascade classifier:
On Some of my pictures half of the object is visible. Should I mark the visible part as region of interest, use the picture as negative sample or sort it out completely?
Is the classifier able to detect objects that are just partly visible (using Haar features)?
What should be the ratio of negative and positive samples? Often I read that you should use more negative samples. But for example in this thread it is mentioned that the ratio should be 2:1 (more positive samples).
My current classifier detects to much false positives. According to this tutorial you can either increase the number of stages or decrease the false alarm rate per stage. But I can't increase the number of stages without increasing the false alarm rate. If I just increase the number of stages, the training stops at some point because the classifier runs out of samples. Is the only way to reduce the false positives to increase the number of samples?
Would be glad if someone could answer one of my questions :)
In case of cascade classifier I would suggest to throw away the "half" objects. Since are they positive samples? no since they don't contain the object entirely, are they negative samples? no , because they are not something which have nothing to do with our object.
In my experience I started with training with almost similar number of negative and positive images, and I had the similar problem. Increasing the number of samples was the first step. You should probably increase the number of negative samples, note that you need to get different images, simply having 100 similar background images are almost the same as having only like 5-10 images. In my case the best ratio was positive:negative = 2:1. You still need to try out though it depends on the classifier you are trying to build. If your object is not something too fancy and comes in simple shapes and sizes (like a company logo or coin, or an orange) you don't have to get too many samples but if you are trying to build a classifier which checks for some complicated objects ( like a chair, yes.. chair is a serious object, since it comes in many different shapes and sizes) than you will need a lot of samples.
Hope this helps.
Related
Why it loads neg samples(40*40) too slowly when I using opencv_traincascade.exe? It may toke 1-2 minutes to load a neg samples.
It's not just loading a single negative sample within that time, it is the collection of negative samples that are still classified as positive samples, so the new training stage can go and find features that distinguish your positive samples from those negative samples. With each stage, your classifier gets better and better and more negative samples are already classified correctly, the harder it gets to find negative samples that are usable in that training stage. After enough negative samples are collected, you'll see a value (acceptanceRatio) which shows you the rate of usable negative samples found.
For example, here is a stage preparation from a training I had once:
The acceptance ratio is 3.03652e-005 which means the negative samples collection had to test in average 32932.4 negative samples to find a single USABLE negative sample. Yes, this takes a long time (especially if the classifier gets more "complicated"). The more different negative sample pictures you use, typically the easier it gets to find usable samples. If you've chosen for example max false alarm rate of 0.5, in theory you'll have to double the number of negative images tested in each stage. This is also an indicator of how well your training is working. If the acceptance ratio doesn't go down in each stage, the training is probably not working well, it does not look like it is generalizing well.
I have decided to train Haar classifier for 102 flower categories given here:(The dataset)
http://www.robots.ox.ac.uk/~vgg/data/flowers/102/categories.html
In the link you can see several categories. I am posting a few images of an individual flower to explain the question.
This flower belongs to a single class. I have 250 images as positives. There is a considerable variation in this flower's others images(of color, brightness, orientation, etc.). I am hunting for negative images right now. As you might have guessed, I didn't click these pictures so I can't go to the places where these were clicked to collect negative dataset. Instead, I have decided to extract frames from a video. Here is the link:
https://www.youtube.com/watch?v=x3zT1mJE0W0
Here are the images from the video:
It is a video of general garden with bushes and plants background.
My question is: Will this video(and other similar videos) suffice for being negative samples for successful detection? Is it safe to train the classifier for these flowers at all?(I mean with lot of variation in the background. I also plan to use the rest flowers category images as negatives that I am not detecting except the flower that I am trying to detect which in the case here is the Passion Flower).
This is my first training and I am asking this because the training is gonna eat my whole day and night. I am skeptical about it beforehand.
The trick with negative images is to use whatever you have, and as many as possible. The more difference, and quantity, of your negative images means that you will end up with a more robust classifier.
As for your specific question about whether the bushes are a good negative data set compared to the flowers I would say they will be ok. The background behind the bushes is relatively similar and you have quite a distinct flower pattern for your positive samples.
Viola-Jones' AdaBoost method is very popular for face detection? We need lots of positive and negative samples o train a face detector.
The rule for collecting positive sample is simple: the image which contains faces. But the rule for collecting negative sample is not very clear: the image which does not contains faces.
But there are so many scene that do not contain faces (which may be sky, river, house animals etc.). Which should I collect it? How can know I have collected enough negative samples?
Some suggested idea for negative samples: using the positive samples and crop the face region using the left part as negative samples. Is this work?
You have asked many questions inside your thread.
Amount of samples. As a rule of thumbs: When you train a detector you need roughly few thousands positive and negative examples per stage. Typical detector has 10-20 stages. Each stage reduces the amount of negative by a factor of 2. So you will need roughly 3,000 - 10,000 positive examples and ~5,000,000 to 100,000,000 negative examples.
Which negatives to take. A rule of thumb: You need to find a face in a given environment. So you need to take that environment as negative examples. For instance, if you try to detect faces of students sitting in a classroom than take as negative examples images from the classroom (walls, windows, human body, clothes etc). Taking images of the moon or of the sky will probably not help you. If you don't know your environment than just take as much as possible different natural images (under different light conditions).
Should you take facial parts (like an eye, or a nose) as negative? You can but this is definitely not enough (to take only those negatives). The real strength of the detector will come from the negative images which represent the typical background of the faces
How to collect/generate negative samples - You don't actually need many negative images. You can take 1000 images and generate 10,000,000 negative samples from them. Here is how you do it. Suppose you take a photo of a car of 1 mega pixel resolution 1000x1000 pixels. Suppose than you want to train face detector to work on resolution of 20x20 pixels (like openCV did). So you take your 1000x1000 big image and cut it to pieces of 20x20. You can get 2,500 pieces (50x50). So this is how from a single big image you generated 2,500 negative examples. Now you can take the same big image and cut it to pieces of size 10x10 pixels. You will now have additional 10,000 negative examples. Each example is of size 10x10 pixels and you can enlarge it by factor of 2 to force all the sample to have the same size. You can repeat this process as much as you want (cutting the input image to pieces of different size). Mathematically speaking, if your image is of size NxN - You can generate O(N^4) negative examples from it by taking each possible rectangle inside it.
In step 4, I described how to take a single big image and cut it to a large amount of negative examples. I must warn you that negative examples should not have high co-variance so I don't recommend taking only one image and generating 1 million negative examples from it. As a rule of thumb - create a library of 1000 images (or download random images from Google). Verify than none of the images contains faces. Crop about 10,000 negative examples from each image and now you have got a decent 10,000,000 negative examples. Train your detector. In the next step you can cut each image to ~50,000 (partially overlapping pieces) and thus enlarge your amount of negatives to 50 millions. You will start having very good results with it.
Final enhancement step of the detector. When you already have a rather good detector, run it on many images. It will produce false detections (detect face where there is no face). Gather all those false detections and add them to your negative set. Now retrain the detector once again. The more such iterations you do the better your detector becomes
Real numbers - The best face detectors today (like Facebooks) use hundreds of millions of positive examples and billions of negatives. As positive examples they take not only frontal faces but faces in many orientations, different facial expressions (smiling, shouting, angry,...), different age groups, different genders, different races (Caucasians, blacks, Thai, Chinese,....), with or without glasses/hat/sunglasses/make-up etc. You will not be able to compete with the best, so don't get angry if your detector misses some faces.
Good luck
I am trying to understand how haar classifiers work. I am reading the opencv documentation here: http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html and it seems like you basically train a set of data to get something like a template. Then you lay the template over the actual image that you want to check and you go through and check each pixel to see if it is likely or not likely to be what you are looking for. So, assuming that is right, I got up to the point where I was looking at the photo below and I did not understand. Are the blocks supposed to represent regions of "likely" and "unlikely"? Thanks in advance
These patterns are the features that are evaluated for your training image. For example, for feature 1a the training process finds square regions in all your training images where the left half is usually brighter than the right half (or vice versa). For feature 3a, the training finds square regions where the center is darker than the surrounding.
These particular features you depicted were chosen for the haar cascade not because they are particularly good features, but mainly because the are extremely fast to evaluate.
More specifically, the training of a haar cascade finds the one feature that helps best at differentiating your positive and negative training images, (roughly the feature that is most often true for the positive images and most often false for the negative images). That feature will be the first stage of the resulting haar cascade. The second best feature will be the second stage, and so on.
After training, a haar cascade consists of a series of rules, or stages, like this:
evaluate feature 1a for region (x1;y1)-(x2;y2). Is the result greater than threshold z1?
(meaning: is the left half of that region brighter than the right half by a certain amount?)
if yes, return 'not a match'
if no, execute the next stage
In the classical haar cascade, each such rule, involving only a single feature at a single location with a single threshold, represents a stage of the cascade. OpenCV actually uses a boosted cascade, meaning each stage consists of a combination of several of these simple features.
The principle remains: each stage is a very weak classifier that by itself is just barely better than wild guessing. The threshold for each stage is chosen so that the chance of false negatives is very low (So a stage will almost never wrongly reject a good match, but will quite frequently wrongly accept a bad match).
When the haar cascade is executed, all stages are executed in order; only a picture that passes the first AND the second AND the third ... stage will be accepted.
During training the first stage is also trained first. Then the second stage is trained only with the training images that would pass the first stage, and so on.
Trying to write some code that deals with this task:
As an starting point, I have around 20 "profiles" (imagine a landscape profile), i.e. one-dimensional arrays of around 1000 real values.
Each profile has a real-valued desired outcome, the "effective height".
The effective height is some sort of average but height, width and position of peaks play a particular role.
My aim is to generalize from the input data so as to calculate the effective height for further profiles.
Is there a machine learning algorithm or principle that could help?
Principle 1: Extract the most import features, instead of feeding it everything
As you said, "The effective height is some sort of average but height, width and position of peaks play a particular role." So that you have a strong priori assumption that these measures are the most important for learning. If I were you, I would calculate these measures at first, and use them as the input for learning, instead of the raw data.
Principle 2: While choosing a learning algorithm, the first thing to care about would be the the linear separability
Suppose the height is a function of those measures, then you have to think about that to what extent the function is linear. For example if the function is almost linear, then a very simple Perceptron would be perfect. Otherwise if it's far from linear, you might want to pick up a multiple-layer neural network. If it's far far far from linear....please turn to principle 1, and check out if you are extracting the right features.
Principle 3: More data help
As you said, you have around 20 "profiles" for training. In general speaking, that's not enough. Almost all of the machine learning algorithms were designed for somehow big data. Even they claimed that their algorithm is good at learning small sample, but usually not as small as 20. Get more data!
Maybe multivariate linear regression suffices?
I would probably use a combination of what you said about which features play the most important role, and then train a regression on that. Basically, you need at least one coefficient corresponding to each feature, and you need substantially more data points than coefficients. So, I would pick something like the heights and width of the two biggest peaks. You've now reduced every profile to just 4 numbers. Now do this trick: divide the data into 5 groups of 4. Pick the first 4 groups. Reduce all those profiles to 4 numbers, and then use the desired outcomes to come up with a regression. Once you have trained the regression, try your technique on the last 4 points and see how well it works. Repeat this procedure 5 times, each time leaving out a different set of data. This is called cross-validation, and it's very handy.
Obviously getting more data would help.