I am trying to understand how haar classifiers work. I am reading the opencv documentation here: http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html and it seems like you basically train a set of data to get something like a template. Then you lay the template over the actual image that you want to check and you go through and check each pixel to see if it is likely or not likely to be what you are looking for. So, assuming that is right, I got up to the point where I was looking at the photo below and I did not understand. Are the blocks supposed to represent regions of "likely" and "unlikely"? Thanks in advance
These patterns are the features that are evaluated for your training image. For example, for feature 1a the training process finds square regions in all your training images where the left half is usually brighter than the right half (or vice versa). For feature 3a, the training finds square regions where the center is darker than the surrounding.
These particular features you depicted were chosen for the haar cascade not because they are particularly good features, but mainly because the are extremely fast to evaluate.
More specifically, the training of a haar cascade finds the one feature that helps best at differentiating your positive and negative training images, (roughly the feature that is most often true for the positive images and most often false for the negative images). That feature will be the first stage of the resulting haar cascade. The second best feature will be the second stage, and so on.
After training, a haar cascade consists of a series of rules, or stages, like this:
evaluate feature 1a for region (x1;y1)-(x2;y2). Is the result greater than threshold z1?
(meaning: is the left half of that region brighter than the right half by a certain amount?)
if yes, return 'not a match'
if no, execute the next stage
In the classical haar cascade, each such rule, involving only a single feature at a single location with a single threshold, represents a stage of the cascade. OpenCV actually uses a boosted cascade, meaning each stage consists of a combination of several of these simple features.
The principle remains: each stage is a very weak classifier that by itself is just barely better than wild guessing. The threshold for each stage is chosen so that the chance of false negatives is very low (So a stage will almost never wrongly reject a good match, but will quite frequently wrongly accept a bad match).
When the haar cascade is executed, all stages are executed in order; only a picture that passes the first AND the second AND the third ... stage will be accepted.
During training the first stage is also trained first. Then the second stage is trained only with the training images that would pass the first stage, and so on.
Related
As we know, faster-RCNN has two main parts: one is region proposal network(RPN), and another one is fast-RCNN.
My question is, now that region proposal network(RPN) can output class scores and bounding boxes and is trainable, why do we need Fast-RCNN?
Am I thinking it right that the RPN is enough for detection (red circle), and Fast-RCNN is now becoming redundant (blue circle)?
Short answer: no they are not redundant.
The R-CNN article and its variants popularized the use of what we used to call a cascade.
Back then for detection it was fairly common to use different detectors often very similar in structures to do detection because of their complementary power.
If the detections are partly orthogonal it allows to remove false positive along the way.
Furthermore by definition both parts of R-CNN have different roles the first one is used to discriminate objects from background and the second one to discriminate fine grained categories of objects from themselves (and from the background also).
But you are right if there is only 1 class vs the background one could use only the RPN part to to detection but even in that case it would probably better the result to chain two different classifiers (or not see e.g. this article)
PS: I answered because I wanted to but this question is definitely unsuited for stackoverflow
If you just add a class head to the RPN Network, you would indeed get detections, with scores and class estimates.
However, the second stage is used mainly to obtain more accurate detection boxes.
Faster-RCNN is a two-stage detector, like Fast R-CNN.
There, Selective Search was used to generate rough estimates of the location of objects and the second stage then refines them, or rejects them.
Now why is this necessary for the RPN? So why are they only rough estimates?
One reason is the limited receptive field:
The input image is transformed via a CNN into a feature map with limited spatial resolution. For each position on the feature map, the RPN heads estimate if the features at that position correspond to an object and the heads regress the detection box.
The box regression is done based on the final feature map of the CNN. In particular, it may happen that the correct bounding box on the image is larger than the corresponding receptive field due to the CNN.
Example: Lets say we have an image depicting a person and the features at one position of the feature map indicate a high possibiliy for the person. Now, if the corresponding receptive field contains only the body parts, the regressor has to estimate a box enclosing the entire person, although it "sees" only the body part.
Therefore, RPN creates a rough estimate of the bounding box. The second stage of Faster RCNN uses all features contained in the predicted bounding box and can correct the estimate.
In the example, RPN creates a too large bounding box, which is enclosing the person (since it cannot the see the pose of the person), and the second stage uses all information of this box to reshape it such that it is tight. This however can be done much more accurate, since more content of the object is accessable for the network.
faster-rcnn is a two-stage method comparing to one stage method like yolo, ssd, the reason faster-rcnn is accurate is because of its two stage architecture where the RPN is the first stage for proposal generation and the second classification and localisation stage learn more precise results based on the coarse grained result from RPN.
So yes, you can, but your performance is not good enough
I think the blue circle is completely redundant and just adding a class classification layer (gives class for each bounding box containing object) should work just fine and that's what the single shot detectors do with compromised accuracy.
According to my understanding, RPN is just for binary checking if you have Objects in the bbox or not and final Detector part is for classifying the classes ex) car, human, phones, etc
I got some questions about the training of a cascade classifier:
On Some of my pictures half of the object is visible. Should I mark the visible part as region of interest, use the picture as negative sample or sort it out completely?
Is the classifier able to detect objects that are just partly visible (using Haar features)?
What should be the ratio of negative and positive samples? Often I read that you should use more negative samples. But for example in this thread it is mentioned that the ratio should be 2:1 (more positive samples).
My current classifier detects to much false positives. According to this tutorial you can either increase the number of stages or decrease the false alarm rate per stage. But I can't increase the number of stages without increasing the false alarm rate. If I just increase the number of stages, the training stops at some point because the classifier runs out of samples. Is the only way to reduce the false positives to increase the number of samples?
Would be glad if someone could answer one of my questions :)
In case of cascade classifier I would suggest to throw away the "half" objects. Since are they positive samples? no since they don't contain the object entirely, are they negative samples? no , because they are not something which have nothing to do with our object.
In my experience I started with training with almost similar number of negative and positive images, and I had the similar problem. Increasing the number of samples was the first step. You should probably increase the number of negative samples, note that you need to get different images, simply having 100 similar background images are almost the same as having only like 5-10 images. In my case the best ratio was positive:negative = 2:1. You still need to try out though it depends on the classifier you are trying to build. If your object is not something too fancy and comes in simple shapes and sizes (like a company logo or coin, or an orange) you don't have to get too many samples but if you are trying to build a classifier which checks for some complicated objects ( like a chair, yes.. chair is a serious object, since it comes in many different shapes and sizes) than you will need a lot of samples.
Hope this helps.
I am wanting to count the number of cars in aerial images of parking lots. After some research I believe that Haar Cascade Classifiers might be an option for this. An example of an image I will be using would be something similar to a zoomed in image of a parking lot from Google Maps.
My current plan to accomplish this is to train a custom Haar Classifier using cars that I crop out of images in only one orientation (up and down), and then attempt recognition multiple times while rotating the image in 15 degree increments. My specific questions are:
Is using a Haar Classifier a good approach here or is there something better?
Assuming this is a good approach, when cropping cars from larger images for training data would it be better to crop a larger area that could possibly contain small portions of cars in adjacent parking spaces (although some training images would obviously include solo cars, cars with only one car next to them, etc.) or would it be best to crop the cars as close to their outline as possible?
Again assuming I am taking this approach, how could I avoid double counting cars? If a car was recognized in one orientation, I don't want it to be counted again. Is there some way that I could mark a car as counted and have it ignored?
I think in your case I would not go for Haar features, you should search for something that is rotation invariant.
I would recommend to approach this task in the following order:
Create a solid training / testing data set and have a good look into papers about getting good negative samples. In my experience good negative samples have a great deal of influence on the resulting quality of your classifier. It makes your life a lot easier if all your samples are of the same image size. Add different types of negative samples, half cars, just pavement, grass, trees, people etc...
Before starting your search for a classifier make sure that you have your evaluation pipeline in order, do a 10 fold cross evaluation with the simplest Haar classifier possible. Now you have a baseline. Try to keep the software for all features you tested working in caseou find out that your data set needs adjustment. Ideally you can just execute a script and rerun your whole evaluation on the new data set automatically.
The problem of counting cars multiple times will not be of such importance when you can find a feature that is rotation invariant. Still non maximum suppression will be in order becaus you might not get a good recognition with simple thresholding.
As a tip, you might consider HOG features, I did have some good results on cars with them.
I'm doing my project which need to detect/classify some simple sign language.
I'm new to opencv, I have try to use contours,hull but it seem very hard to apply...
I googled and find the method call "Haarcascade" which seem to be about taking pictures and create .xml file.
So, I decide to do Haarcascade......
Here are some example of the sign language that I want to detect/classify
Set1 : http://www.uppic.org/image-B600_533D7A09.jpg
Set2 : http://www.uppic.org/image-0161_533D7A09.jpg
The result I want here is to classify these 2 set.
Any suggestion if I could use haarcascade method with this
*I'm using xcode with my webcam, but soon I'm gonna port them onto iOS device. Is it possible?
First of all: I would not use haar features for learning on whole images.
Let's see how haar features look like:
Let me point out how learning works. We're building a classifier that consists of many 'weak' classifiers. In approximation, every 'weak' classifier is built in such way to find out information about several haar features. To simplify, let's peek one of them to consideration, a first one from edge features. During learning in some way, we compute a threshold value by sliding this feature over the whole input training image, using feature as a mask: we sum pixels 'under' the white part of the feature, sum pixels 'under' black part and subtract one value from other. In our case, threshold value will give an information if vertical edge feature exists on the training image. After training of weak classifier, you repeat process with different haar features. Every weak classifier gives information about different features.
What is important: I summarized how training works to describe what kind of objects are good to be trained in such way. Let's pick the most powerful application - detecting human's face. There's an important feature of face:
It has a landmarks which are constrastive (they differ from background - skin)
The landmark's locations are correlated to each other in every face (e.g. distance between them in approximation is some factor of face size)
That makes haar features powerful in that case. As you can see, one can easily point out haar features which are useful for face detection e.g. first and second of line features are good for detection a nose.
Back to your problem, ask yourself if your problem have features 1. and 2. In case of whole image, there is too much unnecessary data - background, folds on person's shirt and we don't want to noise classifier with it.
Secondly, I would not use haar features from some cropped regions.
I think the difference between palms is too less for haar classifier. You can derive that from above description. The palms are not different so much - the computed threshold levels will be too similar. The most significant features for haar on given palms will be 'edges' between fingers and palm edges. You can;t rely on palm's edges - it depends from the background (walls, clothes etc.) And edges between fingers are carrying too less information. I am claiming that because I have an experience with learning haar classifier for palm. It started to work only if we cropped palm region containing fingers.
I've read a lot about Viola Jones method but i still not understand about "Weak Classifier", "Strong Classifier", "Sub Window" in Rectangle features, what is definition about them. And what about "threshold"? How i can know the threshold value?
Can anyone help me? Thanks Before
Aim of Viola-Jones algorithm: Detection of faces in an image. This algorithm uses frontal upright faces, thus in order to be detected, the entire face must point towards the camera and should not be tilted to either side. Algorithm isface image partition based on physical estimation of position of eyes, noseand mouth on face.
Stages of the algorithm: This algorithm works in following four stages:
1. Haar features
2. Integral image
3. AdaBoost
4. Cascading
All these stages are discussed below. Before that i will answer a simple question that **why haar** ?
Haar wavelets are preferred because it is better than fourier for feature extraction.
Now, we will discuss about the stages involved in this algorithm.
Haar features: Over the given input image, a 24 x 24 base window will slide while passing haar as an argument and computation will take place usingconvolution theorem.What are different haar features, you can study aboutthem here
The output of this phase will be the detection of light and dark parts of an image.
Integral Image:The haar features extracted in above phase will be verylarge which will make computation very complex. To make this computationsimple and short, these extracted haar features are passed to integral image.It calculates the pixel values using simple mathematics. You can learn thiscalculation in the link provided above.
AdaBoost: As there will be so many features, all of them will not includeface in it. From, integral image we will get two possible things: featurescontaining face and features containing no face. We need only those featureswhich contains face. This job will be done by Adaboost. It will help to sampleface from rest of the body parts using weak classifiers and cascade. Theoverall process used is ensemble method. A weighted arrangement of all thesefeatures are used in evaluating and deciding any given window has face or not.It will eliminate all redundant features
Cascading: Weak classifiers will be cascaded to make a one strong singleclassifier while window sliding over the whole image.This process is alsoknown as boosting up the weak classifiers. A sub-window classified as a faceis passed on to the next stage in the cascade.It follows that the additionalstages a given sub-window passes, the higher chances that the sub-windowreally contains a face.
Next what: This model will be tested on real images and faces will bedetected.
Use-case of Viola-Jones: This model can be run on CPU, hence can beexperimented for learning purpose.
With Regards,
Ekta Smothra