I'm trying to understand Viola Jones method, and I've mostly got it.
It uses simple Haar like features boosted into strong classifiers and organized into layers /cascade in order to accomplish better performances (not bother with obvious 'non object' regions).
I think I understand integral image and I understand how are computed values for the features.
The only thing I can't figure out is how is algorithm dealing with the face size variations.
As far as I know they use 24x24 subwindow that slides over the image, and within it algorithm goes through classifiers and tries to figure out is there a face/object on it, or not.
And my question is - what if one face is 10x10 size, and other 100x100? What happens then?
And I'm dying to know what are these first two features (in first layer of the cascade), how do they look like (keeping in mind that these two features, acording to Viola&Jones, will almost never miss a face, and will eliminate 60% of the incorrect ones) ? How??
And, how is possible to construct these features to work with these statistics for different face sizes in image?
Am I missing something, or maybe I've figured it all wrong?
If I'm not clear enough, I'll try to explain better my confusion.
Training
The Viola-Jones classifier is trained on 24*24 images. Each of the face images contains a similarly scaled face. This produces a set of feature detectors built out of two, three, or four rectangles optimised for a particular sized face.
Face size
Different face sizes are detected by repeating the classification at different scales. The original paper notes that good results are obtained by trying different scales a factor of 1.25 apart.
Note that the integral image means that it is easy to compute the rectangular features at any scale by simply scaling the coordinates of the corners of the rectangles.
Best features
The original paper contains pictures of the first two features selected in a typical cascade (see page 4).
The first feature detects the wide dark rectangle of the eyes above a wide brighter rectangle of the cheeks.
----------
----------
++++++++++
++++++++++
The second feature detects the bright thin rectangle of the bridge of the nose between the darker rectangles on either side containing the eyes.
---+++---
---+++---
---+++---
Related
I am trying to build a solution where I could differentiate between a 3D textured surface with the height of around 200 micron and a regular text print.
The following image is a textured surface. The black color here is the base surface.
Regular text print will be the 2D print of the same 3D textured surface.
[EDIT]
Initial thought about solving this problem, could look like this:
General idea here would be, images shot at different angles of a 3D object would be less related to each other than the images shot for a 2D object in the similar condition.
One of the possible way to verify could be: 1. Take 2 images, with enough light around (flash of the camera). These images should be shot at as far angle from the object plane as possible. Say, one taken at camera making 45 degree at left side and other with the same angle on the right side.
Extract the ROI, perspective correct them.
Find GLCM of the composite of these 2 images. If the contrast of the GLCM is low, then it would be a 3D image, else a 2D.
Please pardon the language, open for edit suggestion.
General idea here would be, images shot at different angles of a 3D object would be less related to each other than the images shot for a 2D object in the similar condition.
One of the possible way to verify could be:
1. Take 2 images, with enough light around (flash of the camera). These images should be shot at as far angle from the object plane as possible. Say, one taken at camera making 45 degree at left side and other with the same angle on the right side.
Extract the ROI, perspective correct them.
Find GLCM of composite of these 2 images. If contrast of the GLCM is low, then it would be a 3D image, else a 2D.
Please pardon the language, open for edit suggestion.
If you can get another image which
different angle or
sharper angle or
different lighting condition
you may get result. However, using two image with different angle with calibrate camera can get stereo vision image which solve your problem easily.
This is a pretty complex problem and there is no plug-in-and-go solution for this. Using light (structured or laser) or shadow to detect a height of 0.2 mm will almost surely not work with an acceptable degree of confidence, no matter of how much "photos" you take. (This is just my personal intuition, in computer vision we verify if something works by actually testing).
GLCM is a nice feature to describe texture, but it is, as far as I know, used to verify if there is a pattern in the texture, so, I believe it would output a positive value for 2D print text if there is some kind of repeating pattern.
I would let the computer learn what is text, what is texture. Just extract a large amount of 3D and 2D data, and use a machine learning engine to learn which is what. If the feature space is rich enough, it may be able to find a way to differentiate one from another, in a way our human mind wouldn't be able to. The feature space should consist of edge and colour features.
If the system environment is stable and controlled, this approach will work specially well, since the training data will be so similar to the testing data.
For this problem, I'd start by computing colour and edge features (local image pixel sums over different edge and colour channels) and try a boosted classifier. Boosted classifiers aren't the state of the art when it comes to machine learning, but they are good at not overfitting (meaning you can just insert as much data as you want), and will most likely work in a stable environment.
Hope this helps,
Good luck.
I am planning on making a cascade detector for a white cup, a red ball, and a blue puck. With how simple these objects are in their shape, I was wondering if there are any parameter differences I will have to have in the training vs finding complex objects such as cars / faces? Also, within the training pos images I have the objects in different lighting conditions and instances where the objects are under shadow.
For training negative images I noticed the image sizes may vary. However, for positive images they MUST be a fixed size.
I plan on using 100x100 pos images to help detect the objects from 20-30 feet, the 200x200 pos images to detect the objects when I am within 5ft / am directly overhead of the object (3 ft off the ground appx). Does this mean that I will have to train 6 different XMLs? 2 for each object as it is trained for 100x100 and 200x200?
Short answer: Yes
Long Answer: Probably:
You have to think about it like this, the classifier is going to build up a set of features for the positive images and then use these to determine whether your detection image is the same or not. If you are drastically moving the angle of your detection, then you are going to need a different classifier.
Let me example with pictures:
If at 20ft away your cup looks like this:
with associated background/lighting etc, then it is going to be a very different classifier if your cup looks like this(maybe 5ft away but different angle):
Now, with all that being said, if you only have larger and smaller versions of your cup, then you may only need one. However you will need a different classifier for each object (cup/ball/puck)
Images not mine - Taken from Google
Is there a robust way to detect the water line, like the edge of a river in this image, in OpenCV?
(source: pequannockriver.org)
This task is challenging because a combination of techniques must be used. Furthermore, for each technique, the numerical parameters may only work correctly for a very narrow range. This means either a human expert must tune them by trial-and-error for each image, or that the technique must be executed many times with many different parameters, in order for the correct result to be selected.
The following outline is highly-specific to this sample image. It might not work with any other images.
One bit of advice: As usual, any multi-step image analysis should always begin with the most reliable step, and then proceed down to the less reliable steps. Whenever possible, the less reliable step should make use of the result of more-reliable steps to augment its own accuracy.
Detection of sky
Convert image to HSV colorspace, and find the cyan located at the upper-half of the image.
Keep this HSV image, becuase it could be handy for the next few steps as well.
Detection of shrubs
Run Canny edge detection on the grayscale version of image, with suitably chosen sigma and thresholds. This will pick up the branches on the shrubs, which would look like a bunch of noise. Meanwhile, the water surface would be relatively smooth.
Grayscale is used in this technique in order to reduce the influence of reflections on the water surface (the green and yellow reflections from the shrubs). There might be other colorspaces (or preprocessing techniques) more capable of removing that reflection.
Detection of water ripples from a lower elevation angle viewpoint
Firstly, mark off any image parts that are already classified as shrubs or sky. Since shrub detection would be more reliable than water detection, shrub detection's result should be used to inform the less-reliable water detection.
Observation
Because of the low elevation angle viewpoint, the water ripples appear horizontally elongated. In fact, every image feature appears stretched horizontally. This is called Anisotropy. We could make use of this tendency to detect them.
Note: I am not experienced in anisotropy detection. Perhaps you can get better ideas from other people.
Idea 1:
Use maximally-stable extremal regions (MSER) as a blob detector.
The Wikipedia introduction appears intimidating, but it is really related to connected-component algorithms. A naive implementation can be done similar to Dijkstra's algorithm.
Idea 2:
Notice that the image features are horizontally stretched, a simpler approach is to just sum up the absolute values of horizontal gradients and compare that to the sum of absolute values of vertical gradients.
I have images of mosquitos similar to these ones and I would like to automatically circle around the head of each mosquito in the images. They are obviously in different orientations and there are random number of them in different images. some error is fine. Any ideas of algorithms to do this?
This problem resembles a face detection problem, so you could try a naïve approach first and refine it if necessary.
First you would need to recreate your training set. For this you would like to extract small images with examples of what is a mosquito head or what is not.
Then you can use those images to train a classification algorithm, be careful to have a balanced training set, since if your data is skewed to one class it would hit the performance of the algorithm. Since images are 2D and algorithms usually just take 1D arrays as input, you will need to arrange your images to that format as well (for instance: http://en.wikipedia.org/wiki/Row-major_order).
I normally use support vector machines, but other algorithms such as logistic regression could make the trick too. If you decide to use support vector machines I strongly recommend you to check libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), since it's a very mature library with bindings to several programming languages. Also they have a very easy to follow guide targeted to beginners (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf).
If you have enough data, you should be able to avoid tolerance to orientation. If you don't have enough data, then you could create more training rows with some samples rotated, so you would have a more representative training set.
As for the prediction what you could do is given an image, cut it using a grid where each cell has the same dimension that the ones you used on your training set. Then you pass each of this image to the classifier and mark those squares where the classifier gave you a positive output. If you really need circles then take the center of the given square and the radius would be the half of the square side size (sorry for stating the obvious).
So after you do this you might have problems with sizes (some mosquitos might appear closer to the camera than others) , since we are not trained the algorithm to be tolerant to scale. Moreover, even with all mosquitos in the same scale, we still might miss some of them just because they didn't fit in our grid perfectly. To address this, we will need to repeat this procedure (grid cut and predict) rescaling the given image to different sizes. How many sizes? well here you would have to determine that through experimentation.
This approach is sensitive to the size of the "window" that you are using, that is also something I would recommend you to experiment with.
There are some research may be useful:
A Multistep Approach for Shape Similarity Search in Image Databases
Representation and Detection of Shapes in Images
From the pictures you provided this seems to be an extremely hard image recognition problem, and I doubt you will get anywhere near acceptable recognition rates.
I would recommend a simpler approach:
First, if you have any control over the images, separate the mosquitoes before taking the picture, and use a white unmarked underground, perhaps even something illuminated from below. This will make separating the mosquitoes much easier.
Then threshold the image. For example here i did a quick try taking the red channel, then substracting the blue channel*5, then applying a threshold of 80:
Use morphological dilation and erosion to get rid of the small leg structures.
Identify blobs of the right size to be moquitoes by Connected Component Labeling. If a blob is large enough to be two mosquitoes, cut it out, and apply some more dilation/erosion to it.
Once you have a single blob like this
you can find the direction of the body using Principal Component Analysis. The head should be the part of the body where the cross-section is the thickest.
I am currently helping a friend working on a geo-physical project, I'm not by any means a image processing pro, but its fun to play
around with these kinds of problems. =)
The aim is to estimate the height of small rocks sticking out of water, from surface to top.
The experimental equipment will be a ~10MP camera mounted on a distance meter with a built in laser pointer.
The "operator" will point this at a rock, press a trigger which will register a distance along of a photo of the rock, which
will be in the center of the image.
The eqipment can be assumed to always be held at a fixed distance above the water.
As I see it there are a number of problems to overcome:
Lighting conditions
Depending on the time of day etc., the rock might be brighter then the water or opposite.
Sometimes the rock will have a color very close to the water.
The position of the shade will move throughout the day.
Depending on how rough the water is, there might sometimes be a reflection of the rock in the water.
Diversity
The rock is not evenly shaped.
Depending on the rock type, growth of lichen etc., changes the look of the rock.
Fortunateness, there is no shortage of test data. Pictures of rocks in water is easy to come by. Here are some sample images:
I've run a edge detector on the images, and esp. in the fourth picture the poor contrast makes it hard to see the edges:
Any ideas would be greatly appreciated!
I don't think that edge detection is best approach to detect the rocks. Other objects, like the mountains or even the reflections in the water will result in edges.
I suggest that you try a pixel classification approach to segment the rocks from the background of the image:
For each pixel in the image, extract a set of image descriptors from a NxN neighborhood centered at that pixel.
Select a set of images and manually label the pixels as rock or background.
Use the labeled pixels and the respective image descriptors to train a classifier (eg. a Naive Bayes classifier)
Since the rocks tends to have similar texture, I would use texture image descriptors to train the classifier. You could try, for example, to extract a few statistical measures from each color chanel (R,G,B) like the mean and standard deviation of the intensity values.
Pixel classification might work here, but will never yield a 100% accuracy. The variance in the data is really big, rocks have different colours (which are also "corrupted" with lighting) and different texture. So, one must account for global information as well.
The problem you deal with is foreground extraction. There are two approaches I am aware of.
Energy minimization via graph cuts, see e.g. http://en.wikipedia.org/wiki/GrabCut (there are links to the paper and OpenCV implementation). Some initialization ("seeds") should be done (either by a user or by some prior knowledge like the rock is in the center while water is on the periphery). Another variant of input is an approximate bounding rectangle. It is implemented in MS Office 2010 foreground extraction tool.
The energy function of possible foreground/background labellings enforces foreground to be similar to the foreground seeds, and a smooth boundary. So, the minimum of the energy corresponds to the good foreground mask. Note that with pixel classification approach one should pre-label a lot of images to learn from, then segmentation is done automatically, while with this approach one should select seeds on each query image (or they are chosen implicitly).
Active contours a.k.a. snakes also requre some user interaction. They are more like Photoshop Magic Wand tool. They also try to find a smooth boundary, but do not consider the inner area.
Both methods might have problems with the reflections (pixel classification will definitely have). If it is the case, you may try to find an approximate vertical symmetry, and delete the lower part, if any. You can also ask a user to mark the reflaction as a background while collecting stats for graph cuts.
Color segmentation to find the rock, together with edge detection to find the top.
To find the water level I would try and find all the water-rock boundaries, and the horizon (if possible) then fit a plane to the surface of the water.
That way you don't need to worry about reflections of the rock.
Easier if you know the pitch angle between the camera and the water and if the camera is is leveled horizontally (roll).
ps. This is a lot harder than I thought - you don't know the distance to all the rocks so fitting a plane is difficult.
It occurs that the reflection is actually the ideal way of finding the level, look for symetric path edges in the rock edge detection and pick the vertex?