object recognition performance not good - opencv

I am trying to do object recognition using algorithms such as SURF, FERN, FREAK in opencv 2.4.2.
I am using the programs from opencv samples without modifications - find_obj.cpp, find_obj_ferns.cpp, freak_demo.cpp
I tried changing the parameters for the algorithms which didn't help.
I have my training images, test images and the result of FREAK recognition here
As you can see the result is pretty bad.
No feature descriptors is detected for one of the training image - image here
Feature descriptors are detected outside the object boundary for the other - image here
I have a few questions:
Why does these algorithms work with grayscale images ? It is apparent that for my above training images, the object can be detected easily if RGB is included. Is there any technique that takes this into account.
Is there any other way to improve performance. I tried fiddling with feature parameters which didn't work well.

First thing i observed in your image is, object is plane and no texture differences are there...I mean all the feature detectors you used are for finding corners which are view invariant, it means those are the keypoints in an image which are having unique neighborhood and good magnitude of x and y derivatives. I have uploaded my analysis...see the figures
How to know what I am saying is correct?
Just go to the descriptor values of a keypoint you find over your object and see the values, you will see most of them are zeros...Because a descriptor is the description of variation of the edges around a corner point in a specific direction (see surf documentation for more details).
The object you are trying to detect is looking like a mobile phone, so you just reverse the object or mobile and repeat the experiment and you will surely get good results...Because on front side generally objects have more texture like switches, logos etc..
Here is a result I uploaded,


How does multiscale feature matching work? ORB, SIFT, etc

When reading about classic computer vision I am confused on how multiscale feature matching works.
Suppose we use an image pyramid,
How do you deal with the same feature being detected at multiple scales? How do you decide which to make a deacriptor for?
How do you connected features between scales? For example let's say you have a feature detected and matched to a descriptor at scale .5. Is this location then translated to its location in the initial scale?
I can share something about SIFT that might answer question (1) for you.
I'm not really sure what you mean in your question (2) though, so please clarify?
SIFT (Scale-Invariant Feature Transform) was designed specifically to find features that remains identifiable across different image scales, rotations, and transformations.
When you run SIFT on an image of some object (e.g. a car), SIFT will try to create the same descriptor for the same feature (e.g. the license plate), no matter what image transformation you apply.
Ideally, SIFT will only produce a single descriptor for each feature in an image.
However, this obviously doesn't always happen in practice, as you can see in an OpenCV example here:
OpenCV illustrates each SIFT descriptor as a circle of different size. You can see many cases where the circles overlap. I assume this is what you meant in question (1) by "the same feature being detected at multiple scales".
And to my knowledge, SIFT doesn't really care about this issue. If by scaling the image enough you end up creating multiple descriptors from "the same feature", then those are distinct descriptors to SIFT.
During descriptor matching, you simply brute-force compare your list of descriptors, regardless of what scale it was generated from, and try to find the closest match.
The whole point of SIFT as a function, is to take in some image feature under different transformations, and produce a similar numerical output at the end.
So if you do end up with multiple descriptors of the same feature, you'll just end up having to do more computational work, but you will still essentially match the same pair of feature across two images regardless.
If you are asking about how to convert coordinates from the scaled images in the image pyramid back into original image coordinates, then David Lowe's SIFT paper dedicates section 4 on that topic.
The naive approach would be to simply calculate the ratios of the scaled coordinates vs the scaled image dimensions, then extrapolate back to the original image coordinates and dimensions. However, this is inaccurate, and becomes increasingly so as you scale down an image.
Example: You start with a 1000x1000 pixel image, where a feature is located at coordinates (123,456). If you had scaled down the image to 100x100 pixel, then the scaled keypoint coordinate would be something like (12,46). Extrapolating back to the original coordinates naively would give the coordinates (120,460).
So SIFT fits a Taylor expansion of the Difference of Gaussian function, to try and locate the original interesting keypoint down to sub-pixel levels of accuracy; which you can then use to extrapolate back to the original image coordinates.
Unfortunately, the math for this part is quite beyond me. But if you are fluent in math, C programming, and want to know specifically how SIFT is implemented; I suggest you dive into Rob Hess' SIFT implementation, lines 467 through 648 is probably the most detailed you can get.

Is image segmentation required with the SURF algorithm?

I was reading about image segmentation, and I understood that it is the first step in image analysis. But I also read that if I am using SURF or SIFT to detect and extract features there is no need for segmentation. Is that true? Is there a need for segmentation if I am using SURF?
The dependency between segmentation and recognition is a bit more complex. Clearly, knowing which pixels of the image belong to your object makes recognition easier. However, this relationship works also in the other direction: knowing what is in the image makes it easier to do segmentation. However, for simplicity, I will only speak about a simple pipeline where segmentation is performed first (for instance based on some simple color model) and each of the segments is then processed.
Your question specifically asks about the SURF features. However, in this context, what is important is that SURF is a local descriptor, i.e. it describes small image patches around detected keypoints. Keypoints should be points in the image where information relevant to your recognition problem can be found (interesting parts of the image), but also points that can reliably be detected in a repeatable fashion on all images of objects belonging to the class of interest. As a result, a local descriptor only cares about the pixels around points selected by the keypoint detector and for each such keypoint extracts a small feature vector. On the other hand a global descriptor will consider all pixels within some area, typically a segment, or the whole image.
Therefore, to perform recognition in an image using a global descriptor, you need to first select the area (segment) from which you want your features to be extracted. These features would then be used to recognize what is the content of the segment. The situation is a bit different with a local descriptor, since it describes local patches that the keypoint detector determines as relevant. As a result, you get multiple feature vectors for multiple points in the image, even if you do not perform segmentation. Each of these feature vectors tells you something about the content of the image and you can try to assign each such local feature vector to a "class" and gather their statistics to understand the content of the image. Such simple model is called the Bag-of-words model.

Object Recognition by Outlines vs Features

I have the RGB-D video from a Kinect, which is aimed straight down at a table. There is a library of around 12 objects I need to identify, alone or several at a time. I have been working with SURF extraction and detection from the RGB image, preprocessing by downscaling to 320x240, grayscale, stretching the contrast and balancing the histogram before applying SURF. I built a lasso tool to choose among detected keypoints in a still of the video image. Then those keypoints are used to build object descriptors which are used to identify objects in the live video feed.
SURF examples show successful identification of objects with a decent amount of text-like feature detail eg. logos and patterns. The objects I need to identify are relatively plain but have distinctive geometry. The SURF features found in my stills are sometimes consistent but mostly unimportant surface features. For instance, say I have a wooden cube. SURF detects a few bits of grain on one face, then fails on other faces. I need to detect (something like) that there are four corners at equal distances and right angles. None of my objects has much of a pattern but all have distinctive symmetric geometry and color. Think cellphone, lollipop, knife, bowling pin. My thought was that I could build object descriptors for each significantly different-looking orientation of the object, eg. two descriptors for a bowling pin: one standing up and one laying down. For a cellphone, one laying on the front and one on the back. My recognizer needs rotational invariance and some degree of scale invariance in case objects are stacked. Ability to deal with some occlusion is preferable (SURF behaves well enough) but not the most important characteristic. Skew invariance would be preferable and SURF does well with paper printouts of my objects held by hand at a skew.
Am I using the wrong SURF parameters to find features at the wrong scale? Is there a better algorithm for this kind of object identification? Is there something as readily usable as SURF that uses the depth data from the Kinect along with or instead of the RGB data?
I was doing something similar for a project, and ended up using a super simple method for object recognition, which was using OpenCV blob detection, and recognizing objects based on their areas. Obviously, there needs to be enough variance for this method to work.
You can see my results here: http://portfolio.jackkalish.com/Secondhand-Stories
I know there are other methods out there, one possible solution for you could be approxPolyDP, which is described here:
How to detect simple geometric shapes using OpenCV
Would love to hear about your progress on this!

How can HOG be used to detect individual body parts

I would like to use OpenCV's HOG detection to identify objects that can be seen in a variety of orientations. The only problem is, I can't seem to find a reasonable feature detector or classifier to detect this in a rotation and scale invaraint way (as is needed by objects such as forearms).
Prior Work:
Lets focus on forearms for this discussion. A forearm can have multiple orientations, the primary distinct features probably being its contour edges. It is possible to have images of forearms that are pointing in any direction in an image, thus the complexity. So far I have done some in depth research on using HOG descriptors to solve this problem, but I am finding that the variety of poses produced by forearms in my positives training set is producing very low detection scores in actual images. I suspect the issue is that the gradients produced by each positive image do not produce very consistent results when saved into the Histogram. I have reviewed many research papers on the topic trying to resolve or improvie this, including the original from Dalal & Triggs [Link]: http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf It also seems that the assumptions made for detecting whole humans do not necessary apply to detecting individual features (particularly the assumption that all humans are standing up seems to suggest HOG is not a good route for rotation invariant detection like that of forearms).
If possible, I would like to steer clear of any non-free solutions such as those pertaining to Sift, Surf, or Haar.
What is a good solution to detecting rotation and scale invariant objects in an image? Particularly for this example, what would be a good solution to detecting all orientations of forearms in an image?
I use hog to detect human heads and shoulders. To train particular part you have to give the location of it. If you use opencv, you can clip samples containing only the training part you want, and make sure all training samples share the same size. For example, I clip images to contain only head and shoulder and resize all them to 64x64. Other opensource codes may require you to pass the location as the input parameter, essentially the same.
Are you trying the Discriminatively trained deformable part model ?http://www.cs.berkeley.edu/~rbg/latent/
you may find answers there.

What does size and response exactly represent in a SURF keypoint?

I'm using OpenCV 2.3 for keypoints detection and matching. But I am a bit confused with the size and response parameters given by the detection algorithm. What do they exactly mean?
Based on the OpenCV manual, I can't figure it out:
float size: diameter of the meaningful keypoint neighborhood
float response: the response by which the most strong keypoints have
been selected. Can be used for further sorting or subsampling
I thought the best point to track would be the one with the highest response but it seems that it is not the case. So how could I subsample the set of key points returned by the surf detector to keep only the best one in term of trackability?
Size and response
SURF is a blob detector, in short, the size of a feature is the size of the blob. To be more precise, the returned size by OpenCV is half the length of the approximated Hessian operator. The size is also known as scale, this is due to the way the blob detectors work, i.e., being functionally equal to first blurring the image with the Gaussian filter at several scales and then downsampling the images and finally detecting blobs with a fixed size. See the image below showing the the size of the SURF features. The size of each feature is the radius of the drawn circle. The lines going out from the center of the features to the circumference show the angles or orientations. In this image, the response strength of the blob detection filter is color coded. You can see the majority of the detected features have a weak response. (see the full size image here)
This histogram shows the distribution of the response strengths of the features in the above image:
What features to track?
The most robust feature tracker tracks all the detected features. The more features the more robustness. But it's impractical to track a large number of features as often we want to limit the computation time. The number of features to track often should be empirically tuned for each application. Often the image is divided into regular sub-regions and in each one the n strongest features are kept to be tracked. n is usually chosen such that in total about 500~1000 features are detected per frame.
Reading the journal paper describing SURF definitely will give you a good idea of how it works. Just try not to get stuck in the details, specially if your background isn't in machine/computer vision or image processing. The SURF detector may seem extremely novel at the first glance but the whole idea is estimating the Hessian operator (a well established filter) using integral images (which have been used by other methods long before SURF). If you want to understand SURF very well and you're not familiar with image processing, you need to go back and read some introductory material. Recently I came across a new and free book, whose chapter 13 has a good and brief introduction to feature detection. Not everything said in there is technically correct, but it's a good starting point. Here you can find another good description of SURF with several images showing how each step works. On that page you see this image:
You can see the white and black blobs, these are the blobs that SURF detects at several scales and estimates their sizes (radius in the OpenCV code).
"size" is the size of the area covered by the descriptor in the original image (it is obtained by downsampling the original image in the scale space, hence it varies from key point to key point based on their scale).
"reponse" is indeed an indicator of "how good" (roughly speaking, in terms of corner-ness) a point is.
Good points are stable for static scene retrieval (this is the main purpose of SIFT/SURF descriptors). In the case of tracking, you can have good points appearing because the tracked object is on a well formed background, of half in the shadow... then disappearing because this condition has changed (change of light, occlusion...). So there is no guarantee for tracking tasks that a good point will always be there.
