I want to generate SURF descriptor (a vector of length 64).
In the original paper, here it says:
The region is split up regularly into smaller 4*4 square sub-regions. For each sub-region, we compute Haar wavelet responses at 5*5 regularly spaced sample points.
With OpenCV help, I can get keypoint and its relating region (by using KeyPoint::size and angle), but my questions are:
1. how to compute Haar wavelet responses?
2. what is "at 5 x 5 regularly spaced sample points" mean?
I've read the wiki and introduction about Harr wavelet but still have no idea how to write the code.
I've known how to use the OpenCV SurfDescriptorExtractor, but I cannot use it because I need to enlarge or shrink the original region and get new descriptor.
Thanks for any hint about how to generate SURF descriptor.
Related
When reading about classic computer vision I am confused on how multiscale feature matching works.
Suppose we use an image pyramid,
How do you deal with the same feature being detected at multiple scales? How do you decide which to make a deacriptor for?
How do you connected features between scales? For example let's say you have a feature detected and matched to a descriptor at scale .5. Is this location then translated to its location in the initial scale?
I can share something about SIFT that might answer question (1) for you.
I'm not really sure what you mean in your question (2) though, so please clarify?
SIFT (Scale-Invariant Feature Transform) was designed specifically to find features that remains identifiable across different image scales, rotations, and transformations.
When you run SIFT on an image of some object (e.g. a car), SIFT will try to create the same descriptor for the same feature (e.g. the license plate), no matter what image transformation you apply.
Ideally, SIFT will only produce a single descriptor for each feature in an image.
However, this obviously doesn't always happen in practice, as you can see in an OpenCV example here:
OpenCV illustrates each SIFT descriptor as a circle of different size. You can see many cases where the circles overlap. I assume this is what you meant in question (1) by "the same feature being detected at multiple scales".
And to my knowledge, SIFT doesn't really care about this issue. If by scaling the image enough you end up creating multiple descriptors from "the same feature", then those are distinct descriptors to SIFT.
During descriptor matching, you simply brute-force compare your list of descriptors, regardless of what scale it was generated from, and try to find the closest match.
The whole point of SIFT as a function, is to take in some image feature under different transformations, and produce a similar numerical output at the end.
So if you do end up with multiple descriptors of the same feature, you'll just end up having to do more computational work, but you will still essentially match the same pair of feature across two images regardless.
Edit:
If you are asking about how to convert coordinates from the scaled images in the image pyramid back into original image coordinates, then David Lowe's SIFT paper dedicates section 4 on that topic.
The naive approach would be to simply calculate the ratios of the scaled coordinates vs the scaled image dimensions, then extrapolate back to the original image coordinates and dimensions. However, this is inaccurate, and becomes increasingly so as you scale down an image.
Example: You start with a 1000x1000 pixel image, where a feature is located at coordinates (123,456). If you had scaled down the image to 100x100 pixel, then the scaled keypoint coordinate would be something like (12,46). Extrapolating back to the original coordinates naively would give the coordinates (120,460).
So SIFT fits a Taylor expansion of the Difference of Gaussian function, to try and locate the original interesting keypoint down to sub-pixel levels of accuracy; which you can then use to extrapolate back to the original image coordinates.
Unfortunately, the math for this part is quite beyond me. But if you are fluent in math, C programming, and want to know specifically how SIFT is implemented; I suggest you dive into Rob Hess' SIFT implementation, lines 467 through 648 is probably the most detailed you can get.
I have started my first project in the field of Image recognition using Feature Point detectors and descriptors. I have no prior knowledge on the topics of Image recognition techniques before starting of this project and then I have researched on the available detectors and descriptors and came to know about the differences between them. Finally, I have opted out to work with the ORB detectors and descriptors for Image recognition (If it didn't worked according to my requiremnets then I would like to go out with the BRISK later).
As of now am in a stage of getting the results for Image recognition using ORB. At this Point, I was thinking of to use Gaussian Filters in my code so that I can get better results even though the Input Image is a bit blur.
My questions:
1) Is it possible to use Gaussian filters with ORB to get much better results for Image recognition?
2) When I read the paper on ORB I came to know that the lines below
FAST does not produce a measure of cornerness, and we have found that it has large
responses along edges. We employ a Harris corner measure [11] to order the FAST keypoints.
For a target number N of keypoints, we first set the threshold low enough to get more than
N keypoints, then order them according to the Harris measure, and pick the top N points.
FAST does not produce multi-scale features. We employ a scale pyramid of the image, and
produce FAST Features (filtered by Harris) at each level in the pyramid.
ORB provides the Harris Corner inorder to detect the corners in an image and is it worth for me to use Gaussian Filters along with ORB?
3) ORB uses only Harris Corner to detect the corners or any other?
Please let me know about this and just enlighten me on the above mentioned questions.
I get often confused with the meaning of the term descriptor in the context of image features. Is a descriptor the description of the local neighborhood of a point (e.g. a float vector), or is a descriptor the algorithm that outputs the description? Also, what exactly is then the output of a feature-extractor?
I have been asking myself this question for a long time, and the only explanation I came up with is that a descriptor is both, the algorithm and the description. A feature detector is used to detect distinctive points. A feature-extractor, however, does then not seem to make any sense.
So, is a feature descriptor the description or the algorithm that produces the description?
A feature detector is an algorithm which takes an image and outputs locations (i.e. pixel coordinates) of significant areas in your image. An example of this is a corner detector, which outputs the locations of corners in your image but does not tell you any other information about the features detected.
A feature descriptor is an algorithm which takes an image and outputs feature descriptors/feature vectors. Feature descriptors encode interesting information into a series of numbers and act as a sort of numerical "fingerprint" that can be used to differentiate one feature from another. Ideally this information would be invariant under image transformation, so we can find the feature again even if the image is transformed in some way. An example would be SIFT, which encodes information about the local neighbourhood image gradients the numbers of the feature vector. Other examples you can read about are HOG and SURF.
EDIT: When it comes to feature detectors, the "location" might also include a number describing the size or scale of the feature. This is because things that look like corners when "zoomed in" may not look like corners when "zoomed out", and so specifying scale information is important. So instead of just using an (x,y) pair as a location in "image space", you might have a triple (x,y,scale) as location in "scale space".
For the descriptor, I understand as the description of the neighborhood of a point on the image. In other words, it is a vector in the image (descriptions of the visual features of the contents in images).
For example, there is method in the HOG (Histogram of Oriented Gradients) called Image Gradients and Spatial/Orientation Binning. The extractHOGFeatures in Matlab and Classification using HOG had visual examples for better understanding.
I have a set of 100K 64x64 gray patches (that are already aligned, meaning they all have the same orientation) and I would like to extract a SIFT descriptor from each one using OpenCV.
It is clear to me all I need to do is to define a vector with one keypoint kp such that: kp.x=32, kp.y=32.
However, I don't know how to set the kp.size parameter. From going over SIFT's code, it looks as it's doing some non-trivial calculations with that parameter instead of just assuming that it's the size of the patch.
Question 1: what should be the kp.size parameter when extracting SIFT descriptors from patches of size 64x64?
Question 2: what should be the kp.size parameter when extracting SURF descriptors from patches of size 64x64?
If you look at sift original publication, the scale of the keypoint is used to weight the histogram of gradients magnitude and orientations(paragraph 6. The local image descriptor). So in your case, since the grey patches are aligned, it is up to you to decide if you want to weight the contributions of the pixels further from the patch center or not, and select the scale (i.e. the with of the gaussian weighting window) accordingly.
For SURF, it's basically the same principle except that instead of gradient magnitude the response to haar wavelet is use, but you could still weight those responses with a gaussian window.
Also, since you are working with those aligned patches I would advise you not to use the high-level functions of OpenCV, but to simply use/recode the descriptor extraction part, and to apply any weighting you want to compute your patch representation. One reason to do so is that, in the SIFT example, the computation of SIFT descriptors might "add new keypoints" to the one you provided, if the algorithm is "not happy" with the keypoint orientation, it duplicates the keypoint at the same location but different orientation.
Okay. So the SIFT descriptor uses a neighbourhood of 4x4 grids usually, each grid usually being 4x4 pixels. Therefore the neighbourhood in pixels is usually 16x16. The scale/size is the parameter to determine the amount of downsampling/blurring/radius of keypoint. So I would think in your case, this would be 4.
You probably would also know that SIFT keypoints also work on sub-pixel layers. (32,32) would not be the exact center of your image patch, which would actually be (32.5, 32.5) if your image dimensions (x,y) start from 1. If they start from 0, it would be (31.5, 31.5)- as in the case of opencv.
I have some conceptual issues in understanding SURF and SIFT algorithm All about SURF. As far as my understanding goes SURF finds Laplacian of Gaussians and SIFT operates on difference of Gaussians. It then constructs a 64-variable vector around it to extract the features. I have applied this CODE.
(Q1 ) So, what forms the features?
(Q2) We initialize the algorithm using SurfFeatureDetector detector(500). So, does this means that the size of the feature space is 500?
(Q3) The output of SURF Good_Matches gives matches between Keypoint1 and Keypoint2 and by tuning the number of matches we can conclude that if the object has been found/detected or not. What is meant by KeyPoints ? Do these store the features ?
(Q4) I need to do object recognition application. In the code, it appears that the algorithm can recognize the book. So, it can be applied for object recognition. I was under the impression that SURF can be used to differentiate objects based on color and shape. But, SURF and SIFT find the corner edge detection, so there is no point in using color images as training samples since they will be converted to gray scale. There is no option of using colors or HSV in these algorithms, unless I compute the keypoints for each channel separately, which is a different area of research (Evaluating Color Descriptors for Object and Scene Recognition).
So, how can I detect and recognize objects based on their color, shape? I think I can use SURF for differentiating objects based on their shape. Say, for instance I have a 2 books and a bottle. I need to only recognize a single book out of the entire objects. But, as soon as there are other similar shaped objects in the scene, SURF gives lots of false positives. I shall appreciate suggestions on what methods to apply for my application.
The local maxima (response of the DoG which is greater (smaller) than responses of the neighbour pixels about the point, upper and lover image in pyramid -- 3x3x3 neighbourhood) forms the coordinates of the feature (circle) center. The radius of the circle is level of the pyramid.
It is Hessian threshold. It means that you would take only maximas (see 1) with values bigger than threshold. Bigger threshold lead to the less number of features, but stability of features is better and visa versa.
Keypoint == feature. In OpenCV Keypoint is the structure to store features.
No, SURF is good for comparison of the textured objects but not for shape and color. For the shape I recommend to use MSER (but not OpenCV one), Canny edge detector, not local features. This presentation might be useful