OpenCV - Extracting SIFT/SURF descriptor from pre-cropped patches - opencv

I have a set of 100K 64x64 gray patches (that are already aligned, meaning they all have the same orientation) and I would like to extract a SIFT descriptor from each one using OpenCV.
It is clear to me all I need to do is to define a vector with one keypoint kp such that: kp.x=32, kp.y=32.
However, I don't know how to set the kp.size parameter. From going over SIFT's code, it looks as it's doing some non-trivial calculations with that parameter instead of just assuming that it's the size of the patch.
Question 1: what should be the kp.size parameter when extracting SIFT descriptors from patches of size 64x64?
Question 2: what should be the kp.size parameter when extracting SURF descriptors from patches of size 64x64?

If you look at sift original publication, the scale of the keypoint is used to weight the histogram of gradients magnitude and orientations(paragraph 6. The local image descriptor). So in your case, since the grey patches are aligned, it is up to you to decide if you want to weight the contributions of the pixels further from the patch center or not, and select the scale (i.e. the with of the gaussian weighting window) accordingly.
For SURF, it's basically the same principle except that instead of gradient magnitude the response to haar wavelet is use, but you could still weight those responses with a gaussian window.
Also, since you are working with those aligned patches I would advise you not to use the high-level functions of OpenCV, but to simply use/recode the descriptor extraction part, and to apply any weighting you want to compute your patch representation. One reason to do so is that, in the SIFT example, the computation of SIFT descriptors might "add new keypoints" to the one you provided, if the algorithm is "not happy" with the keypoint orientation, it duplicates the keypoint at the same location but different orientation.

Okay. So the SIFT descriptor uses a neighbourhood of 4x4 grids usually, each grid usually being 4x4 pixels. Therefore the neighbourhood in pixels is usually 16x16. The scale/size is the parameter to determine the amount of downsampling/blurring/radius of keypoint. So I would think in your case, this would be 4.
You probably would also know that SIFT keypoints also work on sub-pixel layers. (32,32) would not be the exact center of your image patch, which would actually be (32.5, 32.5) if your image dimensions (x,y) start from 1. If they start from 0, it would be (31.5, 31.5)- as in the case of opencv.


How does multiscale feature matching work? ORB, SIFT, etc

When reading about classic computer vision I am confused on how multiscale feature matching works.
Suppose we use an image pyramid,
How do you deal with the same feature being detected at multiple scales? How do you decide which to make a deacriptor for?
How do you connected features between scales? For example let's say you have a feature detected and matched to a descriptor at scale .5. Is this location then translated to its location in the initial scale?
I can share something about SIFT that might answer question (1) for you.
I'm not really sure what you mean in your question (2) though, so please clarify?
SIFT (Scale-Invariant Feature Transform) was designed specifically to find features that remains identifiable across different image scales, rotations, and transformations.
When you run SIFT on an image of some object (e.g. a car), SIFT will try to create the same descriptor for the same feature (e.g. the license plate), no matter what image transformation you apply.
Ideally, SIFT will only produce a single descriptor for each feature in an image.
However, this obviously doesn't always happen in practice, as you can see in an OpenCV example here:
OpenCV illustrates each SIFT descriptor as a circle of different size. You can see many cases where the circles overlap. I assume this is what you meant in question (1) by "the same feature being detected at multiple scales".
And to my knowledge, SIFT doesn't really care about this issue. If by scaling the image enough you end up creating multiple descriptors from "the same feature", then those are distinct descriptors to SIFT.
During descriptor matching, you simply brute-force compare your list of descriptors, regardless of what scale it was generated from, and try to find the closest match.
The whole point of SIFT as a function, is to take in some image feature under different transformations, and produce a similar numerical output at the end.
So if you do end up with multiple descriptors of the same feature, you'll just end up having to do more computational work, but you will still essentially match the same pair of feature across two images regardless.
If you are asking about how to convert coordinates from the scaled images in the image pyramid back into original image coordinates, then David Lowe's SIFT paper dedicates section 4 on that topic.
The naive approach would be to simply calculate the ratios of the scaled coordinates vs the scaled image dimensions, then extrapolate back to the original image coordinates and dimensions. However, this is inaccurate, and becomes increasingly so as you scale down an image.
Example: You start with a 1000x1000 pixel image, where a feature is located at coordinates (123,456). If you had scaled down the image to 100x100 pixel, then the scaled keypoint coordinate would be something like (12,46). Extrapolating back to the original coordinates naively would give the coordinates (120,460).
So SIFT fits a Taylor expansion of the Difference of Gaussian function, to try and locate the original interesting keypoint down to sub-pixel levels of accuracy; which you can then use to extrapolate back to the original image coordinates.
Unfortunately, the math for this part is quite beyond me. But if you are fluent in math, C programming, and want to know specifically how SIFT is implemented; I suggest you dive into Rob Hess' SIFT implementation, lines 467 through 648 is probably the most detailed you can get.

SIFT descriptor for corner pixels in the image

I want to make a function for calculating the sift descriptors that takes two arguments as input, one is the image and second in the vector of certain points in the image. The output would be the rows of 128 dimensional sift descriptors with each row corresponds to a descriptor of certain point in the image.
For the calculations of SIFT descriptors, one needs to crop certain patch around the keypoint (say 32 x 32) and do some image processing stuff (histogram of orientations and all). My question here is that how to deal with the corner pixels in the image. for example if i will give pixel location (1,1) as input to the function, then how to crop the patch or how to do the calculation for the descriptor for such location in the image??
As others pointed out in the comments, in image processing, it's a common practice to mirror the image at the boundaries, or sometimes, you can also pad zeros at the boundaries.
If the "vector of certain points" that you're referring to are the "keypoints", then you can use opencv to extract SIFT features, which returns 128 dimensional feature vector for each keypoint.
You do not have to do the computation of SIFT descriptors explicitly yourself, since OPENCV provides methods for that as described in SIFT descriptors

What are keypoints in image processing?

When using OpenCV for example, algorithms like SIFT or SURF are often used to detect keypoints. My question is what actually are these keypoints?
I understand that they are some kind of "points of interest" in an image. I also know that they are scale invariant and are circular.
Also, I found out that they have orientation but I couldn't understand what this actually is. Is it an angle but between the radius and something? Can you give some explanation? I think I need what I need first is something simpler and after that it will be easier to understand the papers.
Let's tackle each point one by one:
My question is what actually are these keypoints?
Keypoints are the same thing as interest points. They are spatial locations, or points in the image that define what is interesting or what stand out in the image. Interest point detection is actually a subset of blob detection, which aims to find interesting regions or spatial areas in an image. The reason why keypoints are special is because no matter how the image changes... whether the image rotates, shrinks/expands, is translated (all of these would be an affine transformation by the way...) or is subject to distortion (i.e. a projective transformation or homography), you should be able to find the same keypoints in this modified image when comparing with the original image. Here's an example from a post I wrote a while ago:
Source: module' object has no attribute 'drawMatches' opencv python
The image on the right is a rotated version of the left image. I've also only displayed the top 10 matches between the two images. If you take a look at the top 10 matches, these are points that we probably would want to focus on that would allow us to remember what the image was about. We would want to focus on the face of the cameraman as well as the camera, the tripod and some of the interesting textures on the buildings in the background. You see that these same points were found between the two images and these were successfully matched.
Therefore, what you should take away from this is that these are points in the image that are interesting and that they should be found no matter how the image is distorted.
I understand that they are some kind of "points of interest" of an image. I also know that they are scale invariant and I know they are circular.
You are correct. Scale invariant means that no matter how you scale the image, you should still be able to find those points.
Now we are going to venture into the descriptor part. What makes keypoints different between frameworks is the way you describe these keypoints. These are what are known as descriptors. Each keypoint that you detect has an associated descriptor that accompanies it. Some frameworks only do a keypoint detection, while other frameworks are simply a description framework and they don't detect the points. There are also some that do both - they detect and describe the keypoints. SIFT and SURF are examples of frameworks that both detect and describe the keypoints.
Descriptors are primarily concerned with both the scale and the orientation of the keypoint. The keypoints we've nailed that concept down, but we need the descriptor part if it is our purpose to try and match between keypoints in different images. Now, what you mean by "circular"... that correlates with the scale that the point was detected at. Take for example this image that is taken from the VLFeat Toolbox tutorial:
You see that any points that are yellow are interest points, but some of these points have a different circle radius. These deal with scale. How interest points work in a general sense is that we decompose the image into multiple scales. We check for interest points at each scale, and we combine all of these interest points together to create the final output. The larger the "circle", the larger the scale was that the point was detected at. Also, there is a line that radiates from the centre of the circle to the edge. This is the orientation of the keypoint, which we will cover next.
Also I found out that they have orientation but I couldn't understand what actually it is. It is an angle but between the radius and something?
Basically if you want to detect keypoints regardless of scale and orientation, when they talk about orientation of keypoints, what they really mean is that they search a pixel neighbourhood that surrounds the keypoint and figure out how this pixel neighbourhood is oriented or what direction this patch is oriented in. It depends on what descriptor framework you look at, but the general jist is to detect the most dominant orientation of the gradient angles in the patch. This is important for matching so that you can match keypoints together. Take a look at the first figure I have with the two cameramen - one rotated while the other isn't. If you take a look at some of those points, how do we figure out how one point matches with another? We can easily identify that the top of the cameraman as an interest point matches with the rotated version because we take a look at points that surround the keypoint and see what orientation all of these points are in... and from there, that's how the orientation is computed.
Usually when we want to detect keypoints, we just take a look at the locations. However, if you want to match keypoints between images, then you definitely need the scale and the orientation to facilitate this.
I'm not as familiar with SURF, but I can tell you about SIFT, which SURF is based on. I provided a few notes about SURF at the end, but I don't know all the details.
SIFT aims to find highly-distinctive locations (or keypoints) in an image. The locations are not merely 2D locations on the image, but locations in the image's scale space, meaning they have three coordinates: x, y, and scale. The process for finding SIFT keypoints is:
blur and resample the image with different blur widths and sampling rates to create a scale-space
use the difference of gaussians method to detect blobs at different scales; the blob centers become our keypoints at a given x, y, and scale
assign every keypoint an orientation by calculating a histogram of gradient orientations for every pixel in its neighborhood and picking the orientation bin with the highest number of counts
assign every keypoint a 128-dimensional feature vector based on the gradient orientations of pixels in 16 local neighborhoods
Step 2 gives us scale invariance, step 3 gives us rotation invariance, and step 4 gives us a "fingerprint" of sorts that can be used to identify the keypoint. Together they can be used to match occurrences of the same feature at any orientation and scale in multiple images.
SURF aims to accomplish the same goals as SIFT but uses some clever tricks in order to increase speed.
For blob detection, it uses the determinant of Hessian method. The dominant orientation is found by examining the horizontal and vertical responses to Haar wavelets. The feature descriptor is similar to SIFT, looking at orientations of pixels in 16 local neighborhoods, but results in a 64-dimensional vector.
SURF features can be calculated up to 3 times faster than SIFT features, yet are just as robust in most situations.
For reference:
A good SIFT tutorial
An introduction to SURF

Whether the SIFT is rotation invariant feature or not opencv

I want to write a code in opencv that proves whether the SIFT is rotation invariant feature or not.
Assuming that the image has one keypoint which is the center of the image. I want to caculate keypoint descriptor (magnitude and direction). I want to ask what is the keypoint ? is it a location in the image ?
I searched for simple tutorial or code to know what to do but I didn't find something simple.
A keypoint is an interesting point in your image. These points are usually found when you have a change in intensity, for example, at the edges between two objects in the image. A keypoint encodes, among other things, the location of the point in the image. SIFT will then extract a local feature descriptor for your keypoint which you can then use for image matching.
Scale Invariant Feature Transform (SIFT) is scale invariant, as the acronym says. It is not rotationally invariant. In such a case, you could use SURF. But, SURF is a little problematic for real-time applications.
Example code: Trying to match two images using sift in OpenCv, but too many matches
To test your SIFT code out, you could create a black 512x512 image in Opencv with three equally spaced white colored points along its width. Then, rotate the image by small rotations angles, measure the angle, and check the feature matches. As you are doing this, you will realize that for large rotations, the features matches are thrown off.

Confusion regarding Object recognition and features using SURF

I have some conceptual issues in understanding SURF and SIFT algorithm All about SURF. As far as my understanding goes SURF finds Laplacian of Gaussians and SIFT operates on difference of Gaussians. It then constructs a 64-variable vector around it to extract the features. I have applied this CODE.
(Q1 ) So, what forms the features?
(Q2) We initialize the algorithm using SurfFeatureDetector detector(500). So, does this means that the size of the feature space is 500?
(Q3) The output of SURF Good_Matches gives matches between Keypoint1 and Keypoint2 and by tuning the number of matches we can conclude that if the object has been found/detected or not. What is meant by KeyPoints ? Do these store the features ?
(Q4) I need to do object recognition application. In the code, it appears that the algorithm can recognize the book. So, it can be applied for object recognition. I was under the impression that SURF can be used to differentiate objects based on color and shape. But, SURF and SIFT find the corner edge detection, so there is no point in using color images as training samples since they will be converted to gray scale. There is no option of using colors or HSV in these algorithms, unless I compute the keypoints for each channel separately, which is a different area of research (Evaluating Color Descriptors for Object and Scene Recognition).
So, how can I detect and recognize objects based on their color, shape? I think I can use SURF for differentiating objects based on their shape. Say, for instance I have a 2 books and a bottle. I need to only recognize a single book out of the entire objects. But, as soon as there are other similar shaped objects in the scene, SURF gives lots of false positives. I shall appreciate suggestions on what methods to apply for my application.
The local maxima (response of the DoG which is greater (smaller) than responses of the neighbour pixels about the point, upper and lover image in pyramid -- 3x3x3 neighbourhood) forms the coordinates of the feature (circle) center. The radius of the circle is level of the pyramid.
It is Hessian threshold. It means that you would take only maximas (see 1) with values bigger than threshold. Bigger threshold lead to the less number of features, but stability of features is better and visa versa.
Keypoint == feature. In OpenCV Keypoint is the structure to store features.
No, SURF is good for comparison of the textured objects but not for shape and color. For the shape I recommend to use MSER (but not OpenCV one), Canny edge detector, not local features. This presentation might be useful
