How does multiscale feature matching work? ORB, SIFT, etc - image-processing

When reading about classic computer vision I am confused on how multiscale feature matching works.
Suppose we use an image pyramid,
How do you deal with the same feature being detected at multiple scales? How do you decide which to make a deacriptor for?
How do you connected features between scales? For example let's say you have a feature detected and matched to a descriptor at scale .5. Is this location then translated to its location in the initial scale?

I can share something about SIFT that might answer question (1) for you.
I'm not really sure what you mean in your question (2) though, so please clarify?
SIFT (Scale-Invariant Feature Transform) was designed specifically to find features that remains identifiable across different image scales, rotations, and transformations.
When you run SIFT on an image of some object (e.g. a car), SIFT will try to create the same descriptor for the same feature (e.g. the license plate), no matter what image transformation you apply.
Ideally, SIFT will only produce a single descriptor for each feature in an image.
However, this obviously doesn't always happen in practice, as you can see in an OpenCV example here:
OpenCV illustrates each SIFT descriptor as a circle of different size. You can see many cases where the circles overlap. I assume this is what you meant in question (1) by "the same feature being detected at multiple scales".
And to my knowledge, SIFT doesn't really care about this issue. If by scaling the image enough you end up creating multiple descriptors from "the same feature", then those are distinct descriptors to SIFT.
During descriptor matching, you simply brute-force compare your list of descriptors, regardless of what scale it was generated from, and try to find the closest match.
The whole point of SIFT as a function, is to take in some image feature under different transformations, and produce a similar numerical output at the end.
So if you do end up with multiple descriptors of the same feature, you'll just end up having to do more computational work, but you will still essentially match the same pair of feature across two images regardless.
Edit:
If you are asking about how to convert coordinates from the scaled images in the image pyramid back into original image coordinates, then David Lowe's SIFT paper dedicates section 4 on that topic.
The naive approach would be to simply calculate the ratios of the scaled coordinates vs the scaled image dimensions, then extrapolate back to the original image coordinates and dimensions. However, this is inaccurate, and becomes increasingly so as you scale down an image.
Example: You start with a 1000x1000 pixel image, where a feature is located at coordinates (123,456). If you had scaled down the image to 100x100 pixel, then the scaled keypoint coordinate would be something like (12,46). Extrapolating back to the original coordinates naively would give the coordinates (120,460).
So SIFT fits a Taylor expansion of the Difference of Gaussian function, to try and locate the original interesting keypoint down to sub-pixel levels of accuracy; which you can then use to extrapolate back to the original image coordinates.
Unfortunately, the math for this part is quite beyond me. But if you are fluent in math, C programming, and want to know specifically how SIFT is implemented; I suggest you dive into Rob Hess' SIFT implementation, lines 467 through 648 is probably the most detailed you can get.

Related

What are keypoints in image processing?

When using OpenCV for example, algorithms like SIFT or SURF are often used to detect keypoints. My question is what actually are these keypoints?
I understand that they are some kind of "points of interest" in an image. I also know that they are scale invariant and are circular.
Also, I found out that they have orientation but I couldn't understand what this actually is. Is it an angle but between the radius and something? Can you give some explanation? I think I need what I need first is something simpler and after that it will be easier to understand the papers.
Let's tackle each point one by one:
My question is what actually are these keypoints?
Keypoints are the same thing as interest points. They are spatial locations, or points in the image that define what is interesting or what stand out in the image. Interest point detection is actually a subset of blob detection, which aims to find interesting regions or spatial areas in an image. The reason why keypoints are special is because no matter how the image changes... whether the image rotates, shrinks/expands, is translated (all of these would be an affine transformation by the way...) or is subject to distortion (i.e. a projective transformation or homography), you should be able to find the same keypoints in this modified image when comparing with the original image. Here's an example from a post I wrote a while ago:
Source: module' object has no attribute 'drawMatches' opencv python
The image on the right is a rotated version of the left image. I've also only displayed the top 10 matches between the two images. If you take a look at the top 10 matches, these are points that we probably would want to focus on that would allow us to remember what the image was about. We would want to focus on the face of the cameraman as well as the camera, the tripod and some of the interesting textures on the buildings in the background. You see that these same points were found between the two images and these were successfully matched.
Therefore, what you should take away from this is that these are points in the image that are interesting and that they should be found no matter how the image is distorted.
I understand that they are some kind of "points of interest" of an image. I also know that they are scale invariant and I know they are circular.
You are correct. Scale invariant means that no matter how you scale the image, you should still be able to find those points.
Now we are going to venture into the descriptor part. What makes keypoints different between frameworks is the way you describe these keypoints. These are what are known as descriptors. Each keypoint that you detect has an associated descriptor that accompanies it. Some frameworks only do a keypoint detection, while other frameworks are simply a description framework and they don't detect the points. There are also some that do both - they detect and describe the keypoints. SIFT and SURF are examples of frameworks that both detect and describe the keypoints.
Descriptors are primarily concerned with both the scale and the orientation of the keypoint. The keypoints we've nailed that concept down, but we need the descriptor part if it is our purpose to try and match between keypoints in different images. Now, what you mean by "circular"... that correlates with the scale that the point was detected at. Take for example this image that is taken from the VLFeat Toolbox tutorial:
You see that any points that are yellow are interest points, but some of these points have a different circle radius. These deal with scale. How interest points work in a general sense is that we decompose the image into multiple scales. We check for interest points at each scale, and we combine all of these interest points together to create the final output. The larger the "circle", the larger the scale was that the point was detected at. Also, there is a line that radiates from the centre of the circle to the edge. This is the orientation of the keypoint, which we will cover next.
Also I found out that they have orientation but I couldn't understand what actually it is. It is an angle but between the radius and something?
Basically if you want to detect keypoints regardless of scale and orientation, when they talk about orientation of keypoints, what they really mean is that they search a pixel neighbourhood that surrounds the keypoint and figure out how this pixel neighbourhood is oriented or what direction this patch is oriented in. It depends on what descriptor framework you look at, but the general jist is to detect the most dominant orientation of the gradient angles in the patch. This is important for matching so that you can match keypoints together. Take a look at the first figure I have with the two cameramen - one rotated while the other isn't. If you take a look at some of those points, how do we figure out how one point matches with another? We can easily identify that the top of the cameraman as an interest point matches with the rotated version because we take a look at points that surround the keypoint and see what orientation all of these points are in... and from there, that's how the orientation is computed.
Usually when we want to detect keypoints, we just take a look at the locations. However, if you want to match keypoints between images, then you definitely need the scale and the orientation to facilitate this.
I'm not as familiar with SURF, but I can tell you about SIFT, which SURF is based on. I provided a few notes about SURF at the end, but I don't know all the details.
SIFT aims to find highly-distinctive locations (or keypoints) in an image. The locations are not merely 2D locations on the image, but locations in the image's scale space, meaning they have three coordinates: x, y, and scale. The process for finding SIFT keypoints is:
blur and resample the image with different blur widths and sampling rates to create a scale-space
use the difference of gaussians method to detect blobs at different scales; the blob centers become our keypoints at a given x, y, and scale
assign every keypoint an orientation by calculating a histogram of gradient orientations for every pixel in its neighborhood and picking the orientation bin with the highest number of counts
assign every keypoint a 128-dimensional feature vector based on the gradient orientations of pixels in 16 local neighborhoods
Step 2 gives us scale invariance, step 3 gives us rotation invariance, and step 4 gives us a "fingerprint" of sorts that can be used to identify the keypoint. Together they can be used to match occurrences of the same feature at any orientation and scale in multiple images.
SURF aims to accomplish the same goals as SIFT but uses some clever tricks in order to increase speed.
For blob detection, it uses the determinant of Hessian method. The dominant orientation is found by examining the horizontal and vertical responses to Haar wavelets. The feature descriptor is similar to SIFT, looking at orientations of pixels in 16 local neighborhoods, but results in a 64-dimensional vector.
SURF features can be calculated up to 3 times faster than SIFT features, yet are just as robust in most situations.
For reference:
A good SIFT tutorial
An introduction to SURF

What is a feature descriptor in image processing (algorithm or description)?

I get often confused with the meaning of the term descriptor in the context of image features. Is a descriptor the description of the local neighborhood of a point (e.g. a float vector), or is a descriptor the algorithm that outputs the description? Also, what exactly is then the output of a feature-extractor?
I have been asking myself this question for a long time, and the only explanation I came up with is that a descriptor is both, the algorithm and the description. A feature detector is used to detect distinctive points. A feature-extractor, however, does then not seem to make any sense.
So, is a feature descriptor the description or the algorithm that produces the description?
A feature detector is an algorithm which takes an image and outputs locations (i.e. pixel coordinates) of significant areas in your image. An example of this is a corner detector, which outputs the locations of corners in your image but does not tell you any other information about the features detected.
A feature descriptor is an algorithm which takes an image and outputs feature descriptors/feature vectors. Feature descriptors encode interesting information into a series of numbers and act as a sort of numerical "fingerprint" that can be used to differentiate one feature from another. Ideally this information would be invariant under image transformation, so we can find the feature again even if the image is transformed in some way. An example would be SIFT, which encodes information about the local neighbourhood image gradients the numbers of the feature vector. Other examples you can read about are HOG and SURF.
EDIT: When it comes to feature detectors, the "location" might also include a number describing the size or scale of the feature. This is because things that look like corners when "zoomed in" may not look like corners when "zoomed out", and so specifying scale information is important. So instead of just using an (x,y) pair as a location in "image space", you might have a triple (x,y,scale) as location in "scale space".
For the descriptor, I understand as the description of the neighborhood of a point on the image. In other words, it is a vector in the image (descriptions of the visual features of the contents in images).
For example, there is method in the HOG (Histogram of Oriented Gradients) called Image Gradients and Spatial/Orientation Binning. The extractHOGFeatures in Matlab and Classification using HOG had visual examples for better understanding.

Comparing similar images as photographs -- detecting difference, image diff

The situation is kind of unique from anything I have been able to find asked already, and is as follows: If I took a photo of two similar images, I'd like to be able to highlight the differing features in the two images. For example the following two halves of a children's spot the difference game:
The differences in the images will be bits missing/added and/or colour change, and the type of differences which would be easily detectable from the original image files by doing nothing cleverer than a pixel-by-pixel comparison. However the fact that they're subject to the fluctuations of light and imprecision of photography, I'll need a far more lenient/clever algorithm.
As you can see, the images won't necessarily line up perfectly if overlaid.
This question is tagged language-agnostic as I expect answers that point me towards relevant algorithms, however I'd also be interested in current implementations if they exist, particularly in Java, Ruby, or C.
The following approach should work. All of these functionalities are available in OpenCV. Take a look at this example for computing homographies.
Detect keypoints in the two images using a corner detector.
Extract descriptors (SIFT/SURF) for the keypoints.
Match the keypoints and compute a homography using RANSAC, that aligns the second image to the first.
Apply the homography to the second image, so that it is aligned with the first.
Now simply compute the pixel-wise difference between the two images, and the difference image will highlight everything that has changed from the first to the second.
My general approach would be to use an optical flow to align both images and perform a pixel by pixel comparison once they are aligned.
However, for the specifics, standard optical flows (OpenCV etc.) are likely to fail if the two images differ significantly like in your case. If that indeed fails, there are recent optical flow techniques that are supposed to work even if the images are drastically different. For instance, you might want to look at the paper about SIFT flows by Ce Liu et al that addresses this problem with sparse correspondences.

Image processing - Match curves from one image to another

I am doing something similar to this problem:
Matching a curve pattern to the edges of an image
Basically, I have the same curve in two images, but with some affine transform between the two. Here is an example of two images:
Image1
Image2
So in order to get to Image2, you can apply some translation, rotation, scale, etc. to Image1.
Does anyone know how to solve for this transform?
Phase correlation doesn't work because it's not a translation only. Optical flow doesn't work since there's not enough detail to resolve translation, rotation, scale (It's pretty much a binary image). I'm not sure if Hough Transforms will give me good data.
I think some sort of keypoint matching algorithm like sift or surf would work with this kind of data as well.
The basic idea would be to find a limited number of "interesting" keypoints in each image, then match these keypoints pairwise.
Here is a quick test of your image with an online ASIFT demo:
http://demo.ipol.im/demo/my_affine_sift/result?key=BF9F4E4E006AB5168497709836C39C74#
It is probably more suited for normal greyscale images, but nevertheless it seems to work for this data. It looks like the lines connect roughly the same points around both of the curves; plugging all these pairs into something like the FindHomography function in OpenCv, the small discrepancies should even themselves out and you get the affine transformation matrix between the two images.
For your particular data you might be able to come up with better keypoint descriptors; perhaps something to detect the line ends, line crossings and sharp corners.
Or how about this: It is a little more work, but if you can vectorize your paths into a bezier or b-spline, you can get some natural keypoints from the spline descriptors.
I do not know any vectorisation library, but Inkscape has a basic implementation with which you could test the approach.
Once you have a small set of descriptors instead of a large 2d bitmap, you only need to match these descriptors between the two images, as per FindHomography.
answer to comment:
The points of interest are merely small areas that have certain properties. So the center of those areas might be black or white; the algorithm does not specifically look for white pixels or large-scale shapes such as the curve. What matter is that the lines connect roughly the same points on both curves, at least at first glance.

What does size and response exactly represent in a SURF keypoint?

I'm using OpenCV 2.3 for keypoints detection and matching. But I am a bit confused with the size and response parameters given by the detection algorithm. What do they exactly mean?
Based on the OpenCV manual, I can't figure it out:
float size: diameter of the meaningful keypoint neighborhood
float response: the response by which the most strong keypoints have
been selected. Can be used for further sorting or subsampling
I thought the best point to track would be the one with the highest response but it seems that it is not the case. So how could I subsample the set of key points returned by the surf detector to keep only the best one in term of trackability?
Size and response
SURF is a blob detector, in short, the size of a feature is the size of the blob. To be more precise, the returned size by OpenCV is half the length of the approximated Hessian operator. The size is also known as scale, this is due to the way the blob detectors work, i.e., being functionally equal to first blurring the image with the Gaussian filter at several scales and then downsampling the images and finally detecting blobs with a fixed size. See the image below showing the the size of the SURF features. The size of each feature is the radius of the drawn circle. The lines going out from the center of the features to the circumference show the angles or orientations. In this image, the response strength of the blob detection filter is color coded. You can see the majority of the detected features have a weak response. (see the full size image here)
This histogram shows the distribution of the response strengths of the features in the above image:
What features to track?
The most robust feature tracker tracks all the detected features. The more features the more robustness. But it's impractical to track a large number of features as often we want to limit the computation time. The number of features to track often should be empirically tuned for each application. Often the image is divided into regular sub-regions and in each one the n strongest features are kept to be tracked. n is usually chosen such that in total about 500~1000 features are detected per frame.
References
Reading the journal paper describing SURF definitely will give you a good idea of how it works. Just try not to get stuck in the details, specially if your background isn't in machine/computer vision or image processing. The SURF detector may seem extremely novel at the first glance but the whole idea is estimating the Hessian operator (a well established filter) using integral images (which have been used by other methods long before SURF). If you want to understand SURF very well and you're not familiar with image processing, you need to go back and read some introductory material. Recently I came across a new and free book, whose chapter 13 has a good and brief introduction to feature detection. Not everything said in there is technically correct, but it's a good starting point. Here you can find another good description of SURF with several images showing how each step works. On that page you see this image:
You can see the white and black blobs, these are the blobs that SURF detects at several scales and estimates their sizes (radius in the OpenCV code).
"size" is the size of the area covered by the descriptor in the original image (it is obtained by downsampling the original image in the scale space, hence it varies from key point to key point based on their scale).
"reponse" is indeed an indicator of "how good" (roughly speaking, in terms of corner-ness) a point is.
Good points are stable for static scene retrieval (this is the main purpose of SIFT/SURF descriptors). In the case of tracking, you can have good points appearing because the tracked object is on a well formed background, of half in the shadow... then disappearing because this condition has changed (change of light, occlusion...). So there is no guarantee for tracking tasks that a good point will always be there.

Resources