What does size and response exactly represent in a SURF keypoint? - opencv

I'm using OpenCV 2.3 for keypoints detection and matching. But I am a bit confused with the size and response parameters given by the detection algorithm. What do they exactly mean?
Based on the OpenCV manual, I can't figure it out:
float size: diameter of the meaningful keypoint neighborhood
float response: the response by which the most strong keypoints have
been selected. Can be used for further sorting or subsampling
I thought the best point to track would be the one with the highest response but it seems that it is not the case. So how could I subsample the set of key points returned by the surf detector to keep only the best one in term of trackability?

Size and response
SURF is a blob detector, in short, the size of a feature is the size of the blob. To be more precise, the returned size by OpenCV is half the length of the approximated Hessian operator. The size is also known as scale, this is due to the way the blob detectors work, i.e., being functionally equal to first blurring the image with the Gaussian filter at several scales and then downsampling the images and finally detecting blobs with a fixed size. See the image below showing the the size of the SURF features. The size of each feature is the radius of the drawn circle. The lines going out from the center of the features to the circumference show the angles or orientations. In this image, the response strength of the blob detection filter is color coded. You can see the majority of the detected features have a weak response. (see the full size image here)
This histogram shows the distribution of the response strengths of the features in the above image:
What features to track?
The most robust feature tracker tracks all the detected features. The more features the more robustness. But it's impractical to track a large number of features as often we want to limit the computation time. The number of features to track often should be empirically tuned for each application. Often the image is divided into regular sub-regions and in each one the n strongest features are kept to be tracked. n is usually chosen such that in total about 500~1000 features are detected per frame.
References
Reading the journal paper describing SURF definitely will give you a good idea of how it works. Just try not to get stuck in the details, specially if your background isn't in machine/computer vision or image processing. The SURF detector may seem extremely novel at the first glance but the whole idea is estimating the Hessian operator (a well established filter) using integral images (which have been used by other methods long before SURF). If you want to understand SURF very well and you're not familiar with image processing, you need to go back and read some introductory material. Recently I came across a new and free book, whose chapter 13 has a good and brief introduction to feature detection. Not everything said in there is technically correct, but it's a good starting point. Here you can find another good description of SURF with several images showing how each step works. On that page you see this image:
You can see the white and black blobs, these are the blobs that SURF detects at several scales and estimates their sizes (radius in the OpenCV code).

"size" is the size of the area covered by the descriptor in the original image (it is obtained by downsampling the original image in the scale space, hence it varies from key point to key point based on their scale).
"reponse" is indeed an indicator of "how good" (roughly speaking, in terms of corner-ness) a point is.
Good points are stable for static scene retrieval (this is the main purpose of SIFT/SURF descriptors). In the case of tracking, you can have good points appearing because the tracked object is on a well formed background, of half in the shadow... then disappearing because this condition has changed (change of light, occlusion...). So there is no guarantee for tracking tasks that a good point will always be there.

Related

How does multiscale feature matching work? ORB, SIFT, etc

When reading about classic computer vision I am confused on how multiscale feature matching works.
Suppose we use an image pyramid,
How do you deal with the same feature being detected at multiple scales? How do you decide which to make a deacriptor for?
How do you connected features between scales? For example let's say you have a feature detected and matched to a descriptor at scale .5. Is this location then translated to its location in the initial scale?
I can share something about SIFT that might answer question (1) for you.
I'm not really sure what you mean in your question (2) though, so please clarify?
SIFT (Scale-Invariant Feature Transform) was designed specifically to find features that remains identifiable across different image scales, rotations, and transformations.
When you run SIFT on an image of some object (e.g. a car), SIFT will try to create the same descriptor for the same feature (e.g. the license plate), no matter what image transformation you apply.
Ideally, SIFT will only produce a single descriptor for each feature in an image.
However, this obviously doesn't always happen in practice, as you can see in an OpenCV example here:
OpenCV illustrates each SIFT descriptor as a circle of different size. You can see many cases where the circles overlap. I assume this is what you meant in question (1) by "the same feature being detected at multiple scales".
And to my knowledge, SIFT doesn't really care about this issue. If by scaling the image enough you end up creating multiple descriptors from "the same feature", then those are distinct descriptors to SIFT.
During descriptor matching, you simply brute-force compare your list of descriptors, regardless of what scale it was generated from, and try to find the closest match.
The whole point of SIFT as a function, is to take in some image feature under different transformations, and produce a similar numerical output at the end.
So if you do end up with multiple descriptors of the same feature, you'll just end up having to do more computational work, but you will still essentially match the same pair of feature across two images regardless.
Edit:
If you are asking about how to convert coordinates from the scaled images in the image pyramid back into original image coordinates, then David Lowe's SIFT paper dedicates section 4 on that topic.
The naive approach would be to simply calculate the ratios of the scaled coordinates vs the scaled image dimensions, then extrapolate back to the original image coordinates and dimensions. However, this is inaccurate, and becomes increasingly so as you scale down an image.
Example: You start with a 1000x1000 pixel image, where a feature is located at coordinates (123,456). If you had scaled down the image to 100x100 pixel, then the scaled keypoint coordinate would be something like (12,46). Extrapolating back to the original coordinates naively would give the coordinates (120,460).
So SIFT fits a Taylor expansion of the Difference of Gaussian function, to try and locate the original interesting keypoint down to sub-pixel levels of accuracy; which you can then use to extrapolate back to the original image coordinates.
Unfortunately, the math for this part is quite beyond me. But if you are fluent in math, C programming, and want to know specifically how SIFT is implemented; I suggest you dive into Rob Hess' SIFT implementation, lines 467 through 648 is probably the most detailed you can get.

Image segmentation with maxflow

I have to do a foreground/background segmentation using maxflow algorithm in C++. (http://wiki.icub.org/iCub/contrib/dox/html/poeticon_2src_2objSeg_2src_2maxflow-v3_802_2maxflow_8cpp_source.html). I get an array of pixels from a png file according to their RBG but what are the next steps. How could I use this algorithm for my problem?
I recognize that source very well. That's the Boykov-Kolmogorov Graph Cuts library. What I would recommend you do first is read their paper.
Graph Cuts is an interactive image segmentation algorithm. You mark pixels in your image on what you believe belong to the object (a.k.a. foreground) and what don't belong to the object (a.k.a the background). That's what you need first. Once you do this, the Graph Cuts algorithm best guesses what the labels of the other pixels are in the image. It basically goes through each of the other pixels that are not labeled and figures out whether or not they belong to foreground and background.
The whole premise behind Graph Cuts is that image segmentation is akin to energy minimization. Image segmentation can be formulated as a cost function with a summation of two terms:
Self-Penalty: This is the cost of assigning each pixel as either foreground or background. This is also known as a data cost.
Neighbouring Penalties: This enforces that neighbouring pixels more or less should share the same classification label. This is also known as a smoothness cost.
This kind of formulation is well known as the Maximum A Posteriori Markov Random Field classification problem (MAP-MRF). The goal is to minimize that cost function so that you achieve the best image segmentation possible. This is actually an NP-Hard problem, and is actually one of the problems that is up for money from the Clay Math Institute.
Boykov and Kolmogorov theoretically proved that the MAP-MRF problem can be translated into graph theory, and solving the MAP-MRF problem is akin to taking your image and forming it into a graph with source and sink links, as well as links that connect neighbouring pixels together. To solve the MAP-MRF, you perform the maximum-flow/minimum-cut algorithm. There are many ways to do this, but Boykov / Kolmogorov find a more efficient way that is much faster than more established algorithms, such as Push-Relabel, Ford-Fulkenson, etc.
The self penalties are what are known as t links, while the neighbouring penalties are what are known as n links. You should read up the paper to figure out how these are computed, but the t links describe the classification penalty. Basically, how much it would cost to classify each pixel as belonging to the foreground or the background. These are usually based on the negative log probability distributions of the image. What you do is you create a histogram of the distribution of what was classified as foreground and a histogram of what was classified as background.
Usually, a uniform quanitization of each colour channel for both foreground and background suffices. You then turn these into PDFs but dividing by the total number of elements in each histogram, then when you calculate the t-links for each pixel, you access the colour, then see where it lies in the histogram, then take the negative log. This will tell you how much it will cost to classify that pixel to be either foreground or background.
The neighbouring pixel costs are more intuitive. People usually just take the Euclidean distance between one pixel and a neighbouring pixel and apply this distance to a Gaussian. To make things simple, a 4 pixel neighbourhood is what is usually used (North, South, East and West).
Once you figure out how to compute the cost, you follow this procedure:
Mark pixels as foreground or background.
Create a graph structure using their library
Compute the histograms of the foreground and background pixels
Calculate t-links and add to the graph
Calculate n-links and add to the graph
Invoke the maxflow routine on the graph to segment the image
Go through each pixel and figure out whether or not the pixel belongs to foreground or background.
Create a binary map that reflects this, then copy over image pixels where the binary map is true, and don't do this when it's false.
The original source of maxflow can be found here: http://pub.ist.ac.at/~vnk/software/maxflow-v3.03.src.zip
It also has a README so you can see how the library is supposed to work given some example images.
You have a lot to digest, but Graph Cuts is one of the most powerful interactive segmentation tools out there.
Good luck!

Water Edge Detection

Is there a robust way to detect the water line, like the edge of a river in this image, in OpenCV?
(source: pequannockriver.org)
This task is challenging because a combination of techniques must be used. Furthermore, for each technique, the numerical parameters may only work correctly for a very narrow range. This means either a human expert must tune them by trial-and-error for each image, or that the technique must be executed many times with many different parameters, in order for the correct result to be selected.
The following outline is highly-specific to this sample image. It might not work with any other images.
One bit of advice: As usual, any multi-step image analysis should always begin with the most reliable step, and then proceed down to the less reliable steps. Whenever possible, the less reliable step should make use of the result of more-reliable steps to augment its own accuracy.
Detection of sky
Convert image to HSV colorspace, and find the cyan located at the upper-half of the image.
Keep this HSV image, becuase it could be handy for the next few steps as well.
Detection of shrubs
Run Canny edge detection on the grayscale version of image, with suitably chosen sigma and thresholds. This will pick up the branches on the shrubs, which would look like a bunch of noise. Meanwhile, the water surface would be relatively smooth.
Grayscale is used in this technique in order to reduce the influence of reflections on the water surface (the green and yellow reflections from the shrubs). There might be other colorspaces (or preprocessing techniques) more capable of removing that reflection.
Detection of water ripples from a lower elevation angle viewpoint
Firstly, mark off any image parts that are already classified as shrubs or sky. Since shrub detection would be more reliable than water detection, shrub detection's result should be used to inform the less-reliable water detection.
Observation
Because of the low elevation angle viewpoint, the water ripples appear horizontally elongated. In fact, every image feature appears stretched horizontally. This is called Anisotropy. We could make use of this tendency to detect them.
Note: I am not experienced in anisotropy detection. Perhaps you can get better ideas from other people.
Idea 1:
Use maximally-stable extremal regions (MSER) as a blob detector.
The Wikipedia introduction appears intimidating, but it is really related to connected-component algorithms. A naive implementation can be done similar to Dijkstra's algorithm.
Idea 2:
Notice that the image features are horizontally stretched, a simpler approach is to just sum up the absolute values of horizontal gradients and compare that to the sum of absolute values of vertical gradients.

A suitable workflow to detect and classify blurs in images? [duplicate]

I had asked this on photo stackexchange but thought it might be relevant here as well, since I want to implement this programatically in my implementation.
I am trying to implement a blur detection algorithm for my imaging pipeline. The blur that I want to detect is both -
1) Camera Shake: Pictures captured using hand which moves/shakes when shutter speed is less.
2) Lens focussing errors - (Depth of Field) issues, like focussing on a incorrect object causing some blur.
3) Motion blur: Fast moving objects in the scene, captured using a not high enough shutter speed. E.g. A moving car a night might show a trail of its headlight/tail light in the image as a blur.
How can one detect this blur and quantify it in some way to make some decision based on that computed 'blur metric'?
What is the theory behind blur detection?
I am looking of good reading material using which I can implement some algorithm for this in C/Matlab.
thank you.
-AD.
Motion blur and camera shake are kind of the same thing when you think about the cause: relative motion of the camera and the object. You mention slow shutter speed -- it is a culprit in both cases.
Focus misses are subjective as they depend on the intent on the photographer. Without knowing what the photographer wanted to focus on, it's impossible to achieve this. And even if you do know what you wanted to focus on, it still wouldn't be trivial.
With that dose of realism aside, let me reassure you that blur detection is actually a very active research field, and there are already a few metrics that you can try out on your images. Here are some that I've used recently:
Edge width. Basically, perform edge detection on your image (using Canny or otherwise) and then measure the width of the edges. Blurry images will have wider edges that are more spread out. Sharper images will have thinner edges. Google for "A no-reference perceptual blur metric" by Marziliano -- it's a famous paper that describes this approach well enough for a full implementation. If you're dealing with motion blur, then the edges will be blurred (wide) in the direction of the motion.
Presence of fine detail. Have a look at my answer to this question (the edited part).
Frequency domain approaches. Taking the histogram of the DCT coefficients of the image (assuming you're working with JPEG) would give you an idea of how much fine detail the image has. This is how you grab the DCT coefficients from a JPEG file directly. If the count for the non-DC terms is low, it is likely that the image is blurry. This is the simplest way -- there are more sophisticated approaches in the frequency domain.
There are more, but I feel that that should be enough to get you started. If you require further info on either of those points, fire up Google Scholar and look around. In particular, check out the references of Marziliano's paper to get an idea about what has been tried in the past.
There is a great paper called : "analysis of focus measure operators for shape-from-focus" (https://www.researchgate.net/publication/234073157_Analysis_of_focus_measure_operators_in_shape-from-focus) , which does a comparison about 30 different techniques.
Out of all the different techniques, the "Laplacian" based methods seem to have the best performance. Most image processing programs like : MATLAB or OPENCV have already implemented this method . Below is an example using OpenCV : http://www.pyimagesearch.com/2015/09/07/blur-detection-with-opencv/
One important point to note here is that an image can have some blurry areas and some sharp areas. For example, if an image contains portrait photography, the image in the foreground is sharp whereas the background is blurry. In sports photography, the object in focus is sharp and the background usually has motion blur. One way to detect such a spatially varying blur in an image is to run a frequency domain analysis at every location in the image. One of the papers which addresses this topic is "Spatially-Varying Blur Detection Based on Multiscale Fused and Sorted Transform Coefficients of Gradient Magnitudes" (cvpr2017).
the authors look at multi resolution DCT coefficients at every pixel. These DCT coefficients are divided into low, medium, and high frequency bands, out of which only the high frequency coefficients are selected.
The DCT coefficients are then fused together and sorted to form the multiscale-fused and sorted high-frequency transform coefficients
A subset of these coefficients are selected. the number of selected coefficients is a tunable parameter which is application specific.
The selected subset of coefficients are then sent through a max pooling block to retain the highest activation within all the scales. This gives the blur map as the output, which is then sent through a post processing step to refine the map.
This blur map can be used to quantify the sharpness in various regions of the image. In order to get a single global metric to quantify the bluriness of the entire image, the mean of this blur map or the histogram of this blur map can be used
Here are some examples results on how the algorithm performs:
The sharp regions in the image have a high intensity in the blur_map, whereas blurry regions have a low intensity.
The github link to the project is: https://github.com/Utkarsh-Deshmukh/Spatially-Varying-Blur-Detection-python
The python implementation of this algorithm can be found on pypi which can easily be installed as shown below:
pip install blur_detector
A sample code snippet to generate the blur map is as follows:
import blur_detector
import cv2
if __name__ == '__main__':
img = cv2.imread('image_name', 0)
blur_map = blur_detector.detectBlur(img, downsampling_factor=4, num_scales=4, scale_start=2, num_iterations_RF_filter=3)
cv2.imshow('ori_img', img)
cv2.imshow('blur_map', blur_map)
cv2.waitKey(0)
For detecting blurry images, you can tweak the approach and add "Region of Interest estimation".
In this github link: https://github.com/Utkarsh-Deshmukh/Blurry-Image-Detector , I have used local entropy filters to estimate a region of interest. In this ROI, I then use DCT coefficients as feature extractors and train a simple multi-layer perceptron. On testing this approach on 20000 images in the "BSD-B" dataset (http://cg.postech.ac.kr/research/realblur/) I got an average accuracy of 94%
Just to add on the focussing errors, these may be detected by comparing the psf of the captured blurry images (wider) with reference ones (sharper). Deconvolution techniques may help correcting them but leaving artificial errors (shadows, rippling, ...). A light field camera can help refocusing to any depth planes since it captures the angular information besides the traditional spatial ones of the scene.

EMGU OpenCV disparity only on certain pixels

I'm using the EMGU OpenCV wrapper for c#. I've got a disparity map being created nicely. However for my specific application I only need the disparity values of very few pixels, and I need them in real time. The calculation is taking about 100 ms now, I imagine that by getting disparity for hundreds of pixel values rather than thousands things would speed up considerably. I don't know much about what's going on "under the hood" of the stereo solver code, is there a way to speed things up by only calculating the disparity for the pixels that I need?
First of all, you fail to mention what you are really trying to accomplish, and moreover, what algorithm you are using. E.g. StereoGC is a really slow (i.e. not real-time), but usually far more accurate) compared to both StereoSGBM and StereoBM. Those last two can be used real-time, providing a few conditions are met:
The size of the input images is reasonably small;
You are not using an extravagant set of parameters (for instance, a larger value for numberOfDisparities will increase computation time).
Don't expect miracles when it comes to accuracy though.
Apart from that, there is the issue of "just a few pixels". As far as I understand, the algorithms implemented in OpenCV usually rely on information from more than 1 pixel to determine the disparity value. E.g. it needs a neighborhood to detect which pixel from image A map to which pixel in image B. As a result, in general it is not possible to just discard every other pixel of the image (by the way, if you already know the locations in both images, you would not need the stereo methods at all). So unless you can discard a large border of your input images for which you know that you'll never find your pixels of interest there, I'd say the answer to this part of your question would be "no".
If you happen to know that your pixels of interest will always be within a certain rectangle of the input images, you can specify the input image ROIs (regions of interest) to this rectangle. Assuming OpenCV does not contain a bug here this should speedup the computation a little.
With a bit of googling you can to find real-time examples of finding stereo correspondences using EmguCV (or plain OpenCV) using the GPU on Youtube. Maybe this could help you.
Disclaimer: this may have been a more complete answer if your question contained more detail.

Resources