I'm trying to use OpenCV to detect and extract ORB features from images.
However, the images I'm getting are not normalized (different size, different resolutions, etc...).
I was wondering if I need to normalize my images before extracting ORB features to be able to match them across images?
I know the feature detection is scale invariant, but I'm not sure about what it means for images resolution (for example, 2 images of the same size, with 1 object close, and far in the other should result in a match, even if they have a different scale on the images, but what if the images don't have the same size?).
Should I adapt the patchSize from ORB based on the image size to have (for example if I have an image of 800px and take a patchSize of 20px, should I take a patchSize of 10px for an image of 400px?).
Thank you.
Update:
I tested different algorithms (ORB, SURF and SIFT) with high and low resolution images to see how they behave. In this image, objects are the same size, but image resolution is different:
We can see that SIFT is pretty stable, although it has few features. SURF is also pretty stable in terms of keypoints and feature scale. So My guess is that feature matching between a low res and high res images with SIFT and SURF would work, but ORB has much larger feature in low res, so descriptors won't match those in the high res image.
(Same parameters have been used between high and low res feature extraction).
So my guess is that it would be better to SIFT or SURF if we want to do matching between images with different resolutions.
According to OpenCV documentation, ORB also use pyramid to produce multiscale-features. Although details are not clear on this page.
If we look at the ORB paper itself, at section 6.1 it is mentioned that images with five different scales are used. But still we are not sure whether you need to compute descriptors on images with different scale manually, or it is already implemented in OpenCV ORB.
Finally, from source code(line 1063 while I write this answer) we see that images with different resolution is computed for keypoint/descriptor extraction. If you track variables you see that there is a scale factor for ORB class which you can access with getScaleFactor method.
In short, ORB tries to perform matching at different scales itself.
Related
When reading about classic computer vision I am confused on how multiscale feature matching works.
Suppose we use an image pyramid,
How do you deal with the same feature being detected at multiple scales? How do you decide which to make a deacriptor for?
How do you connected features between scales? For example let's say you have a feature detected and matched to a descriptor at scale .5. Is this location then translated to its location in the initial scale?
I can share something about SIFT that might answer question (1) for you.
I'm not really sure what you mean in your question (2) though, so please clarify?
SIFT (Scale-Invariant Feature Transform) was designed specifically to find features that remains identifiable across different image scales, rotations, and transformations.
When you run SIFT on an image of some object (e.g. a car), SIFT will try to create the same descriptor for the same feature (e.g. the license plate), no matter what image transformation you apply.
Ideally, SIFT will only produce a single descriptor for each feature in an image.
However, this obviously doesn't always happen in practice, as you can see in an OpenCV example here:
OpenCV illustrates each SIFT descriptor as a circle of different size. You can see many cases where the circles overlap. I assume this is what you meant in question (1) by "the same feature being detected at multiple scales".
And to my knowledge, SIFT doesn't really care about this issue. If by scaling the image enough you end up creating multiple descriptors from "the same feature", then those are distinct descriptors to SIFT.
During descriptor matching, you simply brute-force compare your list of descriptors, regardless of what scale it was generated from, and try to find the closest match.
The whole point of SIFT as a function, is to take in some image feature under different transformations, and produce a similar numerical output at the end.
So if you do end up with multiple descriptors of the same feature, you'll just end up having to do more computational work, but you will still essentially match the same pair of feature across two images regardless.
Edit:
If you are asking about how to convert coordinates from the scaled images in the image pyramid back into original image coordinates, then David Lowe's SIFT paper dedicates section 4 on that topic.
The naive approach would be to simply calculate the ratios of the scaled coordinates vs the scaled image dimensions, then extrapolate back to the original image coordinates and dimensions. However, this is inaccurate, and becomes increasingly so as you scale down an image.
Example: You start with a 1000x1000 pixel image, where a feature is located at coordinates (123,456). If you had scaled down the image to 100x100 pixel, then the scaled keypoint coordinate would be something like (12,46). Extrapolating back to the original coordinates naively would give the coordinates (120,460).
So SIFT fits a Taylor expansion of the Difference of Gaussian function, to try and locate the original interesting keypoint down to sub-pixel levels of accuracy; which you can then use to extrapolate back to the original image coordinates.
Unfortunately, the math for this part is quite beyond me. But if you are fluent in math, C programming, and want to know specifically how SIFT is implemented; I suggest you dive into Rob Hess' SIFT implementation, lines 467 through 648 is probably the most detailed you can get.
Given an image of many items, with all of its bounding box known in pixel coordinates.
I am trying to extract a region (surrounding) around each of the items, calculate its descriptors and features using AKAZE, to do comparison with one another.
However I realised that this might be too slow, since it involves:
1) cropping each and every single item to generate many images then,
2) detecting and computing on each image to generate the keypoints and descriptors.
Alternatively, to speed things up, I was thinking of:
1) Resizing the entire image, then perform the detecting and computing of keypoints once.
2) Then to obtain the keypoint of a particular object, we simply retrieve the set of precalculated keypoints corresponding to the objects location.
My question is this method functionally sound, and that if there are any consequences to this?
Yes this second strategy is a fine way to go. To do this efficiently you should supply a mask argument in the call to OpenCV's detectAndCompute (or detect if you're using that). Your mask should be the same size as your image. In each pixel of the mask you would have zero for that pixel if it does not lie within at least one detection region, otherwise its value is positive (255 for a uchar mask).
In fact with the first strategy you can have a problem at the borders of your detection regions, where feature points can be missed. This is because feature detection and descriptor computation require processing a small window of pixels around each pixel (which are not available at the borders). To correctly handle this you would need to enlarge the detection regions before cropping.
Concerning efficiency you should be aware that there is an overhead with the second approach, which is that the full image will undergo some image pre-processing before feature detection. For AKAZE this is nonlinear diffusion and for others such as SIFT and SURF this is image convolution. These are needed to built so-called image pyramids. In situations where you only have a few detections the first strategy can be more efficient (the overhead of image cropping is tiny relative to the image pre processing).
I'm using a simple neural network (similar to AlexNet) to classify images into categories. As a preprocessing stage, input images are resized to 256x256 before being fed into the network.
Lately, I have run into the following problem: Many of the images I deal with are of very high resolution (say, 2000x2000). In this case, doing a "hard resize" results in a severe loss of information. For example, a small 100x100 face, easily recognisable in the original image, would be unrecognisable in the resized version. In such cases, I may prefer taking several crops of the 2000x2000 image and run the classification on each crop.
I'm looking for a method to automatically determine which type of pre-processing is most adequate. Ideally, it would be able to recognize, for example, that a high resolution image of a single face should be resized, whereas a high resolution image of a crowd should be cropped several times. The basic requirements, on my part:
As computationally efficient as possible. Hence, something like a "sliding window" would be probably be ruled out (it is computationally cheaper to just crop all the images).
Ability to balance between recall and precision
What I considered thus far:
"Low-level" (image processing) approach: Implement an algorithm that uses local image information (like gradients) to distinguish between high resolution and low resolution images.
"High-level" (semantic) approach: Run the images through a pre-trained network for segmentation of some sort, and use its oputput to determine the appropriate pre-procssing.
I want to try the first option first, but not exactly sure how to go about it. Is there anything I can do in the Fourier domain? Something in OpenCv I can try? Does anyone have any suggestions/thoughts? Other ideas would be very welcome too. Thanks!
I am working on a project which identifies objects after capturing their images on Android platform. For this, I extracted features of sample images such as compactness, rectangularity, elongation, eccentricity, roundness, sphericity, lobation, and hu moments. After then, random tree is used as classifier. As I used pictures gathered from Google which are not in high resolution for creating my classifier, captured images of size 1280x720 gives 19/20 correct results when the image is cropped.
However, when I capture images of large sizes such as about 5 megapixels, and crop them for identification, the number of obtained correct results dramaticaly decreases.
Do I need to extract features of images with high resolution and train them in order to get accurate results when high resolution pictures are captured? Is there a way of adjusting extracted features related to the image resolution?
Some feature descriptors are sensitive to scaling. Others, like SIFT and SURF, are not. If you expect the resolution (or scale) of your images to change, it's best to use scale-invariant feature descriptors.
If you use feature descriptors that are not scale-invariant, you can still get decent results by normalizing the resolution of your images. Try scaling the 5 megapixel images to 1280x720 -- do the classification results improve?
What's the most sensible algorithm, or combination of algorithms, to be using from OpenCV for the following problem:
I have a set of small 2D images. I want to detect the locations of these subimages in a larger image.
The subimages are usually around 32x32 pixels, and the larger image is around 400x400.
The subimages are not always square, and such contains alpha channel.
Optionally - the larger image may be grainy, compressed, rotated in 3D, or otherwise slightly distorted
I have tried cvMatchTemplate, with very poor results (difficult to match correctly, and large numbers of false positives, with all match methods). Some of the problems come from the fact OpenCV can't seem to deal with alpha channel template matching.
I have tried a manual search, which seems to work better, and can include the alpha channel, but is very slow.
Thanks for any help.
cvMatchTemplate uses a MSE (SQDIFF/SQDIFF_NORMED) kind of metric for the matching. This kind of metric will penalize different alpha values severly (due to the square in the equation). Have you tried normalized cross-correlation? It is known to model linear variations in pixel intensities better.
If NCC does not do the job, you will need to transform the images to a space where the intensity differences do not have much effect. e.g. Compute a edge-strength image (canny, sobel etc) and run cvMatchTemplate on these images.
Considering the large difference in scales of the images (~10x). A image pyramid will have to be employed to figure out the correct scale for the matching. Recommend you start with a scale (2^1/x: x being the correct scale) and propagate the estimate up the pyramid.
What you need is something like SIFT or SURF.