Why SIFT descriptors are scale invariant? - image-processing

My understanding: SIFT descriptor uses the histogram of orientation gradient calculated from 16x16 neighbourhood pixels.
16x16 area in a large image can be a very small area, e.g. 1/10 of one hair on a cat's paw,
when you resize the target image into a small size, 16x16 neighbourhood around the same key point
can be a large of part of the image, e.g. the paw of the cat
It doesn't make sense to me to compare the original image with the resized image using SIFT descriptor,
Can any one tell me what's wrong with my understanding ?

This is a rough description, but should give you an understanding of the approach.
One of the stages that SIFT uses is to create a pyramid of scales of the image. It will scale down and smooth using a low pass filter.
The feature detector then works by finding features that have a peak response not only in the image space, but in scale space too. This means that it finds the scale of the image which the feature will produce the highest response.
Then, the descriptor is calculated in that scale. So when you use a smaller/larger version, it should still find the same scale for the feature.

Related

Does scale up or down images effect image information?

i'm work on graduation project for image forgery detection using CNN , Most of the paper i read before feed the data set to the network they Down scale the image size, i want to know how Does this process effect image information ?
Images are resized/rescaled to a specific size for a few reasons:
(1) It allows the user to set the input size to their network. When designing a CNN you need to know the shape (dimensions) of your data at each step; so, having a static input size is an easy way to make sure your network gets data of the shape it was designed to take.
(2) Using a full resolution image as the input to the network is very inefficient (super slow to compute).
(3) For most cases the features desired to be extracted/learned from an image are also present when downsampling the image. So in a way resizing an image to a smaller size will denoise the image, filtering out much of the unimportant features within the image for you.
Well you change the images size. Of course it changes it's information.
You cannot reduce image size without omitting information. Simple case: Throw away every second pixel to scale image to 50%.
Scaling up adds new pixels. In its simplest form you duplicate pixels, creating redundant information.
More complex solutions create new pixels (less or more) by averaging neighbouring pixels or interpolating between them.
Scaling up is reversible. It doesn't create nor destroy information.
Scaling down divides the amount of information by the square of the downscaling factor*. Upscaling after downscaling results in a blurred image.
(*This is true in a first approximation. If the image doesn't have high frequencies, they are not lost, hence no loss of information.)

Cropping keypoints vs. cropping image and finding keypoints

Given an image of many items, with all of its bounding box known in pixel coordinates.
I am trying to extract a region (surrounding) around each of the items, calculate its descriptors and features using AKAZE, to do comparison with one another.
However I realised that this might be too slow, since it involves:
1) cropping each and every single item to generate many images then,
2) detecting and computing on each image to generate the keypoints and descriptors.
Alternatively, to speed things up, I was thinking of:
1) Resizing the entire image, then perform the detecting and computing of keypoints once.
2) Then to obtain the keypoint of a particular object, we simply retrieve the set of precalculated keypoints corresponding to the objects location.
My question is this method functionally sound, and that if there are any consequences to this?
Yes this second strategy is a fine way to go. To do this efficiently you should supply a mask argument in the call to OpenCV's detectAndCompute (or detect if you're using that). Your mask should be the same size as your image. In each pixel of the mask you would have zero for that pixel if it does not lie within at least one detection region, otherwise its value is positive (255 for a uchar mask).
In fact with the first strategy you can have a problem at the borders of your detection regions, where feature points can be missed. This is because feature detection and descriptor computation require processing a small window of pixels around each pixel (which are not available at the borders). To correctly handle this you would need to enlarge the detection regions before cropping.
Concerning efficiency you should be aware that there is an overhead with the second approach, which is that the full image will undergo some image pre-processing before feature detection. For AKAZE this is nonlinear diffusion and for others such as SIFT and SURF this is image convolution. These are needed to built so-called image pyramids. In situations where you only have a few detections the first strategy can be more efficient (the overhead of image cropping is tiny relative to the image pre processing).

OpenCV - Extracting SIFT/SURF descriptor from pre-cropped patches

I have a set of 100K 64x64 gray patches (that are already aligned, meaning they all have the same orientation) and I would like to extract a SIFT descriptor from each one using OpenCV.
It is clear to me all I need to do is to define a vector with one keypoint kp such that: kp.x=32, kp.y=32.
However, I don't know how to set the kp.size parameter. From going over SIFT's code, it looks as it's doing some non-trivial calculations with that parameter instead of just assuming that it's the size of the patch.
Question 1: what should be the kp.size parameter when extracting SIFT descriptors from patches of size 64x64?
Question 2: what should be the kp.size parameter when extracting SURF descriptors from patches of size 64x64?
If you look at sift original publication, the scale of the keypoint is used to weight the histogram of gradients magnitude and orientations(paragraph 6. The local image descriptor). So in your case, since the grey patches are aligned, it is up to you to decide if you want to weight the contributions of the pixels further from the patch center or not, and select the scale (i.e. the with of the gaussian weighting window) accordingly.
For SURF, it's basically the same principle except that instead of gradient magnitude the response to haar wavelet is use, but you could still weight those responses with a gaussian window.
Also, since you are working with those aligned patches I would advise you not to use the high-level functions of OpenCV, but to simply use/recode the descriptor extraction part, and to apply any weighting you want to compute your patch representation. One reason to do so is that, in the SIFT example, the computation of SIFT descriptors might "add new keypoints" to the one you provided, if the algorithm is "not happy" with the keypoint orientation, it duplicates the keypoint at the same location but different orientation.
Okay. So the SIFT descriptor uses a neighbourhood of 4x4 grids usually, each grid usually being 4x4 pixels. Therefore the neighbourhood in pixels is usually 16x16. The scale/size is the parameter to determine the amount of downsampling/blurring/radius of keypoint. So I would think in your case, this would be 4.
You probably would also know that SIFT keypoints also work on sub-pixel layers. (32,32) would not be the exact center of your image patch, which would actually be (32.5, 32.5) if your image dimensions (x,y) start from 1. If they start from 0, it would be (31.5, 31.5)- as in the case of opencv.

How to calculate an image has noise and Geometric distortion or not?

I need to make an application in iphone which needs to calculate noise, geometric deformation other distortions in an image. How to do this? I have done some image processing stuff with opencv + iphone. But I dont know how to calculate these parameters.
1) How to calculate noise in an image?
2) What is geometric deformation and how to calculate geometric deformation of an image?
3) Is geometric deformation and distortion are same parameters in terms of image filter? or any other distortions available to calculate an image is good quality or not?
Input: My image is a face image in live video stream.
I advise you to read some literature about image processing, for example Gonzalez & Woods.
1) The simplest method of noise calculation by single image is to compute standard deviation between image and its smoothed copy. For smoothing I recommend you to use simple median filter by sample of 3x3 pixels (or more). Median is non-sensitive to outbursts of data, so noice like "salt-n-pepper" won't worsen statistics.
In cases of overexposed or underexposed images such method can give you bad results, in that case you can calculate FFT of image and use a high frequency components for noise estimation.
2), 3) Calculation of geometric deformation is possible only if you know, what should be on image. For example, if you use mire (optical etalon) with quadratic grid, you can find lines on your image (for example by Canny edge detector) and compute distortion, astigmatism and some other aberrations. This could be done also if you sure that image have some straight lines.
Defocusing can be computed from analysis of edges on image or with help of image wavelet transform.
There also much more different methods for image analysing. For example, by analysis of colour image you can estimate chromatic aberration and so on.
But I repeat: in common case this operations are impossible. They all have some particular cases of application.
Read about image quality: there are no standard for this term, in every particular case you can use one or more simple characteristics to recognize whether image good or not.
In you case I'd advice you to make a lot of photos with different kind of artefacts and quality, then make simple analysis of their statistics, wavelet compositions and R-G-B components correlation. BTW, to make analysis of colour image less sensitive to its brightness I recommend you to work in HSV colorspace (but to estimate chromatic aberration you need to work exactly with RGB components).

2D subimage detection in Open CV

What's the most sensible algorithm, or combination of algorithms, to be using from OpenCV for the following problem:
I have a set of small 2D images. I want to detect the locations of these subimages in a larger image.
The subimages are usually around 32x32 pixels, and the larger image is around 400x400.
The subimages are not always square, and such contains alpha channel.
Optionally - the larger image may be grainy, compressed, rotated in 3D, or otherwise slightly distorted
I have tried cvMatchTemplate, with very poor results (difficult to match correctly, and large numbers of false positives, with all match methods). Some of the problems come from the fact OpenCV can't seem to deal with alpha channel template matching.
I have tried a manual search, which seems to work better, and can include the alpha channel, but is very slow.
Thanks for any help.
cvMatchTemplate uses a MSE (SQDIFF/SQDIFF_NORMED) kind of metric for the matching. This kind of metric will penalize different alpha values severly (due to the square in the equation). Have you tried normalized cross-correlation? It is known to model linear variations in pixel intensities better.
If NCC does not do the job, you will need to transform the images to a space where the intensity differences do not have much effect. e.g. Compute a edge-strength image (canny, sobel etc) and run cvMatchTemplate on these images.
Considering the large difference in scales of the images (~10x). A image pyramid will have to be employed to figure out the correct scale for the matching. Recommend you start with a scale (2^1/x: x being the correct scale) and propagate the estimate up the pyramid.
What you need is something like SIFT or SURF.

Resources