What is meant by local minima and maxima of an image in this context - image-processing

I'm reading through this paper about image retrieval, and the following algorithm is used:
First you extract the keypoints from the grayscale image, which are defined as every maximum value of a W1xW1 window as it's swept over the image. After that, the paper describes:
Let N_maxW1(p) and N_minW1(p) be the two set of local maximum and local minimum pixels inside the W2xW2 window around the understudied keypoint p, and the following features involving color, spatial and gradient information are extracted for each set.
What is meant by local minima and maxima in this context? Is it every pixel that's above/below a threshold? if so, what threshold is that, the keypoint pixel value?
The paper in question is the following

Related

Difference between contrast stretching and histogram equalization

I would like to know the difference between contrast stretching and histogram equalization.
I have tried both using OpenCV and observed the results, but I still have not understood the main differences between the two techniques. Insights would be of much needed help.
Lets Define Contrast first,
Contrast is a measure of the “range” of an image; i.e. how spread its intensities are. It has many formal definitions one famous is Michelson’s:
He says contrast = ( Imax - Imin )/( Imax + I min )
Contrast is strongly tied to an image’s overall visual quality.
Ideally, we’d like images to use the entire range of values available
to them.
Contrast Stretching and Histogram Equalisation have the same goal: making the images to use entire range of values available to them.
But they use different techniques.
Contrast Stretching works like mapping
it maps minimum intensity in the image to the minimum value in the range( 84 ==> 0 in the example above )
With the same way, it maps maximum intensity in the image to the maximum value in the range( 153 ==> 255 in the example above )
This is why Contrast Stretching is un-reliable, if there exist only two pixels have 0 and 255 intensity, it is totally useless.
However a better approach is Histogram Equalisation which uses probability distribution. You can learn the steps here
I came across the following points after some reading.
Contrast stretching is all about increasing the difference between the maximum intensity value in an image and the minimum one. All the rest of the intensity values are spread out between this range.
Histogram equalization is about modifying the intensity values of all the pixels in the image such that the histogram is "flattened" (in reality, the histogram can't be exactly flattened, there would be some peaks and some valleys, but that's a practical problem).
In contrast stretching, there exists a one-to-one relationship of the intensity values between the source image and the target image i.e., the original image can be restored from the contrast-stretched image.
However, once histogram equalization is performed, there is no way of getting back the original image.
In Histogram equalization, you want to flatten the histogram into a uniform distribution.
In contrast stretching, you manipulate the entire range of intensity values. Like what you do in Normalization.
Contrast stretching is a linear normalization that stretches an arbitrary interval of the intensities of an image and fits the interval to an another arbitrary interval (usually the target interval is the possible minimum and maximum of the image, like 0 and 255).
Histogram equalization is a nonlinear normalization that stretches the area of histogram with high abundance intensities and compresses the area with low abundance intensities.
I think that contrast stretching broadens the histogram of the image intensity levels, so the intensity around the range of input may be mapped to the full intensity range.
Histogram equalization, on the other hand, maps all of the pixels to the full range according to the cumulative distribution function or probability.
Contrast is the difference between maximum and minimum pixel intensity.
Both methods are used to enhance contrast, more precisely, adjusting image intensities to enhance contrast.
During histogram equalization the overall shape of the histogram
changes, whereas in contrast stretching the overall shape of
histogram remains same.

soft binning in SIFT

According to "Lowe, David G. "Distinctive image features from scale-invariant keypoints." International journal of
computer vision 60.2 (2004): 91-110 "
"It is important to avoid all boundary affects in which the descriptor
abruptly changes as a sample shifts smoothly from being within one
histogram to another or from one orientation to another. Therefore,
trilinear interpolation is used to distribute the value of each
gradient sample into adjacent histogram bins. In other words, each
entry into a bin is multiplied by a weight of 1−d for each dimension,
where d is the distance of the sample from the central value of the
bin as measured in units of the histogram bin spacing."
I am calculating the orientation[t] and location of gradient(x,y) which will be in floating point. Currently, I was just
providing the gradient magnitude to 3d histogram values[t][x][y] ( means the lower bound of floating point values of t,x
and y). But, according to paper, I have to distribute the gradient magnitude to adjacent bins. I am not sure about how
to distribute it.
I got my answer on following link:
HOG Trilinear Interpolation of Histogram Bins

Matlab Camera Calibration - Correct lens distortion

In the Computer Vision System Toolbox for Matlab there are three types of interpolation methods used for Correct lens distortion.
Interpolation method for the function to use on the input image. The interp input interpolation method can be the string, 'nearest', 'linear', or 'cubic'.
My question is: what is the difference between 'nearest', 'linear', or 'cubic' ? and which one implemented in "Zhang" and "Heikkila, J, and O. Silven" methods.
I can't access the paged at the link you wrote in your question (it asks for a username and password) and so I assume your linked page has the same contents of the page http://www.mathworks.it/it/help/vision/ref/undistortimage.html which I quote here:
J = undistortImage(I,cameraParameters,interp) removes lens distortion from the input image, I and specifies the
interpolation method for the function to use on the input image.
Input Arguments
I — Input image
cameraParameters — Object for storing camera parameters
interp — Interpolation method
'linear' (default) | 'nearest' | 'cubic'
Interpolation method for the function to use on
the input image. The interp input interpolation method can be the
string, 'nearest', 'linear', or 'cubic'.
Furthermore, I assume you are referring to these papers:
ZHANG, Zhengyou. A flexible new technique for camera calibration. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2000, 22.11: 1330-1334.
HEIKKILA, Janne; SILVEN, Olli. A four-step camera calibration procedure with implicit image correction. In: Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on. IEEE, 1997. p. 1106-1112.
I have searched for the word "interpolation" in the two pdf documents Zhang and Heikkila and Silven and I did not find any direct statement about the interpolation method they have used.
To my knowledge, in general, a camera calibration method is concerned on how to estimate the intrinsic, extrinsic and lens distortion parameters (all these parameters are inside the input argument cameraParameters of Matlab's undistortImage function); the interpolation method is part of a different problem, i.e. the problem of "Geometric Image Transformations".
I quote from the OpenCV's page Geometric Image Transformation (I have slightly modified the original omitting some details and adding some definitions, I assume you are working with grey level image):
The functions in this section perform various geometrical
transformations of 2D images. They do not change the image content but
deform the pixel grid and map this deformed grid to the destination
image. In fact, to avoid sampling artifacts, the mapping is done in
the reverse order, from destination to the source. That is, for each
pixel (x, y) of the destination image, the functions compute
coordinates of the corresponding “donor” pixel in the source image and
copy the pixel value:
dst(x,y) = src(f_x(x,y), f_y(x,y))
where
dst(x,y) is the grey value of the pixel located at row x and column y in the destination image
src(x,y) is the grey value of the pixel located at row x and column y in the source image
f_x is a function that maps the row x and the column y to a new row, it just uses coordinates and not the grey level.
f_y is a function that maps the row x and the column y to a new column, it just uses coordinates and not the grey level.
The actual implementations of the geometrical transformations, from
the most generic remap() and to the simplest and the fastest resize()
, need to solve two main problems with the above formula:
• Extrapolation of non-existing pixels. Similarly to the filtering
functions described in the previous section, for some (x,y) , either
one of f_x(x,y) , or f_y(x,y) , or both of them may fall outside of
the image. In this case, an extrapolation method needs to be used.
OpenCV provides the same selection of extrapolation methods as in the
filtering functions. In addition, it provides the method
BORDER_TRANSPARENT . This means that the corresponding pixels in the
destination image will not be modified at all.
• Interpolation of pixel
values. Usually f_x(x,y) and f_y(x,y) are floating-point numbers. This
means that <f_x, f_y> can be either an affine or
perspective transformation, or radial lens distortion correction, and
so on. So, a pixel value at fractional coordinates needs to be
retrieved. In the simplest case, the coordinates can be just rounded
to the nearest integer coordinates and the corresponding pixel can be
used. This is called a nearest-neighbor interpolation. However, a
better result can be achieved by using more sophisticated
interpolation methods, where a polynomial function is fit into some
neighborhood of the computed pixel (f_x(x,y), f_y(x,y)), and then the
value of the polynomial at (f_x(x,y), f_y(x,y)) is taken as the
interpolated pixel value. In OpenCV, you can choose between several
interpolation methods. See resize() for details.
For a "soft" introduction see also for example Cambridge in colour - DIGITAL IMAGE INTERPOLATION.
So let's say you need the grey level of pixel at x=20.2 y=14.7, since x and y are number with a fractional part different from zero you will need to "invent" (compute) the grey level in some way. In the simplest case ('nearest' interpolation) you just say that the grey level at (20.2,14.7) is the grey level you retrieve at (20,15), it is called "nearest" because 20 is the nearest integer value to 20.2 and 15 is the nearest integer value to 14.7.
In the (bi)'linear' interpolation you will compute the value at (20.2,14.7) with a combination of the grey levels of the four pixels at (20,14), (20,15), (21,14), (21,15); for the details on how to compute the combination see the Wikipedia page which has a numeric example.
The (bi)'cubic' interpolation considers the combination of sixteen pixels in order to compute the value at (20.2,14.7), see the Wikipedia page.
I suggest you to try all the three methods, with the same input image, and see the differences in the output image.
Interpolation method is actually independent of the camera calibration. Any time you apply a geometric transformation to an image, such as rotation, re-sizing, or distortion compensation, the pixels in the new image will correspond to points between the pixels of the old image. So you have to interpolate their values somehow.
'nearest' means you simply use the value of the nearest pixel.
'linear' means you use bi-linear interpolation. The new pixel's value is a weighted sum of the values of the neighboring pixels in the input image, where the weights are proportional to distances.
'cubic' means you use a bi-cubic interpolation, which is more complicated than bi-linear, but may give you a smoother image.
A good description of these interpolation methods is given in the documentation for the interp2 function.
And finally, just to clarify, the undistortImage function is in the Computer Vision System Toolbox.

How to apply box filter on integral image? (SURF)

Assuming that I have a grayscale (8-bit) image and assume that I have an integral image created from that same image.
Image resolution is 720x576. According to SURF algorithm, each octave is composed of 4 box filters, which are defined by the number of pixels on their side. The
first octave uses filters with 9x9, 15x15, 21x21 and 27x27 pixels. The
second octave uses filters with 15x15, 27x27, 39x39 and 51x51 pixels.The third octave uses filters with 27x27, 51x51, 75x75 and 99x99 pixels. If the image is sufficiently large and I guess 720x576 is big enough (right??!!), a fourth octave is added, 51x51, 99x99, 147x147 and 195x195. These
octaves partially overlap one another to improve the quality of the interpolated results.
// so, we have:
//
// 9x9 15x15 21x21 27x27
// 15x15 27x27 39x39 51x51
// 27x27 51x51 75x75 99x99
// 51x51 99x99 147x147 195x195
The questions are:What are the values in each of these filters? Should I hardcode these values, or should I calculate them? How exactly (numerically) to apply filters to the integral image?
Also, for calculating the Hessian determinant I found two approximations:
det(HessianApprox) = DxxDyy − (0.9Dxy)^2 anddet(HessianApprox) = DxxDyy − (0.81Dxy)^2Which one is correct?
(Dxx, Dyy, and Dxy are Gaussian second order derivatives).
I had to go back to the original paper to find the precise answers to your questions.
Some background first
SURF leverages a common Image Analysis approach for regions-of-interest detection that is called blob detection.
The typical approach for blob detection is a difference of Gaussians.
There are several reasons for this, the first one being to mimic what happens in the visual cortex of the human brains.
The drawback to difference of Gaussians (DoG) is the computation time that is too expensive to be applied to large image areas.
In order to bypass this issue, SURF takes a simple approach. A DoG is simply the computation of two Gaussian averages (or equivalently, apply a Gaussian blur) followed by taking their difference.
A quick-and-dirty approximation (not so dirty for small regions) is to approximate the Gaussian blur by a box blur.
A box blur is the average value of all the images values in a given rectangle. It can be computed efficiently via integral images.
Using integral images
Inside an integral image, each pixel value is the sum of all the pixels that were above it and on its left in the original image.
The top-left pixel value in the integral image is thus 0, and the bottom-rightmost pixel of the integral image has thus the sum of all the original pixels for value.
Then, you just need to remark that the box blur is equal to the sum of all the pixels inside a given rectangle (not originating in the top-lefmost pixel of the image) and apply the following simple geometric reasoning.
If you have a rectangle with corners ABCD (top left, top right, bottom left, bottom right), then the value of the box filter is given by:
boxFilter(ABCD) = A + D - B - C,
where A, B, C, D is a shortcut for IntegralImagePixelAt(A) (B, C, D respectively).
Integral images in SURF
SURF is not using box blurs of sizes 9x9, etc. directly.
What it uses instead is several orders of Gaussian derivatives, or Haar-like features.
Let's take an example. Suppose you are to compute the 9x9 filters output. This corresponds to a given sigma, hence a fixed scale/octave.
The sigma being fixed, you center your 9x9 window on the pixel of interest. Then, you compute the output of the 2nd order Gaussian derivative in each direction (horizontal, vertical, diagonal). The Fig. 1 in the paper gives you an illustration of the vertical and diagonal filters.
The Hessian determinant
There is a factor to take into account the scale differences. Let's believe the paper that the determinant is equal to:
Det = DxxDyy - (0.9 * Dxy)^2.
Finally, the determinant is given by: Det = DxxDyy - 0.81*Dxy^2.
Look at page 17 of this document
http://www.sci.utah.edu/~fletcher/CS7960/slides/Scott.pdf
If you made a code for normal Gaussian 2D convolution, just use the box filter as a Gaussian kernel and the input image will be the same original image not integral image. The results from this method will be same with the one you asked.

OpenCV - Dynamically find HSV ranges for color

When given an image such as this:
And not knowing the color of the object in the image, I would like to be able to automatically find the best H, S and V ranges to threshold the object itself, in order to get a result such as this:
In this example, I manually found the values and thresholded the image using cv::inRange.The output I'm looking for, are the best H, S and V ranges (min and max value each, total of 6 integer values) to threshold the given object in the image, without knowing in advance what color the object is. I need to use these values later on in my code.
Keypoints to remember:
- All given images will be of the same size.
- All given images will have the same dark background.
- All the objects I'll put in the images will be of full color.
I can brute force over all possible permutations of the 6 HSV ranges values, threshold each one and find a clever way to figure out when the best blob was found (blob size maybe?). That seems like a very cumbersome, long and highly ineffective solution though.
What would be good way to approach this? I did some research, and found that OpenCV has some machine learning capabilities, but I need to have the actual 6 values at the end of the process, and not just a thresholded image.
You could create a small 2 layer neural network for the task of dynamic HSV masking.
steps:
create/generate ground truth annotations for image and its HSV range for the required object
design a small neural network with at least 1 conv layer and 1 fcn layer.
Input : Mask of the image after applying the HSV range from ground truth( mxn)
Output : mxn mask of the image in binary
post processing : multiply the mask with the original image to get the required object highligted

Resources