What are Octaves and Sub levels in KAZE and AKAZE? - opencv

I am having a hard time trying to understand the Octaves and Sub-levels in the Non-Linear Scale space (KAZE and AKAZE). For the SIFT- Octaves is a collection of same sized images and sub-level is number of gaussian blurred images to be generated in the octave. How can this be explained in KAZE and AKAZE?
I want to tune the parameters and so need to understand this properly.

The octaves are the number of scale levels. 4 octaves means scales of 1, 1/2, 1/4, 1/8.
The octave layers are the number of diffusion layers per octave. Diffusion is the change in luminance of an image that is farther away. It can be (and it is) aproximated by some gaussian blur.
See the original AKAZE paper for more details:
http://www.robesafe.com/personal/pablo.alcantarilla/papers/Alcantarilla13bmvc.pdf
See figure 2 in this paper, for a representation of the image pyramid:
http://tulipp.eu/wp-content/uploads/2019/03/2017_TUD_HEART_kalms.pdf

Related

Workflow to clean badly scanned sheet music

I am looking for a workflow that would clean (and possibly straighten) old and badly scanned images of musical scores (like the one below).
I've tried to use denoise, hough filters, imagemagick geometry filters, and I am struggling to identify the series of filters that remove the scanner noise/bias.
Just some quick ideas:
Remove grayscale noise: Do a low pass filter (darks), since the music is darker than a lot of the noise. Remaining noise is mostly vertical lines.
Rotate image: Sum grayscale values for each column of the image. You'll get a vector with the total pixel lightness in that column. Use gradient descent or search on the rotation of the image (within some bounds like +/-15deg rotation) to maximize the variance of that vector. Idea here is that the vertical noise lines indicate vertical alignment, and so we want the columns of the image to align with these noise lines (= maximized variance).
Remove vertical line noise: After rotation, take median value of each column. The greater the distance (squared difference) a pixel is from that median darkness, the more confident we are it is its true color (e.g. a pure white or black pixel when vertical noise was gray). Since noise is non-white, you could try blending this distance by the whiteness of the median for an alternative confidence metric. Ideally, I think here you'd train some 7x7x2 convolution filter (2 channels being pixel value and distance from median) to estimate true value of the pixel. That would be the most minimal machine learning approach, not using some full-fledged NN. However, given your lack of training data, we'll have to come up with our own heuristic for what the true pixel value is. You likely will need to play around with it, but here's what I think might work:
Set some threshold of confidence; above that threshold we take the value as is. Below the threshold, set to white (the binary expected pixel value for the entire page).
For all values below threshold, take the max confidence value within a +/-2 pixels L1 distance (e.g. 5x5 convolution) as that pixel's value. Seems like features are separated by at least 2 pixels, but for lower resolutions that window size may need to be adjusted. Since white pixels may end up being more confident overall, you could experiment with prioritizing darker pixels (increase their confidence somehow).
Clamp the image contrast and maybe run another low pass filter.

How does image digitalization differ from sound digitalization (PCM)?

I am trying to understand digitalization of sound and images.
As far as I know, they both need to convert analog signal to digital signal. Both should be using sampling and quantization.
Sound: We have amplitudes on axis y and time on axis x. What is on axis x and y during image digitalization?
What is kind of standard of sample rate for image digitalization? It is used 44kHz for CDs (sound digitalization). How exactly is used sample rate for images?
Quantization: Sound - we use bit-depth - which means levels of amplitude - Image: using bit-depth also, but it means how many intesities are we able to recognize? (is it true?)
What are other differences between sound and image digitalization?
Acquisition of images can be summarized as a spatial sampling and conversion/quantization steps. The spatial sampling on (x,y) is due to the pixel size. The data (on the third axis, z) is the number of electrons generated by photoelectric effect on the chip. These electrons are converted to ADU (analog digital unit) and then to bits. What is quantized is the light intensity in level of greys, for example data on 8 bits would give 2^8 = 256 levels of gray.
An image loses information both due to the spatial sampling (resolution) and the intensity quantization (levels of gray).
Unless you are talking about videos, images won't have sampling in units of Hz (1/time) but in 1/distance. What is important is to verify the Shannon-Nyquist theorem to avoid aliasing. The spatial frequencies you are able to get depend directly on the optical design. The pixel size must be chosen respectively to this design to avoid aliasing.
EDIT: On the example below I plotted a sine function (white/black stripes). On the left part the signal is correctly sampled, on the right it is undersampled by a factor of 4. It is the same signal, but due to bigger pixels (smaller sampling) you get aliasing of your data. Here the stripes are horizontal, but you also have the same effect for vertical ones.
There is no common standard for the spatial axis for image sampling. A 20 megapixel sensor or camera will produce images at a completely different spatial resolution in pixels per mm, or pixels per degree angle of view than a 2 megapixel sensor or camera. These images will typically be rescaled to yet another non-common-standard resolution for viewing (72 ppi, 300 ppi, "Retina", SD/HDTV, CCIR-601, "4k", etc.)
For audio, 48k is starting to become more common than 44.1ksps. (on iPhones, etc.)
("a nice thing about standards is that there are so many of them")
Amplitude scaling in raw format also has no single standard. When converted or requantized to storage format, 8-bit, 10-bit, and 12-bit quantizations are the most common for RGB color separations. (JPEG, PNG, etc. formats)
Channel formats are different between audio and image.
X, Y, where X is time and Y is amplitude is only good for mono audio. Stereo usually needs T,L,R for time, left, and right channels. Images are often in X,Y,R,G,B, or 5 dimensional tensors, where X,Y are spatial location coordinates, and RGB are color intensities at that location. The image intensities can be somewhat related (depending on gamma corrections, etc.) to the number of incident photons per shutter duration in certain visible EM frequency ranges per incident solid angle to some lens.
A low-pass filter for audio, and a Bayer filter for images, are commonly used to make the signal closer to bandlimited so it can be sampled with less aliasing noise/artifacts.

Difference between contrast stretching and histogram equalization

I would like to know the difference between contrast stretching and histogram equalization.
I have tried both using OpenCV and observed the results, but I still have not understood the main differences between the two techniques. Insights would be of much needed help.
Lets Define Contrast first,
Contrast is a measure of the “range” of an image; i.e. how spread its intensities are. It has many formal definitions one famous is Michelson’s:
He says contrast = ( Imax - Imin )/( Imax + I min )
Contrast is strongly tied to an image’s overall visual quality.
Ideally, we’d like images to use the entire range of values available
to them.
Contrast Stretching and Histogram Equalisation have the same goal: making the images to use entire range of values available to them.
But they use different techniques.
Contrast Stretching works like mapping
it maps minimum intensity in the image to the minimum value in the range( 84 ==> 0 in the example above )
With the same way, it maps maximum intensity in the image to the maximum value in the range( 153 ==> 255 in the example above )
This is why Contrast Stretching is un-reliable, if there exist only two pixels have 0 and 255 intensity, it is totally useless.
However a better approach is Histogram Equalisation which uses probability distribution. You can learn the steps here
I came across the following points after some reading.
Contrast stretching is all about increasing the difference between the maximum intensity value in an image and the minimum one. All the rest of the intensity values are spread out between this range.
Histogram equalization is about modifying the intensity values of all the pixels in the image such that the histogram is "flattened" (in reality, the histogram can't be exactly flattened, there would be some peaks and some valleys, but that's a practical problem).
In contrast stretching, there exists a one-to-one relationship of the intensity values between the source image and the target image i.e., the original image can be restored from the contrast-stretched image.
However, once histogram equalization is performed, there is no way of getting back the original image.
In Histogram equalization, you want to flatten the histogram into a uniform distribution.
In contrast stretching, you manipulate the entire range of intensity values. Like what you do in Normalization.
Contrast stretching is a linear normalization that stretches an arbitrary interval of the intensities of an image and fits the interval to an another arbitrary interval (usually the target interval is the possible minimum and maximum of the image, like 0 and 255).
Histogram equalization is a nonlinear normalization that stretches the area of histogram with high abundance intensities and compresses the area with low abundance intensities.
I think that contrast stretching broadens the histogram of the image intensity levels, so the intensity around the range of input may be mapped to the full intensity range.
Histogram equalization, on the other hand, maps all of the pixels to the full range according to the cumulative distribution function or probability.
Contrast is the difference between maximum and minimum pixel intensity.
Both methods are used to enhance contrast, more precisely, adjusting image intensities to enhance contrast.
During histogram equalization the overall shape of the histogram
changes, whereas in contrast stretching the overall shape of
histogram remains same.

How to apply box filter on integral image? (SURF)

Assuming that I have a grayscale (8-bit) image and assume that I have an integral image created from that same image.
Image resolution is 720x576. According to SURF algorithm, each octave is composed of 4 box filters, which are defined by the number of pixels on their side. The
first octave uses filters with 9x9, 15x15, 21x21 and 27x27 pixels. The
second octave uses filters with 15x15, 27x27, 39x39 and 51x51 pixels.The third octave uses filters with 27x27, 51x51, 75x75 and 99x99 pixels. If the image is sufficiently large and I guess 720x576 is big enough (right??!!), a fourth octave is added, 51x51, 99x99, 147x147 and 195x195. These
octaves partially overlap one another to improve the quality of the interpolated results.
// so, we have:
//
// 9x9 15x15 21x21 27x27
// 15x15 27x27 39x39 51x51
// 27x27 51x51 75x75 99x99
// 51x51 99x99 147x147 195x195
The questions are:What are the values in each of these filters? Should I hardcode these values, or should I calculate them? How exactly (numerically) to apply filters to the integral image?
Also, for calculating the Hessian determinant I found two approximations:
det(HessianApprox) = DxxDyy − (0.9Dxy)^2 anddet(HessianApprox) = DxxDyy − (0.81Dxy)^2Which one is correct?
(Dxx, Dyy, and Dxy are Gaussian second order derivatives).
I had to go back to the original paper to find the precise answers to your questions.
Some background first
SURF leverages a common Image Analysis approach for regions-of-interest detection that is called blob detection.
The typical approach for blob detection is a difference of Gaussians.
There are several reasons for this, the first one being to mimic what happens in the visual cortex of the human brains.
The drawback to difference of Gaussians (DoG) is the computation time that is too expensive to be applied to large image areas.
In order to bypass this issue, SURF takes a simple approach. A DoG is simply the computation of two Gaussian averages (or equivalently, apply a Gaussian blur) followed by taking their difference.
A quick-and-dirty approximation (not so dirty for small regions) is to approximate the Gaussian blur by a box blur.
A box blur is the average value of all the images values in a given rectangle. It can be computed efficiently via integral images.
Using integral images
Inside an integral image, each pixel value is the sum of all the pixels that were above it and on its left in the original image.
The top-left pixel value in the integral image is thus 0, and the bottom-rightmost pixel of the integral image has thus the sum of all the original pixels for value.
Then, you just need to remark that the box blur is equal to the sum of all the pixels inside a given rectangle (not originating in the top-lefmost pixel of the image) and apply the following simple geometric reasoning.
If you have a rectangle with corners ABCD (top left, top right, bottom left, bottom right), then the value of the box filter is given by:
boxFilter(ABCD) = A + D - B - C,
where A, B, C, D is a shortcut for IntegralImagePixelAt(A) (B, C, D respectively).
Integral images in SURF
SURF is not using box blurs of sizes 9x9, etc. directly.
What it uses instead is several orders of Gaussian derivatives, or Haar-like features.
Let's take an example. Suppose you are to compute the 9x9 filters output. This corresponds to a given sigma, hence a fixed scale/octave.
The sigma being fixed, you center your 9x9 window on the pixel of interest. Then, you compute the output of the 2nd order Gaussian derivative in each direction (horizontal, vertical, diagonal). The Fig. 1 in the paper gives you an illustration of the vertical and diagonal filters.
The Hessian determinant
There is a factor to take into account the scale differences. Let's believe the paper that the determinant is equal to:
Det = DxxDyy - (0.9 * Dxy)^2.
Finally, the determinant is given by: Det = DxxDyy - 0.81*Dxy^2.
Look at page 17 of this document
http://www.sci.utah.edu/~fletcher/CS7960/slides/Scott.pdf
If you made a code for normal Gaussian 2D convolution, just use the box filter as a Gaussian kernel and the input image will be the same original image not integral image. The results from this method will be same with the one you asked.

GPU-based Laplacian Pyramid

I have implemented an image blending method for seamless blending using plain C++. Now I want to convert this code for GPU (using OpenGL ES 2 Shaders for mobile devices). Basically the method creates Gaussian and Laplacian Pyramides for each image which are then combined from low-resolution to top (see also the paper "The Laplacian Pyramid as a Compact Image Code" from Burt et.al. 1983).
My problem is that the Laplacian pyramid levels can have negative values but my devices do not support float or integer type textures (using the ORB_texture_float extension e.g.).
I already looked for papers dealing with GPU-based pyramids but without finding something really useful.
How can I implement such a pyramid efficiently for a GPU?
Is it possible to calculate a Gaussian/Laplacian pyramid level without iterating through the preceding levels?
Regards,
EDIT
It seems as if there is no "good" way to calculate Laplacian Pyramids completely on GPU except using two passes (one for signs, one for values) which do not have support for either signed types (for instance ARB_texture_float) or types larger than byte when the the image's data range is between [0..255]. My Laplacian Pyramid runs perfectly on GPUs with ARB_texture_float extension but without the extension (and some adjustments to compress the range) the pyramid gets "wrong" due to range compression.
The safest way for you to implement a Laplacian pyramid if your textures are unsigned integers is to store two pyramids - one pyramid that contains the gradient magnitude of the Laplacian and another pyramid that stores the sign of the pixel at that location.
Yes. Any level in a Gaussian or Laplacian pyramid has a closed form solution based on the sigma value that you want to compute. Consider the base case of a LoG pyramid computed at intervals of sigma = (2/3). The first level of the pyramid has sigma 2/3 and is produced simply by convolving with a 5x5 LoG filter with sigma 2/3. The second convolution with the same filter produces an LoG image with sigma 4/3, and finally the third has sigma 6/3, or 2, so we subsample the image to produce the next integer level of the pyramid. If you want to compute the LoG of an image at sigma 2, the levels at sigma 2/3 and 4/3 are not necessary - simply subsample the image one time and convolve with an LoG filter with sigma 1.
If you want to compute the LoG at sigma = 20, quad-subsample the image (16 pixel blocks become 1 pixel) to give you a sigma 16 image, then convolve once with a sigma 4/3 LoG filter.

Resources