Final vectors in Histogram of oriented gradient

Final vectors in Histogram of oriented gradient - opencv

The dimension of the image is 64 x 128. That is 8192 magnitude and gradient values. After the binning stage, we are left with 1152 values as we converted 64 pixels into 9 bins based on their orientation. Can you please explain to me how after L2 normalization we get 3780 vectors?

Assumption: You have the gradients of the 64 x 128 patch.
Calculate Histogram of Gradients in 8x8 cells
This is where it starts to get interesting. The image is divided into 8x8 cells and a HOG is calculated for each 8x8 cells. One of the reasons why we use 8x8 cells is that it provides a compact representation. An 8x8 image patch contains 8x8x3 = 192 pixel values (color image). The gradient of this patch contains 2 values (magnitude and direction) per pixel which adds up to 8x8x2 = 128 values. These 128 numbers are represented using a 9-bin histogram which can be stored as an array of 9 numbers. This makes it more compact and calculating histograms over a patch makes this representation more robust to noise.
The histogram is essentially a vector of 9 bins corresponding to angles 0, 20, 40, 60 ... 180 corresponding to unsigned gradients.
16 x 16 Block Normalization
After creating the histogram based on the gradient of the image, we want our descriptor to be independent of lighting variations. Hence, we normalize the histogram. The vector norm for a RGB color [128, 64, 32] is sqrt(128*128 + 64*64 + 32*32) = 146.64, which is the infamous L2-norm. Dividing each element of this vector by 146.64 gives us a normalized vector [0.87, 0.43, 0.22]. If we were to multiply each element of this vector by 2, the normalized vector will remain the same as before.
Although simply normalizing the 9x1 histogram is intriguing, normalizing a bigger sized block of 16 x 16 is better. A 16 x 16 block has 4 histograms, which can be concatenated to form a 36 x 1 element vector and it can be normalized the same way as the 3 x 1 vector in the example. The window is then moved by 8 pixels and a normalized 36 x 1 vector is calculated over this window and the process is repeated (see the animation: Courtesy)
Calculate the HOG feature vector
This is where your question comes in.
To calculate the final feature vector for the entire image patch, the 36 x 1 vectors are concatenated into on giant vector. Let us calculate the size:
How many positions of the 16 x 16 blocks do we have? There are 7 horizontal and 15 vertical positions, which gives - 105 positions.
Each 16 x 16 block is represented by a 36 x 1 vector. So when we concatenate them all into one giant vector we obtain a 36 x 105 = 3780 dimensional vector.
For more details, look at the tutorial where I learned.
Hope it helps!

Related

Find eigenvalues without computing the distance matrix in Ncuts implementation

I want to implement Ncuts algorithm for an image of size 1248 x 378 x 1 but the adjacency matrix will be (1248 x 378 ) x (1248 x 378 ) which needs about 800 gb of RAM. Even if i most of it is zero, still it needs too much memory. I do need this matrix though to compute the normalized cut. Is there any way that i can find the eigenvalues without actually calculate the whole matrix?

If most of the matrix is zero,, then don't use a dense format.
Instead use a sparse matrix.

How best to use Eign to filter & resize an image?

I have an image, a 2D array of uint8_ts. I want to resize the image using a separable filter. Consider shrinking the width first. Because original & target sizes are unrelated, we'll use a different set of coefficients for every destination pixel. For a particular in & out size, for all y, we might have:
out(500, y) = in(673, y) * 12 + in(674, y) * 63 + in(675, y) * 25
out(501, y) = in(674, y) * 27 + in(675, y) * 58 + in(676, y) * 15
How can I use Eigen to speed this up, e.g. vectorize it for me?
This can be expressed as a matrix multiply with a sparse matrix, of dimension in_width * out_width, where in each row, only 3 out of the in_width values are non-zero. In the actual use case, 4 to 8 will typically be non-zero. But those non-zero values will be contiguous, and it would be nice to use SSE or whatever to make it fast.
Note that the source matrix has 8 bit scalars. The final result, after scaling both width & height will be 8 bits as well. It might be nice for the intermediate matrix (image) and filter to be higher precision, say 16 bits. But even if they're 8 bits, when multiplying, we'll need to take the most significant bits of the product, not the least significant.
Is this too far out of with Eigen can do? This sort of convolution, with a kernel that's different at every pixel (only because the output size isn't an integral multiple of the input size), seems common enough.

How to represent the magnitude of a Fourier transform of an image in 8-bit format?

I have computed the Fourier transform of an 256-color-value grayscale image, but I'm not sure how to represent the output in a visible format.
This matrix represents the original image:
0 127 127 195
0 255 255 195
While this matrix represents the Fourier transform of the image:
1154 + 0j -382 + 8j -390 + 0j -382 - 8j
-256 + 0j 128 + 128j 0 + 0j 128 - 128j
From what I know, the magnitude can be computed as sqrt((r)^2+(i)^2) where r is the real component and i is the imaginary component. However, this yields values outside of the range that can be represented in 8 bits. How do I correct this?

Typically, one takes the log magnitude of each complex fft result value (ignoring ones with magnitude zero), and then scale the result so that the maximum expected result is 255 (the scale factor will depend on the dimensions and input gain of the 2D image).

Since the dynamic range of the spectrum is quite different from that of the original spatial signal, it is much difficult to use the original 8-bit format. You can use log(1+x) to shrink the range, and then scale into the 8-bit range.

histogram equalization transformation function

Suppose that a given 3-bit image(L=8) of size 64*64 pixels (M*N=4096) has the intensity distribution shown as below. How to obtain histogram equalization transformation function
and then compute the equalized histogram of the image?
Rk nk
0 800
1 520
2 970
3 660
4 330
5 450
6 260
7 106

"Histogram Equalization is the process of obtaining transformation function automatically. So you need not have to worry about shape and nature of transformation function"
So in Histogram equalization, transformation function is calculated using cumulative frequency approach and this process is automatic. From the histogram of the image, we determine the cumulative histogram, c, rescaling the values as we go so that they occupy an 8-bit range. In this way, c becomes a look-up table that can be subsequently applied to the image in order to carry out equalization.
rk nk c sk = c/MN (L-1)sk rounded value
0 800 800 0.195 1.365 1
1 520 1320 0.322 2.254 2
2 970 2290 0.559 3.913 4
3 660 2950 0.720 5.04 5
4 330 3280 0.801 5.601 6
5 450 3730 0.911 6.377 6
6 260 3990 0.974 6.818 7
7 106 4096 1.000 7.0 7
Now the equalized histogram is therefore
rk nk
0 0
1 800
2 520
3 0
4 970
5 660
6 330 + 450 = 780
7 260 + 106 = 366
The algorithm for equalization can be given as
Compute a scaling factor, α= 255 / number of pixels
Calculate histogram of the image
Create a look up table c with
c[0] = α * histogram[0]
for all remaining grey levels, i, do
c[i] = c[i-1] + α * histogram[i]
end for
for all pixel coordinates, x and y, do
g(x, y) = c[f(x, y)]
end for
But there is a problem with histogram equalization and that is mainly because it is a completely automatic technique, with no parameters to set. At times, it can improve our ability to interpret an image dramatically. However, it is difficult to predict how beneficial equalization will be for any given image; in fact, it may not be of any use at all. This is because the improvement in contrast is optimal statistically, rather than perceptually. In images with narrow histograms and relatively few grey levels, a massive increase in contrast due to histogram equalisation can have the adverse effect of reducing perceived image quality. In particular, sampling or quantisation artefacts and image noise may become more prominent.
The alternative to obtaining the transformation (mapping) function automatically is Histogram Specification. In histogram specification instead of requiring a flat histogram, we specify a particular shape explicitly. We might wish to do this in cases where it is desirable for a set of related images to have the same histogram- in order, perhaps, that a particular operation produces the same results for all images.
Histogram specification can be visualised as a two-stage process. First, we transform the input image by equalisation into a temporary image with a flat histogram. Then we transform this equalised, temporary image into an output image possessing the desired histogram. The mapping function for the second stage is easily obtained. Since a rescaled version of the cumulative histogram can be used to transform a histogram with any shape into a flat histogram, it follows that the inverse of the cumulative histogram will perform the inverse transformation from a fiat histogram to one with a specified shape.
For more details about histogram equalization and mapping functions with C and C++ code
https://programming-technique.blogspot.com/2013/01/histogram-equalization-using-c-image.html

How Huffman Encoding construct the image(jpeg) from dct coefficients?

I have a 512x512 image and I tried to recompress it. Here's the steps for recompressing an image to jpeg file
1) convert rgb to YCrCb
2) perform down sampling on Cr and Cb
2) convert YCrCb to DCT and Quantized according to chosen Quality
3) perform Huffman Encoding on Quantized DCT
But before Huffman Encoding I counted the number of DCT coefficients and it is 393216. Dividing by it by 64 tells me the number of DCT block (8x8) which will be 6144.
Now I tried to count the number of 8x8 blocks for pixel domain. 512/8=64 which gives me 64 blocks horizontally and 64 blocks vertically. 64 x 64 = 4096 which is not equal to number of DCT blocks while the number of pixels are 512x512 = 262144
My Question is how does Huffman encoding magically transform 393216 coefficients to 262144 pixels and get each pixel values, and compute the dimension (512x512) of the compressed image(jpeg).
Thanks you in advance. :D

If your image was encoded with no color subsampling, then there would be a 1:1 ratio of 8x8 coefficient blocks to 8x8 color component blocks. Each MCU (minimum coded unit) would be 8x8 pixels and have 3 8x8 coefficient blocks. 512x512 pixels = 64x64 8x8 blocks x 3 (one each for Y, Cr and Cb) = 12288 coefficient blocks.
Since you said you subsampled the color (I assume in both directions), then you will now have 6 8x8 blocks for each MCU. In the diagram below, the leftmost diagram shows the case for no subsampling of the colors and the rightmost diagram shows subsampling in both directions. The MCU size in this case will be 16x16 pixels. Each 16x16 block of pixels will need 6 8x8 coefficient blocks to define it (4 Y, 1 Cr, 1 Cb). If you divide the image into 16x16 MCUs, you will have 32x32 MCUs each with 6 8x8 blocks per MCU = 6144 coefficient blocks. So, to answer your question, the Huffman encoding is not what's changing the number of coefficients, it's the color subsampling. Part of the compression which comes from using color subsampling in JPEG images is exploiting a feature of the human visual system. Our eyes are more sensitive to changes in luminance than chrominance.

Huffman encoding doesn't transform coefficients to pixels or anything like that. At least not the Huffman encoding that I'm thinking of. All huffman encoding does, is it takes a list of tokens, and represents them with less bits based on the frequency of those tokens.
an example: you have tokens a, b, c, and d
now, uncompressed, each of your tokens would require 2 bits(00, 01, 10, and 11).
let's say a=00, b=01, c=10, and d=11
aabaccda would be represented as 0000010010101100 16 bits
but with Huffman encoding you'd represent a with less bits because it's more common, and you'd represent b and d with more because they're less common something to the extent of:
a=0, b=110, c=10, d=111 and then
aabaccda would be represented as 00110010101110 14 bits

Your image is 512x512 pixels
The Y component is 512x512 hence 262144 pixels turned into 262144 DCT coefficients
The Cb and Cr components are downsampled by 2 hence 256x256 pixels turned into 65536 DCT coefficients each.
The sum of all DCT coefficients is 262144+65536+65536 = 393216.
Huffman has nothing to do with this.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart