I am learning about HOG and I understand it from here. A well-explained page with an example. I am not understanding this concept that how it works
A 16×16 block has 4 histograms which can be concatenated to form a 36
x 1 element vector and it can be normalized just the way a 3×1 vector
is normalized.
How this 36*1 came and how we calculated it? and is it compulsory that we always need 9 bin vector? Is it a fixed size for HOG?
came?
Is it compulsory that we always need 9 bin vector?
Not necessarily. Dalal and Triggs stated in their original HOG paper that accuracy for their application (which was human pedestrian detection) increased when using up to 9 bins, after that the accuracy did not increase any further, that's why 9 are commonly used.
How this 36*1 came and how we calculated it?
As already pointed out in the comments:
You have 9 bins per histogram (which will each be a scalar value in your feature vector). In your example, a histogram was calculated using 8 x 8 blocks, meaning in a 16 x 16 block you will be able to calculate 4 histograms. Each of those histograms will yield a 9 x 1 feature vector so:
4 (histograms) * 9 (bins) = 36 x 1 feature vector.
You basically just concatenate your results into one vector.
I have an image, a 2D array of uint8_ts. I want to resize the image using a separable filter. Consider shrinking the width first. Because original & target sizes are unrelated, we'll use a different set of coefficients for every destination pixel. For a particular in & out size, for all y, we might have:
out(500, y) = in(673, y) * 12 + in(674, y) * 63 + in(675, y) * 25
out(501, y) = in(674, y) * 27 + in(675, y) * 58 + in(676, y) * 15
How can I use Eigen to speed this up, e.g. vectorize it for me?
This can be expressed as a matrix multiply with a sparse matrix, of dimension in_width * out_width, where in each row, only 3 out of the in_width values are non-zero. In the actual use case, 4 to 8 will typically be non-zero. But those non-zero values will be contiguous, and it would be nice to use SSE or whatever to make it fast.
Note that the source matrix has 8 bit scalars. The final result, after scaling both width & height will be 8 bits as well. It might be nice for the intermediate matrix (image) and filter to be higher precision, say 16 bits. But even if they're 8 bits, when multiplying, we'll need to take the most significant bits of the product, not the least significant.
Is this too far out of with Eigen can do? This sort of convolution, with a kernel that's different at every pixel (only because the output size isn't an integral multiple of the input size), seems common enough.
The dimension of the image is 64 x 128. That is 8192 magnitude and gradient values. After the binning stage, we are left with 1152 values as we converted 64 pixels into 9 bins based on their orientation. Can you please explain to me how after L2 normalization we get 3780 vectors?
Assumption: You have the gradients of the 64 x 128 patch.
Calculate Histogram of Gradients in 8x8 cells
This is where it starts to get interesting. The image is divided into 8x8 cells and a HOG is calculated for each 8x8 cells. One of the reasons why we use 8x8 cells is that it provides a compact representation. An 8x8 image patch contains 8x8x3 = 192 pixel values (color image). The gradient of this patch contains 2 values (magnitude and direction) per pixel which adds up to 8x8x2 = 128 values. These 128 numbers are represented using a 9-bin histogram which can be stored as an array of 9 numbers. This makes it more compact and calculating histograms over a patch makes this representation more robust to noise.
The histogram is essentially a vector of 9 bins corresponding to angles 0, 20, 40, 60 ... 180 corresponding to unsigned gradients.
16 x 16 Block Normalization
After creating the histogram based on the gradient of the image, we want our descriptor to be independent of lighting variations. Hence, we normalize the histogram. The vector norm for a RGB color [128, 64, 32] is sqrt(128*128 + 64*64 + 32*32) = 146.64, which is the infamous L2-norm. Dividing each element of this vector by 146.64 gives us a normalized vector [0.87, 0.43, 0.22]. If we were to multiply each element of this vector by 2, the normalized vector will remain the same as before.
Although simply normalizing the 9x1 histogram is intriguing, normalizing a bigger sized block of 16 x 16 is better. A 16 x 16 block has 4 histograms, which can be concatenated to form a 36 x 1 element vector and it can be normalized the same way as the 3 x 1 vector in the example. The window is then moved by 8 pixels and a normalized 36 x 1 vector is calculated over this window and the process is repeated (see the animation: Courtesy)
Calculate the HOG feature vector
This is where your question comes in.
To calculate the final feature vector for the entire image patch, the 36 x 1 vectors are concatenated into on giant vector. Let us calculate the size:
How many positions of the 16 x 16 blocks do we have? There are 7 horizontal and 15 vertical positions, which gives - 105 positions.
Each 16 x 16 block is represented by a 36 x 1 vector. So when we concatenate them all into one giant vector we obtain a 36 x 105 = 3780 dimensional vector.
For more details, look at the tutorial where I learned.
Hope it helps!
There are 9 parameters in the fundamental matrix to relate the pixel co-ordinates of left and right images but only 7 degrees of freedom (DOF).
The reasoning for this on several pages that I've searched says :
Homogenous equations means we lose a degree of freedom
The determinant of F = 0, therefore we lose another degree of freedom.
I don't understand why those 2 reasons mean we lose 2 DOF - can someone explain it?
We initially have 9 DOF because the fundamental matrix is composed of 9 parameters, which implies that we need 9 corresponding points to compute the fundamental matrix (F). But because of the following two reasons, we only need 7 corresponding points.
Reason 1
We lose 1 DOF because we are using homogeneous coordinates. This basically is a way to represent nD points as a vector form by adding an extra dimension. ie) A 2D point (0,2) can be represented as [0,2,1], in general [x,y,1]. There are useful properties when using homogeneous coordinates with 2D/3D transformation, but I'm going to assume you know that.
Now given the expression p and p' representing pixel coordinates:
p'=[u',v',1] and p=[u,v,1]
the fundamental matrix:
F = [f1,f2,f3]
[f4,f5,f6]
[f7,f8,f9]
and fundamental matrix equation:
(transposed p')Fp = 0
when we multiple this expression in algebra form, we get the following:
uu'f1 + vu'f2 + u'f3 + uv'f4 + vv'f5 + v'f6 + uf7 + vf8 + f9 = 0.
In a homogeneous system of linear equation form Af=0 (basically the factorization of the above formula), we get two components A and f.
A:
[uu',vu',u', uv',vv',v',u,v,1]
f (f is essentially the fundamental matrix in vector form):
[f1,f2'f3,f4,f5,f6,f7,f8,f9]
Now if we look at the components of vector A, we have 8 unknowns, but one known value 1 because of homogeneous coordinates, and therefore we only need 8 equations now.
Reason 2
det F = 0.
A determinant is a value that can be obtained from a square matrix.
I'm not entirely sure about the mathematical details of this property but I can still infer the basic idea, and, hopefully, you can as well.
Basically given some matrix A
A = [a,b,c]
[d,e,f]
[g,h,i]
The determinant can be computed using this formula:
det A = aei+bfg+cdh-ceg-bdi-afh
If we look at the determinant using the fundamental matrix, the algebra would look something like this:
F = [f1,f2,f3]
[f4,f5,f6]
[f7,f8,f9]
det F = (f1*f5*f8)+(f2*f6*f7)+(f3*f4*f8)-(f3*f5*f7)-(f2*f4*f9)-(f1*f6*f8)
Now we know the determinant of the fundamental matrix is zero:
det F = (f1*f5*f8)+(f2*f6*f7)+(f3*f4*f8)-(f3*f5*f7)-(f2*f4*f9)-(f1*f6*f8) = 0
So, if we work out only 7 of the 9 parameters of the fundamental matrix, we can work out the last parameter using the above determinant equation.
Therefore the fundamental matrix has 7DOF.
The reasons why F has only 7 degrees of freedom are
F is a 3x3 homogeneous matrix. Homogeneous means there is a scale ambiguity in the matrix, so the scale doesn't matter (as shown in #Curator Corpus 's example). This drops one degree of freedom.
F is a matrix with rank 2. It is not a full rank matrix, so it is singular and its determinant is zero (Proof here). The reason why F is a matrix with rank 2 is that it is mapping a 2D plane (image1) to all the lines (in image 2) that pass through the epipole (of image 2).
Hope it helps.
As for the highest votes answer by nbro, I think it can be interpreted as this way where we have reason two, matrix F has a rank2, so its determinant is zero as a constraint to the f variable function. So, we only need 7 points to determine the rest of variables (f1-f8), with the previous constriant. And 8 equations, 8 variables, leaving only one solution. So there is 7 DOF.
I constructed an experiment with Gaussian blur in real world and MR images. I printed some test images blurred and compare augmented images blurred too.
What is the best way to express how much blurring I applied in real-world coordinates?
The image is 2560x1440 pixels, corresponding to 533x300 cm in the real world. If this image is blurred with a Gaussian with standard deviation n (filter size is ceil(3 * n) * 2 + 1), how can this be expressed in centimeters? Is it reasonable to express it as the real size of the filter in centimeters?
In short, yes, it is perfectly reasonable to express the size of the kernel in real-world coordinates.
In your case, you have 533 cm == 2560 pixels horizontally, which is 0.2082 cm per pixel. (Please edit if the question has a mistake and this should be mm instead of cm.) Vertically you have approximately the same, so we can assume isotropic sampling and leave it at 0.208 cm/px.
Given that pixel size, a standard deviation of the Gaussian of n is equivalent to a standard deviation of 0.208*n cm in the real world.