Convolution kernel - opencv

in the following link http://homepages.inf.ed.ac.uk/rbf/HIPR2/linedet.htm it is being told that for detecting lines we need to specify the width and angle of line - "to detect the presence of lines of a particular width n, at a particular orientation theta angel". The example convolution kernels are given for 0,90,45,135 orientation and width is single pixel.
My problem of understanding is, how will the convolution kernel change is I want thicker lines, means width of 3 or 5 or 7 pixel in 90 or 0 or 45 or 135 degree. What if, I want to change the angels also, how will I change the convolution kernel?
I am new in image processing so have less understanding. Please a tutorial or some of help will be appreciated.

For thicker lines, you need a larger kernel in the conventions of your link. You will need more 2's to detect lines of the width you are looking for. For a 3-pixel width horizontal line, you will need the following kernel.
-1 -1 -1 -1 -1
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
-1 -1 -1 -1 -1
and so on, depending on angles and widths.
if you want a kernel for other orientations than 0, 40, 90, and 135 degrees, it's more complicated than kernels for 0,40, 90 and 135 orientation. There are some other methods you can use. For example, http://en.wikipedia.org/wiki/Hough_transform.

Related

Derive Sobel's operators?

The two operators for detecting and smoothing horizontal and vertical edges are shown below:
[-1 0 1]
[-2 0 2]
[-1 0 1]
and
[-1 -2 -1]
[ 0 0 0]
[ 1 2 1]
But after much Googling, I still have no idea where these operators come from. I would appreciate it if someone can show me how they are derived.
The formulation was proposed by Irwin Sobel a long time ago. I think about 1974. There is a great page on the subject here.
The main advantage of convolving the 9 pixels surrounding one at which gradients are to be detected is that this simple operator is really fast and can be done with shifts and adds in low-cost hardware.
They are not the greatest edge detectors in the world - Google Canny edge detectors for something better, but they are fast and suitable for a lot of simple applications.
So spatial filters, like the Sobel kernels, are applied by "sliding" the kernel over the image (this is called convolution). If we take this kernel:
[-1 0 1]
[-2 0 2]
[-1 0 1]
After applying the Sobel operator, each result pixel gets a:
high (positive) value if the pixels on the right side are bright and pixels on the left are dark
low (negative) value if the pixels on the right side are dark and pixels on the left are bright.
This is because in discrete 2D convolution, the result is the sum of each kernel value multiplied by the corresponding image pixel. Thus a vertical edge causes the value to have a large negative or positive value, depending on the direction of the edge gradient. We can then take the absolute value and scale to interval [0, 1], if we want to display the edges as white and don't care about the edge direction.
This works identically for the other kernel, except it finds horizontal edges.

Why Sobel operator looks that way?

For image derivative computation, Sobel operator looks this way:
[-1 0 1]
[-2 0 2]
[-1 0 1]
I don't quite understand 2 things about it,
1.Why the centre pixel is 0? Can't I just use an operator like below,
[-1 1]
[-1 1]
[-1 1]
2.Why the centre row is 2 times the other rows?
I googled my questions, didn't find any answer which can convince me. Please help me.
In computer vision, there's very often no perfect, universal way of doing something. Most often, we just try an operator, see its results and check whether they fit our needs. It's true for gradient computation too: Sobel operator is one of many ways of computing an image gradient, which has proved its usefulness in many usecases.
In fact, the simpler gradient operator we could think of is even simpler than the one you suggest above:
[-1 1]
Despite its simplicity, this operator has a first problem: when you use it, you compute the gradient between two positions and not at one position. If you apply it to 2 pixels (x,y) and (x+1,y), have you computed the gradient at position (x,y) or (x+1,y)? In fact, what you have computed is the gradient at position (x+0.5,y), and working with half pixels is not very handy. That's why we add a zero in the middle:
[-1 0 1]
Applying this one to pixels (x-1,y), (x,y) and (x+1,y) will clearly give you a gradient for the center pixel (x,y).
This one can also be seen as the convolution of two [-1 1] filters: [-1 1 0] that computes the gradient at position (x-0.5,y), at the left of the pixel, and [0 -1 1] that computes the gradient at the right of the pixel.
Now this filter still has another disadvantage: it's very sensitive to noise. That's why we decide not to apply it on a single row of pixels, but on 3 rows: this allows to get an average gradient on these 3 rows, that will soften possible noise:
[-1 0 1]
[-1 0 1]
[-1 0 1]
But this one tends to average things a little too much: when applied to one specific row, we lose much of what makes the detail of this specific row. To fix that, we want to give a little more weight to the center row, which will allow us to get rid of possible noise by taking into account what happens in the previous and next rows, but still keeping the specificity of that very row. That's what gives the Sobel filter:
[-1 0 1]
[-2 0 2]
[-1 0 1]
Tampering with the coefficients can lead to other gradient operators such as the Scharr operator, which gives just a little more weight to the center row:
[-3 0 3 ]
[-10 0 10]
[-3 0 3 ]
There are also mathematical reasons to this, such as the separability of these filters... but I prefer seeing it as an experimental discovery which proved to have interesting mathematical properties, as experiment is in my opinion at the heart of computer vision.
Only your imagination is the limit to create new ones, as long as it fits your needs...
EDIT The true reason that the Sobel operator looks that way can be be
found by reading an interesting article by Sobel himself. My
quick reading of this article indicates Sobel's idea was to get an
improved estimate of the gradient by averaging the horizontal,
vertical and diagonal central differences. Now when you break the
gradient into vertical and horizontal components, the diagonal central
differences are included in both, while the vertical and horizontal
central differences are only included in one. Two avoid double
counting the diagonals should therefore have half the weights of the
vertical and horizontal. The actual weights of 1 and 2 are just
convenient for fixed point arithmetic (and actually include a scale
factor of 16).
I agree with #mbrenon mostly, but there are a couple points too hard to make in a comment.
Firstly in computer vision, the "Most often, we just try an operator" approach just wastes time and gives poor results compared to what might have been achieved. (That said, I like to experiment too.)
It is true that a good reason to use [-1 0 1] is that it centres the derivative estimate at the pixel. But another good reason is that it is the central difference formula, and you can prove mathematically that it gives a lower error in its estmate of the true derivate than [-1 1].
[1 2 1] is used to filter noise as mbrenon, said. The reason these particular numbers work well is that they are an approximation of a Gaussian which is the only filter that does not introduce artifacts (although from Sobel's article, this seems to be coincidence). Now if you want to reduce noise and you are finding a horizontal derivative you want to filter in the vertical direction so as to least affect the derivate estimate. Convolving transpose([1 2 1]) with [-1 0 1] we get the Sobel operator. i.e.:
[1] [-1 0 1]
[2]*[-1 0 1] = [-2 0 2]
[1] [-1 0 1]
For 2D image you need a mask. Say this mask is:
[ a11 a12 a13;
a21 a22 a23;
a31 a32 a33 ]
Df_x (gradient along x) should be produced from Df_y (gradient along y) by a rotation of 90o, i.e. the mask should be:
[ a11 a12 a11;
a21 a22 a21;
a31 a32 a31 ]
Now if we want to subtract the signal in front of the middle pixel (thats what differentiation is in discrete - subtraction) we want to allocate same weights to both sides of subtraction, i.e. our mask becomes:
[ a11 a12 a11;
a21 a22 a21;
-a11 -a12 -a11 ]
Next, the sum of the weight should be zero, because when we have a smooth image (e.g. all 255s) we want to have a zero response, i.e. we get:
[ a11 a12 a11;
a21 -2a21 a21;
-a31 -a12 -a31 ]
In case of a smooth image we expect the differentiation along X-axis to produce zero, i.e.:
[ a11 a12 a11;
0 0 0;
-a31 -a12 -a31 ]
Finally if we normalize we get:
[ 1 A 1;
0 0 0;
-1 -A -1 ]
and you can set A to anything you want experimentally. A factor of 2 gives the original Sobel filter.

Where to center the kernel when using FFTW for image convolution?

I am trying to use FFTW for image convolution.
At first just to test if the system is working properly, I performed the fft, then the inverse fft, and could get the exact same image returned.
Then a small step forward, I used the identity kernel(i.e., kernel[0][0] = 1 whereas all the other components equal 0). I took the component-wise product between the image and kernel(both in the frequency domain), then did the inverse fft. Theoretically I should be able to get the identical image back. But the result I got is very not even close to the original image. I am suspecting this has something to do with where I center my kernel before I fft it into frequency domain(since I put the "1" at kernel[0][0], it basically means that I centered the positive part at the top left). Could anyone enlighten me about what goes wrong here?
For each dimension, the indexes of samples should be from -n/2 ... 0 ... n/2 -1, so if the dimension is odd, center around the middle. If the dimension is even, center so that before the new 0 you have one sample more than after the new 0.
E.g. -4, -3, -2, -1, 0, 1, 2, 3 for a width/height of 8 or -3, -2, -1, 0, 1, 2, 3 for a width/height of 7.
The FFT is relative to the middle, in its scale there are negative points.
In the memory the points are 0...n-1, but the FFT treats them as -ceil(n/2)...floor(n/2), where 0 is -ceil(n/2) and n-1 is floor(n/2)
The identity matrix is a matrix of zeros with 1 in the 0,0 location (the center - according to above numbering). (In the spatial domain.)
In the frequency domain the identity matrix should be a constant (all real values 1 or 1/(N*M) and all imaginary values 0).
If you do not receive this result, then the identify matrix might need padding differently (to the left and down instead of around all sides) - this may depend on the FFT implementation.
Center each dimension separately (this is an index centering, no change in actual memory).
You will probably need to pad the image (after centering) to a whole power of 2 in each dimension (2^n * 2^m where n doesn't have to equal m).
Pad relative to FFT's 0,0 location (to center, not corner) by copying existing pixels into a new larger image, using center-based-indexes in both source and destination images (e.g. (0,0) to (0,0), (0,1) to (0,1), (1,-2) to (1,-2))
Assuming your FFT uses regular floating point cells and not complex cells, the complex image has to be of size 2*ceil(2/n) * 2*ceil(2/m) even if you don't need a whole power of 2 (since it has half the samples, but the samples are complex).
If your image has more than one color channel, you will first have to reshape it, so that the channel are the most significant in the sub-pixel ordering, instead of the least significant. You can reshape and pad in one go to save time and space.
Don't forget the FFTSHIFT after the IFFT. (To swap the quadrants.)
The result of the IFFT is 0...n-1. You have to take pixels floor(n/2)+1..n-1 and move them before 0...floor(n/2).
This is done by copying pixels to a new image, copying floor(n/2)+1 to memory-location 0, floor(n/2)+2 to memory-location 1, ..., n-1 to memory-location floor(n/2), then 0 to memory-location ceil(n/2), 1 to memory-location ceil(n/2)+1, ..., floor(n/2) to memory-location n-1.
When you multiply in the frequency domain, remember that the samples are complex (one cell real then one cell imaginary) so you have to use a complex multiplication.
The result might need dividing by N^2*M^2 where N is the size of n after padding (and likewise for M and m). - You can tell this by (a. looking at the frequency domain's values of the identity matrix, b. comparing result to input.)
I think that your understanding of the Identity kernel may be off. An Identity kernel should have the 1 at the center of the 2D kernal not at the 0, 0 position.
example for a 3 x 3, you have yours setup as follows:
1, 0, 0
0, 0, 0
0, 0, 0
It should be
0, 0, 0
0, 1, 0
0, 0, 0
Check this out also
What is the "do-nothing" convolution kernel
also look here, at the bottom of page 3.
http://www.fmwconcepts.com/imagemagick/digital_image_filtering.pdf
I took the component-wise product between the image and kernel in frequency domain, then did the inverse fft. Theoretically I should be able to get the identical image back.
I don't think that doing a forward transform with a non-fft kernel, and then an inverse fft transform should lead to any expectation of getting the original image back, but perhaps I'm just misunderstanding what you were trying to say there...

XNA/Directx Handedness (orientation)?

I'm using XNA (which uses DirectX) for some graphical programming. I had a box rotating around a point, but the rotations are a bit odd.
Everything seems like someone took a compass and rotated it 180 degrees, so that N is 180, W is 90, etc..
I can't quite seem to find a source that states the orientation, so i'm probably just not using the right keywords.
Can someone help me find what XNA/DirectX's orientation is, and a page that states this too?
DirectX uses a left-handed coordinate system.
XNA
Uses a right-handed coordinate system.
Forward is -Z, backward is +Z. Forward points into the screen.
Right is +X, left is -X. Right points to the right-side of the screen.
Up is +Y, down is -Y. Up points to the top of the screen.
Matrix layout is as follows (using an identity matrix in this example). XNA uses a row layout for its matrices. The first three rows represent orientation. The last row and first three columns ([4, 1], [4, 2], and [4, 3]) represent translation/position. Here is documentation on XNA's Matrix Structure.
In the case of a translation matrix (translation is position and rotation combined):
Right 1 0 0 0
Up 0 1 0 0
Forward 0 0 -1 0
Pos 0 0 0 1

Convolution theory vs implementation

I study convolution in image processing as it is a part of the curriculum, I understand the theory and the formula but I am confused about its implementation.
The formula is:
What I understand
The convolution kernel is flipped both horizontally and vertically then the values in the kernel are multiplied by the corresponding pixel values, the results are summed, divided by "row x column" to get the average, and then finally this result is the value of the pixel at the center of the kernel location.
Confusion in implementation
When I run the example convolution program from my course material and insert as input a 3x3 convolution kernel where:
1st row: (0, 1, 0)
2nd row: (0, 0, 0)
3rd row: (0, 0, 0)
The processed image is shifted down by one pixel, where I expected it to shift upwards by one pixel. This result indicates that no horizontal or vertical flipping is done before calculating (as if it is doing correlation).
I thought there might be a fault in the program so I looked around and found that Adobe Flex 3 and Gimp are doing this as well.
I don't understand, is there something that I missed to notice?
Appreciate any help or feedback.
I guess the programs you tried implement correlation instead of convolution.
I've tried your filter in Mathematica using the ImageFilter function, the result is shifted upwards as expected:
result:
I've also tried it in Octave (an open source Matlab clone):
imfilter([1,1,1,1,1;
2,2,2,2,2;
3,3,3,3,3;
4,4,4,4,4;
5,5,5,5,5],
[0,1,0;
0,0,0;
0,0,0],"conv")
("conv" means convolution - imfilter's default is correlation). Result:
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
0 0 0 0 0
Note that the last row is different. That's because different implementations use different padding (by default). Mathematica uses constant padding for ImageConvolve, no padding for ListConvolve. Octave's imfilter uses zero padding.
Also note that (as belisarius mentioned) the result of a convolution can be smaller, same size or larger than the source image. (I've read the terms "valid", "same size" and "full" convolution in the Matlab and IPPI documentation, but I'm not sure if that's standard terminology). The idea is that the summation can either be performed
only over the source image pixels where the kernel is completely inside the image. In that case, the result is smaller than the source image.
over every source pixel. In that case, the result has the same size as the source image. This requires padding at the borders
over every pixel where any part of the kernel is inside the source image. In that case, the result image is larger than the source image. This also requires padding at the borders.
Please note that:
Results in:
So, the "shifting" is not real, as the dimensions are affected.

Resources