Image retrieval - edge histogram - image-processing

My lecturer has slides on edge histograms for image retrieval, whereby he states that one must first divide the image into 4x4 blocks, and then check for edges at the horizontal, vertical, +45°, and -45° orientations. He then states that this is then represented in a 14x1 histogram. I have no idea how he came about deciding that a 14x1 histogram must be created. Does anyone know how he came up with this value, or how to create an edge histogram?
Thanks.

The thing you are referring to is called the Histogram of Oriented Gradients (HoG). However, the math doesn't work out for your example. Normally you will choose spatial binning parameters (the 4x4 blocks). For each block, you'll compute the gradient magnitude at some number of different directions (in your case, just 2 directions). So, in each block you'll have N_{directions} measurements. Multiply this by the number of blocks (16 for you), and you see that you have 16*N_{directions} total measurements.
To form the histogram, you simply concatenate these measurements into one long vector. Any way to do the concatenation is fine as long as you keep track of the way you map the bin/direction combo into a slot in the 1-D histogram. This long histogram of concatenations is then most often used for machine learning tasks, like training a classifier to recognize some aspect of images based upon the way their gradients are oriented.
But in your case, the professor must be doing something special, because if you have 16 different image blocks (a 4x4 grid of image blocks), then you'd need to compute less than 1 measurement per block to end up with a total of 14 measurements in the overall histogram.
Alternatively, the professor might mean that you take the range of angles in between [-45,+45] and you divide that into 14 different values: -45, -45 + 90/14, -45 + 2*90/14, ... and so on.
If that is what the professor means, then in that case you get 14 orientation bins within a single block. Once everything is concatenated, you'd have one very long 14*16 = 224-component vector describing the whole image overall.
Incidentally, I have done a lot of testing with Python implementations of Histogram of Gradient, so you can see some of the work linked here or here. There is also some example code at that site, though a more well-supported version of HoG appears in scikits.image.

Related

How does background removal help reduce computation in CNN?

I read in many papers that a preprocessing of background removal help reduce the amount of computation. But why is this the case? My understanding is that he CNN works on a rectangular window no matter how is it filled up, 0 or positive.
See this for an example.
In the paper you provide, it seems that they do not pass the entire image to the network. Instead, they seem to be selecting smaller patches from the non-white background. This makes sense because it reduces the noise in their data, but it also reduces computational complexity, because of the effect it has on fully connected layers.
Suppose the input image is of size h*w. In your CNN, the image passes through a series of convolutions and max-poolings, and as a result, right before the first fully connected layer, you end up with a feature map of size
sz=m*(h/k)*(w/d)
where m is the number of feature planes, and where k and d depend on the number of layers, the parameters of each convolution and max pooling modules (e.g. the size of the convolution kernel, etc). Usually, we'll have d==k. Now, assume that you feed this to a fully connected layer, to produce a vector of q parameters. What this layer does is basicaly a matrix multiplication
A*x
where A is a matrix of size q*sz, and x is just your feature map written as a vector.
Now, assume you pass a patch of size (h/t)*(w/t) to the network. You end up with a feature map of size
sz/(t^2)
Given the size of the images in their datasets, this is a considerable reduction in the number of parameters. Also, small patches also means larger batches, and that too can accelerate training (better gradient approximation.).
I hope this helps.
Edit, following #wlnirvana's comment : Yes, patch size is a hyper parameter. In the example I gave, it is set via selecting t. Given the size of the images in the dataset, I'd say something like t>=6 would be realistic. As for how this relate to background removal, to quote the paper (section 3.1):
"To reduce computation time and to focus our analysis on regions of the slide most likely to contain cancer metastasis..."
This means that they select patches only around areas that are not background. This makes sense, since passing a completely white patch to the network would just be a waste of time (in figure 1, you can have so many white/gray/useless patches if you select them randomly, without removing the background). I didn't find any explanation on how patch selection is done in their paper, but I assume something like selecting a number of pixels p_1,...,p_n in the non-background regions, and considering n patches of size (h/t)*(w/t) around each of them would make sense.

Haar-like features to detect objects

I was studying Viola-Jones paper for better understanding of their object detection algorithm and producing an applicable program. In the last paragraph of features' topic, authors talk about the base resolution of the detector which is 24x24, they say the exhaustive set of rectangle features is quite large, over 180,000 . Note that unlike the Haar basis, the set of rectangle features is overcomplete. Is this mean that every single rectangle feature is 24 by 24 or it simply means that we divide a given image into 24*24 blocks? 180000 is the result of finding several types of Haar-like features for every 24*24 block? And I also couldn't understand the last part which states the set of rectangle features is overcomplete. what does being overcomplete mean when we are talking about rectangle features? Thanks.
Every 24X24 rectangle feature gives you only one number as stated before in the same paragraph "The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions" and "A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature computes the difference between diagonal pairs of rectangles."
An explanation about the number 180,00 you can find in:
Viola-Jones' face detection claims 180k features
An overcomplete set means that you have some features that are a linear combination of other features. In the case of 24X24 rectangle features we can build a linear base for this space by taken all the rectangles with value 1 in one of their squares and zero in all the rest. If we calculate how many option this configuration has we get 24*24=576 which is much less than 180,000. This means that from their set of 180,000 we have some rectangles that we can get as combination of other rectangles from our set.

GPUImage Taking sum of columns of image

Im using GPUImage in my project and I need an efficient way of taking the column sums. Naive way would obviously be retrieving the raw data and adding values of every column. Can anybody suggest a faster way for that?
One way to do this would be to use the approach I take with the GPUImageAverageColor class (as described in this answer), only instead of reducing the total size of each frame at each step, only do this for one dimension of the image.
The average color filter determines the average color of the overall image by stepping down in a factor of four in both X and Y, averaging 16 pixels into one at each step. If operating in a single direction, you should be able to use hardware interpolation to get an 18X reduction in a single direction per step with good performance. Your final step might either require a quick CPU-based iteration on the much smaller image or a tweaked version of this shader that pulls the last few pixels in a column together into the final result pixel for that column.
You notice that I've been talking about averaging here, because the output values for any OpenGL ES operation will need to be in terms of colors, which only have a 0-255 range per channel. A sum will easily overflow this, but you could use an average as an approximation of your sum, with a more limited dynamic range.
If you only care about one color channel, you could possibly encode a larger value into the RGBA channels and maintain a 32-bit sum that way.
Beyond what I describe above, you could look at performing this sum with the help of the Accelerate framework. While probably not quite as fast as doing a shader-based reduction, it might be good enough for your needs.

How to match texture similarity in images?

What are the ways in which to quantify the texture of a portion of an image? I'm trying to detect areas that are similar in texture in an image, sort of a measure of "how closely similar are they?"
So the question is what information about the image (edge, pixel value, gradient etc.) can be taken as containing its texture information.
Please note that this is not based on template matching.
Wikipedia didn't give much details on actually implementing any of the texture analyses.
Do you want to find two distinct areas in the image that looks the same (same texture) or match a texture in one image to another?
The second is harder due to different radiometry.
Here is a basic scheme of how to measure similarity of areas.
You write a function which as input gets an area in the image and calculates scalar value. Like average brightness. This scalar is called a feature
You write more such functions to obtain about 8 - 30 features. which form together a vector which encodes information about the area in the image
Calculate such vector to both areas that you want to compare
Define similarity function which takes two vectors and output how much they are alike.
You need to focus on steps 2 and 4.
Step 2.: Use the following features: std() of brightness, some kind of corner detector, entropy filter, histogram of edges orientation, histogram of FFT frequencies (x and y directions). Use color information if available.
Step 4. You can use cosine simmilarity, min-max or weighted cosine.
After you implement about 4-6 such features and a similarity function start to run tests. Look at the results and try to understand why or where it doesnt work. Then add a specific feature to cover that topic.
For example if you see that texture with big blobs is regarded as simmilar to texture with tiny blobs then add morphological filter calculated densitiy of objects with size > 20sq pixels.
Iterate the process of identifying problem-design specific feature about 5 times and you will start to get very good results.
I'd suggest to use wavelet analysis. Wavelets are localized in both time and frequency and give a better signal representation using multiresolution analysis than FT does.
Thre is a paper explaining a wavelete approach for texture description. There is also a comparison method.
You might need to slightly modify an algorithm to process images of arbitrary shape.
An interesting approach for this, is to use the Local Binary Patterns.
Here is an basic example and some explanations : http://hanzratech.in/2015/05/30/local-binary-patterns.html
See that method as one of the many different ways to get features from your pictures. It corresponds to the 2nd step of DanielHsH's method.

Fast and quick pixel matching algorithm

I am stuck in a pixel matching algorithm for finding symbols in an image. I have two images of symbols that I intend to find in an image that has big resolution.
Instead of a pixel by pixel matching algorithm, is there a fast algorithm that gives the same result as that of pixel matching algorithm. The result should be similar to: (percentage of pixel matched) divide by (total pixels).
My problem is that I wish to find certain symbols in a 1 bit image. The symbol appear with exact similarity in the target image and 95% of total pixel match with the target block in the image. but it takes hours to do iterations. The image is 10k X 10k and the symbol size is 20 X 20, so it will 10 power of 10 calculations which is too much to handle. Is there any filter/NN combination or any other algorithm that can give same results as that of pixel matching in a few minutes?
The point here is that pixels are almost same in the but problem is that size is very large. I do not want complex features for noise handling or edges, fuzzy etc. just a simple algorithm to do pixel matching quickly and the result should be similar to: (percentage of pixel matched) divide by (total pixels)
object recognition is tricky in that any simple algorithm is generally going to be way too slow, as you've apparently realized.
Luckily, if you have a rather large collection of these images on hand that are already correctly labeled, then I have a very simply solution for you.
Simply make 3 layer feedforward network with one input unit per pixel, all of which connect to a much smaller hidden layer, and then those in turn connect to 1 output unit (representing which symbol is present in the image). Then just run the backpropagation algorithm on your dataset until the network learns to identify the symbols.
Unfortunately, this doesn't scale very well, so you might have to look into convolutional NNs for better performance.
Additionally, if you don't have any training data (i.e. labeled examples), then your best bet is probably to decompose your symbols into features and then sweep the image for those. If you can decompose them into lines, then a hough transform can do this quite rapidly.
Maybe an (Adaptive Resonance Theory) ART-1 network could help.
The algorithm can also be written that all Prototypes are checked in parallel in the same time and it can be blazing fast because it esentially uses binary math a lot.

Resources