Conceptual queries on retrieving 'visually similar' images: Dense SIFT or other descriptor? - image-processing

I am posting 3 images of my dataset to show how my image visually looks:
http://s1306.photobucket.com/user/Bidisha_Chakraborty/library/?page=1
I am using VLFFeat DSIFT implementation. I am using per descriptor 4 orientations instead of 8. So in my case it is 64 dimensional vector instead of 128. I am using the original scale for the image, since my image data does is originally taken from fixed distance. I am computing descriptors densely at 4/8 pixels interval. I have conducted several experiments by varying the window size from 80*80 pixels to 20*20 pixels. I did a clustering approach with various number of cluster centers. And finally I used earth mover's Distance to compute the similarity metric.
After various parameter tuning of window size, number of words, I see that even when I have nearly similar images like 1 and 3, the distance metric says image 1 is more similar to image 2 then image 1 to image 3.
I did Principal Component Analysis to see the variance of the data. I expected image 1 and image 2 to have separated clusters and image 1 and 3 to have overlapped clusters. Since I plotted first 3 dimensions and these 3 dimensions accounted for less than 30percentage of data, I am sure including all dimensions(which I of course could not visualize) will give worse results.
Should I conclude that SIFT is not the best thing for my application or I am missing out something. I already used GLCM for these and did not get a good result.
Any suggestion for any other feature space is most welcome.
thanks for any kind of insight.

Related

Rescaling input to CNNs

What is the general consensus on rescaling images that have different sizes? I have read that one approach is to rescale the largest size of an image to a fixed size. It's not clear to me how only rescaling one of the dimensions would lead to uniform image shapes across the dataset.
Are there other approaches, e.g. would it work to take the average size of the two dimensions and then rescale the dimensions of each image to the mean of each dimension across the dataset?
Is it important which interpolation method is used in the rescaling?
Would it make sense to simply take an nxm part of each image and cut off the rest of each image?
Is there a list of approaches people have used and how they perform in different scenarios.
Depends on the target application of the CNNs. For object detection/classification usually a sliding window approach or cropping is used. For the first option, sliding window is moved around the image and for every patch (with different overlapping criterion) a prediction is made. This predictions are then filtered with other pooling or filter strategies.
For image segmentation (aka semantic segmentation), similar approaches are used. 1) image scaling + segmenting + scaling back to its original size. 2) different image patches + segmentation of each, or 3) sliding window segmentation + maxpooling. With the option (3) each pixel has a N = HxW votes (where N is the size of the sliding window). This N predictions are then aggregated into a maxixmum-voting classifier (similar to ensemble models on Random Forest and other classifiers).
So, in short, I believe there is no short nor unique answer to this question. The decision you take will depend in the goal you try to achieve with the CNN, and of course, the quality of your approach will have an impact in the performance of the CNN. I don't know about any study of this kind though.

What is the difference between layers and octaves in SIFT/SURF?

I read both Lowe's papers ('99&'04) and I would say I understood most of them. I saw all SIFT related classes on youtube, but none explicitly says why we would use both octaves and layers?
I understood that you get more layers in the same octave by ~calculating the ~Laplacian for different sigmas and then you resample to half the resolution to get the next octave, and again ~calculating the ~Laplacian for the same sigmas as in the first octave. And then you do this as many times as you feel like doing it.
Initially, I thought that you use the layers (multiple sigmas) to find features of different sizes on one image, and then you resample, so that you calculate descriptors on every octave (resampling level) for every feature, so that you get descriptors at different scales that might be better matches for descriptors in the other image at a similar scale. Apparently, I was wrong, only one descriptor is calculated for every feature, as it is calculated out of gradient orientations, so it is ~invariant to scale anyway.
But this leaves me wondering, why do we need to resample, why can't or shouldn't just use a high number of layers and just one octave (no resampling). Is this just because it is cheaper to resample? If yes, why don't we just resample?
I did an experiment using OpenCV to see how and what is detected. Here are my observations:
1 octave, 1 layer => all features have the exact same size, as expected, 263 matches found
1 octave, 2 layers => all features from 1o1l test are found, plus some other features that are about x1.35 larger than the small ones, 326 matches found.
2 octave, 1 layer => most features from 1o1l test are found(maybe all), plus some other features that are exactly x2 as big, which is again expected since I resampled at half, 318 matches found.
2 octave, 2 layer => features have the x1 size, x1.35 or x2 size. I couldn't find any x2.7 size as I would have expected. Only 299 matches found. I suppose that now that there were multiple closer layers, more things looked too much alike and failed the ratio test, so more layers, might actually decrease the number of tiepoints.
Note: ~ sign means sort of. I use it when I know it is not the exact explanation, but the exact one would be longer and it wouldn't add any value to the question.

Image retrieval - edge histogram

My lecturer has slides on edge histograms for image retrieval, whereby he states that one must first divide the image into 4x4 blocks, and then check for edges at the horizontal, vertical, +45°, and -45° orientations. He then states that this is then represented in a 14x1 histogram. I have no idea how he came about deciding that a 14x1 histogram must be created. Does anyone know how he came up with this value, or how to create an edge histogram?
Thanks.
The thing you are referring to is called the Histogram of Oriented Gradients (HoG). However, the math doesn't work out for your example. Normally you will choose spatial binning parameters (the 4x4 blocks). For each block, you'll compute the gradient magnitude at some number of different directions (in your case, just 2 directions). So, in each block you'll have N_{directions} measurements. Multiply this by the number of blocks (16 for you), and you see that you have 16*N_{directions} total measurements.
To form the histogram, you simply concatenate these measurements into one long vector. Any way to do the concatenation is fine as long as you keep track of the way you map the bin/direction combo into a slot in the 1-D histogram. This long histogram of concatenations is then most often used for machine learning tasks, like training a classifier to recognize some aspect of images based upon the way their gradients are oriented.
But in your case, the professor must be doing something special, because if you have 16 different image blocks (a 4x4 grid of image blocks), then you'd need to compute less than 1 measurement per block to end up with a total of 14 measurements in the overall histogram.
Alternatively, the professor might mean that you take the range of angles in between [-45,+45] and you divide that into 14 different values: -45, -45 + 90/14, -45 + 2*90/14, ... and so on.
If that is what the professor means, then in that case you get 14 orientation bins within a single block. Once everything is concatenated, you'd have one very long 14*16 = 224-component vector describing the whole image overall.
Incidentally, I have done a lot of testing with Python implementations of Histogram of Gradient, so you can see some of the work linked here or here. There is also some example code at that site, though a more well-supported version of HoG appears in scikits.image.

Fast and quick pixel matching algorithm

I am stuck in a pixel matching algorithm for finding symbols in an image. I have two images of symbols that I intend to find in an image that has big resolution.
Instead of a pixel by pixel matching algorithm, is there a fast algorithm that gives the same result as that of pixel matching algorithm. The result should be similar to: (percentage of pixel matched) divide by (total pixels).
My problem is that I wish to find certain symbols in a 1 bit image. The symbol appear with exact similarity in the target image and 95% of total pixel match with the target block in the image. but it takes hours to do iterations. The image is 10k X 10k and the symbol size is 20 X 20, so it will 10 power of 10 calculations which is too much to handle. Is there any filter/NN combination or any other algorithm that can give same results as that of pixel matching in a few minutes?
The point here is that pixels are almost same in the but problem is that size is very large. I do not want complex features for noise handling or edges, fuzzy etc. just a simple algorithm to do pixel matching quickly and the result should be similar to: (percentage of pixel matched) divide by (total pixels)
object recognition is tricky in that any simple algorithm is generally going to be way too slow, as you've apparently realized.
Luckily, if you have a rather large collection of these images on hand that are already correctly labeled, then I have a very simply solution for you.
Simply make 3 layer feedforward network with one input unit per pixel, all of which connect to a much smaller hidden layer, and then those in turn connect to 1 output unit (representing which symbol is present in the image). Then just run the backpropagation algorithm on your dataset until the network learns to identify the symbols.
Unfortunately, this doesn't scale very well, so you might have to look into convolutional NNs for better performance.
Additionally, if you don't have any training data (i.e. labeled examples), then your best bet is probably to decompose your symbols into features and then sweep the image for those. If you can decompose them into lines, then a hough transform can do this quite rapidly.
Maybe an (Adaptive Resonance Theory) ART-1 network could help.
The algorithm can also be written that all Prototypes are checked in parallel in the same time and it can be blazing fast because it esentially uses binary math a lot.

Determine if an image needs contrasting automatically in OpenCV

OpenCV has a handy cvEqualizeHist() function that works great on faded/low-contrast images.
However when an already high-contrast image is given, the result is a low-contrast one. I got the reason - the histogram being distributed evenly and stuff.
Question is - how do I get to know the difference between a low-contrast and a high-contrast image?
I'm operating on Grayscale images and setting their contrast properly so that thresholding them won't delete the text i'm supposed to extract (thats a different story).
Suggestions welcome - esp on how to find out if the majority of the pixels in the image are light gray (which means that the equalise hist is to be performed)
Please help!
EDIT: thanks everyone for many informative answers. But the standard deviation calculation was sufficient for my requirements and hence I'm taking that to be the answer to my query.
You can probably just use a simple statistical measure of the image to determine whether an image has sufficient contrast. The variance of the image would probably be a good starting point. If the variance is below a certain threshold (to be empirically determined) then you can consider it to be "low contrast".
If you're adjusting contrast just so you can threshold later on, you may be able to avoid the contrast adjustment step if you set your threshold adaptively using Ohtsu's method.
If you're still interested in finding out the image contrast, then read on.
While there are a number of different ways to calculate "contrast". Often, those metrics are applied locally as opposed to the entire image, to make the result more sensitive to image content:
Divide the image into adjacent non-overlaying neighborhoods.
Pick neighborhood sizes that are approximate to size of the features of your image (e.g. if your main feature is horizontal text, make neighborhoods tall enough to capture 2 lines of text, and just as wide).
Apply the metric to each neighborhood individually
Threshold the metric result to separate low and high variance blocks. This will prevent such things as large, blank areas of page skewing your contrast estimates.
From there, you can use a number of features to determine contrast:
The proportion of high metric blocks to low metric blocks
High metric block mean
Intensity distance between the high and low metric blocks (using means, modes, etc)
This may serve as a better indication of image contrast than global image variance alone. Here's why:
(stddev: 50.6)
(stddev: 7.9)
The two images are perfectly in contrast (the grey background is just there to make it obvious it's an image), but their standard deviations (and thus variance) are completely different.
Calculate cumulative histogram of image.
Make linear regression of cumulative histogram in the form y(x) = A*x + B.
Calculate RMSE of real_cumulative_frequency(x)-y(x).
If that RMSE is close to zero - image is already equalized. (That means that for equalized images cumulative histograms must be linear)
Idea is taken from here.
EDIT:
I've illustrated this approach in my blog (C example code included).
There is a support provided in skimage for this. skimage.exposure.is_low_contrast. reference
example :
>>> image = np.linspace(0, 0.04, 100)
>>> is_low_contrast(image)
True
>>> image[-1] = 1
>>> is_low_contrast(image)
True
>>> is_low_contrast(image, upper_percentile=100)
False

Resources