Content Based Image Retrieval (CBIR): Bag of Features or Descriptors Matching? - image-processing

I've read a lot of papers about the Nearest Neighbor problem, and it seems that indexing techniques like randomized kd-trees or LSH has been successfully used for Content Based Image Retrieval (CBIR), which can operate in high dimensional space. One really common experiment is given a SIFT query vector, find the most similar SIFT descriptor in the dataset. If we repeat the process with all the detected SIFT descriptors we can find the most similar image.
However, another popular approach is using Bag of Visual Words and convert all the SIFT descriptors detected into an huge sparse vector, which can be indexed with the same text techniques (e.g. inverted index).
My question is: these two different approaches ( matching the SIFT descriptors through Nearest Neighbor technique VS Bag of Features on SIFT descriptors + invert index) are extremely different and I don't understand which one is better.
If the second approach is better, what is the application of Nearest Neighbor in Computer Vision / Image Processing?

Oh boy, you are asking a question that even the papers can't answer, I think. In order to compare, one should take the state-of-the-art technologies of both approaches and compare them, measure speed, accuracy and recall. The one with the best characteristics is better than the other.
Personally, I hadn't heard much of the Bag of Visual Words, I had used the bag of words model only in text related projects, not images-relevant ones. Moreover, I am pretty sure I have seen many people use the 1st approach (including me and our research).
That's the best I got, so if I were you I would search for a paper that compares these two approaches, and if I couldn't find one, I would find the best representative of both approaches (the link you posted has a paper of 2009, that's old I guess), and check their experiments.
But be careful! In order to compare the approaches by the best representatives, you need to make sure that the experiments of each paper are super-relevant, the machines used are of the same "powerness", the data used are of the same nature and size, and so on.

Related

Are there any ways to build an ML model using CBIR and SIFT for image comparison in my case?

I have this project I'm working on. A part of the project involves multiple test runs during which screenshots of an application window are taken. Now, we have to ensure that screenshots taken between consecutive runs match (barring some allowable changes). These changes could be things like filenames, dates, different logos, etc. within the application window that we're taking a screenshot of.
I had the bright idea to automate the process of doing this checking. Essentially my idea was this. If I could somehow mathematically quantify the difference between a screenshot from the N-1th run and the Nth run, I could create a binary labelled dataset that mapped feature vectors of some sort to a label (0 for pass or 1 for fail if the images do not adequately match up). The reason for all of this was so that my labelled data would help make the model understand what scale of changes are acceptable, because there are so many kinds that are acceptable.
Now lets say I have access to lots of data that I have meticulously labelled, in the thousands. So far I have tried using SIFT in opencv using keypoint matching to determine a similarity score between images. But this isn't an intelligent, learning process. Is there some way I could take some information from SIFT and use it as my x-value in my dataset?
Here are my questions:
what would that be the information I need as my x-value? It needs to be something that represents the difference between two images. So maybe the difference between feature vectors from SIFT? What do I do when those vectors are of slightly different dimensions?
Am I on the right track with thinking about using SIFT? Should I look elsewhere and if so where?
Thanks for your time!
The approach that is being suggested in the question goes like this -
Find SIFT features of two consecutive images.
Use those to somehow quantify the similarity between two images (sounds reasonable)
Use this metric to first classify the images into similar and non-similar.
Use this dataset to train a NN do to the same job.
I am not completely convinced if this is a good approach. Let's say that you created the initial classifier with SIFT features. You are then using this data to train a NN. But this data will definitely have a lot of wrong labels. Because if it didn't have a lot of wrong labels, what's stopping you from using your original SIFT based classifier as your final solution?
So if your SIFT based classification is good, why even train a NN? On the other hand, if it's bad, you are giving a lot of wrong labeled data to the NN for training. I think the latter is a probably a bad idea. I say probably because there is a possibility that maybe the wrong labels just encourage the NN to generalize better, but that would require a lot of data, I imagine.
Another way to look at this is, let's say that your initial classifier is 90% accurate. That's probably the upper limit of the performance for the NN that you are looking at when talking about training it with this data.
You said that the issue that you have with your first approach is that 'it's not a an intelligent, learning process'. I think it's the wrong approach to think that the former approach is always inferior to the latter. SIFT is a powerful tool that can solve a lot of problems without all the 'black-boxness' of an NN. If this problem can be solved with sufficient accuracy using SIFT, I think going after a learning based approach is not the way to go, because again, a learning based approach isn't necessarily superior.
However, if the SIFT approach isn't giving you good enough results, definitely start thinking of NN stuff, but at that point, using the "bad" method to label the data is probably a bad idea.
Also in relation, I think you could potentially be underestimating the amount of data that is needed for this. You mentioned data in the thousands, but that's honestly, not a lot. You would need a lot more, I think.
One way I would think about instead doing this -
Do SIFT keyponits detection for a sample reference image.
Manually filter out keypoints that does not belong to the things in the image that are invariant. That is, just take keypoints at the locations in the image that is guaranteed (or very likely) to be always present.
When you get a new image, compute the keypoints and do matching with the reference image.
Set some threshold of the ratio of good matches to the total number of matches.
Depending on your application, this might give you good enough results.
If not, and if you really want your solution to be NN based, I would say you need to manually label the dataset as opposed to using SIFT.

need advise on sift feature - is there such thing as a good feature?

I am trying out vlfeat, got huge amount of features from an image database, and I am testing with the ground truth for mean average precision (MAp). Overall, I got roughly 40%. I see that some of the papers got higher MAp, while using techniques very similar to mine; the standard bag of word.
I am currently looking for an answer for obtaining higher MAp for the standard bag of word technique. While I see that there are other implementation such as SURF and what not, let's stick to the standard Lowe's SIFT and the standard bag of word in this question.
So the thing is this, I see that vl_sift got thresholding to allow you to be more strict on feature selection. Currently, I understand that going for higher threshold might net you smaller and more meaningful "good" features list, and possibly reduce some noisy features. "Good" features mean, given the same images with different variation, very similar features are also detected on other images.
However, how high should we go for this thresholding? Sometimes, I see that an image returns no features at all with higher threshold. At first, I was thinking of keep on adjusting the threshold, until I get better MAp. But again, I think it's a bad idea to keep on adjusting just to find the best MAp for the respective database. So my questions are:
While adjusting threshold may decrease numbers of features, does increasing threshold always return a lesser number yet better features?
Are there better approaches to obtain the good features?
What are other factors that can increase the rate of obtaining good features?
Have a look into some of the papers put out in response to the Pascal challenge in recent years. The impression they seem to give me is that standard 'feature detection' methods don't work very well with the Bag of Words technique. This makes sense when you think about it - BoW works by pulling together lots of weak, often unrelated features. It's less about detecting a specific object, but instead recognizing classes of objects and scenes. As such, putting too much emphasis on normal 'key features' can harm more than help.
As such, we see folks using dense grids and even random points as their features. From experience, using one of these methods over Harris corners, LoG, SIFT, MSER, or any of the like, has a great positive impact on performance.
To answer your questions directly:
Yes. From the SIFT api:
Keypoints are further refined by eliminating those that are likely to be unstable, either because they are selected nearby an image edge, rather than an image blob, or are found on image structures with low contrast. Filtering is controlled by the follow:
Peak threshold. This is the minimum amount of contrast to accept a keypoint. It is set by configuring the SIFT filter object by vl_sift_set_peak_thresh().
Edge threshold. This is the edge rejection threshold. It is set by configuring the SIFT filter object by vl_sift_set_edge_thresh().
You can see examples of the two thresholds in action in the 'Detector parameters' section here.
Research suggests features densely selected from the scene yield more descriptive 'words' than those selected using more 'intelligent' methods (eg: SIFT, Harris, MSER). Try your Bag of Words pipeline with vl_feat's DSIFT or PHOW implementation. You should see a great improvement in performance (assuming your 'word' selection and classification steps are tuned well).
After a dense set of feature points, the biggest breakthrough in this field seems to have been the 'Spatial Pyramid' approach. This increases the number of words produced for an image, but provides a location aspect to the features - something inherently lacking in Bag of Words. After that, make sure your parameters are well tuned (which feature descriptor you're using (SIFT, HOG, SURF, etc), how many words are in your vocabulary, what classifier are you using ect.) Then.. you're in active research land. Enjoy =)

Image classification/recognition open source library

I have a set of reference images (200) and a set of photos of those images (tens of thousands). I have to classify each photo in a semi-automated way. Which algorithm and open source library would you advise me to use for this task? The best thing for me would be to have a similarity measure between the photo and the reference images, so that I would show to a human operator the images ordered from the most similar to the least one, to make her work easier.
To give a little more context, the reference images are branded packages, and the photos are of the same packages, but with all kinds of noises: reflections from the flash, low light, imperfect perspective, etc. The photos are already (manually) segmented: only the package is visible.
Back in my days with image recognition (like 15 years ago) I would have probably tried to train a neural network with the reference images, but I wonder if now there are better ways to do this.
I recommend that you use Python, and use the NumPy/SciPy libraries for your numerical work. Some helpful libraries for handling images are the Mahotas library and the scikits.image library.
In addition, you will want to use scikits.learn, which is a Python wrapper for Libsvm, a very standard SVM implementation.
The hard part is choosing your descriptor. The descriptor will be the feature you compute from each image, intended to compute a similarity distance with the set of reference images. A good set of things to try would be Histogram of Oriented Gradients, SIFT features, and color histograms, and play around with various ways of binning the different parts of the image and concatenating such descriptors together.
Next, set aside some of your data for training. For these data, you have to manually label them according to the true reference image they belong to. You can feed these labels into built-in functions in scikits.learn and it can train a multiclass SVM to recognize your images.
After that, you may want to look at MPI4Py, an implementation of MPI in Python, to take advantage of multiprocessors when doing the large descriptor computation and classification of the tens of thousands of remaining images.
The task you describe is very difficult and solving it with high accuracy could easily lead to a research-level publication in the field of computer vision. I hope I've given you some starting points: searching any of the above concepts on Google will hit on useful research papers and more details about how to use the various libraries.
The best thing for me would be to have a similarity measure between the photo and the reference images, so that I would show to a human operator the images ordered from the most similar to the least one, to make her work easier.
One way people do this is with the so-called "Earth mover's distance". Briefly, one imagines each pixel in an image as a stack of rocks with height corresponding to the pixel value and defines the distance between two images as the minimal amount of work needed to transfer one arrangement of rocks into the other.
Algorithms for this are a current research topic. Here's some matlab for one: http://www.cs.huji.ac.il/~ofirpele/FastEMD/code/ . Looks like they have a java version as well. Here's a link to the original paper and C code: http://ai.stanford.edu/~rubner/emd/default.htm
Try Radpiminer (one of the most widely used data-mining platform, http://rapid-i.com) with IMMI (Image Mining Extension, http://www.burgsys.com/mumi-image-mining-community.php), AGPL licence.
It currently implements several similarity measurement methods (not only trivial pixel by pixel comparison). The similarity measures can be input for a learning algorithm (e.g. neural network, KNN, SVM, ...) and it can be trained in order to give better performance. Some information bout the methods is given in this paper:
http://splab.cz/wp-content/uploads/2012/07/artery_detection.pdf
Now-a-days Deep Learning based framworks like Torch , Tensorflow, Theano, Keras are the best open source tool/library for object classification/recognition tasks.

How to calculate distance when we have sparse dataset in K nearest neighbour

I am implementing K nearest neighbour algorithm for a very sparse data. I want to calculate the distance between a test instance and each sample in the training set, but I am confused.
Because most of the features in training samples don't exist in test instance or vice versa (missing features).
How can I compute the distance in this situation?
To make sure I'm understanding the problem correctly: each sample forms a very sparsely filled vector. The missing data is different between samples, so it's hard to use any Euclidean or other distance metric to gauge similarity of samples.
If that is the scenario, I have seen this problem show up before in machine learning - in the Netflix prize contest, but not specifically applied to KNN. The scenario there was quite similar: each user profile had ratings for some movies, but almost no user had seen all 17,000 movies. The average user profile was quite sparse.
Different folks had different ways of solving the problem, but the way I remember was that they plugged in dummy values for the missing values, usually the mean of the particular value across all samples with data. Then they used Euclidean distance, etc. as normal. You can probably still find discussions surrounding this missing value problem on that forums. This was a particularly common problem for those trying to implement singular value decomposition, which became quite popular and so was discussed quite a bit if I remember right.
You may wish to start here:
http://www.netflixprize.com//community/viewtopic.php?id=1283
You're going to have to dig for a bit. Simon Funk had a little different approach to this, but it was more specific to SVDs. You can find it here: http://www.netflixprize.com//community/viewtopic.php?id=1283
He calls them blank spaces if you want to skip to the relevant sections.
Good luck!
If you work in very high dimension space. It is better to do space reduction using SVD, LDA, pLSV or similar on all available data and then train algorithm on trained data transformed that way. Some of those algorithms are scalable therefor you can find implementation in Mahout project. Especially I prefer using more general features then such transformations, because it is easier debug and feature selection. For such purpose combine some features, use stemmers, think more general.

machine learning - svm feature fusion techique

for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?
Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).
It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.
What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.
As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])

Resources