I'm implementing a cache for virtual reality applications: given an input image query, return the result associated to the most visually similar cached image (so a previously processed query) if the distance between the query representation and the cached image representation is lower than a certain threshold. Our cache is relatively small and contains 10k images representations.
We use VLAD codes [1] as image representation since they are very compact and incredibly fast to compute (around 1 ms).
However, it has been shown in [2] that the the distance between the query code and the images in the dataset (the cache in this case) is very different from query to query, so it's not trivial to find an absolute threshold. In the same work it's proposed a method for object detection applications, which is not relevant in this context (we return just the most similar image, not all and only the images containing the query subject).
[3] offers a very precise method, but at the same time it's very expensive and returns short lists. It's based on spatial feature matching re-ranking, and if you want to know more details the quoted section is at the end of this question. I'm not an expert in computer vision, but this step sounds to me a lot like using a Feature Matcher on the short-list of the top-k elements according to the image representation and re-rank them based on the number of features matched. My first question is: is that correct?
In our case this approach is not a problem, since most of the times the top-10 most similar VLAD codes contains the query subject, and so we should do this spatial matching step only on 10 images.
However, at this point I have a second question: if we had the problem of deciding an absolute threshold for image representations (as VLAD codes), does this problem still persists with this approach? In the first case, the threshold was "the L2 distance between the query VLAD code and the closest VLAD code", here instead the threshold value would represent "the number of features matched between the query image and the image closest image using VLAD codes".
Of course my second question makes sense if the first question is positive.
The approach of [3]:
Geometrical Re-ranking verifies the global geometrical consistency
between matches (Lowe 2004; Philbin et al. 2007) for a short-list of
database images returned by the image search system. Here we implement
the approach of Lowe (2004) and apply it to a short-list of 200
images. We first obtain a set of matches, i.e., each descriptor of the
query image is matched to the 10 closest ones in all the short-list
images. We then estimate an affine 2D transforma- tion in two steps.
First, a Hough scheme estimates a trans- formation with 4 degrees of
freedom. Each pair of matching regions generates a set of parameters
that “vote” in a 4D histogram. In a second step, the sets of matches
from the largest bins are used to estimate a finer 2D affine
transform. The images for which the geometrical estimation succeeds
are returned in first positions and ranked with a score based on the
number of inliers. The images for which the estima- tion failed are
appended to the geometrically matched ones, with their order
unchanged.
Related
I am trying to write a function in OpenCv for comparing two images - imageA and imageB, to check to what extent they are similar.
I want to arrive at three comparison scores(0 to 100 value) as shown below.
1. Histograms - compareHist() : OpenCV method
2. Template Matching - matchTemplate() : OpenCV method
3. Feature Matching - BFMatcher() : OpenCV method
Above on the scores derived from the above calculations I want to arrive at a conclusion regarding the matching.
I was successful in getting this functions to work, but not at getting a comparison score for it. I would be great if someone could help me with that. Also, any other advice regarding this sort of image matching is also welcome.
I know there are different kind of algorithms that can be used for the above functions. So, just clarifying on the kind of images that I will be using.
1. As mentioned above it will be a one-to-one comparison.
2. Its all images taken by a human using a mobile camera.
3. The images that match will be taken of the same object/place from the same spot mostly. (Accoding to the time of the day, the lighting could differ)
4. If the images doesn't match the user will be asked to click another one, till it matches.
5. The kind of images compared could include - corridor, office table, computer screen(content on the screen to be compared), pepper document etc.
1- With histogram you can get a comparison score using histogram intersection. If you divide the intersection of two histograms to the union of the two histogram, will give you a score between 0 (no match at all) and 1 (complete match) like the example in the below graph:
You can compute the intersection for histogram with a simple For loop.
2- In template matching, the score you get is different for each method of comparing. In this link you can see the details of each method. In some methods highest score means the best match, but in some others, the lowest score means the most matched. For defining a score between 0 and 1, you should consider 2 scores: one for matching an image with itself (most match score) and two, matching two completely different images (lowest match) and then normalize the scores by the number of pixels in the image (height*width).
3- Feature matching is different than the last two methods. You may have two similar image with poor features (which fail in matching) or having two conceptually different images and have many matched features. Although if the images are feature-rich we can define something as a score. For this purpose, consider this example:
Img1 has 200 features
Img2 has 170 features
These two images have 100 matched features
Consider 0.5 (100/200) as the whole image matching score
You can also involve the distances between the matched pairs of features into the scoring but I think that's enough.
Regarding the comparison score. Have you tried implementing a weighted average to get a final comparison metric? Weight the 3 matching methods you are implementing according to their accuracy, the best method gets the “heaviest” weight.
Also, if you want to explore additional matching methods, give FFT-based matching a try: http://machineawakening.blogspot.com/2015/12/fft-based-cosine-similarity-for-fast.html
I'm quiet new to OpenCV and image processing, so my questions to the feature matching approach a a bit general. I read something about the theory, but i have problems to arrange the very specific theory in this steps;
As i understand it i would group the sequence in the following steps:
Feature detection: Special points from image are found in a very
Feature description: Information about the near neighborhood is collected and a per featurepoint one vector is created
->(1) is this always in the form of an histogram?
Matching: A distance between the descriptors is calculated
->(2) can I determine what kind of distance is used? I read about χ^2 and EMD, even if they are not implemented, are these the right keywords in this place
Corresponding matches are determined
->(3) I guess the Hungarian method would be one method?
Transformation estimation: In an optimization problem the best position is estimated
It would be nice if someone could clarify the italic marked question
(1): is this always in the form of an histogram?
No, for example there are binary descriptors for ORB features. In Theory, descriptors can be anything. Often they are normalized and often they are either binary or floating points. But: Histograms have some properties which can make them good descriptors.
(2) can I determine what kind of distance is used?
For floating point descriptors, sum of squared distances might be the most used metric to measure the distance. For binary descriptor afaik, hamming distance is used?
(3) I guess the Hungarian method would be one method?
Could be used, I guess, but this might or might not lead to some problems. Typically nearest neighbor approaches are used. Often just brute-force (which is O(n^2) instead of O(n^3) of hungarian). The "problem", that multiple descriptors of one set might have the same nearest neighbor in the second set, is in fact another feature, because if that happens, you might be able to filter out some "uncertain" matches (often the ratio of the best n matches is used to filter out even more). You must assume that many descriptors in a set will have no fitting correspondence in the second set and you must assume, that the matching itself won't produce perfect matches. Typically some additional steps like homography computation are used to make the matching more robust and filter out outliers.
PREMISE:
I'm really new to Computer Vision/Image Processing and Machine Learning (luckily, I'm more expert on Information retrieval), so please be kind with this filthy peasant! :D
MY APPLICATION:
We have a mobile application where the user takes a photo (the query) and the system returns the most similar picture thas was previously taken by some other user (the dataset element). Time performances are crucial, followed by precision and finally by memory usage.
MY APPROACH:
First of all, it's quite obvious that this is a 1-Nearest Neighbor problem (1-NN). LSH is a popular, fast and relatively precise solution for this problem. In particular, my LSH impelementation is about using Kernalized Locality Sensitive Hashing to achieve a good precision to translate a d-dimension vector to a s-dimension binary vector (where s<<d) and then use Fast Exact Search in Hamming Space
with Multi-Index Hashing to quickly find the exact nearest neighbor between all the vectors in the dataset (transposed to hamming space).
In addition, I'm going to use SIFT since I want to use a robust keypoint detector&descriptor for my application.
WHAT DOES IT MISS IN THIS PROCESS?
Well, it seems that I already decided everything, right? Actually NO: in my linked question I face the problem about how to represent the set descriptor vectors of a single image into a vector. Why do I need it? Because a query/dataset element in LSH is vector, not a matrix (while SIFT keypoint descriptor set is a matrix). As someone suggested in the comments, the commonest (and most efficient) solution is using the Bag of Features (BoF) model, which I'm still not confident with yet.
So, I read this article, but I have still some questions (see QUESTIONS below)!
QUESTIONS:
First and most important question: do you think that this is a reasonable approach?
Is k-means used in the BoF algorithm the best choice for such an application? What are alternative clustering algorithms?
The dimension of the codeword vector obtained by the BoF is equal to the number of clusters (so k parameter in the k-means approach)?
If 2. is correct, bigger is k then more precise is the BoF vector obtained?
There is any "dynamic" k-means? Since the query image must added to the dataset after the computation is done (remember: the dataset is formed by the images of all submitted queries) the cluster can change in time.
Given a query image, is the process to obtain the codebook vector the same as the one for a dataset image, e.g. we assign each descriptor to a cluster and the i-th dimension of the resulting vector is equal to the number of descriptors assigned to the i-th cluster?
It looks like you are building codebook from a set of keypoint features generated by SIFT.
You can try "mixture of gaussians" model. K-means assumes that each dimension of a keypoint is independent while "mixture of gaussians" can model the correlation between each dimension of the keypoint feature.
I can't answer this question. But I remember that the SIFT keypoint, by default, has 128 dimensions. You probably want a smaller number of clusters like 50 clusters.
N/A
You can try Infinite Gaussian Mixture Model or look at this paper: "Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis and Michael Jordan!
Not sure if I understand this question.
Hope this help!
I can create time-series graphs from data (charts) as images in C#. One might be moving average of a measured value, say 100 pixels by 100 pixels, time on X, value on Y.
I only train with graphs of values that give a desired (or undesired) result. This means I have lots of 10k images of success that I can use for training a NN.
The idea is to look at a current graph and establish a % match against the training data (many successful images either compiled/summed, averaged, etc. A high % match is likely the same situation exists now as with previous successes.
But I cannot figure out:
Q: How to compare images, or more basically, how to load a current image to test against in a trained NN. Do I really need 10,000 input nodes?!
There has to be a better way.
Right now I'm trying to make Encog/C# work for the image recognition/matching. There seems to be a lot of research in OCR, where a hard yes/no is made on input data, but not much at all about a 'fuzzy' match to the training data...
Backstory , in my country there is a picture of its founding father in every bank denomination :
.
I want to find the similarity between these two images via surf detectors .The system will be trained by both images. The user will present the bottom picture or the top picture via a webcam and will use the similarity score between them to find its denomination value .
My pseudocode:
1.Detect keypoints and the corresponding descriptors of both the images via surf detector and descriptor .
2.a.Calculate the matching vector between the query and each of the trained example .Find number of good matches / total number of matches for each image .
2.b.OR Apply RANSAC algorithm and find the highest number of closest pair between query and training algorithm
3.The one having the higher value will have higher score and better similarity.
Is my method sound enough , or is there any other method to find similarity between two images in which the query image will undergo various transformations . I have looked for solutions for this such as finding Manhattan distance , or finding correlation , however none of them are adequate for this problem.
Yes, you are doing it the right way
1) You create a training set and store all its feature-points .
2) Perform ratio test for matches with the query and train feature-points.
3) Apply ransac test and draw matches (apply homograpghy if you want highlight the detected note).
This paper might be helpful, they are doing similar thing using SIFT
Your algorithm looks fine, but you have much more information with you which you can make use of. I will give you a list of information which you can use to further improve your results:
1. Location of the part where denominations are written on the image.
2. Information about how denominations are written - Script knowledge.
3. Homo-graphic information as you know the original image and The observed image
Make use all the above information to improve the result.