How to evaluate Feature descriptors with a Matching Algorithm - opencv

I'm trying to evaluate Feature Detectors and Descriptors with the FLANN algorithm based on this tutorial
I want to build a ROC curve for the evaluation part therefore I have to get the TP, FN, FP and TN. The thing is, I don't know how to get these values! I have read a lot of papers but none of them explain, for instance how they get the false positives. In the given tutorial you can set a certain threshold such that you can count the good and the bad matches, but thats not a justification that everthing was matched correctly. Should I count it for every image pairs by hand or is their a common technique do solve it automatically.
Thanks in advance for any help!

You have to have so called "ground truth" - manually checked correspondences or transformation matrix (fundamental or homography) between two images. Correspondences which are consistent with this matrix are correct.
Check approach used in classical papers by Mykolajczyk et al. "A comparison of affine region detectors", "A PERFORMANCE EVALUATION OF LOCAL DESCRIPTORS" and Moreels and Perona "Evaluation of Features Detectors and Descriptors based on 3D Objects"

Related

Bag of Features / Visual Words + Locality Sensitive Hashing

PREMISE:
I'm really new to Computer Vision/Image Processing and Machine Learning (luckily, I'm more expert on Information retrieval), so please be kind with this filthy peasant! :D
MY APPLICATION:
We have a mobile application where the user takes a photo (the query) and the system returns the most similar picture thas was previously taken by some other user (the dataset element). Time performances are crucial, followed by precision and finally by memory usage.
MY APPROACH:
First of all, it's quite obvious that this is a 1-Nearest Neighbor problem (1-NN). LSH is a popular, fast and relatively precise solution for this problem. In particular, my LSH impelementation is about using Kernalized Locality Sensitive Hashing to achieve a good precision to translate a d-dimension vector to a s-dimension binary vector (where s<<d) and then use Fast Exact Search in Hamming Space
with Multi-Index Hashing to quickly find the exact nearest neighbor between all the vectors in the dataset (transposed to hamming space).
In addition, I'm going to use SIFT since I want to use a robust keypoint detector&descriptor for my application.
WHAT DOES IT MISS IN THIS PROCESS?
Well, it seems that I already decided everything, right? Actually NO: in my linked question I face the problem about how to represent the set descriptor vectors of a single image into a vector. Why do I need it? Because a query/dataset element in LSH is vector, not a matrix (while SIFT keypoint descriptor set is a matrix). As someone suggested in the comments, the commonest (and most efficient) solution is using the Bag of Features (BoF) model, which I'm still not confident with yet.
So, I read this article, but I have still some questions (see QUESTIONS below)!
QUESTIONS:
First and most important question: do you think that this is a reasonable approach?
Is k-means used in the BoF algorithm the best choice for such an application? What are alternative clustering algorithms?
The dimension of the codeword vector obtained by the BoF is equal to the number of clusters (so k parameter in the k-means approach)?
If 2. is correct, bigger is k then more precise is the BoF vector obtained?
There is any "dynamic" k-means? Since the query image must added to the dataset after the computation is done (remember: the dataset is formed by the images of all submitted queries) the cluster can change in time.
Given a query image, is the process to obtain the codebook vector the same as the one for a dataset image, e.g. we assign each descriptor to a cluster and the i-th dimension of the resulting vector is equal to the number of descriptors assigned to the i-th cluster?
It looks like you are building codebook from a set of keypoint features generated by SIFT.
You can try "mixture of gaussians" model. K-means assumes that each dimension of a keypoint is independent while "mixture of gaussians" can model the correlation between each dimension of the keypoint feature.
I can't answer this question. But I remember that the SIFT keypoint, by default, has 128 dimensions. You probably want a smaller number of clusters like 50 clusters.
N/A
You can try Infinite Gaussian Mixture Model or look at this paper: "Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis and Michael Jordan!
Not sure if I understand this question.
Hope this help!

How to match features when the number of features of both objects in unequal?

I have 2 objects. I get n features from object 1 & m features from object 2.
n!=m
I have to measure the probability that object 1 is similar to object 2.
How can I do this?
There is a nice tutorial in the OpenCV website that does this. Check it out.
The idea is to get the distances between all those descriptors with a FlannBasedMatcher, get the closest ones, and run RANSAC to find some set of consistent features between the two objects. You don't get a probability, but the number of consistent features, from which you may score how good your detection is, but that is up to you.
You can group the features in the image where features are more.
Set a vector to use the same. There may be multiple matches from among-st that you can choose the highest one.
Are you talking about point feature descriptors, like SIFT, SURF, or FREAK?
In that case there are several strategies. In all cases you need a distance measure. For SIFT or SURF you can use the Euclidean distance between the descriptors, or the L1 norm, or the dot product (correlation). For binary features, like FREAK or BRISK, you typically use the Hamming distance.
Then, one approach, is to simply pick a threshold on the distance. This is likely to give you many-to-many matches. Another way is to use bipartite graph matching to find the minimum-cost or maximum-weight assignment between the two sets. A very practical approach is described by David Lowe, which uses a ratio test to discard ambiguous matches.
Many of these strategies are implemented in the matchFeatures function in the Computer Vision System Toolbox for MATLAB.

Estimating parameters in multivariate classification

Newbie here typesetting my question, so excuse me if this don't work.
I am trying to give a bayesian classifier for a multivariate classification problem where input is assumed to have multivariate normal distribution. I choose to use a discriminant function defined as log(likelihood * prior).
However, from the distribution,
$${f(x \mid\mu,\Sigma) = (2\pi)^{-Nd/2}\det(\Sigma)^{-N/2}exp[(-1/2)(x-\mu)'\Sigma^{-1}(x-\mu)]}$$
i encounter a term -log(det($S_i$)), where $S_i$ is my sample covariance matrix for a specific class i. Since my input actually represents a square image data, my $S_i$ discovers quite some correlation and resulting in det(S_i) being zero. Then my discriminant function all turn Inf, which is disastrous for me.
I know there must be a lot of things go wrong here, anyone willling to help me out?
UPDATE: Anyone can help how to get the formula working?
I do not analyze the concept, as it is not very clear to me what you are trying to accomplish here, and do not know the dataset, but regarding the problem with the covariance matrix:
The most obvious solution for data, where you need a covariance matrix and its determinant, and from numerical reasons it is not feasible is to use some kind of dimensionality reduction technique in order to capture the most informative dimensions and simply discard the rest. One such method is Principal Component Analysis (PCA), which applied to your data and truncated after for example 5-20 dimensions would yield the reduced covariance matrix with non-zero determinant.
PS. It may be a good idea to post this question on Cross Validated
Probably you do not have enough data to infer parameters in a space of dimension d. Typically, the way you would get around this is to take an MAP estimate as opposed to an ML.
For the multivariate normal, this is a normal-inverse-wishart distribution. The MAP estimate adds the matrix parameter of inverse Wishart distribution to the ML covariance matrix estimate and, if chosen correctly, will get rid of the singularity problem.
If you are actually trying to create a classifier for normally distributed data, and not just doing an experiment, then a better way to do this would be with a discriminative method. The decision boundary for a multivariate normal is quadratic, so just use a quadratic kernel in conjunction with an SVM.

facial expression classification using k-means

My method for classifying facial expressions using k-means is:
Use opencv to detect the face in the image
Use ASM and stasm to get the facial feature point
Calculate the distance between facial features (as show in the picture). There'll be 5 distances.
Calculate the centroid for each distance for each facial expression (exp: in the distance D1 there are 7 centroids for each expression 'happy, angry...').
Use 5 k-means each k-means for a distance and each k-means will have as a result the expression shown by the distance closest to the Centroid calculated in the first step.
Final expression will be the expression that appears in the most k-means results
However, using that method my results are wrong?
Is my method correct or is it wrong somewhere?
K-means is not a classification algorithm. Once runned, it simply finds centroids of K elements, so it splits data into K parts, but in most cases it won't have anything to do with desired classes. This algorithm (as all the clustering methods) should be used when you want to explore data and find some distinguishable objects. Distinguishable in any sense. If your task is to build a system, which recognizes some given classes, then it is a classification problem, not clustering. One of the most simple methods, which are easy to both implement and understand is KNN (K-nearest neighbours), which roughly does what you are trying to accomplish - checks which classes' objects are the closest ones to some predefined ones.
To better see the difference let us consider your case - you are trying to detect emotional state based on the face features. Running k-means on such data can split your face photos into many groups:
If you use photos of different people, it can cluster photos of particular people together (as their distances differ from others)
it can split data into for example man and woman, as there are gender specific differences in such features
it can even split your data based on the distance from the camera, as the perspective changes your features, creating "clusters".
etc.
As you can see, there are dozens possible "reasonable" (and even more completely not interpretable) splits, and K-means (and any) other clustering algorithm will simply find one of them (in most cases - the not interpretable one). Classification methods are used to overcome this issue, to "explain" the algorithm what are you expecting.

Implementing Vocabulary Tree in OpenCV

I am trying to implement image search based on paper "Scalable Recognition with a Vocabulary Tree". I am using SURF for extracting the features and key points. For example, for an image i'm getting say 300 key points and each key point has 128 descriptor values. My Question is how can I apply the K-Means Clustering algorithm on the data. I mean Do I need to apply clustering algorithm on all the points i.e., 300*128 values or Do I need to find the distance between the consecutive descriptor values and store the values and apply the clustering algorithm on that. I am confused and any help will be appreciated.
Thanks,
Rocky.
From your question I would say you are quite confused. The vocabulary tree technique is grounded on the us of k-means hierarchical clustering and a TF-IDF weighting scheme for the leaf nodes.
In a nutshell the clustering algorithm employed for the vocabulary tree construction runs k-means once over all the d-dimensional data (d=128 for the case of SIFT) and then runs k-means again over each of the obtained clusters until some depth level. Hence the two main parameters for the vocabulary tree construction are the branching factor k and the tree depth L. Some improvements consider only the branching factor while the depth is automatically determined by cutting the tree to fulfill a minimum variance measure.
As for the implementation, cv::BOWTrainer from OpenCV is a good starting point though is not very well generalized for the case of a hierarchical BoW scheme since it imposes the centers to be stored in a simple cv::Mat while vocabulary tree is typically unbalanced and mapping it to a matrix in a level-wise fashion might not be efficient from the memory use point of view when the number of nodes is much lower than the theoretical number of nodes in a balanced tree with depth L and branching factor k, that is:
n << (1-k^L)/(1-k)
For what I know I think that you have to store all the descriptors on a cv::Mat and then add this to a "Kmeans Trainer", thus you can finally apply the clustering algorithm. Here a snippet that can give you an idea about what I am talking:
BOWKMeansTrainer bowtrainer(1000); //num clusters
bowtrainer.add(training_descriptors); // we add the descriptors
Mat vocabulary = bowtrainer.cluster(); // apply the clustering algorithm
And this maybe can be interesting to you: http://www.morethantechnical.com/2011/08/25/a-simple-object-classifier-with-bag-of-words-using-opencv-2-3-w-code/
Good luck!!
Checkout out the code in libvot, in src/vocab_tree/clustering.*, you can find a detailed implementation of the clustering algorithm.

Resources