ORB/BFMatcher - why norm_hamming distance?

ORB/BFMatcher - why norm_hamming distance? - opencv

I'm using the OpenCV implementation of ORB along with the BFMatcher. The OpenCV states that NORM_HAMMING should be used with ORB.
Why is this? What advantages does norm_hamming offer over other methods such as euclidean distance, norm_l1, etc.

When comparing descriptors in computer vision, the Euclidian distance is usually understood as the square root of the sum of the squared differences between the two vectors' elements.
The ORB descriptors are vectors of binary values. If applying Euclidian distance to binary vectors, the squared result of a single comparison would always be 1 or 0, which is not informative when it comes to estimating the difference between the elements. The overall Euclidian distance would be the square root of the sum of those ones and zeroes, again not a good estimator of the difference between the vectors.
That's why the Hamming distance is used. Here the distance is the number of elements that are not the same. As noted by Catree, you can calculate it by a simple boolean operation on the vectors, as shown in the figure below. Here D1 is a single 4-bit descriptor that we are comparing with 4 descriptors shown in D2. Matrix H is the hamming distances for each row.

ORB (ORB: an efficient alternative to SIFT or SURF) is a binary descriptor.
It should be more efficient (in term of computation) to use the HAMMING distance rather than the L1/L2 distance as the HAMMING distance can be implemented using a XOR followed by a bit count (see BRIEF: Binary Robust Independent Elementary Features):
Furthermore, comparing strings can be done by computing the Hamming
distance, which can be done extremely fast on modern CPUs that often
provide a specific instruction to perform a XOR or bit count
operation, as is the case in the latest SSE [10] instruction set.
Of course, with a classical descriptor like SIFT, you cannot use the HAMMING distance.
You can test yourself:
D1=01010110
D2=10011010
L2_dist(D1,D2)=sqrt(4)=2
XOR(D1,D2)=11001100 ; bit_count(11001100)=4

L1/L2 distance is used for string based descriptors and Hamming distance is used for binary descriptors (AKAZE, ORB, BRIEF etc.).

Related

Is there any reason to (not) L2-normalize vectors before using cosine similarity?

I was reading the paper "Improving Distributional Similarity
with Lessons Learned from Word Embeddings" by Levy et al., and while discussing their hyperparameters, they say:
Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e. W’s rows) are normalized to unit length (L2 normalization), rendering the dot product operation equivalent to cosine similarity.
I then recalled that the default for the sim2 vector similarity function in the R text2vec package is to L2-norm vectors first:
sim2(x, y = NULL, method = c("cosine", "jaccard"), norm = c("l2", "none"))
So I'm wondering, what might be the motivation for this, normalizing and cosine (both in terms of text2vec and in general). I tried to read up on the L2 norm, but mostly it comes up in the context of normalizing before using the Euclidean distance. I could not find (surprisingly) anything on whether L2-norm would be recommended for or against in the case of cosine similarity on word vector spaces/embeddings. And I don't quite have the math skills to work out the analytic differences.
So here is a question, meant in the context of word vector spaces learned from textual data (either just co-occurrence matrices possible weighted by tfidf, ppmi, etc; or embeddings like GloVe), and calculating word similarity (with the goal being of course to use a vector space+metric that best reflects the real-world word similarities). Is there, in simple words, any reason to (not) use L2 norm on a word-feature matrix/term-co-occurrence matrix before calculating cosine similarity between the vectors/words?

If you want to get cosine similarity you DON'T need to normalize to L2 norm and then calculate cosine similarity. Cosine similarity anyway normalizes the vector and then takes dot product of two.
If you are calculating Euclidean distance then u NEED to normalize if distance or vector length is not an important distinguishing factor. If vector length is a distinguishing factor then don't normalize and calculate Euclidean distance as it is.

text2vec handles everything automatically - it will make rows have unit L2 norm and then call dot product to calculate cosine similarity.
But if matrix already has rows with unit L2 norm then user can specify norm = "none" and sim2 will skip first normalization step (saves some computation).
I understand confusion - probably I need to remove norm option (it doesn't take much time to normalize matrix).

SIFT features and classification of images?

I am new to image processing, and I want to extract image features in order to do some classification. I am having problems understanding the pipeline.
As far as I understand, I have a images and I run the SIFT algorithm on them. This gives me a set of descriptors for each images, the number varies, with fixed length of 128.
I then proceed to cluster them, since it is not possible to apply algorithms on varying number of features. For this, I stack up all the descriptors of all images and I run the k means algorithm with the desired number of clusters. What I get are k number of features of length 128.
Here is where I am confused, so I now have these new descriptors right, what do I do with them? I don't understand how I can plug them into a classifier if they represent all images? Should each images have their own separate features to be fed into a classifier?
I am sure I did not understand the concept, but can anybody please clarify what happens after I get a k*128 sized matrix? What is fed into for example an SVM classifier and how? How does this k means result suffice to train a classifier?
Thanks!
EDIT: I might have confused keypoints and descriptors, sorry new to image processing!

You should look into the image classification/image retrieval approach known as 'bag of visual words' - it is extremely relevant. A bag of visual words is a fixed-length feature vector v which summarises the occurrences of the features in an image. This makes use of what is called a codebook (also called a dictionary from historical uses in text retrieval), which in your case is built from your K-means clustering. To make v for a given image, the simplest approach is to assign v[j] the proportion of SIFT descriptors that are closest to the jth cluster centroid. This means the length of V is K, so it is independent of the number of SIFT features that are detected in the image.
Concretely, suppose you've done K means clustering with K = 100. Let's use ci to denote the ith cluster centre. For SIFT, this would be a vector of size 128. Now, for a given input image, you make this vector v, which is of size 100 and initialized with zeros. You then extract features from the image, and their corresponding descriptors. Let's say there are N descriptors, and we will call them d0, d2,...,d(N-1), where dj is the jth descriptor. For each dj you compute the vector distance between it and the cluster centres c0, c2,...c99. You then take the cluster index k with the lowest distance to dj, and increment: v[k]+=1. Note that this process can be parallelised very well particularly on GPUs. Also it can be faster to replace this process using what is known as Approximate Nearest Neighbours, using e.g. the FLANN library.

Is it possible to use KDTree with cosine similarity?

Looks like I can't use this similarity metric for with sklearn KDTree, for example, but I need because I am using measuring words vectors similarity. What is fast robust customization algorithm for this case? I know about Local Sensitivity Hashing, but it should tunned & tested up a lot to find params.

The ranking your would get with cosine similarity is equivalent to the rank order of the euclidean distance when you normalize all the data points first. So you can use a KD tree to the the k nearest neighbors with KDTrees, but you will need to recompute what the cosine similarity is.
The cosine similarity is not a distance metric as normally presented, but it can be transformed into one. If done, you can then use other structures like Ball Trees to do accelerated nn with cosine similarity directly. I've implemented this in the JSAT library, if you were interested in a Java implementation.

According to the table at the end of this page, cosine support eoth k-d-tree should be possible: ELKI supports cosine with the R-tree, and you can derive bounding rectangles for the k-d-tree, too; and the k-d-tree supports at least five metrics in that table. So I do not see why it shouldn't work.
Indexing support in sklearn often is not very complete (albeit improving), unfortunately; so don't take that as a reference.
While the k-d-tree can theoretically support Cosine by
transforming the data such that Cosine becomes Euclidean distance
working with the bounding boxes and the minimum angle to the bounding box (that appears to be what ELKI is doing for the R-tree)
You should be aware that the k-d-tree does not work very well with high-dimensional data, and cosine is mostly popular for very high-dimensional data. A k-d-tree always only looks at one dimension. If you want all d dimension to be used once, you need O(2^d) data points. For high d, there is no way all attributes are used.
The R-tree is slightly better here because it uses bounding boxes; these shrink with every split in all dimensions, so the pruning does get better. But this also means it needs a lot of memory for such data, and the tree construction may suffer from the same problem.
So in essence, don't use either for high dimensional data.
But also don't assume that Cosine does magically improve your results, in particular for high-d data. It's very much overrated. As above transformation indicates, there cannot be a systematic benefit of Cosine over Euclidean: Cosine is a special case of Euclidean.
For sparse data, inverted lists (c.f. Lucene, Xapian, Solr, ...) are the way to index for cosine.

L2 norm works better than Hamming for ORB in the BoW model, why?

I've read in numerous papers that Hamming distance needs to be used when dealing with feature matching using ORB features. I have been playing around with the BoW model in opencv in C++ and find that I have been getting better classification accuracy if I use the default BruteForce matcher (which uses L2) when compared to using BruteForce matcher(Hamming or Hamming(2)).
Why is this?
I was under the impression that you can't use L2 norm but it is providing better classification accuracy than using the hamming distance.

Imagine you have two 3-bit ORB-Descriptors:
A = [101]
B = [011]
The hamming distance is the number of positions at which the corresponding characters are different:
hamming = 2
The L2 distance is the euclidean distance:
L2 = sqrt(2)
For a binary descriptor like ORB, you usually take the hamming distance, because it has a higher efficiency

FFT coefficients question

I'm a software engineer working on DSP for the first time.
I'm successfully using an FFT library that produces frequency spectrums. I also understand how the FFT works in terms of its inputs and outputs, in particular the contents of the two output arrays:
Now, my problem is that I'm reading some new research reports that suggest that I extract: "the energy, variance, and sum of FFT coefficients".
What are the 'FFT coefficients'? Are those the values of the Real and Imaginary arrays shown above, which (from my understanding) correspond to the amplitudes of the constituent cosine and sine waves?
What is the 'energy' of the FFT coefficients? Is that terminology from statistics or from DSP?

You are correct. FFT coefficients are the signal values in the frequency domain.
"Energy" is the square modulus of the coefficients. The total energy (sum of square modulus of all values) in either time or frequency domain is the same (see Parseval's Theorem).

The real and imaginary arrays, when put together, can represent a complex array. Every complex element of the complex array in the frequency domain can be considered a frequency coefficient, and has a magnitude ( sqrt(R*R + I*I) ). Parseval's theorem says that the sum of all the Frequency domain complex vector magnitudes (squared) is equal to the energy of the time domain signal (which may require a scaling factor involving the FFT length, depending on your particular DFT/FFT library implementation).
One example of a time domain signal is voltage on a wire, which when measured in Volts times Amps into Ohms represents power, or over time, energy. Probably the word "energy" in the strictly numerical case is derived from historical usage from physics or engineering, where the numbers meant something that could burn your fingers.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart