Is there any reason to (not) L2-normalize vectors before using cosine similarity? - normalization

I was reading the paper "Improving Distributional Similarity
with Lessons Learned from Word Embeddings" by Levy et al., and while discussing their hyperparameters, they say:
Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e. W’s rows) are normalized to unit length (L2 normalization), rendering the dot product operation equivalent to cosine similarity.
I then recalled that the default for the sim2 vector similarity function in the R text2vec package is to L2-norm vectors first:
sim2(x, y = NULL, method = c("cosine", "jaccard"), norm = c("l2", "none"))
So I'm wondering, what might be the motivation for this, normalizing and cosine (both in terms of text2vec and in general). I tried to read up on the L2 norm, but mostly it comes up in the context of normalizing before using the Euclidean distance. I could not find (surprisingly) anything on whether L2-norm would be recommended for or against in the case of cosine similarity on word vector spaces/embeddings. And I don't quite have the math skills to work out the analytic differences.
So here is a question, meant in the context of word vector spaces learned from textual data (either just co-occurrence matrices possible weighted by tfidf, ppmi, etc; or embeddings like GloVe), and calculating word similarity (with the goal being of course to use a vector space+metric that best reflects the real-world word similarities). Is there, in simple words, any reason to (not) use L2 norm on a word-feature matrix/term-co-occurrence matrix before calculating cosine similarity between the vectors/words?

If you want to get cosine similarity you DON'T need to normalize to L2 norm and then calculate cosine similarity. Cosine similarity anyway normalizes the vector and then takes dot product of two.
If you are calculating Euclidean distance then u NEED to normalize if distance or vector length is not an important distinguishing factor. If vector length is a distinguishing factor then don't normalize and calculate Euclidean distance as it is.

text2vec handles everything automatically - it will make rows have unit L2 norm and then call dot product to calculate cosine similarity.
But if matrix already has rows with unit L2 norm then user can specify norm = "none" and sim2 will skip first normalization step (saves some computation).
I understand confusion - probably I need to remove norm option (it doesn't take much time to normalize matrix).

Related

TFIDVectorizer for Word Embedding/Vectorization

I want to compare two bodies of text ( A and B ) and check for the similarity between them
Here's my current approach:
Turn both bodies of text into vectors
Compare these vectors using a cosine similarity measure
Return the result
The very first step is what is giving me pause. How would I do this with TFIDVectorizer? Is it enough to put both bodies of text in a list, fit_transform them and then put their resultant matrices in my cosine similarity measure?
Is there some training process with TFIDVectorizer, a vocabulary matrix ( fit() )? If so, how do I turn A and B into vectors so that I could put them into a cosine similarity measure?
P.S I understand what other options exist, I'm curious specifically about TFIDVectorizer

ORB/BFMatcher - why norm_hamming distance?

I'm using the OpenCV implementation of ORB along with the BFMatcher. The OpenCV states that NORM_HAMMING should be used with ORB.
Why is this? What advantages does norm_hamming offer over other methods such as euclidean distance, norm_l1, etc.
When comparing descriptors in computer vision, the Euclidian distance is usually understood as the square root of the sum of the squared differences between the two vectors' elements.
The ORB descriptors are vectors of binary values. If applying Euclidian distance to binary vectors, the squared result of a single comparison would always be 1 or 0, which is not informative when it comes to estimating the difference between the elements. The overall Euclidian distance would be the square root of the sum of those ones and zeroes, again not a good estimator of the difference between the vectors.
That's why the Hamming distance is used. Here the distance is the number of elements that are not the same. As noted by Catree, you can calculate it by a simple boolean operation on the vectors, as shown in the figure below. Here D1 is a single 4-bit descriptor that we are comparing with 4 descriptors shown in D2. Matrix H is the hamming distances for each row.
ORB (ORB: an efficient alternative to SIFT or SURF) is a binary descriptor.
It should be more efficient (in term of computation) to use the HAMMING distance rather than the L1/L2 distance as the HAMMING distance can be implemented using a XOR followed by a bit count (see BRIEF: Binary Robust Independent Elementary Features):
Furthermore, comparing strings can be done by computing the Hamming
distance, which can be done extremely fast on modern CPUs that often
provide a specific instruction to perform a XOR or bit count
operation, as is the case in the latest SSE [10] instruction set.
Of course, with a classical descriptor like SIFT, you cannot use the HAMMING distance.
You can test yourself:
D1=01010110
D2=10011010
L2_dist(D1,D2)=sqrt(4)=2
XOR(D1,D2)=11001100 ; bit_count(11001100)=4
L1/L2 distance is used for string based descriptors and Hamming distance is used for binary descriptors (AKAZE, ORB, BRIEF etc.).

Relation between max-margin and vector support in SVM

I am reading the mathematical formulation of SVM and on many sources I found this idea that "max-margin hyperplane is completely determined by those \vec{x}_i which lie nearest to it. These \vec{x}_i are called support vectors."
Could an expert explain mathematically this consequence? please.
This is true only if you use representation theorem
Your separation plane W can be represented as Sum(i= 1 to m) Alpha_i*phi(x_i)
Where m - number of examples in your train sample and x_i is example i. And Phi is function that maps x_i to some feature space.
So, your SVM algorithm will find vector Alpha=(Alpha_1...Alpha_m), - alpha_i for x_i. Every x_i(example in train sample) that his alpha_i is NOT zero is support vector of W.
Hence the name - SVM.
What happens if your data is separable, is that you need only support vectors that are close to separation margin W, and all rest of the training set can be discarded(its alphas is 0). Amount of support vectors that algorithm will use is depends on complexity of data and kerenl you using.

Why does Apache Mahout ItemSimilarity use LP-Space normalization

Why is LP-Space normalization being used for Mahout VectorNormMapper for Item similarity. Have also read that the norm power of 2 works great for CosineSimilarity.
Is there an intuitive explanation of why its being used and how can best values for power be determined for given Similarity class.
Vector norms can be defined for any L_p metric. Different norms have different properties according to which problem you are working on. Common values of p include 1 and 2 with 0 used occasionally.
Certain similarity functions in Mahout are closely related to a particular norm. Your example of the cosine similarity is a good one. The cosine similarity is computed by scaling both vector inputs to have L_2 length = 1 and then taking the dot product. This value is equal to the cosine of the angle between the vectors if the vectors are expressed in Cartesian space. This value is also sqrt(1-d^2) where d is the L_2 norm of the difference between the normalized vectors.
This means that there is an intimate connection between cosine similarity and L_2 distance.
Does that answer your question?
These questions are likely to get answered more quickly on the Apache Mahout mailing lists, btw.

Text Classification - how to find the features that most affected the decision

When using SVMlight or LIBSVM in order to classify phrases as positive or negative (Sentiment Analysis), is there a way to determine which are the most influential words that affected the algorithms decision? For example, finding that the word "good" helped determine a phrase as positive, etc.
If you use the linear kernel then yes - simply compute the weights vector:
w = SUM_i y_i alpha_i sv_i
Where:
sv - support vector
alpha - coefficient found with SVMlight
y - corresponding class (+1 or -1)
(in some implementations alpha's are already multiplied by y_i and so they are positive/negative)
Once you have w, which is of dimensions 1 x d where d is your data dimension (number of words in the bag of words/tfidf representation) simply select the dimensions with high absolute value (no matter positive or negative) in order to find the most important features (words).
If you use some kernel (like RBF) then the answer is no, there is no direct method of taking out the most important features, as the classification process is performed in completely different way.
As #lejlot mentioned, with linear kernel in SVM, one of the feature ranking strategies is based on the absolute values of weights in the model. Another simple and effective strategy is based on F-score. It considers each feature separately and therefore cannot reveal mutual information between features. You can also determine how important a feature is by removing that feature and observe the classification performance.
You can see this article for more details on feature ranking.
With other kernels in SVM, the feature ranking is not that straighforward, yet still feasible. You can construct an orthogonal set of basis vectors in the kernel space, and calculate the weights by kernel relief. Then the implicit feature ranking can be done based on the absolute value of weights. Finally the data is projected into the learned subspace.

Resources