Clustering based on pairwise similarity? - machine-learning

Suppose I have a list of element pairs and corresponding similarity scores for each one. I want to be able to cluster the elements in this list based on their similarity to one-another. Is there an established method for doing this?

You can use some density-based clustering algorithms such as DBSCAN or H-DBSCAN. For example, if you want to find the neighbors of a pair p that they are located inside a circle with radius epsilon around p, you can compute all neighbors by their 1-sim(pi,p) < epsilon. Because if sim(pi,p) is the similarity between p and pi, 1-sim(pi,p) will be the distance of these two points.

Related

TFIDVectorizer for Word Embedding/Vectorization

I want to compare two bodies of text ( A and B ) and check for the similarity between them
Here's my current approach:
Turn both bodies of text into vectors
Compare these vectors using a cosine similarity measure
Return the result
The very first step is what is giving me pause. How would I do this with TFIDVectorizer? Is it enough to put both bodies of text in a list, fit_transform them and then put their resultant matrices in my cosine similarity measure?
Is there some training process with TFIDVectorizer, a vocabulary matrix ( fit() )? If so, how do I turn A and B into vectors so that I could put them into a cosine similarity measure?
P.S I understand what other options exist, I'm curious specifically about TFIDVectorizer

Compute the similarity of two graphs of different sizes

I have two graphs G and G' (of different sizes) and I want to check how similar they are. I have read that the Wasserstein distance is used in this case.
How can I use it?
In scipy there is the function:
scipy.stats.wasserstein_distance(u_values, v_values, u_weights=None, v_weights=None)
How can I pass G and G' as u_values and v_values?
EDIT:
I got the idea from this paper: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0228728&type=printable
Where they write:
Inspired by the rich connections between graph theory and geometry, one can define a notion of distance between any two graphs by extending the notion of distance between metric spaces [58]. The construction proceeds as follows: each graph is represented as a metric space, wherein the metric is simply the shortest distance on the graph. Two graphs are equivalent if there exists an isomorphism between the graph represented as metric spaces. Finally, one can define a distance between two graphs G1 and G2 (or rather between the two classes of graph isometric to G1 and G2 respectively) by considering standard notions of distances between isometry classes of metric spaces [59]. Examples of such distances include the Gromov-Hausdorff distance [59], the Kantorovich-Rubinstein distance and the Wasserstein distance [60], which both require that the metric spaces be equipped with probability measures.
It is not clear to me though how to do this.

Is there any reason to (not) L2-normalize vectors before using cosine similarity?

I was reading the paper "Improving Distributional Similarity
with Lessons Learned from Word Embeddings" by Levy et al., and while discussing their hyperparameters, they say:
Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e. W’s rows) are normalized to unit length (L2 normalization), rendering the dot product operation equivalent to cosine similarity.
I then recalled that the default for the sim2 vector similarity function in the R text2vec package is to L2-norm vectors first:
sim2(x, y = NULL, method = c("cosine", "jaccard"), norm = c("l2", "none"))
So I'm wondering, what might be the motivation for this, normalizing and cosine (both in terms of text2vec and in general). I tried to read up on the L2 norm, but mostly it comes up in the context of normalizing before using the Euclidean distance. I could not find (surprisingly) anything on whether L2-norm would be recommended for or against in the case of cosine similarity on word vector spaces/embeddings. And I don't quite have the math skills to work out the analytic differences.
So here is a question, meant in the context of word vector spaces learned from textual data (either just co-occurrence matrices possible weighted by tfidf, ppmi, etc; or embeddings like GloVe), and calculating word similarity (with the goal being of course to use a vector space+metric that best reflects the real-world word similarities). Is there, in simple words, any reason to (not) use L2 norm on a word-feature matrix/term-co-occurrence matrix before calculating cosine similarity between the vectors/words?
If you want to get cosine similarity you DON'T need to normalize to L2 norm and then calculate cosine similarity. Cosine similarity anyway normalizes the vector and then takes dot product of two.
If you are calculating Euclidean distance then u NEED to normalize if distance or vector length is not an important distinguishing factor. If vector length is a distinguishing factor then don't normalize and calculate Euclidean distance as it is.
text2vec handles everything automatically - it will make rows have unit L2 norm and then call dot product to calculate cosine similarity.
But if matrix already has rows with unit L2 norm then user can specify norm = "none" and sim2 will skip first normalization step (saves some computation).
I understand confusion - probably I need to remove norm option (it doesn't take much time to normalize matrix).

Which machine learning algorithm to use for high dimensional matching?

Let say, I can define a person by 1000 different way, so i have 1,000 features for a given person.
PROBLEM: How can I run machine learning algorithm to determine the best possible match, or closest/most similar person, given the 1,000 features?
I have attempted Kmeans but this appears to be more for 2 features, rather than high dimensions.
You basically after some kind of K Nearest Neighbors Algorithm.
Since your data has high dimension you should explore the following:
Dimensionality Reduction - You may have 1000 features but probably some of them are better than others. So it would be a wise move to apply some kind of Dimensionality Reduction. Easiest and teh first point o start with would be Principal Component Analysis (PCA) which preserves ~90% of the data (Namely use enough Eigen Vectors which match 90% o the energy with their matching Eigen Values). I would assume you'll see a significant reduction from this.
Accelerated K Nearest Neighbors - There are many methods out there to accelerate the search of K-NN in high dimensional case. The K D Tree Algorithm would be a good start for that.
Distance metrics
You can try to apply a distance metric (e.g. cosine similarity) directly.
Supervised
If you know how similar the people are, you can try the following:
Neural networks, Approach #1
Input: 2x the person feature vector (hence 2000 features)
Output: 1 float (similarity of the two people)
Scalability: Linear with the number of people
See neuralnetworksanddeeplearning.com for a nice introduction and Keras for a simple framework
Neural networks, Approach #2
A more advanced approach is called metric learning.
Input: the person feature vector (hence 2000 features)
Output: k floats (you choose k, but it should be lower than 1000)
For training, you have to give the network first on person, store the result, then the second person, store the result, apply a distance metric of your choice (e.g. Euclidean distance) of the two results and then backpropagate the error.

Latent semantic analysis (LSA) single value decomposition (SVD) understanding

Bear with me through my modest understanding of LSI (Mechanical Engineering background):
After performing SVD in LSI, you have 3 matrices:
U, S, and V transpose.
U compares words with topics and S is a sort of measure of strength of each feature. Vt compares topics with documents.
U dot S dot Vt
returns the original matrix before SVD. Without doing too much (none) in-depth algebra it seems that:
U dot S dot **Ut**
returns a term by term matrix, which provides a comparison between the terms. i.e. how related one term is to other terms, a DSM (design structure matrix) of sorts that compares words instead of components. I could be completely wrong, but I tried it on a sample data set, and the results seemed to make sense. It could just be bias though (I wanted it to work, so I saw what I wanted). I can't post the results as the documents are protected.
My question though is: Does this make any sense? Logically? Mathematically?
Thanks for any time/responses.
If you want to know how related one term is to another you can just compute
(U dot S)
The terms are represented by the row vectors. You can then compute the distance matrix by applying a distance function such as euclidean distance. Once you make the distance matrix by computing the distance between all the vectors the resulted matrix should be hollow symmetric with all distances >0. if the distance A[i,j] is small then they are related otherwise they are not.

Resources