How to measure similarity between code snippets written in programming language - machine-learning

I am trying to solve the following problem.
Given a particular code snippet I need to give back the top review comments for the code snippet, here we want to give all the comments that were given to similar code snippets.
I am trying to form it as a machine learning problem.I think we can use KNN algorithm, but here I am not sure how should I measure the similarity between two code snippet ? Is there any pre-existing similarity measure for it ? I tried to search in google but not found any useful link
Kindly help

Edit distance between the two strings containing the considered comments could be a useful measure of similarity.
Also, n-gram cosine distance could be useful, that is, you extract the n-grams (e.g. 3-char-segments), build the vectors counting these n-grams and calculate cosine distance.
Another one would be Jaccard similarity between n-gram vectors (as above).

Related

Get sentence vector for a K-means clustering task

I am working on a project which groups jobs posted on various job portals into clusters based on the description of the jobs using K-means.
I found the work vector using Word2Vec, but i guess this will not serve the purpose as I will need a vector of the whole job description.
I know that I can average out the word vector of a sentence to get the sentence vector but worried about the accuracy as this will loose the ordering of the words.
Is there any other way I can get the vectors ?
The most using approaches for text vectorization:
Pure TF-IDF, still can be useful, especially using n-grams.
Using Word2Vec to get vectors for the words. For the whole text using the mean value of all vectors.
Combine the first two methods: get a weighted mean of all words in the text using the coefficients from the TF-IDF.
I would suggest trying each and pick what is performed better in your case. The results can be slightly different depends on the nature of the data.
You can facilitate transfer learning by very useful sentence embedding methods such as Bert-as-service or SentenceBert or even Universal Sentence encoding. All of them are easy to use and full of tutorials on the web. They will work better then TF-IDF in most cases.
You can also try doc2vec, an extension of word2vec that builds representations of a whole document. There is an implementation in gensim available:
https://radimrehurek.com/gensim/models/doc2vec.html

Comparison of two normal distribution

I have two normally distributed samples. I want to know how close or similar it is. I tried few methods to find the similarity, like z-score and bhattacharyya distance.
Bhattacharyya distance didn't work for me. It gives the same distance if the standard deviation of two samples is same. It doesn't change with change in mean.
I want to know whether any method is available that take the samples or its mean and standard deviation to find the similarity or similarity rank something like this.
I am not from mathematics background, so please ignore the terminology mistakes and let me know if any clarification is required.
I assume you're not looking for a relationship between the two samples, where a correlation coefficient would be appropriate?
I've been investigating a similar question for my current data and am looking at the Mahalanobis distance and the Earthmovers distance.
I found this post from a different forum which gave me a few ideas

Recent methods for finding semantic similarity between two short sentences or articles (on a concept level)

I'm working on finding similarities between short sentences and articles. I used many existing methods such as tf-idf, word2vec etc but the results are just okay. The most relevant measure which I found was word moving distance, however, its results are not that better than the other measures. I know it's a challenging problem, however, I am wondering if there are any new methods to find an approximate similarity more on a higher or concept level than just matching words. Especially, any alternative new methods like word moving distance which looks at slightly higher semantic of a sentence or article?
This is the most recent basing on a paper published 4 months ago.
Step 1:
Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list
Step 2 : Computing the sentence vector
The calculation of semantic similarity between sentences was difficult before but recently a paper named "A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS" was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.
A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here
Step 3: using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.
This is the most simple and efficient method to compute the semantic similarity of sentences.
Obviously, this is a huge and busy research area, but I'd say there are two broad types of approaches you could look into:
First, there are some methods that learn sentence embeddings in an unsupervised manner, such as Le and Mikolov's (2014) Paragraph Vectors, which are implemented in gensim, or Kiros et al.'s (2015) SkipThought vectors, with an implementation on Github.
Then there also exist supervised methods that learn sentence embeddings from labelled data. The most recent one is Conneau et al.'s (2017), which trains sentence embeddings on the Stanford Natural Language Inference dataset, and shows these embeddings can be used successfully across a range of NLP tasks. The code is available on Github.
You might also find some inspiration in a blog post I wrote earlier this year on the topic of embeddings.
To be honest the best thing I know to use for this at the moment is AMR:
About AMR here: https://amr.isi.edu/
Documentation here: https://github.com/amrisi/amr-guidelines/blob/master/amr.md
You can use a system like JAMR (see here: https://github.com/jflanigan/jamr) to generate AMRs for your sentence and then you can use Smatch (see here: https://amr.isi.edu/eval/smatch/tutorial.html) to compare the similarity of the two generated AMRs.
What you are trying to do is very difficult and is an active ongoing area of research.
You can use semantic similarity with WordNet for each pair of nouns.
To have a quick look you can enter bird-noun-1 and chair-noun-1 and select wordnet at http://labs.fc.ul.pt/dishin/ it gives you:
Resnik 0.315625756544
Lin 0.0574161071905
Jiang&Conrath 0.0964964414156
The Python code is at: https://github.com/lasigeBioTM/DiShIn

Document similarity: Vector embedding versus Tf-Idf performance?

I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches:
A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity.
Bag-of-Words: tf-idf or its variations such as BM25.
Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document similarity?
Is there another approach, that allows to dynamically refine the document's vectors as more text is added?
doc2vec or word2vec ?
According to article, the performance of doc2vec or paragraph2vec is poor for short-length documents. [Learning Semantic Similarity for Very Short Texts, 2015, IEEE]
Short-length documents ...?
If you want to compare the similarity between short documents, you might want to vectorize the document via word2vec.
how construct ?
For example, you can construct a document vector with a weighted average vector using tf-idf.
similarity measure
In addition, I recommend using ts-ss rather than cosine or euclidean for similarity.
Please refer to the following article or the summary in github below.
"A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"
https://github.com/taki0112/Vector_Similarity
thank you
You have to try it: the answer may vary based on your corpus and application-specific perception of 'similarity'. Effectiveness may especially vary based on typical document lengths, so if "rapidly growing with time" also means "growing arbitrarily long", that could greatly affect what works over time (requiring adaptations for longer docs).
Also note that 'Paragraph Vectors' – where a vector is co-trained like a word vector to represent a range-of-text – may outperform a simple average-of-word-vectors, as an input to similarity/classification tasks. (Many references to 'Doc2Vec' specifically mean 'Paragraph Vectors', though the term 'Doc2Vec' is sometimes also used for any other way of turning a document into a single vector, like a simple average of word-vectors.)
You may also want to look at "Word Mover's Distance" (WMD), a measure of similarity between two texts that uses word-vectors, though not via any simple average. (However, it can be expensive to calculate, especially for longer documents.) For classification, there's a recent refinement called "Supervised Word Mover's Distance" which reweights/transforms word vectors to make them more sensitive to known categories. With enough evaluation/tuning data about which of your documents should be closer than others, an analogous technique could probably be applied to generic similarity tasks.
You also might consider trying Jaccard similarity, which uses basic set algebra to determine the verbal overlap in two documents (although it is somewhat similar to a BOW approach). A nice intro on it can be found here.

Clustering of news articles

My scenario is pretty straightforwrd: I have a bunch of news articles (~1k at the moment) for which I know that some cover the same story/topic. I now would like to group these articles based on shared story/topic, i.e., based on their similarity.
What I did so far is to apply basic NLP techniques including stopword removal and stemming. I also calculated the tf-idf vector for each article, and with this can also calculate the, e.g., cosine similarity based on these tf-idf-vectors. But now with the grouping of the articles I struggles a bit. I see two principle ways -- probably related -- to do it:
1) Machine Learning / Clustering: I already played a bit with existing clustering libraries, with more or less success; see here. On the one hand, algorithms such as k-means require the number of clusters as input, which I don't know. Other algorithms require parameters that are also not intuitive to specify (for me that is).
2) Graph algorithms: I can represent my data as a graph with the articles being the nodes and weighted adges representing the pairwise (cosine) similarity between the articles. With that, for example, I can first remove all edges that fall below a certain threshold and then might apply graph algorithms to look for strongly-connected subgraphs.
In short, I'm not sure where best to go from here -- I'm still pretty new in this area. I wonder if there some best practices for that, or some kind of guidelines which methods / algorithms can (not) be applied in certain scenarios.
(EDIT: forgot to link to related question of mine)
Try the class of Hierarchical Agglomerative Clustering HAC algorithms with Single and Complete linkage.
These algorithms do not need the number of clusters as input.
The basic principle is similar to growing a minimal spanning tree across a given set of data points and then stop based on a threshold criteria. A closely related class is the Divisive clustering algorithms which first builds up the minimal spanning tree and then prunes off a branch of the tree based on inter-cluster similarity ratios.
You can also try a canopy variation on k-means to create a relatively quick estimate for the number of clusters (k).
http://en.wikipedia.org/wiki/Canopy_clustering_algorithm
Will you be recomputing over time or do you only care about a static set of news? I ask because your k may change a bit over time.
Since you can model your dataset as a graph you could apply stochastic clustering based on markov models. Here are link for resources on MCL algorithm:
Official thesis description and code base
Gephi plugin for MCL (to experiment and evaluate the method)

Resources