I want to implement a sentence similarity algorithm. Is it possible to implement it using sequence prediction algorithm? If it is possible what kind of approach should i go forward with or is there any other method which is more suitable for sentence similarity algorithm ,please share your views.
You could try to treat your sentences as separate documents and then use traditional approach for finding similarity between documents. It was answered here using sklearn:
Similarity between two text documents
If you want, you could try and implement the same code in tensorflow.
I also strongly recommend to read this answer, which covers more sophisticated approaches: https://stackoverflow.com/a/15173821/3633250
You could consider using Doc2Vec. Each sentence (document) is mapped to an n-dimensional space. To find the most similar document,
model.most_similar(“documentID”)
Reference
Related
I am working on a project which groups jobs posted on various job portals into clusters based on the description of the jobs using K-means.
I found the work vector using Word2Vec, but i guess this will not serve the purpose as I will need a vector of the whole job description.
I know that I can average out the word vector of a sentence to get the sentence vector but worried about the accuracy as this will loose the ordering of the words.
Is there any other way I can get the vectors ?
The most using approaches for text vectorization:
Pure TF-IDF, still can be useful, especially using n-grams.
Using Word2Vec to get vectors for the words. For the whole text using the mean value of all vectors.
Combine the first two methods: get a weighted mean of all words in the text using the coefficients from the TF-IDF.
I would suggest trying each and pick what is performed better in your case. The results can be slightly different depends on the nature of the data.
You can facilitate transfer learning by very useful sentence embedding methods such as Bert-as-service or SentenceBert or even Universal Sentence encoding. All of them are easy to use and full of tutorials on the web. They will work better then TF-IDF in most cases.
You can also try doc2vec, an extension of word2vec that builds representations of a whole document. There is an implementation in gensim available:
https://radimrehurek.com/gensim/models/doc2vec.html
Is there a way to compare two vectors that do not follow any ordering semantics among its elements, by using any ML algorithm?
Example - Compare (1,3,5) vs (9,7,5) and arrive at some result, and then use that result to check how close/far away they are. And then, when I see (2,6,4), determine whether it is closer to (1,3,5) or to (9,7,5) in terms of any similarity notion taking into account each element?
While I can use my own custom algorithms, I am trying to check if there's any known, standard or established ML algorithm for this kind of use case.
Thanks
Found the answer - cosine similarity
I'm working on finding similarities between short sentences and articles. I used many existing methods such as tf-idf, word2vec etc but the results are just okay. The most relevant measure which I found was word moving distance, however, its results are not that better than the other measures. I know it's a challenging problem, however, I am wondering if there are any new methods to find an approximate similarity more on a higher or concept level than just matching words. Especially, any alternative new methods like word moving distance which looks at slightly higher semantic of a sentence or article?
This is the most recent basing on a paper published 4 months ago.
Step 1:
Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list
Step 2 : Computing the sentence vector
The calculation of semantic similarity between sentences was difficult before but recently a paper named "A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS" was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.
A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here
Step 3: using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.
This is the most simple and efficient method to compute the semantic similarity of sentences.
Obviously, this is a huge and busy research area, but I'd say there are two broad types of approaches you could look into:
First, there are some methods that learn sentence embeddings in an unsupervised manner, such as Le and Mikolov's (2014) Paragraph Vectors, which are implemented in gensim, or Kiros et al.'s (2015) SkipThought vectors, with an implementation on Github.
Then there also exist supervised methods that learn sentence embeddings from labelled data. The most recent one is Conneau et al.'s (2017), which trains sentence embeddings on the Stanford Natural Language Inference dataset, and shows these embeddings can be used successfully across a range of NLP tasks. The code is available on Github.
You might also find some inspiration in a blog post I wrote earlier this year on the topic of embeddings.
To be honest the best thing I know to use for this at the moment is AMR:
About AMR here: https://amr.isi.edu/
Documentation here: https://github.com/amrisi/amr-guidelines/blob/master/amr.md
You can use a system like JAMR (see here: https://github.com/jflanigan/jamr) to generate AMRs for your sentence and then you can use Smatch (see here: https://amr.isi.edu/eval/smatch/tutorial.html) to compare the similarity of the two generated AMRs.
What you are trying to do is very difficult and is an active ongoing area of research.
You can use semantic similarity with WordNet for each pair of nouns.
To have a quick look you can enter bird-noun-1 and chair-noun-1 and select wordnet at http://labs.fc.ul.pt/dishin/ it gives you:
Resnik 0.315625756544
Lin 0.0574161071905
Jiang&Conrath 0.0964964414156
The Python code is at: https://github.com/lasigeBioTM/DiShIn
I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches:
A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity.
Bag-of-Words: tf-idf or its variations such as BM25.
Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document similarity?
Is there another approach, that allows to dynamically refine the document's vectors as more text is added?
doc2vec or word2vec ?
According to article, the performance of doc2vec or paragraph2vec is poor for short-length documents. [Learning Semantic Similarity for Very Short Texts, 2015, IEEE]
Short-length documents ...?
If you want to compare the similarity between short documents, you might want to vectorize the document via word2vec.
how construct ?
For example, you can construct a document vector with a weighted average vector using tf-idf.
similarity measure
In addition, I recommend using ts-ss rather than cosine or euclidean for similarity.
Please refer to the following article or the summary in github below.
"A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"
https://github.com/taki0112/Vector_Similarity
thank you
You have to try it: the answer may vary based on your corpus and application-specific perception of 'similarity'. Effectiveness may especially vary based on typical document lengths, so if "rapidly growing with time" also means "growing arbitrarily long", that could greatly affect what works over time (requiring adaptations for longer docs).
Also note that 'Paragraph Vectors' – where a vector is co-trained like a word vector to represent a range-of-text – may outperform a simple average-of-word-vectors, as an input to similarity/classification tasks. (Many references to 'Doc2Vec' specifically mean 'Paragraph Vectors', though the term 'Doc2Vec' is sometimes also used for any other way of turning a document into a single vector, like a simple average of word-vectors.)
You may also want to look at "Word Mover's Distance" (WMD), a measure of similarity between two texts that uses word-vectors, though not via any simple average. (However, it can be expensive to calculate, especially for longer documents.) For classification, there's a recent refinement called "Supervised Word Mover's Distance" which reweights/transforms word vectors to make them more sensitive to known categories. With enough evaluation/tuning data about which of your documents should be closer than others, an analogous technique could probably be applied to generic similarity tasks.
You also might consider trying Jaccard similarity, which uses basic set algebra to determine the verbal overlap in two documents (although it is somewhat similar to a BOW approach). A nice intro on it can be found here.
I would like to use a Restricted Boltzmann Machine for pattern recognition.
It has come to my attention that they are actually used for finding distributions in patterns rather than pattern recognition. I looked at the following paper: http://www.cs.toronto.edu/~hinton/absps/uai_crbms.pdf which seems to use an extension of RBM, called ConditionalRBM. I would like to implement that. I already used Contrastive Divergence to implement RBM, and I would like to stick to that for CRBM, for simplicity. The paper focuses on replacing contrastive divergence, with more accurate algorithms.
From what I see in the paper, I now need to create three weight matrices (as now I also have to include the classification vectors)(see Figure1 in the paper), and I am not sure how to update each of them (ie how to create the vectors which will influence the change of the matrix.)
Could someone please clarify this for me or suggest an algorithm for classification using simple RBM, which I already implemented?
Thanks.
I found the following paper which clarifies the issue: http://uai.sis.pitt.edu/papers/11/p463-louradour.pdf . The poster here is also very helpful, especially for implementation: http://www.dmi.usherb.ca/~larocheh/publications/drbm-mitacs-poster.pdf . Instead of using 3 weight matrices it is enough to use 2, one for classification vectors and one for the actual patterns.
The formulas for the activation probabilities change, but the idea is the same.