I'm researching viable algorithms/solutions to implement and solve following problem:
match users based on their common interests
Example:
U1: skiing, asian culture, meditation, java, crypto
U2: yoga, meditation, management, travel tips USA
U3: programming, travelling, oriental cuisine
I'm considering three dimensions based on word similarity:
Dictionary synonyms
wordnet synsets
Close semantic similarity (programming > java, travelling > travel tips USA)
So far I have considered Levenshtein_distance
Loose semantic similarity (asian culture >> oriental cuisine, programming >> crypto, asian culture >> yoga, yoga >> meditation)
Not sure at all, played with word2vec
Based on these approaches I would like to calculate a relevancy score and match users accordingly.
Thanks for the input!
Levenshtein distance was not very useful for capturing semantic similarity in my experiments.
Wordnet worked well but was slow for large set of words
Word2Vec is good approximation for wordnet but not as comprehensive in capturing all the related words
Also suggest you look at the graph embedding algorithm used in Starspace from Facebook and specially the use case around Facebook page likes and recommendations
Related
I'm working on finding similarities between short sentences and articles. I used many existing methods such as tf-idf, word2vec etc but the results are just okay. The most relevant measure which I found was word moving distance, however, its results are not that better than the other measures. I know it's a challenging problem, however, I am wondering if there are any new methods to find an approximate similarity more on a higher or concept level than just matching words. Especially, any alternative new methods like word moving distance which looks at slightly higher semantic of a sentence or article?
This is the most recent basing on a paper published 4 months ago.
Step 1:
Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list
Step 2 : Computing the sentence vector
The calculation of semantic similarity between sentences was difficult before but recently a paper named "A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS" was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.
A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here
Step 3: using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.
This is the most simple and efficient method to compute the semantic similarity of sentences.
Obviously, this is a huge and busy research area, but I'd say there are two broad types of approaches you could look into:
First, there are some methods that learn sentence embeddings in an unsupervised manner, such as Le and Mikolov's (2014) Paragraph Vectors, which are implemented in gensim, or Kiros et al.'s (2015) SkipThought vectors, with an implementation on Github.
Then there also exist supervised methods that learn sentence embeddings from labelled data. The most recent one is Conneau et al.'s (2017), which trains sentence embeddings on the Stanford Natural Language Inference dataset, and shows these embeddings can be used successfully across a range of NLP tasks. The code is available on Github.
You might also find some inspiration in a blog post I wrote earlier this year on the topic of embeddings.
To be honest the best thing I know to use for this at the moment is AMR:
About AMR here: https://amr.isi.edu/
Documentation here: https://github.com/amrisi/amr-guidelines/blob/master/amr.md
You can use a system like JAMR (see here: https://github.com/jflanigan/jamr) to generate AMRs for your sentence and then you can use Smatch (see here: https://amr.isi.edu/eval/smatch/tutorial.html) to compare the similarity of the two generated AMRs.
What you are trying to do is very difficult and is an active ongoing area of research.
You can use semantic similarity with WordNet for each pair of nouns.
To have a quick look you can enter bird-noun-1 and chair-noun-1 and select wordnet at http://labs.fc.ul.pt/dishin/ it gives you:
Resnik 0.315625756544
Lin 0.0574161071905
Jiang&Conrath 0.0964964414156
The Python code is at: https://github.com/lasigeBioTM/DiShIn
I am trying to learn Natural Language Processing and an stuck with an open ended question. How do I club together sentences that mean the same. There can be a finite set of sentences that have the same meaning. What kind of algorithms do I use to club them?
For example: Consider the following sentences:
There is a man. There is a lion. The lion will chase the man on seeing him. If the lion catches the man he dies.
There is a man and a lion. If the lion catches the man he dies. The lion will chase the man if he sees him.
You have a lion that chases men on seeing them. There is one man. If the lion catches the man he dies.
Basically what all these sentences say is this:
1 Lion. 1 Man. Lions chase men. If lion catches men the man dies.
I am unable to zero in on one category of Machine Learning or Deep Learning algorithm that would help me achieve something similar. Please guide me in the right direction or point me to some algorithms that are good enough to achieve this.
Another important factor is having a scale-able solution. There could be lots of such sentences out there. What happens then?
One possible solutions is:
Use the parts of speech and the relations between words in a sentence as features for some Machine Leaning algo. But will this be practical in a large set of sentences? Do we need to consider more things?
One of Deep Learning based solution would be to use word embeddings (which ideally should represent a word by a fixed dimensional vector such that similar words lie close in that embedding space and even vector operations like Germany - Berlin ~= Italy - Rome may hold), two famous word embeddings techniques are Word2Vec and Glove, another option is to represent a sentence by a fixed dimensional vector such that similar sentence lie close in that embedding space, check Skip-Thought vectors. Until now we have only tried to represent text (words/sentences) in a more semantic numerical way, next step is to capture the meaning of the current context (paragraphs, documents), a very naive approach would be to just average word/sentence embeddings (you have to try this to see if it works or not), better way would be to use some kind of sequence model like RNN (actually LSTM or GRU) to capture whatever has been said before. The problem in using sequence models is that it will need supervision (you should have a labelled data, but if you don't have it which I guess is the case), then just use sequence models in a language modelling setting and get the hidden representation of RNN/GRU/LSTM at last time step i.e after reading the last word or the aggregated word embeddings if you are using the naive approach. Once you have the hidden representation you may apply any clustering technique to cluster different paragraphs (you have to find the appropriate distance metric) or you can manually apply some distance metric and define or learn a threshold for similar paragraphs to be categorized as one.
I have about 5000 terms in a table and I want to group them into categories that make sense.
For example some terms are:
Nissan
Ford
Arrested
Jeep
Court
The result should be that Nissan, Ford, Jeep get grouped into one category and that Arrested and Court are in another category. I looked at the Stanford Classifier NLP. Am I right to assume that this is the right one to choose to do this for me?
I would suggest you to use NLTK if there weren't many proper nouns. You can use the semantic similarity from WordNet as features and try to cluster the words. Here's a discussion about how to do that.
To use the Stanford Classifier, you need to know how many buckets (classes) of words you want. Besides I think that is for documents rather than words.
That's an interesting problem that the word2vec model that Google released may help with.
In a nutshell, a word is represented by an N-dimensional vector generated by a model. Google provides a great model that returns a 300-dimensional vector from a model trained on over 100 billion words from their news division.
The interesting thing is that there are semantics encoded in these vectors. Suppose you have the vectors for the words King, Man, and Woman. A simple expression (King - Man) + Woman will yield a vector that is exceedingly close to the vector for Queen.
This is done via a distance calculation (cosine distance is their default, but you can use your own on the vectors) to determine similarity between words.
For your example, the distance between Jeep and Ford would be much smaller than between Jeep and Arrested. Through this you could group terms 'logically'.
Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification
But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.
What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?
I will not make the implementation in English, so I can't use databases.
hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.
In NLP exists 3 basic levels:
morphological analyses
syntactic analyses
semantic analyses
(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,...). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.
As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many 'less-covered' languages it can be a problem to find one ...
user1506145,
Here is a simple idea that I have used in the past. Collect a large number of short documents like Wikipedia articles. Do a word count on each document. For the ith document and the jth word let
I = the number of documents,
J = the number of words,
x_ij = the number of times the jth word appears in the ith document, and
y_ij = ln( 1+ x_ij).
Let [U, D, V] = svd(Y) be the singular value decomposition of Y. So Y = U*D*transpose(V)), U is IxI, D is diagonal IxJ, and V is JxJ.
You can use (V_1j, V_2j, V_3j, V_4j) as a feature vector in R^4 for the jth word.
I am surprised the previous answers haven't mentioned word embedding. Word embedding algorithm can produce word vectors for each word a given dataset. These algorithms can nfer word vectors from the context. For instance, by looking at the context of the following sentences we can say that "clever" and "smart" is somehow related. Because the context is almost the same.
He is a clever guy
He is a smart guy
A co-occurrence matrix can be constructed to do this. However, it is too inefficient. A famous technique designed for this purpose is called Word2Vec. It can be studied from the following papers.
https://arxiv.org/pdf/1411.2738.pdf
https://arxiv.org/pdf/1402.3722.pdf
I have been using it for Swedish. It is quite effective in detecting similar words and completely unsupervised.
A package could be find in gensim and tensorflow.
I have two arrays of sentences As you can see I'm trying to match applicant abilities with job requirements.
Array A
-Must be able to use MS Office
-Applicant should be prepared to work 40 to 50 hours a week
-Must know FDA Regulations, FCC Regulations
-Must be willing to work in groups
Array B
-Proficient in MS Office
-Experience with FDA Regulations
-Willing to work long hours
-Has experience with math applications.
Is there any way to compare the two arrays and determine how many similarities there are? Preferably on a sentence by sentence basis (not just picking out words that are similar) returning a percentage similar.
Any suggestions?
What you are asking for is pretty difficult and it is the buzz of natural language processing today.
NLTK is the toolkit of choice, but it's in python. There are lots of academic papers in this field. Most use copuses to train a a model where the hypothesis is that words that are similar tend to be in similar contexts (i.e. surrounded by similar words). This is very computationally expensive.
You can come up with a rudimentary solution by using the the nltk library with this plan in mind:
Remove filler words (a, the, and)
Use the part of speech tagger to identify label verbs, nouns etc (I'd
remove anything else than nouns and verbs)
For, say any twos noun (verbs), use the wordnet library to get
synonyms of that word. And if you have a match you count. There are
lots of other papers on this that use corpuses to build lexicons
which can use word frequencies to measure word similarities. The
latter method is preferred because you are likely to relate words
that are similar but do not have synonyms in common.
You can then give a relative measure of sentence similarity based on the word similarity
Other methods consider the syntactic structure of sentence, but you don't get that much benefits from this. Unfortunately, the above method is not very good, because of the nature of wordnet.