rails - comparison of arrays of sentences - ruby-on-rails

I have two arrays of sentences As you can see I'm trying to match applicant abilities with job requirements.
Array A
-Must be able to use MS Office
-Applicant should be prepared to work 40 to 50 hours a week
-Must know FDA Regulations, FCC Regulations
-Must be willing to work in groups
Array B
-Proficient in MS Office
-Experience with FDA Regulations
-Willing to work long hours
-Has experience with math applications.
Is there any way to compare the two arrays and determine how many similarities there are? Preferably on a sentence by sentence basis (not just picking out words that are similar) returning a percentage similar.
Any suggestions?

What you are asking for is pretty difficult and it is the buzz of natural language processing today.
NLTK is the toolkit of choice, but it's in python. There are lots of academic papers in this field. Most use copuses to train a a model where the hypothesis is that words that are similar tend to be in similar contexts (i.e. surrounded by similar words). This is very computationally expensive.
You can come up with a rudimentary solution by using the the nltk library with this plan in mind:
Remove filler words (a, the, and)
Use the part of speech tagger to identify label verbs, nouns etc (I'd
remove anything else than nouns and verbs)
For, say any twos noun (verbs), use the wordnet library to get
synonyms of that word. And if you have a match you count. There are
lots of other papers on this that use corpuses to build lexicons
which can use word frequencies to measure word similarities. The
latter method is preferred because you are likely to relate words
that are similar but do not have synonyms in common.
You can then give a relative measure of sentence similarity based on the word similarity
Other methods consider the syntactic structure of sentence, but you don't get that much benefits from this. Unfortunately, the above method is not very good, because of the nature of wordnet.

Related

Find translations of a given word in the corpus e.g. by machine learning, word2vec, text mining

I am using this thread to get some ideas and find some possibilities.
I have about 1000 sermons and their translations into another language. The lengths of the sermons are different. These are religious sermon texts. Because of the domain (religious), there are a lot of words that can be used in different ways based on the context. The same word can become a different meaning.
Is there a way, where I can get "programmatically" the translations of a given word in the aim language?
x1 -> [y2,z2,a2,b2,c2]
where x is the word in the language 1
and the returned array contains translations in the language 2
This would be the best case. Maybe this could be possible by training a translation model by using domain data, but I don't have a lot of data.
Could it be possible by using word2vec? By creating a vector space of both texts (language 1 and language 2) and by using a transformation matrix would it be possible to put the semantical meanings together?
Do you know other ways or have other ideas? Is there maybe such works already and what is these kinds of research called? I was not able to find something like this. I hope you guys have some ideas on how this could be reached.
The general purpose is "to create a tool" for researchers in this specific domain, that can be used to analyse sermons translation quality. If you have another ideas how the quality of a translation (semantically) can be analysed, I would be very thankful.
To get the translation for a specific word in a sentence, you can use what’s called word alignment.
To get the quality of the translation, you can use what’s called quality estimation.
machinetranslate.org/quality-estimation
A solution based on word vectors (FastText vectors are typically better than Word2Vec) is certainly possible. The task that you are looking for is bilingual dictionary induction. The most frequently used tool for that is VecMap that can align two embeddings spaces from two languages. It either uses a small seed dictionary to align all the words or it even can work in a completely unsupervised fashion.
Another solution is doing word alignment, i.e., statistically aligning words in the translations. Then you can get a dictionary based on the frequencies of how often the words are mapped to each other (note there might be problems when the languages differ morphologically). In this case, you can easily show examples of how the translations are used in sentences. If the languages you are interested in are covered by the XLM-R model, I recommend using SimAlign (a neural solution). If not, you can use Eflomal (a statistical solution).

How to seek for bigram similarity in gensim word2vec model

Here I have a word2vec model, suppose I use the google-news-300 model
import gensim.downloader as api
word2vec_model300 = api.load('word2vec-google-news-300')
I want to find the similar words for "AI" or "artifical intelligence", so I want to write
word2vec_model300.most_similar("artifical intelligence")
and I got errors
KeyError: "word 'artifical intelligence' not in vocabulary"
So what is the right way to extract similar words for bigram words?
Thanks in advance!
At one level, when a word-token isn't in a fixed set of word-vectors, the creators of that set of word-vectors chose not to train/model that word. So, anything you do will only be a crude workaround for its absence.
Note, though, that when Google prepared those vectors – based on a dataset of news articles from before 2012 – they also ran some statistical multigram-combinations on it, creating multigrams with connecting _ characters. So, first check if a vector for 'artificial_intelligence' might be present.
If it isn't, you could try other rough workarounds like averaging together the vectors for 'artificial' and 'intelligence' – though of course that won't really be what people mean by the distinct combination of those words, just meanings suggested by the independent words.
The Gensim .most_similar() method can take either a raw vectors you've created by operations such as averaging, or even a list of multiple words which it will average for you, as arguments via its explicit keyword positive parameter. For example:
word2vec_model300.most_similar(positive=[average_vector])
...or...
word2vec_model300.most_similar(positive=['artificial', 'intelligence'])
Finally, though Google's old vectors are handy, they're a bit old now, & from a particular domain (popular news articles) where senses may not match tose used in other domains (or more recently). So you may want to seek alternate vectors, or train your own if you have sufficient data from your area of interest, to have apprpriate meanings – including vectors for any particular multigrams you choose to tokenize in your data.

Classification of single sentence

I have 4 different categories and I also have around 3000 words which belong to each of these categories. Now if a new sentence comes, I am able to break the sentence into words and get more words related to it. So say for each new sentence I can get 20-30 words generated from the sentence.
Now what is the best way to classify this sentence in above mentioned category? I know bag of words works well.
I also looked at LDA, but it works with documents, where as I have a list of words as a training corpus. In LDA it looks at the position of word in document. So I could not get meaningful results from LDA.
I'm not sure if I fully understand what your question is exactly.
Bag of words works well for some purposes, but in a lot of cases it throws away a lot of potentially useful information (which could be taken from word order, for example).
And assuming that you get a grammatical sentence as input, why not use your sentence as document and still use LDA? The position of a word in your sentence can still be verymeaningful.
There are plenty of classification methods available. Which one is best depens largely on your purpose. If you're neew to this area, this may be interesting to have a look at: https://www.coursera.org/course/ml
Like, Igor, I am also a bit confused regarding your problem. Be it a document or a sentence, the terms will be part of the feature set for categorization, in some form. You can find out the most relevant terms of each category and using this knowledge, do a better classification of the new sentences. For example, if your sentence is as follows-" There is a stray dog near our layout which bites everyone who goes near to it". If you take the useful keywords from this sentence, removing stopwords, they are a few in number ( stray, dog, layout, bites, near ). You can categorize it into a bucket, "animals_issue". If you train your system with a larger set of example, this bag of words model can help. Otherwise, you can go for LDA/ other topic modelling approaches.

Cutting down on Stanford parser's time-to-parse by pruning the sentence

We are already aware that the parsing time of Stanford Parser increases as the length of a sentence increases. I am interested in finding creative ways in which we prune the sentence such that the parsing time decreases without compromising on accuracy. For e.g. we can replace known noun phrases with one word nouns. Similarly can there be some other smart ways of guessing a subtree before hand, let's say, using the POS Tag information? We have a huge corpus of unstructured text at our disposal. So we wish to learn some common patterns that can ultimately reduce the parsing time. Also some references to publicly available literature in this regards will also be highly appreciated.
P.S. We already are aware of how to multi-thread using Stanford Parser, so we are not looking for answers from that point of view.
You asked for 'creative' approaches - the Cell Closure pruning method might be worth a look. See the series of publications by Brian Roark, Kristy Hollingshead, and Nathan Bodenstab. Papers: 1 2 3. The basic intuition is:
Each cell in the CYK parse chart 'covers' a certain span (e.g. the first 4 words of the sentence, or words 13-18, etc.)
Some words - particularly in certain contexts - are very unlikely to begin a multi-word syntactic constituent; others are similarly unlikely to end a constituent. For example, the word 'the' almost always precedes a noun phrase, and it's almost inconceivable that it would end a constituent.
If we can train a machine-learned classifier to identify such words with very high precision, we can thereby identify cells which would only participate in parses placing said words in highly improbable syntactic positions. (Note that this classifier might make use of a linear-time POS tagger, or other high-speed preprocessing steps.)
By 'closing' these cells, we can reduce both the the asymptotic and average-case complexities considerably - in theory, from cubic complexity all the way to linear; practically, we can achieve approximately n^1.5 without loss of accuracy.
In many cases, this pruning actually increases accuracy slightly vs. an exhaustive search, because the classifier can incorporate information that isn't available to the PCFG. Note that this is a simple, but very effective form of coarse-to-fine pruning, with a single coarse stage (as compared to the 7-stage CTF approach in the Berkeley Parser).
To my knowledge, the Stanford Parser doesn't currently implement this pruning technique; I suspect you'd find it quite effective.
Shameless plug
The BUBS Parser implements this approach, as well as a few other optimizations, and thus achieves throughput of around 2500-5000 words per second, usually with accuracy at least equal to that I've measured with the Stanford Parser. Obviously, if you're using the rest of the Stanford pipeline, the built-in parser is already well integrated and convenient. But if you need improved speed, BUBS might be worth a look, and it does include some example code to aid in embedding the engine in a larger system.
Memoizing Common Substrings
Regarding your thoughts on pre-analyzing known noun phrases or other frequently-observed sequences with consistent structure: I did some evaluation of a similar idea a few years ago (in the context of sharing common substructures across a large corpus, when parsing on a massively parallel architecture). The preliminary results weren't encouraging.In the corpora we looked at, there just weren't enough repeated substrings of substantial length to make it worthwhile. And the aforementioned cell closure methods usually make those substrings really cheap to parse anyway.
However, if your target domains involved a lot of repetition, you might come to a different conclusion (maybe it would be effective on legal documents with lots of copy-and-paste boilerplate? Or news stories that are repeated from various sources or re-published with edits?)

Feature extraction from a single word

Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification
But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.
What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?
I will not make the implementation in English, so I can't use databases.
hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.
In NLP exists 3 basic levels:
morphological analyses
syntactic analyses
semantic analyses
(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,...). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.
As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many 'less-covered' languages it can be a problem to find one ...
user1506145,
Here is a simple idea that I have used in the past. Collect a large number of short documents like Wikipedia articles. Do a word count on each document. For the ith document and the jth word let
I = the number of documents,
J = the number of words,
x_ij = the number of times the jth word appears in the ith document, and
y_ij = ln( 1+ x_ij).
Let [U, D, V] = svd(Y) be the singular value decomposition of Y. So Y = U*D*transpose(V)), U is IxI, D is diagonal IxJ, and V is JxJ.
You can use (V_1j, V_2j, V_3j, V_4j) as a feature vector in R^4 for the jth word.
I am surprised the previous answers haven't mentioned word embedding. Word embedding algorithm can produce word vectors for each word a given dataset. These algorithms can nfer word vectors from the context. For instance, by looking at the context of the following sentences we can say that "clever" and "smart" is somehow related. Because the context is almost the same.
He is a clever guy
He is a smart guy
A co-occurrence matrix can be constructed to do this. However, it is too inefficient. A famous technique designed for this purpose is called Word2Vec. It can be studied from the following papers.
https://arxiv.org/pdf/1411.2738.pdf
https://arxiv.org/pdf/1402.3722.pdf
I have been using it for Swedish. It is quite effective in detecting similar words and completely unsupervised.
A package could be find in gensim and tensorflow.

Resources