Find translations of a given word in the corpus e.g. by machine learning, word2vec, text mining - machine-learning

I am using this thread to get some ideas and find some possibilities.
I have about 1000 sermons and their translations into another language. The lengths of the sermons are different. These are religious sermon texts. Because of the domain (religious), there are a lot of words that can be used in different ways based on the context. The same word can become a different meaning.
Is there a way, where I can get "programmatically" the translations of a given word in the aim language?
x1 -> [y2,z2,a2,b2,c2]
where x is the word in the language 1
and the returned array contains translations in the language 2
This would be the best case. Maybe this could be possible by training a translation model by using domain data, but I don't have a lot of data.
Could it be possible by using word2vec? By creating a vector space of both texts (language 1 and language 2) and by using a transformation matrix would it be possible to put the semantical meanings together?
Do you know other ways or have other ideas? Is there maybe such works already and what is these kinds of research called? I was not able to find something like this. I hope you guys have some ideas on how this could be reached.
The general purpose is "to create a tool" for researchers in this specific domain, that can be used to analyse sermons translation quality. If you have another ideas how the quality of a translation (semantically) can be analysed, I would be very thankful.

To get the translation for a specific word in a sentence, you can use what’s called word alignment.
To get the quality of the translation, you can use what’s called quality estimation.
machinetranslate.org/quality-estimation

A solution based on word vectors (FastText vectors are typically better than Word2Vec) is certainly possible. The task that you are looking for is bilingual dictionary induction. The most frequently used tool for that is VecMap that can align two embeddings spaces from two languages. It either uses a small seed dictionary to align all the words or it even can work in a completely unsupervised fashion.
Another solution is doing word alignment, i.e., statistically aligning words in the translations. Then you can get a dictionary based on the frequencies of how often the words are mapped to each other (note there might be problems when the languages differ morphologically). In this case, you can easily show examples of how the translations are used in sentences. If the languages you are interested in are covered by the XLM-R model, I recommend using SimAlign (a neural solution). If not, you can use Eflomal (a statistical solution).

Related

How to seek for bigram similarity in gensim word2vec model

Here I have a word2vec model, suppose I use the google-news-300 model
import gensim.downloader as api
word2vec_model300 = api.load('word2vec-google-news-300')
I want to find the similar words for "AI" or "artifical intelligence", so I want to write
word2vec_model300.most_similar("artifical intelligence")
and I got errors
KeyError: "word 'artifical intelligence' not in vocabulary"
So what is the right way to extract similar words for bigram words?
Thanks in advance!
At one level, when a word-token isn't in a fixed set of word-vectors, the creators of that set of word-vectors chose not to train/model that word. So, anything you do will only be a crude workaround for its absence.
Note, though, that when Google prepared those vectors – based on a dataset of news articles from before 2012 – they also ran some statistical multigram-combinations on it, creating multigrams with connecting _ characters. So, first check if a vector for 'artificial_intelligence' might be present.
If it isn't, you could try other rough workarounds like averaging together the vectors for 'artificial' and 'intelligence' – though of course that won't really be what people mean by the distinct combination of those words, just meanings suggested by the independent words.
The Gensim .most_similar() method can take either a raw vectors you've created by operations such as averaging, or even a list of multiple words which it will average for you, as arguments via its explicit keyword positive parameter. For example:
word2vec_model300.most_similar(positive=[average_vector])
...or...
word2vec_model300.most_similar(positive=['artificial', 'intelligence'])
Finally, though Google's old vectors are handy, they're a bit old now, & from a particular domain (popular news articles) where senses may not match tose used in other domains (or more recently). So you may want to seek alternate vectors, or train your own if you have sufficient data from your area of interest, to have apprpriate meanings – including vectors for any particular multigrams you choose to tokenize in your data.

Natural Language Processing techniques for understanding contextual words

Take the following sentence:
I'm going to change the light bulb
The meaning of change means replace, as in someone is going to replace the light bulb. This could easily be solved by using a dictionary api or something similar. However, the following sentences
I need to go the bank to change some currency
You need to change your screen brightness
The first sentence does not mean replace anymore, it means Exchangeand the second sentence, change means adjust.
If you were trying to understand the meaning of change in this situation, what techniques would someone use to extract the correct definition based off of the context of the sentence? What is what I'm trying to do called?
Keep in mind, the input would only be one sentence. So something like:
Screen brightness is typically too bright on most peoples computers.
People need to change the brightness to have healthier eyes.
Is not what I'm trying to solve, because you can use the previous sentence to set the context. Also this would be for lots of different words, not just the word change.
Appreciate the suggestions.
Edit: I'm aware that various embedding models can help gain insight on this problem. If this is your answer, how do you interpret the word embedding that is returned? These arrays can be upwards of 500+ in length which isn't practical to interpret.
What you're trying to do is called Word Sense Disambiguation. It's been a subject of research for many years, and while probably not the most popular problem it remains a topic of active research. Even now, just picking the most common sense of a word is a strong baseline.
Word embeddings may be useful but their use is orthogonal to what you're trying to do here.
Here's a bit of example code from pywsd, a Python library with implementations of some classical techniques:
>>> from pywsd.lesk import simple_lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> answer = simple_lesk(sent, ambiguous, pos='n')
>>> print answer
Synset('depository_financial_institution.n.01')
>>> print answer.definition()
'a financial institution that accepts deposits and channels the money into lending activities'
The methods are mostly kind of old and I can't speak for their quality but it's a good starting point at least.
Word senses are usually going to come from WordNet.
I don't know how useful this is but from my POV, word vector embeddings are naturally separated and the position in the sample space is closely related to different uses of the word. However like you said often a word may be used in several contexts.
To Solve this purpose, generally encoding techniques that utilise the context like continuous bag of words, or continous skip gram models are used for classification of the usage of word in a particular context like change for either exchange or adjust. This very idea is applied in LSTM based architectures as well or RNNs where the context is preserved over input sequences.
The interpretation of word-vectors isn't practical from a visualisation point of view, but only from 'relative distance' point of view with other words in the sample space. Another way is to maintain a matrix of the corpus with contextual uses being represented for the words in that matrix.
In fact there's a neural network that utilises bidirectional language model to first predict the upcoming word then at the end of the sentence goes back and tries to predict the previous word. It's called ELMo. You should go through the paper.ELMo Paper and this blog
Naturally the model learns from representative examples. So the better training set you give with the diverse uses of the same word, the better model can learn to utilise context to attach meaning to the word. Often this is what people use to solve their specific cases by using domain centric training data.
I think these could be helpful:
Efficient Estimation of Word Representations in
Vector Space
Pretrained language models like BERT could be useful for this as mentioned in another answer. Those models generate a representation based on the context.
The recent pretrained language models use wordpieces but spaCy has an implementation that aligns those to natural language tokens. There is a possibility then for example to check the similarity of different tokens based on the context. An example from https://explosion.ai/blog/spacy-transformers
import spacy
import torch
import numpy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print(apple1[0].similarity(apple2[0])) # 0.73428553
print(apple1[0].similarity(apple3[0])) # 0.43365782

Is it a good idea to use word2vec for encoding of categorical features?

I am facing a binary prediction task and have a set of features of which all are categorical. A key challenge is therefore to encode those categorical features to numbers and I was looking for smart ways to do so.
I stumbled over word2vec, which is mostly used for NLP, but I was wondering whether I could use it to encode my variables, i.e. simply take the weights of the neural net as the encoded features.
However, I am not sure, whether it is a good idea since, the context words, which serve as the input features in word2vec are in my case more or less random, in contrast to real sentences which word2vec was originially made for.
Do you guys have any advice, thoughts, recommendations on this?
You should look into entity embedding if you are searching for a way to utilize embeddings for categorical variables.
google has a good crash course on the topic: https://developers.google.com/machine-learning/crash-course/embeddings/categorical-input-data
this is a good paper on arxiv written by a team from a Kaggle competition: https://arxiv.org/abs/1604.06737
It's certainly possible to use the word2vec algorithm to train up 'dense embeddings' for things like keywords, tags, categories, and so forth. It's been done, sometimes beneficially.
Whether it's a good idea in your case will depend on your data & goals – the only way to know for sure is to try it, and evaluate the results versus your alternatives. (For example, if the number of categories is modest from a controlled vocabulary, one-hot encoding of the categories may be practical, and depending on the kind of binary classifier you use downstream, the classifier may itself be able to learn the same sorts of subtle interrelationships between categories that could also otherwise be learned via a word2vec model. On the other hand, if categories are very numerous & chaotic, the pre-step of 'compressing' them into a smaller-dimensional space, where similar categories have similar representational vectors, may be more helpful.)
That such tokens don't quite have the same frequency distributions & surrounding contexts as true natural language text may mean it's worth trying a wider range of non-default training options on any word2vec model.
In particular, if your categories don't have a natural ordering giving rise to meaningful near-neighbors relationships, using a giant window (so all words in a single 'text' are in each others' contexts) may be worth considering.
Recent versions of the Python gensim Word2Vec allow changing a parameter named ns_exponent – which was fixed at 0.75 in many early implementations, but at least one paper has suggested can usefully vary far from that value for certain corpus data and recommendation-like applications.

Feature extraction from a single word

Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification
But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.
What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?
I will not make the implementation in English, so I can't use databases.
hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.
In NLP exists 3 basic levels:
morphological analyses
syntactic analyses
semantic analyses
(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,...). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.
As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many 'less-covered' languages it can be a problem to find one ...
user1506145,
Here is a simple idea that I have used in the past. Collect a large number of short documents like Wikipedia articles. Do a word count on each document. For the ith document and the jth word let
I = the number of documents,
J = the number of words,
x_ij = the number of times the jth word appears in the ith document, and
y_ij = ln( 1+ x_ij).
Let [U, D, V] = svd(Y) be the singular value decomposition of Y. So Y = U*D*transpose(V)), U is IxI, D is diagonal IxJ, and V is JxJ.
You can use (V_1j, V_2j, V_3j, V_4j) as a feature vector in R^4 for the jth word.
I am surprised the previous answers haven't mentioned word embedding. Word embedding algorithm can produce word vectors for each word a given dataset. These algorithms can nfer word vectors from the context. For instance, by looking at the context of the following sentences we can say that "clever" and "smart" is somehow related. Because the context is almost the same.
He is a clever guy
He is a smart guy
A co-occurrence matrix can be constructed to do this. However, it is too inefficient. A famous technique designed for this purpose is called Word2Vec. It can be studied from the following papers.
https://arxiv.org/pdf/1411.2738.pdf
https://arxiv.org/pdf/1402.3722.pdf
I have been using it for Swedish. It is quite effective in detecting similar words and completely unsupervised.
A package could be find in gensim and tensorflow.

Binarization in Natural Language Processing

Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.
If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply replace the word by it's ID giving the vector <3,1,2,3,4>.
Given these IDs we could also create a binary vector by giving each word four possible slots, and setting the slot corresponding to a specific word with to one, giving the vector <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. The latter method is, as far as I know, is commonly referred to as the bag-of-words-method.
Now for my question, what is the best binarization method when it comes to describe features for natural language processing in general, and transition-based dependency parsing (with Nivres algorithm) in particular?
In this context, we do not want to encode the whole sentence, but rather the current state of the parse, for example the top word on the stack en the first word in the input queue. Since order is highly relevant, this rules out the bag-of-words-method.
With best, I am referring to the method that makes the data the most intelligible for the classifier, without using up unnecessary memory. For example I don't want a word bigram to use 400 million features for 20000 unique words, if only 2% the bigrams actually exist.
Since the answer is also depending on the particular classifier, I am mostly interested in maximum entropy models (liblinear), support vector machines (libsvm) and perceptrons, but answers that apply to other models are also welcome.
This is actually a really complex question. The first decision you have to make is whether to lemmatize your input tokens (your words). If you do this, you dramatically decrease your type count, and your syntax parsing gets a lot less complicated. However, it takes a lot of work to lemmatize a token. Now, in a computer language, this task gets greatly reduced, as most languages separate keywords or variable names with a well defined set of symbols, like whitespace or a period or whatnot.
The second crucial decision is what you're going to do with the data post-facto. The "bag-of-words" method, in the binary form you've presented, ignores word order, which is completely fine if you're doing summarization of a text or maybe a Google-style search where you don't care where the words appear, as long as they appear. If, on the other hand, you're building something like a compiler or parser, order is very much important. You can use the token-vector approach (as in your second paragraph), or you can extend the bag-of-words approach such that each non-zero entry in the bag-of-words vector contains the linear index position of the token in the phrase.
Finally, if you're going to be building parse trees, there are obvious reasons why you'd want to go with the token-vector approach, as it's a big hassle to maintain sub-phrase ids for every word in the bag-of-words vector, but very easy to make "sub-vectors" in a token-vector. In fact, Eric Brill used a token-id sequence for his part-of-speech tagger, which is really neat.
Do you mind if I ask what specific task you're working on?
Binarization is the act of
transforming colorful features of
an entity into vectors of numbers,
most often binary vectors, to make
good examples for classifier
algorithms.
I have mostly come across numeric features that take values between 0 and 1 (not binary as you describe), representing the relevance of the particular feature in the vector (between 0% and 100%, where 1 represents 100%). A common example for this are tf-idf vectors: in the vector representing a document (or sentence), you have a value for each term in the entire vocabulary that indicates the relevance of that term for the represented document.
As Mike already said in his reply, this is a complex problem in a wide field. In addition to his pointers, you might find it useful to look into some information retrieval techniques like the vector space model, vector space classification and latent semantic indexing as starting points. Also, the field of word sense disambiguation deals a lot with feature representation issues in NLP.
[Not a direct answer] It all depends on what you are try to parse and then process, but for general short human phrase processing (e.g. IVT) another method is to use neural networks to learn the patterns. This can be very acurate for smallish vocubularies

Resources