Seq2Seq/ NLP/Translation: After generating the target sentence, does the last decoder hidden state carry any residual meaning? - machine-learning

I am studying machine translation right now and I am interested in a question probing a bit more deeply into the internals of sentence representations.
Suppose we train an encoder-decoder Seq2Seq En-Fr translation system on parallel corpora, starting with pre-trained Eng and Fr word vectors. The system can use anything to form the sentence embedding (Transformers, LSTMs, etc). Then the job of the Seq2Seq translation system is to learn to build Eng sentence representations from Eng word vectors and learn to build French sentence representations from French word vectors and by the linking of the encoder and decoder, learn those two sentence representations in the same space.
After training the model, and encoding some English sentence with the model (Say, "This is not a pipe."), the sentence embedding in the joint representation space has some idea of the words 'this', 'is', 'not', 'a', 'pipe', etc and all their associations as well as the sequence in which they appear. (1)
When the decoder is run on the encoding, it is able to take out the aforementioned information due for a load of corpora that was fed to it during training and statistical associations between words, and output, correspondingly, 'Ceci', 'n', ''', 'est', 'pas', 'une', 'pipe', '(EOS)'. At each step, it extracts and outputs the next French word from the decoder hidden state and transforms it so that the heuristically "most prominent" word to be decoded next can be found by the decoder, and so on, until '(EOS)'.
My question is this: Is there any interpretation of the last decoder hidden state after (EOS) is the output? Is it useful for anything else? Of course, an easy answer is "no, the model was trained to capture millions of lines of English text and process them until some word in conjunction with the hidden state produces (EOS) and last decoder hidden state is simply that, everything else not explicitly trained on is just noise and not signal".
But I'm wondering if there's anything more to this? What I'm trying to get at is, if you have a sentence embedding generated in English, and have the meaning dumped out of it in French by the decoder model, does any residual meaning remain that is not translatable from English to French? Certainly, the last hidden state for any particular sentence's translation would be very hard to interpret, but how about in the aggregate (like some aggregation of the last hidden states of every single sentence to be translated that has the words 'French' in it, which means something slightly different in English because it can be paired with 'fries' etc. This is a silly example, but you can probably think of others exploiting cultural ambiguities, etc, that turn up in language.) Might this last embedding capture some statistical "uncertainty" or ambiguity about the translation (maybe of like the English possible "meanings" and associations that could have ended up in French but didn't?) or some other structural aspect of the language that might be used to help us understand, say, how English is different from French?
What category do you think the answer to this fall in?
"There is no signal",
"There probably is some signal but it would be
very hard to extract because of depends on the mechanics of how the
model was trained"
"There is a signal that can be reliably extracted,
even if we have to aggregate over millions of examples"?
I'm not sure if this question is sensical at all but I'm curious about the answer and if any research been done on this front? I ask out of plain simple curiosity.
Notes:
I am aware that the last hidden state exists because it generates (EOS) in conjunction with the last word. That is its purpose, nothing else (?) makes it special. I'm wondering if we can get any more meaning out of it (even if it means transforming it like applying the decoder step one more time to it or something).
(1) (Of course, the ML model has no rich ides of 'concepts' as a human would with all its associations to thoughts and experiences and feelings, to the ML model the 'concept' only has associations with other words seen in the monolingual corpus for the word vector training and the bilingual corpus for translation training.)

Answering my own question but still interested in thoughts. I have a hunch the answer is "no", because the hidden state embedding is generated with only two properties in mind: (1) To be 'closest' by cosine distance to the next output token out of all tokens in French, and (2) to produce the hidden state corresponding to the next word when the decoder transformation is applied to it. To make the last hidden state have an interpretation other than 'it is the point on the 300-d (or whatever dimension embedding we are using) unit circle closes by cosine distance to the French (EOS) token' would mean we would have apply (2) to it. But the training data never had any examples of anything following (EOS) so what we get if we apply the decoder transformation to the last hidden state was never learned and is simply random depending on our model initialisations.
If we wanted to get some sort of idea about how good a 'match' the English and French joint embedding space is, we should be looking and comparing the test loss of various translations, not looking into the last hidden state. But still interested in people's thoughts on the matter if anyone thinks differently.

Related

Find translations of a given word in the corpus e.g. by machine learning, word2vec, text mining

I am using this thread to get some ideas and find some possibilities.
I have about 1000 sermons and their translations into another language. The lengths of the sermons are different. These are religious sermon texts. Because of the domain (religious), there are a lot of words that can be used in different ways based on the context. The same word can become a different meaning.
Is there a way, where I can get "programmatically" the translations of a given word in the aim language?
x1 -> [y2,z2,a2,b2,c2]
where x is the word in the language 1
and the returned array contains translations in the language 2
This would be the best case. Maybe this could be possible by training a translation model by using domain data, but I don't have a lot of data.
Could it be possible by using word2vec? By creating a vector space of both texts (language 1 and language 2) and by using a transformation matrix would it be possible to put the semantical meanings together?
Do you know other ways or have other ideas? Is there maybe such works already and what is these kinds of research called? I was not able to find something like this. I hope you guys have some ideas on how this could be reached.
The general purpose is "to create a tool" for researchers in this specific domain, that can be used to analyse sermons translation quality. If you have another ideas how the quality of a translation (semantically) can be analysed, I would be very thankful.
To get the translation for a specific word in a sentence, you can use what’s called word alignment.
To get the quality of the translation, you can use what’s called quality estimation.
machinetranslate.org/quality-estimation
A solution based on word vectors (FastText vectors are typically better than Word2Vec) is certainly possible. The task that you are looking for is bilingual dictionary induction. The most frequently used tool for that is VecMap that can align two embeddings spaces from two languages. It either uses a small seed dictionary to align all the words or it even can work in a completely unsupervised fashion.
Another solution is doing word alignment, i.e., statistically aligning words in the translations. Then you can get a dictionary based on the frequencies of how often the words are mapped to each other (note there might be problems when the languages differ morphologically). In this case, you can easily show examples of how the translations are used in sentences. If the languages you are interested in are covered by the XLM-R model, I recommend using SimAlign (a neural solution). If not, you can use Eflomal (a statistical solution).

Natural Language Processing techniques for understanding contextual words

Take the following sentence:
I'm going to change the light bulb
The meaning of change means replace, as in someone is going to replace the light bulb. This could easily be solved by using a dictionary api or something similar. However, the following sentences
I need to go the bank to change some currency
You need to change your screen brightness
The first sentence does not mean replace anymore, it means Exchangeand the second sentence, change means adjust.
If you were trying to understand the meaning of change in this situation, what techniques would someone use to extract the correct definition based off of the context of the sentence? What is what I'm trying to do called?
Keep in mind, the input would only be one sentence. So something like:
Screen brightness is typically too bright on most peoples computers.
People need to change the brightness to have healthier eyes.
Is not what I'm trying to solve, because you can use the previous sentence to set the context. Also this would be for lots of different words, not just the word change.
Appreciate the suggestions.
Edit: I'm aware that various embedding models can help gain insight on this problem. If this is your answer, how do you interpret the word embedding that is returned? These arrays can be upwards of 500+ in length which isn't practical to interpret.
What you're trying to do is called Word Sense Disambiguation. It's been a subject of research for many years, and while probably not the most popular problem it remains a topic of active research. Even now, just picking the most common sense of a word is a strong baseline.
Word embeddings may be useful but their use is orthogonal to what you're trying to do here.
Here's a bit of example code from pywsd, a Python library with implementations of some classical techniques:
>>> from pywsd.lesk import simple_lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> answer = simple_lesk(sent, ambiguous, pos='n')
>>> print answer
Synset('depository_financial_institution.n.01')
>>> print answer.definition()
'a financial institution that accepts deposits and channels the money into lending activities'
The methods are mostly kind of old and I can't speak for their quality but it's a good starting point at least.
Word senses are usually going to come from WordNet.
I don't know how useful this is but from my POV, word vector embeddings are naturally separated and the position in the sample space is closely related to different uses of the word. However like you said often a word may be used in several contexts.
To Solve this purpose, generally encoding techniques that utilise the context like continuous bag of words, or continous skip gram models are used for classification of the usage of word in a particular context like change for either exchange or adjust. This very idea is applied in LSTM based architectures as well or RNNs where the context is preserved over input sequences.
The interpretation of word-vectors isn't practical from a visualisation point of view, but only from 'relative distance' point of view with other words in the sample space. Another way is to maintain a matrix of the corpus with contextual uses being represented for the words in that matrix.
In fact there's a neural network that utilises bidirectional language model to first predict the upcoming word then at the end of the sentence goes back and tries to predict the previous word. It's called ELMo. You should go through the paper.ELMo Paper and this blog
Naturally the model learns from representative examples. So the better training set you give with the diverse uses of the same word, the better model can learn to utilise context to attach meaning to the word. Often this is what people use to solve their specific cases by using domain centric training data.
I think these could be helpful:
Efficient Estimation of Word Representations in
Vector Space
Pretrained language models like BERT could be useful for this as mentioned in another answer. Those models generate a representation based on the context.
The recent pretrained language models use wordpieces but spaCy has an implementation that aligns those to natural language tokens. There is a possibility then for example to check the similarity of different tokens based on the context. An example from https://explosion.ai/blog/spacy-transformers
import spacy
import torch
import numpy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print(apple1[0].similarity(apple2[0])) # 0.73428553
print(apple1[0].similarity(apple3[0])) # 0.43365782

How to obtain the decomposition of a Chinese character

I'm a complete beginner in character recognition as well as machine learning in general.
I want to write a program which is able to process the following input:
A Chinese character (in either pixels of vector format), for example:
The decomposition of the previous character, ie for the example above:
and and the information that they are aligned horizontally.
The decomposition of a Chinese character is always 3 things: 2 other characters and the pattern describing how the 2 character form the initial character (it is called the compoisition kind). In the example above the composition kind is "aligned horizontally".
Given such an input, I want my program to tell which pixels or which contours in the initial character belongs to which subcharacter in its decomposition.
Where to start?
Well, I can't say that I provide a full answer but think about:
1) Reading the papers on how Google Translate app works. You know, when you point your iPhone's camera at text and it instantly translates the text (even preserving the fonts!). It supports the chineese language so it would be interesting for you to see if they solved similar task and how they did it
2) Another big question to answer - how to prepare your input data. You will need to provide at least some input data - i.e. decomposition of at least some characters. Try to do this manually for couple of characters and try to formalize what exactly you are doing - this will help you to better formulate what exactly you want your algorithm to do.
3) Try to use some deep neural net with your data from #2. Use something with convolution layers. Pre-train it with RBM (restricted boltzmann machine). After that - just take a really close look into the resulting neural network. Don't expect to get any good results, but looking into the ANN layers will help you to understand what the net have learned from data and might provide some insight into where to move next

Classification of single sentence

I have 4 different categories and I also have around 3000 words which belong to each of these categories. Now if a new sentence comes, I am able to break the sentence into words and get more words related to it. So say for each new sentence I can get 20-30 words generated from the sentence.
Now what is the best way to classify this sentence in above mentioned category? I know bag of words works well.
I also looked at LDA, but it works with documents, where as I have a list of words as a training corpus. In LDA it looks at the position of word in document. So I could not get meaningful results from LDA.
I'm not sure if I fully understand what your question is exactly.
Bag of words works well for some purposes, but in a lot of cases it throws away a lot of potentially useful information (which could be taken from word order, for example).
And assuming that you get a grammatical sentence as input, why not use your sentence as document and still use LDA? The position of a word in your sentence can still be verymeaningful.
There are plenty of classification methods available. Which one is best depens largely on your purpose. If you're neew to this area, this may be interesting to have a look at: https://www.coursera.org/course/ml
Like, Igor, I am also a bit confused regarding your problem. Be it a document or a sentence, the terms will be part of the feature set for categorization, in some form. You can find out the most relevant terms of each category and using this knowledge, do a better classification of the new sentences. For example, if your sentence is as follows-" There is a stray dog near our layout which bites everyone who goes near to it". If you take the useful keywords from this sentence, removing stopwords, they are a few in number ( stray, dog, layout, bites, near ). You can categorize it into a bucket, "animals_issue". If you train your system with a larger set of example, this bag of words model can help. Otherwise, you can go for LDA/ other topic modelling approaches.

Binarization in Natural Language Processing

Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.
If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply replace the word by it's ID giving the vector <3,1,2,3,4>.
Given these IDs we could also create a binary vector by giving each word four possible slots, and setting the slot corresponding to a specific word with to one, giving the vector <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. The latter method is, as far as I know, is commonly referred to as the bag-of-words-method.
Now for my question, what is the best binarization method when it comes to describe features for natural language processing in general, and transition-based dependency parsing (with Nivres algorithm) in particular?
In this context, we do not want to encode the whole sentence, but rather the current state of the parse, for example the top word on the stack en the first word in the input queue. Since order is highly relevant, this rules out the bag-of-words-method.
With best, I am referring to the method that makes the data the most intelligible for the classifier, without using up unnecessary memory. For example I don't want a word bigram to use 400 million features for 20000 unique words, if only 2% the bigrams actually exist.
Since the answer is also depending on the particular classifier, I am mostly interested in maximum entropy models (liblinear), support vector machines (libsvm) and perceptrons, but answers that apply to other models are also welcome.
This is actually a really complex question. The first decision you have to make is whether to lemmatize your input tokens (your words). If you do this, you dramatically decrease your type count, and your syntax parsing gets a lot less complicated. However, it takes a lot of work to lemmatize a token. Now, in a computer language, this task gets greatly reduced, as most languages separate keywords or variable names with a well defined set of symbols, like whitespace or a period or whatnot.
The second crucial decision is what you're going to do with the data post-facto. The "bag-of-words" method, in the binary form you've presented, ignores word order, which is completely fine if you're doing summarization of a text or maybe a Google-style search where you don't care where the words appear, as long as they appear. If, on the other hand, you're building something like a compiler or parser, order is very much important. You can use the token-vector approach (as in your second paragraph), or you can extend the bag-of-words approach such that each non-zero entry in the bag-of-words vector contains the linear index position of the token in the phrase.
Finally, if you're going to be building parse trees, there are obvious reasons why you'd want to go with the token-vector approach, as it's a big hassle to maintain sub-phrase ids for every word in the bag-of-words vector, but very easy to make "sub-vectors" in a token-vector. In fact, Eric Brill used a token-id sequence for his part-of-speech tagger, which is really neat.
Do you mind if I ask what specific task you're working on?
Binarization is the act of
transforming colorful features of
an entity into vectors of numbers,
most often binary vectors, to make
good examples for classifier
algorithms.
I have mostly come across numeric features that take values between 0 and 1 (not binary as you describe), representing the relevance of the particular feature in the vector (between 0% and 100%, where 1 represents 100%). A common example for this are tf-idf vectors: in the vector representing a document (or sentence), you have a value for each term in the entire vocabulary that indicates the relevance of that term for the represented document.
As Mike already said in his reply, this is a complex problem in a wide field. In addition to his pointers, you might find it useful to look into some information retrieval techniques like the vector space model, vector space classification and latent semantic indexing as starting points. Also, the field of word sense disambiguation deals a lot with feature representation issues in NLP.
[Not a direct answer] It all depends on what you are try to parse and then process, but for general short human phrase processing (e.g. IVT) another method is to use neural networks to learn the patterns. This can be very acurate for smallish vocubularies

Resources