I was wondering how useful the encoder's hidden state is for an attention network. When I looked into the structure of an attention model, this is what I found a model generally looks like:
x: Input.
h: Encoder's hidden state which feeds forward to the next
encoder's hidden state.
s: Decoder's hidden state which has a
weighted sum of all the encoder's hidden states as input and feeds
forward to the next decoder's hidden state.
y: Output.
With a process like translation, why is it important for encoder's hidden states to feed forward or exist in the first place? We already know what the next x is going to be. Thereby, the order of the input isn't necessarily important for the order of the output, neither is what has been memorized from the previous input as the attention model looks at all inputs simulaneously. Couldn't you just use attention directly on the embedding of x?
Thank you!
You can easily try and see that you will get quite bad results. Even you added some positional encoding to the input embeddings, the result will be pretty bad.
The order matters. Sentences:
John loves Marry.
Marry loves John.
indeed have a different meaning. Also, the order is not the only information you get from the encoder. The encoder does also input disambiguation: words can be homonymous such as "train" (see https://arxiv.org/pdf/1908.11771.pdf). Also, the probing of trained neural networks shows that the encoder develops a pretty abstract representation of the input sentence (see https://arxiv.org/pdf/1911.00317.pdf) and a large part of the translation actually already happens in the encoder (see https://arxiv.org/pdf/2003.09586.pdf).
Related
I've been learning about the new popular Transformer model, which can be used for sequence-to-sequence language applications. I am considering an application of time-series modeling, which is not necessarily language modeling. Thus I am modeling where the output layer maybe is not a probability, but could perhaps be a prediction of the next value of the time series.
If I consider the original language model presented in the paper (see Figure 1), we notice that positional encodings are applied to the embedded input data, however there is no indication of a position in the output. The output simply gives probabilities for value at the "next" time step. To me it seems like something is being lost here. The output assumes an iterative process, where the "next" output is just next because it is next. However in the input we feel the need to insert some positional information with the positional encodings. I would think we should also be interested in the positional encodings of the output as well. Is there a way to recover them?
This problem becomes more pronounced if we consider non-uniformly sampled time series data. This is really what I am interested in. It would be interesting to use non-uniformly sampled time series as input and predict the "next" value of the time series, where we also get the time position of that prediction. This comes down to somehow recovering the positional information from that output value. Since the positional encoding of the input is added to the input, it is not trivial how to extract this positional information from the output, perhaps it should be called "positional decoding".
To sumarize, my question is, what happens to the positional information in the output? Is it still there but I am just missing it? Also, does anyone see a straightforward way of recovering this data if not immediately available by the model?
Thanks
I am studying machine translation right now and I am interested in a question probing a bit more deeply into the internals of sentence representations.
Suppose we train an encoder-decoder Seq2Seq En-Fr translation system on parallel corpora, starting with pre-trained Eng and Fr word vectors. The system can use anything to form the sentence embedding (Transformers, LSTMs, etc). Then the job of the Seq2Seq translation system is to learn to build Eng sentence representations from Eng word vectors and learn to build French sentence representations from French word vectors and by the linking of the encoder and decoder, learn those two sentence representations in the same space.
After training the model, and encoding some English sentence with the model (Say, "This is not a pipe."), the sentence embedding in the joint representation space has some idea of the words 'this', 'is', 'not', 'a', 'pipe', etc and all their associations as well as the sequence in which they appear. (1)
When the decoder is run on the encoding, it is able to take out the aforementioned information due for a load of corpora that was fed to it during training and statistical associations between words, and output, correspondingly, 'Ceci', 'n', ''', 'est', 'pas', 'une', 'pipe', '(EOS)'. At each step, it extracts and outputs the next French word from the decoder hidden state and transforms it so that the heuristically "most prominent" word to be decoded next can be found by the decoder, and so on, until '(EOS)'.
My question is this: Is there any interpretation of the last decoder hidden state after (EOS) is the output? Is it useful for anything else? Of course, an easy answer is "no, the model was trained to capture millions of lines of English text and process them until some word in conjunction with the hidden state produces (EOS) and last decoder hidden state is simply that, everything else not explicitly trained on is just noise and not signal".
But I'm wondering if there's anything more to this? What I'm trying to get at is, if you have a sentence embedding generated in English, and have the meaning dumped out of it in French by the decoder model, does any residual meaning remain that is not translatable from English to French? Certainly, the last hidden state for any particular sentence's translation would be very hard to interpret, but how about in the aggregate (like some aggregation of the last hidden states of every single sentence to be translated that has the words 'French' in it, which means something slightly different in English because it can be paired with 'fries' etc. This is a silly example, but you can probably think of others exploiting cultural ambiguities, etc, that turn up in language.) Might this last embedding capture some statistical "uncertainty" or ambiguity about the translation (maybe of like the English possible "meanings" and associations that could have ended up in French but didn't?) or some other structural aspect of the language that might be used to help us understand, say, how English is different from French?
What category do you think the answer to this fall in?
"There is no signal",
"There probably is some signal but it would be
very hard to extract because of depends on the mechanics of how the
model was trained"
"There is a signal that can be reliably extracted,
even if we have to aggregate over millions of examples"?
I'm not sure if this question is sensical at all but I'm curious about the answer and if any research been done on this front? I ask out of plain simple curiosity.
Notes:
I am aware that the last hidden state exists because it generates (EOS) in conjunction with the last word. That is its purpose, nothing else (?) makes it special. I'm wondering if we can get any more meaning out of it (even if it means transforming it like applying the decoder step one more time to it or something).
(1) (Of course, the ML model has no rich ides of 'concepts' as a human would with all its associations to thoughts and experiences and feelings, to the ML model the 'concept' only has associations with other words seen in the monolingual corpus for the word vector training and the bilingual corpus for translation training.)
Answering my own question but still interested in thoughts. I have a hunch the answer is "no", because the hidden state embedding is generated with only two properties in mind: (1) To be 'closest' by cosine distance to the next output token out of all tokens in French, and (2) to produce the hidden state corresponding to the next word when the decoder transformation is applied to it. To make the last hidden state have an interpretation other than 'it is the point on the 300-d (or whatever dimension embedding we are using) unit circle closes by cosine distance to the French (EOS) token' would mean we would have apply (2) to it. But the training data never had any examples of anything following (EOS) so what we get if we apply the decoder transformation to the last hidden state was never learned and is simply random depending on our model initialisations.
If we wanted to get some sort of idea about how good a 'match' the English and French joint embedding space is, we should be looking and comparing the test loss of various translations, not looking into the last hidden state. But still interested in people's thoughts on the matter if anyone thinks differently.
I am coming from Google BERT context (Bidirectional Encoder representations from Transformers). I have gone through architecture and codes. People say this is bidirectional by nature. To make it unidirectional attention some mask is to be applied.
Basically a transformer takes key, values and queries as input; uses encoder decoder architecture; and applies attention to these keys, queries and values. What I understood is we need to pass tokens explicitly rather than transformer understanding this by nature.
Can someone please explain what makes transformer is bidirectional by nature
Bidirectional is actually a carry-over term from RNN/LSTM. Transformer is much more than that.
Transformer and BERT can directly access all positions in the sequence, equivalent to having full random access memory of the sequence during encoding/decoding.
Classic RNN has only access to the hidden state and last token, e.g. encoding of word3 = f(hidden_state, word2), so it has to compress all previous words into a hidden state vector, theoretically possible but a severe limitation in practice. Bidirectional RNN/LSTM is slightly better. Memory networks is another way to work around this. Attention is yet another way to improve LSTM seq2seq models. The insight for Transformer is that you want full memory access and don't need the RNN at all!
Another piece of history: an important ingredient that let us deal with sequence structure without using RNN is positional encoding, which comes from CNN seq2seq model. It would not have been possible without this. Turns out, you don't need the CNN either, as CNN doesn't have full random access, but each convolution filter can only look at a number of neighboring words at a time.
Hence, Transformer is more like FFN, where encoding of word1 = f1(word1, word2, word3), and encoding of word3 = f2(word1, word2, word3). All positions available all the time.
You might also appreciate the beauty which is that the authors made it possible to compute attention for all positions in parallel, through the use of Q, K, V matrices. It's quite magical!
But understand this, you'll also appreciate the limitations of Transformer, that it requires O(N^2 * d) computation where N is the sequence length, precisely because we're doing N*N attention of all words with all other words. RNN, on the other hand, is linear in the sequence length and requires O(N * d^2) computation. d is the dimension of model hidden state.
Transformer just won't write a novel anytime soon!
On the following picture you will see in a really clear way why BERT is Bidirectional.
This is crucial since this forces the model to use information from the entire sentence simultaneously – regardless of the position – to make a good predictions.
BERT has been a clear break through allowed by the use of the notorious "attention is all you need" paper and architecture.
This Bidirectional idea (masked) is different from classic LSTM cells which till now used the forward or the backward method or both but not at the same time.
Edit:
this is done by the transformer. The attention is all you need paper is presenting an encoder-decoder system implementing a sequence to sequence framework. BERT is using this Transformer (sequence to sequence Bidirectional network) to do other NLP task. And this has been done by using a masked approach.
The important thing is: BERT uses Attention but Attention has been done for a translation and as such do not care for Bidirectional. But remove a word and you have Bidirectional.
So why BERT now?
well the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. Meaning that this model allows a sentence Embedding far more effective than before. In fact, RNN based architectures are hard to parallelize and can have difficulty learning long-range dependencies within the input and output sequences. SO break through in architecture AND the use of this idea to train a network by masking a word (or more) leads to BERT.
Edit of Edit:
forget about the scale product, it's the inside the Attention which is inside A multi head attention itself inside the Transformer: you are looking to deep. The transformer is using the entire sequence every time to find the other sequence (In case of BERT it's the missing 0.15 percentage of the sentence) and that's it. The use of BERT as a language model is realy a transfer learning (see this)
As stated in your post, unidirectional can be done with a certain type of mask, bidirec is better. And it is used because the go from a full sentence to a full sentence but not the way classic Seq2seq is made (with LSTM and RNN) and as such can be used for LM.
BERT is a bidirectional transformer whereas the original transformer (Vaswani et al., 2017) is unidirectional. This can be shown by comparing the masks in the code.
Original Transformer
Tensorflow's tutorial is a good reference. The look_ahead_mask is what makes the Transformer unidirectional.
def create_look_ahead_mask(size):
mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
return mask # (seq_len, seq_len)
If you trace the code, you can find the look_ahead_mask is applied to the attention_weights in the decoder. Basically each row in the attention_weights represents a attention query for token at certain position (the first row -> first token position; the second row -> second token position etc.). And the look_ahead_mask blacks out the tokens appear after this position in the decoder so it does not see the "future". In that sense, the decoder is unidirectional analogous to unidirectional in an RNN.
BERT
On the other hand, if you check the original BERT implementation (also in Tensorflow). There's only an optional input_mask applied to the entire BertModel. And if you follow the README on pre-training the model and run create_pretraining_data.py, you will observe that the input_mask is only used for padding the input sequence so for short sentences the unused tokens are ignored. Thus, attention in BERT can be applied to both the "past" and the "future" of a given token position. In that sense, the encoder in BERT is bidirectional analogous to bidirectional in an RNN.
I know this is an old post but for anyone coming back to this:
Adding to what Hai-Anh Trinh said, Transformers aren't 'bi-directional', it would be better to call them "omni-directional". Because of their self-attention method, they are able to consider every single word at the same time, simultaneously.
BERT on the other hand is "deeply bidirectional". This is because of the masked language model(MLM) pre-training objective that is used in BERT. (there are a lot of resources online, I can link some if need be)
It's easy to get confused so don't worry about it.
(https://arxiv.org/pdf/1810.04805.pdf; link to the original BERT paper)
(https://arxiv.org/pdf/1706.03762.pdf; link to the original Transformer paper)
For some self-studying, I'm trying to implement simple a sequence-to-sequence model using Keras. While I get the basic idea and there are several tutorials available online, I still struggle with some basic concepts when looking these tutorials:
Keras Tutorial: I've tried to adopt this tutorial. Unfortunately, it is for character sequences, but I'm aiming for word sequences. There's is a block to explain the required for word sequences, but this is currently throwing "wrong dimension" errors -- but that's OK, probably some data preparation errors from my side. But more importantly, in this tutorial, I can clearly see the 2 types of input and 1 type of output: encoder_input_data, decoder_input_data, decoder_target_data
MachineLearningMastery Tutorial: Here the network model looks very different, completely sequential with 1 input and 1 output. From what I can tell, here the decoder gets just the output of the encoder.
Is it correct to say that these are indeed two different approaches towards Seq2Seq? Which one is maybe better and why? Or do I read the 2nd tutorial wrongly? I already got an understanding in sequence classification and sequences labeling, but with sequence-to-sequence it hasn't properly clicked yet.
Yes, those two are different approaches and there are other variations as well. MachineLearningMastery simplifies things a bit to make it accessible. I believe Keras method might perform better and is what you will need if you want to advance to seq2seq with attention which is almost always the case.
MachineLearningMastery has a hacky workaround that allows it to work without handing in decoder inputs. It simply repeats the last hidden state and passes that as the input at each timestep. This is not a flexible solution.
model.add(RepeatVector(tar_timesteps))
On the other hand Keras tutorial has several other concepts like teacher forcing (using targets as inputs to the decoder), embeddings(lack of) and a lengthier inference process but it should set you up for attention.
I would also recommend pytorch tutorial which I feel is the most appropriate method.
Edit:
I dont know your task but what you would want for word embedding is
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
Before that, you need to map every word in the vocabulary into an integer, turn every sentence into a sequence of integers and pass that sequence of integers to the model (embedding layer of latent_dim maybe 120). So each of your word is now represented by a vector of size 120. Also your input sentences must be all of the same size. So find an appropriate max sentence length and turn every sentence into that length and pad with zero if sentences are shorter than max len where 0 represents a null word perhaps.
I'm a complete beginner in character recognition as well as machine learning in general.
I want to write a program which is able to process the following input:
A Chinese character (in either pixels of vector format), for example:
The decomposition of the previous character, ie for the example above:
and and the information that they are aligned horizontally.
The decomposition of a Chinese character is always 3 things: 2 other characters and the pattern describing how the 2 character form the initial character (it is called the compoisition kind). In the example above the composition kind is "aligned horizontally".
Given such an input, I want my program to tell which pixels or which contours in the initial character belongs to which subcharacter in its decomposition.
Where to start?
Well, I can't say that I provide a full answer but think about:
1) Reading the papers on how Google Translate app works. You know, when you point your iPhone's camera at text and it instantly translates the text (even preserving the fonts!). It supports the chineese language so it would be interesting for you to see if they solved similar task and how they did it
2) Another big question to answer - how to prepare your input data. You will need to provide at least some input data - i.e. decomposition of at least some characters. Try to do this manually for couple of characters and try to formalize what exactly you are doing - this will help you to better formulate what exactly you want your algorithm to do.
3) Try to use some deep neural net with your data from #2. Use something with convolution layers. Pre-train it with RBM (restricted boltzmann machine). After that - just take a really close look into the resulting neural network. Don't expect to get any good results, but looking into the ANN layers will help you to understand what the net have learned from data and might provide some insight into where to move next