How Transformer is Bidirectional - Machine Learning - machine-learning

I am coming from Google BERT context (Bidirectional Encoder representations from Transformers). I have gone through architecture and codes. People say this is bidirectional by nature. To make it unidirectional attention some mask is to be applied.
Basically a transformer takes key, values and queries as input; uses encoder decoder architecture; and applies attention to these keys, queries and values. What I understood is we need to pass tokens explicitly rather than transformer understanding this by nature.
Can someone please explain what makes transformer is bidirectional by nature

Bidirectional is actually a carry-over term from RNN/LSTM. Transformer is much more than that.
Transformer and BERT can directly access all positions in the sequence, equivalent to having full random access memory of the sequence during encoding/decoding.
Classic RNN has only access to the hidden state and last token, e.g. encoding of word3 = f(hidden_state, word2), so it has to compress all previous words into a hidden state vector, theoretically possible but a severe limitation in practice. Bidirectional RNN/LSTM is slightly better. Memory networks is another way to work around this. Attention is yet another way to improve LSTM seq2seq models. The insight for Transformer is that you want full memory access and don't need the RNN at all!
Another piece of history: an important ingredient that let us deal with sequence structure without using RNN is positional encoding, which comes from CNN seq2seq model. It would not have been possible without this. Turns out, you don't need the CNN either, as CNN doesn't have full random access, but each convolution filter can only look at a number of neighboring words at a time.
Hence, Transformer is more like FFN, where encoding of word1 = f1(word1, word2, word3), and encoding of word3 = f2(word1, word2, word3). All positions available all the time.
You might also appreciate the beauty which is that the authors made it possible to compute attention for all positions in parallel, through the use of Q, K, V matrices. It's quite magical!
But understand this, you'll also appreciate the limitations of Transformer, that it requires O(N^2 * d) computation where N is the sequence length, precisely because we're doing N*N attention of all words with all other words. RNN, on the other hand, is linear in the sequence length and requires O(N * d^2) computation. d is the dimension of model hidden state.
Transformer just won't write a novel anytime soon!

On the following picture you will see in a really clear way why BERT is Bidirectional.
This is crucial since this forces the model to use information from the entire sentence simultaneously – regardless of the position – to make a good predictions.
BERT has been a clear break through allowed by the use of the notorious "attention is all you need" paper and architecture.
This Bidirectional idea (masked) is different from classic LSTM cells which till now used the forward or the backward method or both but not at the same time.
Edit:
this is done by the transformer. The attention is all you need paper is presenting an encoder-decoder system implementing a sequence to sequence framework. BERT is using this Transformer (sequence to sequence Bidirectional network) to do other NLP task. And this has been done by using a masked approach.
The important thing is: BERT uses Attention but Attention has been done for a translation and as such do not care for Bidirectional. But remove a word and you have Bidirectional.
So why BERT now?
well the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. Meaning that this model allows a sentence Embedding far more effective than before. In fact, RNN based architectures are hard to parallelize and can have difficulty learning long-range dependencies within the input and output sequences. SO break through in architecture AND the use of this idea to train a network by masking a word (or more) leads to BERT.
Edit of Edit:
forget about the scale product, it's the inside the Attention which is inside A multi head attention itself inside the Transformer: you are looking to deep. The transformer is using the entire sequence every time to find the other sequence (In case of BERT it's the missing 0.15 percentage of the sentence) and that's it. The use of BERT as a language model is realy a transfer learning (see this)
As stated in your post, unidirectional can be done with a certain type of mask, bidirec is better. And it is used because the go from a full sentence to a full sentence but not the way classic Seq2seq is made (with LSTM and RNN) and as such can be used for LM.

BERT is a bidirectional transformer whereas the original transformer (Vaswani et al., 2017) is unidirectional. This can be shown by comparing the masks in the code.
Original Transformer
Tensorflow's tutorial is a good reference. The look_ahead_mask is what makes the Transformer unidirectional.
def create_look_ahead_mask(size):
mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
return mask # (seq_len, seq_len)
If you trace the code, you can find the look_ahead_mask is applied to the attention_weights in the decoder. Basically each row in the attention_weights represents a attention query for token at certain position (the first row -> first token position; the second row -> second token position etc.). And the look_ahead_mask blacks out the tokens appear after this position in the decoder so it does not see the "future". In that sense, the decoder is unidirectional analogous to unidirectional in an RNN.
BERT
On the other hand, if you check the original BERT implementation (also in Tensorflow). There's only an optional input_mask applied to the entire BertModel. And if you follow the README on pre-training the model and run create_pretraining_data.py, you will observe that the input_mask is only used for padding the input sequence so for short sentences the unused tokens are ignored. Thus, attention in BERT can be applied to both the "past" and the "future" of a given token position. In that sense, the encoder in BERT is bidirectional analogous to bidirectional in an RNN.

I know this is an old post but for anyone coming back to this:
Adding to what Hai-Anh Trinh said, Transformers aren't 'bi-directional', it would be better to call them "omni-directional". Because of their self-attention method, they are able to consider every single word at the same time, simultaneously.
BERT on the other hand is "deeply bidirectional". This is because of the masked language model(MLM) pre-training objective that is used in BERT. (there are a lot of resources online, I can link some if need be)
It's easy to get confused so don't worry about it.
(https://arxiv.org/pdf/1810.04805.pdf; link to the original BERT paper)
(https://arxiv.org/pdf/1706.03762.pdf; link to the original Transformer paper)

Related

how does nn.embedding for developing an encoder-decoder model works?

In this tutorial, it teaches how to develop a simple encoder-decoder model with attention using pytorch.
However, in the encoder or decoder, self.embedding = nn.Embedding(input_size, hidden_size) (or similar) is defined. In pytorch documents, nn.Embedding is defined as "A simple lookup table that stores embeddings of a fixed dictionary and size."
So I am confused that, in the initialization, where does this lookup table has come from? Does it initialize some random embeddings for the indices and then they will be trained? Is it really necessary to be in the encoder/decoder part?
Thanks in advance.
Answering the last bit first: Yes, we do need Embedding or an equivalent. At least when dealing with discrete inputs (e.g. letters or words of a language), because these tokens come encoded as integers (e.g. 'a' -> 1, 'b' -> 2, etc.), but those numbers do not carry meaning: The letter 'b' is not "like 'a', but more", which its original encoding would suggest. So we provide the Embedding so that the network can learn how to represent these letters by something useful, e.g. making vowels similar to one another in some way.
During the initialization, the embedding vector are sampled randomly, in the same fashion as other weights in the model, and also get optimized with the rest of the model. It is also possible to initialize them from some pretrained embeddings (e.g. from word2vec, Glove, FastText), but caution must then be exercised not to destroy them by backprop through randomly initialized model.
Embeddings are not stricly necessary, but it would be very wasteful to force network to learn that 13314 ('items') is very similar to 89137 ('values'), but completely different to 13315 ('japan'). And it would probably not even remotely converge anyway.

Natural Language Processing techniques for understanding contextual words

Take the following sentence:
I'm going to change the light bulb
The meaning of change means replace, as in someone is going to replace the light bulb. This could easily be solved by using a dictionary api or something similar. However, the following sentences
I need to go the bank to change some currency
You need to change your screen brightness
The first sentence does not mean replace anymore, it means Exchangeand the second sentence, change means adjust.
If you were trying to understand the meaning of change in this situation, what techniques would someone use to extract the correct definition based off of the context of the sentence? What is what I'm trying to do called?
Keep in mind, the input would only be one sentence. So something like:
Screen brightness is typically too bright on most peoples computers.
People need to change the brightness to have healthier eyes.
Is not what I'm trying to solve, because you can use the previous sentence to set the context. Also this would be for lots of different words, not just the word change.
Appreciate the suggestions.
Edit: I'm aware that various embedding models can help gain insight on this problem. If this is your answer, how do you interpret the word embedding that is returned? These arrays can be upwards of 500+ in length which isn't practical to interpret.
What you're trying to do is called Word Sense Disambiguation. It's been a subject of research for many years, and while probably not the most popular problem it remains a topic of active research. Even now, just picking the most common sense of a word is a strong baseline.
Word embeddings may be useful but their use is orthogonal to what you're trying to do here.
Here's a bit of example code from pywsd, a Python library with implementations of some classical techniques:
>>> from pywsd.lesk import simple_lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> answer = simple_lesk(sent, ambiguous, pos='n')
>>> print answer
Synset('depository_financial_institution.n.01')
>>> print answer.definition()
'a financial institution that accepts deposits and channels the money into lending activities'
The methods are mostly kind of old and I can't speak for their quality but it's a good starting point at least.
Word senses are usually going to come from WordNet.
I don't know how useful this is but from my POV, word vector embeddings are naturally separated and the position in the sample space is closely related to different uses of the word. However like you said often a word may be used in several contexts.
To Solve this purpose, generally encoding techniques that utilise the context like continuous bag of words, or continous skip gram models are used for classification of the usage of word in a particular context like change for either exchange or adjust. This very idea is applied in LSTM based architectures as well or RNNs where the context is preserved over input sequences.
The interpretation of word-vectors isn't practical from a visualisation point of view, but only from 'relative distance' point of view with other words in the sample space. Another way is to maintain a matrix of the corpus with contextual uses being represented for the words in that matrix.
In fact there's a neural network that utilises bidirectional language model to first predict the upcoming word then at the end of the sentence goes back and tries to predict the previous word. It's called ELMo. You should go through the paper.ELMo Paper and this blog
Naturally the model learns from representative examples. So the better training set you give with the diverse uses of the same word, the better model can learn to utilise context to attach meaning to the word. Often this is what people use to solve their specific cases by using domain centric training data.
I think these could be helpful:
Efficient Estimation of Word Representations in
Vector Space
Pretrained language models like BERT could be useful for this as mentioned in another answer. Those models generate a representation based on the context.
The recent pretrained language models use wordpieces but spaCy has an implementation that aligns those to natural language tokens. There is a possibility then for example to check the similarity of different tokens based on the context. An example from https://explosion.ai/blog/spacy-transformers
import spacy
import torch
import numpy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print(apple1[0].similarity(apple2[0])) # 0.73428553
print(apple1[0].similarity(apple3[0])) # 0.43365782

How does Beam Search operate on the output of The Transformer?

According to my understanding (please correct me if I'm wrong), Beam Search is BFS where it only explores the "graph" of possibilities down b the most likely options, where b is the beam size.
To calculate/score each option, especially for the work that I'm doing which is in the field of NLP, we basically calculate the score of a possibility by calculating the probability of a token, given everything that comes before it.
This makes sense in a recurrent architecture, where you simply run the model you have with your decoder through the best b first tokens, to get the probabilities of the second tokens, for each of the first tokens. Eventually, you get sequences with probabilities and you just pick the one with the highest probability.
However, in the Transformer architecture, where the model doesn't have that recurrence, the output is the entire probability for each word in the vocabulary, for each position in the sequence (batch size, max sequence length, vocab size). How do I interpret this output for Beam Search? I can get the encodings for the input sequence, but since there isn't that recurrence of using the previous output as input for the next token's decoding, how do I go about calculating the probability of all the possible sequences that stems from the best b tokens?
The beam search works exactly in the same as with the recurrent models. The decoder is not recurrent (it's self-attentive), but it is still auto-regressive, i.e., generating a token is conditioned on previously generated tokens.
At the training time, the self-attention is masked, such that in only attend to words to the left from the word that is currently generated. It simulates the setup you have at inference time when you indeed only have the left context (because the right context has not been generated yet).
The only difference is that in the RNN decoder, you only use the last RNN state in every beam search step. With the Transformer, you always need to keep the entire hypothesis and do the self-attention over the entire left context.
Adding more information for your later question and for people who have the same question:
I guess what I really want to ask is that, with an RNN architecture, in the decoder, I can feed it the b tokens that are highest in probability, to get the conditional probabilities of subsequent tokens. However, as I understand, from this tutorial here: tensorflow.org/beta/tutorials/text/…, I can't really do that for the Transformer architecture. Is that right? The decoder takes in the encoder outputs, the 2 masks and the target -- what would I input in for the parameter target?
The tutorial on the website you mentioned is using teacher forcing in the training stage. And it's possible to apply beam-search for the decoder of transformers in the testing stage.
Using beam-search for modern architecture like transformers in the training stage is not so popular. (Check this link for more info)
while teacher forcing as the tutorial mentioned in the training stage, can offer you parallel computation and speed up training once you are dealing with a large vocabulary-list task.
As for testing such a decoder, you could try the following steps to do beam-search (Just offering a possibility based on my understanding and there may have more better solutions):
First, Instead of taking the entire ground truth sequence as input for the decoder, you could only provide "[SOS]" and pad the rest positions.
Although output of your decoder is still [batch_size, max_sequence_len, vocab_size], only the (batch_size, 0, vocab_size) is giving you useful information and that is the first token your model generated. Select top b token and add to your "[SOS]" sequence. Now you have "[SOS] token(1,1)", ... , "[SOS], token(1,b)" sequences.
Second, use the above sequences as input for the decoder and search for the top b token among b * vocab_size options. Add them to their corresponding sequence.
Repeat until sequcences meet some restriction (max_ouput_length or [EOS])
P.S: 1) [SOS] or [EOS] means the Start or the End of the sequence.
2) token(i,j) means the j-th token in top b tokens for the i-th token in sequence

What is used to train a self-attention mechanism?

I've been trying to understand self-attention, but everything I found doesn't explain the concept on a high level very well.
Let's say we use self-attention in a NLP task, so our input is a sentence.
Then self-attention can be used to measure how "important" each word in the sentence is for every other word.
The problem is that I do not understand how that "importance" is measured. Important for what?
What exactly is the goal vector the weights in the self-attention algorithm are trained against?
Connecting language with underlying meaning is called grounding. A sentence like “The ball is on the table” results into an image which can be reproduced with multimodal learning. Multimodal means, that different kind of words are available for example events, action words, subjects and so on. A self-attention mechanism works with mapping input vector to output vectors and between them is a neural network. The output vector of the neural network is referencing to the grounded situation.
Let us make a short example. We need a pixel image which is 300x200, we need a sentence in natural language and we need a parser. The parser works in both directions. He can convert text to image, that means the sentence “The ball is on the table” gets converted into the 300x200 image. But it is also possible to parse a given image and extract the natural sentence back. Self-attention learning is a bootstrapping technique to learn and use the grounded relationship. That means to verify existing language models, to learn new one and to predict future system states.
This question is old now but I came across it so I figured I should update others as my own understanding has increased.
Attention simply refers to some operation that takes the output and combines it with some other information. Typically this just happens by taking the dot product of the output with some other vector so it can "attend" to it in some way.
Self-attention combines the output with other parts of the input (hence self part). Again the combination usually occurs via the dot-product between the vectors.
Finally how is attention (or self-attention) trained?
Let's call Z our output, W our weight matrix and X our input (we'll use # as matrix multiplication symbol).
Z = X^T # W^T # X
In NLP we will compare Z to whatever we want the resulting output to be. In machine translation it is the sentence in the other language for example. We can compare the two with average cross entropy loss over each word predicted. Finally we can update W with back propagation.
How do we see what is important? We can look at the magnitudes of Z to see after the attention what words were most "attended" to.
This is a slightly simplified example as it only has one weight matrix and typically the inputs are embedded but I think it still highlights some of the necessary details concerning attention.
Here is a useful resource with visualizations for more information about attention.
Here is another resource with visualizations for more about attention in transformers specifically self-attention.

How does Fine-tuning Word Embeddings work?

I've been reading some NLP with Deep Learning papers and found Fine-tuning seems to be a simple but yet confusing concept. There's been the same question asked here but still not quite clear.
Fine-tuning pre-trained word embeddings to task-specific word embeddings as mentioned in papers like Y. Kim, “Convolutional Neural Networks for Sentence Classification,” and K. S. Tai, R. Socher, and C. D. Manning, “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks,” had only a brief mention without getting into any details.
My question is:
Word Embeddings generated using word2vec or Glove as pretrained word vectors are used as input features (X) for downstream tasks like parsing or sentiment analysis, meaning those input vectors are plugged into a new neural network model for some specific task, while training this new model, somehow we can get updated task-specific word embeddings.
But as far as I know, during the training, what back-propagation does is updating the weights (W) of the model, it does not change the input features (X), so how exactly does the original word embeddings get fine-tuned? and where do these fine-tuned vectors come from?
Yes, if you feed the embedding vector as your input, you can't fine-tune the embeddings (at least easily). However, all the frameworks provide some sort of an EmbeddingLayer that takes as input an integer that is the class ordinal of the word/character/other input token, and performs a embedding lookup. Such an embedding layer is very similar to a fully connected layer that is fed a one-hot encoded class, but is way more efficient, as it only needs to fetch/change one row from the matrix on both front and back passes. More importantly, it allows the weights of the embedding to be learned.
So the classic way would be to feed the actual classes to the network instead of embeddings, and prepend the entire network with a embedding layer, that is initialized with word2vec / glove, and which continues learning the weights. It might also be reasonable to freeze them for several iterations at the beginning until the rest of the network starts doing something reasonable with them before you start fine tuning them.
One hot encoding is the base for constructing initial layer for embeddings. Once you train the network one hot encoding essentially serves as a table lookup. In fine-tuning step you can select data for specific works and mention variables that need to be fine tune when you define the optimizer using something like this
embedding_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="embedding_variables/kernel")
ft_optimizer = tf.train.AdamOptimizer(learning_rate=0.001,name='FineTune')
ft_op = ft_optimizer.minimize(mean_loss,var_list=embedding_variables)
where "embedding_variables/kernel" is the name of the next layer after one-hot encoding.

Resources