Why doesn't the transformer use positional encoding in every layer?

Why doesn't the transformer use positional encoding in every layer? - machine-learning

Positional encoding is added to the input before it is passed into the transformer model, because otherwise the attention mechanism would be order invariant. However, both the encoder and decoder are layered, with attention being used on each layer. So if order is important for the attention mechanism, shouldn't the positional encoding be added to the input of each multiheaded attention block, instead of just once at the input to the model?

The transformer uses residual connections, and hence the positional encodings carry over through multiple layers in the encoder and decoder.

The transformer has the same number of outputs as input tokens. The transformer learns how to pass positional information from the inputs to the outputs if it learns that information is important in subsequent layers.

Related

What happens to the positional encodings in the output of of the Transformer model?

I've been learning about the new popular Transformer model, which can be used for sequence-to-sequence language applications. I am considering an application of time-series modeling, which is not necessarily language modeling. Thus I am modeling where the output layer maybe is not a probability, but could perhaps be a prediction of the next value of the time series.
If I consider the original language model presented in the paper (see Figure 1), we notice that positional encodings are applied to the embedded input data, however there is no indication of a position in the output. The output simply gives probabilities for value at the "next" time step. To me it seems like something is being lost here. The output assumes an iterative process, where the "next" output is just next because it is next. However in the input we feel the need to insert some positional information with the positional encodings. I would think we should also be interested in the positional encodings of the output as well. Is there a way to recover them?
This problem becomes more pronounced if we consider non-uniformly sampled time series data. This is really what I am interested in. It would be interesting to use non-uniformly sampled time series as input and predict the "next" value of the time series, where we also get the time position of that prediction. This comes down to somehow recovering the positional information from that output value. Since the positional encoding of the input is added to the input, it is not trivial how to extract this positional information from the output, perhaps it should be called "positional decoding".
To sumarize, my question is, what happens to the positional information in the output? Is it still there but I am just missing it? Also, does anyone see a straightforward way of recovering this data if not immediately available by the model?
Thanks

What is the difference between register_parameter and register_buffer in PyTorch?

Module's parameters get changed during training, that is, they are what is learnt during training of a neural network, but what is a buffer?
and is it learnt during neural network training?

Pytorch doc for register_buffer() method reads
This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the persistent state.
As you already observed, model parameters are learned and updated using SGD during the training process.
However, sometimes there are other quantities that are part of a model's "state" and should be
- saved as part of state_dict.
- moved to cuda() or cpu() with the rest of the model's parameters.
- cast to float/half/double with the rest of the model's parameters.
Registering these "arguments" as the model's buffer allows pytorch to track them and save them like regular parameters, but prevents pytorch from updating them using SGD mechanism.
An example for a buffer can be found in _BatchNorm module where the running_mean , running_var and num_batches_tracked are registered as buffers and updated by accumulating statistics of data forwarded through the layer. This is in contrast to weight and bias parameters that learns an affine transformation of the data using regular SGD optimization.

Both parameters and buffers you create for a module (nn.Module).
Say you have a linear layer nn.Linear. You already have weight and bias parameters. But if you need a new parameter you use register_parameter() to register a new named parameter that is a tensor.
When you register a new parameter it will appear inside the module.parameters() iterator, but when you register a buffer it will not.
The difference:
Buffers are named tensors that do not update gradients at every step, like parameters.
For buffers, you create your custom logic (fully up to you).
The good thing is when you save the model, all params and buffers are saved, and when you move the model to or off the CUDA params and buffers will go as well.

How does Beam Search operate on the output of The Transformer?

According to my understanding (please correct me if I'm wrong), Beam Search is BFS where it only explores the "graph" of possibilities down b the most likely options, where b is the beam size.
To calculate/score each option, especially for the work that I'm doing which is in the field of NLP, we basically calculate the score of a possibility by calculating the probability of a token, given everything that comes before it.
This makes sense in a recurrent architecture, where you simply run the model you have with your decoder through the best b first tokens, to get the probabilities of the second tokens, for each of the first tokens. Eventually, you get sequences with probabilities and you just pick the one with the highest probability.
However, in the Transformer architecture, where the model doesn't have that recurrence, the output is the entire probability for each word in the vocabulary, for each position in the sequence (batch size, max sequence length, vocab size). How do I interpret this output for Beam Search? I can get the encodings for the input sequence, but since there isn't that recurrence of using the previous output as input for the next token's decoding, how do I go about calculating the probability of all the possible sequences that stems from the best b tokens?

The beam search works exactly in the same as with the recurrent models. The decoder is not recurrent (it's self-attentive), but it is still auto-regressive, i.e., generating a token is conditioned on previously generated tokens.
At the training time, the self-attention is masked, such that in only attend to words to the left from the word that is currently generated. It simulates the setup you have at inference time when you indeed only have the left context (because the right context has not been generated yet).
The only difference is that in the RNN decoder, you only use the last RNN state in every beam search step. With the Transformer, you always need to keep the entire hypothesis and do the self-attention over the entire left context.

Adding more information for your later question and for people who have the same question:
I guess what I really want to ask is that, with an RNN architecture, in the decoder, I can feed it the b tokens that are highest in probability, to get the conditional probabilities of subsequent tokens. However, as I understand, from this tutorial here: tensorflow.org/beta/tutorials/text/…, I can't really do that for the Transformer architecture. Is that right? The decoder takes in the encoder outputs, the 2 masks and the target -- what would I input in for the parameter target?
The tutorial on the website you mentioned is using teacher forcing in the training stage. And it's possible to apply beam-search for the decoder of transformers in the testing stage.
Using beam-search for modern architecture like transformers in the training stage is not so popular. (Check this link for more info)
while teacher forcing as the tutorial mentioned in the training stage, can offer you parallel computation and speed up training once you are dealing with a large vocabulary-list task.
As for testing such a decoder, you could try the following steps to do beam-search (Just offering a possibility based on my understanding and there may have more better solutions):
First, Instead of taking the entire ground truth sequence as input for the decoder, you could only provide "[SOS]" and pad the rest positions.
Although output of your decoder is still [batch_size, max_sequence_len, vocab_size], only the (batch_size, 0, vocab_size) is giving you useful information and that is the first token your model generated. Select top b token and add to your "[SOS]" sequence. Now you have "[SOS] token(1,1)", ... , "[SOS], token(1,b)" sequences.
Second, use the above sequences as input for the decoder and search for the top b token among b * vocab_size options. Add them to their corresponding sequence.
Repeat until sequcences meet some restriction (max_ouput_length or [EOS])
P.S: 1) [SOS] or [EOS] means the Start or the End of the sequence.
2) token(i,j) means the j-th token in top b tokens for the i-th token in sequence

How Transformer is Bidirectional - Machine Learning

I am coming from Google BERT context (Bidirectional Encoder representations from Transformers). I have gone through architecture and codes. People say this is bidirectional by nature. To make it unidirectional attention some mask is to be applied.
Basically a transformer takes key, values and queries as input; uses encoder decoder architecture; and applies attention to these keys, queries and values. What I understood is we need to pass tokens explicitly rather than transformer understanding this by nature.
Can someone please explain what makes transformer is bidirectional by nature

Bidirectional is actually a carry-over term from RNN/LSTM. Transformer is much more than that.
Transformer and BERT can directly access all positions in the sequence, equivalent to having full random access memory of the sequence during encoding/decoding.
Classic RNN has only access to the hidden state and last token, e.g. encoding of word3 = f(hidden_state, word2), so it has to compress all previous words into a hidden state vector, theoretically possible but a severe limitation in practice. Bidirectional RNN/LSTM is slightly better. Memory networks is another way to work around this. Attention is yet another way to improve LSTM seq2seq models. The insight for Transformer is that you want full memory access and don't need the RNN at all!
Another piece of history: an important ingredient that let us deal with sequence structure without using RNN is positional encoding, which comes from CNN seq2seq model. It would not have been possible without this. Turns out, you don't need the CNN either, as CNN doesn't have full random access, but each convolution filter can only look at a number of neighboring words at a time.
Hence, Transformer is more like FFN, where encoding of word1 = f1(word1, word2, word3), and encoding of word3 = f2(word1, word2, word3). All positions available all the time.
You might also appreciate the beauty which is that the authors made it possible to compute attention for all positions in parallel, through the use of Q, K, V matrices. It's quite magical!
But understand this, you'll also appreciate the limitations of Transformer, that it requires O(N^2 * d) computation where N is the sequence length, precisely because we're doing N*N attention of all words with all other words. RNN, on the other hand, is linear in the sequence length and requires O(N * d^2) computation. d is the dimension of model hidden state.
Transformer just won't write a novel anytime soon!

On the following picture you will see in a really clear way why BERT is Bidirectional.
This is crucial since this forces the model to use information from the entire sentence simultaneously – regardless of the position – to make a good predictions.
BERT has been a clear break through allowed by the use of the notorious "attention is all you need" paper and architecture.
This Bidirectional idea (masked) is different from classic LSTM cells which till now used the forward or the backward method or both but not at the same time.
Edit:
this is done by the transformer. The attention is all you need paper is presenting an encoder-decoder system implementing a sequence to sequence framework. BERT is using this Transformer (sequence to sequence Bidirectional network) to do other NLP task. And this has been done by using a masked approach.
The important thing is: BERT uses Attention but Attention has been done for a translation and as such do not care for Bidirectional. But remove a word and you have Bidirectional.
So why BERT now?
well the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. Meaning that this model allows a sentence Embedding far more effective than before. In fact, RNN based architectures are hard to parallelize and can have difficulty learning long-range dependencies within the input and output sequences. SO break through in architecture AND the use of this idea to train a network by masking a word (or more) leads to BERT.
Edit of Edit:
forget about the scale product, it's the inside the Attention which is inside A multi head attention itself inside the Transformer: you are looking to deep. The transformer is using the entire sequence every time to find the other sequence (In case of BERT it's the missing 0.15 percentage of the sentence) and that's it. The use of BERT as a language model is realy a transfer learning (see this)
As stated in your post, unidirectional can be done with a certain type of mask, bidirec is better. And it is used because the go from a full sentence to a full sentence but not the way classic Seq2seq is made (with LSTM and RNN) and as such can be used for LM.

BERT is a bidirectional transformer whereas the original transformer (Vaswani et al., 2017) is unidirectional. This can be shown by comparing the masks in the code.
Original Transformer
Tensorflow's tutorial is a good reference. The look_ahead_mask is what makes the Transformer unidirectional.
def create_look_ahead_mask(size):
mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
return mask # (seq_len, seq_len)
If you trace the code, you can find the look_ahead_mask is applied to the attention_weights in the decoder. Basically each row in the attention_weights represents a attention query for token at certain position (the first row -> first token position; the second row -> second token position etc.). And the look_ahead_mask blacks out the tokens appear after this position in the decoder so it does not see the "future". In that sense, the decoder is unidirectional analogous to unidirectional in an RNN.
BERT
On the other hand, if you check the original BERT implementation (also in Tensorflow). There's only an optional input_mask applied to the entire BertModel. And if you follow the README on pre-training the model and run create_pretraining_data.py, you will observe that the input_mask is only used for padding the input sequence so for short sentences the unused tokens are ignored. Thus, attention in BERT can be applied to both the "past" and the "future" of a given token position. In that sense, the encoder in BERT is bidirectional analogous to bidirectional in an RNN.

I know this is an old post but for anyone coming back to this:
Adding to what Hai-Anh Trinh said, Transformers aren't 'bi-directional', it would be better to call them "omni-directional". Because of their self-attention method, they are able to consider every single word at the same time, simultaneously.
BERT on the other hand is "deeply bidirectional". This is because of the masked language model(MLM) pre-training objective that is used in BERT. (there are a lot of resources online, I can link some if need be)
It's easy to get confused so don't worry about it.
(https://arxiv.org/pdf/1810.04805.pdf; link to the original BERT paper)
(https://arxiv.org/pdf/1706.03762.pdf; link to the original Transformer paper)

ML to predict input parameters

If you have a machine learning task, you are given a set of input parameters (features) and a output parameter (target). Based on a set of input+output pairs, you train a model and later use that model to predict the output (given the input).
My problem is somewhat different: I am given a set of input and output parameters (that part is identical), that have been recorded during a manifacturing process. (Acutually the input parameters are input values to a machine that produces some piece of equipment). I should suggest to the operators of the machine a set of the input parameters, that will most likely yield the best output parameters.
Q1: Is this type of problem also called machine learning?
Q2: If not, what are these types of problems called?

It can be classed as Machine Learning ... but it would be better classified as Neural Networks. However, these both come under the umbrella term of Artificial Intelligence.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart