Initializing Decoder States in Sequence To Sequence Models - machine-learning

I'm writing my first neural machine translator in tensorflow. I am using an encoder/decoder mechanism with attention. My encoder and decoder are lstm stacks with residual connections, but the encoder has an initial bidirectional layer. The decoder does not.
It is common practice in the code that I have seen to initialize the state of the decoder cells with the last state of the encoder cells. However, this is only a clean solution if your encoder and decoder architectures are the same, as is the case in many of the seq2seq tutorials. In many other systems, such as this model
by Google, the architectures differ on the encoder and decoder.
What are some of the alternative strategies used for initializing decoder state in these circumstances?
I have seen cases where the encoder's last hidden state is passed through a trained weight vector to create the initial decoder state, for all decoder layers. I have also seen more inventive ideas such as the one presented here, but I would like to develop an intuition as to why people pick certain strategies.

Related

How Transformer is Bidirectional - Machine Learning

I am coming from Google BERT context (Bidirectional Encoder representations from Transformers). I have gone through architecture and codes. People say this is bidirectional by nature. To make it unidirectional attention some mask is to be applied.
Basically a transformer takes key, values and queries as input; uses encoder decoder architecture; and applies attention to these keys, queries and values. What I understood is we need to pass tokens explicitly rather than transformer understanding this by nature.
Can someone please explain what makes transformer is bidirectional by nature
Bidirectional is actually a carry-over term from RNN/LSTM. Transformer is much more than that.
Transformer and BERT can directly access all positions in the sequence, equivalent to having full random access memory of the sequence during encoding/decoding.
Classic RNN has only access to the hidden state and last token, e.g. encoding of word3 = f(hidden_state, word2), so it has to compress all previous words into a hidden state vector, theoretically possible but a severe limitation in practice. Bidirectional RNN/LSTM is slightly better. Memory networks is another way to work around this. Attention is yet another way to improve LSTM seq2seq models. The insight for Transformer is that you want full memory access and don't need the RNN at all!
Another piece of history: an important ingredient that let us deal with sequence structure without using RNN is positional encoding, which comes from CNN seq2seq model. It would not have been possible without this. Turns out, you don't need the CNN either, as CNN doesn't have full random access, but each convolution filter can only look at a number of neighboring words at a time.
Hence, Transformer is more like FFN, where encoding of word1 = f1(word1, word2, word3), and encoding of word3 = f2(word1, word2, word3). All positions available all the time.
You might also appreciate the beauty which is that the authors made it possible to compute attention for all positions in parallel, through the use of Q, K, V matrices. It's quite magical!
But understand this, you'll also appreciate the limitations of Transformer, that it requires O(N^2 * d) computation where N is the sequence length, precisely because we're doing N*N attention of all words with all other words. RNN, on the other hand, is linear in the sequence length and requires O(N * d^2) computation. d is the dimension of model hidden state.
Transformer just won't write a novel anytime soon!
On the following picture you will see in a really clear way why BERT is Bidirectional.
This is crucial since this forces the model to use information from the entire sentence simultaneously – regardless of the position – to make a good predictions.
BERT has been a clear break through allowed by the use of the notorious "attention is all you need" paper and architecture.
This Bidirectional idea (masked) is different from classic LSTM cells which till now used the forward or the backward method or both but not at the same time.
Edit:
this is done by the transformer. The attention is all you need paper is presenting an encoder-decoder system implementing a sequence to sequence framework. BERT is using this Transformer (sequence to sequence Bidirectional network) to do other NLP task. And this has been done by using a masked approach.
The important thing is: BERT uses Attention but Attention has been done for a translation and as such do not care for Bidirectional. But remove a word and you have Bidirectional.
So why BERT now?
well the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. Meaning that this model allows a sentence Embedding far more effective than before. In fact, RNN based architectures are hard to parallelize and can have difficulty learning long-range dependencies within the input and output sequences. SO break through in architecture AND the use of this idea to train a network by masking a word (or more) leads to BERT.
Edit of Edit:
forget about the scale product, it's the inside the Attention which is inside A multi head attention itself inside the Transformer: you are looking to deep. The transformer is using the entire sequence every time to find the other sequence (In case of BERT it's the missing 0.15 percentage of the sentence) and that's it. The use of BERT as a language model is realy a transfer learning (see this)
As stated in your post, unidirectional can be done with a certain type of mask, bidirec is better. And it is used because the go from a full sentence to a full sentence but not the way classic Seq2seq is made (with LSTM and RNN) and as such can be used for LM.
BERT is a bidirectional transformer whereas the original transformer (Vaswani et al., 2017) is unidirectional. This can be shown by comparing the masks in the code.
Original Transformer
Tensorflow's tutorial is a good reference. The look_ahead_mask is what makes the Transformer unidirectional.
def create_look_ahead_mask(size):
mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
return mask # (seq_len, seq_len)
If you trace the code, you can find the look_ahead_mask is applied to the attention_weights in the decoder. Basically each row in the attention_weights represents a attention query for token at certain position (the first row -> first token position; the second row -> second token position etc.). And the look_ahead_mask blacks out the tokens appear after this position in the decoder so it does not see the "future". In that sense, the decoder is unidirectional analogous to unidirectional in an RNN.
BERT
On the other hand, if you check the original BERT implementation (also in Tensorflow). There's only an optional input_mask applied to the entire BertModel. And if you follow the README on pre-training the model and run create_pretraining_data.py, you will observe that the input_mask is only used for padding the input sequence so for short sentences the unused tokens are ignored. Thus, attention in BERT can be applied to both the "past" and the "future" of a given token position. In that sense, the encoder in BERT is bidirectional analogous to bidirectional in an RNN.
I know this is an old post but for anyone coming back to this:
Adding to what Hai-Anh Trinh said, Transformers aren't 'bi-directional', it would be better to call them "omni-directional". Because of their self-attention method, they are able to consider every single word at the same time, simultaneously.
BERT on the other hand is "deeply bidirectional". This is because of the masked language model(MLM) pre-training objective that is used in BERT. (there are a lot of resources online, I can link some if need be)
It's easy to get confused so don't worry about it.
(https://arxiv.org/pdf/1810.04805.pdf; link to the original BERT paper)
(https://arxiv.org/pdf/1706.03762.pdf; link to the original Transformer paper)

What kind of data stored in pre-trained model, such as caffe model zoo?

I came across this question from reading squeeze net paper. The authors state that they use Deep Compression to compress the pre-trained model. The algorithm includes Huffman Code etc.
I infer the pre-trained are all parameters and I do know these parameters are generated when training the network but I have no idea how the parameters are generated. What role do parameters of the pre-trained model play when doing prediction?
It sounds to me like black magic
The pre-trained model consists of the weights for all of the layer connections to/from every kernel of every layer. That's the "heavy lifting" from the first 40-80 epochs of training. It should be ready to do predictions, or continue with whatever fine-tuning you'd care to apply.
It's not really black magic. Each framework has a facility to dump (back-up) the parameter values at specified intervals and at completion of training. Granted, these are relatively large files -- hence the use of compression. Each framework has a facility to read in such a dump file in order to bootstrap a model.

How to apply RNN to sequence-to-sequence NLP task?

I'm quite confused about sequence-to-sequence RNN on NLP tasks. Previously, I have implemented some neural models of classification tasks. In those tasks, the models take word embeddings as input and use a softmax layer at the end of the networks to do classification. But how do neural models do seq2seq tasks? If the input is word embedding, then what is the output of the neural model? Examles of these tasks include question answering, dialogue systems and machine translation.
You can use an encoder-decoder architecture. The encoder part encodes your input into a fixed-length vector, and then the decoder decodes this vector to your output sequence, whatever this would be. Encoding and decoding layers can be learned jointly against your objective function (which can still involve a soft-max). Check out this paper which shows how this model can be used in neural machine translation. The decoder here emits words one by one in order to generate a correct translation.

How to use stacked autoencoders for pretraining

Let's say I wish to used stacked autoencoders as a pretraining step.
Let's say my full autoencoder is 40-30-10-30-40.
My steps are:
Train a 40-30-40 using the original 40 features data set in both input and output layers.
Using the trained encoder part only of the above i.e. 40-30 encoder, derive a new 30 feature representation of the original 40 features.
Train a 30-10-30 using the new 30 features data set (derived in step 2) in both input and output layers.
Take the trained encoder from step 1 ,40-30, and feed it into the encoder from step 3,30-10, giving a 40-30-10 encoder.
Take the 40-30-10 encoder from step 4 and use it as the input the NN.
a) Is that correct?
b) Do I freeze the weights in the 40-30-10 encoder when training the NN which would be the same as pregenerating the 10 feature representation from the original 40 feature data set and training on the new 10 feature representation data set.
PS. I already have a question out asking about whether I need to tie the weights of the encoder and decoder
a) Is that correct?
This is one of the typical approaches. You could also try to fit the autoencoder directly, as "raw" autoencoder with that many layers should be possible to fit right away, As an alternative you might consider fitting stacked denoising autoencoders instead, which might benefit more from "stacked" training.
b) Do I freeze the weights in the 40-30-10 encoder when training the NN which would be the same as pregenerating the 10 feature representation from the original 40 feature data set and training on the new 10 feature representation data set.
When you train whole NN you do not freeze anything. Pretraining is only a kind of preconditioning for the optimization process - you show your method where to start, but you do not want to limit the fitting procedure of actual supervised learning.
PS. I already have a question out asking about whether I need to tie the weights of the encoder and decoder
No, you do not have to tie weights, especially that you actually throw away your decoder anyway. Tieing the weights is important for some more probabilistic models in order to make minimization procedure possible (like in the case of RBMs), but for autoencoder there is no point.

Detect hidden unknown patterns when visualization fails

I have a fast set of multi dimensional timebased data which i suspect contain patterns. I simplified the dataset to create a custom visualization.
Humans see patterns in the visualization but the result of the pattern cannot be explained by the visualization. This is because of the simplification step, it hides data which is important.
I cannot put all my data in my visualization cause than humans cannot see the possible patterns anymore because too much data and dimensions are visualized.
Is there a technique that can detect hidden unknown patterns in a data set? (without using visualization, and without me learning the technique patterns) .
One optional extra would be that the technique should somehow be able to "explain the patterns" to me so that i can check if they make sense.
[edit] i can give the technique a collection of small sized datasets (extracted from the big dataset; still very multi dimensional) that i know that contain patterns (by using my visualization). The technique then needs to analyze under what conditions a pattern produces result a or result b.
First of, how did you "simplify" the data? If you did it without any heuristics, you might go ahead and perform PCA. The very idea of PCA is to solve your problem: Not losing "important" data while having a dimensional reduction. You can visualize your principal components so that patterns can be detected by the human eye as well as algorithms.
To your 2nd question: Yes, there are techniques that can detect hidden unknown patterns in data. However, this is a huge field (Machine Learning) and what algorithm you'd use, would depend on your problem structure, so it's impossible to give a specific model name at this point. From what you specified, neural networks in general seem fit to do the job. After you trained a network, you can visualize the activations or weights (Hinton Diagram) to perform an analysis on which input data is treated "similarly".

Resources