Decoder Input in Seq2Seq - Time Series Analysis - time-series

I'm working with MxNet and I'm figuring out the Seq2Seq model. Let’s suppose that every batch will handle 32 sequences and every sequence will be of length 20 (timesteps). In order to create the architecture to work with seq2seq models we are going to split every sequence into two parts. The methods of splitting is very arbitrary but let’s suppose we divide in half the sequence. The first part will be named ‘encoder input’ and will, indeed, be the input to the encoder which will consist in a sequence of 10 (timesteps), clearly this input consist in N numbers of variable of length 10. Therefore, we’ll have x1, … , x10 for every encoding input sequence multiplied by the number of features which will result into the feature vector of encoding inputs Xt. Now, since the decoder output will be the second half of the sequence, what should be the decoder input? I'm setting the decoder input as the encoder input and tha model is working fairly good. That's the forward function:
def forward(self, encoder_input, *args):
state= self.encoder.begin_state(batch_size=encoder_input.shape[0], ctx=mx.cpu())
encoder_output, encoder_state= self.encoder(encoder_input, state)
decoder_output, decoder_state= self.decoder(encoder_input, encoder_state)
output= self.dense(decoder_output)
return output
Is there some error with using encoder input as decoder input? I've seen some example in Keras where they initialize decoder input as an np.array with the shape of the decoder output. I've tried to set decoder input like an array of zeros but the results (in terms of accuracy) decay really badly.

I've found on 'Hands on Machine Learning':
In other words,
the decoder is given as input the word that it should have output at the
previous step (regardless of what it actually output). For the very first
word, it is given the start-of-sequence (SOS) token. The decoder is
expected to end the sentence with an end-of-sequence (EOS) token.
Therefore, I suppose that if the encoder input will be composed by the first sequence of n observation for the z features, no matter what is the encoder output, we should feed the decoder with the encoder states and the decoder input which is the expected output of the encoder or in other words the sequence of the first n observation of the label. Despite all, in my analysis with python there are no evidence of better results. Maybe, feeding the decoder with only the encoder label is better when we got a lot of features.

Related

Difference between One hot encoding and Label Encoding of target/output label

I have a problem where there are 20 classes. I have designed a neural network and using the loss as categorical_crossentropy.
When dealing with categorical cross entropy the output label must be one hot encoded.
So, when I one hot encoded the output label, the label in every row was one hot encoded in a matrix, while in label encoder I got the same encoding in an array.
oht = OneHotEncoder()
y_train_oht = oht.fit_transform(np.array(y_train).reshape(-1,1))
below is the snippet of label encoding
le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_train_le_cat = to_categorical(y_train_le)
one hot encoding sample output one hot encoding
label encoding sample output label encoding
I find the one hot encoding gives a matrix while label encoding gives an array. Can I please know when one hot encoding does the same job why do we have a label encoder. What kind of optimization does the label encoder bring in?
If using the label encoder happens to be more optimal then why do we not use the label encoder to encode categorical input data instead of one hot encoding?
Label encoding imposes artificial order: if you label-encode your pet target as 'Dog':0, 'Cat':1, 'Turtle':2, 'Golden Fish':3, then you get the awkward situation where 'Dog' < 'Cat' and 'Turtle is the average of 'Cat' + 'Golden Fish'.
In the case of predictor features (not the target), this is a problem since your Random Forest can be learning something like "if it less than 'Turtle', then...".
Also, you may have categories in the testing set (or even worse, new data during deployment) that were not present in the training, and the transformer doesn't know what to do, so it throws an error. This may be the case or not depending on the particular problem and particular feature you are encoding, obviously not for the target variable.
When hot encoding, if a category absent in the training is present in a prediction, it just get encoded as 0 in each of the encoded features (new columns representing each category), so you don't get an error. Your model still has the other features to make a reasonable guess.
As a general rule, you want to use label encoding for target variables and OHE for predictor features. Note that in general you don't care about artificial order in the target, since the prediction is usually categorical also (A forest will choose a number, not a range of numbers; a network will have one activation unit per category...)
I don't think optimization should be part of the discussion here since they are used for different scenarios demanding different outputs: surely it's more efficient to use the OHE transformer than trying to hack it by performing label encoding and then some pandas trickery to create the same result as with one hot encoding.
Here there are useful comments about the different scenarios (type of model, type of data) and some issues related to efficiency.
Here there's an example on why label encoding is a bad practice for input features.
And let's not forget that the goal of the model is to make predictions, so at the end what's important is not just the output of <transformer>.fit_transform, but also the fitted transformer itself that's going to be applied to the new observations. OHE will deal with new cases differently than label-encoder (e.g. when the value of the feature in the observation was not present in the training set). That's in my opinion enough reason to have different methods, even when they act in a way similar enough so, for some inputs, you may be able to force them to give similar outputs.

LSTM, multi-variate, multi-feature in pytorch

I'm having trouble understanding the format of data for an LSTM in pytorch. Lets say i have a CSV file with 4 features, laid out in timestamps one after the other ( a classic time series)
time1 feature1 feature2 feature3 feature4
time2 feature1 feature2 feature3 feature4
time3 feature1 feature2 feature3 feature4
time4 feature1 feature2 feature3 feature4, label
However, this entire set of 4 sequences only has a single label. The thing we're trying to classify started at time1, but we don't know how to label it until time 4.
My question is, can a typical pytorch LSTM support this? All of the tutorials i've read, watched, walked through, involve looking at a time sequence of a single feature, or a word model, which is still a dataset with a single dimension.
If it can support it, does the data need to be flattened in some way?
Pytorch's LSTM reference states:
input: tensor of shape (L,N,Hin)(L, N, H_{in})(L,N,Hin​) when batch_first=False or (N,L,Hin)(N, L, H_{in})(N,L,Hin​) when batch_first=True containing the features of the input sequence. The input can also be a packed variable length sequence.
Does this mean that it cannot support any input that contains multiple sequences? Or is there another name for this?
I'm really lost here, and could use any advice, pointers, help, so on. Maybe some disambiguation too.
I've posted a couple times here but gotten no responses at all. If this post is misplaced, could someone kindly direct me towards the correct place to post it?
Edit: Following Daniel's advice, do i understand correctly that the four features should be put together like this:
[(feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4), label] when given to the LSTM?
If that's correct, is the input size (16) in this case?
Finally, I was under the impression that the output of the LSTM Would be the predicted label. Do I have that wrong?
As you show, the LSTM layer's input size is (batch_size, Sequence_length, feature_size). This means that the feature is assumed to be a 1D vector.
So to use it in your case you need to stack your four features into one vector (if they are more then 1D themselves then flatten them first) and use that vector as the layer's input.
Regarding the label. It is defiantly supported to have a label only after a few iterations. The LSTM will output a sequence with the same length as the input sequence, but when training the LSTM you can choose to use any part of that sequence in the loss function. In your case you will want to use the last element only.

What is Sequence length in LSTM?

The dimensions for the input data for LSTM are [Batch Size, Sequence Length, Input Dimension] in tensorflow.
What is the meaning of Sequence Length & Input Dimension ?
How do we assign the values to them if my input data is of the form :
[[[1.23] [2.24] [5.68] [9.54] [6.90] [7.74] [3.26]]] ?
LSTMs are a subclass of recurrent neural networks. Recurrent neural nets are by definition applied on sequential data, which without loss of generality means data samples that change over a time axis. A full history of a data sample is then described by the sample values over a finite time window, i.e. if your data live in an N-dimensional space and evolve over t-time steps, your input representation must be of shape (num_samples, t, N).
Your data does not fit the above description. I assume, however, that this representation means you have a scalar value x which evolves over 7 time instances, such that x[0] = 1.23, x[1] = 2.24, etc.
If that is the case, you need to reshape your input such that instead of a list of 7 elements, you have an array of shape (7,1). Then, your full data can be described by a 3rd order tensor of shape (num_samples, 7, 1) which can be accepted by a LSTM.
Simply put seq_len is number of time steps that will be inputted into LSTM network, Let's understand this by example...
Suppose you are doing a sentiment classification using LSTM.
Your input sentence to the network is =["I hate to eat apples"]. Every single token would be fed as input at each timestep, So accordingly here the seq_Len would total number of tokens in a sentence that is 5.
Coming to the input_dim you might know we can't directly feed words to the netowrk you would need to encode those words into numbers. In Pytorch/tensorflow embedding layers are used where we have to specify embedding dimension.
Suppose your embedding dimension is 50 that means that embedding layer will take index of respective token and convert it into vector representation of size 50. So the input dim to LSTM network would become 50.

How to train a neural network in forward manner and using it in backward manner

I have a neural network with an input layer having 10 nodes, some hidden layers and an output layer with only 1 node. Then I put a pattern in the input layer, and after some processing, it outputs the value in the output neuron which is a number from 1 to 10. After the training this model is able to get the output , provided the input pattern.
Now, my question is, if it is possible to calculate the inverse model: This means, that I provide a number from output side, (i.e. using output side as input) and then getting the random pattern from those 10 input neurons (i.e. using input as output side).
I want to do this because I will first train a network on basis of difficulty of pattern (input is the pattern and output is difficulty to understand the pattern). Then I want to feed the network with a number so it creates the random patterns on basis of difficulty.
I hope I understood your problem correctly, so I will summarize it in my own words: You have a given model, and want to determine the input which yields a given output.
Supposed, that this is correct, there is at least one way I know of, how you can do this approximately. This way is very easy to implement, but might take a while to calculate a value - probably there are better ways to do this, but I am not sure. (I needed this technique some weeks ago in the topic of reinforcement learning, and did not find anything better, compared to this): Lets assume that your Model maps an input to an output . We now have to create a new model, which we will call : This model will later on calculate the inverse of the model , so that it gives you the input which yields a specific output. To construct we will create a new model, which consists of one plain Dense layer which has the same dimension m as the input. This layer will be connected to the input of the model now. Next, you make all weights of non-trainable (this is very important!).
Now we are setup to find an inverse value already: Assuming you want to find the input corresponding (corresponding means here: it creates the output, but is not unique) to the output y. You have to create a new input vector v which is the unity of . Then you create a input-output data pair consisting of (v, y). Now you use any optimizer you wish to let the input-output-trainingdata propagate through your network, until the error converges to zero. Once this has happend, you can calculate the real input, which gives the output y by doing this: Supposed, that the weights if the new input layer are called w, and the bias is b, the desired input u is u = w*1 + b (whereby 1 )
You might be asking for the reason why this equation holds, so let me try to answer it: You model will try to learn the weights of your new input layer, so that the unity as an input will create the given output. As only the newly added input layer is trainable, only this weights will be changed. Therefore, each weight in this vector will represent the corresponding component of the desired input vector. By using an optimizer and minimizing the l^2 distance between the wanted output and the output of our inverse-model , we will finally determine a set of weights, which will give you a good approximation for the input vector.

Input of LSTM seq2seq network - Tensorflow

Using the Tensorflow seq2seq tutorial code I am creating a character-based chatbot. I don't use word embeddings. I have an array of characters (the alphabet and some punctuation marks) and special symbols like the GO, EOS and UNK symbol.
Because I'm not using word embeddings, I use the standard tf.nn.seq2seq.basic_rnn_seq2seq() seq2seq model. However, I am confused about what shape encoder_inputs and decoder_inputs should have. Should they be an array of integers, corresponding to the index of the characters in the alphabet-array, or should I turn those integers into one-hot vectors first?
How many input nodes does one LSTM cell have? Can you specify that? Because I guess in my case an LSTM cell should have an input neuron for each letter in the alphabet (therefore the one-hot vectors?).
Also, what is the LSTM "size" you have to pass in the constructor tf.nn.rnn_cell.BasicLSTMCell(size)?
Thank you.
Appendix: these are the bugs I am trying to fix.
When I use the following code, according to the tutorial:
for i in xrange(buckets[-1][0]): # Last bucket is the biggest one.
self.encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i)))
for i in xrange(buckets[-1][1] + 1):
self.decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i)))
self.target_weights.append(tf.placeholder(dtype, shape=[None], name="weight{0}".format(i)))
And run the self_test() function, I get the error:
ValueError: Linear is expecting 2D arguments: [[None], [None, 32]]
Then, when I change the shapes in the above code to shape=[None, 32] I get this error:
TypeError: Expected int32, got -0.21650635094610965 of type 'float' instead.
The number of inputs of an lstm cell is the dimension of whatever tensor you pass as inputs to the tf.rnn function when instantiating things.
The size argument is the number of hidden units in your lstm (so a bigger number is slower but can lead to more accurate models).
I'd need a bigger stack trace to understand these errors.
It turns out the size argument passed to BasicLSTMCell represents both the size of the hidden state of the LSTM and the size of the input layer. So if you want a different hidden size than input size, you can first propagate your inputs through an additional projection layer or use the built-in seq2seq word embeddings function.

Resources