In the Keras documentation, both stateful and unroll are set to False. So how is the recurrent done in Keras if it's neither of these?
Keras RNN documentaion
I have checked the source code for RNN in Keras, it seems that the default action is to initialize the LSTM at every time step. Am I worng?
if initial_state is not None:
pass
elif self.stateful:
initial_state = self.states
else:
initial_state = self.get_initial_state(inputs)
If I was correct, does it mean that, for time series analysis, it would be better to set unroll=True ?
Neither unrolled nor stateful.
Remember that "stateful" in Keras means only that "two consecutive batches will be interpreted as two parts of the same sequences". Nothing else. (Batch 2 is a sequel of batch 1)
All LSTM's, of course, have states (it's impossible not to).
Be careful with the expression "initialize the LSTM". A stateful=False layer will "reset states" for every batch. The practical result is: "each batch is a group of individual sequences from start to end". (Batch 2 is not a sequel of batch 1)
"States" are information about the "history of a sequence up to the current step". They are completely different from "weights", which are what the layer actually learned from all sequences.
"Unroll" is a way to transform the recurrent calculations into a single graph without recurrency. It's meant only for short sequences, it gets faster processing at the expense of using more memory.
Related
I'm having trouble understanding the format of data for an LSTM in pytorch. Lets say i have a CSV file with 4 features, laid out in timestamps one after the other ( a classic time series)
time1 feature1 feature2 feature3 feature4
time2 feature1 feature2 feature3 feature4
time3 feature1 feature2 feature3 feature4
time4 feature1 feature2 feature3 feature4, label
However, this entire set of 4 sequences only has a single label. The thing we're trying to classify started at time1, but we don't know how to label it until time 4.
My question is, can a typical pytorch LSTM support this? All of the tutorials i've read, watched, walked through, involve looking at a time sequence of a single feature, or a word model, which is still a dataset with a single dimension.
If it can support it, does the data need to be flattened in some way?
Pytorch's LSTM reference states:
input: tensor of shape (L,N,Hin)(L, N, H_{in})(L,N,Hin) when batch_first=False or (N,L,Hin)(N, L, H_{in})(N,L,Hin) when batch_first=True containing the features of the input sequence. The input can also be a packed variable length sequence.
Does this mean that it cannot support any input that contains multiple sequences? Or is there another name for this?
I'm really lost here, and could use any advice, pointers, help, so on. Maybe some disambiguation too.
I've posted a couple times here but gotten no responses at all. If this post is misplaced, could someone kindly direct me towards the correct place to post it?
Edit: Following Daniel's advice, do i understand correctly that the four features should be put together like this:
[(feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4), label] when given to the LSTM?
If that's correct, is the input size (16) in this case?
Finally, I was under the impression that the output of the LSTM Would be the predicted label. Do I have that wrong?
As you show, the LSTM layer's input size is (batch_size, Sequence_length, feature_size). This means that the feature is assumed to be a 1D vector.
So to use it in your case you need to stack your four features into one vector (if they are more then 1D themselves then flatten them first) and use that vector as the layer's input.
Regarding the label. It is defiantly supported to have a label only after a few iterations. The LSTM will output a sequence with the same length as the input sequence, but when training the LSTM you can choose to use any part of that sequence in the loss function. In your case you will want to use the last element only.
Since two operations Conv2DBackpropFilter and Conv2DBackpropInput count most of the time for lots of applications(AlexNet/VGG/GAN/Inception, etc.), I am analyzing the complexity of these two operations (back-propagation) in TensorFlow and I found out that there are three implementation versions (custom, fast and slot) for Conv2DBackpropFilter (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/conv_grad_filter_ops.cc ) and Conv2DBackpropInput (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/conv_grad_input_ops.cc). While I profile, all computations are passed to "custom" version instead of "fast" or "slow" which directly calls Eigen function SpatialConvolutionBackwardInput to do that.
The issue is:
Conv2DBackpropFilter uses Eigen:“TensorMap.contract" to do the tensor contraction and Conv2DBackpropInput uses Eigen:"MatrixMap.transpose" to do the matrix transposition in the Compute() function. Beside these two functions, I didn't see any convolutional operations which are needed for back-propagation theoretically. Beside convolutions, what else would be run inside these two operations for back-propagation? Does anyone know how to analyze the computation complexity of "back propagation" operation in TensorFlow?
I am looking for any advise/suggestion. Thank you!
In addition to the transposition and contraction, the gradient op for the filter and the gradient op for the input must transform their input using Im2Col and Col2Im respectively. Approximately speaking, these transformations enable the convolution operation to be implemented using tensor contraction. For more information, see the CS231n page on Convolutional Networks (specifically, the paragraphs titled "Implementation as Matrix Multiplication" and "Backpropagation").
mrry, I got it. It means that Conv2D, Conv2DBackpropFilter and Conv2DBackpropInput use the same way by using "GEMM" to work for convolution by Im2Col/Col2Im. An other issue is that while I do the profile of GAN in TensorFlow, the execution time of Conv2DBackpropInput and Conv2DBackpropFilter are around 4-6 times slower than Conv2D with the same input size. Why?
I've built an LSTM In Keras with the goal of predicting future values of a time-series from a high-dimensional, time-index input.
However, there's a unique requirement: for certain time points in the future, we know with certainty what some values of the input series will be. For example:
model = SomeLSTM()
trained_model = model.train(train_data)
known_data = [(24, {feature: 2, val: 7.0}), (25, {feature: 2, val: 8.0})]
predictions = trained_model(look_ahead=48, known_data=known_data)
Which would train the model up to time t (the end of training), and predict forward 48 time periods from time t, but substituting known_data values for feature 2 at times 24 and 25.
How exactly can I explicitly inject this into the LSTM at some time?
For reference, here's the model:
model = Sequential()
model.add(LSTM(hidden, input_shape=(look_back, num_features)))
model.add(Dropout(dropout))
model.add(Dense(look_ahead))
model.add(Activation('linear'))
This may be a result of my un-intuitive grasp of LSTMs, and I'd appreciate any clarification. I've dived into the Keras source code, and my first guess is to inject it right into the LSTM state variable, but I'm unsure how to do that at time t (or even if that is correct.)
I think a clean way of doing this is to introduce 2*look_ahead new features, where for each 0 <= i < look_ahead 2*i-th feature is an indicator whether the value of the i-th time step is known and (2*i+1)-th is the value itself (0 if not known). Accordingly, you can generate training data with these features to make your model take into account these known values.
I am not exactly sure what you are trying to do, but maybe create your own layer to go at the end that sets the data to the known values, similar to how dropout sets random values to zero. As a side note, I have had better results with pooling than dropout, so maybe try switching that out and training it. Here is a good guide on how to do it. https://www.tutorialspoint.com/keras/keras_customized_layer.htm
When looking at the RNN example at Tensorflow im having an issue with how the initial state is constructed. At build time of the graph we limit the graph to only handle input of one batch size. This is an issue for me since I want to be able feed in a single example and get a prediction for that single example.
The part of the code that restricts this is:
initial_state = state = tf.zeros([batch_size, lstm.state_size])
So my question is how can I expand the example so that I can use a variable batch size so that I can use the same model for training with batch size and then use single example for predictions?
This is how I'm doing this. You can pass the batch_size as a variable like this:
batch_size = tf.placeholder(tf.int32)
init_state = cell.zero_state(batch_size, tf.float32)
where cell is one of RNN cells (BasicLSTMCell, BasicGRUCell, MultiRNNCell, etc). However, if you're preserving the state over multiple batches that won't work since its' size has to be constant.
The Tensorflow text generation tutorial explains how to do this (now TF 2.0). It seems that the batch_size becomes part of the built model, so you have to rebuild/reload from the saved weights with a new batch size:
https://www.tensorflow.org/tutorials/text/text_generation#restore_the_latest_checkpoint
To keep this prediction step simple, use a batch size of 1.
Because of the way the RNN state is passed from timestep to timestep,
the model only accepts a fixed batch size once built.
To run the model with a different batch_size, we need to rebuild the
model and restore the weights from the checkpoint.
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
model.summary()
I don't know for sure why you have to do this, but I always assumed it's because batching for recurrent layers requires management of multiple, parallel hidden state pipelines, so it preallocates them.
This is rather a weird problem.
A have a code of back propagation which works perfectly, like this:
Now, when I do batch learning I get wrong results even if it concerns just a simple scalar function approximation.
After training the network produces almost the same output for all input patterns.
By this moment I've tried:
Introduced bias weights
Tried with and without updating of input weights
Shuffled the patterns in batch learning
Tried to update after each pattern and accumulating
Initialized weights in different possible ways
Double-checked the code 10 times
Normalized accumulated updates by the number of patterns
Tried different layer, neuron numbers
Tried different activation functions
Tried different learning rates
Tried different number of epochs from 50 to 10000
Tried to normalize the data
I noticed that after a bunch of back propagations for just one pattern, the network produces almost the same output for large variety of inputs.
When I try to approximate a function, I always get just line (almost a line). Like this:
Related question: Neural Network Always Produces Same/Similar Outputs for Any Input
And the suggestion to add bias neurons didn't solve my problem.
I found a post like:
When ANNs have trouble learning they often just learn to output the
average output values, regardless of the inputs. I don't know if this
is the case or why it would be happening with such a simple NN.
which describes my situation closely enough. But how to deal with it?
I am coming to a conclusion that the situation I encounter has the right to be. Really, for each net configuration, one may just "cut" all the connections up to the output layer. This is really possible, for example, by setting all hidden weights to near-zero or setting biases at some insane values in order to oversaturate the hidden layer and make the output independent from the input. After that, we are free to adjust the output layer so that it just reproduces the output as is independently from the input. In batch learning, what happens is that the gradients get averaged and the net reproduces just the mean of the targets. The inputs do not play ANY role.
My answer can not be fully precise because you have not posted the content of the functions perceptron(...) and backpropagation(...).
But from what I guess, you train your network many times on ONE data, then completely on ONE other in a loop for data in training_data, which leads that your network will only remember the last one. Instead, try training your network on every data once, then do that again many times (invert the order of your nested loops).
In other word, the for I = 1:number of patterns loop should be inside the backpropagation(...) function's loop, so this function should contain two loops.
EXAMPLE (in C#):
Here are some parts of a backpropagation function, I simplified it here. At each update of the weights and biases, the entire network is "propagated". The following code can be found at this URL: https://visualstudiomagazine.com/articles/2015/04/01/back-propagation-using-c.aspx
public double[] Train(double[][] trainData, int maxEpochs, double learnRate, double momentum)
{
//...
Shuffle(sequence); // visit each training data in random order
for (int ii = 0; ii < trainData.Length; ++ii)
{
//...
ComputeOutputs(xValues); // copy xValues in, compute outputs
//...
// Find new weights and biases
// Update weights and biases
//...
} // each training item
}
Maybe what is not working is just that you want to enclose everything after this comment (in Batch learn as an example) with a secondary for loop to do multiple epochs of learning:
%--------------------------------------------------------------------------
%% Get all updates