Training LSTMs in Keras with time series of different length - machine-learning

I'm new to Keras and wondering how to train an LTSM with (interrupted) time series of different lengths. Consider, for example, a continuous series from day 1 to day 10 and another continuous series from day 15 to day 20. Simply concatenating them to a single series might yield wrong results. I see basically two options to bring them to shape (batch_size, timesteps, output_features):
Extend the shorter series by some default value (0), i.e. for the above example we would have the following batch:
d1, ..., d10
d15, ..., d20, 0, 0, 0, 0, 0
Compute the GCD of the lengths, cut the series into pieces, and use a stateful LSTM, i.e.:
d1, ..., d5
d6, ..., d10
reset_state
d15, ..., d20
Are there any other / better solutions? Is training a stateless LSTM with a complete sequence equivalent to training a stateful LSTM with pieces?

Have you tried feeding the LSTM layer with inputs of different length? The input time-series can be of different length when LSTM is used (even the batch sizes can be different from one batch to another, but obvisouly the dimension of features should be the same). Here is an example in Keras:
from keras import models, layers
n_feats = 32
latent_dim = 64
lstm_input = layers.Input(shape=(None, n_feats))
lstm_output = layers.LSTM(latent_dim)(lstm_input)
model = models.Model(lstm_input, lstm_output)
model.summary()
Output:
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) (None, None, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 64) 24832
=================================================================
Total params: 24,832
Trainable params: 24,832
Non-trainable params: 0
As you can see the first and second axis of Input layer is None. It means they are not pre-specified and can be any value. You can think of LSTM as a loop. No matter the input length, as long as there are remaining data vectors of same length (i.e. n_feats), the LSTM layer processes them. Therefore, as you can see above, the number of parameters used in a LSTM layer does not depend on the batch size or time-series length (it only depends on input feature vector's length and the latent dimension of LSTM).
import numpy as np
# feed LSTM with: batch_size=10, timestamps=5
model.predict(np.random.rand(10, 5, n_feats)) # This works
# feed LSTM with: batch_size=5, timestamps=100
model.predict(np.random.rand(5, 100, n_feats)) # This also works
However, depending on the specific problem you are working on, this may not work; though I don't have any specific examples in my mind now in which this behavior may not be suitable and you should make sure all the time-series have the same length.

Related

Keras: Is there any workaround to split the output of an intermediate layer without using Lamda layer?

Say, I have a 10x10x4 intermediate output of a convolution layer, which I need to split into 100 1x1x4 volume and apply softmax on each to get 100 outputs from the network. Is there any way to accomplish this without using the Lambda layer? The issue with the Lambda layer in this case is this simple task of splitting takes 100 passes through the lambda layer during forward pass, which makes the network performance very slow for my practical use. Please suggest a quicker way of doing this.
Edit: I had already tried the Softmax+Reshape approach before asking the question. With that approach, I would be getting a 10x10x4 matrix reshaped to a 100x4 Tensor with use of Reshape as the output. What I really need is a multi output network with 100 different outputs. In my application, it is not possible to jointly optimize over the 10x10 matrix, but I get good results by using a network with 100 different outputs with the Lambda layer.
Here are code snippets of my approach using the Keras functional API:
With Lambda layer (slow, gives 100 Tensors of shape (None, 4) as desired):
# Assume conv_output is output from a convolutional layer with shape (None, 10, 10,4)
preds = []
for i in range(10):
for j in range(10):
y = Lambda(lambda x, i,j: x[:, i, j,:], arguments={'i': i,'j':j})(conv_output)
preds.append(Activation('softmax',name='predictions_' + str(i*10+j))(y))
model = Model(inputs=img, outputs=preds, name='model')
model.compile(loss='categorical_crossentropy',
optimizer=Adam(),
metrics=['accuracy']
With Softmax+Reshape (fast, but gives Tensor of shape (None, 100, 4))
# Assume conv_output is output from a convolutional layer with shape (None, 10, 10,4)
y = Softmax(name='softmax', axis=-1)(conv_output)
preds = Reshape([100, 4])(y)
model = Model(inputs=img, outputs=preds, name='model')
model.compile(loss='categorical_crossentropy',
optimizer=Adam(),
metrics=['accuracy']
I don't think in the second case it is possible to individually optimize over each of the 100 outputs (probably one can think of it as learning the joint distribution, whereas I need to learn the marginals as in the first case). Please let me know if there is any way to accomplish what I am doing with the Lambda layer in the first code snippet in a faster way
You can use the Softmax layer and set the axis argument to the last axis (i.e. -1) to apply softmax over that axis:
from keras.layers import Softmax
soft_out = Softmax(axis=-1)(conv_out)
Note that the axis argument by default is set to -1, so you may not even need to pass that.

RNN Encoder Decoder using keras

I am trying to build an architecture which will be used for machine language translation (from English to French)
model = Sequential()
model.add(LSTM(256, input_shape =(15,1)))
model.add(RepeatVector(output_sequence_length))
model.add(LSTM(21,return_sequences=True))
model.add(TimeDistributed(Dense(french_vocab_size, activation='sigmoid'))
Max length of English sentence is 15 and that of French is 21. Max number of English words is 199 and that of French is 399. output_sequence_length is 21.
This model throws me an error
Error when checking input: expected lstm_40_input to have shape (None, 15, 1) but got array with shape (137861, 21, 1)
I am stuck with the understanding of the LSTM in keras.
1.The first argument according to documentation must be 'dimensionality of output space'. I did not understand what that means. Also,
what exactly happens return_sequences is set to True
Please let me know.
What Kind of data are you trying to feed your network ? Because it seems to me that you didn't convert your words to vectors (binary vectors or encoded vectors).
Anyway, a LSTM Netword need a 3 dimensional entry, the dimensions correspond to that : (samples , timesteps , features).
In your case, samples correspond to the numebr of your sentences, I guess 137861. Timesteps correspond to the length of each sequence, which In your case is 15, and features is the size of each encoded word ( Depending on which type of encoding you choose. If you choose OneHotEncoding, it will be 199).
The error that you got shows that you fed your network sequences with 21 timesteps instead of 15.
For your second question, when return_sequences is set to False, it returns only one output per LSTM layer, which in your case will be (256, ) for your first LSTM layer. when it's set to True, it will have one output per timestep, giving you an overall output of shape (15 , 256). When you want to stack two or more LSTM layers, you always have to set the first layers to return_sequences = True.
Also, what you are building is called a Many to Many architecture, with different timestep lengths for the input and the output (15 vs 21). As far as I know, it's not that easy to implement in keras.

Non-linear multivariate time-series response prediction using RNN

I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...
In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.

neural network produces similar pattern for all inputs

I am attempting to train an ANN on time series data in Keras. I have three vectors of data that are broken into scrolling window sequences (i.e. for vector l).
np.array([l[i:i+window_size] for i in range( len(l) - window_size)])
The target vector is similarly windowed so the neural net output is a prediction of the target vector for the next window_size number of time steps. All the data is normalized with a min-max scaler. It is fed into the neural network as a shape=(nb_samples, window_size, 3). Here is a plot of the 3 input vectors.
The only output I've managed to muster from the ANN is the following plot. Target vector in blue, predictions in red (plot is zoomed in to make the prediction pattern legible). Prediction vectors are plotted at window_size intervals so each one of the repeated patterns is one prediction from the net.
I've tried many different model architectures, number of epochs, activation functions, short and fat networks, skinny, tall. This is my current one (it's a little out there).
Conv1D(64,4, input_shape=(None,3)) ->
Conv1d(32,4) ->
Dropout(24) ->
LSTM(32) ->
Dense(window_size)
But nothing I try will affect the neural net from outputting this repeated pattern. I must be misunderstanding something about time-series or LSTMs in Keras. But I'm very lost at this point so any help is greatly appreciated. I've attached the full code at this repository.
https://github.com/jaybutera/dat-toy
I played with your code a little and I think I have a few suggestions for getting you on the right track. The code doesn't seem to match your graphs exactly, but I assume you've tweaked it a bit since then. Anyway, there are two main problems:
The biggest problem is in your data preparation step. You basically have the data shapes backwards, in that you have a single timestep of input for X and a timeseries for Y. Your input shape is (18830, 1, 8), when what you really want is (18830, 30, 8) so that the full 30 timesteps are fed into the LSTM. Otherwise the LSTM is only operating on one timestep and isn't really useful. To fix this, I changed the line in common.py from
X = X.reshape(X.shape[0], 1, X.shape[1])
to
X = windowfy(X, winsize)
Similarly, the output data should probably be only 1 value, from what I've gathered of your goals from the plotting function. There are certainly some situations where you want to predict a whole timeseries, but I don't know if that's what you want in this case. I changed Y_train to use fuels instead of fuels_w so that it only had to predict one step of the timeseries.
Training for 100 epochs might be way too much for this simple network architecture. In some cases when I ran it, it looked like there was some overfitting going on. Observing the decrease of loss in the network, it seems like maybe only 3-4 epochs are needed.
Here is the graph of predictions after 3 training epochs with the adjustments I mentioned. It's not a great prediction, but it looks like it's on the right track now at least. Good luck to you!
EDIT: Example predicting multiple output timesteps:
from sklearn import datasets, preprocessing
import numpy as np
from scipy import stats
from keras import models, layers
INPUT_WINDOW = 10
OUTPUT_WINDOW = 5 # Predict 5 steps of the output variable.
# Randomly generate some regression data (not true sequential data; samples are independent).
np.random.seed(11798)
X, y = datasets.make_regression(n_samples=1000, n_features=4, noise=.1)
# Rescale 0-1 and convert into windowed sequences.
X = preprocessing.MinMaxScaler().fit_transform(X)
y = preprocessing.MinMaxScaler().fit_transform(y.reshape(-1, 1))
X = np.array([X[i:i + INPUT_WINDOW] for i in range(len(X) - INPUT_WINDOW)])
y = np.array([y[i:i + OUTPUT_WINDOW] for i in range(INPUT_WINDOW - OUTPUT_WINDOW,
len(y) - OUTPUT_WINDOW)])
print(np.shape(X)) # (990, 10, 4) - Ten timesteps of four features
print(np.shape(y)) # (990, 5, 1) - Five timesteps of one features
# Construct a simple model predicting output sequences.
m = models.Sequential()
m.add(layers.LSTM(20, activation='relu', return_sequences=True, input_shape=(INPUT_WINDOW, 4)))
m.add(layers.LSTM(20, activation='relu'))
m.add(layers.RepeatVector(OUTPUT_WINDOW))
m.add(layers.LSTM(20, activation='relu', return_sequences=True))
m.add(layers.wrappers.TimeDistributed(layers.Dense(1, activation='sigmoid')))
print(m.summary())
m.compile(optimizer='adam', loss='mse')
m.fit(X[:800], y[:800], batch_size=10, epochs=60) # Train on first 800 sequences.
preds = m.predict(X[800:], batch_size=10) # Predict the remaining sequences.
print('Prediction:\n' + str(preds[0]))
print('Actual:\n' + str(y[800]))
# Correlation should be around r = .98, essentially perfect.
print('Correlation: ' + str(stats.pearsonr(y[800:].flatten(), preds.flatten())[0]))

Keras LSTM multi-dimension input

my input time series data is of the shape (nb_samples, 75, 32).
75 is the timesteps and 32 is the input dimension.
model = Sequential()
model.add(LSTM(4, input_shape=(75, 32)))
model.summary()
The LSTM weight vectors,[W_i, W_c, W_f, W_o] are all 32 dimensions, but the output is just a single value. the output shape of the above model is (1,4). But in LSTM the output is also a vector so should not it be (32,4) for many to one implementation as above? why is it giving a single value for multi-dimension input also?
As you can read in the Keras doc for reccurent layers
For an input of shape (nb_sample, timestep, input_dim), you have two possible outputs:
if you set return_sequence=True in your LSTM (which is not your case), you return every hidden state, so the intermediate steps when the LSTM 'reads' your sequence. You get an output of shape (nb_sample, timestep, output_dim).
if you set return_sequence=False (which is the default), it will only output the last state. So you will get an output of shape (nb_sample, output_dim).
So if you define your LSTM layer like this :
model.add(LSTM(4, return_sequence=True, input_shape=(75, 32)))
you will have an output of shape (None, 75, 4). If 32 is your time dimension, you will have to transpose the data before feeding it to the LSTM. The first dimension is the temporal one.
I hope this helps :)

Resources