LSTM training pattern - machine-learning

I'm fairly new to NNs and I'm doing my own "Hello World" with LSTMs instead copying something. I have chosen a simple logic as follows:
Input with 3 timesteps. First one is either 1 or 0, the other 2 are random numbers. Expected output is same as the first timestep of input. The data feed looks like:
_X0=[1,5,9] _Y0=[1] _X1=[0,5,9] _Y1=[0] ... 200 more records like this.
This simple(?) logic can be trained for 100% accuracy. I ran many tests and the most efficient model I found was 3 LSTM layers, each of them with 15 hidden units. This returned 100% accuracy after 22 epochs.
However I noticed something that I struggle to understand: In the first 12 epochs the model makes no progress at all as measured by accuracy (acc. stays 0.5) and only marginal progress measured by Categorical Crossentropy (goes from 0.69 to 0.65). Then from epoch 12 through epoch 22 it trains very fast to accuracy 1.0. The question is: Why does training happens like this? Why the first 12 epochs are making no progress and why epochs 12-22 are so much more efficient?
Here is my entire code:
from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, LSTM
from keras.models import Model
import helper
from keras.utils.np_utils import to_categorical
x_,y_ = helper.rnn_csv_toXY("LSTM_hello.csv",3,"target")
y_binary = to_categorical(y_)
model = Sequential()
model.add(LSTM(15, input_shape=(3,1),return_sequences=True))
model.add(LSTM(15,return_sequences=True))
model.add(LSTM(15, return_sequences=False))
model.add(Dense(2, activation='softmax', kernel_initializer='RandomUniform'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc'])
model.fit(x_, y_binary, epochs=100)

It is hard to give a specific answer to this as it depends on many factors. One major factor that comes into play when training neural networks is the learning rate of the optimizer you choose.
In your code you have no specific learning rate set. The default learning rate of Adam in Keras 2.0.3 is 0.001. Adam uses a dynamic learning rate lr_t based on the initial learning rate (0.001) and the current time step, defined as
lr_t = lr * (sqrt(1. - beta_2**t) / (1. - beta_1**t)) .
The values of beta_2 and beta_1 are commonly left at their default values of 0.999 and 0.9 respectively. If you plot this learning rate you get a picture of something like this:
It might just be that this is the sweet spot for updating your weights to find a local (possibly a global) minimum. A learning rate that is too high often makes no difference at it just 'skips' over the regions that would lower your error, whereas lower learning rates take smaller step in your error landscape and let you find regions where the error is lower.
I suggest that you use an optimizer that makes less assumptions, such as stochastic gradient descent (SGD) and you test this hypothesis by using a lower learning rate.

Related

training loss increases over time despite of a tiny learning rate

I'm playing with CIFAR-10 dataset using ResNet-50 on Keras with Tensorflow backend, but I ran into a very strange training pattern, where the model loss decreased first, and then started to increase until it plateaued/stuck at a single value due to almost 0 learning rate. Correspondingly, the model accuracy increased first, and then started to decrease until it plateaued at 10% (aka random guess). I wonder what is going wrong?
Typically this U shaped pattern happens with a learning rate that is too large (like this post), but it is not the case here. This pattern also doesn't' look like a classic "over-fitting" as both the training and validation loss increase over time. In the answer to the above linked post, someone mentioned that if Adam optimizer is used, loss may explode under small learning rate when local minimum is exceeded, I'm not sure I can follow what is said there, and also I'm using SGD with weight decay instead of Adam.
Specifically for the training set up, I used resent50 with random initialization, SGD optimizer with 0.9 momentum and a weight decay of 0.0001 using decoupled weight decay regularization, batch-size 64, initial learning_rate 0.01 which declined by a factor of 0.5 with each 10 epochs of non-decreasing validation loss.
base_model = tf.keras.applications.ResNet50(include_top=False,
weights=None,pooling='avg',
input_shape=(32,32,3))
prediction_layer = tf.keras.layers.Dense(10)
model = tf.keras.Sequential([base_model,
prediction_layer])
SGDW = tfa.optimizers.extend_with_decoupled_weight_decay(tf.keras.optimizers.SGD)
optimizer = SGDW(weight_decay=0.0001, learning_rate=0.01, momentum=0.9)
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"])
reduce_lr= tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss',factor=0.5, patience=10)
model.compile(optimizer=optimizer,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"])
model.fit(train_batches, epochs=250,
validation_data=validation_batches,
callbacks=[reduce_lr])

Keras model accuracy not improving

I'm trying to train a neural network to predict the ratings for players in FIFA 18 by easports (ratings are between 64-99). I'm using their players database (https://easports.com/fifa/ultimate-team/api/fut/item?page=1) and I've processed the data into training_x, testing_x, training_y, testing_y. Each of the training samples is a numpy array containing 7 values...the first 6 are the different stats of the player (shooting, passing, dribbling, etc) and the last value is the position of the player (which I mapped between 1-8, depending on the position), and each of the testing values is a single integer between 64-99, representing the rating of that player.
I've tried many different hyperparameters, including changing the activation functions to tanh and relu, and I've tried adding a batch normalization layer after the first dense layer (I thought that it might be useful since one of my features is very small and the other features are between 50-99), I've played around with the SGD optimizer (changed the learning rate, momentum, even tried changing the optimizer to Adam), tried different loss functions, added/removed dropout layers, and tried different regularizers for the weights of the model.
model = Sequential()
model.add(Dense(64, input_shape=(7,),
kernel_regularizer=regularizers.l2(0.01)))
//batch normalization?
model.add(Activation('sigmoid'))
model.add(Dense(64, kernel_regularizer=regularizers.l2(0.01),
activation='sigmoid'))
model.add(Dropout(0.3))
model.add(Dense(32, kernel_regularizer=regularizers.l2(0.01),
activation='sigmoid'))
model.add(Dense(1, activation='linear'))
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_absolute_error', metrics=['accuracy'],
optimizer=sgd)
model.fit(training_x, training_y, epochs=50, batch_size=128, shuffle=True)
When I train the model, the loss is always nan and the accuracy is always 0, even though I've tried adjusting a lot of different parameters. However, if I remove the last feature from my data, the position of the players, and update the input shape of the first dense layer, the model actually "trains" and ends up with around 6% accuracy no matter what parameters I change. In that case, I've found that the model only predicts 79 to be the player's rating. What am I doing inherently wrong?
You can try the following steps :
Use mean squared error loss function.
Use Adam which will help you converge faster with low learning rate like 0.0001 or 0.001. Otherwise, try using the RMSprop optimizer.
Use the default regularizers. That is none actually.
Since this is a regression task, use activation function like ReLU in all the layers except the output layer ( including the input layer ). Use linear activation in output layer.
As mentioned in the comments by #pooyan , normalize the features. See here. Even try standardizing the features. Use whichever suites the best.

Non-linear multivariate time-series response prediction using RNN

I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...
In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.

Why should we normalize data for deep learning in Keras?

I was testing some network architectures in Keras for classifying the MNIST dataset. I have implemented one that is similar to the LeNet.
I have seen that in the examples that I have found on the internet, there is a step of data normalization. For example:
X_train /= 255
I have performed a test without this normalization and I have seen that the performance (accuracy) of the network has decreased (keeping the same number of epochs). Why has this happened?
If I increase the number of epochs, the accuracy can reach the same level reached by the model trained with normalization?
So, the normalization affects the accuracy, or only the training speed?
The complete source code of my training script is below:
from keras.models import Sequential
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Activation
from keras.layers.core import Flatten
from keras.layers.core import Dense
from keras.datasets import mnist
from keras.utils import np_utils
from keras.optimizers import SGD, RMSprop, Adam
import numpy as np
import matplotlib.pyplot as plt
from keras import backend as k
def build(input_shape, classes):
model = Sequential()
model.add(Conv2D(20, kernel_size=5, padding="same",activation='relu',input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(Conv2D(50, kernel_size=5, padding="same", activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(Flatten())
model.add(Dense(500))
model.add(Activation("relu"))
model.add(Dense(classes))
model.add(Activation("softmax"))
return model
NB_EPOCH = 4 # number of epochs
BATCH_SIZE = 128 # size of the batch
VERBOSE = 1 # set the training phase as verbose
OPTIMIZER = Adam() # optimizer
VALIDATION_SPLIT=0.2 # percentage of the training data used for
evaluating the loss function
IMG_ROWS, IMG_COLS = 28, 28 # input image dimensions
NB_CLASSES = 10 # number of outputs = number of digits
INPUT_SHAPE = (1, IMG_ROWS, IMG_COLS) # shape of the input
(X_train, y_train), (X_test, y_test) = mnist.load_data()
k.set_image_dim_ordering("th")
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
X_train = X_train[:, np.newaxis, :, :]
X_test = X_test[:, np.newaxis, :, :]
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
y_train = np_utils.to_categorical(y_train, NB_CLASSES)
y_test = np_utils.to_categorical(y_test, NB_CLASSES)
model = build(input_shape=INPUT_SHAPE, classes=NB_CLASSES)
model.compile(loss="categorical_crossentropy",
optimizer=OPTIMIZER,metrics=["accuracy"])
history = model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=NB_EPOCH, verbose=VERBOSE, validation_split=VALIDATION_SPLIT)
model.save("model2")
score = model.evaluate(X_test, y_test, verbose=VERBOSE)
print('Test accuracy:', score[1])
Normalization is a generic concept not limited only to deep learning or to Keras.
Why to normalize?
Let me take a simple logistic regression example which will be easy to understand and to explain normalization.
Assume we are trying to predict if a customer should be given loan or not. Among many available independent variables lets just consider Age and Income.
Let the equation be of the form:
Y = weight_1 * (Age) + weight_2 * (Income) + some_constant
Just for sake of explanation let Age be usually in range of [0,120] and let us assume Income in range of [10000, 100000]. The scale of Age and Income are very different. If you consider them as is then weights weight_1 and weight_2 may be assigned biased weights. weight_2 might bring more importance to Income as a feature than to what weight_1 brings importance to Age. To scale them to a common level, we can normalize them. For example, we can bring all the ages in range of [0,1] and all incomes in range of [0,1]. Now we can say that Age and Income are given equal importance as a feature.
Does Normalization always increase the accuracy?
Apparently, No. It is not necessary that normalization always increases accuracy. It may or might not, you never really know until you implement. Again it depends on at which stage in you training you apply normalization, on whether you apply normalization after every activation, etc.
As the range of the values of the features gets narrowed down to a particular range because of normalization, its easy to perform computations over a smaller range of values. So, usually the model gets trained a bit faster.
Regarding the number of epochs, accuracy usually increases with number of epochs provided that your model doesn't start over-fitting.
A very good explanation for Normalization/Standardization and related terms is here.
In a nutshell, normalization reduces the complexity of the problem your network is trying to solve. This can potentially increase the accuracy of your model and speed up the training. You bring the data on the same scale and reduce variance. None of the weights in the network are wasted on doing a normalization for you, meaning that they can be used more efficiently to solve the actual task at hand.
As #Shridhar R Kulkarni says, normalization is a general concept and doesn’t only apply to keras.
It’s often applied as part of data preparation for ML learning models to change numeric values in the dataset to fit a standard scale without distorting the differences in their ranges. As such, normalization enhances the cohesion of entity types within a model by reducing the probability of inconsistent data.
However, not every other dataset and use case requires normalization, it’s primarily necessary when features have different ranges. You may use when;
You want to improve your model’s convergence efficiency and make
optimization feasible
When you want to make training less sensitive to scale features, you can better
solve coefficients.
Want to improve analysis from multiple models.
Normalization is not recommended when;
-Using decision tree models or ensembles based on them
-Your data is not normally distributed- you may have to use other data pre-
processing techniques
-If your dataset comprises already scaled variables
In some cases, normalization can improve performance. However, it is not always necessary.
The critical thing is to understand your dataset and scenario first, then you’ll know whether you need it or not. Sometimes, you can experiment to see if it gives you good performance or not.
Check out deepchecks and see how to deal with important data-related checks you come across in ML.
For example, to check duplicated data in your set, you can use the following code detailed code
from deepchecks.checks.integrity.data_duplicates import DataDuplicates
from deepchecks.base import Dataset, Suite
from datetime import datetime
import pandas as pd
I think there are some issue with the convergence of the optimizer function too. Here i show a simple linear regression. Three examples:
First with an array with small values and it works as expected.
Second an array with bigger values and the loss function explodes toward infinity, suggesting the need to normalize. And at the end in model 3 the same array as case two but it has been normalized and we get convergence.
github colab enabled ipython notebook
I've use the MSE optimizer function i don't know if other optimizers suffer the same issues.

neural network produces similar pattern for all inputs

I am attempting to train an ANN on time series data in Keras. I have three vectors of data that are broken into scrolling window sequences (i.e. for vector l).
np.array([l[i:i+window_size] for i in range( len(l) - window_size)])
The target vector is similarly windowed so the neural net output is a prediction of the target vector for the next window_size number of time steps. All the data is normalized with a min-max scaler. It is fed into the neural network as a shape=(nb_samples, window_size, 3). Here is a plot of the 3 input vectors.
The only output I've managed to muster from the ANN is the following plot. Target vector in blue, predictions in red (plot is zoomed in to make the prediction pattern legible). Prediction vectors are plotted at window_size intervals so each one of the repeated patterns is one prediction from the net.
I've tried many different model architectures, number of epochs, activation functions, short and fat networks, skinny, tall. This is my current one (it's a little out there).
Conv1D(64,4, input_shape=(None,3)) ->
Conv1d(32,4) ->
Dropout(24) ->
LSTM(32) ->
Dense(window_size)
But nothing I try will affect the neural net from outputting this repeated pattern. I must be misunderstanding something about time-series or LSTMs in Keras. But I'm very lost at this point so any help is greatly appreciated. I've attached the full code at this repository.
https://github.com/jaybutera/dat-toy
I played with your code a little and I think I have a few suggestions for getting you on the right track. The code doesn't seem to match your graphs exactly, but I assume you've tweaked it a bit since then. Anyway, there are two main problems:
The biggest problem is in your data preparation step. You basically have the data shapes backwards, in that you have a single timestep of input for X and a timeseries for Y. Your input shape is (18830, 1, 8), when what you really want is (18830, 30, 8) so that the full 30 timesteps are fed into the LSTM. Otherwise the LSTM is only operating on one timestep and isn't really useful. To fix this, I changed the line in common.py from
X = X.reshape(X.shape[0], 1, X.shape[1])
to
X = windowfy(X, winsize)
Similarly, the output data should probably be only 1 value, from what I've gathered of your goals from the plotting function. There are certainly some situations where you want to predict a whole timeseries, but I don't know if that's what you want in this case. I changed Y_train to use fuels instead of fuels_w so that it only had to predict one step of the timeseries.
Training for 100 epochs might be way too much for this simple network architecture. In some cases when I ran it, it looked like there was some overfitting going on. Observing the decrease of loss in the network, it seems like maybe only 3-4 epochs are needed.
Here is the graph of predictions after 3 training epochs with the adjustments I mentioned. It's not a great prediction, but it looks like it's on the right track now at least. Good luck to you!
EDIT: Example predicting multiple output timesteps:
from sklearn import datasets, preprocessing
import numpy as np
from scipy import stats
from keras import models, layers
INPUT_WINDOW = 10
OUTPUT_WINDOW = 5 # Predict 5 steps of the output variable.
# Randomly generate some regression data (not true sequential data; samples are independent).
np.random.seed(11798)
X, y = datasets.make_regression(n_samples=1000, n_features=4, noise=.1)
# Rescale 0-1 and convert into windowed sequences.
X = preprocessing.MinMaxScaler().fit_transform(X)
y = preprocessing.MinMaxScaler().fit_transform(y.reshape(-1, 1))
X = np.array([X[i:i + INPUT_WINDOW] for i in range(len(X) - INPUT_WINDOW)])
y = np.array([y[i:i + OUTPUT_WINDOW] for i in range(INPUT_WINDOW - OUTPUT_WINDOW,
len(y) - OUTPUT_WINDOW)])
print(np.shape(X)) # (990, 10, 4) - Ten timesteps of four features
print(np.shape(y)) # (990, 5, 1) - Five timesteps of one features
# Construct a simple model predicting output sequences.
m = models.Sequential()
m.add(layers.LSTM(20, activation='relu', return_sequences=True, input_shape=(INPUT_WINDOW, 4)))
m.add(layers.LSTM(20, activation='relu'))
m.add(layers.RepeatVector(OUTPUT_WINDOW))
m.add(layers.LSTM(20, activation='relu', return_sequences=True))
m.add(layers.wrappers.TimeDistributed(layers.Dense(1, activation='sigmoid')))
print(m.summary())
m.compile(optimizer='adam', loss='mse')
m.fit(X[:800], y[:800], batch_size=10, epochs=60) # Train on first 800 sequences.
preds = m.predict(X[800:], batch_size=10) # Predict the remaining sequences.
print('Prediction:\n' + str(preds[0]))
print('Actual:\n' + str(y[800]))
# Correlation should be around r = .98, essentially perfect.
print('Correlation: ' + str(stats.pearsonr(y[800:].flatten(), preds.flatten())[0]))

Resources