Validation Loss Much Higher Than Training Loss

Validation Loss Much Higher Than Training Loss - machine-learning

I'm very new to deep learning models, and trying to train a multiple time series model using LSTM with Keras Sequential. There are 25 observations per year for 50 years = 1250 samples, so not sure if this is even possible to use LSTM for such small data. However, I have thousands of feature variables, not including time lags. I'm trying to predict a sequence of the next 25 time steps of data. The data is normalized between 0 and 1. My problem is that, despite trying many obvious adjustments, I cannot get the LSTM validation loss anywhere close to the training loss (overfitting dramatically, I think).
I have tried adjusting number of nodes per hidden layer (25-375), number of hidden layers (1-3), dropout (0.2-0.8), batch_size (25-375), and train/ test split (90%:10% - 50%-50%). Nothing really makes much of a difference on the validation loss/ training loss disparity.
# SPLIT INTO TRAIN AND TEST SETS
# 25 observations per year; Allocate 5 years (2014-2018) for Testing
n_test = 5 * 25
test = values[:n_test, :]
train = values[n_test:, :]
# split into input and outputs
train_X, train_y = train[:, :-25], train[:, -25:]
test_X, test_y = test[:, :-25], test[:, -25:]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 5, newdf.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 5, newdf.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
# design network
model = Sequential()
model.add(Masking(mask_value=-99, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(LSTM(375, return_sequences=True))
model.add(Dropout(0.8))
model.add(LSTM(125, return_sequences=True))
model.add(Dropout(0.8))
model.add(LSTM(25))
model.add(Dense(25))
model.compile(loss='mse', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=20, batch_size=25, validation_data=(test_X, test_y), verbose=2, shuffle=False)
Epoch 19/20
14s - loss: 0.0512 - val_loss: 188.9568
Epoch 20/20
14s - loss: 0.0510 - val_loss: 188.9537
I assume I must be doing something obvious wrong, but can't realize it since I'm a newbie. I am hoping to either get some useful validation loss achieved (compared to training), or know that my data observations are simply not large enough for useful LSTM modeling. Any help or suggestions is much appreciated, thanks!

Overfitting
In general, if you're seeing much higher validation loss than training loss, then it's a sign that your model is overfitting - it learns "superstitions" i.e. patterns that accidentally happened to be true in your training data but don't have a basis in reality, and thus aren't true in your validation data.
It's generally a sign that you have a "too powerful" model, too many parameters that are capable of memorizing the limited amount of training data. In your particular model you're trying to learn almost a million parameters (try printing model.summary()) from a thousand datapoints - that's not reasonable, learning can extract/compress information from data, not create it out of thin air.
What's the expected result?
The first question you should ask (and answer!) before building a model is about the expected accuracy. You should have a reasonable lower bound (what's a trivial baseline? For time series prediction, e.g. linear regression might be one) and an upper bound (what could an expert human predict given the same input data and nothing else?).
Much depends on the nature of the problem. You really have to ask, is this information sufficient to get a good answer? For many real life time problems with time series prediction, the answer is no - the future state of such a system depends on many variables that can't be determined by simply looking at historical measurements - to reasonably predict the next value, you need to bring in lots of external data other than the historical prices. There's a classic quote by Tukey: "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data."

Related

Is this LSTM underfitting?

I am trying to create a model that predicts if it will rain in the next 5 days (multi-step) or not, so I dont need the precipitation value, just a "yes" or "no". I've been testing with some different tools/algorithms and I guess the big challenge here is dealing with the zero skewed data.
The dataset consists of hourly data that has columns such as precipitation, temperature, pressure, wind speed, humidity. It has around 1 milion rows. There is no requisite to use a multivariate approach.
Rain occurs mostly on months 1,2,3,11 and 12.
So I tried using a univariate LSTM on the data, and with hourly sample I had the best results. I used the following architecture:
model=Sequential()
model.add(LSTM(150,return_sequences=True,input_shape=(1,look_back)))
model.add(LSTM(50,return_sequences=True))
model.add(LSTM(50))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
history = model.fit(trainX, trainY, epochs=15, batch_size=4096, validation_data=(testX, testY), shuffle=False)
I'm using a lookback value of 24*60, which should mean 2 months.
Train/Validation Loss:
https://i.stack.imgur.com/CjDbR.png
Final result:
https://i.stack.imgur.com/p6SnD.png
So I read that this train/validation loss means the model is underfitting, is it? What could I do to prevent this?
Before using LSTM I tried using Prophet, which rendered really bad results and tried used autoarima, but it couldn't handle a yearly seasonality (365 days).

In case of underfitting what you can do is icreasing the learning rate, increasing training duration and number of training data.
It is also worth having some external metric such as the F1 score because loss isn't a good metrics for human evaluation.
Just looking at your example I would start with experimenting a bit with the loss function, it seems like your data is binary so it would be wiser to use a binary loss instead of a regression loss

Validation and training loss per batch and epoch

I am using Pytorch to run some deep learning models. I am currently keeping track of training and validation loss per epoch, which is pretty standard. However, what is the best way of going about keeping track of training and validation loss per batch/iteration?
For training loss, I could just keep a list of the loss after each training loop. But, validation loss is calculated after a whole epoch, so I’m not sure how to go about the validation loss per batch. The only thing I can think of is to run the whole validation step after each training batch and keeping track of those, but that seems overkill and a lot of computation.
For example, the training is like this:
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
And for validation loss:
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
# validation loss
batch_loss = error(outputs.float(), labels.long()).item()
loss_test += batch_loss
loss_test /= len(testloader)
The validation loss/test part is done per epoch. I’m looking for a way to get the validation loss per batch, which is my point above.
Any tips?

Well, you're right that's the way to do it "run the whole validation step after each training batch and keeping track of those" and also as you've thought it's pretty time-consuming and would be overkill. However, If that's something you really need then there's a way you can do it. What you can do is, let's say you've 1000 batches in your data. Now to calculate per batch val_loss you can choose not to run the validation step for each of the batch (then you'd have to do it 1000 times!) but for a small subset of those batches, let's say 50/100 (choose as you please or find feasible). Now, you can use some statistical power so that your calculation for 50/100 batches becomes very very close to that of 1000 batches (meaning this val_loss for a small number of batches must be as close as to those of 1000 batches if you had calculated that), so to achieve it you can introduce some randomness in your batch selection.
This means you randomly select 100 batches from your 1000 batches for which you'll run the validation step.

An epoch is the process of making the model go through the entire training set - which is, generally, divided into batches. Also, it tends to be shuffled. The validation set, on the other hand is used to tune the hyper-parameters of your training and find out what's your model's behavior towards new data. In that respect, to me, evaluating at epoch=1/2 doesn't make much sense. Because the question is - whatever the performance on the evaluation set at epoch=1/2 - what can you do about it? Since, you don't know which data it has been going through in the first half of the epoch, there's no way to take advantage of 'a first half being better'... And remember your data will likely be shuffled into batches.
Therefore, I would stick with the classic approach: train on the entire set then, and only then, evaluate on another set. In some cases, you won't even allow yourself to evaluate once per epoch, because of the computation time. Instead you would evaluate every n epochs. But then again it will depend on your dataset size, your sampling from that dataset, the batch size, and the computation cost.
For the training loss, you can keep track of its value per-update-step vs. per-epoch. This will give you much more control over whether or not your model is learning independently from the validation phase.
Edit - As an alternative for not having to run the entire evaluation set per train batch you could do the following: shuffle your validation and set the same batch size as your trainset.
len(trainset)//batch_size is the number of updates per epoch
len(validset)//batch_size is the number of allowed evaluation per epoch
Every len(trainset)//len(validset) train updates you can evaluate on 1
batch
This allows you to get a feedback len(trainset)//len(validset) times per epoch.
If you set your train/valid ratio as 0.1, then len(validset)=0.1*len(trainset), that's ten partial evaluations per epoch.

How many hours of training does it take to get decent error in House Prices Dataset using Neural Network

I'm new to Machine Learning and I'm trying to implement linear regression using keras on this dataset https://www.kaggle.com/harlfoxem/housesalesprediction . Although I think classical machine learning will be more suited to this problem, I want to use Neural Network to learn about it. I have done feature selection and removed some features with high correlation with each other, and now have 8 features left. I have hnormalized my features, but not the labels. I have read and know that Neural Networks generally take time to train, I just want to ask this question to prevent me from investing further time on a model that might won't work. Right now, I am training a model with this design:
model = Sequential()
model.add(Dense(10, inputshape = (10, ) , activation =LeakyReLU()))
model.add(Dense(7, activation=LeakyReLU()))
model.add(Dense(1))
model.compile(optimizer ="adam", loss = "meansquarederror", metrics = ["meansquared_error"])
and right now, it's been 13,000 epochs and 8 hours, and I'm still getting :
loss: 66127403415.9417 - meansquarederror: 66127421440.0000 - valloss: 75086529026.4872 - valmeansquarederror: 75086495744.0000
Although I can see that the loss has been slowly improving (It started at about 300 billion) . So how many hours of training does it take to get decent error on this dataset? Am I on the right track?

What does initial_epoch in Keras mean?

I'm a little bit confused about initial_epoch value in fit and fit_generator methods. Here is the doc:
initial_epoch: Integer. Epoch at which to start training (useful for resuming a previous training run).
I understand, it is not useful if you start training from scratch. It is useful if you trained your dataset and want to improve accuracy or other values (correct me if I'm wrong). But I'm not sure what it really does.
So after all this, I have 2 questions:
What does initial_epoch do and what is it for?
When can I use initial_epoch?
When I change my dataset?
When I change the learning rate, optimizer or loss function?
Both of them?

Since in some of the optimizers, some of their internal values (e.g. learning rate) are set using the current epoch value, or even you may have (custom) callbacks that depend on the current value of epoch, the initial_epoch argument let you specify the initial value of epoch to start from when training.
As stated in the documentation, this is mostly useful when you have trained your model for some epochs, say 10, and then saved it and now you want to load it and resume the training for another 10 epochs without disrupting the state of epoch-dependent objects (e.g. optimizer). So you would set initial_epoch=10 (i.e. we have trained the model for 10 epochs) and epochs=20 (not 10, since the total number of epochs to reach is 20) and then everything resume as if you were initially trained the model for 20 epochs in one single training session.
However, note that when using built-in optimizers of Keras you don't need to use initial_epoch, since they store and update their state internally (without considering the value of current epoch) and also when saving a model the state of the optimizer will be stored as well.

The answer above is correct however it is important to note that if you have trained for 10 epochs and set initial_epoch=10 and epochs=20 you train for 10 more epochs until you reach a total of 20 epochs. For example I trained for 2 epochs, then set initial_epoch=2 and epochs=4. The result is it trains for 4-2=2 more epochs. The new data in the history object starts at epoch 3. So the returned history object does start from epoch 1 as you might expect. Another words the state of the history object is not preserved from the initial training epochs. If you do not set initial_epoch and you train for 2 epochs, then rerun the fit_generator with epochs=4 it will train for 4 more epochs starting from the state preserved at the end of the second epoch (provided you use the built in optimizers). Again the history object state is NOT preserved from the initial training and only contains the data for the last 4 epochs. I noticed this because I plot the validation loss versus epochs.

Here is an example of how to integrate the initial_epoch in your code
#Training first 4 Epcohs and saving
model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=32, epochs=4)
model.save("partial.h5")
#loading the model, training another 4 Epochs and then saving the updated model.
from keras.models import load_model
new_model = load_model('partial.h5')
new_model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=32, initial_epoch=4,epochs=8)
new_model.save("updated.h5")
Also don't forget to specify a particular random_state value while splitting the data into train and test, so that it encounters the same set of training data each time you reinitiate the training process, so that there is no data leakage of test data entering the training data.

Non-linear multivariate time-series response prediction using RNN

I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...

In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart