I'm getting some weird results while training a model. My loss starts very low and gets higher after every epoch, whilst my val_loss starts very high, then gradually gets lower. val_acc is also going up a bit too. I don't get how this could happen. Is this just overfitting?
Here's my model:
num_units=96
batch_size=120
sequence_length=10
input_classes=9
output_classes=2
dropout_rate=0.6
model.add(LSTM(num_units, stateful=False, batch_input_shape=(batch_size, sequence_length, input_classes)))
model.add(Dropout(dropout_rate))
model.add(Dense(output_classes, activation='softmax'))
model.compile(loss='binary_crossentropy',optimizer=optimizers.Adam(),metrics=['accuracy'])
Related
I'm trying to train a small CNN from scratch to classify images of 10 different animal species. The images have different dimensions, but I'd say around 300x300. Anyway, every image is resized to 224x224 before going into the model.
Here is the network I'm training:
# Convolution 1
self.cnn1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=0)
self.relu1 = nn.ReLU()
# Max pool 1
self.maxpool1 = nn.MaxPool2d(kernel_size=2)
# Convolution 2
self.cnn2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=0)
self.relu2 = nn.ReLU()
# Max pool 2
self.maxpool2 = nn.MaxPool2d(kernel_size=2)
# Fully connected 1
self.fc1 = nn.Linear(32 * 54 * 54, 10)
I'm using a SGD optimizer with fixed learning rate = 0.005 and weight decay = 0.01. I'm using a cross entropy function.
The accuracy of the model is good (around 99% after the 43-th epoch). However:
in some epoch I get a 'nan' as training loss
in some other epoch the accuracy drops significantly (sometimes the two happen in the same epoch). However, in the next epoch the accuracy comes back to a normal level.
If I understood it correctly a nan in training loss most of the times is caused by gradient values getting too small (underflow) or too big (overflow). Could this be the case?
Should I try by increasing the weight decay to 0.05? Or should I do gradient clipping to avoid exploding gradients? If so which would be a reasonable bound?
Still I don't understand the second issue.
I'm playing with CIFAR-10 dataset using ResNet-50 on Keras with Tensorflow backend, but I ran into a very strange training pattern, where the model loss decreased first, and then started to increase until it plateaued/stuck at a single value due to almost 0 learning rate. Correspondingly, the model accuracy increased first, and then started to decrease until it plateaued at 10% (aka random guess). I wonder what is going wrong?
Typically this U shaped pattern happens with a learning rate that is too large (like this post), but it is not the case here. This pattern also doesn't' look like a classic "over-fitting" as both the training and validation loss increase over time. In the answer to the above linked post, someone mentioned that if Adam optimizer is used, loss may explode under small learning rate when local minimum is exceeded, I'm not sure I can follow what is said there, and also I'm using SGD with weight decay instead of Adam.
Specifically for the training set up, I used resent50 with random initialization, SGD optimizer with 0.9 momentum and a weight decay of 0.0001 using decoupled weight decay regularization, batch-size 64, initial learning_rate 0.01 which declined by a factor of 0.5 with each 10 epochs of non-decreasing validation loss.
base_model = tf.keras.applications.ResNet50(include_top=False,
weights=None,pooling='avg',
input_shape=(32,32,3))
prediction_layer = tf.keras.layers.Dense(10)
model = tf.keras.Sequential([base_model,
prediction_layer])
SGDW = tfa.optimizers.extend_with_decoupled_weight_decay(tf.keras.optimizers.SGD)
optimizer = SGDW(weight_decay=0.0001, learning_rate=0.01, momentum=0.9)
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"])
reduce_lr= tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss',factor=0.5, patience=10)
model.compile(optimizer=optimizer,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"])
model.fit(train_batches, epochs=250,
validation_data=validation_batches,
callbacks=[reduce_lr])
I'm very new to deep learning models, and trying to train a multiple time series model using LSTM with Keras Sequential. There are 25 observations per year for 50 years = 1250 samples, so not sure if this is even possible to use LSTM for such small data. However, I have thousands of feature variables, not including time lags. I'm trying to predict a sequence of the next 25 time steps of data. The data is normalized between 0 and 1. My problem is that, despite trying many obvious adjustments, I cannot get the LSTM validation loss anywhere close to the training loss (overfitting dramatically, I think).
I have tried adjusting number of nodes per hidden layer (25-375), number of hidden layers (1-3), dropout (0.2-0.8), batch_size (25-375), and train/ test split (90%:10% - 50%-50%). Nothing really makes much of a difference on the validation loss/ training loss disparity.
# SPLIT INTO TRAIN AND TEST SETS
# 25 observations per year; Allocate 5 years (2014-2018) for Testing
n_test = 5 * 25
test = values[:n_test, :]
train = values[n_test:, :]
# split into input and outputs
train_X, train_y = train[:, :-25], train[:, -25:]
test_X, test_y = test[:, :-25], test[:, -25:]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 5, newdf.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 5, newdf.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
# design network
model = Sequential()
model.add(Masking(mask_value=-99, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(LSTM(375, return_sequences=True))
model.add(Dropout(0.8))
model.add(LSTM(125, return_sequences=True))
model.add(Dropout(0.8))
model.add(LSTM(25))
model.add(Dense(25))
model.compile(loss='mse', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=20, batch_size=25, validation_data=(test_X, test_y), verbose=2, shuffle=False)
Epoch 19/20
14s - loss: 0.0512 - val_loss: 188.9568
Epoch 20/20
14s - loss: 0.0510 - val_loss: 188.9537
I assume I must be doing something obvious wrong, but can't realize it since I'm a newbie. I am hoping to either get some useful validation loss achieved (compared to training), or know that my data observations are simply not large enough for useful LSTM modeling. Any help or suggestions is much appreciated, thanks!
Overfitting
In general, if you're seeing much higher validation loss than training loss, then it's a sign that your model is overfitting - it learns "superstitions" i.e. patterns that accidentally happened to be true in your training data but don't have a basis in reality, and thus aren't true in your validation data.
It's generally a sign that you have a "too powerful" model, too many parameters that are capable of memorizing the limited amount of training data. In your particular model you're trying to learn almost a million parameters (try printing model.summary()) from a thousand datapoints - that's not reasonable, learning can extract/compress information from data, not create it out of thin air.
What's the expected result?
The first question you should ask (and answer!) before building a model is about the expected accuracy. You should have a reasonable lower bound (what's a trivial baseline? For time series prediction, e.g. linear regression might be one) and an upper bound (what could an expert human predict given the same input data and nothing else?).
Much depends on the nature of the problem. You really have to ask, is this information sufficient to get a good answer? For many real life time problems with time series prediction, the answer is no - the future state of such a system depends on many variables that can't be determined by simply looking at historical measurements - to reasonably predict the next value, you need to bring in lots of external data other than the historical prices. There's a classic quote by Tukey: "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data."
While training a convolutional neural network following this article, the accuracy of the training set increases too much while the accuracy on the test set settles.
Below is an example with 6400 training examples, randomly chosen at each epoch (so some examples might be seen at the previous epochs, some might be new), and 6400 same test examples.
For a bigger data set (64000 or 100000 training examples), the increase in training accuracy is even more abrupt, going to 98 on the third epoch.
I also tried using the same 6400 training examples each epoch, just randomly shuffled. As expected, the result is worse.
epoch 3 loss 0.54871 acc 79.01
learning rate 0.1
nr_test_examples 6400
TEST epoch 3 loss 0.60812 acc 68.48
nr_training_examples 6400
tb 91
epoch 4 loss 0.51283 acc 83.52
learning rate 0.1
nr_test_examples 6400
TEST epoch 4 loss 0.60494 acc 68.68
nr_training_examples 6400
tb 91
epoch 5 loss 0.47531 acc 86.91
learning rate 0.05
nr_test_examples 6400
TEST epoch 5 loss 0.59846 acc 68.98
nr_training_examples 6400
tb 91
epoch 6 loss 0.42325 acc 92.17
learning rate 0.05
nr_test_examples 6400
TEST epoch 6 loss 0.60667 acc 68.10
nr_training_examples 6400
tb 91
epoch 7 loss 0.38460 acc 95.84
learning rate 0.05
nr_test_examples 6400
TEST epoch 7 loss 0.59695 acc 69.92
nr_training_examples 6400
tb 91
epoch 8 loss 0.35238 acc 97.58
learning rate 0.05
nr_test_examples 6400
TEST epoch 8 loss 0.60952 acc 68.21
This is my model (I'm using RELU activation after each convolution):
conv 5x5 (1, 64)
max-pooling 2x2
dropout
conv 3x3 (64, 128)
max-pooling 2x2
dropout
conv 3x3 (128, 256)
max-pooling 2x2
dropout
conv 3x3 (256, 128)
dropout
fully_connected(18*18*128, 128)
dropout
output(128, 128)
What could be the cause?
I'm using Momentum Optimizer with learning rate decay:
batch = tf.Variable(0, trainable=False)
train_size = 6400
learning_rate = tf.train.exponential_decay(
0.1, # Base learning rate.
batch * batch_size, # Current index into the dataset.
train_size*5, # Decay step.
0.5, # Decay rate.
staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
0.9).minimize(cost, global_step=batch)
This is very much expected. This problem is called over-fitting. This is when your model starts "memorizing" the training examples without actually learning anything useful for the Test set. In fact, this is exactly why we use a test set in the first place. Since if we have a complex enough model we can always fit the data perfectly, even if not meaningfully. The test set is what tells us what the model has actually learned.
Its also useful to use a Validation set which is like a test set, but you use it to find out when to stop training. When the Validation error stops lowering you stop training. why not use the test set for this? The test set is to know how well your model would do in the real world. If you start using information from the test set to choose things about your training process, than its like your cheating and you will be punished by your test error no longer representing your real world error.
Lastly, convolutional neural networks are notorious for their ability to over-fit. It has been shown the Conv-nets can get zero training error even if you shuffle the labels and even random pixels. That means that there doesn't have to be a real pattern for the Conv-net to learn to represent it. This means that you have to regularize a conv-net. That is, you have to use things like Dropout, batch normalization, early stopping.
I'll leave a few links if you want to read more:
Over-fitting, validation, early stopping
https://elitedatascience.com/overfitting-in-machine-learning
Conv-nets fitting random labels:
https://arxiv.org/pdf/1611.03530.pdf
(this paper is a bit advanced, but its interresting to skim through)
P.S. to actually improve your test accuracy you will need to change your model or train with data augmentation. You might want to try transfer learning as well.
How do you compute for the training accuracy for SGD? Do you compute it using the batch data you trained your network with? Or using the entire dataset? (for each batch optimization iteration)
I tried computing the training accuracy for each iteration using the batch data I trained my network with. And it almost always gives me 100% training accuracy (sometimes 100%, 90%, 80%, always multiples of 10%, but the very first iteration gave me 100%). Is this because I am computing the accuracy on the same batch data I trained it with for that iteration? Or is my model overfitting that it gave me 100% instantly, but the validation accuracy is low? (this is the main question here, if this is acceptable, or there is something wrong with the model)
Here are the hyperparameters I used.
batch_size = 64
kernel_size = 60 #from 60 #optimal 2
depth = 15 #from 60 #optimal 15
num_hidden = 1000 #from 1000 #optimal 80
learning_rate = 0.0001
training_epochs = 8
total_batches = train_x.shape[0] // batch_size
Calculating the training accuracy on the batch data during the training process is correct. If the number of the accuracy is always multiple of 10%, then most likely it is because your batch size is 10. For example, if 8 of the training outputs match the labels, then your training accuracy will be 80%. If the training accuracy number goes up and down, there are two main possibilities:
1. If you print out the accuracy numbers multiple time over one epoch, it is normal, especially at the early stage of training, because the model is predicting over different data samples;
2. If you print out the accuracy once each epoch, and if you see the training accuracy goes up and down during the later stage of the training, that means your learning rate is too big. You need to decease that overtime during the training.
If these do not answer your question, please provider more details so that we can help.