Keras RNN loss does not decrease over epoch - machine-learning

I built a RNN using Keras. The RNN is used to solve a regression problem:
def RNN_keras(feat_num, timestep_num=100):
model = Sequential()
model.add(BatchNormalization(input_shape=(timestep_num, feat_num)))
model.add(LSTM(input_shape=(timestep_num, feat_num), output_dim=512, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(LSTM(output_dim=128, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(TimeDistributed(Dense(output_dim=1, activation='relu'))) # sequence labeling
rmsprop = RMSprop(lr=0.00001, rho=0.9, epsilon=1e-08)
model.compile(loss='mean_squared_error',
optimizer=rmsprop,
metrics=['mean_squared_error'])
return model
The whole process looks fine. But the loss stays the exact same over epochs.
61267 in the training set
6808 in the test set
Building training input vectors ...
888 unique feature names
The length of each vector will be 888
Using TensorFlow backend.
Build model...
# Each batch has 1280 examples
# The training data are shuffled at the beginning of each epoch.
****** Iterating over each batch of the training data ******
Epoch 1/3 : Batch 1/48 | loss = 11011073.000000 | root_mean_squared_error = 3318.232910
Epoch 1/3 : Batch 2/48 | loss = 620.271667 | root_mean_squared_error = 24.904161
Epoch 1/3 : Batch 3/48 | loss = 620.068665 | root_mean_squared_error = 24.900017
......
Epoch 1/3 : Batch 47/48 | loss = 618.046448 | root_mean_squared_error = 24.859678
Epoch 1/3 : Batch 48/48 | loss = 652.977051 | root_mean_squared_error = 25.552946
****** Epoch 1: RMSD(training) = 24.897174
Epoch 2/3 : Batch 1/48 | loss = 607.372620 | root_mean_squared_error = 24.644049
Epoch 2/3 : Batch 2/48 | loss = 599.667786 | root_mean_squared_error = 24.487448
Epoch 2/3 : Batch 3/48 | loss = 621.368103 | root_mean_squared_error = 24.926300
......
Epoch 2/3 : Batch 47/48 | loss = 620.133667 | root_mean_squared_error = 24.901398
Epoch 2/3 : Batch 48/48 | loss = 639.971924 | root_mean_squared_error = 25.297264
****** Epoch 2: RMSD(training) = 24.897174
Epoch 3/3 : Batch 1/48 | loss = 651.519836 | root_mean_squared_error = 25.523636
Epoch 3/3 : Batch 2/48 | loss = 673.582581 | root_mean_squared_error = 25.952084
Epoch 3/3 : Batch 3/48 | loss = 613.930054 | root_mean_squared_error = 24.776562
......
Epoch 3/3 : Batch 47/48 | loss = 624.460327 | root_mean_squared_error = 24.988203
Epoch 3/3 : Batch 48/48 | loss = 629.544250 | root_mean_squared_error = 25.090448
****** Epoch 3: RMSD(training) = 24.897174
I do NOT think it is normal. Do I miss something?
UPDATE:
I find that all predictions are always zero after all epochs. This is the reason why all RMSDs are all the same because the predictions are all the same, i.e. 0. I checked the training y. It only contains just a few zeros. So it is not due to data imbalance.
So now I am thinking if it is because of the layers and activation that I am using.

Your RNN functions seems to be ok.
The speed of reduction in loss depends on optimizer and learning rate.
Any how you are using decay rate 0.9. try with bigger learning rate, any how it is going to decrease with 0.9 rate.
Try out other optimizers with different learning rates
Other optimizers available with keras: https://keras.io/optimizers/
Many times, some optimizers work well on some data sets while some may fails.

Have you tried changing activation function from relu to softmax?
Relu activation has the tendency to diverge. However, if initializing the weight with eigenmatrix may result in a better convergence.

Since you are using RNNs for regression problem (not for classification), you should use 'linear' activation at the last layer.
In your code,
model.add(TimeDistributed(Dense(output_dim=1, activation='relu'))) # sequence labeling
change to activation='linear' instead of 'relu'.
If it doesn't work, remove activation='relu' in second layer.
Also learning rate for rmsprop usually ranges from 0.1 to 0.0001.

Related

LSTM regression model flat prediction

This is a time series regression problem for the battery capacity as output and a single input variable as voltage; the relation is non-linear.
LSTM Model prediction of the test data always returns a semi-flat line, probably the mean of the output variable in the training data.
This is an example of predicted vs test set output values, with the following model parameters:
(Window size: 10, batch site: 256, LSTM nodes: 16)
Prediction of the test data
Data had been normalized, down-sampled to 1 sec and later to 3 sec, original sampling was 10 Hz.
I was suspecting the voltage fluctuation is the problem, but sampling at 3 seconds hadn't resulted into noticeable improvement.
Here are the data after being down-sampled to 3 seconds:
Normalized Training Data ; Y:SOC, X: Voltage
Normalized Test Data ; Y:SOC, X: Voltage
I've tried many changes in the model and learning parameters as follows, but still the behavior is the same.
That's why i think it's not a parameter tuning issue, rather the model is not learning at all.
LSTM layer: always single, followed by Dense with no options.
LSTM nodes: [4,8,16,32]
Epoch: : [16,32,64,128]
window size (input vector depth): [8,32,64,128]
Batch size: [32,64,128,256]
learning rate: [.0005,.0001,.001]
optimizer : ADAM, options:[ none, clipnorm=1, clipvalue=0.5]
Model specification Code:
backend.clear_session()
model1 = Sequential()
model1.add(LSTM(16,input_shape=(win_sz, features_cnt) )) # stateless
model1.add(layers.Dense(1))
model1.summary()
Model training and validation Code:
n_epochs = 12
iterations = tr_samples_sh_cnt // batch_sz_tr
loss = tf.keras.losses.MeanAbsoluteError()
optimizer = tf.optimizers.Adam(learning_rate = 0.001)
loss_history = []
#tf.function
def train_model_on_batch():
start = epoch * batch_sz_tr
X_batch = df_feat_tr_3D[start:start+batch_sz_tr, :, :]
y_batch = df_SOC_tr_2D[start:start+batch_sz_tr, :]
with tf.GradientTape() as tape:
current_loss = loss(model1(X_batch), y_batch)
gradients = tape.gradient(current_loss, model1.trainable_variables)
optimizer.apply_gradients(zip(gradients, model1.trainable_variables))
return current_loss
for epoch in range(n_epochs+1):
for iteration in range(iterations):
current_loss = train_model_on_batch()
if epoch % 1 == 0:
loss_history.append(current_loss.numpy())
print("{}. \t\tLoss: {}".format(
epoch, loss_history[-1]))
print('\nTraining complete.')
P_test = model1.predict(df_feat_test_3D)
After adding sigmoid activation function in both LSTM and Dense layers, a very small change observed, but far from reasonable fit.
Prediction of the test data after adding activation function
The problem was the activation function as #Dr. Snoopy recommended

Horrible forecasting KPI scores (RMSE, MAE and MAPE) using RNN with three flavors ("LSTM", "GRU" and "RNN")

My project is trying to forecast COVID 19 total confirmed case for 2021. This is the overlook of the confirm case data, which I use to train my RNN model.
enter image description here
The confirmed number though doesn't show any repeating pattern. But there has been research studies using RNN and LSTM on the same data (in fact, I use the same data source as them). This is the research study I drew inspiration from:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8523546/
This is the result I got:
beginning the training of the LSTM RNN:
Epoch 99: 100%|██████████| 29/29 [00:00<00:00, 33.10it/s, loss=0.115, v_num=logs, train_loss=0.0933, val_loss=0.466]
training of the LSTM RNN completed: 105.23 sec
Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00, 6.48it/s]
LSTM :
MAPE : 199.3029
RMSPE : 0.6659
RMSE : 0.6763
-R squared : 15366746993394475597824.0000
se : 0.0871
beginning the training of the GRU RNN:
Epoch 99: 100%|██████████| 29/29 [00:00<00:00, 37.86it/s, loss=0.11, v_num=logs, train_loss=0.0911, val_loss=0.474]
training of the GRU RNN completed: 90.99 sec
Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00, 3.91it/s]
GRU :
MAPE : 209.6602
RMSPE : 0.6771
RMSE : 0.6877
-R squared : 467975998695823638528.0000
se : 0.0871
beginning the training of the Vanilla RNN:
Epoch 99: 100%|██████████| 29/29 [00:00<00:00, 43.54it/s, loss=0.114, v_num=logs, train_loss=0.109, val_loss=0.461]
training of the Vanilla RNN completed: 79.32 sec
Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00, 6.08it/s]
Vanilla :
MAPE : 200.5710
RMSPE : 0.6673
RMSE : 0.6778
-R squared : 115942252717688101559392358367232.0000
se : 0.0871
Also, here is my prediction plot. Three of the plot looks exactly the same, different MAPE score:
enter image description here
For the package i use, i use darts (u8darts[all] when importing the packages).
for methodology, I set the parameters for my model as this. I used this article by Heiko Oinen for methodology (detailed code is below the page):
https://medium.com/towards-data-science/temporal-loops-intro-to-recurrent-neural-networks-for-time-series-forecasting-in-python-b0398963dc1f:
#Set up the models, run the models, plot and evaluate
EPOCH = 100
def run_RNN(flavor, ts, train, val):
#set the model up
model_RNN = RNNModel(
model = flavor,
model_name = flavor + str(" RNN"),
input_chunk_length = 12,
training_length = 20,
hidden_dim = 20,
batch_size = 16,
n_epochs = EPOCH,
dropout = 0,
optimizer_kwargs = {'lr': 1e-3},
log_tensorboard = True,
random_state = 42,
force_reset = True
)
if flavor == "RNN": flavor = "Vanilla"
#fit the model
fit_it(model_RNN, train, val, flavor)
#compute N predictions
pred = model_RNN.predict(n = FC_N,future_covariates = covariates)
#plot predictions vs actual
plot_fitted(pred, ts, flavor)
#print accuracy metrics
res_acc = accuracy_metrics(pred, ts)
I have even tries epoch to 300, but the train loss and loss during training didn't decrease further.
I haven't gotten experience in asking questions here, I'll try to articulate and provide more detail if you have questions. Thank you so much for your help!

Conv nets accuracy not changing while loss decreases

I'm training several CNNs to do image classification in TensorFlow. The training losses decrease normally. However the test accuracy never changed throughout the whole training procedure, plus the accuracy is very low (0.014) where the accuracy for randomly guessing would be 0.003 (There are around 300 classes). One thing I've noticed is that only those models that I applied batch norm to showed such a weird behavior. What can possibly be wrong to cause this issue? The training set has 80000 samples, in case you might figure this was caused by overfitting. Below is part of the code for evaluation:
Accuracy function:
correct_prediction = tf.equal(tf.argmax(Model(test_image), 1), tf.argmax(test_image_label, 0))
accuracy = tf.cast(correct_prediction, tf.float32)
the test_image is a batch with only one sample in it while the test_image_label is a scalar.
Session:
with tf.Session() as sess:
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord, start=True)
print('variables initialized')
step = 0
for epoch in range(epochs):
sess.run(enqueue_train)
print('epoch: %d' %epoch)
if epoch % 5 == 0:
save_path = saver.save(sess, savedir + "/Model")
for batch in range(num_batch):
if step % 400 == 0:
summary_str = cost_summary.eval(feed_dict={phase: True})
file_writer.add_summary(summary_str, step)
else:
sess.run(train_step, feed_dict={phase: True})
step += 1
sess.run(train_close)
sess.run(enqueue_test)
accuracy_vector = []
for num in range(len(testnames)):
accuracy_vector.append(sess.run(accuracy, feed_dict={phase: False}))
mean_accuracy = sess.run(tf.divide(tf.add_n(accuracy_vector), len(testnames)))
print("test accuracy %g"%mean_accuracy)
sess.run(test_close)
save_path = saver.save(sess, savedir + "/Model_final")
coord.request_stop()
coord.join(threads)
file_writer.close()
The phase above is to indicate if it is training or testing for the batch norm layer.
Note that I tried to calculate the accuracy with the training set, which led to the minimal loss. However it gives the same poor accuracy. Please help me, I really appreciate it!

Tensorflow: loss becomes 'NaN'

I was doing CIFAR-10 training on CPU with Tensorflow. During the first few rounds, the loss seemed alright. But after the step 10210 the loss varies and ends up becoming NaN.
My network model the CIFAR-10 CNN model from their website. Here is my setting,
image_size = 32
num_channels = 3
num_classes = 10
num_batches_to_run = 50000
batch_size = 128
eval_batch_size = 64
initial_learning_rate = 0.1
learning_rate_decay_factor = 0.1
num_epochs_per_decay = 350.0
moving_average_decay = 0.9999
and the result is shown as below.
2017-05-12 21:53:05.125242: step 10210, loss = 4.99 (124.9 examples/sec; 1.025 sec/batch)
2017-05-12 21:53:13.960001: step 10220, loss = 7.55 (139.5 examples/sec; 0.918 sec/batch)
2017-05-12 21:53:23.491228: step 10230, loss = 6.63 (149.5 examples/sec; 0.856 sec/batch)
2017-05-12 21:53:33.355805: step 10240, loss = 8.08 (113.3 examples/sec; 1.129 sec/batch)
2017-05-12 21:53:43.007007: step 10250, loss = 7.18 (126.7 examples/sec; 1.010 sec/batch)
2017-05-12 21:53:52.650118: step 10260, loss = 16.61 (138.0 examples/sec; 0.928 sec/batch)
2017-05-12 21:54:02.537279: step 10270, loss = 9.60 (137.6 examples/sec; 0.930 sec/batch)
2017-05-12 21:54:12.390117: step 10280, loss = 46526.25 (145.5 examples/sec; 0.880 sec/batch)
2017-05-12 21:54:22.060741: step 10290, loss = 133479743509972411931057146822656.00 (130.4 examples/sec; 0.982 sec/batch)
2017-05-12 21:54:31.691058: step 10300, loss = nan (115.8 examples/sec; 1.105 sec/batch)
Any idea about the NaN loss?
This happens a lot in practice when your learning rate is too high, I tend to start at 0.001 and move from there, 0.1 is on the very high side on most datasets, especially if you aren't dividing your loss by your batch size.
You can clip the gradients, if you are using Keras with Tensorflow backend, you could do as follows,
The parameters clipnorm and clipvalue can be used with all optimizers to control gradient clipping:
from keras import optimizers
# All parameter gradients will be clipped to
# a maximum norm of 1.
sgd = optimizers.SGD(lr=0.01, clipnorm=1.)
or
from keras import optimizers
# All parameter gradients will be clipped to
# a maximum value of 0.5 and
# a minimum value of -0.5.
sgd = optimizers.SGD(lr=0.01, clipvalue=0.5)
You might have the cross entropy loss and take log(0). Just add a small constant within the log.
(you might also want to look into gradient clipping)

Difference between Tensorflow and Scikitlearn log_loss function implementation

Hi I am trying to get into tensorflow and feeling a bit dumb.
Does log_loss in TF differ from sklearn's one?
Here are some lines from my code, how I am calculating:
from sklearn.metrics import log_loss
tmp = np.array(y_test)
y_test_t = np.array([tmp, -(tmp-1)]).T[0]
tf_log_loss = tf.losses.log_loss(predictions=tf.nn.softmax(logits), labels=tf_y)
with tf.Session() as sess:
# training
a = sess.run(tf.nn.softmax(logits), feed_dict={tf_x: xtest, keep_prob: 1.})
print(" sk.log_loss: ", log_loss(y_test, a,eps=1e-7 ))
print(" tf.log_loss: ", sess.run(tf_log_loss, feed_dict={tf_x: xtest, tf_y: y_test_t, keep_prob: 1.}))
Output I get
Epoch 7, Loss: 0.4875 Validation Accuracy: 0.818981
sk.log_loss: 1.76533018874
tf.log_loss: 0.396557
Epoch 8, Loss: 0.4850 Validation Accuracy: 0.820738
sk.log_loss: 1.77217639627
tf.log_loss: 0.393351
Epoch 9, Loss: 0.4835 Validation Accuracy: 0.823374
sk.log_loss: 1.78479079656
tf.log_loss: 0.390572
Seems like while tf.log_loss converges sk.log_loss diverges.
I had the same problem. After looking up the source code of tf.losses.log_loss, its key lines show wat is going on:
losses = - math_ops.multiply(labels, math_ops.log(predictions + epsilon))
- math_ops.multiply((1 - labels), math_ops.log(1 - predictions + epsilon))
It is binary log-loss (i.e. every class is considered non-exclusive) rather than multi-class log-loss.
As I worked with probabilities (rather than logits), I couldn't use tf.nn.softmax_cross_entropy_with_logits (though, I could have applied logarithm).
My solution was to implement log-loss by hand:
loss = tf.reduce_sum(tf.multiply(- labels, tf.log(probs))) / len(probs)
See also:
https://github.com/tensorflow/tensorflow/issues/2462
difference between tensorflow tf.nn.softmax and tf.nn.softmax_cross_entropy_with_logits

Resources