training loss increases over time despite of a tiny learning rate - machine-learning

I'm playing with CIFAR-10 dataset using ResNet-50 on Keras with Tensorflow backend, but I ran into a very strange training pattern, where the model loss decreased first, and then started to increase until it plateaued/stuck at a single value due to almost 0 learning rate. Correspondingly, the model accuracy increased first, and then started to decrease until it plateaued at 10% (aka random guess). I wonder what is going wrong?
Typically this U shaped pattern happens with a learning rate that is too large (like this post), but it is not the case here. This pattern also doesn't' look like a classic "over-fitting" as both the training and validation loss increase over time. In the answer to the above linked post, someone mentioned that if Adam optimizer is used, loss may explode under small learning rate when local minimum is exceeded, I'm not sure I can follow what is said there, and also I'm using SGD with weight decay instead of Adam.
Specifically for the training set up, I used resent50 with random initialization, SGD optimizer with 0.9 momentum and a weight decay of 0.0001 using decoupled weight decay regularization, batch-size 64, initial learning_rate 0.01 which declined by a factor of 0.5 with each 10 epochs of non-decreasing validation loss.
base_model = tf.keras.applications.ResNet50(include_top=False,
weights=None,pooling='avg',
input_shape=(32,32,3))
prediction_layer = tf.keras.layers.Dense(10)
model = tf.keras.Sequential([base_model,
prediction_layer])
SGDW = tfa.optimizers.extend_with_decoupled_weight_decay(tf.keras.optimizers.SGD)
optimizer = SGDW(weight_decay=0.0001, learning_rate=0.01, momentum=0.9)
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"])
reduce_lr= tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss',factor=0.5, patience=10)
model.compile(optimizer=optimizer,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"])
model.fit(train_batches, epochs=250,
validation_data=validation_batches,
callbacks=[reduce_lr])

Related

Why is train loss and validate loss both in a straight line?

I am using Conv-LSTM for training, and the input features have been proven to be effective in some papers, and I can use CNN+FC networks to extract features and classify them. I change the task to regression here, and I can also achieve model convergence with Conv+FC. Later, I tried to use Conv-LSTM for processing to consider the timing characteristics of the corresponding data. Specifically: return the output of the current moment based on multiple historical inputs and the input of the current moment. The Conv-LSTM code I used: https://github.com/ndrplz/ConvLSTM_pytorch. My Loss is L1-Loss and optimizer is Adam.
A loss curve is below:
Example loss value:
Epoch:1/500 AVG Training Loss:16.40108 AVG Valid Loss:22.40100
Best validation loss: 22.400997797648113
Saving best model for epoch 1
Epoch:2/500 AVG Training Loss:16.42522 AVG Valid Loss:22.40100
Epoch:3/500 AVG Training Loss:16.40599 AVG Valid Loss:22.40100
Epoch:4/500 AVG Training Loss:16.40175 AVG Valid Loss:22.40100
Epoch:5/500 AVG Training Loss:16.42198 AVG Valid Loss:22.40101
Epoch:6/500 AVG Training Loss:16.41907 AVG Valid Loss:22.40101
Epoch:7/500 AVG Training Loss:16.42531 AVG Valid Loss:22.40101
My attempt:
Adjust the data set to only a few samples, verify that it can be overfitted, and the network code should be fine.
Adjusting the learning rate, I tried 1e-3, 1e-4, 1e-5 and 1e-6, but the loss curve is still flat as before, and even the value of the loss curve has not changed much.
Replace the optimizer with SGD, and the training result is also the above problem.
Because my data is wireless data (I-Q), neither CV nor NLP input type, here are some questions to ask about deep learning training.
After some testing, I finally found that my initial learning rate was too small. According to my previous single-point data training, the learning rate of 1e-3 is large enough, so here is preconceived, and it is adjusted from 1e-3 to a small tune, but in fact, the learning rate of 1e-3 is too small, resulting in the network not learning at all. Later, the learning rate was adjusted to 1e-2, and both the train loss and validate loss of the network achieved rapid decline (And the optimizer is Adam). When adjusting the learning rate later, you can start from 1 to the minor, do not preconceive.

CatBoost precision imbalanced classes

I use a CatBoostClassifier and my classes are highly imbalanced. I applied a scale_pos_weight parameter to account for that. While training with an evaluation dataset (test) CatBoost shows a high precision on test. However, when I make predictions on test using a predict method, I only get a low precision score (calculated using the sklearn.metrics).
I think this might be related to class weights that I applied. However, I don't quite understand how a precision score is affected by this.
params = frozendict({
'task_type': 'CPU',
'loss_function': 'Logloss',
'eval_metric': 'F1',
'custom_metric': ['F1', 'Precision', 'Recall'],
'iterations': 100,
'random_seed': 20190128,
'scale_pos_weight': 56.88657244809081,
'learning_rate': 0.5412829495147387,
'depth': 7,
'l2_leaf_reg': 9.526905230698302
})
from catboost import CatBoostClassifier
model = cb.CatBoostClassifier(**params)
model.fit(
X_train, y_train,
cat_features=np.where(X_train.dtypes == np.object)[0],
eval_set=(X_test, y_test),
verbose=False,
plot=True
)
model.get_best_score()
{'learn': {'Recall': 0.9243007537531925,
'Logloss': 0.15892360013680026,
'F1': 0.9416723809244181,
'Precision': 0.9640191600545249},
'validation_0': {'Recall': 0.914252301192093,
'Logloss': 0.1714387314107052,
'F1': 0.9357892623978286,
'Precision': 0.9642642597943112}}
y_test_pred = model.predict(data=X_test)
from sklearn.metrics import balanced_accuracy_score, recall_score, precision_score, f1_score
print('Balanced accuracy: {:.2f}'.format(balanced_accuracy_score(y_test, y_test_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_test_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_test_pred)))
print('F1: {:.2f}'.format(f1_score(y_test, y_test_pred)))
Balanced accuracy: 0.94
Precision: 0.29
Recall: 0.91
F1: 0.44
I expected to get the same precision as CatBoost show while training, however, it's not so. What am I doing wrong?
Default use_weights is set to True , which means adding weights to the evaluation metrics, e.g. Precision:use_weights=True,
To let your own precision calculator the same as his, change to Precision: use_weights=False
Also, get_best_score gives the highest score over the iterations, you need to specify which iteration to be used in prediction. You can set use_best_model=True in model.fit to automatically choose the iteration.
The predict function uses a standard threshold of 0.5 to convert the probabilities of the prediction into a binary value. When you are dealing with a imbalanced problem, the threshold of 0.5 is not always the best value, that's why on the test set you are achieving a poor precision.
In order to find a better threshold, catboost has some methods that help you to do so, like get_roc_curve, get_fpr_curve, get_fnr_curve. These 3 methods can help you to visualize the true positive, false positive and false negative rates by changing the prediction threhsold.
Besides these visualization methods, catboost has a method called select_threshold which gives you the best threshold by that optimizes one of the curves.
You can check this on their documentation.
In addition to setting the use_bet_model=True, ensure that the class balance in both datasets is the same, or use balanced accuracy metrics to account for different class balance.
If you've done both of these, and you still see much worse accuracy metrics on a test set versus the train set, it is a sign of overfitting. I'd recommend you take advantage of the CatBoost's overfitting detector. The most common first method is to set early_stopping_rounds to an integer like 10, which will stop training once an improvement in the selected loss function isn't achieved after that number of training rounds (see early_stopping_rounds documentation).

Non-linear multivariate time-series response prediction using RNN

I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...
In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.

Training accuracy on SGD

How do you compute for the training accuracy for SGD? Do you compute it using the batch data you trained your network with? Or using the entire dataset? (for each batch optimization iteration)
I tried computing the training accuracy for each iteration using the batch data I trained my network with. And it almost always gives me 100% training accuracy (sometimes 100%, 90%, 80%, always multiples of 10%, but the very first iteration gave me 100%). Is this because I am computing the accuracy on the same batch data I trained it with for that iteration? Or is my model overfitting that it gave me 100% instantly, but the validation accuracy is low? (this is the main question here, if this is acceptable, or there is something wrong with the model)
Here are the hyperparameters I used.
batch_size = 64
kernel_size = 60 #from 60 #optimal 2
depth = 15 #from 60 #optimal 15
num_hidden = 1000 #from 1000 #optimal 80
learning_rate = 0.0001
training_epochs = 8
total_batches = train_x.shape[0] // batch_size
Calculating the training accuracy on the batch data during the training process is correct. If the number of the accuracy is always multiple of 10%, then most likely it is because your batch size is 10. For example, if 8 of the training outputs match the labels, then your training accuracy will be 80%. If the training accuracy number goes up and down, there are two main possibilities:
1. If you print out the accuracy numbers multiple time over one epoch, it is normal, especially at the early stage of training, because the model is predicting over different data samples;
2. If you print out the accuracy once each epoch, and if you see the training accuracy goes up and down during the later stage of the training, that means your learning rate is too big. You need to decease that overtime during the training.
If these do not answer your question, please provider more details so that we can help.

LSTM training pattern

I'm fairly new to NNs and I'm doing my own "Hello World" with LSTMs instead copying something. I have chosen a simple logic as follows:
Input with 3 timesteps. First one is either 1 or 0, the other 2 are random numbers. Expected output is same as the first timestep of input. The data feed looks like:
_X0=[1,5,9] _Y0=[1] _X1=[0,5,9] _Y1=[0] ... 200 more records like this.
This simple(?) logic can be trained for 100% accuracy. I ran many tests and the most efficient model I found was 3 LSTM layers, each of them with 15 hidden units. This returned 100% accuracy after 22 epochs.
However I noticed something that I struggle to understand: In the first 12 epochs the model makes no progress at all as measured by accuracy (acc. stays 0.5) and only marginal progress measured by Categorical Crossentropy (goes from 0.69 to 0.65). Then from epoch 12 through epoch 22 it trains very fast to accuracy 1.0. The question is: Why does training happens like this? Why the first 12 epochs are making no progress and why epochs 12-22 are so much more efficient?
Here is my entire code:
from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, LSTM
from keras.models import Model
import helper
from keras.utils.np_utils import to_categorical
x_,y_ = helper.rnn_csv_toXY("LSTM_hello.csv",3,"target")
y_binary = to_categorical(y_)
model = Sequential()
model.add(LSTM(15, input_shape=(3,1),return_sequences=True))
model.add(LSTM(15,return_sequences=True))
model.add(LSTM(15, return_sequences=False))
model.add(Dense(2, activation='softmax', kernel_initializer='RandomUniform'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc'])
model.fit(x_, y_binary, epochs=100)
It is hard to give a specific answer to this as it depends on many factors. One major factor that comes into play when training neural networks is the learning rate of the optimizer you choose.
In your code you have no specific learning rate set. The default learning rate of Adam in Keras 2.0.3 is 0.001. Adam uses a dynamic learning rate lr_t based on the initial learning rate (0.001) and the current time step, defined as
lr_t = lr * (sqrt(1. - beta_2**t) / (1. - beta_1**t)) .
The values of beta_2 and beta_1 are commonly left at their default values of 0.999 and 0.9 respectively. If you plot this learning rate you get a picture of something like this:
It might just be that this is the sweet spot for updating your weights to find a local (possibly a global) minimum. A learning rate that is too high often makes no difference at it just 'skips' over the regions that would lower your error, whereas lower learning rates take smaller step in your error landscape and let you find regions where the error is lower.
I suggest that you use an optimizer that makes less assumptions, such as stochastic gradient descent (SGD) and you test this hypothesis by using a lower learning rate.

Resources