What is the learning rate status when applying keras model fit() iteratively?

I am applying keras model fitting iteratively (within a for loop) due to a large dataset. My goal is to split the dataset into 100 parts, read each part at once and apply the fit() method.
My Question: In each iteration, does the fit() method begins from the initial learning rate (lr=0.1) which I set during model compilation? Or it remembers the last updated learning rate and apply it directly on a new call of the fit() method.
My code sample is as follows:
# Define model
# Set the optimizer
sgd = SGD(lr=0.1, decay=1e-08, momentum=0.9, nesterov=False)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
# Fit model and train
for j in range(100):
print('Data extracting from big matrix ...')
X_train = HDF5Matrix(path_train, 'X', start=st, end=ed)
Y_train = HDF5Matrix(path_train, 'y', start=st, end=ed)
print('Fitting model ...')
model.fit(X_train, Y_train, batch_size=100, shuffle='batch', nb_epoch=1,
validation_data=(X_test, Y_test))

The updated learning rate is remembered in the optimizer object model.optimizer, which is just the sgd variable in your example.
In callbacks such as LearningRateScheduler, the learning rate variable model.optimizer.lr is updated (some lines are removed for clarity).
def on_epoch_begin(self, epoch, logs=None):
lr = self.schedule(epoch)
K.set_value(self.model.optimizer.lr, lr)
However, when decay is used (as in your example), the learning rate variable is not directly updated, but the variable model.optimizer.iterations is updated. This variable records how many batches have been used in model fitting, and the learning rate with decay is computed in SGD.get_updates() by:
lr = self.lr
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations,
So in either case, as long as the model is not re-compiled, it will use the updated learning rate in the new fit() calls.


Validation accuracy fluctuating while training accuracy increase?

I have a multiclassification problem that depends on historical data. I am trying LSTM using loss='sparse_categorical_crossentropy'. The train accuracy and loss increase and decrease respectively. But, my test accuracy starts to fluctuate wildly.
What I am doing wrong?
Input data:
X = np.reshape(X, (X.shape[0], X.shape[1], 1))
(200146, 13, 1)
My model
# fix random seed for reproducibility
seed = 7
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=False, random_state=seed)
cvscores = []
for train, test in kfold.split(X, y):
regressor = Sequential()
# Units = the number of LSTM that we want to have in this first layer -> we want very high dimentionality, we need high number
# return_sequences = True because we are adding another layer after this
# input shape = the last two dimensions and the indicator
regressor.add(LSTM(units=50, return_sequences=True, input_shape=(X[train].shape[1], 1)))
# Extra LSTM layer
regressor.add(LSTM(units=50, return_sequences=True))
# 3rd
regressor.add(LSTM(units=50, return_sequences=True))
# output layer
regressor.add(Dense(4, activation='softmax', kernel_regularizer=regularizers.l2(0.001)))
# Compile the RNN
regressor.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy'])
# Set callback functions to early stop training and save the best model so far
callbacks = [EarlyStopping(monitor='val_loss', patience=9),
ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
history = regressor.fit(X[train], y[train], epochs=250, callbacks=callbacks,
validation_data=(X[test], y[test]))
# plot train and validation loss
pyplot.title('model train vs validation loss')
pyplot.legend(['train', 'validation'], loc='upper right')
# evaluate the model
scores = regressor.evaluate(X[test], y[test], verbose=0)
print("%s: %.2f%%" % (regressor.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
What you are describing here is overfitting. This means your model keeps learning about your training data and doesn't generalize, or other said it is learning the exact features of your training set. This is the main problem you can deal with in deep learning. There is no solution per se. You have to try out different architectures, different hyperparameters and so on.
You can try with a small model that underfits (that is the train acc and validation are at low percentage) and keep increasing your model until it overfits. Then you can play around with the optimizer and other hyperparameters.
By smaller model I mean one with fewer hidden units or fewer layers.
you seem to have too many LSTM layers stacked over and over again which eventually leads to overfitting. Probably should decrease the num of layers.
Your model seems to be overfitting, since the training error keeps on reducing while validation error fails to. Overall, it fails to generalize.
You should try reducing the model complexity by removing some of the LSTM layers. Also, try varying the batch sizes, it will reduce the number of fluctuations in the loss.
You can also consider varying the learning rate.

How do loss functions know for which model to compute gradients in PyTorch?

I am unsure how PyTorch manges to link the loss function to the model I want it to be computed for. There is never an explicit reference between the loss and the model, such as the one between the model's parameters and the optimizer.
Say for example I want to train 2 networks on the same dataset, so I want to utilize a single pass through the dataset. How would PyTorch link the appropriate loss functions to the appropriate models. Here's code for reference:
import torch
from torch import nn, optim
import torch.nn.functional as F
from torchvision import datasets, transforms
import shap
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
model = nn.Sequential(nn.Linear(784, 128),
nn.Linear(128, 64),
nn.Linear(64, 10),
model2 = nn.Sequential(nn.Linear(784, 128),
nn.Linear(128, 10),
# Define the loss
criterion = nn.NLLLoss()
criterion2 = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.003)
optimizer2 = optim.SGD(model2.parameters(), lr=0.003)
epochs = 5
for e in range(epochs):
running_loss = 0
running_loss_2 = 0
for images, labels in trainloader:
# Flatten MNIST images into a 784 long vector
images = images.view(images.shape[0], -1) # batch_size x total_pixels
# Training pass
output = model(images)
loss = criterion(output, labels)
output2 = model2(images)
loss2 = criterion2(output2, labels)
running_loss += loss.item()
running_loss_2 += loss2.item()
print(f"Training loss 1: {running_loss/len(trainloader)}")
print(f"Training loss 2: {running_loss_2/len(trainloader)}")
So once again, how does pytorch know how to compute the appropriate gradients for the appropriate models when loss.backward() and loss2.backward() are called?
Whenever you perform forward operations using one of your model parameters (or any torch.tensor that has attribute requires_grad==True), pytorch builds a computational graph. When you operate on descendents in this graph, the graph is extended. In your case, you have a nn.module called model which will have some trainable model.parameters(), so pytorch will build a graph from your model.parameters() all the way to the loss as you perform the forward operations. The graph is then traversed in reverse during the backward pass to propagate the gradients back to the parameters. For loss in your code above the graph is something like
model.parameters() --> [intermediate variables in model] --> output --> loss
^ ^
| |
images labels
When you call loss.backward() pytorch traverses this graph in reverse to reach all trainable parameters (only the model.parameters() in this case) and updates param.grad for each of them. The optimizer then relies on this information gathered during the backward pass to update the parameter.
For loss2 the story is similar.
The official pytorch tutorials are a good resource for more in-depth information on this.

How to check the predicted output during fitting of the model in Keras?

I am new in Keras and I learned fitting and evaluating the model.
After evaluating the model one can see the actual predictions made by model.
I am wondering Is it also possible to see the predictions during fitting in Keras? Till now I cant find any code doing this.
Since this question doesn't specify "epochs", and since using callbacks may represent extra computation, I don't think it's exactly a duplication.
With tensorflow, you can use a custom training loop with eager execution turned on. A simple tutorial for creating a custom training loop: https://www.tensorflow.org/tutorials/eager/custom_training_walkthrough
Basically you will:
#transform your data in to a Dataset:
dataset = tf.data.Dataset.from_tensor_slices(
(x_train, y_train)).shuffle(some_buffer).batch(batchSize)
#the above is buggy in some versions regarding shuffling, you may need to shuffle
#again between each epoch
#create an optimizer
optimizer = tf.keras.optimizers.Adam()
#create an epoch loop:
for e in range(epochs):
#create a batch loop
for i, (x, y_true) in enumerate(dataset):
#create a tape to record actions
with tf.GradientTape() as tape:
#take the model's predictions
y_pred = model(x)
#calculate loss
loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)
#calculate gradients
gradients = tape.gradient(loss, model.trainable_weights)
#apply gradients
optimizer.apply_gradients(zip(gradients, model.trainable_weights)
You can use the y_pred var for doing anything, including getting its numpy_pred = y_pred.numpy() value.
The tutorial gives some more details about metrics and validation loop.

Using cross-validation to select optimal threshold: binary classification in Keras

I have a Keras model that takes a transformed vector x as input and outputs probabilities that each input value is 1.
I would like to take the predictions from this model and find an optimal threshold. That is, maybe the cutoff value for "this value is 1" should be 0.23, or maybe it should be 0.78, or something else. I know cross-validation is a good tool for this.
My question is how to work this in to training. For example, say I have the following model (taken from here):
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
I train the model and get some output probabilities:
model.fit(train_x, train_y)
predictions = model.predict(train_y)
Now I want to learn the threshold for the value of each entry in predictions that would give the best accuracy, for example. How can I learn this parameter, instead of just choosing one after training is complete?
EDIT: For example, say I have this:
def fake_model(self):
#Model that returns probability that each of 10 values is 1
a_input = Input(shape=(2, 10), name='a_input')
dense_1 = Dense(5)(a_input)
outputs = Dense(10, activation='sigmoid')(dense_1)
def hamming_loss(y_true, y_pred):
return tf.to_float(tf.reduce_sum(abs(y_true - y_pred))) /tf.to_float(tf.size(y_pred))
fakemodel = Model(a_input, outputs)
#Use the outputs of the model; find the threshold value that minimizes the Hamming loss
#Record the final confusion matrix.
How can I train a model like this end-to-end?
If an ROC curve isn't what you are looking for, you could create a custom Keras Layer that takes in the outputs of your original model and tries to learn an optimal threshold given the true outputs and the predicted probabilities.
This layer subtracts the threshold from the predicted probability, multiplies by a relatively large constant (in this case 100) and then applies the sigmoid function. Here is a plot that shows the function at three different thresholds (.3, .5, .7).
Below is the code for the definition of this layer and the creation of a model that is composed solely of it, after fitting your original model, feed it's outputs probabilities to this model and start training for an optimal threshold.
class ThresholdLayer(keras.layers.Layer):
def __init__(self, **kwargs):
super(ThresholdLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.kernel = self.add_weight(name="threshold", shape=(1,), initializer="uniform",
super(ThresholdLayer, self).build(input_shape)
def call(self, x):
return keras.backend.sigmoid(100*(x-self.kernel))
def compute_output_shape(self, input_shape):
return input_shape
out = ThresholdLayer()(input_layer)
threshold_model = keras.Model(inputs=input_layer, outputs=out)
threshold_model.compile(optimizer="sgd", loss="mse")
First, here's a direct answer to your question. You're thinking of an ROC curve. For example, assuming some data X_test and y_test:
from matplotlib import pyplot as plt
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
y_pred = model.predict(X_test).ravel()
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
my_auc = auc(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model_name (area = {:.3f})'.format(my_auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model_name (area = {:.3f})'.format(my_auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve close-up')
Second, regarding my comment, here's an example of one attempt. It can be done in Keras, or TF, or anywhere, although he does it with XGBoost.
Hope that helps!
First idea I have is kind of brute force.
You compute on a test set a metric separately for each of your input and its corresponding predicted output.
Then for each of them iterate over values for the threshold betzeen 0 and 1 until the metric is optimized for the given input/prediction pair.
For many of the popular metrics of classification quality (accuracy, precision, recall, etc) you just cannot learn the optimal threshold while training your neural network.
This is because these metrics are not differentiable - therefore, gradient updates will fail to set the threshold (or any other parameter) correctly. Therefore, you are forced to optimize a nice smooth loss (like negative log likelihood) during training most of the parameters, and then tune the threshold by grid search.
Of course, you can come up with a smoothed version of your metric and optimize it (and sometimes people do this). But in most cases it is OK to optimize log-likelihood, get a nice probabilistic classifier, and tune the thresholds on top of it. E.g. if you want to optimize accuracy, then you should first estimate class probabilities as accurately as possible (to get close to the perfect Bayes classifier), and then just choose their argmax.

How to Use Keras Model to Predict Output After Unpacking the Model

I can unpack my RNN model onto my website, but I am having trouble getting it to predict a numpy array of predictions using a list as input (contains only one string called text but needs to be a list for preprocessing from what I've gathered) and I am coming across the problem:
ValueError: Error when checking : expected embedding_1_input to have shape (None, 72)
but got array with shape (1, 690)
Here is how I am currently preprocessing and predicting with the model:
tokenizer = Tokenizer(num_words = 5000, split=' ')
X = tokenizer.texts_to_sequences([text])
X = pad_sequences(X)
prediction = loadedModel.predict(X)
And this is how I trained my model:
HIDDEN_LAYER_SIZE = 195 # Details the amount of nodes in a hidden layer.
TOP_WORDS = 5000 # Most-used words in the dataset.
MAX_REVIEW_LENGTH = 500 # Char length of each text being sent in (necessary).
EMBEDDING_VECTOR_LENGTH = 128 # The specific Embedded later will have 128-length vectors to
# represent each word.
BATCH_SIZE = 32 # Takes 64 sentences at a time and continually retrains RNN.
NUMBER_OF_EPOCHS = 10 # Fits RNN to more accurately guess the data's political bias.
DROPOUT = 0.2 # Helps slow down overfitting of data (slower convergence rate)
# Define the model
model = Sequential()
model.add(Dense(2, activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', \
# Fit the model
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), \
epochs=NUMBER_OF_EPOCHS, batch_size=BATCH_SIZE)
How can I fix my preprocessing code in the codebox starting with "tokenizer" to stop getting the ValueError?
Thank you, and I can definitely provide more code or expand upon the purpose of the project.
So there are two problems here:
Set max_len in pad_sequences: it seems that all of your training sequences were padded to have length 72 so - you need to change the following line:
X = pad_sequences(X, max_len=72)
Use training Tokenizer: this is a subtle problem - you are creating and fitting a totally new Tokenizer so it could be different than one which you used for training. This could cause problems - because different words could have different indexes - and this will make your model to work awful. Try to pickle your training Tokenizer and load it during deployment in order to transform sentences into data points fed to your model properly.
