Tensorflow: save the model with smallest validation error - machine-learning

I ran a training job with tensorflow and got the following curve for loss on validation set. The net starts to overfit after 6000-th iteration. So I'd like to get the model before overfitting.
My training code is something like below:
train_step = ......
summary = tf.scalar_summary(l1_loss.op.name, l1_loss)
summary_writer = tf.train.SummaryWriter("checkpoint", sess.graph)
saver = tf.train.Saver()
for i in xrange(20000):
batch = get_next_batch(batch_size)
sess.run(train_step, feed_dict = {x: batch.x, y:batch.y})
if (i+1) % 100 == 0:
saver.save(sess, "checkpoint/net", global_step = i+1)
summary_str = sess.run(summary, feed_dict=validation_feed_dict)
summary_writer.add_summary(summary_str, i+1)
summary_writer.flush()
After training finishes, there is only five checkpoints saved (19600, 19700, 19800, 19900, 20000). Is there any way to let tensorflow save checkpoint according to the validation error?
P.S. I know that tf.train.Saver has a max_to_keep argument, which in principal could save all the checkpoints. But that's not I wanted (unless it's the only option). I want the saver keep the checkpoint with the smallest validation loss so far. Is that possible?

You need to calculate the classification accuracy on the validation-set and keep track of the best one seen so far, and only write the checkpoint once an improvement has been found to the validation accuracy.
If the data-set and/or model is large, then you may have to split the validation-set into batches to fit the computation in memory.
This tutorial shows exactly how to do what you want:
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/04_Save_Restore.ipynb
It is also available as a short video:
https://www.youtube.com/watch?v=Lx8JUJROkh0

This can be done with checkpoints. In tensorflow 1:
# you should import other functions/libs as needed to build the model
from keras.callbacks.callbacks import ModelCheckpoint
# add checkpoint to save model with lowest val loss
filepath = 'tf1_mnist_cnn.hdf5'
save_checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, \
save_best_only=True, save_weights_only=False, \
mode='auto', period=1)
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test),
callbacks=[save_checkpoint])
Tensorflow 2:
# import other libs as needed for building model
from tensorflow.keras.callbacks import ModelCheckpoint
# add a checkpoint to save the lowest validation loss
filepath = 'tf2_mnist_model.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, \
save_best_only=True, save_weights_only=False, \
mode='auto', save_frequency=1)
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test),
callbacks=[checkpoint])
Complete demo files are here: https://github.com/nateGeorge/slurm_gpu_ubuntu/tree/master/demo_files.

In your session.run you'll need to explicitely ask for the loss. Then create a list with your last eval-losses and only if the current eval loss is smaller than i.e. the last two saved losses create the checkpoint.

Related

Can not save best weights using keras while training process

I'm new in Keras. I want save model with best weights like as:
model1.compile(loss="mean_squared_error", optimizer="RMSprop")
model1.summary()
mcp_save = ModelCheckpoint('best_model.h5', save_best_only=True, monitor='val_accuracy', mode='auto', verbose=2)
callbacks_list = [mcp_save]
epochs = 5000
batch_size = 50
# fit the model
history = model1.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
callbacks=callbacks_list,
validation_data=(x_test, y_test),
verbose=2)
I couldn't come across warning or error message on Pycharm 2019 Community edition. But I am not able to see 'best_model.h5' on project file folder or somwhere else on my computer after trainig process finished?? Would you give me advices please?? What are my faults??
Your code looks fine to me. I use this callback often. All I can suggest is that you use a full path to designate where to save the model rather than a relative path.

How do loss functions know for which model to compute gradients in PyTorch?

I am unsure how PyTorch manges to link the loss function to the model I want it to be computed for. There is never an explicit reference between the loss and the model, such as the one between the model's parameters and the optimizer.
Say for example I want to train 2 networks on the same dataset, so I want to utilize a single pass through the dataset. How would PyTorch link the appropriate loss functions to the appropriate models. Here's code for reference:
import torch
from torch import nn, optim
import torch.nn.functional as F
from torchvision import datasets, transforms
import shap
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.LogSoftmax(dim=1))
model2 = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 10),
nn.LogSoftmax(dim=1))
# Define the loss
criterion = nn.NLLLoss()
criterion2 = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.003)
optimizer2 = optim.SGD(model2.parameters(), lr=0.003)
epochs = 5
for e in range(epochs):
running_loss = 0
running_loss_2 = 0
for images, labels in trainloader:
# Flatten MNIST images into a 784 long vector
images = images.view(images.shape[0], -1) # batch_size x total_pixels
# Training pass
optimizer.zero_grad()
optimizer2.zero_grad()
output = model(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
output2 = model2(images)
loss2 = criterion2(output2, labels)
loss2.backward()
optimizer2.step()
running_loss += loss.item()
running_loss_2 += loss2.item()
print(f"Training loss 1: {running_loss/len(trainloader)}")
print(f"Training loss 2: {running_loss_2/len(trainloader)}")
print()
So once again, how does pytorch know how to compute the appropriate gradients for the appropriate models when loss.backward() and loss2.backward() are called?
Whenever you perform forward operations using one of your model parameters (or any torch.tensor that has attribute requires_grad==True), pytorch builds a computational graph. When you operate on descendents in this graph, the graph is extended. In your case, you have a nn.module called model which will have some trainable model.parameters(), so pytorch will build a graph from your model.parameters() all the way to the loss as you perform the forward operations. The graph is then traversed in reverse during the backward pass to propagate the gradients back to the parameters. For loss in your code above the graph is something like
model.parameters() --> [intermediate variables in model] --> output --> loss
^ ^
| |
images labels
When you call loss.backward() pytorch traverses this graph in reverse to reach all trainable parameters (only the model.parameters() in this case) and updates param.grad for each of them. The optimizer then relies on this information gathered during the backward pass to update the parameter.
For loss2 the story is similar.
The official pytorch tutorials are a good resource for more in-depth information on this.

What is the learning rate status when applying keras model fit() iteratively?

I am applying keras model fitting iteratively (within a for loop) due to a large dataset. My goal is to split the dataset into 100 parts, read each part at once and apply the fit() method.
My Question: In each iteration, does the fit() method begins from the initial learning rate (lr=0.1) which I set during model compilation? Or it remembers the last updated learning rate and apply it directly on a new call of the fit() method.
My code sample is as follows:
# Define model
my_model()
# Set the optimizer
sgd = SGD(lr=0.1, decay=1e-08, momentum=0.9, nesterov=False)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
# Fit model and train
for j in range(100):
print('Data extracting from big matrix ...')
X_train = HDF5Matrix(path_train, 'X', start=st, end=ed)
Y_train = HDF5Matrix(path_train, 'y', start=st, end=ed)
print('Fitting model ...')
model.fit(X_train, Y_train, batch_size=100, shuffle='batch', nb_epoch=1,
validation_data=(X_test, Y_test))
The updated learning rate is remembered in the optimizer object model.optimizer, which is just the sgd variable in your example.
In callbacks such as LearningRateScheduler, the learning rate variable model.optimizer.lr is updated (some lines are removed for clarity).
def on_epoch_begin(self, epoch, logs=None):
lr = self.schedule(epoch)
K.set_value(self.model.optimizer.lr, lr)
However, when decay is used (as in your example), the learning rate variable is not directly updated, but the variable model.optimizer.iterations is updated. This variable records how many batches have been used in model fitting, and the learning rate with decay is computed in SGD.get_updates() by:
lr = self.lr
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations,
K.dtype(self.decay))))
So in either case, as long as the model is not re-compiled, it will use the updated learning rate in the new fit() calls.

Classification with Keras Autoencoders

I'm trying to take a vanilla autoencoder using Keras (with a Tensorflow backend) and stop it when the loss value converges to a specific value. After the last epoch, I want to use a sigmoid function to perform classification. Would you know how to go about doing this (or at least point me in the right direction)?
The below code is quite similar to the vanilla autoencoder at http://wiseodd.github.io/techblog/2016/12/03/autoencoders/. (I'm using my own data, but feel free to use the MNIST example in the link to demonstrate what you are talking about.)
NUM_ROWS = len(x_train)
NUM_COLS = len(x_train[0])
inputs = Input(shape=(NUM_COLS, ))
h = Dense(64, activation='sigmoid')(inputs)
outputs = Dense(NUM_COLS)(h)
# trying to add last sigmoid layer
outputs = Dense(1)
outputs = Activation('sigmoid')
model = Model(input=inputs, output=outputs)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train,
batch_size=batch,
epochs=epochs,
validation_data=(x_test, y_test))
I have an interpretation of what you are aiming at, however, you don't seem to have a very clear image yourself.
I guess you can clarify if you prepare the necessary dataset yourself.
One possible solution would be as below:
NUM_ROWS = len(x_train)
NUM_COLS = len(x_train[0])
inputs = Input(shape=(NUM_COLS, ))
h = Dense(64, activation='sigmoid')(inputs)
outputs = Dense(NUM_COLS)(h)
model = Model(input=inputs, output=outputs)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, x_train,
batch_size=batch,
epochs=epochs,
validation_data=(x_test, y_test))
h.trainable=False
# trying to add last sigmoid layer
outputs = Dense(1)(h)
outputs = Activation('sigmoid')
model2.fit(x_train, y_train,
batch_size=batch,
epochs=epochs,
validation_data=(x_test, y_test))

How to avoid overfitting on a simple feed forward network

Using the pima indians diabetes dataset I'm trying to build an accurate model using Keras. I've written the following code:
# Visualize training history
from keras import callbacks
from keras.layers import Dropout
tb = callbacks.TensorBoard(log_dir='/.logs', histogram_freq=10, batch_size=32,
write_graph=True, write_grads=True, write_images=False,
embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None)
# Visualize training history
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import numpy
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:, 0:8]
Y = dataset[:, 8]
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu', name='first_input'))
model.add(Dense(500, activation='tanh', name='first_hidden'))
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(8, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer'))
# Compile model
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
# Fit the model
history = model.fit(X, Y, validation_split=0.33, epochs=1000, batch_size=10, verbose=0, callbacks=[tb])
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
After several tries, I've added dropout layers in order to avoid overfitting, but with no luck. The following graph shows that the validation loss and training loss gets separate at one point.
What else could I do to optimize this network?
UPDATE:
based on the comments I got I've tweaked the code like so:
model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01), activation='relu',
name='first_input')) # added regularizers
model.add(Dense(8, activation='relu', name='first_hidden')) # reduced to 8 neurons
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(5, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer'))
Here are the graphs for 500 epochs
The first example gave a validation accuracy > 75% and the second one gave an accuracy of < 65% and if you compare the losses for epochs below 100, its less than < 0.5 for the first one and the second one was > 0.6. But how is the second case better?.
The second one to me is a case of under-fitting: the model doesnt have enough capacity to learn. While the first case has a problem of over-fitting because its training was not stopped when overfitting started (early stopping). If the training was stopped at say 100 epoch, it would be a far better model compared between the two.
The goal should be to obtain small prediction error in unseen data and for that you increase the capacity of the network till a point beyond which overfitting starts to happen.
So how to avoid over-fitting in this particular case? Adopt early stopping.
CODE CHANGES: To include early stopping and input scaling.
# input scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Early stopping
early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=1, mode='auto')
# create model - almost the same code
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu', name='first_input'))
model.add(Dense(500, activation='relu', name='first_hidden'))
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(8, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer')))
history = model.fit(X, Y, validation_split=0.33, epochs=1000, batch_size=10, verbose=0, callbacks=[tb, early_stop])
The Accuracy and loss graphs:
First, try adding some regularization (https://keras.io/regularizers/) like with this code:
model.add(Dense(12, input_dim=12,
kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01)))
Also, make sure to decrease your network size i.e. you don't need a hidden layer of 500 neurons - try just taking that out to decrease the representation power and maybe even another layer if it's still overfitting. Also, only use relu activation. Maybe also try increasing your dropout rate to something like 0.75 (although it's already high). You probably also don't need to run it for so many epochs - it will just begin to overfit after long enough.
For a dataset like the Diabetes one you can use a much simpler network. Try to reduce the neurons in your second layer. (Is there a specific reason why you chose tanh as the activation there?).
In addition you simply can add an EarlyStopping callback to your training: https://keras.io/callbacks/

Resources