Constantly separated validation & training losses - machine-learning

I've worked with Autoencoders for some weeks now, but I've seem to hit a rock wall when it comes to my understanding of losses overall. The issue I'm facing is that when trying to implement Batchnormalization & Dropout layers to my model, I get losses which aren't converging and awful reconstructions. A typical loss plot is something like this:
and the losses I use is an L1 regularization with MSE loss and looks something like this
def L1_loss_fcn(model_children, true_data, reconstructed_data, reg_param=0.1, validate):
mse = nn.MSELoss()
mse_loss = mse(reconstructed_data, true_data)
l1_loss = 0
values = true_data
if validate == False:
for i in range(len(model_children)):
values = F.relu((model_children[i](values)))
l1_loss += torch.sum(torch.abs(values))
loss = mse_loss + reg_param * l1_loss
return loss, mse_loss, l1_loss
else:
return mse_loss
with my training loop written as:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
train_run_loss = 0
val_run_loss = 0
for epoch in range(epochs):
print(f"Epoch {epoch + 1} of {epochs}")
# TRAINING
model.train()
for data in tqdm(train_dl):
x, _ = data
reconstructions = model(x)
optimizer.zero_grad()
train_loss, mse_loss, l1_loss =L1_loss_fcn(model_children=model_children, true_data=x,reg_param=regular_param, reconstructed_data=reconstructions, validate=False)
train_loss.backward()
optimizer.step()
train_run_loss += train_loss.item()
# VALIDATING
model.eval()
with torch.no_grad():
for data in tqdm(test_dl):
x, _ = data
reconstructions = model(x)
val_loss = L1_loss_fcn(model_children=model_children, true_data=x, reg_param=regular_param, reconstructed_data = reconstructions, validate = True)
val_run_loss += val_loss.item()
epoch_loss_train = train_run_loss / len(train_dl)
epoch_loss_val = val_run_loss / len(test_dl)
where I've tried different hyper-parameter values without luck. My model looks something like this,
encoder = nn.Sequential(nn.Linear(), nn.Dropout(p=0.5), nn.LeakyReLU(), nn.BatchNorm1d(),
nn.Linear(), nn.Dropout(p=0.4), nn.LeakyReLU(), nn.BatchNorm1d(),
nn.Linear(), nn.Dropout(p=0.3), nn.LeakyReLU(), nn.BatchNorm1d(),
nn.Linear(), nn.Dropout(p=0.2), nn.LeakyReLU(), nn.BatchNorm1d(),
)
decoder = nn.Sequential(nn.Linear(), nn.Dropout(p=0.2), nn.LeakyReLU(),
nn.Linear(), nn.Dropout(p=0.3), nn.LeakyReLU(),
nn.Linear(), nn.Dropout(p=0.4), nn.LeakyReLU(),
nn.Linear(), nn.Dropout(p=0.5), nn.ReLU(),
)
What I expect to find is a converging train & validation loss, and thereby a lot better reconstructions overall, but I think that I'm missing something quite grave I'm afraid. Some help would be greatly appreciated!

You are not comparing apples to apples, your code reads
l1_loss = 0
values = true_data
if validate == False:
for i in range(len(model_children)):
values = F.relu((model_children[i](values)))
l1_loss += torch.sum(torch.abs(values))
loss = mse_loss + reg_param * l1_loss
return loss, mse_loss, l1_loss
else:
return mse_loss
So your validation loss is just MSE, but training is MSE + regularization, so obviously your train loss will be higher. You should log just train MSE without regulariser if you want to compare them.
Also, do not start with regularisation, always start witha model with no regularisation at all and get training to converge. Remove all extra losses, remove your dropouts. These things only harm your ability to learn (but might improve generalisation). Once this is achieved - reintroduce them one at a time.

Related

Why is a simple embedding+linear layer outperforming this LSTM classifier?

I'm running into a roadblock in my learning about NLP. I'm working on a beginner's Kaggle competition classifying tweets as "disaster" or "not disaster". I started out by repurposing a simple network from a PyTorch tutorial comprised of nn.EmbeddingBag and nn.Linear layers and saw decent results during both training and inference:
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
self.fc = nn.Linear(embed_dim, num_class)
The loss function is BCEWithLogits, by the way.
I decided to up my game and throw an LSTM into the mix. I took a deep dive into padded/packed sequences and think I understand them pretty well. After perusing around and thinking about it, I came to the conclusion that I should be grabbing the final non-padded hidden state of each sequence's output from the LSTM. That's what I tried below:
My attempt at upping my game:
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
self.fc1 = nn.Linear(hidden_size, num_class)
def forward(self, padded_seq, lengths):
# embedding layer
embedded_padded = self.embedding(padded_seq)
packed_output = pack_padded_sequence(embedded_padded, lengths, batch_first=True)
# lstm layer
output, _ = self.lstm(packed_output)
padded_output, lengths = pad_packed_sequence(output, batch_first=True)
# get hidden state of final non-padded sequence element:
h_n = []
for seq, length in zip(padded_output, lengths):
h_n.append(seq[length - 1, :])
lstm_out = torch.stack(h_n)
# linear layers
out = self.fc1(lstm_out)
return out
This morning, I ported my notebook over to an IDE and ran the debugger and confirmed that h_n is indeed the final hidden state of each sequence, not including padding.
So everything runs/trains without error but my loss never decreases when I use batch size > 1.
With batch_size = 8:
With batch_size = 1:
My Question
I would have expected this LSTM setup to perform much better on this simple task. So I'm wondering "Where have I gone wrong?"
Additional Information: Training Code
def train_one_epoch(model, opt, criterion, lr, trainloader):
model.to(device)
model.train()
running_tl = 0
for (label, data, lengths) in trainloader:
opt.zero_grad()
label = label.reshape(label.size()[0], 1)
output = model(data, lengths)
loss = criterion(output, label)
running_tl += loss.item()
loss.backward()
opt.step()
return running_tl
def validate_one_epoch(model, opt, criterion, lr, validloader):
running_vl = 0
model.eval()
with torch.no_grad():
for (label, data, lengths) in validloader:
label = label.reshape(label.shape[0], 1)
output = model(data, lengths)
loss = criterion(output, label)
running_vl += loss.item()
return running_vl
def train_model(model, opt, criterion, epochs, trainload, testload=None, lr=1e-3):
avg_tl_per_epoch = []
avg_vl_per_epoch = []
for e in trange(epochs):
running_tl = train_one_epoch(model, opt, criterion, lr, trainload)
avg_tl_per_epoch.append(running_tl / len(trainload))
if testload:
running_vl = validate_one_epoch(model, opt, criterion, lr, validloader)
avg_vl_per_epoch.append(running_vl / len(testload))
return avg_tl_per_epoch, avg_vl_per_epoch
I think your model should look like that :
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
self.fc1 = nn.Linear(hidden_size, num_class)
def forward(self, padded_seq, lengths):
# embedding layer
embedded_padded = self.embedding(padded_seq)
packed_output = pack_padded_sequence(embedded_padded, lengths, batch_first=True)
# lstm layer
output, _ = self.lstm(packed_output)
out = self.fc1(output)
return out
As, by default, the LSTM will just output the last hidden state as an output when provided with a sequence.
Also depending on the number of examples, the simple embedding + linear model might work better as it needs fewer data to converge. Your data being tweets (very short text) the sequential aspect of the text might not be so important.
You have not provided the code for preprocessing your data. With text a good preprocessing is crucial and I recommend you to take a look to the pytorch tutorial called NLP FROM SCRATCH: TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION.

Training Loss When Resuming From a Checkpoint Explodes

I am trying to implement a function in my algorithm which allows me to resume training from a checkpoint. The problem is that when I resume training, my loss explodes by many orders of magnitude, from the order to 0.001 to 1000. I suspect that the problem may be that when training is resumed, the learning rate is not being set properly.
Here is my training function:
def train_gray(epoch, data_loader, device, model, criterion, optimizer, i, path):
train_loss = 0.0
for data in data_loader:
img, _ = data
img = img.to(device)
stand_dev = 0.0392
noisy_img = add_noise(img, stand_dev, device)
output = model(noisy_img, stand_dev)
output = output[:,0:1,:,:]
loss = criterion(output, img)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()*img.size(0)
train_loss = train_loss/len(data_loader)
print('Epoch: {} Complete \tTraining Loss: {:.6f}'.format(
epoch,
train_loss
))
return train_loss
And here is my main function that initialises my variables, loads a checkpoint, calls my training function, and saves a checkpoint after an epoch of training:
def main():
now = datetime.now()
current_time = now.strftime("%H_%M_%S")
path = "/home/bledc/my_remote_folder/denoiser/models/{}_sigma_10_session2".format(current_time)
os.mkdir(path)
width = 256
# height = 256
num_epochs = 25
batch_size = 4
learning_rate = 0.0001
data_loader = load_dataset(batch_size, width)
model = UNetWithResnet50Encoder().to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(
model.parameters(), lr=learning_rate, weight_decay=1e-5)
############################################################################################
# UNCOMMENT CODE BELOW TO RESUME TRAINING FROM A MODEL
model_path = "/home/bledc/my_remote_folder/denoiser/models/resnet_sigma_10/model_epoch_10.pt"
save_point = torch.load(model_path)
model.load_state_dict(save_point['model_state_dict'])
optimizer.load_state_dict(save_point['optimizer_state_dict'])
epoch = save_point['epoch']
train_loss = save_point['train_loss']
model.train()
############################################################################################
for i in range(epoch, num_epochs+1):
train_loss = train_gray(i, data_loader, device, model, criterion, optimizer, i, path)
checkpoint(i, train_loss, model, optimizer, path)
print("end")
Lastly, here is my function to save checkpoints:
def checkpoint(epoch, train_loss, model, optimizer, path):
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'train_loss': train_loss
}, path+"/model_epoch_{}.pt".format(epoch))
print("Epoch saved")
If my problem is that I am not saving my learning rate, how would I do this?
Any help would be greatly appreciated,
Clement
Update: I'm fairly certain that the problem lies in my pretrained model. I am saving the optimiser every epoch but the optimiser only holds information for the trainable layers. I hope to solve this soon and post a more thorough answer when I figure out who to save and load the entire model.

Defining a simple neural netwok in mxnet error

I am doing making simple NN using MXnet , but having some problem in step() method
x1.shape=(64, 1, 1000)
y1.shape=(64, 1, 10)
net =nm.Sequential()
net.add(nn.Dense(H,activation='relu'),nn.Dense(90,activation='relu'),nn.Dense(D_out))
for t in range(500):
#y_pred = net(x1)
#loss = loss_fn(y_pred, y)
#for i in range(len(x1)):
with autograd.record():
output=net(x1)
loss =loss_fn(output,y1)
loss.backward()
trainer.step(64)
if t % 100 == 99:
print(t, loss)
#optimizer.zero_grad()
UserWarning: Gradient of Parameter dense30_weight on context cpu(0)
has not been updated by backward since last step. This could mean a
bug in your model that made it only use a subset of the Parameters
(Blocks) for this iteration. If you are intentionally only using a
subset, call step with ignore_stale_grad=True to suppress this warning
and skip updating of Parameters with stale gradient
The error indicates that you are passing parameters in your trainer that are not in your computational graph.
You need to initialize the parameters of your model and define the trainer. Unlike Pytorch, you don't need to call zero_grad in MXNet because by default new gradients are written in and not accumulated. Following code shows a simple neural network implemented using MXNet's Gluon API:
# Define model
net = gluon.nn.Dense(1)
net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=model_ctx)
square_loss = gluon.loss.L2Loss()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.0001})
# Create random input and labels
def real_fn(X):
return 2 * X[:, 0] - 3.4 * X[:, 1] + 4.2
X = nd.random_normal(shape=(num_examples, num_inputs))
noise = 0.01 * nd.random_normal(shape=(num_examples,))
y = real_fn(X) + noise
# Define Dataloader
batch_size = 4
train_data = gluon.data.DataLoader(gluon.data.ArrayDataset(X, y), batch_size=batch_size, shuffle=True)
num_batches = num_examples / batch_size
for e in range(10):
# Iterate over training batches
for i, (data, label) in enumerate(train_data):
# Load data on the CPU
data = data.as_in_context(mx.cpu())
label = label.as_in_context(mx.cpu())
with autograd.record():
output = net(data)
loss = square_loss(output, label)
# Backpropagation
loss.backward()
trainer.step(batch_size)
cumulative_loss += nd.mean(loss).asscalar()
print("Epoch %s, loss: %s" % (e, cumulative_loss / num_examples))

TensorFlow learning rate decay - how to properly supply the step number for decay?

I am training my deep network in TensorFlow and I am trying to use a learning rate decay with it. As far as I see I should use train.exponential_decay function for that - it will calculate the proper learning rate value for current training step using various parameters. I just need to provide it with a step which is performed right now. I suspected I should use tf.placeholder(tf.int32) as usual when I need to provide something into the network, but seems like I am wrong. When I do this I get the below error:
TypeError: Input 'ref' of 'AssignAdd' Op requires l-value input
What am I doing wrong? Unfortunately, I haven't managed to find some good example of network training with decay. My whole code is below. Network has 2 hidden ReLU layers, has L2 penalty on weights and has dropout on both hidden layers.
#We try the following - 2 ReLU layers
#Dropout on both of them
#Also L2 regularization on them
#and learning rate decay also
#batch size for SGD
batch_size = 128
#beta parameter for L2 loss
beta = 0.001
#that's how many hidden neurons we want
num_hidden_neurons = 1024
#learning rate decay
#starting value, number of steps decay is performed,
#size of the decay
start_learning_rate = 0.05
decay_steps = 1000
decay_size = 0.95
#building tensorflow graph
graph = tf.Graph()
with graph.as_default():
# Input data. For the training data, we use a placeholder that will be fed
# at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32,
shape=(batch_size, image_size * image_size))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
#now let's build our first hidden layer
#its weights
hidden_weights_1 = tf.Variable(
tf.truncated_normal([image_size * image_size, num_hidden_neurons]))
hidden_biases_1 = tf.Variable(tf.zeros([num_hidden_neurons]))
#now the layer 1 itself. It multiplies data by weights, adds biases
#and takes ReLU over result
hidden_layer_1 = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights_1) + hidden_biases_1)
#add dropout on hidden layer 1
#we pick up the probabylity of switching off the activation
#and perform the switch off of the activations
keep_prob = tf.placeholder("float")
hidden_layer_drop_1 = tf.nn.dropout(hidden_layer_1, keep_prob)
#now let's build our second hidden layer
#its weights
hidden_weights_2 = tf.Variable(
tf.truncated_normal([num_hidden_neurons, num_hidden_neurons]))
hidden_biases_2 = tf.Variable(tf.zeros([num_hidden_neurons]))
#now the layer 2 itself. It multiplies data by weights, adds biases
#and takes ReLU over result
hidden_layer_2 = tf.nn.relu(tf.matmul(hidden_layer_drop_1, hidden_weights_2) + hidden_biases_2)
#add dropout on hidden layer 2
#we pick up the probabylity of switching off the activation
#and perform the switch off of the activations
hidden_layer_drop_2 = tf.nn.dropout(hidden_layer_2, keep_prob)
#time to go for output linear layer
#out weights connect hidden neurons to output labels
#biases are added to output labels
out_weights = tf.Variable(
tf.truncated_normal([num_hidden_neurons, num_labels]))
out_biases = tf.Variable(tf.zeros([num_labels]))
#compute output
#notice that upon training we use the switched off activations
#i.e. the variaction of hidden_layer with the dropout active
out_layer = tf.matmul(hidden_layer_drop_2,out_weights) + out_biases
#our real output is a softmax of prior result
#and we also compute its cross-entropy to get our loss
#Notice - we introduce our L2 here
loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
out_layer, tf_train_labels) +
beta*tf.nn.l2_loss(hidden_weights_1) +
beta*tf.nn.l2_loss(hidden_biases_1) +
beta*tf.nn.l2_loss(hidden_weights_2) +
beta*tf.nn.l2_loss(hidden_biases_2) +
beta*tf.nn.l2_loss(out_weights) +
beta*tf.nn.l2_loss(out_biases)))
#variable to count number of steps taken
global_step = tf.placeholder(tf.int32)
#compute current learning rate
learning_rate = tf.train.exponential_decay(start_learning_rate, global_step, decay_steps, decay_size)
#use it in optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
#nice, now let's calculate the predictions on each dataset for evaluating the
#performance so far
# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(out_layer)
valid_relu_1 = tf.nn.relu( tf.matmul(tf_valid_dataset, hidden_weights_1) + hidden_biases_1)
valid_relu_2 = tf.nn.relu( tf.matmul(valid_relu_1, hidden_weights_2) + hidden_biases_2)
valid_prediction = tf.nn.softmax( tf.matmul(valid_relu_2, out_weights) + out_biases)
test_relu_1 = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights_1) + hidden_biases_1)
test_relu_2 = tf.nn.relu( tf.matmul( test_relu_1, hidden_weights_2) + hidden_biases_2)
test_prediction = tf.nn.softmax(tf.matmul(test_relu_2, out_weights) + out_biases)
#now is the actual training on the ANN we built
#we will run it for some number of steps and evaluate the progress after
#every 500 steps
#number of steps we will train our ANN
num_steps = 3001
#actual training
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
print("Initialized")
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
# Note: we could use better randomization across epochs.
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : 0.5, global_step: step}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 500 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%%" % accuracy(
valid_prediction.eval(), valid_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
Instead of using a placeholder for global_step, try using a Variable.
global_step = tf.Variable(0)
You will have to remove global_step from the feed_dict. Note that you don't have to increment global_step manually, tensorflow will do it automatically for you.

How does one do Inference with Batch Normalization with Tensor Flow?

I was reading the original paper on BN and the stack overflow question on How could I use Batch Normalization in TensorFlow? which provides a very useful piece of code to insert a batch normalization block to a Neural Network but does not provides enough guidance on how to actually use it during training, inference and when evaluating models.
For example, I would like to track the train error during training and test error to make sure I don't overfit. Its clear that the batch normalization block should be off during test, but when evaluating the error on the training set, should the batch normalization block be turned off too? My main questions are:
During inference and error evaluation, should the batch normalization block be turned off regardless of the data set?
Does that mean that the batch normalization block should only be on during the training step then?
To make it very clear, I will provide an extract (of simplified) code I have been using to run batch normalization with Tensor flow according to what is my understanding of what is the right thing to do:
## TRAIN
if phase_train is not None:
#DO BN
feed_dict_train = {x:X_train, y_:Y_train, phase_train: False}
feed_dict_cv = {x:X_cv, y_:Y_cv, phase_train: False}
feed_dict_test = {x:X_test, y_:Y_test, phase_train: False}
else:
#Don't do BN
feed_dict_train = {x:X_train, y_:Y_train}
feed_dict_cv = {x:X_cv, y_:Y_cv}
feed_dict_test = {x:X_test, y_:Y_test}
def get_batch_feed(X, Y, M, phase_train):
mini_batch_indices = np.random.randint(M,size=M)
Xminibatch = X[mini_batch_indices,:] # ( M x D^(0) )
Yminibatch = Y[mini_batch_indices,:] # ( M x D^(L) )
if phase_train is not None:
#DO BN
feed_dict = {x: Xminibatch, y_: Yminibatch, phase_train: True}
else:
#Don't do BN
feed_dict = {x: Xminibatch, y_: Yminibatch}
return feed_dict
with tf.Session() as sess:
sess.run( tf.initialize_all_variables() )
for iter_step in xrange(steps):
feed_dict_batch = get_batch_feed(X_train, Y_train, M, phase_train)
# Collect model statistics
if iter_step%report_error_freq == 0:
train_error = sess.run(fetches=l2_loss, feed_dict=feed_dict_train)
cv_error = sess.run(fetches=l2_loss, feed_dict=feed_dict_cv)
test_error = sess.run(fetches=l2_loss, feed_dict=feed_dict_test)
do_stuff_with_errors(train_error, cv_error, test_error)
# Run Train Step
sess.run(fetches=train_step, feed_dict=feed_dict_batch)
and the code I am using to produce batch normalization blocks is:
def standard_batch_norm(l, x, n_out, phase_train, scope='BN'):
"""
Batch normalization on feedforward maps.
Args:
x: Vector
n_out: integer, depth of input maps
phase_train: boolean tf.Varialbe, true indicates training phase
scope: string, variable scope
Return:
normed: batch-normalized maps
"""
with tf.variable_scope(scope+l):
#beta = tf.Variable(tf.constant(0.0, shape=[n_out], dtype=tf.float64 ), name='beta', trainable=True, dtype=tf.float64 )
#gamma = tf.Variable(tf.constant(1.0, shape=[n_out],dtype=tf.float64 ), name='gamma', trainable=True, dtype=tf.float64 )
init_beta = tf.constant(0.0, shape=[n_out], dtype=tf.float64)
init_gamma = tf.constant(1.0, shape=[n_out],dtype=tf.float64)
beta = tf.get_variable(name='beta'+l, dtype=tf.float64, initializer=init_beta, regularizer=None, trainable=True)
gamma = tf.get_variable(name='gamma'+l, dtype=tf.float64, initializer=init_gamma, regularizer=None, trainable=True)
batch_mean, batch_var = tf.nn.moments(x, [0], name='moments')
ema = tf.train.ExponentialMovingAverage(decay=0.5)
def mean_var_with_update():
ema_apply_op = ema.apply([batch_mean, batch_var])
with tf.control_dependencies([ema_apply_op]):
return tf.identity(batch_mean), tf.identity(batch_var)
mean, var = tf.cond(phase_train, mean_var_with_update, lambda: (ema.average(batch_mean), ema.average(batch_var)))
normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3)
return normed
I found that there is 'official' batch_norm layer in tensorflow. Try it out:
https://github.com/tensorflow/tensorflow/blob/b826b79718e3e93148c3545e7aa3f90891744cc0/tensorflow/contrib/layers/python/layers/layers.py#L100
Most likely it is not mentioned in docs since it included in some RC or 'beta' version only.
I haven't inspected deep into this matter yet, but as far as I see from documentation you just use binary parameter is_training in this batch_norm layer, and set it to true only for training phase. Try it out.
UPDATE: Below is the code to load data, build a network with one hidden ReLU layer and L2 normalization and introduce batch normalization for both hidden and out layer. This runs fine and trains fine.
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
pickle_file = '/home/maxkhk/Documents/Udacity/DeepLearningCourse/SourceCode/tensorflow/examples/udacity/notMNIST.pickle'
with open(pickle_file, 'rb') as f:
save = pickle.load(f)
train_dataset = save['train_dataset']
train_labels = save['train_labels']
valid_dataset = save['valid_dataset']
valid_labels = save['valid_labels']
test_dataset = save['test_dataset']
test_labels = save['test_labels']
del save # hint to help gc free up memory
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)
image_size = 28
num_labels = 10
def reformat(dataset, labels):
dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
# Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)
def accuracy(predictions, labels):
return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
/ predictions.shape[0])
#for NeuralNetwork model code is below
#We will use SGD for training to save our time. Code is from Assignment 2
#beta is the new parameter - controls level of regularization.
#Feel free to play with it - the best one I found is 0.001
#notice, we introduce L2 for both biases and weights of all layers
batch_size = 128
beta = 0.001
#building tensorflow graph
graph = tf.Graph()
with graph.as_default():
# Input data. For the training data, we use a placeholder that will be fed
# at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32,
shape=(batch_size, image_size * image_size))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
#introduce batchnorm
tf_train_dataset_bn = tf.contrib.layers.batch_norm(tf_train_dataset)
#now let's build our new hidden layer
#that's how many hidden neurons we want
num_hidden_neurons = 1024
#its weights
hidden_weights = tf.Variable(
tf.truncated_normal([image_size * image_size, num_hidden_neurons]))
hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))
#now the layer itself. It multiplies data by weights, adds biases
#and takes ReLU over result
hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset_bn, hidden_weights) + hidden_biases)
#adding the batch normalization layerhi()
hidden_layer_bn = tf.contrib.layers.batch_norm(hidden_layer)
#time to go for output linear layer
#out weights connect hidden neurons to output labels
#biases are added to output labels
out_weights = tf.Variable(
tf.truncated_normal([num_hidden_neurons, num_labels]))
out_biases = tf.Variable(tf.zeros([num_labels]))
#compute output
out_layer = tf.matmul(hidden_layer_bn,out_weights) + out_biases
#our real output is a softmax of prior result
#and we also compute its cross-entropy to get our loss
#Notice - we introduce our L2 here
loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
out_layer, tf_train_labels) +
beta*tf.nn.l2_loss(hidden_weights) +
beta*tf.nn.l2_loss(hidden_biases) +
beta*tf.nn.l2_loss(out_weights) +
beta*tf.nn.l2_loss(out_biases)))
#now we just minimize this loss to actually train the network
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
#nice, now let's calculate the predictions on each dataset for evaluating the
#performance so far
# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(out_layer)
valid_relu = tf.nn.relu( tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)
valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases)
test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)
test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)
#now is the actual training on the ANN we built
#we will run it for some number of steps and evaluate the progress after
#every 500 steps
#number of steps we will train our ANN
num_steps = 3001
#actual training
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
print("Initialized")
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
# Note: we could use better randomization across epochs.
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 500 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%%" % accuracy(
valid_prediction.eval(), valid_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Resources