BERT Pre-training accuracy not increasing

BERT Pre-training accuracy not increasing - machine-learning

I am trying to pretrain BERT on dataset (wiki103) which contains 150k sentences. After 12 epochs nsp (next sentence prediction) task gives accuracy around 0.76 (overfits if I continue with more epochs) and mlm (masked language modeling) task starts from 0.01 acc and goes till 0.2 at max. What is wrong here? Can I stop nsp at one point and continue to do mlm for longer period of time? My train loader length is 2486 (2486 training steps per epochs) which means 40*2486=99440 training steps.
Here is model config and training config
class Train_Config():
""" Hyperparameters for training """
seed: int = 391275 # random seed
batch_size: int = 64
lr: int = 1e-5 # learning rate
n_epochs: int = 40 # the number of epoch
# `warm up` period = warmup(0.1)*total_steps
# linearly increasing learning rate from zero to the specified value(5e-5)
warmup: float = 0.1
is_dibert: bool = False
class Model_Config():
vocab_size: int = 30522 # Size of Vocabulary
hidden_size: int = 768 # Dimension of Hidden Layer in Transformer Encoder
num_hidden_layers: int = 8 # Numher of Hidden Layers
num_attention_heads: int = 8 # Numher of Heads in Multi-Headed Attention Layers
intermediate_size: int = 768 * 4 # Dimension of Intermediate Layers in Positionwise Feedforward Net
# activ_fn: str = "gelu" # Non-linear Activation Function Type in Hidden Layers
max_len: int = 312 # Maximum Length for Positional Embeddings
n_segments: int = 2 # Number of Sentence Segments
attention_probs_dropout_prob: int = 0.1

Related

Loss for Multi-label Classification

I am working on a multi-label classification problem. My gt labels are of shape 14 x 10 x 128, where 14 is the batch_size, 10 is the sequence_length, and 128 is the vector with values 1 if the item in sequence belongs to the object and 0 otherwise.
My output is also of same shape: 14 x 10 x 128. Since, my input sequence was of varying length I had to pad it to make it of fixed length 10. I'm trying to find the loss of the model as follows:
total_loss = 0.0
unpadded_seq_lengths = [3, 4, 5, 7, 9, 3, 2, 8, 5, 3, 5, 7, 7, ...] # true lengths of sequences
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()
for data in training_dataloader:
optimizer.zero_grad()
# shape of input 14 x 10 x 128
output = model(data)
batch_loss = 0.0
for batch_idx, sequence in enumerate(output):
# sequence shape is 10 x 128
true_seq_len = unpadded_seq_lengths[batch_idx]
# only keep unpadded gt and predicted labels since we don't want loss to be influenced by padded values
predicted_labels = sequence[:true_seq_len, :] # for example, 3 x 128
gt_labels = gt_labels_padded[batch_idx, :true_seq_len, :] # same shape as above, gt_labels_padded has shape 14 x 10 x 128
# loop through unpadded predicted and gt labels and calculate loss
for item_idx, predicted_labels_seq_item in enumerate(predicted_labels):
# predicted_labels_seq_item and gt_labels_seq_item are 1D vectors of length 128
gt_labels_seq_item = gt_labels[item_idx]
current_loss = criterion(predicted_labels_seq_item, gt_labels_seq_item)
total_loss += current_loss
batch_loss += current_loss
batch_loss.backward()
optimizer.step()
Can anybody please check to see if I'm calculating loss correctly. Thanks
Update:
Is this the correct approach for calculating accuracy metrics?
# batch size: 14
# seq length: 10
for epoch in range(10):
TP = FP = TN = FN = 0.
for x, y, mask in tr_dl:
# mask shape: (10,)
out = model(x) # out shape: (14, 10, 128)
y_pred = (torch.sigmoid(out) >= 0.5).float().type(torch.int64) # consider all predictions above 0.5 as 1, rest 0
y_pred = y_pred[mask] # y_pred shape: (14, 10, 10, 128)
y_labels = y[mask] # y_labels shape: (14, 10, 10, 128)
# do I flatten y_pred and y_labels?
y_pred = y_pred.flatten()
y_labels = y_labels.flatten()
for idx, prediction in enumerate(y_pred):
if prediction == 1 and y_labels[idx] == 1:
# calculate IOU (overlap of prediction and gt bounding box)
iou = 0.78 # assume we get this iou value for objects at idx
if iou >= 0.5:
TP += 1
else:
FP += 1
elif prediction == 1 and y_labels[idx] == 0:
FP += 1
elif prediction == 0 and y_labels[idx] == 1:
FN += 1
else:
TN += 1
EPOCH_ACC = (TP + TN) / (TP + TN + FP + FN)

It is usually recommended to stick with batch-wise operations and avoid going into single-element processing steps while in the main training loop. One way to handle this case is to make your dataset return padded inputs and labels with additionally a mask that will come useful for loss computation. In other words, to compute the loss term with sequences of varying sizes, we will use a mask instead of doing individual slices.
Dataset
The way to proceed is to make sure you build the mask in the dataset and not in the inference loop. Here I am showing a minimal implementation that you should be able to transfer to your dataset without much hassle:
class Dataset(data.Dataset):
def __init__(self):
super().__init__()
def __len__(self):
return 100
def __getitem__(self, index):
i = random.randint(5, SEQ_LEN) # for demo puporse, generate x with random length
x = torch.rand(i, EMB_SIZE)
y = torch.randint(0, N_CLASSES, (i, EMB_SIZE))
# pad data to fit in batch
pad = torch.zeros(SEQ_LEN-len(x), EMB_SIZE)
x_padded = torch.cat((pad, x))
y_padded = torch.cat((pad, y))
# construct tensor to mask loss
mask = torch.cat((torch.zeros(SEQ_LEN-len(x)), torch.ones(len(x))))
return x_padded, y_padded, mask
Essentially in the __getitem__, we not only pad the input x and target y with zero values, we also construct a simple mask containing the positions of the padded values in the currently processed element.
Notice how:
x_padded, shaped (SEQ_LEN, EMB_SIZE)
y_padded, shaped (SEQ_LEN, N_CLASSES)
mask, shaped (SEQ_LEN,)
are all three tensors which are shape invariant across the dataset, yet mask contains the padding information necessary for us to compute the loss function appropriately.
Inference
The loss you've used nn.BCEWithLogitsLoss, is the correct one since it's a multi-dimensional loss used for binary classification. In other words, you can use it here in this multi-label classification task, considering each one of the 128 logits as an individual binary prediction. Do not use nn.CrossEntropyLoss) as suggested elsewhere, since the softmax will push a single logit (i.e. class), which is the behaviour required for single-label classification tasks.
Therefore, in the training loop, we simply have to apply the mask to our loss.
for x, y, mask in dl:
y_pred = model(x)
loss = mask*bce(y_pred, y)
# backpropagation, loss postprocessing, logs, etc.

This is what you need for the first part of the question, there are already loss functions implemented in tensorflow: https://medium.com/#aadityaura_26777/the-loss-function-for-multi-label-and-multi-class-f68f95cae525. Yours is just tf.nn.weighted_cross_entropy_with_logits, but you need to set the weight.
The second part of the question is not straightforward, because there's conditioning on the IOU, generally, when you do machine learning, you should heavily depend on matrix multiplication, in your case, you probably need to pre-calculate the IOU -> 1 or 0 as a vector, then multiply with the y_pred , element-wise, this will give you the modified y_pred . After that, you can use any accuracy available function to calculate the final result.

if you can use the CROSSENTROPYLOSS instead of BCEWithLogitsLoss there is something called ignore_index. you can use it to exclude your padded sequences. the difference between the 2 losses is the activation function used (softmax vs sigmoid). but I think you can still use the CROSSENTROPYLOSSfor binary classification as well.

Facing this error while classifying Images, containing 10 classes in pytorch, in ResNet50. My code is:

This is the code I am implementing: I am using a subset of the CalTech256 dataset to classify images of 10 different kinds of animals. We will go over the dataset preparation, data augmentation and then steps to build the classifier.
def train_and_validate(model, loss_criterion, optimizer, epochs=25):
'''
Function to train and validate
Parameters
:param model: Model to train and validate
:param loss_criterion: Loss Criterion to minimize
:param optimizer: Optimizer for computing gradients
:param epochs: Number of epochs (default=25)
Returns
model: Trained Model with best validation accuracy
history: (dict object): Having training loss, accuracy and validation loss, accuracy
'''
start = time.time()
history = []
best_acc = 0.0
for epoch in range(epochs):
epoch_start = time.time()
print("Epoch: {}/{}".format(epoch+1, epochs))
# Set to training mode
model.train()
# Loss and Accuracy within the epoch
train_loss = 0.0
train_acc = 0.0
valid_loss = 0.0
valid_acc = 0.0
for i, (inputs, labels) in enumerate(train_data_loader):
inputs = inputs.to(device)
labels = labels.to(device)
# Clean existing gradients
optimizer.zero_grad()
# Forward pass - compute outputs on input data using the model
outputs = model(inputs)
# Compute loss
loss = loss_criterion(outputs, labels)
# Backpropagate the gradients
loss.backward()
# Update the parameters
optimizer.step()
# Compute the total loss for the batch and add it to train_loss
train_loss += loss.item() * inputs.size(0)
# Compute the accuracy
ret, predictions = torch.max(outputs.data, 1)
correct_counts = predictions.eq(labels.data.view_as(predictions))
# Convert correct_counts to float and then compute the mean
acc = torch.mean(correct_counts.type(torch.FloatTensor))
# Compute total accuracy in the whole batch and add to train_acc
train_acc += acc.item() * inputs.size(0)
#print("Batch number: {:03d}, Training: Loss: {:.4f}, Accuracy: {:.4f}".format(i, loss.item(), acc.item()))
# Validation - No gradient tracking needed
with torch.no_grad():
# Set to evaluation mode
model.eval()
# Validation loop
for j, (inputs, labels) in enumerate(valid_data_loader):
inputs = inputs.to(device)
labels = labels.to(device)
# Forward pass - compute outputs on input data using the model
outputs = model(inputs)
# Compute loss
loss = loss_criterion(outputs, labels)
# Compute the total loss for the batch and add it to valid_loss
valid_loss += loss.item() * inputs.size(0)
# Calculate validation accuracy
ret, predictions = torch.max(outputs.data, 1)
correct_counts = predictions.eq(labels.data.view_as(predictions))
# Convert correct_counts to float and then compute the mean
acc = torch.mean(correct_counts.type(torch.FloatTensor))
# Compute total accuracy in the whole batch and add to valid_acc
valid_acc += acc.item() * inputs.size(0)
#print("Validation Batch number: {:03d}, Validation: Loss: {:.4f}, Accuracy: {:.4f}".format(j, loss.item(), acc.item()))
# Find average training loss and training accuracy
avg_train_loss = train_loss/train_data_size
avg_train_acc = train_acc/train_data_size
# Find average training loss and training accuracy
avg_valid_loss = valid_loss/valid_data_size
avg_valid_acc = valid_acc/valid_data_size
history.append([avg_train_loss, avg_valid_loss, avg_train_acc, avg_valid_acc])
epoch_end = time.time()
print("Epoch : {:03d}, Training: Loss: {:.4f}, Accuracy: {:.4f}%, \n\t\tValidation : Loss : {:.4f}, Accuracy: {:.4f}%, Time: {:.4f}s".format(epoch, avg_train_loss, avg_train_acc*100, avg_valid_loss, avg_valid_acc*100, epoch_end-epoch_start))
# Save if the model has best accuracy till now
torch.save(model, dataset+'_model_'+str(epoch)+'.pt')
return model, history
# Load pretrained ResNet50 Model
resnet50 = models.resnet50(pretrained=True)
#resnet50 = resnet50.to('cuda:0')
# Freeze model parameters
for param in resnet50.parameters():
param.requires_grad = False
# Change the final layer of ResNet50 Model for Transfer Learning
fc_inputs = resnet50.fc.in_features
resnet50.fc = nn.Sequential(
nn.Linear(fc_inputs, 256),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(256, num_classes), # Since 10 possible outputs
nn.LogSoftmax(dim=1) # For using NLLLoss()
)
# Convert model to be used on GPU
# resnet50 = resnet50.to('cuda:0')
# Change the final layer of ResNet50 Model for Transfer Learning
fc_inputs = resnet50.fc.in_features
resnet50.fc = nn.Sequential(
nn.Linear(fc_inputs, 256),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(256, num_classes), # Since 10 possible outputs
nn.LogSoftmax(dienter code herem=1) # For using NLLLoss()
)
# Convert model to be used on GPU
# resnet50 = resnet50.to('cuda:0')`enter code here`
Error is this:
RuntimeError Traceback (most recent call
last) in ()
6 # Train the model for 25 epochs
7 num_epochs = 30
----> 8 trained_model, history = train_and_validate(resnet50, loss_func, optimizer, num_epochs)
9
10 torch.save(history, dataset+'_history.pt')
in train_and_validate(model,
loss_criterion, optimizer, epochs)
43
44 # Compute loss
---> 45 loss = loss_criterion(outputs, labels)
46
47 # Backpropagate the gradients
~\Anaconda3\lib\site-packages\torch\nn\modules\module.py in
call(self, *input, **kwargs)
539 result = self._slow_forward(*input, **kwargs)
540 else:
--> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():
543 hook_result = hook(self, input, result)
~\Anaconda3\lib\site-packages\torch\nn\modules\loss.py in
forward(self, input, target)
202
203 def forward(self, input, target):
--> 204 return F.nll_loss(input, target, weight=self.weight, ignore_index=self.ignore_index, reduction=self.reduction)
205
206
~\Anaconda3\lib\site-packages\torch\nn\functional.py in
nll_loss(input, target, weight, size_average, ignore_index, reduce,
reduction) 1836 .format(input.size(0),
target.size(0))) 1837 if dim == 2:
-> 1838 ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index) 1839 elif dim == 4: 1840 ret = torch._C._nn.nll_loss2d(input, target,
weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes'
failed. at
C:\Users\builder\AppData\Local\Temp\pip-req-build-0i480kur\aten\src\THNN/generic/ClassNLLCriterion.c:97

This happens when there are either incorrect labels in your dataset, or the labels are 1-indexed (instead of 0-indexed). As from the error message, cur_target must be smaller than the total number of classes (10). To verify the issue, check the maximum and minimum label in your dataset. If the data is indeed 1-indexed, just minus one from all annotations and you should be fine.
Note, another possible reason is that there exists some -1 labels in the data. Some (esp older) datasets use -1 as indication of a wrong/dubious label. If you find such labels, just discard them.

Pytorch model stuck at 0.5 though loss decreases consistently

This is using PyTorch
I have been trying to implement UNet model on my images, however, my model accuracy is always exact 0.5. Loss does decrease.
I have also checked for class imbalance. I have also tried playing with learning rate. Learning rate affects loss but not the accuracy.
My architecture below ( from here )
""" `UNet` class is based on https://arxiv.org/abs/1505.04597
The U-Net is a convolutional encoder-decoder neural network.
Contextual spatial information (from the decoding,
expansive pathway) about an input tensor is merged with
information representing the localization of details
(from the encoding, compressive pathway).
Modifications to the original paper:
(1) padding is used in 3x3 convolutions to prevent loss
of border pixels
(2) merging outputs does not require cropping due to (1)
(3) residual connections can be used by specifying
UNet(merge_mode='add')
(4) if non-parametric upsampling is used in the decoder
pathway (specified by upmode='upsample'), then an
additional 1x1 2d convolution occurs after upsampling
to reduce channel dimensionality by a factor of 2.
This channel halving happens with the convolution in
the tranpose convolution (specified by upmode='transpose')
Arguments:
in_channels: int, number of channels in the input tensor.
Default is 3 for RGB images. Our SPARCS dataset is 13 channel.
depth: int, number of MaxPools in the U-Net. During training, input size needs to be
(depth-1) times divisible by 2
start_filts: int, number of convolutional filters for the first conv.
up_mode: string, type of upconvolution. Choices: 'transpose' for transpose convolution
"""
class UNet(nn.Module):
def __init__(self, num_classes, depth, in_channels, start_filts=16, up_mode='transpose', merge_mode='concat'):
super(UNet, self).__init__()
if up_mode in ('transpose', 'upsample'):
self.up_mode = up_mode
else:
raise ValueError("\"{}\" is not a valid mode for upsampling. Only \"transpose\" and \"upsample\" are allowed.".format(up_mode))
if merge_mode in ('concat', 'add'):
self.merge_mode = merge_mode
else:
raise ValueError("\"{}\" is not a valid mode for merging up and down paths.Only \"concat\" and \"add\" are allowed.".format(up_mode))
# NOTE: up_mode 'upsample' is incompatible with merge_mode 'add'
if self.up_mode == 'upsample' and self.merge_mode == 'add':
raise ValueError("up_mode \"upsample\" is incompatible with merge_mode \"add\" at the moment "
"because it doesn't make sense to use nearest neighbour to reduce depth channels (by half).")
self.num_classes = num_classes
self.in_channels = in_channels
self.start_filts = start_filts
self.depth = depth
self.down_convs = []
self.up_convs = []
# create the encoder pathway and add to a list
for i in range(depth):
ins = self.in_channels if i == 0 else outs
outs = self.start_filts*(2**i)
pooling = True if i < depth-1 else False
down_conv = DownConv(ins, outs, pooling=pooling)
self.down_convs.append(down_conv)
# create the decoder pathway and add to a list
# - careful! decoding only requires depth-1 blocks
for i in range(depth-1):
ins = outs
outs = ins // 2
up_conv = UpConv(ins, outs, up_mode=up_mode, merge_mode=merge_mode)
self.up_convs.append(up_conv)
self.conv_final = conv1x1(outs, self.num_classes)
# add the list of modules to current module
self.down_convs = nn.ModuleList(self.down_convs)
self.up_convs = nn.ModuleList(self.up_convs)
self.reset_params()
#staticmethod
def weight_init(m):
if isinstance(m, nn.Conv2d):
#https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/
##Doc: https://pytorch.org/docs/stable/nn.init.html?highlight=xavier#torch.nn.init.xavier_normal_
init.xavier_normal_(m.weight)
init.constant_(m.bias, 0)
def reset_params(self):
for i, m in enumerate(self.modules()):
self.weight_init(m)
def forward(self, x):
encoder_outs = []
# encoder pathway, save outputs for merging
for i, module in enumerate(self.down_convs):
x, before_pool = module(x)
encoder_outs.append(before_pool)
for i, module in enumerate(self.up_convs):
before_pool = encoder_outs[-(i+2)]
x = module(before_pool, x)
# No softmax is used. This means we need to use
# nn.CrossEntropyLoss is your training script,
# as this module includes a softmax already.
x = self.conv_final(x)
return x
Parameters are :
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x,y = train_sequence[0] ; batch_size = x.shape[0]
model = UNet(num_classes = 2, depth=5, in_channels=5, merge_mode='concat').to(device)
optim = torch.optim.Adam(model.parameters(),lr=0.01, weight_decay=1e-3)
criterion = nn.BCEWithLogitsLoss() #has sigmoid internally
epochs = 1000
The function for training is :
import torch.nn.functional as f
def train_model(epoch,train_sequence):
"""Train the model and report validation error with training error
Args:
model: the model to be trained
criterion: loss function
data_train (DataLoader): training dataset
"""
model.train()
for idx in range(len(train_sequence)):
X, y = train_sequence[idx]
images = Variable(torch.from_numpy(X)).to(device) # [batch, channel, H, W]
masks = Variable(torch.from_numpy(y)).to(device)
outputs = model(images)
print(masks.shape, outputs.shape)
loss = criterion(outputs, masks)
optim.zero_grad()
loss.backward()
# Update weights
optim.step()
# total_loss = get_loss_train(model, data_train, criterion)
My function for calculating loss and accuracy is below:
def get_loss_train(model, train_sequence):
"""
Calculate loss over train set
"""
model.eval()
total_acc = 0
total_loss = 0
for idx in range(len(train_sequence)):
with torch.no_grad():
X, y = train_sequence[idx]
images = Variable(torch.from_numpy(X)).to(device) # [batch, channel, H, W]
masks = Variable(torch.from_numpy(y)).to(device)
outputs = model(images)
loss = criterion(outputs, masks)
preds = torch.argmax(outputs, dim=1).float()
acc = accuracy_check_for_batch(masks.cpu(), preds.cpu(), images.size()[0])
total_acc = total_acc + acc
total_loss = total_loss + loss.cpu().item()
return total_acc/(len(train_sequence)), total_loss/(len(train_sequence))
Edit : Code which runs (calls) the functions:
for epoch in range(epochs):
train_model(epoch, train_sequence)
train_acc, train_loss = get_loss_train(model,train_sequence)
print("Train Acc:", train_acc)
print("Train loss:", train_loss)
Can someone help me identify as why is accuracy always exact 0.5?
Edit-2:
As asked accuracy_check_for_batch function is here:
def accuracy_check_for_batch(masks, predictions, batch_size):
total_acc = 0
for index in range(batch_size):
total_acc += accuracy_check(masks[index], predictions[index])
return total_acc/batch_size
and
def accuracy_check(mask, prediction):
ims = [mask, prediction]
np_ims = []
for item in ims:
if 'str' in str(type(item)):
item = np.array(Image.open(item))
elif 'PIL' in str(type(item)):
item = np.array(item)
elif 'torch' in str(type(item)):
item = item.numpy()
np_ims.append(item)
compare = np.equal(np_ims[0], np_ims[1])
accuracy = np.sum(compare)
return accuracy/len(np_ims[0].flatten())

I found the mistake.
model = UNet(num_classes = 2, depth=5, in_channels=5, merge_mode='concat').to(device)
should be
model = UNet(num_classes = 1, depth=5, in_channels=5, merge_mode='concat').to(device)
because I am using BCELosswithLogits.

High bias convolutional neural network not improving with more layers/filters

I am training a convolutional neural network using TensorFlow to classify images of buildings into 5 classes.
Training dataset:
Class 1 - 3000 images
Class 2 - 3000 images
Class 3 - 3000 images
Class 4 - 3000 images
Class 5 - 3000 images
I started out with a very simple architecture:
Input image - 256 x 256 x 3
Convolutional layer 1 - 128 x 128 x 16 (3x3 filters, 16 filters, stride=2)
Convolutional layer 2 - 64 x 64 x 32 (3x3 filters, 32 filters, stride=2)
Convolutional layer 3 - 32 x 32 x 64 (3x3 filters, 64 filters, stride=2)
Max-pooling layer - 16 x 16 x 64 (2x2 pooling)
Fully-connected layer 1 - 1 x 1024
Fully-connected layer 2 - 1 x 64
Output - 1 x 5
Other details of my network:
Cost-function: tf.softmax_cross_entropy_with_logits
Optimizer: Adam optimizer (Learning rate=0.01, Epsilon=0.1)
Mini-batch size: 5
My cost-function has a high starting value of around 10^10 and then drops rapidly to a value of about 1.6 (after a few hundred iterations) and saturates at that value (no matter how long I train the network for). The cost-function value on the test set is the same. This value is equivalent to predicting approximately equal probability for each class and it makes the same predictions for all images. My predictions look something like this:
[0.191877 0.203651 0.194455 0.200043 0.203081]
A high error on both the training and test set indicate high bias i.e. underfitting. I increased the complexity of my network by adding layers and increasing the number of filters and my latest network is this (the number of layers and filter sizes are similar to AlexNet):
Input image - 256 x 256 x 3
Convolutional layer 1 - 64 x 64 x 64 (11x11 filters, 64 filters, stride=4)
Convolutional layer 2 - 32 x 32 x 128 (5x5 filters, 128 filters, stride=2)
Convolutional layer 3 - 16 x 16 x 256 (3x3 filters, 256 filters, stride=2)
Convolutional layer 4 - 8 x 8 x 512 (3x3 filters, 512 filters, stride=2)
Convolutional layer 5 - 8 x 8 x 256 (3x3 filters, 256 filters, stride=1)
Fully-connected layer 1 - 1 x 4096
Fully-connected layer 2 - 1 x 4096
Fully-connected layer 3 - 1 x 4096
Dropout layer (0.5 probability)
Output - 1 x 5
However, my cost-function is still saturating at approximately 1.6 and making the same predictions.
My questions are:
What other solutions should I try to fix a high bias network? I have (and still am) trying different learning rates and initialisation of weights - but to no avail.
Is it because my training set is too small? Wouldn't a small training set lead to a high variance network? It would overfit to the training images and have low training error, but high test error.
Is it possible that there are no distinguishable features in these images? However, considering the fact that other CNNs can distinguish between breeds of dogs, this does not seem likely.
As a sanity check, I am training my network on a very small dataset (50 images) and I am expecting it to overfit. However, it doesn't look like it is going to; it looks like the same problem is going to occur.
Code:
import tensorflow as tf
sess = tf.Session()
BATCH_SIZE = 50
MAX_CAPACITY = 300
TRAINING_STEPS = 3001
# To get the list of image filenames and labels from the text file
def read_labeled_image_list(list_filename):
f = open(list_filename,'r')
filenames = []
labels = []
for line in f:
filename, label = line[:-1].split(' ')
filenames.append(filename)
labels.append(int(label))
return filenames,labels
# To get images and labels in batches
def add_to_batch(image,label):
image_batch,label_batch = tf.train.batch([image,label],batch_size=BATCH_SIZE,num_threads=1,capacity=MAX_CAPACITY)
return image_batch, tf.reshape(label_batch,[BATCH_SIZE])
# To decode a single image and its label
def read_image_with_label(input_queue):
""" Image """
# Read
file_contents = tf.read_file(input_queue[0])
example = tf.image.decode_png(file_contents)
# Reshape
my_image = tf.cast(example,tf.float32)
my_image = tf.reshape(my_image,[256,256,3])
# Normalisation
my_image = my_image/255
my_mean = tf.reduce_mean(my_image)
# Centralisation
my_image = my_image - my_mean
""" Label """
label = input_queue[1]-1
return add_to_batch(my_image,label)
# Network
def inference(x):
""" Layer 1: Convolutional """
# Initialise variables
W_conv1 = tf.Variable(tf.truncated_normal([11,11,3,64],stddev=0.0001),name='W_conv1')
b_conv1 = tf.Variable(tf.constant(0.1,shape=[64]),name='b_conv1')
# Convolutional layer
h_conv1 = tf.nn.relu(tf.nn.conv2d(x,W_conv1,strides=[1,4,4,1],padding='SAME') + b_conv1)
""" Layer 2: Convolutional """
# Initialise variables
W_conv2 = tf.Variable(tf.truncated_normal([5,5,64,128],stddev=0.0001),name='W_conv2')
b_conv2 = tf.Variable(tf.constant(0.1,shape=[128]),name='b_conv2')
# Convolutional layer
h_conv2 = tf.nn.relu(tf.nn.conv2d(h_conv1,W_conv2,strides=[1,2,2,1],padding='SAME') + b_conv2)
""" Layer 3: Convolutional """
# Initialise variables
W_conv3 = tf.Variable(tf.truncated_normal([3,3,128,256],stddev=0.0001),name='W_conv3')
b_conv3 = tf.Variable(tf.constant(0.1,shape=[256]),name='b_conv3')
# Convolutional layer
h_conv3 = tf.nn.relu(tf.nn.conv2d(h_conv2,W_conv3,strides=[1,2,2,1],padding='SAME') + b_conv3)
""" Layer 4: Convolutional """
# Initialise variables
W_conv4 = tf.Variable(tf.truncated_normal([3,3,256,512],stddev=0.0001),name='W_conv4')
b_conv4 = tf.Variable(tf.constant(0.1,shape=[512]),name='b_conv4')
# Convolutional layer
h_conv4 = tf.nn.relu(tf.nn.conv2d(h_conv3,W_conv4,strides=[1,2,2,1],padding='SAME') + b_conv4)
""" Layer 5: Convolutional """
# Initialise variables
W_conv5 = tf.Variable(tf.truncated_normal([3,3,512,256],stddev=0.0001),name='W_conv5')
b_conv5 = tf.Variable(tf.constant(0.1,shape=[256]),name='b_conv5')
# Convolutional layer
h_conv5 = tf.nn.relu(tf.nn.conv2d(h_conv4,W_conv5,strides=[1,1,1,1],padding='SAME') + b_conv5)
""" Layer X: Pooling
# Pooling layer
h_pool1 = tf.nn.max_pool(h_conv3,ksize=[1,2,2,1],strides=[1,2,2,1],padding='SAME')"""
""" Layer 6: Fully-connected """
# Initialise variables
W_fc1 = tf.Variable(tf.truncated_normal([8*8*256,4096],stddev=0.0001),name='W_fc1')
b_fc1 = tf.Variable(tf.constant(0.1,shape=[4096]),name='b_fc1')
# Multiplication layer
h_conv5_reshaped = tf.reshape(h_conv5,[-1,8*8*256])
h_fc1 = tf.nn.relu(tf.matmul(h_conv5_reshaped, W_fc1) + b_fc1)
""" Layer 7: Fully-connected """
# Initialise variables
W_fc2 = tf.Variable(tf.truncated_normal([4096,4096],stddev=0.0001),name='W_fc2')
b_fc2 = tf.Variable(tf.constant(0.1,shape=[4096]),name='b_fc2')
# Multiplication layer
h_fc2 = tf.nn.relu(tf.matmul(h_fc1, W_fc2) + b_fc2)
""" Layer 8: Fully-connected """
# Initialise variables
W_fc3 = tf.Variable(tf.truncated_normal([4096,4096],stddev=0.0001),name='W_fc3')
b_fc3 = tf.Variable(tf.constant(0.1,shape=[4096]),name='b_fc3')
# Multiplication layer
h_fc3 = tf.nn.relu(tf.matmul(h_fc2, W_fc3) + b_fc3)
""" Layer 9: Dropout layer """
# Keep/drop nodes with 50% chance
h_dropout = tf.nn.dropout(h_fc3,0.5)
""" Readout layer: Softmax """
# Initialise variables
W_softmax = tf.Variable(tf.truncated_normal([4096,5],stddev=0.0001),name='W_softmax')
b_softmax = tf.Variable(tf.constant(0.1,shape=[5]),name='b_softmax')
# Multiplication layer
y_conv = tf.nn.relu(tf.matmul(h_dropout,W_softmax) + b_softmax)
""" Summaries """
tf.histogram_summary('W_conv1',W_conv1)
tf.histogram_summary('W_conv2',W_conv2)
tf.histogram_summary('W_conv3',W_conv3)
tf.histogram_summary('W_conv4',W_conv4)
tf.histogram_summary('W_conv5',W_conv5)
tf.histogram_summary('W_fc1',W_fc1)
tf.histogram_summary('W_fc2',W_fc2)
tf.histogram_summary('W_fc3',W_fc3)
tf.histogram_summary('W_softmax',W_softmax)
tf.histogram_summary('b_conv1',b_conv1)
tf.histogram_summary('b_conv2',b_conv2)
tf.histogram_summary('b_conv3',b_conv3)
tf.histogram_summary('b_conv4',b_conv4)
tf.histogram_summary('b_conv5',b_conv5)
tf.histogram_summary('b_fc1',b_fc1)
tf.histogram_summary('b_fc2',b_fc2)
tf.histogram_summary('b_fc3',b_fc3)
tf.histogram_summary('b_softmax',b_softmax)
return y_conv
# Training
def cost_function(y_label,y_conv):
# Reshape y_label to one-hot vectors
sparse_labels = tf.reshape(y_label,[BATCH_SIZE,1])
indices = tf.reshape(tf.range(BATCH_SIZE),[BATCH_SIZE,1])
concated = tf.concat(1,[indices,sparse_labels])
dense_labels = tf.sparse_to_dense(concated,[BATCH_SIZE,5],1.0,0.0)
# Cross-entropy
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_conv,dense_labels))
# Accuracy
y_prob = tf.nn.softmax(y_conv)
correct_prediction = tf.equal(tf.argmax(dense_labels,1), tf.argmax(y_prob,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
# Add to summary
tf.scalar_summary('loss',cost)
tf.scalar_summary('accuracy',accuracy)
return cost, accuracy
def main ():
# To get list of filenames and labels
filename = '/labels/filenames_with_labels_server.txt'
image_list, label_list = read_labeled_image_list(filename)
images = tf.convert_to_tensor(image_list, dtype=tf.string)
labels = tf.convert_to_tensor(label_list,dtype=tf.int32)
# To create the queue
input_queue = tf.train.slice_input_producer([images,labels],shuffle=True,capacity=MAX_CAPACITY)
# To train network
image,label = read_image_with_label(input_queue)
y_conv = inference(image)
loss,acc = cost_function(label,y_conv)
train_step = tf.train.AdamOptimizer(learning_rate=0.001,epsilon=0.1).minimize(loss)
# To write and merge summaries
writer = tf.train.SummaryWriter('/SummaryLogs/log', sess.graph)
merged = tf.merge_all_summaries()
# To save variables
saver = tf.train.Saver()
""" Run session """
sess.run(tf.initialize_all_variables())
tf.train.start_queue_runners(sess=sess)
print('Running...')
for step in range(1,TRAINING_STEPS):
loss_val,acc_val,_,summary_str = sess.run([loss,acc,train_step,merged])
writer.add_summary(summary_str,step)
print "Step %d, Loss %g, Accuracy %g"%(step,loss_val,acc_val)
if(step == 1):
save_path = saver.save(sess,'/SavedVariables/model',global_step=step)
print "Initial model saved: %s"%save_path
save_path = saver.save(sess,'/SavedVariables/model-final')
print "Final model saved: %s"%save_path
""" Close session """
print('Finished')
sess.close()
if __name__ == '__main__':
main()
EDIT:
After making some changes, I managed to get the network to overfit to a small training set of 50 images.
Changes:
Initialization of weights using Xavier initialization
Initialization of bias to zero
No normalisation of images i.e. no division by 255
Centred the images by subtracting the mean pixel value (calculated over the whole training set). In this case, the mean was 114.
Encouraged by this, I proceeded to train my network on the whole training set, only to encounter the SAME issue again. These are the outputs:
Step 1, Loss 1.37815, Accuracy 0.4
y_conv (before softmax):
[[ 0.30913264 0. 1.20176554 0. 0. ]
[ 0. 0. 1.23200822 0. 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 1.65852785 0.01910716 0. ]
[ 0. 0. 0.94612855 0. 0.10457891]]
y_prob (after softmax):
[[ 0.1771856 0.130069 0.43260741 0.130069 0.130069 ]
[ 0.13462381 0.13462381 0.46150482 0.13462381 0.13462381]
[ 0.2 0.2 0.2 0.2 0.2 ]
[ 0.1078648 0.1078648 0.56646001 0.1099456 0.1078648 ]
[ 0.14956713 0.14956713 0.38524282 0.14956713 0.16605586]]
Very quickly it becomes:
Step 39, Loss 1.60944, Accuracy 0.2
y_conv (before softmax):
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
y_prob (after softmax):
[[ 0.2 0.2 0.2 0.2 0.2]
[ 0.2 0.2 0.2 0.2 0.2]
[ 0.2 0.2 0.2 0.2 0.2]
[ 0.2 0.2 0.2 0.2 0.2]
[ 0.2 0.2 0.2 0.2 0.2]]
Clearly a y_conv of all zeros is not a good sign. Looking at the histograms, the weight variables do not change after initialization; only the bias variables change.

This is not so much a "complete" answer but rather a "things you can try if you are facing a similar problem" answer.
I managed to get my network to start to learn something with the following changes:
Xavier initialization of weights
Zero initialization of bias
No normalization of images to [0,1]
Subtracting the mean pixel value (calculated over the whole training set) from the images
No ReLU in the final layer that calculates y_conv
After 3000 iterations of training with a batch-size of 50 images (approximately 10 epochs):
On the testing set it does not perform so well, because my training set is very small and my network was over-fitting; this was expected so I am not surprised there. At least now I know that I have to focus on getting a larger training set, add more regularization or simplify my network.

Tensorflow cross-entropy NaN, and changing learning rate doesn't seem to have an impact

TL;DR
Trying to build a bidirectional RNN for sequence tagging using tensorflow.
The goal is to take inputs "I like New York" and produce outputs "O O LOC_START LOC"
The graph compiles and runs, but the loss becomes NaN after 1 or 2 batches. I understand this could be a problem with the learning rate, but changing the learning rate seems to have no impact. Using AdamOptimizer at the moment.
Any help would be appreciated.
Here is my code:
Code:
# The input and output: a sequence of words, embedded, and a sequence of word classifications, one-hot
self.input_x = tf.placeholder(tf.float32, [None, n_sequence_length, n_embedding_dim], name="input_x")
self.input_y = tf.placeholder(tf.float32, [None, n_sequence_length, n_output_classes], name="input_y")
# New shape: [sequence_length, batch_size (None), embedding_dim]
inputs = tf.transpose(self.input_x, [1, 0, 2])
# New shape: [sequence_length * batch_size (None), embedding_dim]
inputs = tf.reshape(inputs, [-1, n_embedding_dim])
# Define weights
w_hidden = tf.Variable(tf.random_normal([n_embedding_dim, 2 * n_hidden_states]))
b_hidden = tf.Variable(tf.random_normal([2 * n_hidden_states]))
w_out = tf.Variable(tf.random_normal([2 * n_hidden_states, n_output_classes]))
b_out = tf.Variable(tf.random_normal([n_output_classes]))
# Linear activation for the input; this will make it fit to the hidden size
inputs = tf.nn.xw_plus_b(inputs, w_hidden, b_hidden)
# Split up the batches into a Python list
inputs = tf.split(0, n_sequence_length, inputs)
# Now we define our cell. It takes one word as input, a vector of embedding_size length
cell_forward = rnn_cell.BasicLSTMCell(n_hidden_states, forget_bias=0.0)
cell_backward = rnn_cell.BasicLSTMCell(n_hidden_states, forget_bias=0.0)
# And we add a Dropout Wrapper as appropriate
if is_training and prob_keep < 1:
cell_forward = rnn_cell.DropoutWrapper(cell_forward, output_keep_prob=prob_keep)
cell_backward = rnn_cell.DropoutWrapper(cell_backward, output_keep_prob=prob_keep)
# And we make it a few layers deep
cell_forward_multi = rnn_cell.MultiRNNCell([cell_forward] * n_layers)
cell_backward_multi = rnn_cell.MultiRNNCell([cell_backward] * n_layers)
# returns outputs = a list T of tensors [batch, 2*hidden]
outputs = rnn.bidirectional_rnn(cell_forward_multi, cell_backward_multi, inputs, dtype=dtypes.float32)
# [sequence, batch, 2*hidden]
outputs = tf.pack(outputs)
# [batch, sequence, 2*hidden]
outputs = tf.transpose(outputs, [1, 0, 2])
# [batch * sequence, 2 * hidden]
outputs = tf.reshape(outputs, [-1, 2 * n_hidden_states])
# [batch * sequence, output_classes]
self.scores = tf.nn.xw_plus_b(outputs, w_out, b_out)
# [batch * sequence, output_classes]
inputs_y = tf.reshape(self.input_y, [-1, n_output_classes])
# [batch * sequence]
self.predictions = tf.argmax(self.scores, 1, name="predictions")
# Now calculate the cross-entropy
losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, inputs_y)
self.loss = tf.reduce_mean(losses, name="loss")
if not is_training:
return
# Training
self.train_op = tf.train.AdamOptimizer(1e-4).minimize(self.loss)
# Evaluate model
correct_pred = tf.equal(self.predictions, tf.argmax(inputs_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name="accuracy")

Could there be an example in the training data where something is wrong with the labels? Then when it hits that example the cost become NaN. I'm suggesting this because it seems like it still happens when the learning rate is zero and after just a few batches.
Here is how I would debug:
Set the batch size to 1
set the learning rate to 0.0
when you run a batch have tensorflow output the intermediate values not just the cost
run until you get a NaN and then check to see what the input was and by examining the intermediate outputs determine at which point there is a NaN

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

BERT Pre-training accuracy not increasing - machine-learning

Related

Loss for Multi-label Classification

Facing this error while classifying Images, containing 10 classes in pytorch, in ResNet50. My code is:

Pytorch model stuck at 0.5 though loss decreases consistently

High bias convolutional neural network not improving with more layers/filters

Tensorflow cross-entropy NaN, and changing learning rate doesn't seem to have an impact

Categories

Resources