I'm trying to fine-tune BERT for a text classification task, but I'm getting NaN losses and can't figure out why.
First I define a BERT-tokenizer and then tokenize my text:
from transformers import DistilBertTokenizer, RobertaTokenizer
distil_bert = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(distil_bert, do_lower_case=True, add_special_tokens=True,
max_length=128, pad_to_max_length=True)
def tokenize(sentences, tokenizer):
input_ids, input_masks, input_segments = [],[],[]
for sentence in tqdm(sentences):
inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=25, pad_to_max_length=True,
return_attention_mask=True, return_token_type_ids=True)
input_ids.append(inputs['input_ids'])
input_masks.append(inputs['attention_mask'])
input_segments.append(inputs['token_type_ids'])
return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')
train = pd.read_csv('train_dataset.csv')
d = train['text']
input_ids, input_masks, input_segments = tokenize(d, tokenizer)
Next, I load my integer labels which are: 0, 1, 2, 3.
d_y = train['label']
0 0
1 1
2 0
3 2
4 0
5 0
6 0
7 0
8 3
9 1
Name: label, dtype: int64
Then I load the pretrained Transformer model and put layers on top of it. I use SparseCategoricalCrossEntropy Loss when compiling the model:
from transformers import TFDistilBertForSequenceClassification, DistilBertConfig, AutoTokenizer, TFDistilBertModel
distil_bert = 'distilbert-base-uncased'
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.0000001)
config = DistilBertConfig(num_labels=4, dropout=0.2, attention_dropout=0.2)
config.output_hidden_states = False
transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)
input_ids_in = tf.keras.layers.Input(shape=(25,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(25,), name='masked_token', dtype='int32')
embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)
X = tf.keras.layers.Dense(50, activation='relu')(X)
X = tf.keras.layers.Dropout(0.2)(X)
X = tf.keras.layers.Dense(4, activation='softmax')(X)
model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)
for layer in model.layers[:3]:
layer.trainable = False
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'],
)
Finally, I run the model using previously tokenized input_ids and input_masks as inputs to the model and get a NAN Loss after the first epoch:
model.fit(x=[input_ids, input_masks], y = d_y, epochs=3)
Epoch 1/3
20/20 [==============================] - 4s 182ms/step - loss: 0.9714 - sparse_categorical_accuracy: 0.6153
Epoch 2/3
20/20 [==============================] - 0s 19ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
Epoch 3/3
20/20 [==============================] - 0s 20ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
<tensorflow.python.keras.callbacks.History at 0x7fee0e220f60>
EDIT: The model computes losses on the first epoch but it starts returning NaNs
at the second epoch. What could be causing that problem???
Does anyone has any ideas about what I am doing wrong?
All suggestions are welcomed!
The problem is here:
X = tf.keras.layers.Dense(1, activation='softmax')(X)
At the end of the network, you only have a single neuron, corresponding to a single class. The output probability is always 100% for class 0. If you have classes 0, 1, 2, 3, you need to have 4 outputs at the end.
The problem would have occurred because of not specifying the num_labels
At the final output layer, by default K = 1 (number of labels), and as mentioned
\sigma(\vec{z})_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}}
so while fine tuning we need to provide num_labels when going for multi class classification.
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)
I'd also suggest removing NA values from the pandas data frame before using the dataset for training and evaluation.
train = pd.read_csv('train_dataset.csv')
d = train['text']
d = d.dropna()
I had a similar problem where my model produced NaN losses only during the last batch of an epoch. All the other batches resulted in typical loss values. In my case, the problem was that the size of the batches was not always equal. Thus, the model produced NaN losses. After I made all batches equally sized, the NaN's were gone. It might be also worth investigating if this is also true in your case.
Related
I am trying to calculate the f1 score for a multi-class classification problem using the Cifar10 dataset. I am importing f1 metirics from the sklearn library. However I keep getting the following error message:
ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
Below is my function for testing the model on my validation set. Would someone be able to explain how to calculate f1 when performing multi-class classification. I am getting quite confused.
#torch.no_grad()
def valid_function(model, optimizer, val_loader):
model.eval()
val_loss = 0.0
val_accu = 0.0
f_one = []
for i, (x_val, y_val) in enumerate(val_loader):
x_val, y_val = x_val.to(device), y_val.to(device)
val_pred = model(x_val)
loss = criterion(val_pred, y_val)
val_loss += loss.item()
val_accu += accuracy(val_pred, y_val)
f_one.append(f1_score(y_val.cpu(), val_pred.cpu()))
val_loss /= len(val_loader)
val_accu /= len(val_loader)
print('Val Loss: %.3f | Val Accuracy: %.3f'%(val_loss,val_accu))
return val_loss, val_accu
The problem is here:
val_pred = model(x_val)
You need to convert how you load the model. For example in your case:
val_pred = np.argmax(model.predict(x_val), axis=-1)
I am running a RNN model with Pytorch library to do sentiment analysis on movie review, but somehow the training loss and validation loss remained constant throughout the training. I have looked up different online sources but still stuck.
Can someone please help and take a look at my code?
Some parameters are specified by the assignment:
embedding_dim = 64
n_layers = 1
n_hidden = 128
dropout = 0.5
batch_size = 32
My main code
txt_field = data.Field(tokenize=word_tokenize, lower=True, include_lengths=True, batch_first=True)
label_field = data.Field(sequential=False, use_vocab=False, batch_first=True)
train = data.TabularDataset(path=part2_filepath+"train_Copy.csv", format='csv',
fields=[('label', label_field), ('text', txt_field)], skip_header=True)
validation = data.TabularDataset(path=part2_filepath+"validation_Copy.csv", format='csv',
fields=[('label', label_field), ('text', txt_field)], skip_header=True)
txt_field.build_vocab(train, min_freq=5)
label_field.build_vocab(train, min_freq=2)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
train_iter, valid_iter, test_iter = data.BucketIterator.splits(
(train, validation, test),
batch_size=32,
sort_key=lambda x: len(x.text),
sort_within_batch=True,
device=device)
n_vocab = len(txt_field.vocab)
embedding_dim = 64
n_hidden = 128
n_layers = 1
dropout = 0.5
model = Text_RNN(n_vocab, embedding_dim, n_hidden, n_layers, dropout)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = torch.nn.BCELoss().to(device)
N_EPOCHS = 15
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
train_loss, train_acc = RNN_train(model, train_iter, optimizer, criterion)
valid_loss, valid_acc = evaluate(model, valid_iter, criterion)
My Model
class Text_RNN(nn.Module):
def __init__(self, n_vocab, embedding_dim, n_hidden, n_layers, dropout):
super(Text_RNN, self).__init__()
self.n_layers = n_layers
self.n_hidden = n_hidden
self.emb = nn.Embedding(n_vocab, embedding_dim)
self.rnn = nn.RNN(
input_size=embedding_dim,
hidden_size=n_hidden,
num_layers=n_layers,
dropout=dropout,
batch_first=True
)
self.sigmoid = nn.Sigmoid()
self.linear = nn.Linear(n_hidden, 2)
def forward(self, sent, sent_len):
sent_emb = self.emb(sent)
outputs, hidden = self.rnn(sent_emb)
prob = self.sigmoid(self.linear(hidden.squeeze(0)))
return prob
The training function
def RNN_train(model, iterator, optimizer, criterion):
epoch_loss = 0
epoch_acc = 0
model.train()
for batch in iterator:
text, text_lengths = batch.text
predictions = model(text, text_lengths)
batch.label = batch.label.type(torch.FloatTensor).squeeze()
predictions = torch.max(predictions.data, 1).indices.type(torch.FloatTensor)
loss = criterion(predictions, batch.label)
loss.requires_grad = True
acc = binary_accuracy(predictions, batch.label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
The output I run on 10 testing reviews + 5 validation reviews
Epoch [1/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
Epoch [2/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
Epoch [3/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
Epoch [4/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
...
Appreciate if someone can point me to the right direction, I believe is something with the training code, since for most parts I follow this article:
https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/
In your training loop you are using the indices from the max operation, which is not differentiable, so you cannot track gradients through it. Because it is not differentiable, everything afterwards does not track the gradients either. Calling
loss.backward() would fail.
# The indices of the max operation are not differentiable
predictions = torch.max(predictions.data, 1).indices.type(torch.FloatTensor)
loss = criterion(predictions, batch.label)
# Setting requires_grad to True to make .backward() work, although incorrectly.
loss.requires_grad = True
Presumably you wanted to fix that by setting requires_grad, but that does not do what you expect, because no gradients are propagated to your model, since the only thing in your computational graph would be the loss itself, and there is nowhere to go from there.
You used the indices to get either 0 or 1, since the output of your model is essentially two classes, and you wanted the one with the higher probability. For the Binary Cross Entropy loss, you only need one class that has a value between 0 and 1 (continuous), which you get by applying the sigmoid function.
So you need change the output channels of the final linear layer to 1:
self.linear = nn.Linear(n_hidden, 1)
and in your training loop you can remove the torch.max call and also the requires_grad.
# Squeeze the model's output to get rid of the single class dimension
predictions = model(text, text_lengths).squeeze()
batch.label = batch.label.type(torch.FloatTensor).squeeze()
loss = criterion(predictions, batch.label)
acc = binary_accuracy(predictions, batch.label)
optimizer.zero_grad()
loss.backward()
Since you have only 1 class at the end, an actual prediction would be either 0 or 1 (nothing in between), to achieve that you can simply use 0.5 as the threshold, so everything below is considered a 0 and everything above is considered a 1. If you are using the binary_accuracy function of the article you were following, that is done automatically for you. They do that by rounding it with torch.round.
This is the code I am implementing: I am using a subset of the CalTech256 dataset to classify images of 10 different kinds of animals. We will go over the dataset preparation, data augmentation and then steps to build the classifier.
def train_and_validate(model, loss_criterion, optimizer, epochs=25):
'''
Function to train and validate
Parameters
:param model: Model to train and validate
:param loss_criterion: Loss Criterion to minimize
:param optimizer: Optimizer for computing gradients
:param epochs: Number of epochs (default=25)
Returns
model: Trained Model with best validation accuracy
history: (dict object): Having training loss, accuracy and validation loss, accuracy
'''
start = time.time()
history = []
best_acc = 0.0
for epoch in range(epochs):
epoch_start = time.time()
print("Epoch: {}/{}".format(epoch+1, epochs))
# Set to training mode
model.train()
# Loss and Accuracy within the epoch
train_loss = 0.0
train_acc = 0.0
valid_loss = 0.0
valid_acc = 0.0
for i, (inputs, labels) in enumerate(train_data_loader):
inputs = inputs.to(device)
labels = labels.to(device)
# Clean existing gradients
optimizer.zero_grad()
# Forward pass - compute outputs on input data using the model
outputs = model(inputs)
# Compute loss
loss = loss_criterion(outputs, labels)
# Backpropagate the gradients
loss.backward()
# Update the parameters
optimizer.step()
# Compute the total loss for the batch and add it to train_loss
train_loss += loss.item() * inputs.size(0)
# Compute the accuracy
ret, predictions = torch.max(outputs.data, 1)
correct_counts = predictions.eq(labels.data.view_as(predictions))
# Convert correct_counts to float and then compute the mean
acc = torch.mean(correct_counts.type(torch.FloatTensor))
# Compute total accuracy in the whole batch and add to train_acc
train_acc += acc.item() * inputs.size(0)
#print("Batch number: {:03d}, Training: Loss: {:.4f}, Accuracy: {:.4f}".format(i, loss.item(), acc.item()))
# Validation - No gradient tracking needed
with torch.no_grad():
# Set to evaluation mode
model.eval()
# Validation loop
for j, (inputs, labels) in enumerate(valid_data_loader):
inputs = inputs.to(device)
labels = labels.to(device)
# Forward pass - compute outputs on input data using the model
outputs = model(inputs)
# Compute loss
loss = loss_criterion(outputs, labels)
# Compute the total loss for the batch and add it to valid_loss
valid_loss += loss.item() * inputs.size(0)
# Calculate validation accuracy
ret, predictions = torch.max(outputs.data, 1)
correct_counts = predictions.eq(labels.data.view_as(predictions))
# Convert correct_counts to float and then compute the mean
acc = torch.mean(correct_counts.type(torch.FloatTensor))
# Compute total accuracy in the whole batch and add to valid_acc
valid_acc += acc.item() * inputs.size(0)
#print("Validation Batch number: {:03d}, Validation: Loss: {:.4f}, Accuracy: {:.4f}".format(j, loss.item(), acc.item()))
# Find average training loss and training accuracy
avg_train_loss = train_loss/train_data_size
avg_train_acc = train_acc/train_data_size
# Find average training loss and training accuracy
avg_valid_loss = valid_loss/valid_data_size
avg_valid_acc = valid_acc/valid_data_size
history.append([avg_train_loss, avg_valid_loss, avg_train_acc, avg_valid_acc])
epoch_end = time.time()
print("Epoch : {:03d}, Training: Loss: {:.4f}, Accuracy: {:.4f}%, \n\t\tValidation : Loss : {:.4f}, Accuracy: {:.4f}%, Time: {:.4f}s".format(epoch, avg_train_loss, avg_train_acc*100, avg_valid_loss, avg_valid_acc*100, epoch_end-epoch_start))
# Save if the model has best accuracy till now
torch.save(model, dataset+'_model_'+str(epoch)+'.pt')
return model, history
# Load pretrained ResNet50 Model
resnet50 = models.resnet50(pretrained=True)
#resnet50 = resnet50.to('cuda:0')
# Freeze model parameters
for param in resnet50.parameters():
param.requires_grad = False
# Change the final layer of ResNet50 Model for Transfer Learning
fc_inputs = resnet50.fc.in_features
resnet50.fc = nn.Sequential(
nn.Linear(fc_inputs, 256),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(256, num_classes), # Since 10 possible outputs
nn.LogSoftmax(dim=1) # For using NLLLoss()
)
# Convert model to be used on GPU
# resnet50 = resnet50.to('cuda:0')
# Change the final layer of ResNet50 Model for Transfer Learning
fc_inputs = resnet50.fc.in_features
resnet50.fc = nn.Sequential(
nn.Linear(fc_inputs, 256),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(256, num_classes), # Since 10 possible outputs
nn.LogSoftmax(dienter code herem=1) # For using NLLLoss()
)
# Convert model to be used on GPU
# resnet50 = resnet50.to('cuda:0')`enter code here`
Error is this:
RuntimeError Traceback (most recent call
last) in ()
6 # Train the model for 25 epochs
7 num_epochs = 30
----> 8 trained_model, history = train_and_validate(resnet50, loss_func, optimizer, num_epochs)
9
10 torch.save(history, dataset+'_history.pt')
in train_and_validate(model,
loss_criterion, optimizer, epochs)
43
44 # Compute loss
---> 45 loss = loss_criterion(outputs, labels)
46
47 # Backpropagate the gradients
~\Anaconda3\lib\site-packages\torch\nn\modules\module.py in
call(self, *input, **kwargs)
539 result = self._slow_forward(*input, **kwargs)
540 else:
--> 541 result = self.forward(*input, **kwargs)
542 for hook in self._forward_hooks.values():
543 hook_result = hook(self, input, result)
~\Anaconda3\lib\site-packages\torch\nn\modules\loss.py in
forward(self, input, target)
202
203 def forward(self, input, target):
--> 204 return F.nll_loss(input, target, weight=self.weight, ignore_index=self.ignore_index, reduction=self.reduction)
205
206
~\Anaconda3\lib\site-packages\torch\nn\functional.py in
nll_loss(input, target, weight, size_average, ignore_index, reduce,
reduction) 1836 .format(input.size(0),
target.size(0))) 1837 if dim == 2:
-> 1838 ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index) 1839 elif dim == 4: 1840 ret = torch._C._nn.nll_loss2d(input, target,
weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes'
failed. at
C:\Users\builder\AppData\Local\Temp\pip-req-build-0i480kur\aten\src\THNN/generic/ClassNLLCriterion.c:97
This happens when there are either incorrect labels in your dataset, or the labels are 1-indexed (instead of 0-indexed). As from the error message, cur_target must be smaller than the total number of classes (10). To verify the issue, check the maximum and minimum label in your dataset. If the data is indeed 1-indexed, just minus one from all annotations and you should be fine.
Note, another possible reason is that there exists some -1 labels in the data. Some (esp older) datasets use -1 as indication of a wrong/dubious label. If you find such labels, just discard them.
I am following the blog for transfer learning:
##First I compute the saved the bottleneck features and build a new model and train it with the bottleneck features:
input_layer = Input(shape=base_model.output_shape[1:])
x = GlobalAveragePooling2D()(input_layer)
x = Dense(512, activation='relu',name='fc_new_1')(x)
x = Dropout(0.2)(x)
x = Dense(512, activation='relu',name='fc_new_2')(x)
x = Dense(num_classes, activation='softmax',name='logit_new')(x)
Add_layers = Model(inputs=input_layer, outputs=x,name='Add_layers')
##Then I put this new model at the end of pretrained models:
base_model = ResNet50(include_top=False, weights='imagenet', input_shape=
(img_shape[0],img_shape[1],3))
x = base_model.output
predictions = Add_layers(x)
model = Model(inputs=base_model.input, outputs=predictions)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=
['accuracy'])
##Then, I evaluate the model :
score = model.evaluate_generator(train_generator, nb_train_samples //
batch_size_finetuning)
print('The evaluation of the entire model before fine tuning : ')
print(score)
score = model.evaluate_generator(validation_generator, nb_validation_samples
// batch_size_evaluation)
print('The evaluation of the entire model before fine tuning : ')
print(score)
And get training loss and accuracy : [0.015362062912073827, 1.0]
validation loss and accuracy : [0.89740632474422455, 0.75]
##Just one line below it, I trained the new model:
model.fit_generator(train_generator,
steps_per_epoch= nb_train_samples // batch_size_finetuning,
epochs=finetuning_epoch,
validation_data=validation_generator,
validation_steps=nb_validation_samples //batch_size_evaluation,
callbacks=[checkpointer_finetuning,
history_finetuning,TB_finetuning,
lrate_finetuning,Eartly_Stopping_finetuning]);
Then the output is :
31/31 [==============================] - 35s - loss: 3.4004 - acc: 0.3297 - val_loss: 0.9591 - val_acc: 0.7083
Weird thing is: This problem only happens if I use Resnet50 and InceptionV3 but not with vgg16. I am pretty sure that changing the pretrained model is the only difference. I understand that the dropout might make it different but it should not be this large and vgg16 has no obvious problem at all.
Another weird thing is: If I change every layer to be .trainable = False and compile, the validation accuracy will still decrease dramatically. I even checked the weights of every layer and if .trainable = False weights will not change and .trainable = True weights will change.
Any help is appreciated!!! THANKS!!!
I am building an RNN in Keras.
def RNN_keras(max_timestep_len, feat_num):
model = Sequential()
model.add(Masking(mask_value=-1.0, input_shape=(max_timestep_len, feat_num)))
model.add(SimpleRNN(input_dim=feat_num, output_dim=128, activation='relu', return_sequences=True))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(output_dim = 1, activation='relu')))
sgd = SGD(lr=0.1, decay=1e-6)
model.compile(loss='mean_squared_error',
optimizer=sgd,
metrics=['mean_squared_error'])
return model
for epoch in range(1, NUM_EPOCH+1):
batch_index = 0
for X_batch, y_batch in mig.Xy_gen(mig.X_train, mig.y_train, batch_size=BATCH_SIZE):
batch_index += 1
X_train_pad = sequence.pad_sequences(X_batch, maxlen=mig.ttb.MAX_SEQ_LEN, padding='pre', value=-1.0)
y_train_pad = sequence.pad_sequences(y_batch, maxlen=mig.ttb.MAX_SEQ_LEN, padding='pre', value=-1.0)
loss = rnn.train_on_batch(X_train_pad, y_train_pad)
print("Epoch", epoch, ": Batch", batch_index, "-",
rnn.metrics_names[0], "=", loss[0], "-", rnn.metrics_names[1], "=", loss[1])
The output:
Epoch 1 : Batch 1 - loss = 715.478 - mean_squared_error = 178.191
Epoch 1 : Batch 2 - loss = 1.32964e+12 - mean_squared_error = 2.7457e+11
Epoch 1 : Batch 3 - loss = 2880.08 - mean_squared_error = 594.089
Epoch 1 : Batch 4 - loss = 4065.16 - mean_squared_error = 1031.27
Epoch 1 : Batch 5 - loss = 3489.96 - mean_squared_error = 695.302
Epoch 1 : Batch 6 - loss = 546.395 - mean_squared_error = 147.439
Epoch 1 : Batch 7 - loss = 1353.35 - mean_squared_error = 241.043
Epoch 1 : Batch 8 - loss = 1962.75 - mean_squared_error = 426.699
Epoch 1 : Batch 9 - loss = 2680.85 - mean_squared_error = 504.812
My questions:
Is it normal that the loss is not decreasing by batches?
I set both the loss and metrics to 'mean_squared_error'. Why the outputted loss and mean_square_error are different? Are they calculated based on different sets of the training data?
How should I decide whether I need to use 'pre' padding or 'post' padding? 'Pre' is like adding 'START', while 'post' is like adding 'END'. But based on my understanding, both 'START' and 'END' are important in sequence labeling. Right?
In TimeDistributed Layer, is Y_t also determined by y_t-1, y_t-2,...? Or it is just sequence-version of Dense layer where the output of all time steps are independent?