I am running a RNN model with Pytorch library to do sentiment analysis on movie review, but somehow the training loss and validation loss remained constant throughout the training. I have looked up different online sources but still stuck.
Can someone please help and take a look at my code?
Some parameters are specified by the assignment:
embedding_dim = 64
n_layers = 1
n_hidden = 128
dropout = 0.5
batch_size = 32
My main code
txt_field = data.Field(tokenize=word_tokenize, lower=True, include_lengths=True, batch_first=True)
label_field = data.Field(sequential=False, use_vocab=False, batch_first=True)
train = data.TabularDataset(path=part2_filepath+"train_Copy.csv", format='csv',
fields=[('label', label_field), ('text', txt_field)], skip_header=True)
validation = data.TabularDataset(path=part2_filepath+"validation_Copy.csv", format='csv',
fields=[('label', label_field), ('text', txt_field)], skip_header=True)
txt_field.build_vocab(train, min_freq=5)
label_field.build_vocab(train, min_freq=2)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
train_iter, valid_iter, test_iter = data.BucketIterator.splits(
(train, validation, test),
batch_size=32,
sort_key=lambda x: len(x.text),
sort_within_batch=True,
device=device)
n_vocab = len(txt_field.vocab)
embedding_dim = 64
n_hidden = 128
n_layers = 1
dropout = 0.5
model = Text_RNN(n_vocab, embedding_dim, n_hidden, n_layers, dropout)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = torch.nn.BCELoss().to(device)
N_EPOCHS = 15
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
train_loss, train_acc = RNN_train(model, train_iter, optimizer, criterion)
valid_loss, valid_acc = evaluate(model, valid_iter, criterion)
My Model
class Text_RNN(nn.Module):
def __init__(self, n_vocab, embedding_dim, n_hidden, n_layers, dropout):
super(Text_RNN, self).__init__()
self.n_layers = n_layers
self.n_hidden = n_hidden
self.emb = nn.Embedding(n_vocab, embedding_dim)
self.rnn = nn.RNN(
input_size=embedding_dim,
hidden_size=n_hidden,
num_layers=n_layers,
dropout=dropout,
batch_first=True
)
self.sigmoid = nn.Sigmoid()
self.linear = nn.Linear(n_hidden, 2)
def forward(self, sent, sent_len):
sent_emb = self.emb(sent)
outputs, hidden = self.rnn(sent_emb)
prob = self.sigmoid(self.linear(hidden.squeeze(0)))
return prob
The training function
def RNN_train(model, iterator, optimizer, criterion):
epoch_loss = 0
epoch_acc = 0
model.train()
for batch in iterator:
text, text_lengths = batch.text
predictions = model(text, text_lengths)
batch.label = batch.label.type(torch.FloatTensor).squeeze()
predictions = torch.max(predictions.data, 1).indices.type(torch.FloatTensor)
loss = criterion(predictions, batch.label)
loss.requires_grad = True
acc = binary_accuracy(predictions, batch.label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
The output I run on 10 testing reviews + 5 validation reviews
Epoch [1/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
Epoch [2/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
Epoch [3/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
Epoch [4/15]: Train Loss: 15.351 | Train Acc: 44.44% Val. Loss: 11.052 | Val. Acc: 60.00%
...
Appreciate if someone can point me to the right direction, I believe is something with the training code, since for most parts I follow this article:
https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/
In your training loop you are using the indices from the max operation, which is not differentiable, so you cannot track gradients through it. Because it is not differentiable, everything afterwards does not track the gradients either. Calling
loss.backward() would fail.
# The indices of the max operation are not differentiable
predictions = torch.max(predictions.data, 1).indices.type(torch.FloatTensor)
loss = criterion(predictions, batch.label)
# Setting requires_grad to True to make .backward() work, although incorrectly.
loss.requires_grad = True
Presumably you wanted to fix that by setting requires_grad, but that does not do what you expect, because no gradients are propagated to your model, since the only thing in your computational graph would be the loss itself, and there is nowhere to go from there.
You used the indices to get either 0 or 1, since the output of your model is essentially two classes, and you wanted the one with the higher probability. For the Binary Cross Entropy loss, you only need one class that has a value between 0 and 1 (continuous), which you get by applying the sigmoid function.
So you need change the output channels of the final linear layer to 1:
self.linear = nn.Linear(n_hidden, 1)
and in your training loop you can remove the torch.max call and also the requires_grad.
# Squeeze the model's output to get rid of the single class dimension
predictions = model(text, text_lengths).squeeze()
batch.label = batch.label.type(torch.FloatTensor).squeeze()
loss = criterion(predictions, batch.label)
acc = binary_accuracy(predictions, batch.label)
optimizer.zero_grad()
loss.backward()
Since you have only 1 class at the end, an actual prediction would be either 0 or 1 (nothing in between), to achieve that you can simply use 0.5 as the threshold, so everything below is considered a 0 and everything above is considered a 1. If you are using the binary_accuracy function of the article you were following, that is done automatically for you. They do that by rounding it with torch.round.
Related
I'm attempting to save and load best model through torch, where I've defined my training function as follows:
def train_model(model, train_loader, test_loader, device, learning_rate=1e-1, num_epochs=200):
# The training configurations were not carefully selected.
criterion = nn.CrossEntropyLoss()
model.to(device)
# It seems that SGD optimizer is better than Adam optimizer for ResNet18 training on CIFAR10.
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=1e-4)
# scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=500)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[65, 75], gamma=0.75, last_epoch=-1)
# optimizer = optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
# Evaluation
model.eval()
eval_loss, eval_accuracy = evaluate_model(model=model, test_loader=test_loader, device=device, criterion=criterion)
print("Epoch: {:02d} Eval Loss: {:.3f} Eval Acc: {:.3f}".format(-1, eval_loss, eval_accuracy))
load_model = input('Load a model?')
for epoch in range(num_epochs):
if epoch//2 == 0:
write_checkpoint(model=model, epoch=epoch, scheduler=scheduler, optimizer=optimizer)
model, optimizer, epoch, scheduler = load_checkpoint(model=model, scheduler=scheduler, optimizer=optimizer)
for state in optimizer.state.values():
for k, v in state.items():
if isinstance(v, torch.Tensor):
state[k] = v.to(device)
# Training
model.train()
running_loss = 0
running_corrects = 0
for inputs, labels in train_loader:
inputs = torch.FloatTensor(inputs)
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
train_loss = running_loss / len(train_loader.dataset)
train_accuracy = running_corrects / len(train_loader.dataset)
# Evaluation
model.eval()
eval_loss, eval_accuracy = evaluate_model(model=model, test_loader=test_loader, device=device, criterion=criterion)
# Set learning rate scheduler
scheduler.step()
print("Epoch: {:03d} Train Loss: {:.3f} Train Acc: {:.3f} Eval Loss: {:.3f} Eval Acc: {:.3f}".format(epoch, train_loss, train_accuracy, eval_loss, eval_accuracy))
return model
Where I'd like to be able to load a model, and start training from the epoch where model was saved.
So far I have methods to save model, optimizer,scheduler states and the epoch via
def write_checkpoint(model, optimizer, epoch, scheduler):
state = {'epoch': epoch + 1, 'state_dict': model.state_dict(),
'optimizer': optimizer.state_dict(), 'scheduler': scheduler.state_dict(), }
filename = '/content/model_'
torch.save(state, filename + f'CP_epoch{epoch + 1}.pth')
def load_checkpoint(model, optimizer, scheduler, filename='/content/checkpoint.pth'):
# Note: Input model & optimizer should be pre-defined. This routine only updates their states.
start_epoch = 0
if os.path.isfile(filename):
print("=> loading checkpoint '{}'".format(filename))
checkpoint = torch.load(filename)
start_epoch = checkpoint['epoch']
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
scheduler = checkpoint['scheduler']
print("=> loaded checkpoint '{}' (epoch {})"
.format(filename, checkpoint['epoch']))
else:
print("=> no checkpoint found at '{}'".format(filename))
return model, optimizer, start_epoch, scheduler
But I can't seem to come up with the logic of how I'd update the epoch to start at the correct one. Looking for hints or ideas on how to implement just that.
If I understand correctly you trying to resume training from last progress with correct epoch number.
Before calling train_model load the checkpoint values including start_epoch. Then use start_epoch as loop starting point,
for epoch in range(start_epoch, num_epochs):
I am trying to calculate the f1 score for a multi-class classification problem using the Cifar10 dataset. I am importing f1 metirics from the sklearn library. However I keep getting the following error message:
ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
Below is my function for testing the model on my validation set. Would someone be able to explain how to calculate f1 when performing multi-class classification. I am getting quite confused.
#torch.no_grad()
def valid_function(model, optimizer, val_loader):
model.eval()
val_loss = 0.0
val_accu = 0.0
f_one = []
for i, (x_val, y_val) in enumerate(val_loader):
x_val, y_val = x_val.to(device), y_val.to(device)
val_pred = model(x_val)
loss = criterion(val_pred, y_val)
val_loss += loss.item()
val_accu += accuracy(val_pred, y_val)
f_one.append(f1_score(y_val.cpu(), val_pred.cpu()))
val_loss /= len(val_loader)
val_accu /= len(val_loader)
print('Val Loss: %.3f | Val Accuracy: %.3f'%(val_loss,val_accu))
return val_loss, val_accu
The problem is here:
val_pred = model(x_val)
You need to convert how you load the model. For example in your case:
val_pred = np.argmax(model.predict(x_val), axis=-1)
I need to create a model that takes as input a 351x351x11 Tensor and gives as output a 351x351x11 Tensor (it is an Autoencoder). The two tensors are made of 0s and 1s.
This is the model:
class AutoEncoder(nn.Module):
def __init__(self):
super(AutoEncoder, self).__init__()
self.down_layers=nn.ModuleList()
self.up_layers=nn.ModuleList()
self.n_layers = 1
self.down_layers.append(nn.Conv3d(5,1,(3,3,1)))
self.up_layers.append(nn.ConvTranspose3d(1,5,(3,3,1)))
for d_l in self.down_layers:
torch.nn.init.normal_(d_l.weight, mean=0.5, std=0.7)
for u_l in self.up_layers:
torch.nn.init.normal_(u_l.weight, mean=0.5, std=0.7)
def encode(self, x):
# Encoder
for i in range(len(self.down_layers)):
x = self.down_layers[i](x)
x = torch.sigmoid(x)
return x
def forward(self, x):
# Decoder
x = self.encode(x)
for i in range(len(self.up_layers)):
x = self.up_layers[i](x)
x = torch.sigmoid(x)
if(i==(len(self.up_layers)-1)):
x = torch.round(x)
return x
This is the training function:
max_e,max_p = 351,11 #tensor dimensions
DEVICE = get_device() #device is cpu
EPOCHS = 100
BATCHSIZE=5
try:
print("Start model",flush=True)
# Generate the model.
model = AutoEncoder().to(DEVICE)
lr = 0.09
optimizer = torch.optim.RMSprop(model.parameters(), lr=lr)
# Training of the model.
for epoch in range(EPOCHS):
model.train()
"""
I have to create 25 dataloaders for 50000 training samples (each of 2000 samples) to avoid memory congestion.
"""
for i in range(25):
train_loader,X_train_shape=get_dataset(i)
N_TRAIN_EXAMPLES = X_train_shape
for batch_idx, (data, target) in enumerate(train_loader):
if batch_idx * BATCHSIZE >= N_TRAIN_EXAMPLES:
break
data, target = data[None, ...].to(DEVICE, dtype=torch.float), target[None, ...].to(DEVICE, dtype=torch.float)
optimizer.zero_grad()
output = model(data)
loss = torch.nn.BCELoss()
loss = loss(output, target)
loss.backward()
optimizer.step()
#remove train data loader from memory
del train_loader
print("VALIDATION",flush=True)
# Validation of the model.
model.eval()
correct = 0
tot = 0
with torch.no_grad():
"""
Same with the training, 10 data loaders for 20000 samples
"""
for i in range(25,35):
valid_loader,X_valid_shape=get_dataset(i)
N_VALID_EXAMPLES = X_valid_shape
for batch_idx, (data, target) in enumerate(valid_loader):
# Limiting validation data.
if batch_idx * BATCHSIZE >= N_VALID_EXAMPLES:
break
data, target = data[None, ...].to(DEVICE, dtype=torch.float), target[None, ...].to(DEVICE, dtype=torch.float)
output = model(data)
# count the number of 1s and 0s predicted correctly
newCorrect= output(target.view_as(output)).sum().item()
correct += newCorrect
tot +=max_e*max_e*max_p*BATCHSIZE
del valid_loader
accuracy = correct*100 / tot
print('Epoch: {} Loss: {} Accuracy: {} %'.format(epoch, loss.data, accuracy),flush=True)
the function that returns the data loader is:
def get_dataset(i):
X_train=[]
Y_train=[]
for j in range(i*2000,(i+1)*2000):
t = torch.load("/home/ubuntu/data/home/ubuntu/deeplogic/el_dataset/x/scene{}.pt".format(j))
X_train.append(t)
t = torch.load("/home/ubuntu/data/home/ubuntu/deeplogic/el_dataset/y/scene{}.pt".format(j))
Y_train.append(t)
train_x = torch.from_numpy(np.array(X_train)).float()
train_y = torch.from_numpy(np.array(Y_train)).float()
batch_size = 1
train = torch.utils.data.TensorDataset(train_x,train_y)
# data loader
train_loader = torch.utils.data.DataLoader(train, batch_size = batch_size, shuffle = True)
return train_loader,len(X_train)
The prints I got are :
Epoch: 1 Loss: 99.80729675292969 Accuracy: 0.19852701903983955 %
Epoch: 2 Loss: 99.80729675292969 Accuracy: 0.19852701903983955 %
Epoch: 3 Loss: 99.80729675292969 Accuracy: 0.19852701903983955 %
Epoch: 4 Loss: 99.80729675292969 Accuracy: 0.19852701903983955 %
x = torch.round(x) prevents you from updating your model because it's non-differentiable. More importantly, x = torch.round(x) is redundant for BCELoss. You should move it validation step only. Also, the newCorrect in your validation loop does not compare with target values. (I add the missing eq() in your code.)
# in validation loop
preds = torch.round(output)
newCorrect= preds.eq(target.view_as(preds)).sum().item()
I'm trying to fine-tune BERT for a text classification task, but I'm getting NaN losses and can't figure out why.
First I define a BERT-tokenizer and then tokenize my text:
from transformers import DistilBertTokenizer, RobertaTokenizer
distil_bert = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(distil_bert, do_lower_case=True, add_special_tokens=True,
max_length=128, pad_to_max_length=True)
def tokenize(sentences, tokenizer):
input_ids, input_masks, input_segments = [],[],[]
for sentence in tqdm(sentences):
inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=25, pad_to_max_length=True,
return_attention_mask=True, return_token_type_ids=True)
input_ids.append(inputs['input_ids'])
input_masks.append(inputs['attention_mask'])
input_segments.append(inputs['token_type_ids'])
return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')
train = pd.read_csv('train_dataset.csv')
d = train['text']
input_ids, input_masks, input_segments = tokenize(d, tokenizer)
Next, I load my integer labels which are: 0, 1, 2, 3.
d_y = train['label']
0 0
1 1
2 0
3 2
4 0
5 0
6 0
7 0
8 3
9 1
Name: label, dtype: int64
Then I load the pretrained Transformer model and put layers on top of it. I use SparseCategoricalCrossEntropy Loss when compiling the model:
from transformers import TFDistilBertForSequenceClassification, DistilBertConfig, AutoTokenizer, TFDistilBertModel
distil_bert = 'distilbert-base-uncased'
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.0000001)
config = DistilBertConfig(num_labels=4, dropout=0.2, attention_dropout=0.2)
config.output_hidden_states = False
transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)
input_ids_in = tf.keras.layers.Input(shape=(25,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(25,), name='masked_token', dtype='int32')
embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)
X = tf.keras.layers.Dense(50, activation='relu')(X)
X = tf.keras.layers.Dropout(0.2)(X)
X = tf.keras.layers.Dense(4, activation='softmax')(X)
model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)
for layer in model.layers[:3]:
layer.trainable = False
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'],
)
Finally, I run the model using previously tokenized input_ids and input_masks as inputs to the model and get a NAN Loss after the first epoch:
model.fit(x=[input_ids, input_masks], y = d_y, epochs=3)
Epoch 1/3
20/20 [==============================] - 4s 182ms/step - loss: 0.9714 - sparse_categorical_accuracy: 0.6153
Epoch 2/3
20/20 [==============================] - 0s 19ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
Epoch 3/3
20/20 [==============================] - 0s 20ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
<tensorflow.python.keras.callbacks.History at 0x7fee0e220f60>
EDIT: The model computes losses on the first epoch but it starts returning NaNs
at the second epoch. What could be causing that problem???
Does anyone has any ideas about what I am doing wrong?
All suggestions are welcomed!
The problem is here:
X = tf.keras.layers.Dense(1, activation='softmax')(X)
At the end of the network, you only have a single neuron, corresponding to a single class. The output probability is always 100% for class 0. If you have classes 0, 1, 2, 3, you need to have 4 outputs at the end.
The problem would have occurred because of not specifying the num_labels
At the final output layer, by default K = 1 (number of labels), and as mentioned
\sigma(\vec{z})_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}}
so while fine tuning we need to provide num_labels when going for multi class classification.
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)
I'd also suggest removing NA values from the pandas data frame before using the dataset for training and evaluation.
train = pd.read_csv('train_dataset.csv')
d = train['text']
d = d.dropna()
I had a similar problem where my model produced NaN losses only during the last batch of an epoch. All the other batches resulted in typical loss values. In my case, the problem was that the size of the batches was not always equal. Thus, the model produced NaN losses. After I made all batches equally sized, the NaN's were gone. It might be also worth investigating if this is also true in your case.
I tried to implement a no-hidden-layer neural network to hack with MNIST dataset.
I used sigmoid as activate function and cross-entropy as loss function.
For simplicity I my network has no hidden layer, just input and output.
X = trainImage
label = trainLabel
w1 = np.random.normal(np.zeros([28 * 28, 10]))
b1 = np.random.normal(np.zeros([10]))
def sigm(x):
return 1 / (1 + np.exp(-x))
def acc():
y = sigm(np.matmul(X, w1))
return sum(np.argmax(label, 1) == np.argmax(y, 1)) / y.shape[0]
def loss():
y = sigm(np.matmul(X, w1))
return sum((-label * np.log(y)).flatten())
a = np.matmul(X[0:1], w1)
y = sigm(a)
dy = - label[0:1] / y
ds = dy * y * (1 - y)
dw = np.matmul(X[0:1].transpose(), ds)
db = ds
def bp(lr, i):
global w1, b1, a, y, dy, ds, dw, db
a = np.matmul(X[i:i+1], w1)
y = sigm(a)
dy = - label[0:1] / y
ds = dy * y * (1 - y)
dw = np.matmul(X[i:i+1].transpose(), ds)
db = ds
w1 = w1 - lr * dw
b1 = b1 - lr * db
for i in range(100 * 60000):
bp(1, i % 60000)
if i % 60000 == 0:
print("#", int(i / 60000), "loss:", loss(), "acc:", acc())
This is the part of my implementation of backpropagation algorithm, but it doesn't work as expected. The descent of loss function is extremely slow (I tried with learning rate varing from 0.001 to 1), and the accuracy never grows over than 0.1.
The output is like this:
# 0 loss: 279788.368245 acc: 0.0903333333333
# 1 loss: 279788.355211 acc: 0.09035
# 2 loss: 279788.350629 acc: 0.09035
# 3 loss: 279788.348228 acc: 0.09035
# 4 loss: 279788.346736 acc: 0.09035
# 5 loss: 279788.345715 acc: 0.09035
# 6 loss: 279788.344969 acc: 0.09035
# 7 loss: 279788.3444 acc: 0.09035
# 8 loss: 279788.343951 acc: 0.09035
# 9 loss: 279788.343587 acc: 0.09035
# 10 loss: 279788.343286 acc: 0.09035
# 11 loss: 279788.343033 acc: 0.09035
From what I see here, there are a few possible factors that are preventing this from working.
Firstly, you have to randomly initialize your weights and biases. From what I am seeing, you are trying to change a value that doesn't yet have a tangible value(ex: w1 = 0).
Secondly, your optimizer is not suited for the MNIST dataset. The optimizer is what changes the values based on the backpropagation, so it is very important that you choose the right optimizer. Gradient Descent is better suited for this dataset, and from what I can see, you are not using Gradient Descent. If you are attempting to make a Gradient Descent optimizer with your code, you are most likely doing it wrong. Gradient Descent involves partial derivatives of the SSE (Sum of Squared Error), which I do not see in this code.
If you want to go with the Gradient Descent optimizer, you will have to make a few changes to your code besides implementing the mathematics behind Gradient Descent. You will have to use ReLU activation function(which I suggest you do anyway) instead of the sigmoid function. This will make sure the Vanishing Gradient problem doesn't occur. Also, your should make your loss function the reduced mean of the cross entropy. The optimizer will be much more effective this way.
Hope this helps.