Training seq2seq LM over multiple iterations in PyTorch, seems like lack of connection between encoder and decoder - machine-learning

My seq2seq model seems to only learn to produce sequences of popular words like:
"i don't . i don't . i don't . i don't . i don't"
I think that might be due to a lack of actual data flow between encoder and decoder.
That happens whether I use encoder.init_hidden() or encoder_hidden.detach().
If I use neither, I get an error:
"RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward."
If I try to use retain_graph=True, I get another error:
"RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [256, 768]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)."
This seems to be a very common use case, but from all similar questions and all the documentation and experiments, I cannot solve it.
Am I missing something obvious?
encoder = Encoder(embedding_dim, hidden_size, max_seq_len, num_layers, vocab.len(), word_embeddings).to(device)
decoder = Decoder(embedding_dim, hidden_size, num_layers, vocab.len()).to(device)
loss_function = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.SGD(params=encoder.parameters() + decoder.parameters(), lr=learn_rate)
encoder.train()
decoder.train()
encoder_hidden = encoder.init_hidden()
for epoch in range(num_epochs):
epoch_loss = 0
num_samples = 0
j = 0
for prompts, responses in train_data_loader:
#encoder_hidden = encoder.init_hidden() # new tensor of zeroes
encoder_hidden = encoder_hidden.detach()
optimizer.zero_grad()
encoder_output, encoder_hidden = encoder(prompts, encoder_hidden)
decoder_hidden = encoder.transform_hidden(encoder_hidden)
batch_size = responses.size(0)
decoder_input = torch.tensor([[SOS_TOKEN]] * batch_size, device=device)
decoder_outputs = []
sequence_length = responses.shape[1]
for i in range(sequence_length):
word_index = responses[:, i:i+1]
decoder_output, _ = decoder(decoder_input, decoder_hidden)
decoder_outputs.append(decoder_output)
decoder_input = word_index
decoder_outputs_t = torch.cat(decoder_outputs, dim=1)
decoder_outputs_t = decoder_outputs_t.permute(0, 2, 1)
loss = loss_function(decoder_outputs_t, responses)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
num_samples += 1
j += 1
mean_loss = epoch_loss / num_samples

Related

How to register a dynamic backward hook on tensors in Pytorch?

I'm trying to register a backward hook on each neuron's weights in a network. By dynamic I mean that it will take a value and multiply the associated gradients by that value.
From here it seem like it's possible to register a hook on a tensor with a fixed value (though note that I need it to take a value that will change). From here it also seems like it's possible to register a hook on all of the parameters -- they use it to do gradients clipping (though note that I'm trying to only do it on each neuron's weights).
If my network is as follows:
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(3,5)
self.fc2 = nn.Linear(5,10)
self.fc3 = nn.Linear(10,1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = torch.relu(self.fc3(x))
return x
The first layer has 5 neurons with 3 associated weights for each. Hence, this layer should have 5 hooks that modifies (i.e change the current gradient by multiplying it) their 3 associated weights gradients during the backward step.
Training pseudo-code example:
net = Model()
for epoch in epochs:
out = net(data)
loss = criterion(out, target)
optimizer.zero_grad()
loss.backward()
for hook in list_of_hooks: #not sure if there's a more "pytorch" way of doing this without a for loop
hook(random_value)
optimizer.step()
What about exploiting lambdas closure over names?
A short example:
import torch
net_params = torch.rand(5, 3, requires_grad=True)
msg = "Hello!"
t.register_hook(lambda g: print(msg))
out1 = net_params * 2.
loss = out1.sum()
loss.backward() # Activates the hook and prints "Hello!"
msg = "How are you?" # The lambda is affected by this change
out2 = t ** 4.
loss2 = out2.sum()
loss2.backward() # Activates the hook again and prints "How are you?"
So a possible solution to your problem:
net = Model()
# Replace it with your computed values
rand_values = torch.rand(net.fc1.out_features, net.fc1.in_features)
net.fc1.weight.register_hook(lambda g: g * rand_values)
for epoch in epochs:
out = net(data)
loss = criterion(out, target)
optimizer.zero_grad()
loss.backward() # fc1 gradients are multiplied by rand_values
optimizer.step()
# Update rand_values. The lambda computation will change accordingly
rand_values = torch.rand(net.fc1.out_features, net.fc1.in_features)
Edit
To make things clearer, if you specifically want to multiply each set of weights i by a single value vi you can exploit broadcasting semantic and define values = torch.tensor([v0, v1, v2, v3, v4]).reshape(5, 1), then the lambda becomes lambda g: g * values

Why is a simple embedding+linear layer outperforming this LSTM classifier?

I'm running into a roadblock in my learning about NLP. I'm working on a beginner's Kaggle competition classifying tweets as "disaster" or "not disaster". I started out by repurposing a simple network from a PyTorch tutorial comprised of nn.EmbeddingBag and nn.Linear layers and saw decent results during both training and inference:
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
self.fc = nn.Linear(embed_dim, num_class)
The loss function is BCEWithLogits, by the way.
I decided to up my game and throw an LSTM into the mix. I took a deep dive into padded/packed sequences and think I understand them pretty well. After perusing around and thinking about it, I came to the conclusion that I should be grabbing the final non-padded hidden state of each sequence's output from the LSTM. That's what I tried below:
My attempt at upping my game:
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
self.fc1 = nn.Linear(hidden_size, num_class)
def forward(self, padded_seq, lengths):
# embedding layer
embedded_padded = self.embedding(padded_seq)
packed_output = pack_padded_sequence(embedded_padded, lengths, batch_first=True)
# lstm layer
output, _ = self.lstm(packed_output)
padded_output, lengths = pad_packed_sequence(output, batch_first=True)
# get hidden state of final non-padded sequence element:
h_n = []
for seq, length in zip(padded_output, lengths):
h_n.append(seq[length - 1, :])
lstm_out = torch.stack(h_n)
# linear layers
out = self.fc1(lstm_out)
return out
This morning, I ported my notebook over to an IDE and ran the debugger and confirmed that h_n is indeed the final hidden state of each sequence, not including padding.
So everything runs/trains without error but my loss never decreases when I use batch size > 1.
With batch_size = 8:
With batch_size = 1:
My Question
I would have expected this LSTM setup to perform much better on this simple task. So I'm wondering "Where have I gone wrong?"
Additional Information: Training Code
def train_one_epoch(model, opt, criterion, lr, trainloader):
model.to(device)
model.train()
running_tl = 0
for (label, data, lengths) in trainloader:
opt.zero_grad()
label = label.reshape(label.size()[0], 1)
output = model(data, lengths)
loss = criterion(output, label)
running_tl += loss.item()
loss.backward()
opt.step()
return running_tl
def validate_one_epoch(model, opt, criterion, lr, validloader):
running_vl = 0
model.eval()
with torch.no_grad():
for (label, data, lengths) in validloader:
label = label.reshape(label.shape[0], 1)
output = model(data, lengths)
loss = criterion(output, label)
running_vl += loss.item()
return running_vl
def train_model(model, opt, criterion, epochs, trainload, testload=None, lr=1e-3):
avg_tl_per_epoch = []
avg_vl_per_epoch = []
for e in trange(epochs):
running_tl = train_one_epoch(model, opt, criterion, lr, trainload)
avg_tl_per_epoch.append(running_tl / len(trainload))
if testload:
running_vl = validate_one_epoch(model, opt, criterion, lr, validloader)
avg_vl_per_epoch.append(running_vl / len(testload))
return avg_tl_per_epoch, avg_vl_per_epoch
I think your model should look like that :
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
self.fc1 = nn.Linear(hidden_size, num_class)
def forward(self, padded_seq, lengths):
# embedding layer
embedded_padded = self.embedding(padded_seq)
packed_output = pack_padded_sequence(embedded_padded, lengths, batch_first=True)
# lstm layer
output, _ = self.lstm(packed_output)
out = self.fc1(output)
return out
As, by default, the LSTM will just output the last hidden state as an output when provided with a sequence.
Also depending on the number of examples, the simple embedding + linear model might work better as it needs fewer data to converge. Your data being tweets (very short text) the sequential aspect of the text might not be so important.
You have not provided the code for preprocessing your data. With text a good preprocessing is crucial and I recommend you to take a look to the pytorch tutorial called NLP FROM SCRATCH: TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION.

PyTorch gives incorrect results due to broadcasting

I want to run some neural net experiments with PyTorch, but a minimal test case is giving wrong answers. The test case sets up a simple neural network with two input variables and an output variable that is just the sum of the inputs, and tries learning it as a regression problem; I expect it to converge on zero mean squared error, but it actually converges on 0.165. It's probably because of the issue alluded to in the warning message; how can I fix it?
Code:
import torch
import torch.nn as nn
# data
Xs = []
ys = []
n = 10
for i in range(n):
i1 = i / n
for j in range(n):
j1 = j / n
Xs.append([i1, j1])
ys.append(i1 + j1)
# torch tensors
X_tensor = torch.tensor(Xs)
y_tensor = torch.tensor(ys)
# hyperparameters
in_features = len(Xs[0])
hidden_size = 100
out_features = 1
epochs = 500
# model
class Net(nn.Module):
def __init__(self, hidden_size):
super(Net, self).__init__()
self.L0 = nn.Linear(in_features, hidden_size)
self.N0 = nn.ReLU()
self.L1 = nn.Linear(hidden_size, 1)
def forward(self, x):
x = self.L0(x)
x = self.N0(x)
x = self.L1(x)
return x
model = Net(hidden_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# train
print("training")
for epoch in range(1, epochs + 1):
# forward
output = model(X_tensor)
cost = criterion(output, y_tensor)
# backward
optimizer.zero_grad()
cost.backward()
optimizer.step()
# print progress
if epoch % (epochs // 10) == 0:
print(f"{epoch:6d} {cost.item():10f}")
print()
output = model(X_tensor)
cost = criterion(output, y_tensor)
print("mean squared error:", cost.item())
Output:
training
C:\Users\russe\Anaconda3\envs\torch2\lib\site-packages\torch\nn\modules\loss.py:445: UserWarning: Using a target size (torch.Size([100])) that is different to the input size (torch.Size([100, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
50 0.167574
100 0.165108
150 0.165070
200 0.165052
250 0.165039
300 0.165028
350 0.165020
400 0.165013
450 0.165009
500 0.165006
mean squared error: 0.1650056540966034
And the message:
UserWarning: Using a target size (torch.Size([100])) that is different to the input size (torch.Size([100, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
You're going to be a bit more specific on which tensors (X, or Y), but we can can reshape our tensors by using the torch.view function.
For example:
Y_tensor = torch.tensor(Ys)
print(Y_tensor.shape)
>> torch.Size([5])
new_shape = (len(Ys), 1)
Y_tensor = Y_tensor.view(new_shape)
print(Y_tensor.shape)
>> torch.Size([5, 1])
However, I'm skeptical that this broadcasting behavior is why you're having accuracy issues.

Pytorch model stuck at 0.5 though loss decreases consistently

This is using PyTorch
I have been trying to implement UNet model on my images, however, my model accuracy is always exact 0.5. Loss does decrease.
I have also checked for class imbalance. I have also tried playing with learning rate. Learning rate affects loss but not the accuracy.
My architecture below ( from here )
""" `UNet` class is based on https://arxiv.org/abs/1505.04597
The U-Net is a convolutional encoder-decoder neural network.
Contextual spatial information (from the decoding,
expansive pathway) about an input tensor is merged with
information representing the localization of details
(from the encoding, compressive pathway).
Modifications to the original paper:
(1) padding is used in 3x3 convolutions to prevent loss
of border pixels
(2) merging outputs does not require cropping due to (1)
(3) residual connections can be used by specifying
UNet(merge_mode='add')
(4) if non-parametric upsampling is used in the decoder
pathway (specified by upmode='upsample'), then an
additional 1x1 2d convolution occurs after upsampling
to reduce channel dimensionality by a factor of 2.
This channel halving happens with the convolution in
the tranpose convolution (specified by upmode='transpose')
Arguments:
in_channels: int, number of channels in the input tensor.
Default is 3 for RGB images. Our SPARCS dataset is 13 channel.
depth: int, number of MaxPools in the U-Net. During training, input size needs to be
(depth-1) times divisible by 2
start_filts: int, number of convolutional filters for the first conv.
up_mode: string, type of upconvolution. Choices: 'transpose' for transpose convolution
"""
class UNet(nn.Module):
def __init__(self, num_classes, depth, in_channels, start_filts=16, up_mode='transpose', merge_mode='concat'):
super(UNet, self).__init__()
if up_mode in ('transpose', 'upsample'):
self.up_mode = up_mode
else:
raise ValueError("\"{}\" is not a valid mode for upsampling. Only \"transpose\" and \"upsample\" are allowed.".format(up_mode))
if merge_mode in ('concat', 'add'):
self.merge_mode = merge_mode
else:
raise ValueError("\"{}\" is not a valid mode for merging up and down paths.Only \"concat\" and \"add\" are allowed.".format(up_mode))
# NOTE: up_mode 'upsample' is incompatible with merge_mode 'add'
if self.up_mode == 'upsample' and self.merge_mode == 'add':
raise ValueError("up_mode \"upsample\" is incompatible with merge_mode \"add\" at the moment "
"because it doesn't make sense to use nearest neighbour to reduce depth channels (by half).")
self.num_classes = num_classes
self.in_channels = in_channels
self.start_filts = start_filts
self.depth = depth
self.down_convs = []
self.up_convs = []
# create the encoder pathway and add to a list
for i in range(depth):
ins = self.in_channels if i == 0 else outs
outs = self.start_filts*(2**i)
pooling = True if i < depth-1 else False
down_conv = DownConv(ins, outs, pooling=pooling)
self.down_convs.append(down_conv)
# create the decoder pathway and add to a list
# - careful! decoding only requires depth-1 blocks
for i in range(depth-1):
ins = outs
outs = ins // 2
up_conv = UpConv(ins, outs, up_mode=up_mode, merge_mode=merge_mode)
self.up_convs.append(up_conv)
self.conv_final = conv1x1(outs, self.num_classes)
# add the list of modules to current module
self.down_convs = nn.ModuleList(self.down_convs)
self.up_convs = nn.ModuleList(self.up_convs)
self.reset_params()
#staticmethod
def weight_init(m):
if isinstance(m, nn.Conv2d):
#https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/
##Doc: https://pytorch.org/docs/stable/nn.init.html?highlight=xavier#torch.nn.init.xavier_normal_
init.xavier_normal_(m.weight)
init.constant_(m.bias, 0)
def reset_params(self):
for i, m in enumerate(self.modules()):
self.weight_init(m)
def forward(self, x):
encoder_outs = []
# encoder pathway, save outputs for merging
for i, module in enumerate(self.down_convs):
x, before_pool = module(x)
encoder_outs.append(before_pool)
for i, module in enumerate(self.up_convs):
before_pool = encoder_outs[-(i+2)]
x = module(before_pool, x)
# No softmax is used. This means we need to use
# nn.CrossEntropyLoss is your training script,
# as this module includes a softmax already.
x = self.conv_final(x)
return x
Parameters are :
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x,y = train_sequence[0] ; batch_size = x.shape[0]
model = UNet(num_classes = 2, depth=5, in_channels=5, merge_mode='concat').to(device)
optim = torch.optim.Adam(model.parameters(),lr=0.01, weight_decay=1e-3)
criterion = nn.BCEWithLogitsLoss() #has sigmoid internally
epochs = 1000
The function for training is :
import torch.nn.functional as f
def train_model(epoch,train_sequence):
"""Train the model and report validation error with training error
Args:
model: the model to be trained
criterion: loss function
data_train (DataLoader): training dataset
"""
model.train()
for idx in range(len(train_sequence)):
X, y = train_sequence[idx]
images = Variable(torch.from_numpy(X)).to(device) # [batch, channel, H, W]
masks = Variable(torch.from_numpy(y)).to(device)
outputs = model(images)
print(masks.shape, outputs.shape)
loss = criterion(outputs, masks)
optim.zero_grad()
loss.backward()
# Update weights
optim.step()
# total_loss = get_loss_train(model, data_train, criterion)
My function for calculating loss and accuracy is below:
def get_loss_train(model, train_sequence):
"""
Calculate loss over train set
"""
model.eval()
total_acc = 0
total_loss = 0
for idx in range(len(train_sequence)):
with torch.no_grad():
X, y = train_sequence[idx]
images = Variable(torch.from_numpy(X)).to(device) # [batch, channel, H, W]
masks = Variable(torch.from_numpy(y)).to(device)
outputs = model(images)
loss = criterion(outputs, masks)
preds = torch.argmax(outputs, dim=1).float()
acc = accuracy_check_for_batch(masks.cpu(), preds.cpu(), images.size()[0])
total_acc = total_acc + acc
total_loss = total_loss + loss.cpu().item()
return total_acc/(len(train_sequence)), total_loss/(len(train_sequence))
Edit : Code which runs (calls) the functions:
for epoch in range(epochs):
train_model(epoch, train_sequence)
train_acc, train_loss = get_loss_train(model,train_sequence)
print("Train Acc:", train_acc)
print("Train loss:", train_loss)
Can someone help me identify as why is accuracy always exact 0.5?
Edit-2:
As asked accuracy_check_for_batch function is here:
def accuracy_check_for_batch(masks, predictions, batch_size):
total_acc = 0
for index in range(batch_size):
total_acc += accuracy_check(masks[index], predictions[index])
return total_acc/batch_size
and
def accuracy_check(mask, prediction):
ims = [mask, prediction]
np_ims = []
for item in ims:
if 'str' in str(type(item)):
item = np.array(Image.open(item))
elif 'PIL' in str(type(item)):
item = np.array(item)
elif 'torch' in str(type(item)):
item = item.numpy()
np_ims.append(item)
compare = np.equal(np_ims[0], np_ims[1])
accuracy = np.sum(compare)
return accuracy/len(np_ims[0].flatten())
I found the mistake.
model = UNet(num_classes = 2, depth=5, in_channels=5, merge_mode='concat').to(device)
should be
model = UNet(num_classes = 1, depth=5, in_channels=5, merge_mode='concat').to(device)
because I am using BCELosswithLogits.

Defining a simple neural netwok in mxnet error

I am doing making simple NN using MXnet , but having some problem in step() method
x1.shape=(64, 1, 1000)
y1.shape=(64, 1, 10)
net =nm.Sequential()
net.add(nn.Dense(H,activation='relu'),nn.Dense(90,activation='relu'),nn.Dense(D_out))
for t in range(500):
#y_pred = net(x1)
#loss = loss_fn(y_pred, y)
#for i in range(len(x1)):
with autograd.record():
output=net(x1)
loss =loss_fn(output,y1)
loss.backward()
trainer.step(64)
if t % 100 == 99:
print(t, loss)
#optimizer.zero_grad()
UserWarning: Gradient of Parameter dense30_weight on context cpu(0)
has not been updated by backward since last step. This could mean a
bug in your model that made it only use a subset of the Parameters
(Blocks) for this iteration. If you are intentionally only using a
subset, call step with ignore_stale_grad=True to suppress this warning
and skip updating of Parameters with stale gradient
The error indicates that you are passing parameters in your trainer that are not in your computational graph.
You need to initialize the parameters of your model and define the trainer. Unlike Pytorch, you don't need to call zero_grad in MXNet because by default new gradients are written in and not accumulated. Following code shows a simple neural network implemented using MXNet's Gluon API:
# Define model
net = gluon.nn.Dense(1)
net.collect_params().initialize(mx.init.Normal(sigma=1.), ctx=model_ctx)
square_loss = gluon.loss.L2Loss()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.0001})
# Create random input and labels
def real_fn(X):
return 2 * X[:, 0] - 3.4 * X[:, 1] + 4.2
X = nd.random_normal(shape=(num_examples, num_inputs))
noise = 0.01 * nd.random_normal(shape=(num_examples,))
y = real_fn(X) + noise
# Define Dataloader
batch_size = 4
train_data = gluon.data.DataLoader(gluon.data.ArrayDataset(X, y), batch_size=batch_size, shuffle=True)
num_batches = num_examples / batch_size
for e in range(10):
# Iterate over training batches
for i, (data, label) in enumerate(train_data):
# Load data on the CPU
data = data.as_in_context(mx.cpu())
label = label.as_in_context(mx.cpu())
with autograd.record():
output = net(data)
loss = square_loss(output, label)
# Backpropagation
loss.backward()
trainer.step(batch_size)
cumulative_loss += nd.mean(loss).asscalar()
print("Epoch %s, loss: %s" % (e, cumulative_loss / num_examples))

Resources