Adam Optimizer not Updating Values - machine-learning

I am trying to use Adam optimizer to obtain certain values outside of a neural network. My technique wasn't working so I created a simple example to see if it works:
a = np.array([[0.0,1.0,2.0,3.0,4.0], [0.0,1.0,2.0,3.0,4.0]])
b = np.array([[0.1,0.2,0.0,0.0,0.0], [0.0,0.5,0.0,0.0,0.0]])
a = torch.from_numpy(a)
b = torch.from_numpy(b)
a.requires_grad = True
b.requires_grad = True
optimizer = torch.optim.Adam(
[b],
lr=0.01,
weight_decay=0.001
)
iterations = 200
for i in range(iterations ):
loss = torch.sqrt(((a.detach() - b.detach()) ** 2).sum(1)).mean()
loss.requires_grad = True
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i % 10 == 0:
print(b)
print("loss:", loss)
My intuition was b should get close to a as much as possible to reduce loss. But I see no change in any of the values of b and loss stays exactly the same. What am I missing here? Thanks.

You are detaching b, meaning the gradient won't flow all the way to b when backpropagating, i.e. b won't change! Additionally, you don't need to state requires_grad = True on the loss, as this is done automatically since one of the operands has the requires_grad flag on.
loss = torch.sqrt(((a.detach() - b) ** 2).sum(1)).mean()

Related

Training seq2seq LM over multiple iterations in PyTorch, seems like lack of connection between encoder and decoder

My seq2seq model seems to only learn to produce sequences of popular words like:
"i don't . i don't . i don't . i don't . i don't"
I think that might be due to a lack of actual data flow between encoder and decoder.
That happens whether I use encoder.init_hidden() or encoder_hidden.detach().
If I use neither, I get an error:
"RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward."
If I try to use retain_graph=True, I get another error:
"RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [256, 768]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)."
This seems to be a very common use case, but from all similar questions and all the documentation and experiments, I cannot solve it.
Am I missing something obvious?
encoder = Encoder(embedding_dim, hidden_size, max_seq_len, num_layers, vocab.len(), word_embeddings).to(device)
decoder = Decoder(embedding_dim, hidden_size, num_layers, vocab.len()).to(device)
loss_function = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.SGD(params=encoder.parameters() + decoder.parameters(), lr=learn_rate)
encoder.train()
decoder.train()
encoder_hidden = encoder.init_hidden()
for epoch in range(num_epochs):
epoch_loss = 0
num_samples = 0
j = 0
for prompts, responses in train_data_loader:
#encoder_hidden = encoder.init_hidden() # new tensor of zeroes
encoder_hidden = encoder_hidden.detach()
optimizer.zero_grad()
encoder_output, encoder_hidden = encoder(prompts, encoder_hidden)
decoder_hidden = encoder.transform_hidden(encoder_hidden)
batch_size = responses.size(0)
decoder_input = torch.tensor([[SOS_TOKEN]] * batch_size, device=device)
decoder_outputs = []
sequence_length = responses.shape[1]
for i in range(sequence_length):
word_index = responses[:, i:i+1]
decoder_output, _ = decoder(decoder_input, decoder_hidden)
decoder_outputs.append(decoder_output)
decoder_input = word_index
decoder_outputs_t = torch.cat(decoder_outputs, dim=1)
decoder_outputs_t = decoder_outputs_t.permute(0, 2, 1)
loss = loss_function(decoder_outputs_t, responses)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
num_samples += 1
j += 1
mean_loss = epoch_loss / num_samples

How to register a dynamic backward hook on tensors in Pytorch?

I'm trying to register a backward hook on each neuron's weights in a network. By dynamic I mean that it will take a value and multiply the associated gradients by that value.
From here it seem like it's possible to register a hook on a tensor with a fixed value (though note that I need it to take a value that will change). From here it also seems like it's possible to register a hook on all of the parameters -- they use it to do gradients clipping (though note that I'm trying to only do it on each neuron's weights).
If my network is as follows:
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(3,5)
self.fc2 = nn.Linear(5,10)
self.fc3 = nn.Linear(10,1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = torch.relu(self.fc3(x))
return x
The first layer has 5 neurons with 3 associated weights for each. Hence, this layer should have 5 hooks that modifies (i.e change the current gradient by multiplying it) their 3 associated weights gradients during the backward step.
Training pseudo-code example:
net = Model()
for epoch in epochs:
out = net(data)
loss = criterion(out, target)
optimizer.zero_grad()
loss.backward()
for hook in list_of_hooks: #not sure if there's a more "pytorch" way of doing this without a for loop
hook(random_value)
optimizer.step()
What about exploiting lambdas closure over names?
A short example:
import torch
net_params = torch.rand(5, 3, requires_grad=True)
msg = "Hello!"
t.register_hook(lambda g: print(msg))
out1 = net_params * 2.
loss = out1.sum()
loss.backward() # Activates the hook and prints "Hello!"
msg = "How are you?" # The lambda is affected by this change
out2 = t ** 4.
loss2 = out2.sum()
loss2.backward() # Activates the hook again and prints "How are you?"
So a possible solution to your problem:
net = Model()
# Replace it with your computed values
rand_values = torch.rand(net.fc1.out_features, net.fc1.in_features)
net.fc1.weight.register_hook(lambda g: g * rand_values)
for epoch in epochs:
out = net(data)
loss = criterion(out, target)
optimizer.zero_grad()
loss.backward() # fc1 gradients are multiplied by rand_values
optimizer.step()
# Update rand_values. The lambda computation will change accordingly
rand_values = torch.rand(net.fc1.out_features, net.fc1.in_features)
Edit
To make things clearer, if you specifically want to multiply each set of weights i by a single value vi you can exploit broadcasting semantic and define values = torch.tensor([v0, v1, v2, v3, v4]).reshape(5, 1), then the lambda becomes lambda g: g * values

Reinforcement learning converges for mean loss but not for each training data

Here I show a dummy example that represents my actual problem.
My neural network (NN) receives one input and gives the probabilities for two output nodes. The code for the NN is:
class Net(torch.nn.Module):
def __init__(self, N, M):
super(Net, self).__init__()
self.fc1 = torch.nn.Linear(N, 4)
self.fc2 = torch.nn.Linear(4, 4)
self.fc3 = torch.nn.Linear(4, M)
def forward(self, x):
x = torch.sigmoid(self.fc1(x))
x = torch.sigmoid(self.fc2(x))
x = torch.softmax(self.fc3(x),0)
return x
The ABM class is our model that iteratiely sends calls to Net::forward, and based on the probabilities, chooses an action if it's the first index, increments agent_count. Inputs xx are stored in states which will be used to backward.
class ABM:
def __init__(self,_nn,_t_data):
self.nn = nn
self.iteration_n = _t_data.iteration_n
self.target_value = _t_data.target_value
def run(self):
for jj in range(self.iteration_n):
xx = self.generate_input();
self.states.append(xx); # store inputs
ys = nn.forward(xx);
action = self.draw(ys);
if (action == 0):
self.agent_count+=1
loss = self.calculateReward();
return loss;
def generate_input(self):
return torch.ones((1),requires_grad = True)
--some other attributes--
When the run is over, error is calculated as error = (target_value - agent_count)/target_value which is a value between -1 and 1.
In order to train the model, the error is applied to the probability of the first output node of NN. This is to correct the NN in a way that predicts the right probability for the first output. The code is:
class ABM:
def calculateReward(self):
error = (self.target_value - self.agent_count)/self.target_value
reward = torch.tensor((-error), requires_grad = True)
#since all states are same, we just choose the first one
state = self.states[0]
ys = nn.forward(state)
actionProb = ys[0]
action_reward = actionProb * reward
return action_reward;
--some other members--
Two parameters of iteration_n and target_value used in the ABM are defined in the training data class as:
class Train:
def __init__(self,tt , tv):
self.iteration_n = tt
self.target_value = tv
target_value =0;
iteration_n=0;
The harmony between different parts of the code is done as:
#### start optimization ####
nn = Net(1,2)
optimizer = optim.Adam(nn.parameters(), lr=0.01)
# create training data values
training_items = []
training_items.append(Train(1000,800))
training_items.append(Train(500,200))
error_record = []
for ii in range(100):
print("############ start iteration #%d ################"%ii)
for t_item in training_items:
model = ABM(nn,t_item)
loss = model.run()
optimizer.zero_grad()
loss.backward()
optimizer.step()
error_record.append(loss.item())
Now let's present the problem.
If I only define one training item as Train(/*iteration number*/1000,/*target value*/800));, the NN is optimized as expected:
however, by defining two training items, although the average error declines to zero, the error on each training data stays high:
Is there any idea how to solve this issue?
I have omitted some parts of the code here to make it more readable. The full running code is available on minimal ABM

I want to calculate cross entropy at all outputs of LSTM

I am writing a program of classification problem using LSTM.
However, I do not know how to calculate cross entropy with all the output of LSTM.
Here is a part of my program.
cell_fw = tf.nn.rnn_cell.LSTMCell(num_hidden)
cell_bw = tf.nn.rnn_cell.LSTMCell(num_hidden)
outputs, _ = tf.nn.bidirectional_dynamic_rnn(cell_fw,cell_bw,inputs = inputs3, dtype=tf.float32,sequence_length= seq_len)
outputs = tf.concat(outputs,axis=2)
#outputs [batch_size,max_timestep,num_features]
outputs = tf.reshape(outputs, [-1, num_hidden*2])
W = tf.Variable(tf.truncated_normal([num_hidden*2,
num_classes],
stddev=0.1))
b = tf.Variable(tf.constant(0., shape=[num_classes]))
logits = tf.matmul(outputs, W) + b
How can I apply crossentropy error to this?
Should I create a vector that represents the same class as the number of max_timestep for each batch and calculate the error with that?
Have you looked at cross_entropy documentation: https://www.tensorflow.org/api_docs/python/tf/losses/softmax_cross_entropy ?
The dimension of onehot_labels should answer your question.

Understanding code wrt Logistic Regression using gradient descent

I was following Siraj Raval's videos on logistic regression using gradient descent :
1) Link to longer video :
https://www.youtube.com/watch?v=XdM6ER7zTLk&t=2686s
2) Link to shorter video :
https://www.youtube.com/watch?v=xRJCOz3AfYY&list=PL2-dafEMk2A7mu0bSksCGMJEmeddU_H4D
In the videos he talks about using gradient descent to reduce the error for a set number of iterations so that the function converges(slope becomes zero).
He also illustrates the process via code. The following are the two main functions from the code :
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
#The above functions are called below:
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
# code taken from Siraj Raval's github page
Why does the value of b & m continue to update for all the iterations? After a certain number of iterations, the function will converge, when we find the values of b & m that give slope = 0.
So why do we continue iteration after that point and continue updating b & m ?
This way, aren't we losing the 'correct' b & m values? How is learning rate helping the convergence process if we continue to update values after converging? Thus, why is there no check for convergence, and so how is this actually working?
In practice, most likely you will not reach to slope 0 exactly. Thinking of your loss function as a bowl. If your learning rate is too high, it is possible to overshoot over the lowest point of the bowl. On the contrary, if the learning rate is too low, your learning will become too slow and won't reach the lowest point of the bowl before all iterations are done.
That's why in machine learning, the learning rate is an important hyperparameter to tune.
Actually, once we reach a slope 0; b_gradient and m_gradient will become 0;
thus, for :
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
new_b and new_m will remain the old correct values; as nothing will be subtracted from them.

Resources