Here is a toy model. I print the model parameters before calling backward exactly once, then print the model parameters again. The parameters are unchanged. If I add the line model:updateParameters(<learning_rate>) after calling backward, I see the parameters update.
But in the example code I've seen, for example https://github.com/torch/demos/blob/master/train-a-digit-classifier/train-on-mnist.lua, no one actually calls updateParameters. Also, it doesn't look like optim.sgd, optim.adam, or nn.StochasticGradient ever call updateParameters either. What am I missing here? How do the parameters get updated automatically? If I must call updateParameters, why do no examples do that?
require 'nn'
require 'optim'
local model = nn.Sequential()
model:add(nn.Linear(4, 1, false))
local params, grads = model:getParameters()
local criterion = nn.MSECriterion()
local inputs = torch.randn(1, 4)
local labels = torch.Tensor{1}
print(params)
model:zeroGradParameters()
local output = model:forward(inputs)
local loss = criterion:forward(output, labels)
local dfdw = criterion:backward(output, labels)
model:backward(inputs, dfdw)
-- With the line below uncommented, the parameters are updated:
-- model:updateParameters(1000)
print(params)
The backward() is not supposed to change parameters, it merely computes the derivatives of the error function with respect to all of the parameters of the network.
In general the training is the sequence of the steps:
repeat
local output = model:forward(input) --see what model predicts
local loss = criterion:forward(output, answer) --see how wrong it is
local loss_grad = criterion:backward(output, answer) --see where it is the most wrong
model:backward(input,loss_grad) --see how much each particular parameter of network is responsible for error
model:updateParameters(learningRate) --fix the parameters based on their wrongness
model:zeroGradParameters() --network parameters are different now, so old gradients are of no use now
until is_user_satisfied()
updateParameters implements the most simple optimization algorithm here (gradient descent).
If so inclined, you may use your own function instead. In theory, you might perform explicit loops through the network storages to update their values.
In practice, you usually call getParameters()
local model_parameters,model_parameters_gradient=model:getParameters()
Which yields you homogeneous tensors of all the values and the gradients. These tensors are views inside the network, so changes in them affect the network.
You may not know which point in the network corresponds to which value, but most optimizers do not care about that.
The demo of optim.sgd usage is as follows:
optim.sgd(
function_to_return_error_and_its_gradients,
model_parameters,
optimizer_special_settings)
The specifics are covered in demo, but here it is relevant that optimizer receives the model_parameters as a parameter which gives it write access to network. And it is not explicitly stated in the documentation, but in the source code it is seen, that the optimizer changes the values of its input tensor (also, note that it is returning the same tensor it received).
Related
I am confused about how gradient update works for the SimOTA label assignment part in YOLOX.
In Megvii's implementation of the yolo_head.py, there is get_losses function.
A part of the function calls get_assignments function, which implements the SimOTA label assignment strategy mentioned in the original YOLOX paper:
try:
(
gt_matched_classes,
fg_mask,
pred_ious_this_matching,
matched_gt_inds,
num_fg_img,
) = self.get_assignments( # noqa
batch_idx,
num_gt,
total_num_anchors,
gt_bboxes_per_image,
gt_classes,
bboxes_preds_per_image,
expanded_strides,
x_shifts,
y_shifts,
cls_preds,
bbox_preds,
obj_preds,
labels,
imgs,
)
My understanding is:
The get_assignments function has the #torch.no_grad() decorator which would prevent the gradient calculation from taking place in this function during back propagation.
(I believe) This would mean that the return values of the get_assignments function would be treated as pre-computed constants, except that they will be varying for each image & groundtruth input.
Above points suggest that the neural network would be trying to learn something from an (paradoxically) ever-changing pre-computed "constants" for every image input which does not seem to make much sense. Intuition leads me to think that whatever calculation( that could vary across inputs) that results in a loss should be differentiable & BP'ed.
Is there something inaccurate in my understanding of the YOLOX architecture / how BP works?
Upon thinking over my question, I realized that the matching cost matrix obtained from dynamic_k_matching() (inside get_assignments) serves merely as another proxy groundtruth target. There is no reason to compute gradients within a function that creates a target groundtruth.
Since I’m a beginner in ML, this question or the design overall may sound silly, sorry about that. I’m open to any suggestions.
I have a simple network with three linear layers one of which is output layer.
self.fc1 = nn.Linear(in_features=2, out_features=12)
self.fc2 = nn.Linear(in_features=12, out_features=16)
self.out = nn.Linear(in_features=16, out_features=4)
My states are consisting of two values, coordinate x and y. That’s why input layer has two features.
In main.py I’m sampling and extracting memories in ReplayMemory class and pass them to get_current function:
experiences = memory.sample(batch_size)
states, actions, rewards, next_states = qvalues.extract_tensors(experiences)
current_q_values = qvalues.QValues.get_current(policy_net, states, actions)
Since a single state is consisting of two values, length of the states tensor is batchsize x 2 while length of the actions is batchsize. (Maybe that’s the problem?)
When I pass “states” to my network in get_current function to obtain predicted q-values for the state, I get this error:
size mismatch, m1: [1x16], m2: [2x12]
It looks like it is trying to grab the states tensor as if it is a single state tensor. I don’t want that. In the tutorials that I follow, they pass the states tensor which is a stack of multiple states, and there is no problem. What am I doing wrong? :)
This is how I store an experience:
memory.push(dqn.Experience(state, action, next_state, reward))
This is my extract tensors function:
def extract_tensors(experiences):
# Convert batch of Experiences to Experience of batches
batch = dqn.Experience(*zip(*experiences))
state_batch = torch.cat(tuple(d[0] for d in experiences))
action_batch = torch.cat(tuple(d[1] for d in experiences))
reward_batch = torch.cat(tuple(d[2] for d in experiences))
nextState_batch = torch.cat(tuple(d[3] for d in experiences))
print(action_batch)
return (state_batch,action_batch,reward_batch,nextState_batch)
Tutorial that I follow is this project's tutorial.
https://github.com/nevenp/dqn_flappy_bird/blob/master/dqn.py
Look between 148th and 169th lines. And especially 169th line where it passes the states batch to the network.
SOLVED. It turned out that I didn't know how to properly create 2d tensor.
2D Tensor must be like this:
states = torch.tensor([[1, 1], [2,2]], dtype=torch.float)
Inside an autoregressive continuous problem, when the zeros take too much place, it is possible to treat the situation as a zero-inflated problem (i.e. ZIB). In other words, instead of working to fit f(x), we want to fit g(x)*f(x) where f(x) is the function we want to approximate, i.e. y, and g(x) is a function which output a value between 0 and 1 depending if a value is zero or non-zero.
Currently, I have two models. One model which gives me g(x) and another model which fits g(x)*f(x).
The first model gives me a set of weights. This is where I need your help. I can use the sample_weights arguments with model.fit(). As I work with tremendous amount of data, then I need to work with model.fit_generator(). However, fit_generator() does not have the argument sample_weights.
Is there a work around to work with sample_weights inside fit_generator()? Otherwise, how can I fit g(x)*f(x) knowing that I have already a trained model for g(x)?
You can provide sample weights as the third element of the tuple returned by the generator. From Keras documentation on fit_generator:
generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. The output of the generator must be either
a tuple (inputs, targets)
a tuple (inputs, targets, sample_weights).
Update: Here is a rough sketch of a generator that returns the input samples and targets as well as the sample weights obtained from model g(x):
def gen(args):
while True:
for i in range(num_batches):
# get the i-th batch data
inputs = ...
targets = ...
# get the sample weights
weights = g.predict(inputs)
yield inputs, targets, weights
model.fit_generator(gen(args), steps_per_epoch=num_batches, ...)
I was using tensorflow input pipelines like cifar10 model in tensorflow and try to use tf.cond to do validation and I wrote something like this
train_data = model.input(istrain=True)
val_data = model.input(istrain=False)
# This selects which stream to use.
select_val = tf.placeholder(dtype=bool,shape=[],name='select_test')
data = tf.cond(
select_val,
lambda:val_data,
lambda:train_data
)
# Here is the model.
loss = ...
train_op = ...
...
with tf.Session():
...
And if I delete the cond and just use the training data, the speed is 4000 samples/s and if I use the code above, the speed decrease to 2300 samples/s. The validation pipeline capacity is set really small so it won't take too much memory in GPU. The frequency of doing validation is also really low.
I'm not sure what is going wrong and please help me out.
tf.cond is not fully lazy. Any operations that are required by either of the branches of the cond will be run even if the branch that requires it is not the branch to be executed. So in your case, both model.input(istrain=True) and model.input(istrain=False) are being execute every time your data op is being called. The results of one of them is just ignored.
The documentation for cond gives a minimal code example:
Note that the conditional execution applies only to the operations
defined in fn1 and fn2. Consider the following simple program:
z = tf.multiply(a, b)
result = tf.cond(x < y, lambda: tf.add(x, z), lambda: tf.square(y))
If x < y, the tf.add operation will be executed and tf.square
operation will not be executed. Since z is needed for at least one
branch of the cond, the tf.mul operation is always executed,
unconditionally. Although this behavior is consistent with the
dataflow model of TensorFlow, it has occasionally surprised some users
who expected a lazier semantics.
Also note, this means that if your model.input is pulling some set of data from a larger pool (say, a batch from an entire dataset), each time the cond is run, data gets pulled from both validation and training, and one set just gets thrown away. This can cause problems more serious than inefficiencies in some cases. For example, if you're processing only a certain number epochs, then with this code you're not actually processing that number of epochs because data was being pulled that was not used.
I am trying to implement a neural network with multiple layers. I am trying to understand if what I have done is correct and if not, how do I debug this. The way I do it is, I define my neural network in the following manner (I initialise the lookuptable layer with some prior learned embeddings):
lookupTableLayer = nn.LookupTable(vector:size()[1], d)
for i=1,vector:size()[1] do
lookupTableLayer.weight[i] = vector[i]
end
mlp=nn.Sequential();
mlp:add(lookupTableLayer)
mlp:add(nn.TemporalConvolution(d,H,K,dw))
mlp:add(nn.Tanh())
mlp:add(nn.Max(1))
mlp:add(nn.Tanh())
mlp:add(nn.Linear(H,d))
Now, to train the network, I loop through every training example and for every example I call gradUpdate() which has this code (this is straight from the examples):
function gradUpdate(mlp, x, indexY, learningRate)
local pred = mlp:forward(x)
local gradCriterion = findGrad(pred, indexY)
mlp:zeroGradParameters()
mlp:backward(x, gradCriterion)
mlp:updateParameters(learningRate)
end
The findGrad function is just an implementation of WARP Loss which returns the gradient wrt output. I am wondering if this is all I need? I assume this will backpropagate and update the parameters of all the layers. To check this, I trained this network and saved the model. Then I loaded the model and did:
{load saved mlp after training}
lookuptable = mlp:findModules('nn.LookupTable')[1]
Now, I checked vector[1] and lookuptable.weight[1] and they were the same. I can't understand why did the weights in the lookup table layer not get updated? What am I missing here?
Looking forward to your replies!