How to handle gradients when training two sub-graphs simultaneously - machine-learning

The general idea I am trying to realize is a seq2seq-model (taken from the translate.py-example in the models, based on the seq2seq-class). This trains well.
Furthermore I am using the hidden state of the rnn after all the encoding is done, right before decoding starts (I call it the “hidden state at end of encoding”). I use this hidden state at end of encoding to feed it into a further sub-graph which I call “prices” (see below). The training gradients of this sub-graph backprop not only through this additional sub-graph, but also back into the encoder-part of the rnn (which is what I want and need).
The plan is to add more such sub-graph to the hidden state at end of encoding, as I want to analyze the input phrases in a variety of ways.
Now during training when I evaluate and train both sub-graphs (encoder+prices AND encoder+decoder) at the same time, the net does NOT converge. However, if I train by executing the training in the following way (pseudo-code):
if global_step % 10 == 0:
execute-the-price-training_code
else:
execute-the-decoder-training_code
So I am not training both sub-graphs simultaneously. Now it does converge, but the encoder+decoder-part converges MUCH slower than if I ONLY train this part and never train the prices-sub-graph.
My question is: I should be able to train both sub-graphs simultaneously. But probably I have to rescale the gradients flowing back into the hidden state at end of encoding. Here we get the gradients from the prices sub-graph AND from the decoder-sub-graph. How should this rescaling be done. I didnt find any papers describing such an undertaking, but maybe I am searching with the wrong keywords.
Here is the training-part of the code:
This is the (almost original) training-op-preparation:
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.AdadeltaOptimizer(self.learning_rate)
for bucket_id in xrange(len(buckets)):
tf.scalar_summary("seq2seq loss", self.losses[bucket_id])
gradients = tf.gradients(self.losses[bucket_id], var_list_seq2seq)
clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(zip(clipped_gradients, var_list_seq2seq), global_step=self.global_step))
Now, additionally, I am running a second sub-graph that takes the hidden state at end of encoding as input:
with tf.name_scope('prices') as scope:
#First layer
W_price_first_layer = tf.Variable(tf.random_normal([num_layers*size, self.prices_hidden_layer_size], stddev=0.35), name="W_price_first_layer")
B_price_first_layer = tf.Variable(tf.zeros([self.prices_hidden_layer_size]), name="B_price_first_layer")
self.output_price_first_layer = tf.add(tf.matmul(self.hidden_state, W_price_first_layer), B_price_first_layer)
self.activation_price_first_layer = tf.nn.sigmoid(self.output_price_first_layer)
#self.activation_price_first_layer = tf.nn.Relu(self.output_price_first_layer)
#Second layer to softmax (price ranges)
W_price = tf.Variable(tf.random_normal([self.prices_hidden_layer_size, self.prices_bit_size], stddev=0.35), name="W_price")
W_price_t = tf.transpose(W_price)
B_price = tf.Variable(tf.zeros([self.prices_bit_size]), name="B_price")
self.output_price_second_layer = tf.add(tf.matmul(self.activation_price_first_layer, W_price),B_price)
self.price_prediction = tf.nn.softmax(self.output_price_second_layer)
self.label_price = tf.placeholder(tf.int32, shape=[self.batch_size], name="price_label")
#Remember the prices trainables
var_list_prices = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "prices")
var_list_all = tf.trainable_variables()
#Backprop
self.loss_price = tf.nn.sparse_softmax_cross_entropy_with_logits(self.output_price_second_layer, self.label_price)
self.loss_price_scalar = tf.reduce_mean(self.loss_price)
self.optimizer_price = tf.train.AdadeltaOptimizer(self.learning_rate_prices)
self.training_op_price = self.optimizer_price.minimize(self.loss_price, var_list=var_list_all)
Thx a bunch

I expect that running two optimizers simultaneously will lead to inconsistent gradient updates on the common variables, and this might be causing your training not to converge.
Instead, if you add the scalar loss from each sub-network to the "losses collection" (e.g. via tf.contrib.losses.add_loss() or tf.add_to_collection(tf.GraphKeys.LOSSES, ...), you can use tf.contrib.losses.get_total_loss() to get a single loss value that can be passed to a single standard TensorFlow tf.train.Optimizer subclass. TensorFlow will derive the appropriate back-prop computation for your split network.
The get_total_loss() method simply computes an unweighted sum of the values that have been added to the losses collection. I'm not familiar with the literature on how or if you should scale these values, but you can use any arbitrary (differentiable) TensorFlow expression to combine the losses and pass the result to a single optimizer.

Related

Is loss.backward() meant to be called on each sample or on each batch?

I have a training dataset which contains features of different sizes. I understand the implications of this in terms of network architecture and have designed my network accordingly to handle these heterogeneous shapes. When it comes to my training loop, though, I'm confused as to the order/placement of optimizer.zero_grad(), loss.backward(), and optimizer.step().
Because of the unequal feature sizes, I cannot do forward pass upon features of a batch at the same time. So, my training loop loops through samples of a batch manually, like this:
for epoch in range(NUM_EPOCHS):
for bidx, batch in enumerate(train_loader):
optimizer.zero_grad()
batch_loss = 0
for sample in batch:
feature1 = sample['feature1']
feature2 = sample['feature2']
label1 = sample['label1']
label2 = sample['label2']
pred_l1, pred_l2 = model(feature1, feature2)
sample_loss = compute_loss(label1, pred_l1)
sample_loss += compute_loss(label2, pred_l2)
sample_loss.backward() # CHOICE 1
batch_loss += sample_loss.item()
# batch_loss.backward() # CHOICE 2
optimizer.step()
I'm wondering if it makes sense here that backward is called upon each sample_loss with the optimizer step called every BATCH_SIZE samples (CHOICE 1). The alternative, I think, would be to call backward upon batch_loss (CHOICE 2) and I'm not so sure which is the right choice.
Differentiation is a linear operation, so in theory it should not matter whether you first differentiate the different losses and add their derivatives or whether you first add the losses and then compute the derivative of their sum.
So for practical purposes both of them should lead to the same results (disregarding to the usual floating point issues).
You might get a slightly different memory requirements and computation speeds (I'd guess the second version might be slightly faster.), but that is hard to predict but something that you can easily find out by timing the two versions.

Using optim.step() with Pytorch's DataLoader

Usually the learning cycle contains:
optim.zero_grad()
loss(m, op).backward()
optim.step()
But what should be the cycle when the data does not fit in the graphics card?
First option:
for ip, op in DataLoader(TensorDataset(inputs, outputs),
batch_size=int(1e4), pin_memory=True):
m = model(ip.to(dev))
op = op.to(dev)
optim.zero_grad()
loss(m, op).backward()
optim.step()
Second option:
optim.zero_grad()
for ip, op in DataLoader(TensorDataset(inputs, outputs),
batch_size=int(1e4), pin_memory=True):
m = model(ip.to(dev))
op = op.to(dev)
loss(m, op).backward()
optim.step()
The third option:
Accumulate gradients after calling backward().
The first option is correct and corresponds to batch gradient descent.
The second option will not work because m and op are being overwritten at each step, so your optimizer step will only correspond to optimizing based on the final batch.
The proper way of training a model using Stochastic Gradient Descent (SGD) is following these steps:
instantiate a model, and randomly init its weights. This is done only once.
instantiate the dataset and the dataloader, defining appropriate batch_size.
Iterate over the all examples, batch by batch. At each iteration
3.a Compute a stochastic estimate of the loss using only a batch, rather than the entire set (aka "forward pass")
3.b Compute the gradient of the loss w.r.t the model's parameters (aka "backward pass")
3.c Update the weights based on the current gradient
This is how the code should look like
model = MyModel(...) # instantiate a model once
dl = DataLoader(TensorDataset(inputs, outputs), batch_size=int(1e4), pin_memory=True)
for ei in range(num_epochs):
for ip, op in dl:
optim.zero_grad()
predict = model(ip.to(dev)) # forward pass
loss = criterion(predict, op.to(dev)) # estimate current loss
loss.backward() # backward pass - propagate gradients
optim.step() # update the weights based on current batch
Note that during training you iterate several times over the entire training set. Each such iteration is usually referred to as an "epoch".

Low confidence score in SVC for example from training set

Here is my code for SVC classifier.
vectorizer = TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(training_data)
classifier_linear = svm.LinearSVC()
clf = CalibratedClassifierCV(classifier_linear)
linear_svc_model = clf.fit(train_vectors, train_labels)
training_data here is a list of english sentences and train_lables are the labels associated. I do the usual stopwords removal and some preprocessing before creating final version of training_data. Here is how my testing code:
test_lables = ["no"]
test_vectors = vectorizer.transform(test_lables)
prediction_linear = clf.predict_proba(test_vectors)
counter = 0
class_probability = {}
lables = []
for item in train_labels:
if item in lables:
continue
else:
lables.append(item)
for val in np.nditer(prediction_linear):
new_val = val.item(0)
class_probability[lables[counter]] = new_val
counter = counter + 1
sorted_class_probability = sorted(class_probability.items(), key=operator.itemgetter(1), reverse=True)
print(sorted_class_probability)
Now when I run the code with a phrase that is already there in the training set (a word 'no' in this case), it identifies properly, but the confidence score is even below .9. The output is as follows:
[('no', 0.8474342514152964), ('hi', 0.06830103628879058), ('thanks', 0.03070201906552546), ('confused', 0.02647134535600733), ('ok', 0.015857384248465656), ('yes', 0.005961945963546264), ('bye', 0.005272017662368208)]
When I am studying online, I have seen that usually confidence score for data already in the training set is closer to 1 or almost 1 and rest of them are really negligible. What can I do to get better confidence score? Should I be worried that if I add more classes to it, the confidence score will further dip and it will be difficult for me to surely point out one standout class?
As long as your scores help you classify your inputs correctly, you shouldn't worry at all. If anything, if your confidence on the input already in your training data is too high, that probably means your method has overfit to the data, and cannot generalize to the unseen data.
However, you can tune the complexity of your method by changing the penalization parameters. In the case of a LinearSVC, you have both the penalty and the C parameter. Try different values of those two and observe the effect. Make sure you also observe the effect on an unseen test set.
Just a not that the values of C should be in exponential space, eg. [0.001, 0.01, 0.1, 1, 10, 100, 1000] for you to see meaningful effects.
The SGDClassifier may be relevant to your case if you're interested in such linear models and tuning your parameters.

how the generator is trained with the output of discriminator in Generative adversarial Networks

Recently I have learned about Generative Adversarial Networks.
For training the Generator, I am somehow confused how it learns. Here is an implemenation of GANs:
`# train generator
z = Variable(xp.random.uniform(-1, 1, (batchsize, nz), dtype=np.float32))
x = gen(z)
yl = dis(x)
L_gen = F.softmax_cross_entropy(yl, Variable(xp.zeros(batchsize, dtype=np.int32)))
L_dis = F.softmax_cross_entropy(yl, Variable(xp.ones(batchsize, dtype=np.int32)))
# train discriminator
x2 = Variable(cuda.to_gpu(x2))
yl2 = dis(x2)
L_dis += F.softmax_cross_entropy(yl2, Variable(xp.zeros(batchsize, dtype=np.int32)))
#print "forward done"
o_gen.zero_grads()
L_gen.backward()
o_gen.update()
o_dis.zero_grads()
L_dis.backward()
o_dis.update()`
So it computes a loss for the Generator as it is mentioned in the paper.
However, it calls the Generator backward function based on the Discriminator output. The discriminator output is just a number (not an array).
But we know that in general, for training a network, we compute a loss function in the last layer (a loss between the last layers output and the real output) and then we compute the gradients. So for example, if the output is 64*64, then we compare it with a 64*64 image and then compute the loss and do the back propagation.
However, in the codes that I see in Generative Adversarial Networks, I see they compute a loss for the Generator from the discriminator output (which is just a number) and then they call the back propagation for Generator. The Generators last layers is for example 64*64 pixels but the discriminator loss is 1*1 (which is different from the usual networks) So I do not understand how it cause the Generator to be learned and trained?
I thought if we attach the two networks (attaching the Generator and Discriminator) and then call the back propagation but just update the Generators parameters, it makes sense and it should work. But what I see in the codes are totally different.
So I am asking how it is possible?
Thanks
You say 'However, it calls the Generator backward function based on the Discriminator output. The discriminator output is just a number (not an array)' whereas the loss is always a scalar value. When we compute mean square error of two images it is also a scalar value.
L_adversarial = E[log(D(x))]+E[log(1−D(G(z))]
x is from real data distribution
z is the latent data distribution which is transformed by the Generator
Coming back to your actual question, The Discriminator network has a sigmoid activation function in the last layer which means it outputs in the range [0,1]. Discriminator tries to maximize this loss by maximizing both terms that are added in the loss function. Maximum value of first term is 0 and occurs when D(x) is 1 and maximum value of second term is also 0 and occurs when 1-D(G(z)) is 1 which means D(G(z)) is 0. So Discriminator tries to do a binary classification my maximizing this loss function where it tries to output 1 when it is fed x(real data) and 0 when it is fed G(z)(generated fake data).
But the Generator tries to minimize this loss in other words it tries to fool the Discriminator by generating fake samples which are similar to real samples. With time both Generator and Discriminator gets better and better. This is the intuition behind GAN.
The code is in pytorch
bce_loss = nn.BCELoss() #bce_loss = -ylog(y_hat)-(1-y)log(1-y_hat)[similar to L_adversarial]
Discriminator = ..... #some network
Generator = ..... #some network
optimizer_generator = ....... #some optimizer for generator network
optimizer_discriminator = ....... #some optimizer for discriminator network
z = ...... #some latent data distribution that is transformed by the generator
real = ..... #real data distribution
#####################
#Update Discriminator
#####################
fake = Generator(z)
fake_prediction = Discriminator(fake)
real_prediction = Discriminator(real)
discriminator_loss = bce_loss(fake_prediction,torch.zeros(batch_size))+bce_loss(real_prediction,torch.ones(batch_size))
discriminator_loss.backward()
optimizer_discriminator.step()
#################
#Update Generator
#################
fake = Generator(z)
fake_prediction = Discriminator(fake)
generator_loss = bce_loss(fake_prediction,torch.ones(batch_size))
generator_loss.backward()
optimizer_generator.step()

How do I perform a differentiable operation selection in TensorFlow?

I am trying to produce a mathematical operation selection nn model, which is based on the scalar input. The operation is selected based on the softmax result which is produce by the nn. Then this operation has to be applied to the scalar input in order to produce the final output. So far I’ve come up with applying argmax and onehot on the softmax output in order to produce a mask which then is applied on the concated values matrix from all the possible operations to be performed (as show in the pseudo code below). The issue is that neither argmax nor onehot appears to be differentiable. I am new to this, so any would be highly appreciated. Thanks in advance.
#perform softmax
logits = tf.matmul(current_input, W) + b
softmax = tf.nn.softmax(logits)
#perform all possible operations on the input
op_1_val = tf_op_1(current_input)
op_2_val = tf_op_2(current_input)
op_3_val = tf_op_2(current_input)
values = tf.concat([op_1_val, op_2_val, op_3_val], 1)
#create a mask
argmax = tf.argmax(softmax, 1)
mask = tf.one_hot(argmax, num_of_operations)
#produce the input, by masking out those operation results which have not been selected
output = values * mask
I believe that this is not possible. This is similar to Hard Attention described in this paper. Hard attention is used in Image captioning to allow the model to focus only on a certain part of the image at each step. Hard attention is not differentiable but there are 2 ways to go around this:
1- Use Reinforcement Learning (RL): RL is made to train models that makes decisions. Even though, the loss function won't back-propagate any gradients to the softmax used for the decision, you can use RL techniques to optimize the decision. For a simplified example, you can consider the loss as penalty, and send to the node, with the maximum value in the softmax layer, a policy gradient proportional to the penalty in order to decrease the score of the decision if it was bad (results in a high loss).
2- Use something like soft attention: instead of picking only one operation, mix them with weights based on the softmax. so instead of:
output = values * mask
Use:
output = values * softmax
Now, the operations will converge down to zero based on how much the softmax will not select them. This is easier to train compared to RL but it won't work if you must completely remove the non-selected operations from the final result (set them to zero completely).
This is another answer that talks about Hard and Soft attention that you may find helpful: https://stackoverflow.com/a/35852153/6938290

Resources