Tweaking the Loss before the Optimizer Step

Tweaking the Loss before the Optimizer Step - machine-learning

I want to add an extra operation before running the AdamOptimizer operation on my loss, so as to help the model deal with repetitions in my data. The relevant code snippet looks something like this:
loss = tf.nn.softmax_cross_entropy_with_logits(logits=predLogits, labels=actLabels)
loss = tf.reshape(loss, [batchsize, -1])
repMask = tf.sqrt(tf.cast(tf.abs(tf.subtract(tf.cast(Y, tf.int64), tf.cast(X, tf.int64))), tf.float32))
lossPost = loss - repMask
train_step = tf.train.AdamOptimizer(LR).minimize(lossPost)
So, in other words, instead of minimizing loss, I want AdamOptimizer to minimize its slightly tweaked version, which is lossPost. I then train the model in the usual way:
_ = sess.run([train_step], feed_dict=feed_dict)
I noticed that adding this workaround of minimizing lossPost instead of loss has no impact on the accuracy of the model. The model produces the exact same output with or without this workaround. It seems that it continues to optimize the original, unmodified loss. Why is this the case?
My original approach was to perform this tweak at the softmax_cross_entropy_with_logits step, by using the weighted_cross_entropy_with_logits instead, but I have an extra complication there, since there is an extra dimension of Vocabulary (this is a character-level-style model). So I thought it would be easier to do this afterwords, and as long as it's done prior to the optimization step it should be doable?

In your model it seems like X and Y are constants (that is, they depend only on the data). In this case repMask is also constant, as it is defined by
repMask = tf.sqrt(tf.cast(tf.abs(tf.subtract(tf.cast(Y, tf.int64), tf.cast(X, tf.int64))), tf.float32))
Hence loss and lossPost differ by constant value, and this has no effect on the minimization process (it is like finding x that minimizes x^2-1 vs x that minimizes x^2-5. Both x are the same).

Related

Force a neural network to have 0-sum outputs

I have a pytorch neural net with n-dimensional output which I want to have 0-sum during training (my training data, i.e. the true outputs, have 0 sum). Of course I could just add a line computing the sum s and then subtract s/n from each element of the output. But this way, the network would be driven even less to actually finding outputs with zero sum, as this would get taken care of anyways (I've been getting worse test results with this approach). Also, as the true outputs in the training data have 0 sum, obviously the network converges to having almost 0 sum outputs, but not quite. Hence, I was wondering whether there is a smart way to force the network to have outputs that sum to 0, without just brute-force subtracting the sum in the end (which would corrupt learning outputs to have sum 0)? I.e. some sort of solution directly incorporated in the network? (Probably there isn't, at least I couldn't think of any...)

Your approach with "explicitly substracting the mean" is the correct way. The same way we use softmax to nicely parametrise distributions, and you could complain that "this makes the network not learn about probability even more!", but in fact it does, it simply does so in its own, unnormalised space. Same in your case - by subtracting the mean you make sure that you match the target variable while allowing your network to focus on hard problems, and not waste its compute on having to learn that the sum is zero. If you do anything else your network will literally have to learn to compute the mean somewhere and subtract it. There are some potential corner cases where there might be some deep representational reason for mean to be zero that could be argues for, but these cases are rare enough that chances that this is actually happening "magically" in the network are zero (and if you knew it was happening there would be better ways of targeting it than by zero ensuring).

What happens if you add an explicit loss?
pred = model(input)
original_loss = criterion(pred, target)
# add this loss
zero_sum_loss = pred.mean() ** 2
loss = original_loss + weight * zero_sum_loss
loss.backward()
optim.step()
# ...

Is loss.backward() meant to be called on each sample or on each batch?

I have a training dataset which contains features of different sizes. I understand the implications of this in terms of network architecture and have designed my network accordingly to handle these heterogeneous shapes. When it comes to my training loop, though, I'm confused as to the order/placement of optimizer.zero_grad(), loss.backward(), and optimizer.step().
Because of the unequal feature sizes, I cannot do forward pass upon features of a batch at the same time. So, my training loop loops through samples of a batch manually, like this:
for epoch in range(NUM_EPOCHS):
for bidx, batch in enumerate(train_loader):
optimizer.zero_grad()
batch_loss = 0
for sample in batch:
feature1 = sample['feature1']
feature2 = sample['feature2']
label1 = sample['label1']
label2 = sample['label2']
pred_l1, pred_l2 = model(feature1, feature2)
sample_loss = compute_loss(label1, pred_l1)
sample_loss += compute_loss(label2, pred_l2)
sample_loss.backward() # CHOICE 1
batch_loss += sample_loss.item()
# batch_loss.backward() # CHOICE 2
optimizer.step()
I'm wondering if it makes sense here that backward is called upon each sample_loss with the optimizer step called every BATCH_SIZE samples (CHOICE 1). The alternative, I think, would be to call backward upon batch_loss (CHOICE 2) and I'm not so sure which is the right choice.

Differentiation is a linear operation, so in theory it should not matter whether you first differentiate the different losses and add their derivatives or whether you first add the losses and then compute the derivative of their sum.
So for practical purposes both of them should lead to the same results (disregarding to the usual floating point issues).
You might get a slightly different memory requirements and computation speeds (I'd guess the second version might be slightly faster.), but that is hard to predict but something that you can easily find out by timing the two versions.

How do I perform a differentiable operation selection in TensorFlow?

I am trying to produce a mathematical operation selection nn model, which is based on the scalar input. The operation is selected based on the softmax result which is produce by the nn. Then this operation has to be applied to the scalar input in order to produce the final output. So far I’ve come up with applying argmax and onehot on the softmax output in order to produce a mask which then is applied on the concated values matrix from all the possible operations to be performed (as show in the pseudo code below). The issue is that neither argmax nor onehot appears to be differentiable. I am new to this, so any would be highly appreciated. Thanks in advance.
#perform softmax
logits = tf.matmul(current_input, W) + b
softmax = tf.nn.softmax(logits)
#perform all possible operations on the input
op_1_val = tf_op_1(current_input)
op_2_val = tf_op_2(current_input)
op_3_val = tf_op_2(current_input)
values = tf.concat([op_1_val, op_2_val, op_3_val], 1)
#create a mask
argmax = tf.argmax(softmax, 1)
mask = tf.one_hot(argmax, num_of_operations)
#produce the input, by masking out those operation results which have not been selected
output = values * mask

I believe that this is not possible. This is similar to Hard Attention described in this paper. Hard attention is used in Image captioning to allow the model to focus only on a certain part of the image at each step. Hard attention is not differentiable but there are 2 ways to go around this:
1- Use Reinforcement Learning (RL): RL is made to train models that makes decisions. Even though, the loss function won't back-propagate any gradients to the softmax used for the decision, you can use RL techniques to optimize the decision. For a simplified example, you can consider the loss as penalty, and send to the node, with the maximum value in the softmax layer, a policy gradient proportional to the penalty in order to decrease the score of the decision if it was bad (results in a high loss).
2- Use something like soft attention: instead of picking only one operation, mix them with weights based on the softmax. so instead of:
output = values * mask
Use:
output = values * softmax
Now, the operations will converge down to zero based on how much the softmax will not select them. This is easier to train compared to RL but it won't work if you must completely remove the non-selected operations from the final result (set them to zero completely).
This is another answer that talks about Hard and Soft attention that you may find helpful: https://stackoverflow.com/a/35852153/6938290

How does batching interact with the loss function in TensorFlow?

I'm training a multi-objective neural net in TensorFlow with my own loss function and can't find documentation regarding how batching interacts with that functionality.
For example, I have snippet of my loss function below, which takes the tensor/list of predictions and makes sure that their absolute value sums to no more than one:
def fitness(predictions,actual):
absTensor = tf.abs(predictions)
sumTensor = tf.reduce_sum(absTensor)
oneTensor = tf.constant(1.0)
isGTOne = tf.greater(sumTensor,oneTensor)
def norm(): return predictions/sumTensor
def unchanged(): return predictions
predictions = tf.cond(isGTOne,norm,unchanged)
etc...
But when I'm passing in a batch of estimates I feel like this loss function is normalising the whole set of inputs to sum to 1 at this point, rather than each individual set summing to 1. I.e.
[[.8,.8],[.8,.8]] -> [[.25,.25],[.25,25]]
rather than the desired
[[.8,.8],[.8,.8]] -> [[.5,.5],[.5,.5]]
Can anybody clarify or put to rest my suspicions? If this is how my function is currently working, how do I change that?

You must specify a reduction axis for reduction ops, otherwise all axes will be reduces. Traditionally this is the first dimension of your tensor. So, line 2 should look like this:
sumTensor = tf.reduce_sum(absTensor, 0)
After you make that change you will run into another problem. sumTensor will no longer be a scalar and will thus no longer make sense as a condition for tf.cond (i.e. what does it mean to branch per entry of a batch?). What you really want is tf.select since you don't really want to branch logic per batch entry. Like this:
isGTOne = tf.greater(sumTensor,oneTensor)
norm = predictions/sumTensor
predictions = tf.select(isGTOne,norm,predictions)
But, looking at this now, I wouldn't even bother conditionally normalizing the entries. Since you are operating at the granularity of a batch now, I don't think you can gain performance from normalizing an entry of a batch one at a time. Especially, since dividing is not really an expensive side effect. Might as well just do:
def fitness(predictions,actual):
absTensor = tf.abs(predictions)
sumTensor = tf.reduce_sum(absTensor, 0)
predictions = predictions/sumTensor
etc...
Hope that helps!

Interpretation of a learning curve in machine learning

While following the Coursera-Machine Learning class, I wanted to test what I learned on another dataset and plot the learning curve for different algorithms.
I (quite randomly) chose the Online News Popularity Data Set, and tried to apply a linear regression to it.
Note : I'm aware it's probably a bad choice but I wanted to start with linear reg to see later how other models would fit better.
I trained a linear regression and plotted the following learning curve :
This result is particularly surprising for me, so I have questions about it :
Is this curve even remotely possible or is my code necessarily flawed?
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
If it is not, any hint to where I made a mistake?
Here's my code (Octave / Matlab) just in case:
Plot :
lambda = 0;
startPoint = 5000;
stepSize = 500;
[error_train, error_val] = ...
learningCurve([ones(mTrain, 1) X_train], y_train, ...
[ones(size(X_val, 1), 1) X_val], y_val, ...
lambda, startPoint, stepSize);
plot(error_train(:,1),error_train(:,2),error_val(:,1),error_val(:,2))
title('Learning curve for linear regression')
legend('Train', 'Cross Validation')
xlabel('Number of training examples')
ylabel('Error')
Learning curve :
S = ['Reg with '];
for i = startPoint:stepSize:m
temp_X = X(1:i,:);
temp_y = y(1:i);
% Initialize Theta
initial_theta = zeros(size(X, 2), 1);
% Create "short hand" for the cost function to be minimized
costFunction = #(t) linearRegCostFunction(X, y, t, lambda);
% Now, costFunction is a function that takes in only one argument
options = optimset('MaxIter', 50, 'GradObj', 'on');
% Minimize using fmincg
theta = fmincg(costFunction, initial_theta, options);
[J, grad] = linearRegCostFunction(temp_X, temp_y, theta, 0);
error_train = [error_train; [i J]];
[J, grad] = linearRegCostFunction(Xval, yval, theta, 0);
error_val = [error_val; [i J]];
fprintf('%s %6i examples \r', S, i);
fflush(stdout);
end
Edit : if I shuffle the whole dataset before splitting train/validation and doing the learning curve, I have very different results, like the 3 following :
Note : the training set size is always around 24k examples, and validation set around 8k examples.

Is this curve even remotely possible or is my code necessarily flawed?
It's possible, but not very likely. You might be picking the hard to predict instances for the training set and the easy ones for the test set all the time. Make sure you shuffle your data, and use 10 fold cross validation.
Even if you do all this, it is still possible for it to happen, without necessarily indicating a problem in the methodology or the implementation.
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
Let's assume that your data can only be properly fitted by a 3rd degree polynomial, and you're using linear regression. This means that the more data you add, the more obviously it will be that your model is inadequate (higher training error). Now, if you choose few instances for the test set, the error will be smaller, because linear vs 3rd degree might not show a big difference for too few test instances for this particular problem.
For example, if you do some regression on 2D points, and you always pick 2 points for your test set, you will always have 0 error for linear regression. An extreme example, but you get the idea.
How big is your test set?
Also, make sure that your test set remains constant throughout the plotting of the learning curves. Only the train set should increase.
If it is not, any hint to where I made a mistake?
Your test set might not be large enough or your train and test sets might not be properly randomized. You should shuffle the data and use 10 fold cross validation.
You might want to also try to find other research regarding that data set. What results are other people getting?
Regarding the update
That makes a bit more sense, I think. Test error is generally higher now. However, those errors look huge to me. Probably the most important information this gives you is that linear regression is very bad at fitting this data.
Once more, I suggest you do 10 fold cross validation for learning curves. Think of it as averaging all of your current plots into one. Also shuffle the data before running the process.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart