Why is my network changing without me calling the optimizer? - save

I've got a network in a reinforcement learning program where it plays against itself and updates to improve a valuation and policy. If the game is a draw or longer than a threshold the game is thrown out and the loss is never calculated and the optimizer is never called. Still, even if it does not update it still saves the network. It does this and loads the data with these functions
def load_net(i):
if not os.path.exists('C:\\Users\\user\\Desktop\\python\\refactor\\nets\\net{}.pth'.format(i)):
return None
nn = net.Net()
nn.load_state_dict(torch.load('C:\\Users\\user\\Desktop\\python\\refactor\\nets\\net{}.pth'.format(i)))
nn.double()
nn.train()
print('loaded net {}'.format(i))
return nn
def load_latest_net():
i = 0
while os.path.exists('C:\\Users\\user\\Desktop\\python\\refactor\\nets\\net{}.pth'.format(i)):
i += 1
i -= 1
nn = net.Net()
nn.load_state_dict(torch.load('C:\\Users\\user\\Desktop\\python\\refactor\\nets\\net{}.pth'.format(i)))
nn.double()
nn.train()
print('loaded latest net {}'.format(i))
return nn
def save_net_as_latest(net):
i = 0
while os.path.exists('C:\\Users\\user\\Desktop\\python\\refactor\\nets\\net{}.pth'.format(i)):
i += 1
print('saving net as {}'.format(i))
torch.save(net.state_dict(), 'C:\\Users\\user\\Desktop\\python\\refactor\\nets\\net{}.pth'.format(i))
If I compare the output of Networks 18 and 19, say, that have not been updated inbetween, I will get a slightly different result. By the order of like 4 decimal places. I have no idea why. Besides the training which im positive is not happening here there are only calls such as
output = model(input)
I may have some redundant or unnecessary .double() calls. Otherwise I'm perplexed. Here is some output that shows what im talking about
loaded net 31
loaded net 24
0.03457238742588508
{(256, 0): 0.1320085660843816, (256, 1): 0.137342432039808, (16384, 0): 0.17418085554435797, (16384, 1): 0.10271324890344476, (1048576, 0): 0.19074605973319944, (1048576, 1): 0.16643130291875174, (67108864, 0): 0.09657753477605639}
-0.04351635692025382
{(256, 0): 0.12781107650911216, (256, 1): 0.1366737843176031, (16384, 0): 0.17335968033226215, (16384, 1): 0.10237753363808048, (1048576, 0): 0.19300295476929272, (1048576, 1): 0.1690421767620045, (67108864, 0): 0.09773279367164493}

Related

How to handle imbalanced multi-label dataset?

I am currently trying to train an image classification model using Pytorch densenet121 with 4 labels (A, B, C, D). I have 224000 images and each image is labeled in the form of [1, 0, 0, 1] (Label A and D are present in the image). I have replaced the last dense layer of densenet121. The model is trained using Adam optimizer, LR of 0.0001 (with the decay of a factor of 10 per epoch), and trained for 4 epochs. I will try more epochs after I am confident that the class imbalanced issue is resolved.
The estimated number of positive classes is [19000, 65000, 38000, 105000] respectively. When I trained the model without class balancing and weights (with BCELoss), I have a very low recall for label A and C (in fact the True Positive TP and False Positive FP is less than 20)
I have tried 3 approaches to deal with the class imbalance after an extensive search on Google and Stackoverflow.
Approach 1: Class weights
I have tried to implement class weights by using the ratio of negative samples to positive samples.
y = train_df[CLASSES];
pos_weight = (y==0).sum()/(y==1).sum()
pos_weight = torch.Tensor(pos_weight)
if torch.cuda.is_available():
pos_weight = pos_weight.cuda()
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
The resultant class weights are [10.79, 2.45, 4.90, 1.13]. I am getting the opposite effect; having too many positive predictions which result in low precision.
Approach 2: Changing logic for class weights
I have also tried to get class weights by getting the proportion of the positive samples in the dataset and getting the inverse. The resultant class weights are [11.95, 3.49, 5.97, 2.16]. I am still getting too many positive predictions.
class_dist = y.apply(pd.Series.value_counts)
class_dist_norm = class_dist.loc[1.0]/class_dist.loc[1.0].sum()
pos_weight = 1/class_dist_norm
Approach 3: Focal Loss
I have also tried Focal Loss with the following implementation (but still getting too many positive predictions). I have used the class weights for the alpha parameter. This is referenced from https://gist.github.com/f1recracker/0f564fd48f15a58f4b92b3eb3879149b but I made some modifications to suit my use case better.
class FocalLoss(nn.CrossEntropyLoss):
''' Focal loss for classification tasks on imbalanced datasets '''
def __init__(self, alpha=None, gamma=1.5, ignore_index=-100, reduction='mean', epsilon=1e-6):
super().__init__(weight=alpha, ignore_index=ignore_index, reduction='mean')
self.reduction = reduction
self.gamma = gamma
self.epsilon = epsilon
self.alpha = alpha
def forward(self, input_, target):
# cross_entropy = super().forward(input_, target)
# Temporarily mask out ignore index to '0' for valid gather-indices input.
# This won't contribute final loss as the cross_entropy contribution
# for these would be zero.
target = target * (target != self.ignore_index).long()
# p_t = p if target = 1, p_t = (1-p) if target = 0, where p is the probability of predicting target = 1
p_t = input_ * target + (1 - input_) * (1 - target)
# Loss = -(alpha)( 1 - p_t)^gamma log(p_t), where -log(p_t) is cross entropy => loss = (alpha)(1-p_t)^gamma * cross_entropy (Epsilon added to prevent error with log(0) when class probability is 0)
if self.alpha != None:
loss = -1 * self.alpha * torch.pow(1 - p_t, self.gamma) * torch.log(p_t + self.epsilon)
else:
loss = -1 * torch.pow(1 - p_t, self.gamma) * torch.log(p_t + self.epsilon)
if self.reduction == 'mean':
return torch.mean(loss)
elif self.reduction == 'sum':
return torch.sum(loss)
else:
return loss
One thing to note is that the loss using stagnant after the first epoch, but the metrics varied between epochs.
I have considered undersampling and oversampling but I am unsure of how to proceed due to the fact that each image can have more than 1 label. One possible method is to oversample images with only 1 label by replicating them. But I am concerned that the model will only generalize on images with 1 label but perform poorly on images with multiple labels.
Therefore I would like to ask if there are methods that I should try, or did I make any mistakes in my approaches.
Any advice will be greatly appreciated.
Thank you!

is binary cross entropy an additive function?

I am trying to train a machine learning model where the loss function is binary cross entropy, because of gpu limitations i can only do batch size of 4 and i'm having lot of spikes in the loss graph. So I'm thinking to back-propagate after some predefined batch size(>4). So it's like i'll do 10 iterations of batch size 4 store the losses, after 10th iteration add the losses and back-propagate. will it be similar to batch size of 40.
TL;DR
f(a+b) = f(a)+f(b) is it true for binary cross entropy?
f(a+b) = f(a) + f(b) doesn't seem to be what you're after. This would imply that BCELoss is additive which it clearly isn't. I think what you really care about is if for some index i
# false
f(x, y) == f(x[:i], y[:i]) + f([i:], y[i:])
is true?
The short answer is no, because you're missing some scale factors. What you probably want is the following identity
# true
f(x, y) == (i / b) * f(x[:i], y[:i]) + (1.0 - i / b) * f(x[i:], y[i:])
where b is the total batch size.
This identity is used as motivation for the gradient accumulation method (see below). Also, this identity applies to any objective function which returns an average loss across each batch element, not just BCE.
Caveat/Pitfall: Keep in mind that batch norm will not behave exactly the same when using this approach since it updates its internal statistics based on batch size during the forward pass.
We can actually do a little better memory-wise than just computing the loss as a sum followed by backpropagation. Instead we can compute the gradient of each component in the equivalent sum individually and allow the gradients to accumulate. To better explain I'll give some examples of equivalent operations
Consider the following model
import torch
import torch.nn as nn
import torch.nn.functional as F
class MyModel(nn.Module):
def __init__(self):
super().__init__()
num_outputs = 5
# assume input shape is 10x10
self.conv_layer = nn.Conv2d(3, 10, 3, 1, 1)
self.fc_layer = nn.Linear(10*5*5, num_outputs)
def forward(self, x):
x = self.conv_layer(x)
x = F.max_pool2d(x, 2, 2, 0, 1, False, False)
x = F.relu(x)
x = self.fc_layer(x.flatten(start_dim=1))
x = torch.sigmoid(x) # or omit this and use BCEWithLogitsLoss instead of BCELoss
return x
# to ensure same results for this example
torch.manual_seed(0)
model = MyModel()
# the examples will work as long as the objective averages across batch elements
objective = nn.BCELoss()
# doesn't matter what type of optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
and lets say our data and targets for a single batch are
torch.manual_seed(1) # to ensure same results for this example
batch_size = 32
input_data = torch.randn((batch_size, 3, 10, 10))
targets = torch.randint(0, 1, (batch_size, 20)).float()
Full batch
The body of our training loop for an entire batch may look something like this
# entire batch
output = model(input_data)
loss = objective(output, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_value = loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
Weighted sum of loss on sub-batches
We could have computed this using the sum of multiple loss functions using
# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)
# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size
loss = 0
for sub_batch_idx in range(num_sub_batches):
start_idx = sub_batch_size * sub_batch_idx
end_idx = start_idx + sub_batch_size
sub_input = input_data[start_idx:end_idx]
sub_targets = targets[start_idx:end_idx]
sub_output = model(sub_input)
# add loss component for sub_batch
loss = loss + objective(sub_output, sub_targets) / num_sub_batches
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_value = loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
Gradient accumulation
The problem with the previous approach is that in order to apply back-propagation, pytorch needs to store intermediate results of layers in memory for every sub-batch. This ends up requiring a relatively large amount of memory and you may still run into memory consumption issues.
To alleviate this problem, instead of computing a single loss and performing back-propagation once, we could perform gradient accumulation. This gives equivalent results of the previous version. The difference here is that we instead perform a backward pass on each component of
the loss, only stepping the optimizer once all of them have been backpropagated. This way the computation graph is cleared after each sub-batch which will help with memory usage. Note that this works because .backward() actually accumulates (adds) the newly computed gradients to the existing .grad member of each model parameter. This is why optimizer.zero_grad() must be called only once, before the loop, and not during or after.
# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)
# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size
# Important! zero the gradients before the loop
optimizer.zero_grad()
loss_value = 0.0
for sub_batch_idx in range(num_sub_batches):
start_idx = sub_batch_size * sub_batch_idx
end_idx = start_idx + sub_batch_size
sub_input = input_data[start_idx:end_idx]
sub_targets = targets[start_idx:end_idx]
sub_output = model(sub_input)
# compute loss component for sub_batch
sub_loss = objective(sub_output, sub_targets) / num_sub_batches
# accumulate gradients
sub_loss.backward()
loss_value += sub_loss.item()
optimizer.step()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
I think 10 iterations of batch size 4 is same as one iteration of batch size 40, only here the time taken will be more. Across different training examples losses are added before backprop. But that doesn't make the function linear. BCELoss has a log component, and hence it is not a linear function. However what you said is correct. It will be similar to batch size 40.

cost becoming NaN after certain iterations

I am trying to do a multiclass classification problem (containing 3 labels) with softmax regression.
This is my first rough implementation with gradient descent and back propagation (without using regularization and any advanced optimization algorithm) containing only 1 layer.
Also when learning-rate is big (>0.003) cost becomes NaN, on decreasing learning-rate the cost function works fine.
Can anyone explain what I'm doing wrong??
# X is (13,177) dimensional
# y is (3,177) dimensional with label 0/1
m = X.shape[1] # 177
W = np.random.randn(3,X.shape[0])*0.01 # (3,13)
b = 0
cost = 0
alpha = 0.0001 # seems too small to me but for bigger values cost becomes NaN
for i in range(100):
Z = np.dot(W,X) + b
t = np.exp(Z)
add = np.sum(t,axis=0)
A = t/add
loss = -np.multiply(y,np.log(A))
cost += np.sum(loss)/m
print('cost after iteration',i+1,'is',cost)
dZ = A-y
dW = np.dot(dZ,X.T)/m
db = np.sum(dZ)/m
W = W - alpha*dW
b = b - alpha*db
This is what I get :
cost after iteration 1 is 6.661713420377916
cost after iteration 2 is 23.58974203186562
cost after iteration 3 is 52.75811642877174
.............................................................
...............*upto 100 iterations*.................
.............................................................
cost after iteration 99 is 1413.555298639879
cost after iteration 100 is 1429.6533630169406
Well after some time i figured it out.
First of all the cost was increasing due to this :
cost += np.sum(loss)/m
Here plus sign is not needed as it will add all the previous cost computed on every epoch which is not what we want. This implementation is generally required during mini-batch gradient descent for computing cost over each epoch.
Secondly the learning rate is too big for this problem that's why cost was overshooting the minimum value and becoming NaN.
I looked in my code and find out that my features were of very different range (one was from -1 to 1 and other was -5000 to 5000) which was limiting my algorithm to use greater values for learning rate.
So I applied feature scaling :
var = np.var(X, axis=1)
X = X/var
Now learning rate can be much bigger (<=0.001).

Create a List and Use it in Loss Function Tensorflow

I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0

Gradient Boost Decision Tree (GBDT) or Multiple Additive Regression Tree(MART): Calculating gradient/pseudo-response

I'm implementing MART from (http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf) algorithm 5,
My algorithm "works" for say less data(3000 training data file, 22 features) and J=5,10,20 (# of leaf nodes) and T = 10, 20. It gives me good result (R-Precision is 0.30 to 0.5 for training) but when I try to run on some what large training data (70K records) it gives me runtime underflow error - which I think it should be - just don't know how workaround this problem?
Underflow err comes here, calculating gradient of cost (or pseudo-response):
here y_i are {1,-1} labels so if I just try: 2/exp(5000) its overflow in denominator!
Just wondering if I can "normalize" this or "threshold" this, but then I'm using this pseudo-response in calculating "label" (gamma in that pdf), and then those gamma to calculate model scores.
You can wrap that expression with an if.
exp_arg = 2 * y_i * F_m_minus_1
if (exp_arg > 700) {
// assume exp() overflow, result of exp() ~= inf, 2 / inf = 0
y_tilda_i = 0
else // standard calculation
I haven't implemented gradient boosting in particular, but I needed to do that trick in some neural network computation.
#rrenaud is close, what I did is: if exp_arg > 16 or exp_arg < -16 make my exp_arg = 16(or -16) and it works! (For 1.2GB data and 700 features too!)

Resources