is binary cross entropy an additive function? - machine-learning

I am trying to train a machine learning model where the loss function is binary cross entropy, because of gpu limitations i can only do batch size of 4 and i'm having lot of spikes in the loss graph. So I'm thinking to back-propagate after some predefined batch size(>4). So it's like i'll do 10 iterations of batch size 4 store the losses, after 10th iteration add the losses and back-propagate. will it be similar to batch size of 40.
TL;DR
f(a+b) = f(a)+f(b) is it true for binary cross entropy?

f(a+b) = f(a) + f(b) doesn't seem to be what you're after. This would imply that BCELoss is additive which it clearly isn't. I think what you really care about is if for some index i
# false
f(x, y) == f(x[:i], y[:i]) + f([i:], y[i:])
is true?
The short answer is no, because you're missing some scale factors. What you probably want is the following identity
# true
f(x, y) == (i / b) * f(x[:i], y[:i]) + (1.0 - i / b) * f(x[i:], y[i:])
where b is the total batch size.
This identity is used as motivation for the gradient accumulation method (see below). Also, this identity applies to any objective function which returns an average loss across each batch element, not just BCE.
Caveat/Pitfall: Keep in mind that batch norm will not behave exactly the same when using this approach since it updates its internal statistics based on batch size during the forward pass.
We can actually do a little better memory-wise than just computing the loss as a sum followed by backpropagation. Instead we can compute the gradient of each component in the equivalent sum individually and allow the gradients to accumulate. To better explain I'll give some examples of equivalent operations
Consider the following model
import torch
import torch.nn as nn
import torch.nn.functional as F
class MyModel(nn.Module):
def __init__(self):
super().__init__()
num_outputs = 5
# assume input shape is 10x10
self.conv_layer = nn.Conv2d(3, 10, 3, 1, 1)
self.fc_layer = nn.Linear(10*5*5, num_outputs)
def forward(self, x):
x = self.conv_layer(x)
x = F.max_pool2d(x, 2, 2, 0, 1, False, False)
x = F.relu(x)
x = self.fc_layer(x.flatten(start_dim=1))
x = torch.sigmoid(x) # or omit this and use BCEWithLogitsLoss instead of BCELoss
return x
# to ensure same results for this example
torch.manual_seed(0)
model = MyModel()
# the examples will work as long as the objective averages across batch elements
objective = nn.BCELoss()
# doesn't matter what type of optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
and lets say our data and targets for a single batch are
torch.manual_seed(1) # to ensure same results for this example
batch_size = 32
input_data = torch.randn((batch_size, 3, 10, 10))
targets = torch.randint(0, 1, (batch_size, 20)).float()
Full batch
The body of our training loop for an entire batch may look something like this
# entire batch
output = model(input_data)
loss = objective(output, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_value = loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
Weighted sum of loss on sub-batches
We could have computed this using the sum of multiple loss functions using
# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)
# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size
loss = 0
for sub_batch_idx in range(num_sub_batches):
start_idx = sub_batch_size * sub_batch_idx
end_idx = start_idx + sub_batch_size
sub_input = input_data[start_idx:end_idx]
sub_targets = targets[start_idx:end_idx]
sub_output = model(sub_input)
# add loss component for sub_batch
loss = loss + objective(sub_output, sub_targets) / num_sub_batches
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_value = loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
Gradient accumulation
The problem with the previous approach is that in order to apply back-propagation, pytorch needs to store intermediate results of layers in memory for every sub-batch. This ends up requiring a relatively large amount of memory and you may still run into memory consumption issues.
To alleviate this problem, instead of computing a single loss and performing back-propagation once, we could perform gradient accumulation. This gives equivalent results of the previous version. The difference here is that we instead perform a backward pass on each component of
the loss, only stepping the optimizer once all of them have been backpropagated. This way the computation graph is cleared after each sub-batch which will help with memory usage. Note that this works because .backward() actually accumulates (adds) the newly computed gradients to the existing .grad member of each model parameter. This is why optimizer.zero_grad() must be called only once, before the loop, and not during or after.
# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)
# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size
# Important! zero the gradients before the loop
optimizer.zero_grad()
loss_value = 0.0
for sub_batch_idx in range(num_sub_batches):
start_idx = sub_batch_size * sub_batch_idx
end_idx = start_idx + sub_batch_size
sub_input = input_data[start_idx:end_idx]
sub_targets = targets[start_idx:end_idx]
sub_output = model(sub_input)
# compute loss component for sub_batch
sub_loss = objective(sub_output, sub_targets) / num_sub_batches
# accumulate gradients
sub_loss.backward()
loss_value += sub_loss.item()
optimizer.step()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))

I think 10 iterations of batch size 4 is same as one iteration of batch size 40, only here the time taken will be more. Across different training examples losses are added before backprop. But that doesn't make the function linear. BCELoss has a log component, and hence it is not a linear function. However what you said is correct. It will be similar to batch size 40.

Related

LSTM regression model flat prediction

This is a time series regression problem for the battery capacity as output and a single input variable as voltage; the relation is non-linear.
LSTM Model prediction of the test data always returns a semi-flat line, probably the mean of the output variable in the training data.
This is an example of predicted vs test set output values, with the following model parameters:
(Window size: 10, batch site: 256, LSTM nodes: 16)
Prediction of the test data
Data had been normalized, down-sampled to 1 sec and later to 3 sec, original sampling was 10 Hz.
I was suspecting the voltage fluctuation is the problem, but sampling at 3 seconds hadn't resulted into noticeable improvement.
Here are the data after being down-sampled to 3 seconds:
Normalized Training Data ; Y:SOC, X: Voltage
Normalized Test Data ; Y:SOC, X: Voltage
I've tried many changes in the model and learning parameters as follows, but still the behavior is the same.
That's why i think it's not a parameter tuning issue, rather the model is not learning at all.
LSTM layer: always single, followed by Dense with no options.
LSTM nodes: [4,8,16,32]
Epoch: : [16,32,64,128]
window size (input vector depth): [8,32,64,128]
Batch size: [32,64,128,256]
learning rate: [.0005,.0001,.001]
optimizer : ADAM, options:[ none, clipnorm=1, clipvalue=0.5]
Model specification Code:
backend.clear_session()
model1 = Sequential()
model1.add(LSTM(16,input_shape=(win_sz, features_cnt) )) # stateless
model1.add(layers.Dense(1))
model1.summary()
Model training and validation Code:
n_epochs = 12
iterations = tr_samples_sh_cnt // batch_sz_tr
loss = tf.keras.losses.MeanAbsoluteError()
optimizer = tf.optimizers.Adam(learning_rate = 0.001)
loss_history = []
#tf.function
def train_model_on_batch():
start = epoch * batch_sz_tr
X_batch = df_feat_tr_3D[start:start+batch_sz_tr, :, :]
y_batch = df_SOC_tr_2D[start:start+batch_sz_tr, :]
with tf.GradientTape() as tape:
current_loss = loss(model1(X_batch), y_batch)
gradients = tape.gradient(current_loss, model1.trainable_variables)
optimizer.apply_gradients(zip(gradients, model1.trainable_variables))
return current_loss
for epoch in range(n_epochs+1):
for iteration in range(iterations):
current_loss = train_model_on_batch()
if epoch % 1 == 0:
loss_history.append(current_loss.numpy())
print("{}. \t\tLoss: {}".format(
epoch, loss_history[-1]))
print('\nTraining complete.')
P_test = model1.predict(df_feat_test_3D)
After adding sigmoid activation function in both LSTM and Dense layers, a very small change observed, but far from reasonable fit.
Prediction of the test data after adding activation function
The problem was the activation function as #Dr. Snoopy recommended

How to handle imbalanced multi-label dataset?

I am currently trying to train an image classification model using Pytorch densenet121 with 4 labels (A, B, C, D). I have 224000 images and each image is labeled in the form of [1, 0, 0, 1] (Label A and D are present in the image). I have replaced the last dense layer of densenet121. The model is trained using Adam optimizer, LR of 0.0001 (with the decay of a factor of 10 per epoch), and trained for 4 epochs. I will try more epochs after I am confident that the class imbalanced issue is resolved.
The estimated number of positive classes is [19000, 65000, 38000, 105000] respectively. When I trained the model without class balancing and weights (with BCELoss), I have a very low recall for label A and C (in fact the True Positive TP and False Positive FP is less than 20)
I have tried 3 approaches to deal with the class imbalance after an extensive search on Google and Stackoverflow.
Approach 1: Class weights
I have tried to implement class weights by using the ratio of negative samples to positive samples.
y = train_df[CLASSES];
pos_weight = (y==0).sum()/(y==1).sum()
pos_weight = torch.Tensor(pos_weight)
if torch.cuda.is_available():
pos_weight = pos_weight.cuda()
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
The resultant class weights are [10.79, 2.45, 4.90, 1.13]. I am getting the opposite effect; having too many positive predictions which result in low precision.
Approach 2: Changing logic for class weights
I have also tried to get class weights by getting the proportion of the positive samples in the dataset and getting the inverse. The resultant class weights are [11.95, 3.49, 5.97, 2.16]. I am still getting too many positive predictions.
class_dist = y.apply(pd.Series.value_counts)
class_dist_norm = class_dist.loc[1.0]/class_dist.loc[1.0].sum()
pos_weight = 1/class_dist_norm
Approach 3: Focal Loss
I have also tried Focal Loss with the following implementation (but still getting too many positive predictions). I have used the class weights for the alpha parameter. This is referenced from https://gist.github.com/f1recracker/0f564fd48f15a58f4b92b3eb3879149b but I made some modifications to suit my use case better.
class FocalLoss(nn.CrossEntropyLoss):
''' Focal loss for classification tasks on imbalanced datasets '''
def __init__(self, alpha=None, gamma=1.5, ignore_index=-100, reduction='mean', epsilon=1e-6):
super().__init__(weight=alpha, ignore_index=ignore_index, reduction='mean')
self.reduction = reduction
self.gamma = gamma
self.epsilon = epsilon
self.alpha = alpha
def forward(self, input_, target):
# cross_entropy = super().forward(input_, target)
# Temporarily mask out ignore index to '0' for valid gather-indices input.
# This won't contribute final loss as the cross_entropy contribution
# for these would be zero.
target = target * (target != self.ignore_index).long()
# p_t = p if target = 1, p_t = (1-p) if target = 0, where p is the probability of predicting target = 1
p_t = input_ * target + (1 - input_) * (1 - target)
# Loss = -(alpha)( 1 - p_t)^gamma log(p_t), where -log(p_t) is cross entropy => loss = (alpha)(1-p_t)^gamma * cross_entropy (Epsilon added to prevent error with log(0) when class probability is 0)
if self.alpha != None:
loss = -1 * self.alpha * torch.pow(1 - p_t, self.gamma) * torch.log(p_t + self.epsilon)
else:
loss = -1 * torch.pow(1 - p_t, self.gamma) * torch.log(p_t + self.epsilon)
if self.reduction == 'mean':
return torch.mean(loss)
elif self.reduction == 'sum':
return torch.sum(loss)
else:
return loss
One thing to note is that the loss using stagnant after the first epoch, but the metrics varied between epochs.
I have considered undersampling and oversampling but I am unsure of how to proceed due to the fact that each image can have more than 1 label. One possible method is to oversample images with only 1 label by replicating them. But I am concerned that the model will only generalize on images with 1 label but perform poorly on images with multiple labels.
Therefore I would like to ask if there are methods that I should try, or did I make any mistakes in my approaches.
Any advice will be greatly appreciated.
Thank you!

How to learn a convolution kernel

I have an image matrix A. I want to learn a convolution kernel H that does following operations:
A*H gives a tensor "Intermediate" and
Intermediate * H gives "A"
Here * represents convolution operation (possibly using FFT). I only have the image. I started with a random H matrix. I want to minimise the loss between the final output [(A*H)*H] and A; and using that to get the optimised H. Can someone suggest how should I proceed using Torch?
N.B: I've written a function that does the convolution operations and returns a tensor that I want to be Like A.
Does this code match your requirement?
import torch
A = torch.randn([1, 1, 4, 4])
conv = torch.nn.Conv2d(1, 1, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(conv.parameters(), lr=0.001)
for i in range(1000):
optimizer.zero_grad()
out = conv(conv(A))
loss = criterion(out, A)
loss.backward()
optimizer.step()
if i % 100 == 0:
print(i, loss.item())
And of course, the weight of convolution will converge to 1

Importance weighted autoencoder doing worse than VAE

I've been implementing VAE and IWAE models on the caltech silhouettes dataset and am having an issue where the VAE outperforms IWAE by a modest margin (test LL ~120 for VAE, ~133 for IWAE!). I don't believe this should be the case, according to both theory and experiments produced here.
I'm hoping someone can find some issue in how I'm implementing that's causing this to be the case.
The network I'm using to approximate q and p is the same as that detailed in the appendix of the paper above. The calculation part of the model is below:
data_k_vec = data.repeat_interleave(K,0) # Generate K samples (in my case K=50 is producing this behavior)
mu, log_std = model.encode(data_k_vec)
z = model.reparameterize(mu, log_std) # z = mu + torch.exp(log_std)*epsilon (epsilon ~ N(0,1))
decoded = model.decode(z) # this is the sigmoid output of the model
log_prior_z = torch.sum(-0.5 * z ** 2, 1)-.5*z.shape[1]*T.log(torch.tensor(2*np.pi))
log_q_z = compute_log_probability_gaussian(z, mu, log_std) # Definitions below
log_p_x = compute_log_probability_bernoulli(decoded,data_k_vec)
if model_type == 'iwae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, K)
elif model_type =='vae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, 1)*1/K
log_w_minus_max = log_w_matrix - torch.max(log_w_matrix, 1, keepdim=True)[0]
ws_matrix = torch.exp(log_w_minus_max)
ws_norm = ws_matrix / torch.sum(ws_matrix, 1, keepdim=True)
ws_sum_per_datapoint = torch.sum(log_w_matrix * ws_norm, 1)
loss = -torch.sum(ws_sum_per_datapoint) # value of loss that gets returned to training function. loss.backward() will get called on this value
Here are the likelihood functions. I had to fuss with the bernoulli LL in order to not get nan during training
def compute_log_probability_gaussian(obs, mu, logstd, axis=1):
return torch.sum(-0.5 * ((obs-mu) / torch.exp(logstd)) ** 2 - logstd, axis)-.5*obs.shape[1]*T.log(torch.tensor(2*np.pi))
def compute_log_probability_bernoulli(theta, obs, axis=1): # Add 1e-18 to avoid nan appearances in training
return torch.sum(obs*torch.log(theta+1e-18) + (1-obs)*torch.log(1-theta+1e-18), axis)
In this code there's a "shortcut" being used in that the row-wise importance weights are being calculated in the model_type=='iwae' case for the K=50 samples in each row, while in the model_type=='vae' case the importance weights are being calculated for the single value left in each row, so that it just ends up calculating a weight of 1. Maybe this is the issue?
Any and all help is huge - I thought that addressing the nan issue would permanently get me out of the weeds but now I have this new problem.
EDIT:
Should add that the training scheme is the same as that in the paper linked above. That is, for each of i=0....7 rounds train for 2**i epochs with a learning rate of 1e-4 * 10**(-i/7)
The K-sample importance weighted ELBO is
$$ \textrm{IW-ELBO}(x,K) = \log \sum_{k=1}^K \frac{p(x \vert z_k) p(z_k)}{q(z_k;x)}$$
For the IWAE there are K samples originating from each datapoint x, so you want to have the same latent statistics mu_z, Sigma_z obtained through the amortized inference network, but sample multiple z K times for each x.
So its computationally wasteful to compute the forward pass for data_k_vec = data.repeat_interleave(K,0), you should compute the forward pass once for each original datapoint, then repeat the statistics output by the inference network for sampling:
mu = torch.repeat_interleave(mu,K,0)
log_std = torch.repeat_interleave(log_std,K,0)
Then sample z_k. And now repeat your datapoints data_k_vec = data.repeat_interleave(K,0), and use the resulting tensor to efficiently evaluate the conditional p(x |z_k) for each importance sample z_k.
Note you may also want to use the logsumexp operation when calculating the IW-ELBO for numerical stability. I can't quite figure out what's going on with the log_w_matrix calculation in your post, but this is what I would do:
log_pz = ...
log_qzCx = ....
log_pxCz = ...
log_iw = log_pxCz + log_pz - log_qzCx
log_iw = log_iw.reshape(-1, K)
iwelbo = torch.logsumexp(log_iw, dim=1) - np.log(K)
EDIT: Actually after thinking about it a bit and using the score function identity, you can interpret the IWAE gradient as an importance weighted estimate of the standard single-sample gradient, so the method in the OP for calculation of the importance weights is equivalent (if a bit wasteful), provided you place a stop_gradient operator around the normalized importance weights, which you call w_norm. So I the main problem is the absence of this stop_gradient operator.

Create a List and Use it in Loss Function Tensorflow

I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0

Resources