theano - how to have many of the same function - machine-learning

The input is a variable sized array. I can only process a given example one sample at a time in train_model. I want to accumulate the sum of objectives for the elements in the batch then apply regularization and gradient descent.
Currently, this is the training stage where updates are done for each element xi.
for epoch in range(n_epochs):
minibatch_avg_cost = 0
for xi in dataset.get_next_xi(batch_size):
minibatch_avg_cost += train_model(xi)
How can I get results from train_model(xi) for the number of elements in the batch and then do the updates?

just use all the elements in dataset.get_next_xi(batch_size) as input and create a theano function to calculate the average cost (instead of only one cost) and do the updates using the average cost. You can see the example code from here
they use theano function from train model like this:
train_model = theano.function(
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
with cost is average cost of datasets batch


is binary cross entropy an additive function?

I am trying to train a machine learning model where the loss function is binary cross entropy, because of gpu limitations i can only do batch size of 4 and i'm having lot of spikes in the loss graph. So I'm thinking to back-propagate after some predefined batch size(>4). So it's like i'll do 10 iterations of batch size 4 store the losses, after 10th iteration add the losses and back-propagate. will it be similar to batch size of 40.
f(a+b) = f(a)+f(b) is it true for binary cross entropy?
f(a+b) = f(a) + f(b) doesn't seem to be what you're after. This would imply that BCELoss is additive which it clearly isn't. I think what you really care about is if for some index i
# false
f(x, y) == f(x[:i], y[:i]) + f([i:], y[i:])
is true?
The short answer is no, because you're missing some scale factors. What you probably want is the following identity
# true
f(x, y) == (i / b) * f(x[:i], y[:i]) + (1.0 - i / b) * f(x[i:], y[i:])
where b is the total batch size.
This identity is used as motivation for the gradient accumulation method (see below). Also, this identity applies to any objective function which returns an average loss across each batch element, not just BCE.
Caveat/Pitfall: Keep in mind that batch norm will not behave exactly the same when using this approach since it updates its internal statistics based on batch size during the forward pass.
We can actually do a little better memory-wise than just computing the loss as a sum followed by backpropagation. Instead we can compute the gradient of each component in the equivalent sum individually and allow the gradients to accumulate. To better explain I'll give some examples of equivalent operations
Consider the following model
import torch
import torch.nn as nn
import torch.nn.functional as F
class MyModel(nn.Module):
def __init__(self):
num_outputs = 5
# assume input shape is 10x10
self.conv_layer = nn.Conv2d(3, 10, 3, 1, 1)
self.fc_layer = nn.Linear(10*5*5, num_outputs)
def forward(self, x):
x = self.conv_layer(x)
x = F.max_pool2d(x, 2, 2, 0, 1, False, False)
x = F.relu(x)
x = self.fc_layer(x.flatten(start_dim=1))
x = torch.sigmoid(x) # or omit this and use BCEWithLogitsLoss instead of BCELoss
return x
# to ensure same results for this example
model = MyModel()
# the examples will work as long as the objective averages across batch elements
objective = nn.BCELoss()
# doesn't matter what type of optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
and lets say our data and targets for a single batch are
torch.manual_seed(1) # to ensure same results for this example
batch_size = 32
input_data = torch.randn((batch_size, 3, 10, 10))
targets = torch.randint(0, 1, (batch_size, 20)).float()
Full batch
The body of our training loop for an entire batch may look something like this
# entire batch
output = model(input_data)
loss = objective(output, targets)
loss_value = loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
Weighted sum of loss on sub-batches
We could have computed this using the sum of multiple loss functions using
# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)
# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size
loss = 0
for sub_batch_idx in range(num_sub_batches):
start_idx = sub_batch_size * sub_batch_idx
end_idx = start_idx + sub_batch_size
sub_input = input_data[start_idx:end_idx]
sub_targets = targets[start_idx:end_idx]
sub_output = model(sub_input)
# add loss component for sub_batch
loss = loss + objective(sub_output, sub_targets) / num_sub_batches
loss_value = loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
Gradient accumulation
The problem with the previous approach is that in order to apply back-propagation, pytorch needs to store intermediate results of layers in memory for every sub-batch. This ends up requiring a relatively large amount of memory and you may still run into memory consumption issues.
To alleviate this problem, instead of computing a single loss and performing back-propagation once, we could perform gradient accumulation. This gives equivalent results of the previous version. The difference here is that we instead perform a backward pass on each component of
the loss, only stepping the optimizer once all of them have been backpropagated. This way the computation graph is cleared after each sub-batch which will help with memory usage. Note that this works because .backward() actually accumulates (adds) the newly computed gradients to the existing .grad member of each model parameter. This is why optimizer.zero_grad() must be called only once, before the loop, and not during or after.
# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)
# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size
# Important! zero the gradients before the loop
loss_value = 0.0
for sub_batch_idx in range(num_sub_batches):
start_idx = sub_batch_size * sub_batch_idx
end_idx = start_idx + sub_batch_size
sub_input = input_data[start_idx:end_idx]
sub_targets = targets[start_idx:end_idx]
sub_output = model(sub_input)
# compute loss component for sub_batch
sub_loss = objective(sub_output, sub_targets) / num_sub_batches
# accumulate gradients
loss_value += sub_loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
I think 10 iterations of batch size 4 is same as one iteration of batch size 40, only here the time taken will be more. Across different training examples losses are added before backprop. But that doesn't make the function linear. BCELoss has a log component, and hence it is not a linear function. However what you said is correct. It will be similar to batch size 40.

cost becoming NaN after certain iterations

I am trying to do a multiclass classification problem (containing 3 labels) with softmax regression.
This is my first rough implementation with gradient descent and back propagation (without using regularization and any advanced optimization algorithm) containing only 1 layer.
Also when learning-rate is big (>0.003) cost becomes NaN, on decreasing learning-rate the cost function works fine.
Can anyone explain what I'm doing wrong??
# X is (13,177) dimensional
# y is (3,177) dimensional with label 0/1
m = X.shape[1] # 177
W = np.random.randn(3,X.shape[0])*0.01 # (3,13)
b = 0
cost = 0
alpha = 0.0001 # seems too small to me but for bigger values cost becomes NaN
for i in range(100):
Z =,X) + b
t = np.exp(Z)
add = np.sum(t,axis=0)
A = t/add
loss = -np.multiply(y,np.log(A))
cost += np.sum(loss)/m
print('cost after iteration',i+1,'is',cost)
dZ = A-y
dW =,X.T)/m
db = np.sum(dZ)/m
W = W - alpha*dW
b = b - alpha*db
This is what I get :
cost after iteration 1 is 6.661713420377916
cost after iteration 2 is 23.58974203186562
cost after iteration 3 is 52.75811642877174
...............*upto 100 iterations*.................
cost after iteration 99 is 1413.555298639879
cost after iteration 100 is 1429.6533630169406
Well after some time i figured it out.
First of all the cost was increasing due to this :
cost += np.sum(loss)/m
Here plus sign is not needed as it will add all the previous cost computed on every epoch which is not what we want. This implementation is generally required during mini-batch gradient descent for computing cost over each epoch.
Secondly the learning rate is too big for this problem that's why cost was overshooting the minimum value and becoming NaN.
I looked in my code and find out that my features were of very different range (one was from -1 to 1 and other was -5000 to 5000) which was limiting my algorithm to use greater values for learning rate.
So I applied feature scaling :
var = np.var(X, axis=1)
X = X/var
Now learning rate can be much bigger (<=0.001).

Importance weighted autoencoder doing worse than VAE

I've been implementing VAE and IWAE models on the caltech silhouettes dataset and am having an issue where the VAE outperforms IWAE by a modest margin (test LL ~120 for VAE, ~133 for IWAE!). I don't believe this should be the case, according to both theory and experiments produced here.
I'm hoping someone can find some issue in how I'm implementing that's causing this to be the case.
The network I'm using to approximate q and p is the same as that detailed in the appendix of the paper above. The calculation part of the model is below:
data_k_vec = data.repeat_interleave(K,0) # Generate K samples (in my case K=50 is producing this behavior)
mu, log_std = model.encode(data_k_vec)
z = model.reparameterize(mu, log_std) # z = mu + torch.exp(log_std)*epsilon (epsilon ~ N(0,1))
decoded = model.decode(z) # this is the sigmoid output of the model
log_prior_z = torch.sum(-0.5 * z ** 2, 1)-.5*z.shape[1]*T.log(torch.tensor(2*np.pi))
log_q_z = compute_log_probability_gaussian(z, mu, log_std) # Definitions below
log_p_x = compute_log_probability_bernoulli(decoded,data_k_vec)
if model_type == 'iwae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, K)
elif model_type =='vae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, 1)*1/K
log_w_minus_max = log_w_matrix - torch.max(log_w_matrix, 1, keepdim=True)[0]
ws_matrix = torch.exp(log_w_minus_max)
ws_norm = ws_matrix / torch.sum(ws_matrix, 1, keepdim=True)
ws_sum_per_datapoint = torch.sum(log_w_matrix * ws_norm, 1)
loss = -torch.sum(ws_sum_per_datapoint) # value of loss that gets returned to training function. loss.backward() will get called on this value
Here are the likelihood functions. I had to fuss with the bernoulli LL in order to not get nan during training
def compute_log_probability_gaussian(obs, mu, logstd, axis=1):
return torch.sum(-0.5 * ((obs-mu) / torch.exp(logstd)) ** 2 - logstd, axis)-.5*obs.shape[1]*T.log(torch.tensor(2*np.pi))
def compute_log_probability_bernoulli(theta, obs, axis=1): # Add 1e-18 to avoid nan appearances in training
return torch.sum(obs*torch.log(theta+1e-18) + (1-obs)*torch.log(1-theta+1e-18), axis)
In this code there's a "shortcut" being used in that the row-wise importance weights are being calculated in the model_type=='iwae' case for the K=50 samples in each row, while in the model_type=='vae' case the importance weights are being calculated for the single value left in each row, so that it just ends up calculating a weight of 1. Maybe this is the issue?
Any and all help is huge - I thought that addressing the nan issue would permanently get me out of the weeds but now I have this new problem.
Should add that the training scheme is the same as that in the paper linked above. That is, for each of i=0....7 rounds train for 2**i epochs with a learning rate of 1e-4 * 10**(-i/7)
The K-sample importance weighted ELBO is
$$ \textrm{IW-ELBO}(x,K) = \log \sum_{k=1}^K \frac{p(x \vert z_k) p(z_k)}{q(z_k;x)}$$
For the IWAE there are K samples originating from each datapoint x, so you want to have the same latent statistics mu_z, Sigma_z obtained through the amortized inference network, but sample multiple z K times for each x.
So its computationally wasteful to compute the forward pass for data_k_vec = data.repeat_interleave(K,0), you should compute the forward pass once for each original datapoint, then repeat the statistics output by the inference network for sampling:
mu = torch.repeat_interleave(mu,K,0)
log_std = torch.repeat_interleave(log_std,K,0)
Then sample z_k. And now repeat your datapoints data_k_vec = data.repeat_interleave(K,0), and use the resulting tensor to efficiently evaluate the conditional p(x |z_k) for each importance sample z_k.
Note you may also want to use the logsumexp operation when calculating the IW-ELBO for numerical stability. I can't quite figure out what's going on with the log_w_matrix calculation in your post, but this is what I would do:
log_pz = ...
log_qzCx = ....
log_pxCz = ...
log_iw = log_pxCz + log_pz - log_qzCx
log_iw = log_iw.reshape(-1, K)
iwelbo = torch.logsumexp(log_iw, dim=1) - np.log(K)
EDIT: Actually after thinking about it a bit and using the score function identity, you can interpret the IWAE gradient as an importance weighted estimate of the standard single-sample gradient, so the method in the OP for calculation of the importance weights is equivalent (if a bit wasteful), provided you place a stop_gradient operator around the normalized importance weights, which you call w_norm. So I the main problem is the absence of this stop_gradient operator.

What would be a good loss function to penalize the magnitude and sign difference

I'm in a situation where I need to train a model to predict a scalar value, and it's important to have the predicted value be in the same direction as the true value, while the squared error being minimum.
What would be a good choice of loss function for that?
For example:
Let's say the predicted value is -1 and the true value is 1. The loss between the two should be a lot greater than the loss between 3 and 1, even though the squared error of (3, 1) and (-1, 1) is equal.
Thanks a lot!
This turned out to be a really interesting question - thanks for asking it! First, remember that you want your loss functions to be defined entirely of differential operations, so that you can back-propagation though it. This means that any old arbitrary logic won't necessarily do. To restate your problem: you want to find a differentiable function of two variables that increases sharply when the two variables take on values of different signs, and more slowly when they share the same sign. Additionally, you want some control over how sharply these values increase, relative to one another. Thus, we want something with two configurable constants. I started constructing a function that met these needs, but then remembered one you can find in any high school geometry text book: the elliptic paraboloid!
The standard formulation doesn't meet the requirement of sign agreement symmetry, so I had to introduce a rotation. The plot above is the result. Note that it increases more sharply when the signs don't agree, and less sharply when they do, and that the input constants controlling this behaviour are configurable. The code below is all that was needed to define and plot the loss function. I don't think I've ever used a geometric form as a loss function before - really neat.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
def elliptic_paraboloid_loss(x, y, c_diff_sign, c_same_sign):
# Compute a rotated elliptic parabaloid.
t = np.pi / 4
x_rot = (x * np.cos(t)) + (y * np.sin(t))
y_rot = (x * -np.sin(t)) + (y * np.cos(t))
z = ((x_rot**2) / c_diff_sign) + ((y_rot**2) / c_same_sign)
c_diff_sign = 4
c_same_sign = 2
a = np.arange(-5, 5, 0.1)
b = np.arange(-5, 5, 0.1)
loss_map = np.zeros((len(a), len(b)))
for i, a_i in enumerate(a):
for j, b_j in enumerate(b):
loss_map[i, j] = elliptic_paraboloid_loss(a_i, b_j, c_diff_sign, c_same_sign)
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y = np.meshgrid(a, b)
surf = ax.plot_surface(X, Y, loss_map, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
From what I understand, your current loss function is something like:
loss = mean_square_error(y, y_pred)
What you could do, is to add one other component to your loss, being this a component that penalizes negative numbers and does nothing with positive numbers. And you can choose a coefficient for how much you want to penalize it. For that, we can use like a negative shaped ReLU. Something like this:
Let's call "Neg_ReLU" to this component. Then, your loss function will be:
loss = mean_squared_error(y, y_pred) + Neg_ReLU(y_pred)
So for example, if your result is -1, then the total error would be:
mean_squared_error(1, -1) + 1
And if your result is 3, then the total error would be:
mean_squared_error(1, -1) + 0
(See in the above function how Neg_ReLU(3) = 0, and Neg_ReLU(-1) = 1.
If you want to penalize more the negative values, then you can add a coefficient:
coeff_negative_value = 2
loss = mean_squared_error(y, y_pred) + coeff_negative_value * Neg_ReLU
Now the negative values are more penalized.
The ReLU negative function we can build it like this:
So summarizing, in the end your total loss will be:
coeff = 1
Neg_ReLU = tf.nn.relu(tf.math.negative(y))
total_loss = mean_squared_error(y, y_pred) + coeff * Neg_ReLU

Gradient Boost Decision Tree (GBDT) or Multiple Additive Regression Tree(MART): Calculating gradient/pseudo-response

I'm implementing MART from ( algorithm 5,
My algorithm "works" for say less data(3000 training data file, 22 features) and J=5,10,20 (# of leaf nodes) and T = 10, 20. It gives me good result (R-Precision is 0.30 to 0.5 for training) but when I try to run on some what large training data (70K records) it gives me runtime underflow error - which I think it should be - just don't know how workaround this problem?
Underflow err comes here, calculating gradient of cost (or pseudo-response):
here y_i are {1,-1} labels so if I just try: 2/exp(5000) its overflow in denominator!
Just wondering if I can "normalize" this or "threshold" this, but then I'm using this pseudo-response in calculating "label" (gamma in that pdf), and then those gamma to calculate model scores.
You can wrap that expression with an if.
exp_arg = 2 * y_i * F_m_minus_1
if (exp_arg > 700) {
// assume exp() overflow, result of exp() ~= inf, 2 / inf = 0
y_tilda_i = 0
else // standard calculation
I haven't implemented gradient boosting in particular, but I needed to do that trick in some neural network computation.
#rrenaud is close, what I did is: if exp_arg > 16 or exp_arg < -16 make my exp_arg = 16(or -16) and it works! (For 1.2GB data and 700 features too!)
