I'm trying to solve the PCA problem:
For k some number and X dataset where I'm trying to find w (The PCA matrix) such that:
w = argmax( E(WXX^T))
(I might be wrong with the formulation of the optimization goal. Please correct me).
I want to solve the optimization goal with gradient decent rather than with SVD decomposition as usual.
I'm basing my code on this Stats SE post:
How can one implement PCA using gradient descent?.
But, my code doesn't seem to find an optimal solution.
def get_gd_pca(X, w):
k = w.shape[-1]
LEARNING_RATE = 0.1
EPOCHS = 200
cov = torch.cov(X)
lam = torch.rand(1, requires_grad=True)
optimizer = torch.optim.SGD([w, lam], lr=LEARNING_RATE, maximize=True)
for epoch in range(EPOCHS):
optimizer.zero_grad()
left_side_loss = torch.matmul(w.T, torch.matmul(cov, w))
right_side_loss = torch.matmul((lam * torch.eye(k)), (torch.matmul(w.T, w) - 1))
loss = torch.sum(left_side_loss - right_side_loss)
loss.backward()
optimizer.step()
# Normalizing current_P
w.data = w.data / torch.norm(w.data, dim=0)
print('current_P', w)
I have two problems:
If I'm not using normalization (last row of the for loop) the w values just explode.
If I do use it, it only works with k=1. After that, the returned vectors are the same as the first one.
Ideas what have I done wrong? I think it's connectקג to not using the Lagrangian correctly but I doesn't understand why.
Related
I have an image matrix A. I want to learn a convolution kernel H that does following operations:
A*H gives a tensor "Intermediate" and
Intermediate * H gives "A"
Here * represents convolution operation (possibly using FFT). I only have the image. I started with a random H matrix. I want to minimise the loss between the final output [(A*H)*H] and A; and using that to get the optimised H. Can someone suggest how should I proceed using Torch?
N.B: I've written a function that does the convolution operations and returns a tensor that I want to be Like A.
Does this code match your requirement?
import torch
A = torch.randn([1, 1, 4, 4])
conv = torch.nn.Conv2d(1, 1, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(conv.parameters(), lr=0.001)
for i in range(1000):
optimizer.zero_grad()
out = conv(conv(A))
loss = criterion(out, A)
loss.backward()
optimizer.step()
if i % 100 == 0:
print(i, loss.item())
And of course, the weight of convolution will converge to 1
Question
Please forgive me asking Coursera ML course specific question. Hope someone who did the couser can answer.
In Coursera ML Week 4 Multi-class Classification and Neural Networks assignment, why the weight (theta) gradient is adding (plus) the derivative instead of subtracting?
% Calculate the gradients of Weight2
% Derivative at Loss function J=L(Z) : dJ/dZ = (oi-yi)/oi(1-oi)
% Derivative at Sigmoid activation function dZ/dY = oi(1-oi)
delta_theta2 = oi - yi; % <--- (dJ/dZ) * (dZ/dY)
# Using +/plus NOT -/minus
Theta2_grad = Theta2_grad + <-------- Why plus(+)?
bsxfun(#times, hi, transpose(delta_theta2));
Code Excerpt
for i = 1:m
% i is training set index of X (including bias). X(i, :) is 401 data.
xi = X(i, :);
yi = Y(i, :);
% hi is the i th output of the hidden layer. H(i, :) is 26 data.
hi = H(i, :);
% oi is the i th output layer. O(i, :) is 10 data.
oi = O(i, :);
%------------------------------------------------------------------------
% Calculate the gradients of Theta2
%------------------------------------------------------------------------
delta_theta2 = oi - yi;
Theta2_grad = Theta2_grad + bsxfun(#times, hi, transpose(delta_theta2));
%------------------------------------------------------------------------
% Calculate the gradients of Theta1
%------------------------------------------------------------------------
% Derivative of g(z): g'(z)=g(z)(1-g(z)) where g(z) is sigmoid(H_NET).
dgz = (hi .* (1 - hi));
delta_theta1 = dgz .* sum(bsxfun(#times, Theta2, transpose(delta_theta2)));
% There is no input into H0, hence there is no theta for H0. Remove H0.
delta_theta1 = delta_theta1(2:end);
Theta1_grad = Theta1_grad + bsxfun(#times, xi, transpose(delta_theta1));
end
I thought it is subtracting the derivative.
Derivative of Binary Cross Entropy - why are my signs not right?
Since the gradients are calculated by averaging the gradients across all training examples, we first "accumulate" the gradients while looping across all the training examples. We do this by summing the gradient across all training examples. So the line you highlighted with the plus is not the gradient update step. (Notice that alpha is not there as well.) It might be somewhere else. It is most likely outside of the loop from 1 to m.
Also, I am not sure when you will learn about this (I'm sure it's somewhere in the course), but you could also vectorize the code :)
I am trying to train a machine learning model where the loss function is binary cross entropy, because of gpu limitations i can only do batch size of 4 and i'm having lot of spikes in the loss graph. So I'm thinking to back-propagate after some predefined batch size(>4). So it's like i'll do 10 iterations of batch size 4 store the losses, after 10th iteration add the losses and back-propagate. will it be similar to batch size of 40.
TL;DR
f(a+b) = f(a)+f(b) is it true for binary cross entropy?
f(a+b) = f(a) + f(b) doesn't seem to be what you're after. This would imply that BCELoss is additive which it clearly isn't. I think what you really care about is if for some index i
# false
f(x, y) == f(x[:i], y[:i]) + f([i:], y[i:])
is true?
The short answer is no, because you're missing some scale factors. What you probably want is the following identity
# true
f(x, y) == (i / b) * f(x[:i], y[:i]) + (1.0 - i / b) * f(x[i:], y[i:])
where b is the total batch size.
This identity is used as motivation for the gradient accumulation method (see below). Also, this identity applies to any objective function which returns an average loss across each batch element, not just BCE.
Caveat/Pitfall: Keep in mind that batch norm will not behave exactly the same when using this approach since it updates its internal statistics based on batch size during the forward pass.
We can actually do a little better memory-wise than just computing the loss as a sum followed by backpropagation. Instead we can compute the gradient of each component in the equivalent sum individually and allow the gradients to accumulate. To better explain I'll give some examples of equivalent operations
Consider the following model
import torch
import torch.nn as nn
import torch.nn.functional as F
class MyModel(nn.Module):
def __init__(self):
super().__init__()
num_outputs = 5
# assume input shape is 10x10
self.conv_layer = nn.Conv2d(3, 10, 3, 1, 1)
self.fc_layer = nn.Linear(10*5*5, num_outputs)
def forward(self, x):
x = self.conv_layer(x)
x = F.max_pool2d(x, 2, 2, 0, 1, False, False)
x = F.relu(x)
x = self.fc_layer(x.flatten(start_dim=1))
x = torch.sigmoid(x) # or omit this and use BCEWithLogitsLoss instead of BCELoss
return x
# to ensure same results for this example
torch.manual_seed(0)
model = MyModel()
# the examples will work as long as the objective averages across batch elements
objective = nn.BCELoss()
# doesn't matter what type of optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
and lets say our data and targets for a single batch are
torch.manual_seed(1) # to ensure same results for this example
batch_size = 32
input_data = torch.randn((batch_size, 3, 10, 10))
targets = torch.randint(0, 1, (batch_size, 20)).float()
Full batch
The body of our training loop for an entire batch may look something like this
# entire batch
output = model(input_data)
loss = objective(output, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_value = loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
Weighted sum of loss on sub-batches
We could have computed this using the sum of multiple loss functions using
# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)
# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size
loss = 0
for sub_batch_idx in range(num_sub_batches):
start_idx = sub_batch_size * sub_batch_idx
end_idx = start_idx + sub_batch_size
sub_input = input_data[start_idx:end_idx]
sub_targets = targets[start_idx:end_idx]
sub_output = model(sub_input)
# add loss component for sub_batch
loss = loss + objective(sub_output, sub_targets) / num_sub_batches
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_value = loss.item()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
Gradient accumulation
The problem with the previous approach is that in order to apply back-propagation, pytorch needs to store intermediate results of layers in memory for every sub-batch. This ends up requiring a relatively large amount of memory and you may still run into memory consumption issues.
To alleviate this problem, instead of computing a single loss and performing back-propagation once, we could perform gradient accumulation. This gives equivalent results of the previous version. The difference here is that we instead perform a backward pass on each component of
the loss, only stepping the optimizer once all of them have been backpropagated. This way the computation graph is cleared after each sub-batch which will help with memory usage. Note that this works because .backward() actually accumulates (adds) the newly computed gradients to the existing .grad member of each model parameter. This is why optimizer.zero_grad() must be called only once, before the loop, and not during or after.
# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)
# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size
# Important! zero the gradients before the loop
optimizer.zero_grad()
loss_value = 0.0
for sub_batch_idx in range(num_sub_batches):
start_idx = sub_batch_size * sub_batch_idx
end_idx = start_idx + sub_batch_size
sub_input = input_data[start_idx:end_idx]
sub_targets = targets[start_idx:end_idx]
sub_output = model(sub_input)
# compute loss component for sub_batch
sub_loss = objective(sub_output, sub_targets) / num_sub_batches
# accumulate gradients
sub_loss.backward()
loss_value += sub_loss.item()
optimizer.step()
print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))
I think 10 iterations of batch size 4 is same as one iteration of batch size 40, only here the time taken will be more. Across different training examples losses are added before backprop. But that doesn't make the function linear. BCELoss has a log component, and hence it is not a linear function. However what you said is correct. It will be similar to batch size 40.
I've been implementing VAE and IWAE models on the caltech silhouettes dataset and am having an issue where the VAE outperforms IWAE by a modest margin (test LL ~120 for VAE, ~133 for IWAE!). I don't believe this should be the case, according to both theory and experiments produced here.
I'm hoping someone can find some issue in how I'm implementing that's causing this to be the case.
The network I'm using to approximate q and p is the same as that detailed in the appendix of the paper above. The calculation part of the model is below:
data_k_vec = data.repeat_interleave(K,0) # Generate K samples (in my case K=50 is producing this behavior)
mu, log_std = model.encode(data_k_vec)
z = model.reparameterize(mu, log_std) # z = mu + torch.exp(log_std)*epsilon (epsilon ~ N(0,1))
decoded = model.decode(z) # this is the sigmoid output of the model
log_prior_z = torch.sum(-0.5 * z ** 2, 1)-.5*z.shape[1]*T.log(torch.tensor(2*np.pi))
log_q_z = compute_log_probability_gaussian(z, mu, log_std) # Definitions below
log_p_x = compute_log_probability_bernoulli(decoded,data_k_vec)
if model_type == 'iwae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, K)
elif model_type =='vae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, 1)*1/K
log_w_minus_max = log_w_matrix - torch.max(log_w_matrix, 1, keepdim=True)[0]
ws_matrix = torch.exp(log_w_minus_max)
ws_norm = ws_matrix / torch.sum(ws_matrix, 1, keepdim=True)
ws_sum_per_datapoint = torch.sum(log_w_matrix * ws_norm, 1)
loss = -torch.sum(ws_sum_per_datapoint) # value of loss that gets returned to training function. loss.backward() will get called on this value
Here are the likelihood functions. I had to fuss with the bernoulli LL in order to not get nan during training
def compute_log_probability_gaussian(obs, mu, logstd, axis=1):
return torch.sum(-0.5 * ((obs-mu) / torch.exp(logstd)) ** 2 - logstd, axis)-.5*obs.shape[1]*T.log(torch.tensor(2*np.pi))
def compute_log_probability_bernoulli(theta, obs, axis=1): # Add 1e-18 to avoid nan appearances in training
return torch.sum(obs*torch.log(theta+1e-18) + (1-obs)*torch.log(1-theta+1e-18), axis)
In this code there's a "shortcut" being used in that the row-wise importance weights are being calculated in the model_type=='iwae' case for the K=50 samples in each row, while in the model_type=='vae' case the importance weights are being calculated for the single value left in each row, so that it just ends up calculating a weight of 1. Maybe this is the issue?
Any and all help is huge - I thought that addressing the nan issue would permanently get me out of the weeds but now I have this new problem.
EDIT:
Should add that the training scheme is the same as that in the paper linked above. That is, for each of i=0....7 rounds train for 2**i epochs with a learning rate of 1e-4 * 10**(-i/7)
The K-sample importance weighted ELBO is
$$ \textrm{IW-ELBO}(x,K) = \log \sum_{k=1}^K \frac{p(x \vert z_k) p(z_k)}{q(z_k;x)}$$
For the IWAE there are K samples originating from each datapoint x, so you want to have the same latent statistics mu_z, Sigma_z obtained through the amortized inference network, but sample multiple z K times for each x.
So its computationally wasteful to compute the forward pass for data_k_vec = data.repeat_interleave(K,0), you should compute the forward pass once for each original datapoint, then repeat the statistics output by the inference network for sampling:
mu = torch.repeat_interleave(mu,K,0)
log_std = torch.repeat_interleave(log_std,K,0)
Then sample z_k. And now repeat your datapoints data_k_vec = data.repeat_interleave(K,0), and use the resulting tensor to efficiently evaluate the conditional p(x |z_k) for each importance sample z_k.
Note you may also want to use the logsumexp operation when calculating the IW-ELBO for numerical stability. I can't quite figure out what's going on with the log_w_matrix calculation in your post, but this is what I would do:
log_pz = ...
log_qzCx = ....
log_pxCz = ...
log_iw = log_pxCz + log_pz - log_qzCx
log_iw = log_iw.reshape(-1, K)
iwelbo = torch.logsumexp(log_iw, dim=1) - np.log(K)
EDIT: Actually after thinking about it a bit and using the score function identity, you can interpret the IWAE gradient as an importance weighted estimate of the standard single-sample gradient, so the method in the OP for calculation of the importance weights is equivalent (if a bit wasteful), provided you place a stop_gradient operator around the normalized importance weights, which you call w_norm. So I the main problem is the absence of this stop_gradient operator.
I know xgboost need first gradient and second gradient, but anybody else has used "mae" as obj function?
A little bit of theory first, sorry! You asked for the grad and hessian for MAE, however, the MAE is not continuously twice differentiable so trying to calculate the first and second derivatives becomes tricky. Below we can see the "kink" at x=0 which prevents the MAE from being continuously differentiable.
Moreover, the second derivative is zero at all the points where it is well behaved. In XGBoost, the second derivative is used as a denominator in the leaf weights, and when zero, creates serious math-errors.
Given these complexities, our best bet is to try to approximate the MAE using some other, nicely behaved function. Let's take a look.
We can see above that there are several functions that approximate the absolute value. Clearly, for very small values, the Squared Error (MSE) is a fairly good approximation of the MAE. However, I assume that this is not sufficient for your use case.
Huber Loss is a well documented loss function. However, it is not smooth so we cannot guarantee smooth derivatives. We can approximate it using the Psuedo-Huber function. It can be implemented in python XGBoost as follows,
import xgboost as xgb
dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)
param = {'max_depth': 5}
num_round = 10
def huber_approx_obj(preds, dtrain):
d = preds - dtrain.get_labels() #remove .get_labels() for sklearn
h = 1 #h is delta in the graphic
scale = 1 + (d / h) ** 2
scale_sqrt = np.sqrt(scale)
grad = d / scale_sqrt
hess = 1 / scale / scale_sqrt
return grad, hess
bst = xgb.train(param, dtrain, num_round, obj=huber_approx_obj)
Other function can be used by replacing the obj=huber_approx_obj.
Fair Loss is not well documented at all but it seems to work rather well. The fair loss function is:
It can be implemented as such,
def fair_obj(preds, dtrain):
"""y = c * abs(x) - c**2 * np.log(abs(x)/c + 1)"""
x = preds - dtrain.get_labels()
c = 1
den = abs(x) + c
grad = c*x / den
hess = c*c / den ** 2
return grad, hess
This code is taken and adapted from the second place solution in the Kaggle Allstate Challenge.
Log-Cosh Loss function.
def log_cosh_obj(preds, dtrain):
x = preds - dtrain.get_labels()
grad = np.tanh(x)
hess = 1 / np.cosh(x)**2
return grad, hess
Finally, you can create your own custom loss functions using the above functions as templates.
Warning: Due to API changes newer versions of XGBoost may require loss functions for the form:
def custom_objective(y_true, y_pred):
...
return grad, hess
For the Huber loss above, I think the gradient is missing a negative sign upfront. Should be as
grad = - d / scale_sqrt
I am running the huber/fair metric from above on ~normally distributed Y, but for some reason with alpha <0 (and all the time for fair) the result prediction will equal to zero...