Calculating Gradient Update - machine-learning

Lets say I want to manually calculate the gradient update with respect to the Kullback-Liebler divergence loss, say on a VAE (see an actual example from pytorch sample documentation here):
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
where the logvar is (for simplicitys sake, ignoring activation functions and multiple layers etc.) basically a single layer transformation from a 400 dim feature vector into a 20 dim one:
self.fc21 = nn.Linear(400, 20)
logvar = fc21(x)
I'm just not mathematically understanding how you take the gradient of this, with respect to the weight vector for fc21. Mathematically I thought this would look like:
KL = -.5sum(1 + Wx + b - m^2 - e^{Wx + b})
dKL/dW = -.5 (x - e^{Wx + b}x)
where W is the weight matrix of the fc21 layer. But here this result isn't in the same shape as W (20x400). Like, x is just a 400 feature vector. So how would I perform SGD on this? Does x just broadcast to the second term, and if so why? I feel like I'm just missing some mathematical understanding here...

Let's simplify the example a bit and assume a fully connected layer of input shape 3 and output shape 2, then:
W = [[w1, w2, w3], [w4, w5, w6]]
x = [x1, x2, x3]
y = [w1*x1 + w2*x2 + w3*x3, w4*x1 + w5*x2 + w6*x3]
D_KL = -0.5 * [ 1 + w1*x1 + w2*x2 + w3*x3 + w4*x1 + w5*x2 + w6*x3 + b - m^2 + e^(..)]
grad(D_KL, w1) = -0.5 * [x1 + x1* e^(..)]
grad(D_KL, w2) = -0.5 * [x2 + x2* e^(..)]
...
grad(D_KL, W) = [[grad(D_KL, w1), grad(D_KL, w2), grad(D_KL,w3)],
[grad(D_KL, w4), grad(D_KL, w5), grad(D_KL,w6)]
]
This generalizes for higher order tensors of any dimensionality. Your differentiation is wrong in treating x and W as scalars rather than taking element-wise partial derivatives.

Related

How to debug if weight keep increasing. Pytorch program

I m having some doubt when practicing Pytorch program.
I have function like y = m1x1 + m2x2 + c (just 2 weights to learn here). The expected values of weight should be 16,-14 and bias should be 36. But in every epoch the learned wight goes very big. Can any one help me to debug and understand this 20 lines of code, what going wrong here.
import torch
x = torch.randint(size = (1,2), high = 10)
w = torch.Tensor([16,-14])
b = 36
#Compute Ground Truth
y = w * x + b
#Find weights by program
epoch = 20
learning_rate = 30
#initialize random
w1 = torch.rand(size= (1,2), requires_grad= True)
b1 = torch.ones(size = [1], requires_grad= True)
for i in range(epoch):
y1 = w1 * x + b1
#loss function RMSQ
loss = torch.sum((y1-y)**2)
#Find gradient
loss.backward()
with torch.no_grad():
#update parameters
w1 -= (learning_rate * w1.grad)
b1 -= (learning_rate * b1.grad)
w1.grad.zero_()
b1.grad.zero_()
print("B ", b1)
print("W ", w1)
Thanks,
Ganesh
You have a very large learning rate.
This is an illustration from Jeremy Jordan's blog that explains exactly what is going on in your case.

Denormalizing thetas after a linear regression with gradient descent

I have the following set of data:
km,price
240000,3650
139800,3800
150500,4400
185530,4450
176000,5250
114800,5350
166800,5800
89000,5990
144500,5999
84000,6200
82029,6390
63060,6390
74000,6600
97500,6800
67000,6800
76025,6900
48235,6900
93000,6990
60949,7490
65674,7555
54000,7990
68500,7990
22899,7990
61789,8290
After normalizing them, I'm performing a gradient descent that gives me the following thetas:
θ0 = 0.9362124793084768
θ1 = -0.9953762249792935
I can correctly predict the price if I feed a normalized mileage, and then denormalize the predicted price, ie:
Asked price for a mileage of 50000km:
normalized mileage: 0.12483129971764294
normalized price: (mx + c) = 0.8119583714362707
real price: 7417.486843464296
What I'm looking for is to revert my thetas back to their non-normalized values, but I've been unable to, no matter which equation I tried. Is there a way to do so ?
As usual, it takes me asking a question on stackoverflow to manage to solve it by myself moments later....
It was simply a two variables equation to solve, as you can see here (excuse the handwriting): https://ibb.co/178qWcQ.
Here is the python code that does the computation:
x0, x1 = self.training_set[0][0], self.training_set[1][0]
x0n, x1n = self.normalized_training_set[0][0], self.normalized_training_set[1][0]
y0n, y1n = self.hypothesis(x0n), self.hypothesis(x1n)
p_diff = self.max_price - self.min_price
theta0 = (x1 / (x1 - x0)) * (y0n * p_diff + self.min_price - (x0 / x1 * (y1n * p_diff + self.min_price)))
y0 = self.training_set[0][1]
theta1 = (y0 - theta0) / x0
print(theta0, theta1) //RESULT: 8481.172796984529 -0.020129886654102203

Gradient Descent failing for multiple variables, results in NaN

I am trying to implement gradient descent algorithm to minimize a cost function for multiple linear algorithm. I am using the concepts explained in the machine learning class by Andrew Ng. I am using Octave. However when I try to execute the code it seems to fail to provide the solution as my theta values computes to "NaN". I have attached the cost function code and the gradient descent code. Can someone please help.
Cost function :
function J = computeCostMulti(X, y, theta)
m = length(y); % number of training examples
J = 0;
h=(X*theta);
s= sum((h-y).^2);
J= s/(2*m);
Gradient Descent Code:
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
a= X*theta -y;
b = alpha*(X'*a);
theta = theta - (b/m);
J_history(iter) = computeCostMulti(X, y, theta);
end
I implemented this algorithm in GNU Octave and I separated this into 2 different functions, first you need to define a gradient function
function [thetaNew] = compute_gradient (X, y, theta, m)
thetaNew = (X'*(X*theta'-y))*1/m;
end
then to compute the gradient descent algorithm use a different function
function [theta] = gd (X, y, alpha, num_iters)
theta = zeros(1,columns(X));
for iter = 1:num_iters,
theta = theta - alpha*compute_gradient(X,y,theta,rows(y))';
end
end
Edit 1
This algorithm works for both multiple linear regression (multiple independent variable) and linear regression of 1 independent variable, I tested this with this dataset
age height weight
41 62 115
21 62 140
31 62 125
21 64 125
31 64 145
41 64 135
41 72 165
31 72 190
21 72 175
31 66 150
31 66 155
21 64 140
For this example we want to predict
predicted weight = theta0 + theta1*age + theta2*height
I used these input values for alpha and num_iters
alpha=0.00037
num_iters=3000000
The output of runing gradient descent for this experiment is as follows:
theta =
-170.10392 -0.40601 4.99799
So the equation is
predicted weight = -170.10392 - .406*age + 4.997*height
This is almost absolute minimum of the gradient, since the true results for
this problem if using PSPP (open source alternative of SPSS) are
predicted weight = -175.17 - .40*age + 5.07*height
Hope this helps to confirm the gradient descent algorithm works same for multiple linear regression and standard linear regression
I did found the bug and it was not either in the logic of the cost function or gradient descent function. But indeed in the feature normilization logic and I was accidentally returning the wrong varible and hence it was cauing the output to be "NaN"
It is dumb mistake :
What I was doing previously
mu= mean(a);
sigma = std(a);
b=(X.-mu);
X= b./sigma;
Instead what I shoul be doing
function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X
% FEATURENORMALIZE(X) returns a normalized version of X where
% the mean value of each feature is 0 and the standard deviation
% is 1. This is often a good preprocessing step to do when
% working with learning algorithms.
% You need to set these values correctly
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
% ====================== YOUR CODE HERE ======================
mu= mean(X);
sigma = std(X);
a=(X.-mu);
X_norm= a./sigma;
% ============================================================
end
So clearly I should be using X_norm insated of X and that is what cauing the code to give wrong output

How do I implement the optimization function in tensorflow?

minΣ(||xi-Xci||^2+ λ||ci||),
s.t cii = 0,
where X is a matrix of shape d * n and C is of the shape n * n, xi and ci means a column of X and C separately.
X is known here and based on X we want to find C.
Usually with a loss like that you need to vectorize it, instead of working with columns:
loss = X - tf.matmul(X, C)
loss = tf.reduce_sum(tf.square(loss))
reg_loss = tf.reduce_sum(tf.square(C), 0) # L2 loss for each column
reg_loss = tf.reduce_sum(tf.sqrt(reg_loss))
total_loss = loss + lambd * reg_loss
To implement the zero constraint on the diagonal of C, the best way is to add it to the loss with another constant lambd2:
reg_loss2 = tf.trace(tf.square(C))
total_loss = total_loss + lambd2 * reg_loss2

Gradient in continuous regression using a neural network

I'm trying to implement a regression NN that has 3 layers (1 input, 1 hidden and 1 output layer with a continuous result). As a basis I took a classification NN from coursera.org class, but changed the cost function and gradient calculation so as to fit a regression problem (and not a classification one):
My nnCostFunction now is:
function [J grad] = nnCostFunctionLinear(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
m = size(X, 1);
a1 = X;
a1 = [ones(m, 1) a1];
a2 = a1 * Theta1';
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2';
Y = y;
J = 1/(2*m)*sum(sum((a3 - Y).^2))
th1 = Theta1;
th1(:,1) = 0; %set bias = 0 in reg. formula
th2 = Theta2;
th2(:,1) = 0;
t1 = th1.^2;
t2 = th2.^2;
th = sum(sum(t1)) + sum(sum(t2));
th = lambda * th / (2*m);
J = J + th; %regularization
del_3 = a3 - Y;
t1 = del_3'*a2;
Theta2_grad = 2*(t1)/m + lambda*th2/m;
t1 = del_3 * Theta2;
del_2 = t1 .* a2;
del_2 = del_2(:,2:end);
t1 = del_2'*a1;
Theta1_grad = 2*(t1)/m + lambda*th1/m;
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
Then I use this func in fmincg algorithm, but in firsts iterations fmincg end it's work. I think my gradient is wrong, but I can't find the error.
Can anybody help?
If I understand correctly, your first block of code (shown below) -
m = size(X, 1);
a1 = X;
a1 = [ones(m, 1) a1];
a2 = a1 * Theta1';
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2';
Y = y;
is to get the output a(3) at the output layer.
Ng's slides about NN has the below configuration to calculate a(3). It's different from what your code presents.
in the middle/output layer, you are not doing the activation function g, e.g., a sigmoid function.
In terms of the cost function J without regularization terms, Ng's slides has the below formula:
I don't understand why you can compute it using:
J = 1/(2*m)*sum(sum((a3 - Y).^2))
because you are not including the log function at all.
Mikhaill, I´ve been playing with a NN for continuous regression as well, and had a similar issues at some point. The best thing to do here would be to test gradient computation against a numerical calculation before running the model. If that´s not correct, fmincg won´t be able to train the model. (Btw, I discourage you of using numerical gradient as the time involved is much bigger).
Taking into account that you took this idea from Ng´s Coursera class, I´ll implement a possible solution for you to try using the same notation for Octave.
% Cost function without regularization.
J = 1/2/m^2*sum((a3-Y).^2);
% In case it´s needed, regularization term is added (i.e. for Training).
if (reg==true);
J=J+lambda/2/m*(sum(sum(Theta1(:,2:end).^2))+sum(sum(Theta2(:,2:end).^2)));
endif;
% Derivatives are computed for layer 2 and 3.
d3=(a3.-Y);
d2=d3*Theta2(:,2:end);
% Theta grad is computed without regularization.
Theta1_grad=(d2'*a1)./m;
Theta2_grad=(d3'*a2)./m;
% Regularization is added to grad computation.
Theta1_grad(:,2:end)=Theta1_grad(:,2:end)+(lambda/m).*Theta1(:,2:end);
Theta2_grad(:,2:end)=Theta2_grad(:,2:end)+(lambda/m).*Theta2(:,2:end);
% Unroll gradients.
grad = [Theta1_grad(:) ; Theta2_grad(:)];
Note that, since you have taken out all the sigmoid activation, the derivative calculation is quite simple and results in a simplification of the original code.
Next steps:
1. Check this code to understand if it makes sense to your problem.
2. Use gradient checking to test gradient calculation.
3. Finally, use fmincg and check you get different results.
Try to include sigmoid function to compute second layer (hidden layer) values and avoid sigmoid in calculating the target (output) value.
function [J grad] = nnCostFunction1(nnParams, ...
inputLayerSize, ...
hiddenLayerSize, ...
numLabels, ...
X, y, lambda)
Theta1 = reshape(nnParams(1:hiddenLayerSize * (inputLayerSize + 1)), ...
hiddenLayerSize, (inputLayerSize + 1));
Theta2 = reshape(nnParams((1 + (hiddenLayerSize * (inputLayerSize + 1))):end), ...
numLabels, (hiddenLayerSize + 1));
Theta1Grad = zeros(size(Theta1));
Theta2Grad = zeros(size(Theta2));
m = size(X,1);
a1 = [ones(m, 1) X]';
z2 = Theta1 * a1;
a2 = sigmoid(z2);
a2 = [ones(1, m); a2];
z3 = Theta2 * a2;
a3 = z3;
Y = y';
r1 = lambda / (2 * m) * sum(sum(Theta1(:, 2:end) .* Theta1(:, 2:end)));
r2 = lambda / (2 * m) * sum(sum(Theta2(:, 2:end) .* Theta2(:, 2:end)));
J = 1 / ( 2 * m ) * (a3 - Y) * (a3 - Y)' + r1 + r2;
delta3 = a3 - Y;
delta2 = (Theta2' * delta3) .* sigmoidGradient([ones(1, m); z2]);
delta2 = delta2(2:end, :);
Theta2Grad = 1 / m * (delta3 * a2');
Theta2Grad(:, 2:end) = Theta2Grad(:, 2:end) + lambda / m * Theta2(:, 2:end);
Theta1Grad = 1 / m * (delta2 * a1');
Theta1Grad(:, 2:end) = Theta1Grad(:, 2:end) + lambda / m * Theta1(:, 2:end);
grad = [Theta1Grad(:) ; Theta2Grad(:)];
end
Normalize the inputs before passing it in nnCostFunction.
In accordance with Week 5 Lecture Notes guideline for a Linear System NN you should make following changes in the initial code:
Remove num_lables or make it 1 (in reshape() as well)
No need to convert y into a logical matrix
For a2 - replace sigmoid() function to tanh()
In d2 calculation - replace sigmoidGradient(z2) with (1-tanh(z2).^2)
Remove sigmoid from output layer (a3 = z3)
Replace cost function in the unregularized portion to linear one: J = (1/(2*m))*sum((a3-y).^2)
Create predictLinear(): use predict() function as a basis, replace sigmoid with tanh() for the first layer hypothesis, remove second sigmoid for the second layer hypothesis, remove the line with max() function, use output of the hidden layer hypothesis as a prediction result
Verify your nnCostFunctionLinear() on the test case from the lecture note

Resources