I am trying to implement gradient descent for linear regression using this resource: https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
My problem is that my weights are exploding (increasing exponentially) and essentially doing the opposite of what is intended.
First I created a data set:
def y(x, a):
return 2*x + a*np.random.random_sample(len(x)) - a/2
x = np.arange(20)
y_true = y(x,10)
Which looks like this:
And the linear function to be optimized:
def y_predict(x, m, b):
return m*x + b
So for some randomly chosen parameters, this is the result:
m0 = 1
b0 = 1
a = y_predict(x, m0, b0)
plt.scatter(x, y_true)
plt.plot(x, a)
plt.show()
Now the cost would look like this:
cost = (1/2)* np.sum((y_true - a) ** 2)
The partial derivative of the cost with respect to the prediction (dc_da):
dc_da = (a - y_true) # still a vector
The partial derivative of the cost with respect to the slope parameter (dc_dm):
dc_dm = dc_da.dot(x) # now a constant
And the partial derivative of the cost with respect to the y-intercept parameter (dc_db):
dc_db = np.sum(dc_da) # also a constant
And finally the implementation of gradient descent:
iterations = 10
m0 = 1
b0 = 1
learning_rate = 0.1
N = len(x)
for i in range(iterations):
a = y_predict(x, m0, b0)
cost = (1/2) * np.sum((y_true - a) ** 2)
dc_da = (a - y_true)
mgrad = dc_da.dot(x)
bgrad = np.sum(dc_da)
m0 -= learning_rate * (2 / N) * mgrad
b0 -= learning_rate * (2 / N) * bgrad
if (i % 2 == 0):
print("Iteration {}".format(i))
print("Cost: {}, m: {}, b: {}\n".format(cost, m0, b0))
For which the result is:
Iteration 0
Cost: 1341.5241150881411, m: 26.02473879743261, b: 2.8683883457327797
Iteration 2
Cost: 409781757.38124645, m: 13657.166910552878, b: 1053.5831308528543
Iteration 4
Cost: 132510115599264.75, m: 7765058.4350503925, b: 598610.1166795876
Iteration 6
Cost: 4.284947676217907e+19, m: 4415631880.089208, b: 340401694.5610262
Iteration 8
Cost: 1.3856132043127762e+25, m: 2510967578365.3584, b: 193570850213.62192
Clearly, something is wrong. But I do not know what is wrong with my implementation.
Thanks for reading
The problem is the learning rate.
At a learning rate of 0.1, the step sizes were too large, causing them to escape the downhill gradient.
At a learning rate of 0.001, this is the result:
Iteration 0
Cost: 1341.5241150881411, m: 1.250247387974326, b: 1.0186838834573277
Iteration 20
Cost: 74.23350734398517, m: 2.0054600094398487, b: 1.0648169455682297
Iteration 40
Cost: 74.14854910310204, m: 2.00886824141609, b: 1.0531220375231194
Iteration 60
Cost: 74.07892801481468, m: 2.0097830838155835, b: 1.0413622803654885
Iteration 80
Cost: 74.01078231057598, m: 2.0106800645568503, b: 1.0297271562539492
Which looks like:
plt.scatter(x,y_true)
plt.plot(x, a)
plt.show()
Related
I wrote a RNN with LSTM cell with Pycharm. The peculiarity of this network is that the output of the RNN is fed into a integration opeartion, computed with Runge-kutta.
The integration takes some input and propagate that in time one step ahead. In order to do so I need to slice the feature tensor X along the batch dimension, and pass this to the Runge-kutta.
class MyLSTM(torch.nn.Module):
def __init__(self, ni, no, sampling_interval, nh=10, nlayers=1):
super(MyLSTM, self).__init__()
self.device = torch.device("cpu")
self.dtype = torch.float
self.ni = ni
self.no = no
self.nh = nh
self.nlayers = nlayers
self.lstms = torch.nn.ModuleList(
[torch.nn.LSTMCell(self.ni, self.nh)] + [torch.nn.LSTMCell(self.nh, self.nh) for i in range(nlayers - 1)])
self.out = torch.nn.Linear(self.nh, self.no)
self.do = torch.nn.Dropout(p=0.2)
self.actfn = torch.nn.Sigmoid()
self.sampling_interval = sampling_interval
self.scaler_states = None
# Options
# description of the whole block
def forward(self, x, h0, train=False, integrate_ode=True):
x0 = x.clone().requires_grad_(True)
hs = x # initiate hidden state
if h0 is None:
h = torch.zeros(hs.shape[0], self.nh, device=self.device)
c = torch.zeros(hs.shape[0], self.nh, device=self.device)
else:
(h, c) = h0
# LSTM cells
for i in range(self.nlayers):
h, c = self.lstms[i](hs, (h, c))
if train:
hs = self.do(h)
else:
hs = h
# Output layer
# y = self.actfn(self.out(hs))
y = self.out(hs)
if integrate_ode:
p = y
y = self.integrate(x0, p)
return y, (h, c)
def integrate(self, x0, p):
# RK4 steps per interval
M = 4
DT = self.sampling_interval / M
X = x0
# X = self.scaler_features.inverse_transform(x0)
for b in range(X.shape[0]):
xx = X[b, :]
for j in range(M):
k1 = self.ode(xx, p[b, :])
k2 = self.ode(xx + DT / 2 * k1, p[b, :])
k3 = self.ode(xx + DT / 2 * k2, p[b, :])
k4 = self.ode(xx + DT * k3, p[b, :])
xx = xx + DT / 6 * (k1 + 2 * k2 + 2 * k3 + k4)
X_all[b, :] = xx
return X_all
def ode(self, x0, y):
# Here I a dynamic model
I get this error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor []], which is output 0 of SelectBackward, is at version 64; expected version 63 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
the problem is in the operations xx = X[b, :] and p[b,:]. I know that because I choose batch dimension of 1, then I can replace the previous two equations with xx=X and p, and this works. How can split the tensor without loosing the gradient?
I had the same question, and after a lot of searching, I added .detach() function after "h" and "c" in the RNN cell.
I m having some doubt when practicing Pytorch program.
I have function like y = m1x1 + m2x2 + c (just 2 weights to learn here). The expected values of weight should be 16,-14 and bias should be 36. But in every epoch the learned wight goes very big. Can any one help me to debug and understand this 20 lines of code, what going wrong here.
import torch
x = torch.randint(size = (1,2), high = 10)
w = torch.Tensor([16,-14])
b = 36
#Compute Ground Truth
y = w * x + b
#Find weights by program
epoch = 20
learning_rate = 30
#initialize random
w1 = torch.rand(size= (1,2), requires_grad= True)
b1 = torch.ones(size = [1], requires_grad= True)
for i in range(epoch):
y1 = w1 * x + b1
#loss function RMSQ
loss = torch.sum((y1-y)**2)
#Find gradient
loss.backward()
with torch.no_grad():
#update parameters
w1 -= (learning_rate * w1.grad)
b1 -= (learning_rate * b1.grad)
w1.grad.zero_()
b1.grad.zero_()
print("B ", b1)
print("W ", w1)
Thanks,
Ganesh
You have a very large learning rate.
This is an illustration from Jeremy Jordan's blog that explains exactly what is going on in your case.
minΣ(||xi-Xci||^2+ λ||ci||),
s.t cii = 0,
where X is a matrix of shape d * n and C is of the shape n * n, xi and ci means a column of X and C separately.
X is known here and based on X we want to find C.
Usually with a loss like that you need to vectorize it, instead of working with columns:
loss = X - tf.matmul(X, C)
loss = tf.reduce_sum(tf.square(loss))
reg_loss = tf.reduce_sum(tf.square(C), 0) # L2 loss for each column
reg_loss = tf.reduce_sum(tf.sqrt(reg_loss))
total_loss = loss + lambd * reg_loss
To implement the zero constraint on the diagonal of C, the best way is to add it to the loss with another constant lambd2:
reg_loss2 = tf.trace(tf.square(C))
total_loss = total_loss + lambd2 * reg_loss2
I am trying to implement gradient descent in python; the implementation works when I try it with training_set1 but it returns not a number(nan) when I try it training_set. Any idea why my code is broken?
from collections import namedtuple
TrainingInstance = namedtuple("TrainingInstance", ['X', 'Y'])
training_set1 = [TrainingInstance(0, 4), TrainingInstance(1, 7),
TrainingInstance(2, 7), TrainingInstance(3, 8),
TrainingInstance(8, 12)]
training_set = [TrainingInstance(60, 3.1), TrainingInstance(61, 3.6),
TrainingInstance(62, 3.8), TrainingInstance(63, 4),
TrainingInstance(65, 4.1)]
def grad_desc(x, x1):
# minimize a cost function of two variables using gradient descent
training_rate = 0.1
iterations = 5000
#while sqrd_error(x, x1) > 0.0000001:
while iterations > 0:
#print sqrd_error(x, x1)
x, x1 = x - (training_rate * deriv(x, x1)), x1 - (training_rate * deriv1(x, x1))
iterations -= 1
return x, x1
def sqrd_error(x, x1):
sum = 0.0
for inst in training_set:
sum += ((x + x1 * inst.X) - inst.Y)**2
return sum / (2.0 * len(training_set))
def deriv(x, x1):
sum = 0.0
for inst in training_set:
sum += ((x + x1 * inst.X) - inst.Y)
return sum / len(training_set)
def deriv1(x, x1):
sum = 0.0
for inst in training_set:
sum += ((x + x1 * inst.X) - inst.Y) * inst.X
return sum / len(training_set)
if __name__ == "__main__":
print grad_desc(2, 2)
Reduce training_rate so that the objective decreases at each iteration.
See Figure 6. in this paper: http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
I'm trying to implement a regression NN that has 3 layers (1 input, 1 hidden and 1 output layer with a continuous result). As a basis I took a classification NN from coursera.org class, but changed the cost function and gradient calculation so as to fit a regression problem (and not a classification one):
My nnCostFunction now is:
function [J grad] = nnCostFunctionLinear(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
m = size(X, 1);
a1 = X;
a1 = [ones(m, 1) a1];
a2 = a1 * Theta1';
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2';
Y = y;
J = 1/(2*m)*sum(sum((a3 - Y).^2))
th1 = Theta1;
th1(:,1) = 0; %set bias = 0 in reg. formula
th2 = Theta2;
th2(:,1) = 0;
t1 = th1.^2;
t2 = th2.^2;
th = sum(sum(t1)) + sum(sum(t2));
th = lambda * th / (2*m);
J = J + th; %regularization
del_3 = a3 - Y;
t1 = del_3'*a2;
Theta2_grad = 2*(t1)/m + lambda*th2/m;
t1 = del_3 * Theta2;
del_2 = t1 .* a2;
del_2 = del_2(:,2:end);
t1 = del_2'*a1;
Theta1_grad = 2*(t1)/m + lambda*th1/m;
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
Then I use this func in fmincg algorithm, but in firsts iterations fmincg end it's work. I think my gradient is wrong, but I can't find the error.
Can anybody help?
If I understand correctly, your first block of code (shown below) -
m = size(X, 1);
a1 = X;
a1 = [ones(m, 1) a1];
a2 = a1 * Theta1';
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2';
Y = y;
is to get the output a(3) at the output layer.
Ng's slides about NN has the below configuration to calculate a(3). It's different from what your code presents.
in the middle/output layer, you are not doing the activation function g, e.g., a sigmoid function.
In terms of the cost function J without regularization terms, Ng's slides has the below formula:
I don't understand why you can compute it using:
J = 1/(2*m)*sum(sum((a3 - Y).^2))
because you are not including the log function at all.
Mikhaill, I´ve been playing with a NN for continuous regression as well, and had a similar issues at some point. The best thing to do here would be to test gradient computation against a numerical calculation before running the model. If that´s not correct, fmincg won´t be able to train the model. (Btw, I discourage you of using numerical gradient as the time involved is much bigger).
Taking into account that you took this idea from Ng´s Coursera class, I´ll implement a possible solution for you to try using the same notation for Octave.
% Cost function without regularization.
J = 1/2/m^2*sum((a3-Y).^2);
% In case it´s needed, regularization term is added (i.e. for Training).
if (reg==true);
J=J+lambda/2/m*(sum(sum(Theta1(:,2:end).^2))+sum(sum(Theta2(:,2:end).^2)));
endif;
% Derivatives are computed for layer 2 and 3.
d3=(a3.-Y);
d2=d3*Theta2(:,2:end);
% Theta grad is computed without regularization.
Theta1_grad=(d2'*a1)./m;
Theta2_grad=(d3'*a2)./m;
% Regularization is added to grad computation.
Theta1_grad(:,2:end)=Theta1_grad(:,2:end)+(lambda/m).*Theta1(:,2:end);
Theta2_grad(:,2:end)=Theta2_grad(:,2:end)+(lambda/m).*Theta2(:,2:end);
% Unroll gradients.
grad = [Theta1_grad(:) ; Theta2_grad(:)];
Note that, since you have taken out all the sigmoid activation, the derivative calculation is quite simple and results in a simplification of the original code.
Next steps:
1. Check this code to understand if it makes sense to your problem.
2. Use gradient checking to test gradient calculation.
3. Finally, use fmincg and check you get different results.
Try to include sigmoid function to compute second layer (hidden layer) values and avoid sigmoid in calculating the target (output) value.
function [J grad] = nnCostFunction1(nnParams, ...
inputLayerSize, ...
hiddenLayerSize, ...
numLabels, ...
X, y, lambda)
Theta1 = reshape(nnParams(1:hiddenLayerSize * (inputLayerSize + 1)), ...
hiddenLayerSize, (inputLayerSize + 1));
Theta2 = reshape(nnParams((1 + (hiddenLayerSize * (inputLayerSize + 1))):end), ...
numLabels, (hiddenLayerSize + 1));
Theta1Grad = zeros(size(Theta1));
Theta2Grad = zeros(size(Theta2));
m = size(X,1);
a1 = [ones(m, 1) X]';
z2 = Theta1 * a1;
a2 = sigmoid(z2);
a2 = [ones(1, m); a2];
z3 = Theta2 * a2;
a3 = z3;
Y = y';
r1 = lambda / (2 * m) * sum(sum(Theta1(:, 2:end) .* Theta1(:, 2:end)));
r2 = lambda / (2 * m) * sum(sum(Theta2(:, 2:end) .* Theta2(:, 2:end)));
J = 1 / ( 2 * m ) * (a3 - Y) * (a3 - Y)' + r1 + r2;
delta3 = a3 - Y;
delta2 = (Theta2' * delta3) .* sigmoidGradient([ones(1, m); z2]);
delta2 = delta2(2:end, :);
Theta2Grad = 1 / m * (delta3 * a2');
Theta2Grad(:, 2:end) = Theta2Grad(:, 2:end) + lambda / m * Theta2(:, 2:end);
Theta1Grad = 1 / m * (delta2 * a1');
Theta1Grad(:, 2:end) = Theta1Grad(:, 2:end) + lambda / m * Theta1(:, 2:end);
grad = [Theta1Grad(:) ; Theta2Grad(:)];
end
Normalize the inputs before passing it in nnCostFunction.
In accordance with Week 5 Lecture Notes guideline for a Linear System NN you should make following changes in the initial code:
Remove num_lables or make it 1 (in reshape() as well)
No need to convert y into a logical matrix
For a2 - replace sigmoid() function to tanh()
In d2 calculation - replace sigmoidGradient(z2) with (1-tanh(z2).^2)
Remove sigmoid from output layer (a3 = z3)
Replace cost function in the unregularized portion to linear one: J = (1/(2*m))*sum((a3-y).^2)
Create predictLinear(): use predict() function as a basis, replace sigmoid with tanh() for the first layer hypothesis, remove second sigmoid for the second layer hypothesis, remove the line with max() function, use output of the hidden layer hypothesis as a prediction result
Verify your nnCostFunctionLinear() on the test case from the lecture note