My question is based on the data from Coursera course - https://www.coursera.org/learn/machine-learning/, but after a search is appears to be a common problem.
The gradient descent works perfectly on normalize data (pic.1), but goes in wrong direction on original data(pic.2) with J(cost function) growing very fast toward infinity. The difference between the parameters values is about 10^3.
I thought that normalization is required for better execution speed, I really can't see a reason of this growth in the cost function, even after a lot of search. Decreasing 'alpha', e.g. making it 0.001 or 0.0001 doesn't help either.
Please post if you have any ideas!
P.S. (I had manually provided matrices to the functions, where X_buf - normalized version and X_basic - original; Y - vector of all examles Q - theta vector, alpha - leaning rate).
function [theta, J_history] = gradientDescentMulti(X, Y, theta, alpha, num_iters)
m = length(Y);
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
theta = theta - (alpha/m)*X'*(X*theta-Y);
J_history(iter) = computeCostMulti(X, Y, theta);
end
end
And the second function:
function J = computeCostMulti(X, Y, theta)
m = length(Y); % number of training examples
J = 0;
J = (1/(2*rows(X)))*(X*theta-Y)'*(X*theta-Y);
end
Screenshots
Related
I tried implementing my own linear regression model in octave with some sample data but the theta does not seem to be correct and does not match the one provided by the normal equation which gives the correct values of theta. But running my model(with different alpha and iterations) on the data from Andrew Ng's machine learning course gives the proper theta for the hypothesis. I have tweaked alpha and iterations so that the cost function decreases. This is the image of cost function against iterations.. As you can see the cost decreases and plateaus but not to a low enough cost. Can somebody help me understand why this is happening and what I can do to fix it?
Here is the data (The first column is the x values, and the second column is the y values):
20,48
40,55.1
60,56.3
80,61.2
100,68
Here is the graph of the data and the equations plotted by gradient descent(GD) and by the normal equation(NE).
Code for the main script:
clear , close all, clc;
%loading the data
data = load("data1.txt");
X = data(:,1);
y = data(:,2);
%Plotting the data
figure
plot(X,y, 'xr', 'markersize', 7);
xlabel("Mass in kg");
ylabel("Length in cm");
X = [ones(length(y),1), X];
theta = ones(2, 1);
alpha = 0.000001; num_iter = 4000;
%Running gradientDescent
[opt_theta, J_history] = gradientDescent(X, y, theta, alpha, num_iter);
%Running Normal equation
opt_theta_norm = pinv(X' * X) * X' * y;
%Plotting the hypothesis for GD and NE
hold on
plot(X(:,2), X * opt_theta);
plot(X(:,2), X * opt_theta_norm, 'g-.', "markersize",10);
legend("Data", "GD", "NE");
hold off
%Plotting values of previous J with each iteration
figure
plot(1:numel(J_history), J_history);
xlabel("iterations"); ylabel("J");
Function for finding gradientDescent:
function [theta, J_history] = gradientDescent (X, y, theta, alpha, num_iter)
m = length(y);
J_history = zeros(num_iter,1);
for iter = 1:num_iter
theta = theta - (alpha / m) * (X' * (X * theta - y));
J_history(iter) = computeCost(X, y, theta);
endfor
endfunction
Function for computing cost:
function J = computeCost (X, y, theta)
J = 0;
m = length(y);
errors = X * theta - y;
J = sum(errors .^ 2) / (2 * m);
endfunction
Try alpha = 0.0001 and num_iter = 400000. This will solve your problem!
Now, the problem with your code is that the learning rate is way too less which is slowing down the convergence. Also, you are not giving it enough time to converge by limiting the training iterations to 4000 only which is very less given the learning rate.
Summarising, the problem is: less learning rate + less iterations.
I am enrolled in Andrew Ng's Machine Learning course. And there is an assignment, which I did complete of course... but the code I was using earlier wasn't working... and in the end I had to look for the right answer on Internet.
But anyways, from "wasn't working" I mean that it was working for that particular question, but thing is when they check my code then they have a different training set so my code should work with all types of training sets in order to be considered a right code. But the problem was that my code wasn't working with other training sets.
Now, to make it a bit more clear... all other training sets were supposed to have only one feature... that is we only needed to implement linear regression with one variable.
So I did everything right till the part where we have to calculate cost function. But when I had to compute the gradient descent, in that function I had to just write the code for the gradient descent steps.
So here's my code -
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
% First Step
Prediction = X*theta;
A = (Prediction - y);
theta(1) = theta(1) - ((alpha * (1/m)) * sum(A' * X(:,1)));
% Second Step
Prediction = X*theta;
A = (Prediction - y);
theta(2) = theta(2) - ((alpha * (1/m)) * sum(A' * X(:,2)));
J_history(iter) = computeCost(X, y, theta);
end
end
Here is the code which is working, which I found on Internet... from one of the threads in Stack Overflow only -
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
Prediction = X*theta;
A = (Prediction - y);
theta = theta - ((alpha * (1/m)) * (A' * X)');
J_history(iter) = computeCost(X, y, theta);
end
end
From what I can see from my code, that it should work properly... but dunno why it isn't working with other training sets? Because no matter how many training examples there are, it is always gonna be a column vector, or m*2 if we add x0. So I don't see how my code won't work?
Vectorized implementation of gradient descent
for iter = 1:num_iters
theta = theta - (alpha / m) * X' * (X * theta - y);
J_history(iter) = computeCostMulti(X, y, theta);
end
Implementation of computeCostMulti()
function J = computeCostMulti(X, y, theta)
m = length(y);
J = 0;
J = 1 / (2 * m) * (X * theta - y)' * (X * theta - y);
Normal equation implementation
theta = pinv(X' * X) * X' * y;
These two implementations converge to different values of theta for the same values of X and y. The Normal Equation gives the right answer but Gradient descent gives a wrong answer.
Is there anything wrong with the implementation of Gradient Descent?
I suppose that when you use gradient descent, you first process your input using feature scaling. That is not done with the normal equation method (as feature scaling is not required), and that should result in a different theta. If you use your models to make predictions they should come up with the same result.
It doesn't matter. As you're not making feature scaling to use the normal equation, you'll discover that the prediction is the same
Nobody promised you that gradient descent with fixed step size will converge by num_iters iterations even to a local optimum. You need to iterate until some well defined convergency criteria are met (e.g. gradient is close to zero).
If you have normalized the training data before gradient descent, you should also do it with your input data for the prediction. Concretely, your new input data should be like:
[1, (x-mu)/sigma]
where:
- 1 is the bias term
- mu is the mean of the training data
- sigma is the standard deviation of the training data
I am trying to move on from simple linear single-variable gradient descent into something more advanced: best polynomial fit for a set of points. I created a simple octave test script which allows me to visually set the points in a 2D space, then start the gradient dsecent algorithm and see how it gradually approaches the best fit.
Unfortunately, it doesn't work as good as it did with the simple single-variable linear regression: the results I get ( when I get them ) are inconsistent with the polynome I expect!
Here is the code:
dim=5;
h = figure();
axis([-dim dim -dim dim]);
hold on
index = 1;
data = zeros(1,2);
while(1)
[x,y,b] = ginput(1);
if( length(b) == 0 )
break;
endif
plot(x, y, "b+");
data(index, :) = [x y];
index++;
endwhile
y = data(:, 2);
m = length(y);
X = data(:, 1);
X = [ones(m, 1), data(:,1), data(:,1).^2, data(:,1).^3 ];
theta = zeros(4, 1);
iterations = 100;
alpha = 0.001;
J = zeros(1,iterations);
for iter = 1:iterations
theta -= ( (1/m) * ((X * theta) - y)' * X)' * alpha;
plot(-dim:0.01:dim, theta(1) + (-dim:0.01:dim).*theta(2) + (-dim:0.01:dim).^2.*theta(3) + (-dim:0.01:dim).^3.*theta(4), "g-");
J(iter) = sum( (1/m) * ((X * theta) - y)' * X);
end
plot(-dim:0.01:dim, theta(1) + (-dim:0.01:dim).*theta(2) + (-dim:0.01:dim).^2.*theta(3) + (-dim:0.01:dim).^3.*theta(4), "r-");
figure()
plot(1:iter, J);
I continuously get wrong results, even though it would seem that J is minimized correctly. I checked the plotting function with the normal equation ( which works correctly of course, and although I believe the error lies somewhere in the theta equation, I cannot figure out what it.
i implemented your code and it seems to be just fine, the reason that you do not have the results that you want is that Linear regression or polynomial regression in your case suffers from local minimum when you try to minimize the objective function. The algorithm traps in local minimum during execution. i implement your code changing the step (alpha) and i saw that with smaller step it fits the data better but still you are trapping in local minimum.
Choosing random initialization point of thetas every time i am trapping in a different local minimum. If you are lucky you will find a better initial points for theta and fit the data better. I think that there are some algorithms that finds the best initial points.
Below i attach the results for random initial points and the results with Matlab's polyfit.
In the above plot replace "Linear Regression with Polynomial Regression", type error.
If you observe better the plot, you will see that by chance (using rand() ) i chose some initial points that leaded me to the best data fitting comparing the other initial points.... i am showing that with a pointer.
I implemented a gradient descent algorithm to minimize a cost function in order to gain a hypothesis for determining whether an image has a good quality. I did that in Octave. The idea is somehow based on the algorithm from the machine learning class by Andrew Ng
Therefore I have 880 values "y" that contains values from 0.5 to ~12. And I have 880 values from 50 to 300 in "X" that should predict the image's quality.
Sadly the algorithm seems to fail, after some iterations the value for theta is so small, that theta0 and theta1 become "NaN". And my linear regression curve has strange values...
here is the code for the gradient descent algorithm:
(theta = zeros(2, 1);, alpha= 0.01, iterations=1500)
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
tmp_j1=0;
for i=1:m,
tmp_j1 = tmp_j1+ ((theta (1,1) + theta (2,1)*X(i,2)) - y(i));
end
tmp_j2=0;
for i=1:m,
tmp_j2 = tmp_j2+ (((theta (1,1) + theta (2,1)*X(i,2)) - y(i)) *X(i,2));
end
tmp1= theta(1,1) - (alpha * ((1/m) * tmp_j1))
tmp2= theta(2,1) - (alpha * ((1/m) * tmp_j2))
theta(1,1)=tmp1
theta(2,1)=tmp2
% ============================================================
% Save the cost J in every iteration
J_history(iter) = computeCost(X, y, theta);
end
end
And here is the computation for the costfunction:
function J = computeCost(X, y, theta) %
m = length(y); % number of training examples
J = 0;
tmp=0;
for i=1:m,
tmp = tmp+ (theta (1,1) + theta (2,1)*X(i,2) - y(i))^2; %differenzberechnung
end
J= (1/(2*m)) * tmp
end
If you are wondering how the seemingly complex looking for loop can be vectorized and cramped into a single one line expression, then please read on. The vectorized form is:
theta = theta - (alpha/m) * (X' * (X * theta - y))
Given below is a detailed explanation for how we arrive at this vectorized expression using gradient descent algorithm:
This is the gradient descent algorithm to fine tune the value of θ:
Assume that the following values of X, y and θ are given:
m = number of training examples
n = number of features + 1
Here
m = 5 (training examples)
n = 4 (features+1)
X = m x n matrix
y = m x 1 vector matrix
θ = n x 1 vector matrix
xi is the ith training example
xj is the jth feature in a given training example
Further,
h(x) = ([X] * [θ]) (m x 1 matrix of predicted values for our training set)
h(x)-y = ([X] * [θ] - [y]) (m x 1 matrix of Errors in our predictions)
whole objective of machine learning is to minimize Errors in predictions. Based on the above corollary, our Errors matrix is m x 1 vector matrix as follows:
To calculate new value of θj, we have to get a summation of all errors (m rows) multiplied by jth feature value of the training set X. That is, take all the values in E, individually multiply them with jth feature of the corresponding training example, and add them all together. This will help us in getting the new (and hopefully better) value of θj. Repeat this process for all j or the number of features. In matrix form, this can be written as:
This can be simplified as:
[E]' x [X] will give us a row vector matrix, since E' is 1 x m matrix and X is m x n matrix. But we are interested in getting a column matrix, hence we transpose the resultant matrix.
More succinctly, it can be written as:
Since (A * B)' = (B' * A'), and A'' = A, we can also write the above as
This is the original expression we started out with:
theta = theta - (alpha/m) * (X' * (X * theta - y))
i vectorized the theta thing...
may could help somebody
theta = theta - (alpha/m * (X * theta-y)' * X)';
I think that your computeCost function is wrong.
I attended NG's class last year and I have the following implementation (vectorized):
m = length(y);
J = 0;
predictions = X * theta;
sqrErrors = (predictions-y).^2;
J = 1/(2*m) * sum(sqrErrors);
The rest of the implementation seems fine to me, although you could also vectorize them.
theta_1 = theta(1) - alpha * (1/m) * sum((X*theta-y).*X(:,1));
theta_2 = theta(2) - alpha * (1/m) * sum((X*theta-y).*X(:,2));
Afterwards you are setting the temporary thetas (here called theta_1 and theta_2) correctly back to the "real" theta.
Generally it is more useful to vectorize instead of loops, it is less annoying to read and to debug.
If you are OK with using a least-squares cost function, then you could try using the normal equation instead of gradient descent. It's much simpler -- only one line -- and computationally faster.
Here is the normal equation:
http://mathworld.wolfram.com/NormalEquation.html
And in octave form:
theta = (pinv(X' * X )) * X' * y
Here is a tutorial that explains how to use the normal equation: http://www.lauradhamilton.com/tutorial-linear-regression-with-octave
While not scalable like a vectorized version, a loop-based computation of a gradient descent should generate the same results. In the example above, the most probably case of the gradient descent failing to compute the correct theta is the value of alpha.
With a verified set of cost and gradient descent functions and a set of data similar with the one described in the question, theta ends up with NaN values just after a few iterations if alpha = 0.01. However, when set as alpha = 0.000001, the gradient descent works as expected, even after 100 iterations.
Using only vectors here is the compact implementation of LR with Gradient Descent in Mathematica:
Theta = {0, 0}
alpha = 0.0001;
iteration = 1500;
Jhist = Table[0, {i, iteration}];
Table[
Theta = Theta -
alpha * Dot[Transpose[X], (Dot[X, Theta] - Y)]/m;
Jhist[[k]] =
Total[ (Dot[X, Theta] - Y[[All]])^2]/(2*m); Theta, {k, iteration}]
Note: Of course one assumes that X is a n * 2 matrix, with X[[,1]] containing only 1s'
This should work:-
theta(1,1) = theta(1,1) - (alpha*(1/m))*((X*theta - y)'* X(:,1) );
theta(2,1) = theta(2,1) - (alpha*(1/m))*((X*theta - y)'* X(:,2) );
its cleaner this way, and vectorized also
predictions = X * theta;
errorsVector = predictions - y;
theta = theta - (alpha/m) * (X' * errorsVector);
If you remember the first Pdf file for Gradient Descent form machine Learning course, you would take care of learning rate. Here is the note from the mentioned pdf.
Implementation Note: If your learning rate is too large, J(theta) can di-
verge and blow up', resulting in values which are too large for computer
calculations. In these situations, Octave/MATLAB will tend to return
NaNs. NaN stands fornot a number' and is often caused by undened
operations that involve - infinity and +infinity.