Cost Function, Linear Regression, trying to avoid hard coding theta. Octave. - machine-learning

I'm in the second week of Professor Andrew Ng's Machine Learning course through Coursera. We're working on linear regression and right now I'm dealing with coding the cost function.
The code I've written solves the problem correctly but does not pass the submission process and fails the unit test because I have hard coded the values of theta and not allowed for more than two values for theta.
Here's the code I've got so far
function J = computeCost(X, y, theta)
m = length(y);
J = 0;
for i = 1:m,
h = theta(1) + theta(2) * X(i)
a = h - y(i);
b = a^2;
J = J + b;
end;
J = J * (1 / (2 * m));
end
the unit test is
computeCost( [1 2 3; 1 3 4; 1 4 5; 1 5 6], [7;6;5;4], [0.1;0.2;0.3])
and should produce ans = 7.0175
So I need to add another for loop to iterate over theta, therefore allowing for any number of values for theta, but I'll be damned if I can wrap my head around how/where.
Can anyone suggest a way I can allow for any number of values for theta within this function?
If you need more information to understand what I'm trying to ask, I will try my best to provide it.

You can use vectorize of operations in Octave/Matlab.
Iterate over entire vector - it is really bad idea, if your programm language let you vectorize operations.
R, Octave, Matlab, Python (numpy) allow this operation.
For example, you can get scalar production, if theta = (t0, t1, t2, t3) and X = (x0, x1, x2, x3) in the next way:
theta * X' = (t0, t1, t2, t3) * (x0, x1, x2, x3)' = t0*x0 + t1*x1 + t2*x2 + t3*x3
Result will be scalar.
For example, you can vectorize h in your code in the next way:
H = (theta'*X')';
S = sum((H - y) .^ 2);
J = S / (2*m);

Above answer is perfect but you can also do
H = (X*theta);
S = sum((H - y) .^ 2);
J = S / (2*m);
Rather than computing
(theta' * X')'
and then taking the transpose you can directly calculate
(X * theta)
It works perfectly.

The below line return the required 32.07 cost value while we run computeCost once using θ initialized to zeros:
J = (1/(2*m)) * (sum(((X * theta) - y).^2));
and is similar to the original formulas that is given below.

It can be also done in a line-
m- # training sets
J=(1/(2*m)) * ((((X * theta) - y).^2)'* ones(m,1));

J = sum(((X*theta)-y).^2)/(2*m);
ans = 32.073
Above answer is perfect,I thought the problem deeply for a day and still unfamiliar with Octave,so,Just study together!

If you want to use only matrix, so:
temp = (X * theta - y); % h(x) - y
J = ((temp')*temp)/(2 * m);
clear temp;

This would work just fine for you -
J = sum((X*theta - y).^2)*(1/(2*m))
This directly follows from the Cost Function Equation

Python code for the same :
def computeCost(X, y, theta):
m = y.size # number of training examples
J = 0
H = (X.dot(theta))
S = sum((H - y)**2);
J = S / (2*m);
return J

function J = computeCost(X, y, theta)
m = length(y);
J = 0;
% Hypothesis h(x)
h = X * theta;
% Error function (h(x) - y) ^ 2
squaredError = (h-y).^2;
% Cost function
J = sum(squaredError)/(2*m);
end

I think we needed to use iteration for much general solution for cost rather one iteration, also the result shows in the PDF 32.07 may not be correct answer that grader is looking for reason being its a one case out of many training data.
I think it should loop through like this
for i in 1:iteration
theta = theta - alpha*(1/m)(theta'*x-y)*x
j = (1/(2*m))(theta'*x-y)^2

Related

Gradient function not able to find optimal theta but normal equation does

I tried implementing my own linear regression model in octave with some sample data but the theta does not seem to be correct and does not match the one provided by the normal equation which gives the correct values of theta. But running my model(with different alpha and iterations) on the data from Andrew Ng's machine learning course gives the proper theta for the hypothesis. I have tweaked alpha and iterations so that the cost function decreases. This is the image of cost function against iterations.. As you can see the cost decreases and plateaus but not to a low enough cost. Can somebody help me understand why this is happening and what I can do to fix it?
Here is the data (The first column is the x values, and the second column is the y values):
20,48
40,55.1
60,56.3
80,61.2
100,68
Here is the graph of the data and the equations plotted by gradient descent(GD) and by the normal equation(NE).
Code for the main script:
clear , close all, clc;
%loading the data
data = load("data1.txt");
X = data(:,1);
y = data(:,2);
%Plotting the data
figure
plot(X,y, 'xr', 'markersize', 7);
xlabel("Mass in kg");
ylabel("Length in cm");
X = [ones(length(y),1), X];
theta = ones(2, 1);
alpha = 0.000001; num_iter = 4000;
%Running gradientDescent
[opt_theta, J_history] = gradientDescent(X, y, theta, alpha, num_iter);
%Running Normal equation
opt_theta_norm = pinv(X' * X) * X' * y;
%Plotting the hypothesis for GD and NE
hold on
plot(X(:,2), X * opt_theta);
plot(X(:,2), X * opt_theta_norm, 'g-.', "markersize",10);
legend("Data", "GD", "NE");
hold off
%Plotting values of previous J with each iteration
figure
plot(1:numel(J_history), J_history);
xlabel("iterations"); ylabel("J");
Function for finding gradientDescent:
function [theta, J_history] = gradientDescent (X, y, theta, alpha, num_iter)
m = length(y);
J_history = zeros(num_iter,1);
for iter = 1:num_iter
theta = theta - (alpha / m) * (X' * (X * theta - y));
J_history(iter) = computeCost(X, y, theta);
endfor
endfunction
Function for computing cost:
function J = computeCost (X, y, theta)
J = 0;
m = length(y);
errors = X * theta - y;
J = sum(errors .^ 2) / (2 * m);
endfunction
Try alpha = 0.0001 and num_iter = 400000. This will solve your problem!
Now, the problem with your code is that the learning rate is way too less which is slowing down the convergence. Also, you are not giving it enough time to converge by limiting the training iterations to 4000 only which is very less given the learning rate.
Summarising, the problem is: less learning rate + less iterations.

Which is the correct implementation of regularization in octave?

I'm currently taking Andrew Ng's machine learning course and I try implementing the stuff as I learn so as not to forget them, I just finished regularization (chapter 7). I know that theta 0 is updated normally, separate from other parameters, however, I am not sure which of these is the correct implementation.
Implementation 1: in my gradient function, after computing the regularization vector, change theta 0 part to 0 so when it is added to the total, it is as if theta 0 was never regularized.
Implementation 2: store theta in a temp variable: _theta, update it with a reg_step of 0 (so it's as if there's no regularization), store the new theta 0 in a temp variable: t1, then update the original theta value with my desired reg_step and replace theta 0 with t1 (value from non-regularized update).
below is my code for the first implementation, it's not meant to be advanced, I'm just practicing:
I'm using octave which is 1-index, so theta(1) is theta(0)
function ret = gradient(X,Y,theta,reg_step),
H = theta' * X;
dif = H-Y;
mul = dif .* X;
total = sum(mul,2);
m=(size(Y)(1,1));
regular = (reg_step/m)*theta;
regular(1)=0;
ret = (total/m)+regular,
endfunction
Thanks in advance.
A slight tweak to the first implementation worked for me.
First, calculate regularization for every theta. Then go on to perform gradient step and later you can change the first entry of the matrix containing gradients manually to ignore regularization for theta_0.
% Calculate regularization
regularization = (reg_step / m) * theta;
% Gradient Step
gradients = (1 / m) * (X' * (predictions - y)) + regularization;
% Ignore regularization in theta_0
gradients(1) = (1 / m) * (X(:, 1)' * (predictions - y));

Not getting accurate result when using normalized data for gradient descent

I am currently on week 2 of Andrew NG's Machine Learning course on Coursera, and I came across an issue that I cannot sort out.
Based on a data set, where the first column is the house size, the second the number of bedrooms in it, and the third column is the price of it, I need to use linear regression and gradient descent after normalizing the data to predict new house prices.
However, I am getting a gigantic number for my prediction and I cannot find where is the error on my calculations.
I am using the following:
alpha = 0.03;
num_iters = 400;
Code to normalize the features (X is the data set matrix):
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
for i = 1:size(X, 2);
mu(1, i) = mean(X(:, i)), % Getting the mean of each row.
sigma(1, i) = std(X(:, i)), % Getting the standard deviation of each row.
for j = 1:size(X, 1);
X_norm(j, i) = (X(j, i) .- mu(1, i)) ./ sigma(1, i);
end;
end;
Code to calculate current cost:
m = length(y);
J = 0;
predictions = X * theta;
sqErrors = (predictions - y).^2;
J = (1/(2*m)) * sum(sqErrors);
Code to calculate gradient descent:
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
% Getting the predictions for our firstly chosen theta values.
predictions = X * theta;
% Getting the error difference of the hypothesis(h(x)) and real results(y).
diff = predictions - y;
% Getting the number of features.
features_num = size(X, 2);
% Applying gradient descent for each feature.
for i = 1:features_num;
theta(i, 1) = theta(i, 1) - (alpha / m) * sum(diff .* X(:, i))
end;
% Saving the cost J in every iteration
J_history(iter) = computeCostMulti(X, y, theta);
The resulting price I am getting when predicting a house with 1650 squared feet and 3 bedrooms:
182329818.366117

In case of logistic regression, how should I interpret this learning curve between cost and number of examples?

I have obtained the following learning curve on plotting the learning curves for training and cross validation sets between the error cost, and number of training examples (in 100s in the graph). Can someone please tell me if this learning curve is ever possible? Because I am of the impression that the Cross validation error should decrease as the number of training examples increase.
Learning Curve. Note that the x axis denotes the number of training examples in 100s.
EDIT :
This is the code which I use to calculate the 9 values for plotting the learning curves.
X is the 2D matrix of the training set examples. It is of dimensions m x (n+1). y is of dimensions m x 1, and each element has value 1 or 0.
for j=1:9
disp(j)
[theta,J] = trainClassifier(X(1:(j*100),:),y(1:(j*100)),lambda);
[error_train(j), grad] = costprediciton_train(theta , X(1:(j*100),:), y(1:(j*100)));
[error_cv(j), grad] = costfunction_test2(theta , Xcv(1:(j*100),:),ycv(1:(j*100)));
end
The code I use for finding the optimal value of Theta from the training set.
% Train the classifer. Return theta
function [optTheta, J] = trainClassifier(X,y,lambda)
[m,n]=size(X);
initialTheta = zeros(n, 1);
options=optimset('GradObj','on','MaxIter',100);
[optTheta, J, Exit_flag ] = fminunc(#(t)(regularizedCostFunction(t, X, y, lambda)), initialTheta, options);
end
%regularized cost
function [J, grad] = regularizedCostFunction(theta, X, y,lambda)
[m,n]=size(X);
h=sigmoid( X * theta);
temp1 = -1 * (y .* log(h));
temp2 = (1 - y) .* log(1 - h);
thetaT = theta;
thetaT(1) = 0;
correction = sum(thetaT .^ 2) * (lambda / (2 * m));
J = sum(temp1 - temp2) / m + correction;
grad = (X' * (h - y)) * (1/m) + thetaT * (lambda / m);
end
The code I use for calculating the error cost for prediction of results for training set: (similar is the code for error cost of CV set)
Theta is of dimensions (n+1) x 1 and consists of the coefficients of the features in the hypothesis function.
function [J,grad] = costprediciton_train(theta , X, y)
[m,n]=size(X);
h=sigmoid(X * theta);
temp1 = y .* log(h);
temp2 = (1-y) .* log(1- h);
J = -sum (temp1 + temp2)/m;
t=h-y;
grad=(X'*t)*(1/m);
end
function [J,grad] = costfunction_test2(theta , X, y)
m= length(y);
h=sigmoid(X*theta);
temp1 = y .* log(h);
temp2 = (1-y) .* log(1- h);
J = -sum (temp1 + temp2)/m ;
grad = (X' * (h - y)) * (1/m) ;
end
The Sigmoid function:
function g = sigmoid(z)
g= zeros(size(z));
den=1 + exp(-1*z);
g = 1 ./ den;
end

gradient descent seems to fail

I implemented a gradient descent algorithm to minimize a cost function in order to gain a hypothesis for determining whether an image has a good quality. I did that in Octave. The idea is somehow based on the algorithm from the machine learning class by Andrew Ng
Therefore I have 880 values "y" that contains values from 0.5 to ~12. And I have 880 values from 50 to 300 in "X" that should predict the image's quality.
Sadly the algorithm seems to fail, after some iterations the value for theta is so small, that theta0 and theta1 become "NaN". And my linear regression curve has strange values...
here is the code for the gradient descent algorithm:
(theta = zeros(2, 1);, alpha= 0.01, iterations=1500)
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
tmp_j1=0;
for i=1:m,
tmp_j1 = tmp_j1+ ((theta (1,1) + theta (2,1)*X(i,2)) - y(i));
end
tmp_j2=0;
for i=1:m,
tmp_j2 = tmp_j2+ (((theta (1,1) + theta (2,1)*X(i,2)) - y(i)) *X(i,2));
end
tmp1= theta(1,1) - (alpha * ((1/m) * tmp_j1))
tmp2= theta(2,1) - (alpha * ((1/m) * tmp_j2))
theta(1,1)=tmp1
theta(2,1)=tmp2
% ============================================================
% Save the cost J in every iteration
J_history(iter) = computeCost(X, y, theta);
end
end
And here is the computation for the costfunction:
function J = computeCost(X, y, theta) %
m = length(y); % number of training examples
J = 0;
tmp=0;
for i=1:m,
tmp = tmp+ (theta (1,1) + theta (2,1)*X(i,2) - y(i))^2; %differenzberechnung
end
J= (1/(2*m)) * tmp
end
If you are wondering how the seemingly complex looking for loop can be vectorized and cramped into a single one line expression, then please read on. The vectorized form is:
theta = theta - (alpha/m) * (X' * (X * theta - y))
Given below is a detailed explanation for how we arrive at this vectorized expression using gradient descent algorithm:
This is the gradient descent algorithm to fine tune the value of θ:
Assume that the following values of X, y and θ are given:
m = number of training examples
n = number of features + 1
Here
m = 5 (training examples)
n = 4 (features+1)
X = m x n matrix
y = m x 1 vector matrix
θ = n x 1 vector matrix
xi is the ith training example
xj is the jth feature in a given training example
Further,
h(x) = ([X] * [θ]) (m x 1 matrix of predicted values for our training set)
h(x)-y = ([X] * [θ] - [y]) (m x 1 matrix of Errors in our predictions)
whole objective of machine learning is to minimize Errors in predictions. Based on the above corollary, our Errors matrix is m x 1 vector matrix as follows:
To calculate new value of θj, we have to get a summation of all errors (m rows) multiplied by jth feature value of the training set X. That is, take all the values in E, individually multiply them with jth feature of the corresponding training example, and add them all together. This will help us in getting the new (and hopefully better) value of θj. Repeat this process for all j or the number of features. In matrix form, this can be written as:
This can be simplified as:
[E]' x [X] will give us a row vector matrix, since E' is 1 x m matrix and X is m x n matrix. But we are interested in getting a column matrix, hence we transpose the resultant matrix.
More succinctly, it can be written as:
Since (A * B)' = (B' * A'), and A'' = A, we can also write the above as
This is the original expression we started out with:
theta = theta - (alpha/m) * (X' * (X * theta - y))
i vectorized the theta thing...
may could help somebody
theta = theta - (alpha/m * (X * theta-y)' * X)';
I think that your computeCost function is wrong.
I attended NG's class last year and I have the following implementation (vectorized):
m = length(y);
J = 0;
predictions = X * theta;
sqrErrors = (predictions-y).^2;
J = 1/(2*m) * sum(sqrErrors);
The rest of the implementation seems fine to me, although you could also vectorize them.
theta_1 = theta(1) - alpha * (1/m) * sum((X*theta-y).*X(:,1));
theta_2 = theta(2) - alpha * (1/m) * sum((X*theta-y).*X(:,2));
Afterwards you are setting the temporary thetas (here called theta_1 and theta_2) correctly back to the "real" theta.
Generally it is more useful to vectorize instead of loops, it is less annoying to read and to debug.
If you are OK with using a least-squares cost function, then you could try using the normal equation instead of gradient descent. It's much simpler -- only one line -- and computationally faster.
Here is the normal equation:
http://mathworld.wolfram.com/NormalEquation.html
And in octave form:
theta = (pinv(X' * X )) * X' * y
Here is a tutorial that explains how to use the normal equation: http://www.lauradhamilton.com/tutorial-linear-regression-with-octave
While not scalable like a vectorized version, a loop-based computation of a gradient descent should generate the same results. In the example above, the most probably case of the gradient descent failing to compute the correct theta is the value of alpha.
With a verified set of cost and gradient descent functions and a set of data similar with the one described in the question, theta ends up with NaN values just after a few iterations if alpha = 0.01. However, when set as alpha = 0.000001, the gradient descent works as expected, even after 100 iterations.
Using only vectors here is the compact implementation of LR with Gradient Descent in Mathematica:
Theta = {0, 0}
alpha = 0.0001;
iteration = 1500;
Jhist = Table[0, {i, iteration}];
Table[
Theta = Theta -
alpha * Dot[Transpose[X], (Dot[X, Theta] - Y)]/m;
Jhist[[k]] =
Total[ (Dot[X, Theta] - Y[[All]])^2]/(2*m); Theta, {k, iteration}]
Note: Of course one assumes that X is a n * 2 matrix, with X[[,1]] containing only 1s'
This should work:-
theta(1,1) = theta(1,1) - (alpha*(1/m))*((X*theta - y)'* X(:,1) );
theta(2,1) = theta(2,1) - (alpha*(1/m))*((X*theta - y)'* X(:,2) );
its cleaner this way, and vectorized also
predictions = X * theta;
errorsVector = predictions - y;
theta = theta - (alpha/m) * (X' * errorsVector);
If you remember the first Pdf file for Gradient Descent form machine Learning course, you would take care of learning rate. Here is the note from the mentioned pdf.
Implementation Note: If your learning rate is too large, J(theta) can di-
verge and blow up', resulting in values which are too large for computer
calculations. In these situations, Octave/MATLAB will tend to return
NaNs. NaN stands fornot a number' and is often caused by undened
operations that involve - infinity and +infinity.

Resources