Octave Logistic regression cost function error - machine-learning

I am doing Andrew Ng's ML course on Coursera. Week3 logistic regression cost function using Octave is giving me some errors. I think it's because of incorrect matrix multiplication. Can someone point out my mistakes please?
Data file for training data is located in File ex2Data1.txt which is available from here https://upscfever.com/upsc-fever/en/data/images/ex2.zip
data = load('ex2data1.txt');
X = data(:, [1, 2]); y = data(:, 3);
[m, n] = size(X);
% Add intercept term to x and X_test
X = [ones(m, 1) X];
% Initialize fitting parameters
initial_theta = zeros(n + 1, 1);
% Compute and display initial cost and gradient
[cost, grad] = costFunction(initial_theta, X, y);
Code for my costFunction is as follows;
function [J, grad] = costFunction(theta, X, y)
% Initialize some useful values
m = length(y); % number of training examples
J = 0;
grad = zeros(size(theta));
%calculate hofX --> sigmoid theta'*X
hfX=sigmoid(theta'*X');
%cost --> bring '-' outside
J=-(1/m)*(y'*(log(hfX))')+(1-y)'*(log(1-hfX))';
%gradiant
fifth=(hfX-y)';
grad=(1/m)*(X'*fifth);
end
Code for the sigmoid function is as follows;
function g = sigmoid(z)
%SIGMOID Compute sigmoid function
g = zeros(size(z));
g = (1./(1+e.^(-1*z)));
end

Related

How to modify cost function algorithm in case of SGD in logistic regression algorithm

In my case I am implementing stochastic gradient descent to train a logistic classifier in a classical binary classification problem.
The algorithm basically is similar to that of GD with the exception of selecting one random observation at a time and iterating over the loss function. The code for this basic SGD is given as:
step_size = 0.01;
iter_max = 10000;
for iter = 1 : iter_max
r = randi([1 n]); % produces a random integer r between 1 and n.
[J,grad] = costfunction(theta,X(r,:),y(r));
w = w - step_size * grad;
end
The following code is for the cost function that I used for computing cost function based on Gradient Descent algorithm.
function [J, grad] = costfunction(theta, X, y)
m = length(y);
J = 0;
grad = zeros(size(theta));
sig = 1./(1 + (exp(-(X * theta))));
J = -(1/m)*sum(y.*log(sig) + (1-y).*log(1-sig));
grad = (sum((sig - y).*X))'/m;
end
However, my main suspect is that in SGD this cost function is producing error results. I suspect that this is due to the input of the costfunction now being a row-vector X(r,:) and a scalar y(r) instead of a matrix X and a vector y as in the case of gradient descent.
Question: How can I modify the costfunction code to make it produce correct values for J and grad in this case of SGD.

Linear Regression with Gradient Descent, Octave

So I am trying to solve the first programming exercise from Andrew Ng's ML Coursera course. I have a little bit of trouble implementing linear gradient descent in octave. The code below shows what I am trying to implement, per the equation posted in the picture, but I am getting a different value from the expected value. I'm not sure what I am missing, I'm hoping someone can parse through this.
%GRADIENTDESCENT Performs gradient descent to learn theta
% theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by
% taking num_iters gradient steps with learning rate alpha
% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
theta0 = theta(1);
theta1 = theta(2);
temp0 = 0;
temp1 = 0;
errFunc = 0;
for iter = 1:num_iters
h = X * theta;
errFunc = h - y;
temp0 = temp0 + (alpha/m).*sum(errFunc'*X(:, 1));
temp1 = temp1 + (alpha/m).*sum(errFunc'*X(:, 2));
theta0 = theta0 - temp0;
theta1 = theta1 - temp1;
theta = [theta0; theta1];
% ============================================================
% Save the cost J in every iteration
J_history(iter) = computeCost(X, y, theta);
end
end
code
My expected results are [ -3.6303; 1.1664], but I am getting [-1.361798; 0.931592]. This is the equation I am working with. results

Gradient function not able to find optimal theta but normal equation does

I tried implementing my own linear regression model in octave with some sample data but the theta does not seem to be correct and does not match the one provided by the normal equation which gives the correct values of theta. But running my model(with different alpha and iterations) on the data from Andrew Ng's machine learning course gives the proper theta for the hypothesis. I have tweaked alpha and iterations so that the cost function decreases. This is the image of cost function against iterations.. As you can see the cost decreases and plateaus but not to a low enough cost. Can somebody help me understand why this is happening and what I can do to fix it?
Here is the data (The first column is the x values, and the second column is the y values):
20,48
40,55.1
60,56.3
80,61.2
100,68
Here is the graph of the data and the equations plotted by gradient descent(GD) and by the normal equation(NE).
Code for the main script:
clear , close all, clc;
%loading the data
data = load("data1.txt");
X = data(:,1);
y = data(:,2);
%Plotting the data
figure
plot(X,y, 'xr', 'markersize', 7);
xlabel("Mass in kg");
ylabel("Length in cm");
X = [ones(length(y),1), X];
theta = ones(2, 1);
alpha = 0.000001; num_iter = 4000;
%Running gradientDescent
[opt_theta, J_history] = gradientDescent(X, y, theta, alpha, num_iter);
%Running Normal equation
opt_theta_norm = pinv(X' * X) * X' * y;
%Plotting the hypothesis for GD and NE
hold on
plot(X(:,2), X * opt_theta);
plot(X(:,2), X * opt_theta_norm, 'g-.', "markersize",10);
legend("Data", "GD", "NE");
hold off
%Plotting values of previous J with each iteration
figure
plot(1:numel(J_history), J_history);
xlabel("iterations"); ylabel("J");
Function for finding gradientDescent:
function [theta, J_history] = gradientDescent (X, y, theta, alpha, num_iter)
m = length(y);
J_history = zeros(num_iter,1);
for iter = 1:num_iter
theta = theta - (alpha / m) * (X' * (X * theta - y));
J_history(iter) = computeCost(X, y, theta);
endfor
endfunction
Function for computing cost:
function J = computeCost (X, y, theta)
J = 0;
m = length(y);
errors = X * theta - y;
J = sum(errors .^ 2) / (2 * m);
endfunction
Try alpha = 0.0001 and num_iter = 400000. This will solve your problem!
Now, the problem with your code is that the learning rate is way too less which is slowing down the convergence. Also, you are not giving it enough time to converge by limiting the training iterations to 4000 only which is very less given the learning rate.
Summarising, the problem is: less learning rate + less iterations.

Not getting accurate result when using normalized data for gradient descent

I am currently on week 2 of Andrew NG's Machine Learning course on Coursera, and I came across an issue that I cannot sort out.
Based on a data set, where the first column is the house size, the second the number of bedrooms in it, and the third column is the price of it, I need to use linear regression and gradient descent after normalizing the data to predict new house prices.
However, I am getting a gigantic number for my prediction and I cannot find where is the error on my calculations.
I am using the following:
alpha = 0.03;
num_iters = 400;
Code to normalize the features (X is the data set matrix):
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
for i = 1:size(X, 2);
mu(1, i) = mean(X(:, i)), % Getting the mean of each row.
sigma(1, i) = std(X(:, i)), % Getting the standard deviation of each row.
for j = 1:size(X, 1);
X_norm(j, i) = (X(j, i) .- mu(1, i)) ./ sigma(1, i);
end;
end;
Code to calculate current cost:
m = length(y);
J = 0;
predictions = X * theta;
sqErrors = (predictions - y).^2;
J = (1/(2*m)) * sum(sqErrors);
Code to calculate gradient descent:
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
% Getting the predictions for our firstly chosen theta values.
predictions = X * theta;
% Getting the error difference of the hypothesis(h(x)) and real results(y).
diff = predictions - y;
% Getting the number of features.
features_num = size(X, 2);
% Applying gradient descent for each feature.
for i = 1:features_num;
theta(i, 1) = theta(i, 1) - (alpha / m) * sum(diff .* X(:, i))
end;
% Saving the cost J in every iteration
J_history(iter) = computeCostMulti(X, y, theta);
The resulting price I am getting when predicting a house with 1650 squared feet and 3 bedrooms:
182329818.366117

In case of logistic regression, how should I interpret this learning curve between cost and number of examples?

I have obtained the following learning curve on plotting the learning curves for training and cross validation sets between the error cost, and number of training examples (in 100s in the graph). Can someone please tell me if this learning curve is ever possible? Because I am of the impression that the Cross validation error should decrease as the number of training examples increase.
Learning Curve. Note that the x axis denotes the number of training examples in 100s.
EDIT :
This is the code which I use to calculate the 9 values for plotting the learning curves.
X is the 2D matrix of the training set examples. It is of dimensions m x (n+1). y is of dimensions m x 1, and each element has value 1 or 0.
for j=1:9
disp(j)
[theta,J] = trainClassifier(X(1:(j*100),:),y(1:(j*100)),lambda);
[error_train(j), grad] = costprediciton_train(theta , X(1:(j*100),:), y(1:(j*100)));
[error_cv(j), grad] = costfunction_test2(theta , Xcv(1:(j*100),:),ycv(1:(j*100)));
end
The code I use for finding the optimal value of Theta from the training set.
% Train the classifer. Return theta
function [optTheta, J] = trainClassifier(X,y,lambda)
[m,n]=size(X);
initialTheta = zeros(n, 1);
options=optimset('GradObj','on','MaxIter',100);
[optTheta, J, Exit_flag ] = fminunc(#(t)(regularizedCostFunction(t, X, y, lambda)), initialTheta, options);
end
%regularized cost
function [J, grad] = regularizedCostFunction(theta, X, y,lambda)
[m,n]=size(X);
h=sigmoid( X * theta);
temp1 = -1 * (y .* log(h));
temp2 = (1 - y) .* log(1 - h);
thetaT = theta;
thetaT(1) = 0;
correction = sum(thetaT .^ 2) * (lambda / (2 * m));
J = sum(temp1 - temp2) / m + correction;
grad = (X' * (h - y)) * (1/m) + thetaT * (lambda / m);
end
The code I use for calculating the error cost for prediction of results for training set: (similar is the code for error cost of CV set)
Theta is of dimensions (n+1) x 1 and consists of the coefficients of the features in the hypothesis function.
function [J,grad] = costprediciton_train(theta , X, y)
[m,n]=size(X);
h=sigmoid(X * theta);
temp1 = y .* log(h);
temp2 = (1-y) .* log(1- h);
J = -sum (temp1 + temp2)/m;
t=h-y;
grad=(X'*t)*(1/m);
end
function [J,grad] = costfunction_test2(theta , X, y)
m= length(y);
h=sigmoid(X*theta);
temp1 = y .* log(h);
temp2 = (1-y) .* log(1- h);
J = -sum (temp1 + temp2)/m ;
grad = (X' * (h - y)) * (1/m) ;
end
The Sigmoid function:
function g = sigmoid(z)
g= zeros(size(z));
den=1 + exp(-1*z);
g = 1 ./ den;
end

Resources