Logistic Regression using Gradient Descent with OCTAVE - machine-learning

I've gone through few courses of Professor Andrew for machine Learning and viewed the transcript for Logistic Regression using Newton's method. However when implementing the logistic regression using gradient descent I face certain issue.
The graph generated is not convex.
My code goes as follows:
I am using the vectorized implementation of the equation.
%1. The below code would load the data present in your desktop to the octave memory
x=load('ex4x.dat');
y=load('ex4y.dat');
%2. Now we want to add a column x0 with all the rows as value 1 into the matrix.
%First take the length
m=length(y);
x=[ones(m,1),x];
alpha=0.1;
max_iter=100;
g=inline('1.0 ./ (1.0 + exp(-z))');
theta = zeros(size(x(1,:)))'; % the theta has to be a 3*1 matrix so that it can multiply by x that is m*3 matrix
j=zeros(max_iter,1); % j is a zero matrix that is used to store the theta cost function j(theta)
for num_iter=1:max_iter
% Now we calculate the hx or hypothetis, It is calculated here inside no. of iteration because the hupothesis has to be calculated for new theta for every iteration
z=x*theta;
h=g(z); % Here the effect of inline function we used earlier will reflect
j(num_iter)=(1/m)*(-y'* log(h) - (1 - y)'*log(1-h)) ; % This formula is the vectorized form of the cost function J(theta) This calculates the cost function
j
grad=(1/m) * x' * (h-y); % This formula is the gradient descent formula that calculates the theta value.
theta=theta - alpha .* grad; % Actual Calculation for theta
theta
end
The code per say doesn't give any error but does not produce proper convex graph.
I shall be glad if any body could point out the mistake or share insight on what's causing the problem.
thanks

2 things you need to look into:
Machine Learning involves learning patterns from data. If your files ex4x.dat and ex4y.dat are randomly generated, it won't have patterns that you can learn.
You have used variables like g, h, i, j which make debugging difficult. Since it's a very small program, it might be a better idea to rewrite it.
Here's my code that gives the convex plot
clc; clear; close all;
load q1x.dat;
load q1y.dat;
X = [ones(size(q1x, 1),1) q1x];
Y = q1y;
m = size(X,1);
n = size(X,2)-1;
%initialize
theta = zeros(n+1,1);
thetaold = ones(n+1,1);
while ( ((theta-thetaold)'*(theta-thetaold)) > 0.0000001 )
%calculate dellltheta
dellltheta = zeros(n+1,1);
for j=1:n+1,
for i=1:m,
dellltheta(j,1) = dellltheta(j,1) + [Y(i,1) - (1/(1 + exp(-theta'*X(i,:)')))]*X(i,j);
end;
end;
%calculate hessian
H = zeros(n+1, n+1);
for j=1:n+1,
for k=1:n+1,
for i=1:m,
H(j,k) = H(j,k) -[1/(1 + exp(-theta'*X(i,:)'))]*[1-(1/(1 + exp(-theta'*X(i,:)')))]*[X(i,j)]*[X(i,k)];
end;
end;
end;
thetaold = theta;
theta = theta - inv(H)*dellltheta;
(theta-thetaold)'*(theta-thetaold)
end
I get the following values of error after iterations:
2.8553
0.6596
0.1532
0.0057
5.9152e-06
6.1469e-12
Which when plotted looks like:

Related

Backpropagation (Cousera ML by Andrew Ng) gradient descent clarification

Question
Please forgive me asking Coursera ML course specific question. Hope someone who did the couser can answer.
In Coursera ML Week 4 Multi-class Classification and Neural Networks assignment, why the weight (theta) gradient is adding (plus) the derivative instead of subtracting?
% Calculate the gradients of Weight2
% Derivative at Loss function J=L(Z) : dJ/dZ = (oi-yi)/oi(1-oi)
% Derivative at Sigmoid activation function dZ/dY = oi(1-oi)
delta_theta2 = oi - yi; % <--- (dJ/dZ) * (dZ/dY)
# Using +/plus NOT -/minus
Theta2_grad = Theta2_grad + <-------- Why plus(+)?
bsxfun(#times, hi, transpose(delta_theta2));
Code Excerpt
for i = 1:m
% i is training set index of X (including bias). X(i, :) is 401 data.
xi = X(i, :);
yi = Y(i, :);
% hi is the i th output of the hidden layer. H(i, :) is 26 data.
hi = H(i, :);
% oi is the i th output layer. O(i, :) is 10 data.
oi = O(i, :);
%------------------------------------------------------------------------
% Calculate the gradients of Theta2
%------------------------------------------------------------------------
delta_theta2 = oi - yi;
Theta2_grad = Theta2_grad + bsxfun(#times, hi, transpose(delta_theta2));
%------------------------------------------------------------------------
% Calculate the gradients of Theta1
%------------------------------------------------------------------------
% Derivative of g(z): g'(z)=g(z)(1-g(z)) where g(z) is sigmoid(H_NET).
dgz = (hi .* (1 - hi));
delta_theta1 = dgz .* sum(bsxfun(#times, Theta2, transpose(delta_theta2)));
% There is no input into H0, hence there is no theta for H0. Remove H0.
delta_theta1 = delta_theta1(2:end);
Theta1_grad = Theta1_grad + bsxfun(#times, xi, transpose(delta_theta1));
end
I thought it is subtracting the derivative.
Derivative of Binary Cross Entropy - why are my signs not right?
Since the gradients are calculated by averaging the gradients across all training examples, we first "accumulate" the gradients while looping across all the training examples. We do this by summing the gradient across all training examples. So the line you highlighted with the plus is not the gradient update step. (Notice that alpha is not there as well.) It might be somewhere else. It is most likely outside of the loop from 1 to m.
Also, I am not sure when you will learn about this (I'm sure it's somewhere in the course), but you could also vectorize the code :)

"Z" Variable is undefined when used to represent a matrix for sigmoid function

I'm a high school student and I just started going into machine learning to further my knowledge of coding. I tried out the program Octave and been working with neurological networks, or at least, tried to. In my first program, however, I already found myself at an impasse with my Sigmoid gradient function. When I try to make the function work for each value within a matrix, I have no idea how to do so. I tried placing z as the parameter of the function but it says that "z" itself is undefined. I have no knowledge on C or C++, and I'm still an amateur in this area, so sorry if I take some time to understand. Thanks to anyone who offers to help!
I'm running Octave 4.4.1, and I haven't tried any other solution yet, as I don't really have any.
% Main Code
g = sigGrad([-2 -1 0 1 2]);
% G is supposed to be my sigmoid Gradient for each value of Theta, which is the matrix within it's parameters.
% Sigmoid Gradient function
function g = sigGrad(z)
g = zeros(size(z));
% This is where the code tells me that z is undefined
g = sigmoid(z).*(1.-sigmoid(z));
% I began by initializing a matrix of zeroes with the size of z
% It should later do the Gradient Equation, but it marks z as undefined before that
% Sigmoid function
g = sigmoid(z)
g = 1.0 ./ (1.0 + exp(-z));
From what I see, I make out that you are committing simple syntax mistakes, I'd recommend get a gist of octave first than diving into the code head on. That being said you have to declare your functions with proper syntax and use them as shown below:
function g = sigmoid(z)
% SIGMOID Compute sigmoid function
% J = SIGMOID(z) computes the sigmoid of z.
g = 1.0 ./ (1.0 + exp(-z));
end
And the other piece of code should be
function g = sigGrad(z)
% sigGrad returns the gradient of the sigmoid function evaluated at z
% g = sigGrad(z) computes the gradient of the sigmoid function evaluated at z.
% This should work regardless if z is a matrix or a vector.
% In particular, if z is a vector or matrix, you should return the gradient for each element.
g = zeros(size(z));
g = sigmoid(z).*(1 - sigmoid(z));
end
And then finally call the above implemented functions using:
g = sigGrad([1 -0.5 0 0.5 1]);

Gradient descent on linear regression not converging

I have implemented a very simple linear regression with gradient descent algorithm in JavaScript, but after consulting multiple sources and trying several things, I cannot get it to converge.
The data is absolutely linear, it's just the numbers 0 to 30 as inputs with x*3 as their correct outputs to learn.
This is the logic behind the gradient descent:
train(input, output) {
const predictedOutput = this.predict(input);
const delta = output - predictedOutput;
this.m += this.learningRate * delta * input;
this.b += this.learningRate * delta;
}
predict(x) {
return x * this.m + this.b;
}
I took the formulas from different places, including:
Exercises from Udacity's Deep Learning Foundations Nanodegree
Andrew Ng's course on Gradient Descent for Linear Regression (also here)
Stanford's CS229 Lecture Notes
this other PDF slides I found from Carnegie Mellon
I have already tried:
normalizing input and output values to the [-1, 1] range
normalizing input and output values to the [0, 1] range
normalizing input and output values to have mean = 0 and stddev = 1
reducing the learning rate (1e-7 is as low as I went)
having a linear data set with no bias at all (y = x * 3)
having a linear data set with non-zero bias (y = x * 3 + 2)
initializing the weights with random non-zero values between -1 and 1
Still, the weights (this.b and this.m) do not approach any of the data values, and they diverge into infinity.
I'm obviously doing something wrong, but I cannot figure out what it is.
Update: Here's a little bit more context that may help figure out what my problem is exactly:
I'm trying to model a simple approximation to a linear function, with online learning by a linear regression pseudo-neuron. With that, my parameters are:
weights: [this.m, this.b]
inputs: [x, 1]
activation function: identity function z(x) = x
As such, my net will be expressed by y = this.m * x + this.b * 1, simulating the data-driven function that I want to approximate (y = 3 * x).
What I want is for my network to "learn" the parameters this.m = 3 and this.b = 0, but it seems I get stuck at a local minima.
My error function is the mean-squared error:
error(allInputs, allOutputs) {
let error = 0;
for (let i = 0; i < allInputs.length; i++) {
const x = allInputs[i];
const y = allOutputs[i];
const predictedOutput = this.predict(x);
const delta = y - predictedOutput;
error += delta * delta;
}
return error / allInputs.length;
}
My logic for updating my weights will be (according to the sources I've checked so far) wi -= alpha * dError/dwi
For the sake of simplicity, I'll call my weights this.m and this.b, so we can relate it back to my JavaScript code. I'll also call y^ the predicted value.
From here:
error = y - y^
= y - this.m * x + this.b
dError/dm = -x
dError/db = 1
And so, applying that to the weight correction logic:
this.m += alpha * x
this.b -= alpha * 1
But this doesn't seem correct at all.
I finally found what's wrong, and I'm answering my own question in hopes it will help beginners in this area too.
First, as Sascha said, I had some theoretical misunderstandings. It may be correct that your adjustment includes the input value verbatim, but as he said, it should already be part of the gradient. This all depends on your choice of the error function.
Your error function will be the measure of what you use to measure how off you were from the real value, and that measurement needs to be consistent. I was using mean-squared-error as a measurement tool (as you can see in my error method), but I was using a pure-absolute error (y^ - y) inside of the training method to measure the error. Your gradient will depend on the choice of this error function. So choose only one and stick with it.
Second, simplify your assumptions in order to test what's wrong. In this case, I had a very good idea what the function to approximate was (y = x * 3) so I manually set the weights (this.b and this.m) to the right values and I still saw the error diverge. This means that weight initialization was not the problem in this case.
After searching some more, my error was somewhere else: the function that was feeding data into the network was mistakenly passing a 3 hardcoded value into the predicted output (it was using a wrong index in an array), so the oscillation I saw was because of the network trying to approximate to y = 0 * x + 3 (this.b = 3 and this.m = 0), but because of the small learning rate and the error in the error function derivative, this.b wasn't going to get near to the right value, making this.m making wild jumps to adjust to it.
Finally, keep track of the error measurement as your network trains, so you can have some insight into what's going on. This helps a lot to identify a difference between simple overfitting, big learning rates and plain simple mistakes.

Vectorized gradient descent basics

I'm implementing simple gradient descent in octave but its not working. Here is the data I'm using:
X = [1 2 3
1 4 5
1 6 7]
y = [10
11
12]
theta = [0
0
0]
alpha = 0.001 and itr = 50
This is my gradient descent implementation:
function theta = Gradient(X,y,theta,alpha,itr)
m= length(y)
for i = 1:itr,
th1 = theta(1) - alpha * (1/m) * sum((X * theta - y) .* X(:, 1));
th2 = theta(2) - alpha * (1/m) * sum((X * theta - y) .* X(:, 2));
th3 = theta(3) - alpha * (1/m) * sum((X * theta - y) .* X(:, 3));
theta(1) = th1;
theta(2) = th2;
theta(3) = th3;
end
Questions are:
It produces some values of theta which I use in theta * [1 2 3] and expect an output near about 10 (from y). Is that the correct way to test the hypothesis? [h(x) = theta' * x]
How can I determine how many times should it iterate? If I give it 1500 iterations, theta gets extremely small (in e).
If I use double digit numbers in X, theta gets too small again. Even with < 5 iterations.
I've been struggling with these things for a long time now. Unable to resolve it myself.
Sorry for bad formatting.
Your Batch gradient descent implementation seems perfectly fine to me. Can you be more specific on what is the error you are facing. Having said that, for your question Is that the correct way to test the hypothesis? [h(x) = theta' * x].
Based on the dimensions of your test set here, you should test it as h(x) = X*theta.
For your second question, the number of iterations depends on the data set provided. To decide on the optimized number of iterations, you need to plot your cost function with the number of iterations. And as iterations increase, values of cost function should decrease. By this you can decide upon how many iterations you need. You might also consider, increasing the value of alpha in steps of 0.001, 0.003, 0.01, 0.03, 0.1 ... to consider best possible alpha value.
For your third question, I guess you are directly trying to model the data which you have in this question. This data is very small, it just contains 3 training examples. You might be trying to implement linear regression algorithm. For that, you need to that proper training set which contains sufficient data to train your model. Then you can test your model with your test data.
Refer to Andrew Ng course of Machine Learning in www.coursera.org. You will find more information in that course.

Perceptron training - delta rule

according to wikipedia, with the delta rule we adjust the weight by:
dw = alpha * (ti-yi)*g'(hj)xi
when alpha = learning constant, ti - true answer, yi - perceptron's guess,g' = the derivative of the activation function g with respect to the weighted sum of the perceptron's inputs, xi - input.
The part that I don't understand in this formula is the multiplication by the derivative g'. let g = sign(x) (the sign of the weighted sum). so g' is always 0, and dw = 0. However, in code examples I saw in the internet, the writers just omitted the g' and used the formula:
dw = alpha * (ti-yi)*(hj)xi
I will be glad to read a proper explanation!
thank you in advance.
You're correct that if you use a step function for your activation function g, the gradient is always zero (except at 0), so the delta rule (aka gradient descent) just does nothing (dw = 0). This is why a step-function perceptron doesn't work well with gradient descent. :)
For a linear perceptron, you'd have g'(x) = 1, for dw = alpha * (t_i - y_i) * x_i.
You've seen code that uses dw = alpha * (t_i - y_i) * h_j * x_i. We can reverse-engineer what's going on here, because apparently g'(h_j) = h_j, which means remembering our calculus that we must have g(x) = e^x + constant. So apparently the code sample you found uses an exponential activation function.
This must mean that the neuron outputs are constrained to be on (0, infinity) (or I guess (a, infinity) for any finite a, for g(x) = e^x + a). I haven't run into this before, but I see some references online. Logistic or tanh activations are more common for bounded outputs (either classification or regression with known bounds).

Resources