implement Rectified Linear unit in octave - machine-learning

function g = relu(z)
a=z>0;
g=z.*a;
end
z can be a scalar, matrix or vector. So is above implementation correct or is there a better way of implementing [Rectified linear unit) ReLU in octave.
Also kindly say if the derivative is proper
function g = relugradient(z)
g= (z>=0);
end

I would use
function r = relu (z)
r = max (0, z);
endfunction
But your version should return the same. Try to benchmark both with big vectors and matrices...
The derivative is fine (g = z > 0; would be sufficient)

Related

How to modify cost function algorithm in case of SGD in logistic regression algorithm

In my case I am implementing stochastic gradient descent to train a logistic classifier in a classical binary classification problem.
The algorithm basically is similar to that of GD with the exception of selecting one random observation at a time and iterating over the loss function. The code for this basic SGD is given as:
step_size = 0.01;
iter_max = 10000;
for iter = 1 : iter_max
r = randi([1 n]); % produces a random integer r between 1 and n.
[J,grad] = costfunction(theta,X(r,:),y(r));
w = w - step_size * grad;
end
The following code is for the cost function that I used for computing cost function based on Gradient Descent algorithm.
function [J, grad] = costfunction(theta, X, y)
m = length(y);
J = 0;
grad = zeros(size(theta));
sig = 1./(1 + (exp(-(X * theta))));
J = -(1/m)*sum(y.*log(sig) + (1-y).*log(1-sig));
grad = (sum((sig - y).*X))'/m;
end
However, my main suspect is that in SGD this cost function is producing error results. I suspect that this is due to the input of the costfunction now being a row-vector X(r,:) and a scalar y(r) instead of a matrix X and a vector y as in the case of gradient descent.
Question: How can I modify the costfunction code to make it produce correct values for J and grad in this case of SGD.

Octave Logistic regression cost function error

I am doing Andrew Ng's ML course on Coursera. Week3 logistic regression cost function using Octave is giving me some errors. I think it's because of incorrect matrix multiplication. Can someone point out my mistakes please?
Data file for training data is located in File ex2Data1.txt which is available from here https://upscfever.com/upsc-fever/en/data/images/ex2.zip
data = load('ex2data1.txt');
X = data(:, [1, 2]); y = data(:, 3);
[m, n] = size(X);
% Add intercept term to x and X_test
X = [ones(m, 1) X];
% Initialize fitting parameters
initial_theta = zeros(n + 1, 1);
% Compute and display initial cost and gradient
[cost, grad] = costFunction(initial_theta, X, y);
Code for my costFunction is as follows;
function [J, grad] = costFunction(theta, X, y)
% Initialize some useful values
m = length(y); % number of training examples
J = 0;
grad = zeros(size(theta));
%calculate hofX --> sigmoid theta'*X
hfX=sigmoid(theta'*X');
%cost --> bring '-' outside
J=-(1/m)*(y'*(log(hfX))')+(1-y)'*(log(1-hfX))';
%gradiant
fifth=(hfX-y)';
grad=(1/m)*(X'*fifth);
end
Code for the sigmoid function is as follows;
function g = sigmoid(z)
%SIGMOID Compute sigmoid function
g = zeros(size(z));
g = (1./(1+e.^(-1*z)));
end

Gradient descent not working without normalization, why?

My question is based on the data from Coursera course - https://www.coursera.org/learn/machine-learning/, but after a search is appears to be a common problem.
The gradient descent works perfectly on normalize data (pic.1), but goes in wrong direction on original data(pic.2) with J(cost function) growing very fast toward infinity. The difference between the parameters values is about 10^3.
I thought that normalization is required for better execution speed, I really can't see a reason of this growth in the cost function, even after a lot of search. Decreasing 'alpha', e.g. making it 0.001 or 0.0001 doesn't help either.
Please post if you have any ideas!
P.S. (I had manually provided matrices to the functions, where X_buf - normalized version and X_basic - original; Y - vector of all examles Q - theta vector, alpha - leaning rate).
function [theta, J_history] = gradientDescentMulti(X, Y, theta, alpha, num_iters)
m = length(Y);
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
theta = theta - (alpha/m)*X'*(X*theta-Y);
J_history(iter) = computeCostMulti(X, Y, theta);
end
end
And the second function:
function J = computeCostMulti(X, Y, theta)
m = length(Y); % number of training examples
J = 0;
J = (1/(2*rows(X)))*(X*theta-Y)'*(X*theta-Y);
end
Screenshots

Represent Linear Regression features in Gradient Descent numerically

The following piece of python code works well for finding gradient descent:
def gradientDescent(x, y, theta, alpha, m, numIterations):
xTrans = x.transpose()
for i in range(0, numIterations):
hypothesis = np.dot(x, theta)
loss = hypothesis - y
cost = np.sum(loss ** 2) / (2 * m)
print("Iteration %d | Cost: %f" % (i, cost))
gradient = np.dot(xTrans, loss) / m
theta = theta - alpha * gradient
return theta
Here, x = m*n (m = no. of sample data and n = total features) feature matrix.
However, if my features are non-numerical (say, director and genre) of '2' movies then my feature matrix may look like:
['Peter Jackson', 'Action'
Sergio Leone', 'Comedy']
In such a case, how can I map these features to numerical values and apply gradient descent ?
You can map your features to numerical value of your choice and then apply gradient descent the usual way.
In python you can use panda to do that easily:
import pandas as pd
df = pd.DataFrame(X, ['director', 'genre'])
df.director = df.director.map({'Peter Jackson': 0, 'Sergio Leone': 1})
df.genre = df.genre.map({'Action': 0, 'Comedy': 1})
As you can see, this way can become pretty complicated and it might be better to write a piece of code doing that dynamically.

gradient descent seems to fail

I implemented a gradient descent algorithm to minimize a cost function in order to gain a hypothesis for determining whether an image has a good quality. I did that in Octave. The idea is somehow based on the algorithm from the machine learning class by Andrew Ng
Therefore I have 880 values "y" that contains values from 0.5 to ~12. And I have 880 values from 50 to 300 in "X" that should predict the image's quality.
Sadly the algorithm seems to fail, after some iterations the value for theta is so small, that theta0 and theta1 become "NaN". And my linear regression curve has strange values...
here is the code for the gradient descent algorithm:
(theta = zeros(2, 1);, alpha= 0.01, iterations=1500)
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
tmp_j1=0;
for i=1:m,
tmp_j1 = tmp_j1+ ((theta (1,1) + theta (2,1)*X(i,2)) - y(i));
end
tmp_j2=0;
for i=1:m,
tmp_j2 = tmp_j2+ (((theta (1,1) + theta (2,1)*X(i,2)) - y(i)) *X(i,2));
end
tmp1= theta(1,1) - (alpha * ((1/m) * tmp_j1))
tmp2= theta(2,1) - (alpha * ((1/m) * tmp_j2))
theta(1,1)=tmp1
theta(2,1)=tmp2
% ============================================================
% Save the cost J in every iteration
J_history(iter) = computeCost(X, y, theta);
end
end
And here is the computation for the costfunction:
function J = computeCost(X, y, theta) %
m = length(y); % number of training examples
J = 0;
tmp=0;
for i=1:m,
tmp = tmp+ (theta (1,1) + theta (2,1)*X(i,2) - y(i))^2; %differenzberechnung
end
J= (1/(2*m)) * tmp
end
If you are wondering how the seemingly complex looking for loop can be vectorized and cramped into a single one line expression, then please read on. The vectorized form is:
theta = theta - (alpha/m) * (X' * (X * theta - y))
Given below is a detailed explanation for how we arrive at this vectorized expression using gradient descent algorithm:
This is the gradient descent algorithm to fine tune the value of θ:
Assume that the following values of X, y and θ are given:
m = number of training examples
n = number of features + 1
Here
m = 5 (training examples)
n = 4 (features+1)
X = m x n matrix
y = m x 1 vector matrix
θ = n x 1 vector matrix
xi is the ith training example
xj is the jth feature in a given training example
Further,
h(x) = ([X] * [θ]) (m x 1 matrix of predicted values for our training set)
h(x)-y = ([X] * [θ] - [y]) (m x 1 matrix of Errors in our predictions)
whole objective of machine learning is to minimize Errors in predictions. Based on the above corollary, our Errors matrix is m x 1 vector matrix as follows:
To calculate new value of θj, we have to get a summation of all errors (m rows) multiplied by jth feature value of the training set X. That is, take all the values in E, individually multiply them with jth feature of the corresponding training example, and add them all together. This will help us in getting the new (and hopefully better) value of θj. Repeat this process for all j or the number of features. In matrix form, this can be written as:
This can be simplified as:
[E]' x [X] will give us a row vector matrix, since E' is 1 x m matrix and X is m x n matrix. But we are interested in getting a column matrix, hence we transpose the resultant matrix.
More succinctly, it can be written as:
Since (A * B)' = (B' * A'), and A'' = A, we can also write the above as
This is the original expression we started out with:
theta = theta - (alpha/m) * (X' * (X * theta - y))
i vectorized the theta thing...
may could help somebody
theta = theta - (alpha/m * (X * theta-y)' * X)';
I think that your computeCost function is wrong.
I attended NG's class last year and I have the following implementation (vectorized):
m = length(y);
J = 0;
predictions = X * theta;
sqrErrors = (predictions-y).^2;
J = 1/(2*m) * sum(sqrErrors);
The rest of the implementation seems fine to me, although you could also vectorize them.
theta_1 = theta(1) - alpha * (1/m) * sum((X*theta-y).*X(:,1));
theta_2 = theta(2) - alpha * (1/m) * sum((X*theta-y).*X(:,2));
Afterwards you are setting the temporary thetas (here called theta_1 and theta_2) correctly back to the "real" theta.
Generally it is more useful to vectorize instead of loops, it is less annoying to read and to debug.
If you are OK with using a least-squares cost function, then you could try using the normal equation instead of gradient descent. It's much simpler -- only one line -- and computationally faster.
Here is the normal equation:
http://mathworld.wolfram.com/NormalEquation.html
And in octave form:
theta = (pinv(X' * X )) * X' * y
Here is a tutorial that explains how to use the normal equation: http://www.lauradhamilton.com/tutorial-linear-regression-with-octave
While not scalable like a vectorized version, a loop-based computation of a gradient descent should generate the same results. In the example above, the most probably case of the gradient descent failing to compute the correct theta is the value of alpha.
With a verified set of cost and gradient descent functions and a set of data similar with the one described in the question, theta ends up with NaN values just after a few iterations if alpha = 0.01. However, when set as alpha = 0.000001, the gradient descent works as expected, even after 100 iterations.
Using only vectors here is the compact implementation of LR with Gradient Descent in Mathematica:
Theta = {0, 0}
alpha = 0.0001;
iteration = 1500;
Jhist = Table[0, {i, iteration}];
Table[
Theta = Theta -
alpha * Dot[Transpose[X], (Dot[X, Theta] - Y)]/m;
Jhist[[k]] =
Total[ (Dot[X, Theta] - Y[[All]])^2]/(2*m); Theta, {k, iteration}]
Note: Of course one assumes that X is a n * 2 matrix, with X[[,1]] containing only 1s'
This should work:-
theta(1,1) = theta(1,1) - (alpha*(1/m))*((X*theta - y)'* X(:,1) );
theta(2,1) = theta(2,1) - (alpha*(1/m))*((X*theta - y)'* X(:,2) );
its cleaner this way, and vectorized also
predictions = X * theta;
errorsVector = predictions - y;
theta = theta - (alpha/m) * (X' * errorsVector);
If you remember the first Pdf file for Gradient Descent form machine Learning course, you would take care of learning rate. Here is the note from the mentioned pdf.
Implementation Note: If your learning rate is too large, J(theta) can di-
verge and blow up', resulting in values which are too large for computer
calculations. In these situations, Octave/MATLAB will tend to return
NaNs. NaN stands fornot a number' and is often caused by undened
operations that involve - infinity and +infinity.

Resources