Octave fminunc doesn't converge - machine-learning

I'm trying to use fminunc in Octave for a logistic problem, but it doesn't work. It says that I didn't define variables, but actually I did. If I define variables directly in the costFunction,and not in the main, it doesn't give any problem, but the function doesn't work really. In fact the exitFlag is equal to -3 and it doesn't converge at all.
Here's my function:
function [jVal, gradient] = cost(theta, X, y)
X = [1,0.14,0.09,0.58,0.39,0,0.55,0.23,0.64;1,-0.57,-0.54,-0.16,0.21,0,-0.11,-0.61,-0.35;1,0.42,0.45,-0.41,-0.6,0,-0.44,0.38,-0.29];
y = [1;0;1];
theta = [0.8;0.2;0.6;0.3;0.4;0.5;0.6;0.2;0.4];
jVal = 0;
jVal = costFunction2(X, y, theta); %this is another function that gives me jVal. I'm quite sure it is
%correct because I use it also with other algorithms and it
%works perfectly
m = length(y);
xSize = size(X, 2);
gradient = zeros(xSize, 1);
sig = X * theta;
h = 1 ./(1 + exp(-sig));
for i = 1:m
for j = 1:xSize
gradient(j) = (1/m) * sum(h(i) - y(i)) .* X(i, j);
end
end
Here's my main:
theta = [0.8;0.2;0.6;0.3;0.4;0.5;0.6;0.2;0.4];
options = optimset('GradObj', 'on', 'MaxIter', 100);
[optTheta, functionVal, exitFlag] = fminunc(#cost, theta, options)
if I compile it:
optTheta =
0.80000
0.20000
0.60000
0.30000
0.40000
0.50000
0.60000
0.20000
0.40000
functionVal = 0.15967
exitFlag = -3
How can I resolve this problem?

You are not in fact using fminunc correctly. From the documentation:
-- fminunc (FCN, X0)
-- fminunc (FCN, X0, OPTIONS)
FCN should accept a vector (array) defining the unknown variables,
and return the objective function value, optionally with gradient.
'fminunc' attempts to determine a vector X such that 'FCN (X)' is a
local minimum.
What you are passing is not a handle to a function that accepts a single vector argument. Instead, what you are passing (i.e. #cost) is a handle to a function that takes three arguments.
You need to 'convert' this into a handle to a function that takes only one input, and does what you want under the hood. The easiest way to do this is by 'wrapping' your cost function into an anonymous function that only takes one argument, and calls the cost function in the appropriate way, e.g.
fminunc( #(t) cost(t, X, y), theta, options )
Note: This assumes X and y are defined in the scope where you do this 'wrapping' business

Related

Training NN with Julia's Flux - Loss function with derivative of output and functions of output

I want to run this NN in which input is time over some interval. There's no label, and the loss function requires the derivative of the outputs and a specified function (H in my code), which is also a function of the outputs. I believe my loss function is not properly set yet. I also would like to see how the loss function decreases, to see how close to the actual function I am getting, but I don't seem to find a away to see how the loss function progresses.
Here is my new code:
using Flux, Zygote, ForwardDiff
##Data
t=vcat(0:0.1:4)
##Problem parameters
α = 2; C = 1; β = 0.5; P = 1; π₀ = 0.5
#Initial and final conditions
x₀ = 0.5
p₄ = 1
t₀ = 0
t𝔣 = 4
#Hidden layer length
len_hidden=5
X = Chain(Dense(1,len_hidden),Dense(len_hidden,1,relu))
x(t) = (t - t₀)*X([t])[1] + x₀
dxdt(t) = ForwardDiff.derivative(x,t)
Ρ = Chain(Dense(1,len_hidden),Dense(len_hidden,1,relu))
p(t) = p₄ + (t - t𝔣)*Ρ([t])[1]
dpdt(t) = ForwardDiff.derivative(p,t)
U = Chain(Dense(1,len_hidden),Dense(len_hidden,1,relu))
u(t) = U([t])[1]
Θ = Flux.params(X,Ρ,U)
H(x,p,u) = α*u*x - C*u^2 + p*β*x*(1 - x)*(P*u - π₀)
#Partials
dHdx(t) = α*u(t) + p(t)*(1 - x(t))*β*(P*u(t) - π₀) - p(t)*x(t)*(P*u(t) - π₀)
dHdp(t) = (1 - x(t))*x(t)*β*(P*u(t) - π₀)
dHdu(t) = α*x(t) - 2*C*u(t) + P*p(t)*β*x(t)*(1 - x(t))
#Loss function
function loss(t)
return (-dxdt(t) + dHdp(t))^2 + (dpdt(t) + dHdx(t))^2 + (dHdu(t))^2
end
opt=Descent()
parameters=Θ
data=t
Flux.train!(loss, parameters, data, opt, cb = () -> println("Training"))
Is the way I wrote the loss function correct? For each time instant (which is my data vector), am I computing the loss with the updated value of each function in loss()? So far, x(0) becomes the imposed initial condition, however it stays constant for all other time instants, which makes me think the loss is not being evaluated and minimized over time taking in considerations the evolution of all other functions.

How can i adapt multivariate normal for batch operations?

I am implementing from scratch the multivariate normal probability function in python. The formula for it is as follows:
I was able to code this version, where $\mathbf{x}$ is an input vector (single sample). However, i could make good use of numpy's matrix operations and extend it to the case of using $\mathbf{X}$ (set of samples) to return all the samples probabilities at once. This is equal to the scipy's implementation.
This is the code i have made:
def multivariate_normal(X, center, cov):
k = X.shape[0]
det_cov = np.linalg.det(cov)
inv_cov = np.linalg.inv(cov)
o = 1 / np.sqrt( (2 * np.pi) ** k * det_cov)
p = np.exp( -.5 * ( np.dot(np.dot((X - center).T, inv_cov), (X - center))))
return o * p
Thanks in advance.

How can I fix this issue with my Mandelbrot fractal generator?

I've been working on a project that renders a Mandelbrot fractal. For those of you who know, it is generated by iterating through the following function where c is the point on a complex plane:
function f(c, z) return z^2 + c end
Iterating through that function produces the following fractal (ignore the color):
When you change the function to this, (z raised to the third power)
function f(c, z) return z^3 + c end
the fractal should render like so (again, the color doesn't matter):
(source: uoguelph.ca)
However, when I raised z to the power of 3, I got an image extremely similar as to when you raise z to the power of 2. How can I make the fractal render correctly? This is the code where the iterations are done: (the variables real and imaginary simply scale the screen from -2 to 2)
--loop through each pixel, col = column, row = row
local real = (col - zoomCol) * 4 / width
local imaginary = (row - zoomRow) * 4 / width
local z, c, iter = 0, 0, 0
while math.sqrt(z^2 + c^2) <= 2 and iter < maxIter do
local zNew = z^2 - c^2 + real
c = 2*z*c + imaginary
z = zNew
iter = iter + 1
end
So I recently decided to remake a Mandelbrot fractal generator, and it was MUCH more successful than my attempt last time, as my programming skills have increased with practice.
I decided to generalize the mandelbrot function using recursion for anyone who wants it. So, for example, you can do f(z, c) z^2 + c or f(z, c) z^3 + c
Here it is for anyone that may need it:
function raise(r, i, cr, ci, pow)
if pow == 1 then
return r + cr, i + ci
end
return raise(r*r-i*i, 2*r*i, cr, ci, pow - 1)
end
and it's used like this:
r, i = raise(r, i, CONSTANT_REAL_PART, CONSTANT_IMAG_PART, POWER)

Backpropagation, all outputs tend to 1

I have this Backpropagation implementation in MATLAB, and have an issue with training it. Early on in the training phase, all of the outputs go to 1. I have normalized the input data(except the desired class, which is used to generate a binary target vector) to the interval [0, 1]. I have been referring to the implementation in Artificial Intelligence: A Modern Approach, Norvig et al.
Having checked the pseudocode against my code(and studying the algorithm for some time), I cannot spot the error. I have not been using MATLAB for that long, so have been trying to use the documentation where needed.
I have also tried different amounts of nodes in the hidden layer and different learning rates (ALPHA).
The target data encodings are as follows: when the target is to classify as, say 2, the target vector would be [0,1,0], say it were 1, [1, 0, 0] so on and so forth. I have also tried using different values for the target, such as (for class 1 for example) [0.5, 0, 0].
I noticed that some of my weights go above 1, resulting in large net values.
%Topological constants
NUM_HIDDEN = 8+1;%written as n+1 so is clear bias is used
NUM_OUT = 3;
%Training constants
ALPHA = 0.01;
TARG_ERR = 0.01;
MAX_EPOCH = 50000;
%Read and normalize data file.
X = normdata(dlmread('iris.data'));
X = shuffle(X);
%X_test = normdata(dlmread('iris2.data'));
%epocherrors = fopen('epocherrors.txt', 'w');
%Weight matrices.
%Features constitute size(X, 2)-1, however size is (X, 2) to allow for
%appending bias.
w_IH = rand(size(X, 2), NUM_HIDDEN)-(0.5*rand(size(X, 2), NUM_HIDDEN));
w_HO = rand(NUM_HIDDEN+1, NUM_OUT)-(0.5*rand(NUM_HIDDEN+1, NUM_OUT));%+1 for bias
%Layer nets
net_H = zeros(NUM_HIDDEN, 1);
net_O = zeros(NUM_OUT, 1);
%Layer outputs
out_H = zeros(NUM_HIDDEN, 1);
out_O = zeros(NUM_OUT, 1);
%Layer deltas
d_H = zeros(NUM_HIDDEN, 1);
d_O = zeros(NUM_OUT, 1);
%Control variables
error = inf;
epoch = 0;
%Run the algorithm.
while error > TARG_ERR && epoch < MAX_EPOCH
for n=1:size(X, 1)
x = [X(n, 1:size(X, 2)-1) 1]';%Add bias for hiddens & transpose to column vector.
o = X(n, size(X, 2));
%Forward propagate.
net_H = w_IH'*x;%Transposed w.
out_H = [sigmoid(net_H); 1]; %Append 1 for bias to outputs
net_O = w_HO'*out_H;
out_O = sigmoid(net_O); %Again, transposed w.
%Calculate output deltas.
d_O = ((targetVec(o, NUM_OUT)-out_O) .* (out_O .* (1-out_O)));
%Calculate hidden deltas.
for i=1:size(w_HO, 1);
delta_weight = 0;
for j=1:size(w_HO, 2)
delta_weight = delta_weight + d_O(j)*w_HO(i, j);
end
d_H(i) = (out_H(i)*(1-out_H(i)))*delta_weight;
end
%Update hidden-output weights
for i=1:size(w_HO, 1)
for j=1:size(w_HO, 2)
w_HO(i, j) = w_HO(i, j) + (ALPHA*out_H(i)*d_O(j));
end
end
%Update input-hidden weights.
for i=1:size(w_IH, 1)
for j=1:size(w_IH, 2)
w_IH(i, j) = w_IH(i, j) + (ALPHA*x(i)*d_H(j));
end
end
out_O
o
%out_H
%w_IH
%w_HO
%d_O
%d_H
end
end
function outs = sigmoid(nets)
outs = zeros(size(nets, 1), 1);
for i=1:size(nets, 1)
if nets(i) < -45
outs(i) = 0;
elseif nets(i) > 45
outs(i) = 1;
else
outs(i) = 1/1+exp(-nets(i));
end
end
end
From what we've established in the comments the only thing that comes in my mind are all recipes written down together in this great NN archive:
ftp://ftp.sas.com/pub/neural/FAQ2.html#questions
First things you could try are:
1) How to avoid overflow in the logistic function? Probably that's the problem - many times I've implemented NNs the problem was with such an overflow.
2) How should categories be encoded?
And more general:
3) How does ill-conditioning affect NN training?
4) Help! My NN won't learn! What should I do?
After the discussion it turns out the problem lies within the sigmoid function:
function outs = sigmoid(nets)
%...
outs(i) = 1/1+exp(-nets(i)); % parenthesis missing!!!!!!
%...
end
It should be:
function outs = sigmoid(nets)
%...
outs(i) = 1/(1+exp(-nets(i)));
%...
end
The lack of parenthesis caused that the sigmoid output was larger than 1 sometimes. That made the gradient calculation incorrect (because it wasn't a gradient of this function). This caused the gradient to be negative. And this caused that the delta for the output layer was most of the time in the wrong direction. After the fix (the after correctly maintaining the error variable - this seems to be missing in your code) all seems to work fine.
Beside that, there are two other main problems with this code:
1) No bias. Without the bias each neuron can only represent a line which crosses the origin. If data is normalized (i.e. values are between 0 and 1), some configurations are inseparable.
2) Lack of guarding against high gradient values (point 1 in my previous answer).

Gradient in continuous regression using a neural network

I'm trying to implement a regression NN that has 3 layers (1 input, 1 hidden and 1 output layer with a continuous result). As a basis I took a classification NN from coursera.org class, but changed the cost function and gradient calculation so as to fit a regression problem (and not a classification one):
My nnCostFunction now is:
function [J grad] = nnCostFunctionLinear(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
m = size(X, 1);
a1 = X;
a1 = [ones(m, 1) a1];
a2 = a1 * Theta1';
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2';
Y = y;
J = 1/(2*m)*sum(sum((a3 - Y).^2))
th1 = Theta1;
th1(:,1) = 0; %set bias = 0 in reg. formula
th2 = Theta2;
th2(:,1) = 0;
t1 = th1.^2;
t2 = th2.^2;
th = sum(sum(t1)) + sum(sum(t2));
th = lambda * th / (2*m);
J = J + th; %regularization
del_3 = a3 - Y;
t1 = del_3'*a2;
Theta2_grad = 2*(t1)/m + lambda*th2/m;
t1 = del_3 * Theta2;
del_2 = t1 .* a2;
del_2 = del_2(:,2:end);
t1 = del_2'*a1;
Theta1_grad = 2*(t1)/m + lambda*th1/m;
grad = [Theta1_grad(:) ; Theta2_grad(:)];
end
Then I use this func in fmincg algorithm, but in firsts iterations fmincg end it's work. I think my gradient is wrong, but I can't find the error.
Can anybody help?
If I understand correctly, your first block of code (shown below) -
m = size(X, 1);
a1 = X;
a1 = [ones(m, 1) a1];
a2 = a1 * Theta1';
a2 = [ones(m, 1) a2];
a3 = a2 * Theta2';
Y = y;
is to get the output a(3) at the output layer.
Ng's slides about NN has the below configuration to calculate a(3). It's different from what your code presents.
in the middle/output layer, you are not doing the activation function g, e.g., a sigmoid function.
In terms of the cost function J without regularization terms, Ng's slides has the below formula:
I don't understand why you can compute it using:
J = 1/(2*m)*sum(sum((a3 - Y).^2))
because you are not including the log function at all.
Mikhaill, I´ve been playing with a NN for continuous regression as well, and had a similar issues at some point. The best thing to do here would be to test gradient computation against a numerical calculation before running the model. If that´s not correct, fmincg won´t be able to train the model. (Btw, I discourage you of using numerical gradient as the time involved is much bigger).
Taking into account that you took this idea from Ng´s Coursera class, I´ll implement a possible solution for you to try using the same notation for Octave.
% Cost function without regularization.
J = 1/2/m^2*sum((a3-Y).^2);
% In case it´s needed, regularization term is added (i.e. for Training).
if (reg==true);
J=J+lambda/2/m*(sum(sum(Theta1(:,2:end).^2))+sum(sum(Theta2(:,2:end).^2)));
endif;
% Derivatives are computed for layer 2 and 3.
d3=(a3.-Y);
d2=d3*Theta2(:,2:end);
% Theta grad is computed without regularization.
Theta1_grad=(d2'*a1)./m;
Theta2_grad=(d3'*a2)./m;
% Regularization is added to grad computation.
Theta1_grad(:,2:end)=Theta1_grad(:,2:end)+(lambda/m).*Theta1(:,2:end);
Theta2_grad(:,2:end)=Theta2_grad(:,2:end)+(lambda/m).*Theta2(:,2:end);
% Unroll gradients.
grad = [Theta1_grad(:) ; Theta2_grad(:)];
Note that, since you have taken out all the sigmoid activation, the derivative calculation is quite simple and results in a simplification of the original code.
Next steps:
1. Check this code to understand if it makes sense to your problem.
2. Use gradient checking to test gradient calculation.
3. Finally, use fmincg and check you get different results.
Try to include sigmoid function to compute second layer (hidden layer) values and avoid sigmoid in calculating the target (output) value.
function [J grad] = nnCostFunction1(nnParams, ...
inputLayerSize, ...
hiddenLayerSize, ...
numLabels, ...
X, y, lambda)
Theta1 = reshape(nnParams(1:hiddenLayerSize * (inputLayerSize + 1)), ...
hiddenLayerSize, (inputLayerSize + 1));
Theta2 = reshape(nnParams((1 + (hiddenLayerSize * (inputLayerSize + 1))):end), ...
numLabels, (hiddenLayerSize + 1));
Theta1Grad = zeros(size(Theta1));
Theta2Grad = zeros(size(Theta2));
m = size(X,1);
a1 = [ones(m, 1) X]';
z2 = Theta1 * a1;
a2 = sigmoid(z2);
a2 = [ones(1, m); a2];
z3 = Theta2 * a2;
a3 = z3;
Y = y';
r1 = lambda / (2 * m) * sum(sum(Theta1(:, 2:end) .* Theta1(:, 2:end)));
r2 = lambda / (2 * m) * sum(sum(Theta2(:, 2:end) .* Theta2(:, 2:end)));
J = 1 / ( 2 * m ) * (a3 - Y) * (a3 - Y)' + r1 + r2;
delta3 = a3 - Y;
delta2 = (Theta2' * delta3) .* sigmoidGradient([ones(1, m); z2]);
delta2 = delta2(2:end, :);
Theta2Grad = 1 / m * (delta3 * a2');
Theta2Grad(:, 2:end) = Theta2Grad(:, 2:end) + lambda / m * Theta2(:, 2:end);
Theta1Grad = 1 / m * (delta2 * a1');
Theta1Grad(:, 2:end) = Theta1Grad(:, 2:end) + lambda / m * Theta1(:, 2:end);
grad = [Theta1Grad(:) ; Theta2Grad(:)];
end
Normalize the inputs before passing it in nnCostFunction.
In accordance with Week 5 Lecture Notes guideline for a Linear System NN you should make following changes in the initial code:
Remove num_lables or make it 1 (in reshape() as well)
No need to convert y into a logical matrix
For a2 - replace sigmoid() function to tanh()
In d2 calculation - replace sigmoidGradient(z2) with (1-tanh(z2).^2)
Remove sigmoid from output layer (a3 = z3)
Replace cost function in the unregularized portion to linear one: J = (1/(2*m))*sum((a3-y).^2)
Create predictLinear(): use predict() function as a basis, replace sigmoid with tanh() for the first layer hypothesis, remove second sigmoid for the second layer hypothesis, remove the line with max() function, use output of the hidden layer hypothesis as a prediction result
Verify your nnCostFunctionLinear() on the test case from the lecture note

Resources