Background and my thought process:
I wanted to see if I could utilize logistic regression to create a hypothesis function that could predict recessions in the US economy by looking at a date and its corresponding leading economic indicators. Leading economic indicators are known to be good predictors of the economy.
To do this, I got data from OECD on the composite leading (economic) indicators from January, 1970 to July, 2021 in addition to finding when recessions occurred from 1970 to 2021. The formatted data that I use for training can be found further below.
Knowing the relationship between a recession and the Date/LEI wouldn't be a simple linear relationship, I decided to make more parameters for each datapoint so I could fit a polynominal equation to the data. Thus, each datapoint has the following parameters: Date, LEI, LEI^2, LEI^3, LEI^4, and LEI^5.
The Problem:
When I attempt to train my hypothesis function, I get a very strange cost history that seems to indicate that I either did not implement my cost function correctly or that my gradient descent was implemented incorrectly. Below is the imagine of my cost history:
I have tried implementing the suggestions from this post to fix my cost history, as originally I had the same NaN and Inf issues described in the post. While the suggestions helped me fix the NaN and Inf issues, I couldn't find anything to help me fix my cost function once it started oscillating. Some of the other fixes I've tried are adjusting the learning rate, double checking my cost and gradient descent, and introducing more parameters for datapoints (to see if a higher-degree polynominal equation would help).
My Code
The main file is predictor.m.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Program: Predictor.m
% Author: Hasec Rainn
% Desc: Predictor.m uses logistic regression
% to predict when economic recessions will occur
% in the United States. The data it uses is from the past 50 years.
%
% In particular, it uses dates and their corresponding economic leading
% indicators to learn a non-linear hypothesis function to fit to the data.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
LI_Data = dlmread("leading_indicators_formatted.csv"); %Get LI data
RD_Data = dlmread("recession_dates_formatted.csv"); %Get RD data
%our datapoints of interest: Dates and their corresponding
%leading Indicator values.
%We are going to increase the number of parameters per datapoint to allow
%for a non-linear hypothesis function. Specifically, let the 3rd, 4th
%5th, and 6th columns represent LI^2, LI^3, LI^4, and LI^5 respectively
X = LI_Data; %datapoints of interest (row = 1 datapoint)
X = [X, X(:,2).^2]; %Adding LI^2
X = [X, X(:,2).^3]; %Adding LI^3
X = [X, X(:,2).^4]; %Adding LI^4
X = [X, X(:,2).^5]; %Adding LI^5
%normalize data
X(:,1) = normalize( X(:,1) );
X(:,2) = normalize( X(:,2) );
X(:,3) = normalize( X(:,3) );
X(:,4) = normalize( X(:,4) );
X(:,5) = normalize( X(:,5) );
X(:,6) = normalize( X(:,6) );
%What we want to predict: if a recession happens or doesn't happen
%for a corresponding year
Y = RD_Data(:,2); %row = 1 datapoint
%defining a few useful variables:
nIter = 4000; %how many iterations we want to run gradient descent for
ndp = size(X, 1); %number of data points we have to work with
nPara = size(X,2); %number of parameters per data point
alpha = 1; %set the learning rate to 1
%Defining Theta
Theta = ones(1, nPara); %initialize the weights of Theta to 1
%Make a cost history so we can see if gradient descent is implemented
%correctly
costHist = zeros(nIter, 1);
for i = 1:nIter
costHist(i, 1) = cost(Theta, Y, X);
Theta = Theta - (sum((sigmoid(X * Theta') - Y) .* X));
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: Cost
% Author: Hasec Rainn
% Parameters: Theta (vector), Y (vector), X (matrix)
% Desc: Uses Theta, Y, and X to determine the cost of our current
% hypothesis function H_theta(X). Uses manual loop approach to
% avoid errors that arrise from log(0).
% Additionally, limits the range of H_Theta to prevent Inf
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function expense = cost(Theta, Y, X)
m = size(X, 1); %number of data points
hTheta = sigmoid(X*Theta'); %hypothesis function
%limit the range of hTheta to [10^-50, 0.9999999999999]
for i=1:size(hTheta, 1)
if (hTheta(i) <= 10^(-50))
hTheta(i) = 10^(-50);
endif
if (hTheta(i) >= 0.9999999999999)
hTheta(i) = 0.9999999999999;
endif
endfor
expense = 0;
for i = 1:m
if Y(i) == 1
expense = expense + -log(hTheta(i));
endif
if Y(i) == 0
expense = expense + -log(1-hTheta(i));
endif
endfor
endfunction
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: normalization
% Author: Hasec Rainn
% Parameters: vector
% Desc: Takes in an input and normalizes its value(s)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function n = normalize(data)
dMean = mean(data);
dStd = std(data);
n = (data - dMean) ./ dStd;
endfunction
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: Sigmoid
% Author: Hasec Rainn
% Parameters: scalar, vector, or matrix
% Desc: Takes an input and forces its value(s) to be between
% 0 and 1. If a matrix or vector, sigmoid is applied to
% each element.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function result = sigmoid(z)
result = 1 ./ ( 1 + e .^(-z) );
endfunction
The data I used for my learning process can be found here: formatted LI data and recession dates data.
The problem you're running into here is your gradient descent function.
In particular, while you correctly calculate the cost portion (aka, (hTheta - Y) or (sigmoid(X * Theta') - Y) ), you do not calculate the derivative of the cost correctly; in Theta = Theta - (sum((sigmoid(X * Theta') - Y) .* X)), the .*X is not correct.
The derivative is equivalent to the cost of each datapoint (found in the vector hTheta - Y) multiplied by their corresponding parameter j, for every parameter. For more information, check out this article.
Related
everyone.
I am beginner of machine learning and start learning about gradient descent right now. However, I got a one big problem. Following question is like this :
given numbers [0,0],[1,1],[1,2],[2,1] and
equation will be [ f=(a2)*x^2 + (a1)*x + a0 ]
With hand-solving, i got a answer [-1,5/2,0]
but it is hard to find out the solution from making a python code with gradient descent with these given data.
In my case, I try to make a code with gradient descent method with easiest and fastest way like :
learningRate = 0.1
make **a series of number of x
initialize given 1,1,1 for a2,a1,a0
partial derivative for a2,a1,a0 (a2_p:2x, a1_p:x, a0_p:1)
gradient descent method : (ex) a2 = a2 - (learningRate)( y - [(a2)*x^2 + (a1)*x + a0] )(a2_p)
ps. Honestly, I do not know what should i put 'x' and 'y' or a2, a1, a0.
However, i got a wrong answer with different result each time.
So, I want to get a hint for correct equation or code sequence.
Thank you for reading my lowest level of question.
There are a few errors in your equations
For the function f(x) = a2*x^2+a1*x+a0, partial derivatives for a2, a1 and a0 are x^2, x and 1, respectively.
Suppose cost function is (1/2)*(y-f(x))^2
Partial derivatives of cost function with respect to ai is -(y-f(x))* partial derivative of f(x) for ai, where i belongs to [0,2]
So, the gradient descent equation is:
ai = ai + learning_rate*(y-f(x)) * partial derivative of f(x) for ai, where i belongs to [0,2]
I hope this code helps
#Training sample
sample = [(0,0),(1,1),(1,2),(2,1)]
#Our function => a2*x^2+a1*x+a0
class Function():
def __init__(self, a2, a1, a0):
self.a2 = a2
self.a1 = a1
self.a0 = a0
def eval(self, x):
return self.a2*x**2+self.a1*x+self.a0
def partial_a2(self, x):
return x**2
def partial_a1(self, x):
return x
def partial_a0(self, x):
return 1
#Initialise function
f = Function(1,1,1)
#To Calculate loss from the sample
def loss(sample, f):
return sum([(y-f.eval(x))**2 for x,y in sample])/len(sample)
epochs = 100000
lr = 0.0005
#To record the best values
best_values = (0,0,0)
for epoch in range(epochs):
min_loss = 100
for x, y in sample:
#Gradient descent
f.a2 = f.a2+lr*(y-f.eval(x))*f.partial_a2(x)
f.a1 = f.a1+lr*(y-f.eval(x))*f.partial_a1(x)
f.a0 = f.a0+lr*(y-f.eval(x))*f.partial_a0(x)
#Storing the best values
epoch_loss = loss(sample, f)
if min_loss > epoch_loss:
min_loss = epoch_loss
best_values = (f.a2, f.a1, f.a0)
print("Loss:", min_loss)
print("Best values (a2,a1,a0):", best_values)
Output:
Loss: 0.12500004789165717
Best values (a2,a1,a0): (-1.0001922562970325, 2.5003368582261487, 0.00014521557599919338)
I was following Siraj Raval's videos on logistic regression using gradient descent :
1) Link to longer video :
https://www.youtube.com/watch?v=XdM6ER7zTLk&t=2686s
2) Link to shorter video :
https://www.youtube.com/watch?v=xRJCOz3AfYY&list=PL2-dafEMk2A7mu0bSksCGMJEmeddU_H4D
In the videos he talks about using gradient descent to reduce the error for a set number of iterations so that the function converges(slope becomes zero).
He also illustrates the process via code. The following are the two main functions from the code :
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
#The above functions are called below:
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
# code taken from Siraj Raval's github page
Why does the value of b & m continue to update for all the iterations? After a certain number of iterations, the function will converge, when we find the values of b & m that give slope = 0.
So why do we continue iteration after that point and continue updating b & m ?
This way, aren't we losing the 'correct' b & m values? How is learning rate helping the convergence process if we continue to update values after converging? Thus, why is there no check for convergence, and so how is this actually working?
In practice, most likely you will not reach to slope 0 exactly. Thinking of your loss function as a bowl. If your learning rate is too high, it is possible to overshoot over the lowest point of the bowl. On the contrary, if the learning rate is too low, your learning will become too slow and won't reach the lowest point of the bowl before all iterations are done.
That's why in machine learning, the learning rate is an important hyperparameter to tune.
Actually, once we reach a slope 0; b_gradient and m_gradient will become 0;
thus, for :
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
new_b and new_m will remain the old correct values; as nothing will be subtracted from them.
I am trying to implement a custom Keras objective function:
in 'Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression', Narihira et al.
This is the sum of equations (4) and (6) from the previous picture. Y* is the ground truth, Y a prediction map and y = Y* - Y.
This is my code:
def custom_objective(y_true, y_pred):
#Eq. (4) Scale invariant L2 loss
y = y_true - y_pred
h = 0.5 # lambda
term1 = K.mean(K.sum(K.square(y)))
term2 = K.square(K.mean(K.sum(y)))
sca = term1-h*term2
#Eq. (6) Gradient L2 loss
gra = K.mean(K.sum((K.square(K.gradients(K.sum(y[:,1]), y)) + K.square(K.gradients(K.sum(y[1,:]), y)))))
return (sca + gra)
However, I suspect that the equation (6) is not correctly implemented because the results are not good. Am I computing this right?
Thank you!
Edit:
I am trying to approximate (6) convolving with Prewitt filters. It works when my input is a chunk of images i.e. y[batch_size, channels, row, cols], but not with y_true and y_pred (which are of type TensorType(float32, 4D)).
My code:
def cconv(image, g_kernel, batch_size):
g_kernel = theano.shared(g_kernel)
M = T.dtensor3()
conv = theano.function(
inputs=[M],
outputs=conv2d(M, g_kernel, border_mode='full'),
)
accum = 0
for curr_batch in range (batch_size):
accum = accum + conv(image[curr_batch])
return accum/batch_size
def gradient_loss(y_true, y_pred):
y = y_true - y_pred
batch_size = 40
# Direction i
pw_x = np.array([[-1,0,1],[-1,0,1],[-1,0,1]]).astype(np.float64)
g_x = cconv(y, pw_x, batch_size)
# Direction j
pw_y = np.array([[-1,-1,-1],[0,0,0],[1,1,1]]).astype(np.float64)
g_y = cconv(y, pw_y, batch_size)
gra_l2_loss = K.mean(K.square(g_x) + K.square(g_y))
return (gra_l2_loss)
The crash is produced in:
accum = accum + conv(image[curr_batch])
...and error description is the following one:
*** TypeError: ('Bad input argument to theano function with name "custom_models.py:836" at index 0 (0-based)', 'Expected an array-like
object, but found a Variable: maybe you are trying to call a function
on a (possibly shared) variable instead of a numeric array?')
How can I use y (y_true - y_pred) as a numpy array, or how can I solve this issue?
SIL2
term1 = K.mean(K.square(y))
term2 = K.square(K.mean(y))
[...]
One mistake spread across the code was that when you see (1/n * sum()) in the equations, it is a mean. Not the mean of a sum.
Gradient
After reading your comment and giving it more thought, I think there is a confusion about the gradient. At least I got confused.
There are two ways of interpreting the gradient symbol:
The gradient of a vector where y should be differentiated with respect to the parameters of your model (usually the weights of the neural net). In previous edits I started to write in this direction because that's the sort of approach used to trained the model (eg. gradient descent). But I think I was wrong.
The pixel intensity gradient in a picture, as you mentioned in your comment. The diff of each pixel with its neighbor in each direction. In which case I guess you have to translate the example you gave into Keras.
To sum up, K.gradients() and numpy.gradient() are not used in the same way. Because numpy implicitly considers (i, j) (the row and column indices) as the two input variables, while when you feed a 2D image to a neural net, every single pixel is an input variable. Hope I'm clear.
I am trying to implement logistic regression with gradient descent,
I get my Cost function j_theta for the number of iterations and fortunately my j_theta is decreasing when plotted j_theta against the number of iteration.
The data set I use is given below:
x=
1 20 30
1 40 60
1 70 30
1 50 50
1 50 40
1 60 40
1 30 40
1 40 50
1 10 20
1 30 40
1 70 70
y= 0
1
1
1
0
1
0
0
0
0
1
The code that I managed to write for logistic regression using Gradient descent is:
%1. The below code would load the data present in your desktop to the octave memory
x=load('stud_marks.dat');
%y=load('ex4y.dat');
y=x(:,3);
x=x(:,1:2);
%2. Now we want to add a column x0 with all the rows as value 1 into the matrix.
%First take the length
[m,n]=size(x);
x=[ones(m,1),x];
X=x;
% Now we limit the x1 and x2 we need to leave or skip the first column x0 because they should stay as 1.
mn = mean(x);
sd = std(x);
x(:,2) = (x(:,2) - mn(2))./ sd(2);
x(:,3) = (x(:,3) - mn(3))./ sd(3);
% We will not use vectorized technique, Because its hard to debug, We shall try using many for loops rather
max_iter=50;
theta = zeros(size(x(1,:)))';
j_theta=zeros(max_iter,1);
for num_iter=1:max_iter
% We calculate the cost Function
j_cost_each=0;
alpha=1;
theta
for i=1:m
z=0;
for j=1:n+1
% theta(j)
z=z+(theta(j)*x(i,j));
z
end
h= 1.0 ./(1.0 + exp(-z));
j_cost_each=j_cost_each + ( (-y(i) * log(h)) - ((1-y(i)) * log(1-h)) );
% j_cost_each
end
j_theta(num_iter)=(1/m) * j_cost_each;
for j=1:n+1
grad(j) = 0;
for i=1:m
z=(x(i,:)*theta);
z
h=1.0 ./ (1.0 + exp(-z));
h
grad(j) += (h-y(i)) * x(i,j);
end
grad(j)=grad(j)/m;
grad(j)
theta(j)=theta(j)- alpha * grad(j);
end
end
figure
plot(0:1999, j_theta(1:2000), 'b', 'LineWidth', 2)
hold off
figure
%3. In this step we will plot the graph for the given input data set just to see how is the distribution of the two class.
pos = find(y == 1); % This will take the postion or array number from y for all the class that has value 1
neg = find(y == 0); % Similarly this will take the position or array number from y for all class that has value 0
% Now we plot the graph column x1 Vs x2 for y=1 and y=0
plot(x(pos, 2), x(pos,3), '+');
hold on
plot(x(neg, 2), x(neg, 3), 'o');
xlabel('x1 marks in subject 1')
ylabel('y1 marks in subject 2')
legend('pass', 'Failed')
plot_x = [min(x(:,2))-2, max(x(:,2))+2]; % This min and max decides the length of the decision graph.
% Calculate the decision boundary line
plot_y = (-1./theta(3)).*(theta(2).*plot_x +theta(1));
plot(plot_x, plot_y)
hold off
%%%%%%% The only difference is In the last plot I used X where as now I use x whose attributes or features are featured scaled %%%%%%%%%%%
If you view the graph of x1 vs x2 the graph would look like,
After I run my code I create a decision boundary. The shape of the decision line seems to be okay but it is a bit displaced. The graph of the x1 vs x2 with decision boundary is given below:
![enter image description here][2]
Please suggest me where am I going wrong ....
Thanks:)
The New Graph::::
![enter image description here][1]
If you see the new graph the coordinated of x axis have changed ..... Thats because I use x(feature scalled) instead of X.
The problem lies in your cost function calculation and/or gradient calculation, your plotting function is fine. I ran your dataset on the algorithm I implemented for logistic regression but using the vectorized technique because in my opinion it is easier to debug.
The final values I got for theta were
theta =
[-76.4242,
0.8214,
0.7948]
I also used alpha = 0.3
I plotted the decision boundary and it looks fine, I would recommend using the vectorized form as it is easier to implement and to debug in my opinion.
I also think your implementation of gradient descent is not quite correct. 50 iterations is just not enough and the cost at the last iteration is not good enough. Maybe you should try to run it for more iterations with a stopping condition.
Also check this lecture for optimization techniques.
https://class.coursera.org/ml-006/lecture/37
I have this Backpropagation implementation in MATLAB, and have an issue with training it. Early on in the training phase, all of the outputs go to 1. I have normalized the input data(except the desired class, which is used to generate a binary target vector) to the interval [0, 1]. I have been referring to the implementation in Artificial Intelligence: A Modern Approach, Norvig et al.
Having checked the pseudocode against my code(and studying the algorithm for some time), I cannot spot the error. I have not been using MATLAB for that long, so have been trying to use the documentation where needed.
I have also tried different amounts of nodes in the hidden layer and different learning rates (ALPHA).
The target data encodings are as follows: when the target is to classify as, say 2, the target vector would be [0,1,0], say it were 1, [1, 0, 0] so on and so forth. I have also tried using different values for the target, such as (for class 1 for example) [0.5, 0, 0].
I noticed that some of my weights go above 1, resulting in large net values.
%Topological constants
NUM_HIDDEN = 8+1;%written as n+1 so is clear bias is used
NUM_OUT = 3;
%Training constants
ALPHA = 0.01;
TARG_ERR = 0.01;
MAX_EPOCH = 50000;
%Read and normalize data file.
X = normdata(dlmread('iris.data'));
X = shuffle(X);
%X_test = normdata(dlmread('iris2.data'));
%epocherrors = fopen('epocherrors.txt', 'w');
%Weight matrices.
%Features constitute size(X, 2)-1, however size is (X, 2) to allow for
%appending bias.
w_IH = rand(size(X, 2), NUM_HIDDEN)-(0.5*rand(size(X, 2), NUM_HIDDEN));
w_HO = rand(NUM_HIDDEN+1, NUM_OUT)-(0.5*rand(NUM_HIDDEN+1, NUM_OUT));%+1 for bias
%Layer nets
net_H = zeros(NUM_HIDDEN, 1);
net_O = zeros(NUM_OUT, 1);
%Layer outputs
out_H = zeros(NUM_HIDDEN, 1);
out_O = zeros(NUM_OUT, 1);
%Layer deltas
d_H = zeros(NUM_HIDDEN, 1);
d_O = zeros(NUM_OUT, 1);
%Control variables
error = inf;
epoch = 0;
%Run the algorithm.
while error > TARG_ERR && epoch < MAX_EPOCH
for n=1:size(X, 1)
x = [X(n, 1:size(X, 2)-1) 1]';%Add bias for hiddens & transpose to column vector.
o = X(n, size(X, 2));
%Forward propagate.
net_H = w_IH'*x;%Transposed w.
out_H = [sigmoid(net_H); 1]; %Append 1 for bias to outputs
net_O = w_HO'*out_H;
out_O = sigmoid(net_O); %Again, transposed w.
%Calculate output deltas.
d_O = ((targetVec(o, NUM_OUT)-out_O) .* (out_O .* (1-out_O)));
%Calculate hidden deltas.
for i=1:size(w_HO, 1);
delta_weight = 0;
for j=1:size(w_HO, 2)
delta_weight = delta_weight + d_O(j)*w_HO(i, j);
end
d_H(i) = (out_H(i)*(1-out_H(i)))*delta_weight;
end
%Update hidden-output weights
for i=1:size(w_HO, 1)
for j=1:size(w_HO, 2)
w_HO(i, j) = w_HO(i, j) + (ALPHA*out_H(i)*d_O(j));
end
end
%Update input-hidden weights.
for i=1:size(w_IH, 1)
for j=1:size(w_IH, 2)
w_IH(i, j) = w_IH(i, j) + (ALPHA*x(i)*d_H(j));
end
end
out_O
o
%out_H
%w_IH
%w_HO
%d_O
%d_H
end
end
function outs = sigmoid(nets)
outs = zeros(size(nets, 1), 1);
for i=1:size(nets, 1)
if nets(i) < -45
outs(i) = 0;
elseif nets(i) > 45
outs(i) = 1;
else
outs(i) = 1/1+exp(-nets(i));
end
end
end
From what we've established in the comments the only thing that comes in my mind are all recipes written down together in this great NN archive:
ftp://ftp.sas.com/pub/neural/FAQ2.html#questions
First things you could try are:
1) How to avoid overflow in the logistic function? Probably that's the problem - many times I've implemented NNs the problem was with such an overflow.
2) How should categories be encoded?
And more general:
3) How does ill-conditioning affect NN training?
4) Help! My NN won't learn! What should I do?
After the discussion it turns out the problem lies within the sigmoid function:
function outs = sigmoid(nets)
%...
outs(i) = 1/1+exp(-nets(i)); % parenthesis missing!!!!!!
%...
end
It should be:
function outs = sigmoid(nets)
%...
outs(i) = 1/(1+exp(-nets(i)));
%...
end
The lack of parenthesis caused that the sigmoid output was larger than 1 sometimes. That made the gradient calculation incorrect (because it wasn't a gradient of this function). This caused the gradient to be negative. And this caused that the delta for the output layer was most of the time in the wrong direction. After the fix (the after correctly maintaining the error variable - this seems to be missing in your code) all seems to work fine.
Beside that, there are two other main problems with this code:
1) No bias. Without the bias each neuron can only represent a line which crosses the origin. If data is normalized (i.e. values are between 0 and 1), some configurations are inseparable.
2) Lack of guarding against high gradient values (point 1 in my previous answer).