Two layer neural network performs worse than single layer

Two layer neural network performs worse than single layer - machine-learning

I'm learning TensorFlow, and trying to create a simple two layer neural network.
The tutorial code https://www.tensorflow.org/get_started/mnist/pros starts with this simple network, to get 92% accuracy:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
I tried replacing it with this very simple network, adding a new layer, but accuracy now drops to 84%!!!
layer1_len = 10
w1 = weight_var([784, layer1_len])
b1 = bias_var([layer1_len])
o1 = tf.nn.relu(tf.matmul(x, w1) + b1)
w2 = weight_var([layer1_len, 10])
b2 = bias_var([10])
y = tf.nn.softmax(tf.matmul(o1, w2) + b2)
I get that result with several different values for layer1_len as well as different numbers of training steps. (Note that if I omit the weight_var and bias_var random initialization, and keep everything at zero, accuracy drops to close to 10%, essentially no better than guessing.)
What am I doing wrong?

There is nothing wrong. The problem is that increasing layers does not automatically means a higher accuracy (otherwise machine learning would be kind of solved, because if you need a better accuracy in an image classifier you would just add +1 layer to an inception and claim a victory).
To show you that this is not only your problem - take a look at this high-level paper: Deep Residual Learning for Image Recognition where they see that increasing the number of layers decreases the scoring function (which is not important) and their architecture to overcome this problem (which is important). Here is a small part from it:
The deeper network has higher training error and thus test error.

Related

In backpropogation, what does it mean when the error of a neural network converges to 0.5?

I've been trying to learn the math behind neural networks and have implemented (in Octave) a version of the following equations which include bias terms.
Back-propagation equations matrix form:
Visual representation of the problem and Network:
clear; clc; close all;
#Initialize weights and bias from input to hidden layer
W1 = rand(3,4)
b1 = ones(3,1)
#Initialize weights from hidden to output
W2 = rand(2,3)
b2 = ones(2,1)
#define sigmoid function
s = #(z) 1./(1 + exp(-z));
ds = #(z) s(z).*(1-s(z));
data = csvread("data.txt");
for j = 1 : 100
for i = 1 : length(data)
x0 = data(i,2:5)';
#Find the truth
if data(i,6) == 1 ;
t = [1;0] ;
else
t = [0;1];
end
#Forward propagate
x1 = s(W1*x0 + b1);
x2 = s(W2*x1 + b2);
iter = (j-1)*length(data) + i;
E((j-1)*length(data) + i) = norm(x2-t)^2;
E(length(E))
#Back propagate
delta2 = (x2-t).*ds(W2*x1+b2);
delta1 = W2'*delta2.*ds(W1*x0+b1);
dedw2 = delta2*x1';
dedw1 = delta1*x0';
alpha = 0.001*(40000-iter)/40000;
W2 = W2 - alpha*dedw2;
W1 = W1 - alpha*dedw1;
b2 = b2 - alpha*delta2;
b1 = b1 - alpha*delta1;
end
end
plot(E)
title('Gradient Descent')
xlabel('Iteration')
ylabel('Error')
When I run this, I converge on weights that give an constant error of 0.5 rather than 0.0. The error plot looks something like this depending on the initial samples of W1 and W2:
The resulting weights W1 and W2 yield output ~[0.5,0.5] for the whole set rather than [1,0](isStairs = true) or [0,1](isStairs = False)
Other information:
If I loop over a single data point instead of the entire learning set, it does converge to zero error for that particular case. (like 20 iterations or so), so I assume my derivatives are correct?
For the model to converge the learning rate has to be insanely small. Not sure what this means.
Is this neural network valid to solve the described problem? If so, what does it mean to converge to an error of 0.5?

The NN learns from data. If there is only one example, it will learn this example by heard and you have zero error. But if you have more examples, they will likely not lie on a nice curve, but are noisy instead. So it is harder to learn the data by heard for the network (it also depends on the number of free parameters that the NN has but you get the idea)... However, you don't want the NN to learn everything in detail. You want it to learn the overall trend (so not the noise). But this also means, that your error won't converge to zero as there is noise, which your NN should not learn... So don't worry if you have a (small) error at the end.
But what about the learning rate? Well, imagine you have 10 examples. Eight of them describe a perfect line but two exhibit noise. One sightly to the right (lets say +1) and the other slightly to the left (-1). If the NN estimates one of those points and updates to minimize the error drawn from it. The update will jump from + to - or vice versa. Depending on your learning rate, this jumping may eventually converge to the middle (which is the correct function) or may go on forever... This is essentially what the learning rate does: it determines how much impact an estimation error has on the update/learning of the network. So a good idea is to choose a larger learning rate the the beginning (where the network has a really bad performance due to its random initialization) and decrease the rate when it already learned something. You can achieve the same thing with a small learning rate but you will need longer time for it;)

How to find if a data set can train a neural network?

I'm a newbie to machine learning and this is one of the first real-world ML tasks challenged.
Some experimental data contains 512 independent boolean features and a boolean result.
There are about 1e6 real experiment records in the provided data set.
In a classic XOR example all 4 out of 4 possible states are required to train NN. In my case its only 2^(10-512) = 2^-505 which is close to zero.
I have no more information about the data nature, just these (512 + 1) * 1e6 bits.
Tried NN with 1 hidden layer on available data. Output of the trained NN on the samples even from the training set are always close to 0, not a single close to "1". Played with weights initialization, gradient descent learning rate.
My code utilizing TensorFlow 1.3, Python 3. Model excerpt:
with tf.name_scope("Layer1"):
#W1 = tf.Variable(tf.random_uniform([512, innerN], minval=-2/512, maxval=2/512), name="Weights_1")
W1 = tf.Variable(tf.zeros([512, innerN]), name="Weights_1")
b1 = tf.Variable(tf.zeros([1]), name="Bias_1")
Out1 = tf.sigmoid( tf.matmul(x, W1) + b1)
with tf.name_scope("Layer2"):
W2 = tf.Variable(tf.random_uniform([innerN, 1], minval=-2/512, maxval=2/512), name="Weights_2")
#W2 = tf.Variable(tf.zeros([innerN, 1]), name="Weights_2")
b2 = tf.Variable(tf.zeros([1]), name="Bias_2")
y = tf.nn.sigmoid( tf.matmul(Out1, W2) + b2)
with tf.name_scope("Training"):
y_ = tf.placeholder(tf.float32, [None,1])
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels = y_, logits = y)
)
train_step = tf.train.GradientDescentOptimizer(0.005).minimize(cross_entropy)
with tf.name_scope("Testing"):
# Test trained model
correct_prediction = tf.equal( tf.round(y), tf.round(y_))
# ...
# Train
for step in range(500):
batch_xs, batch_ys = Datasets.train.next_batch(300, shuffle=False)
_, my_y, summary = sess.run([train_step, y, merged_summaries],
feed_dict={x: batch_xs, y_: batch_ys})
I suspect two cases:
my fault – bad NN implementation, wrong architecture;
bad data. Compared to XOR example, incomplete training data would result in a failing NN. However, the training examples fed to the trained NN are supposed to give right predictions, aren't they?
How to evaluate if it is possible at all to train a neural network (a 2-layer perceptron) on the provided data to forecast the result? A case of aceptable set would be the XOR example. Opposed to some random noise.

There are only ad hoc ways to know if it is possible to learn a function with a differentiable network from a dataset. That said, these ad hoc ways do usually work. For example, the network should be able to overfit the training set without any regularisation.
A common technique to gauge this is to only fit the network on a subset of the full dataset. Check that the network can overfit to that, then increase the size of the subset, and increase the size of the network as well. Unfortunately, deciding whether to add extra layers or add more units in a hidden layer is an arbitrary decision you'll have to make.
However, looking at your code, there are a few things that could be going wrong here:
Are your outputs balanced? By that I mean, do you have the same number of 1s as 0s in the dataset targets?
Your initialisation in the first layer is all zeros, the gradient to this will be zero, so it can't learn anything (although, you have a real initialisation above it commented out).
Sigmoid nonlinearities are more difficult to optimise than simpler nonlinearities, such as ReLUs.
I'd recommend using the built-in definitions for layers in Tensorflow to not worry about initialisation, and switching to ReLUs in any hidden layers (you need sigmoid at the output for your boolean target).
Finally, deep learning isn't actually very good at most "bag of features" machine learning problems because they lack structure. For example, the order of the features doesn't matter. Other methods often work better, but if you really want to use deep learning then you could look at this recent paper, showing improved performance by just using a very specific nonlinearity and weight initialisation (change 4 lines in your code above).

LSTM vs. Hidden Layer Training in Tensorflow

I am messing around with LSTMs and have a conceptual question. I created a matrix of bogus data on the following rules:
For each 1-D list in the matrix:
If previous element is less than 10, then this next element is the previous one plus 1.
Else, this element is sin(previous element)
This way, it is a sequence that is pretty simply based on the previous information. I set up an LSTM to learn the recurrence and ran it to train on the lists one at a time. I have an LSTM layer followed by a fully connected feed-forward layer. It learns the +1 step very easily, but has trouble with the sin step. It will seemingly pick a random number between -1 and 1 when making the next element when the previous one was greater than 10. My question is this: is the training only modifying the variables in my fully connected feed forward layer? Is that why it can't learn the non-linear sin function?
Here's the code snippet in question:
lstm = rnn_cell.LSTMCell(lstmSize)
y_ = tf.placeholder(tf.float32, [None, OS])
outputs, state = rnn.rnn(lstm, x, dtype=tf.float32)
outputs = tf.transpose(outputs, [1, 0, 2])
last = tf.gather(outputs, int(outputs.get_shape()[0]) - 1)
weights = tf.Variable(tf.truncated_normal([lstmSize, OS]))
bias = tf.Variable(tf.constant(0.1, shape=[OS]))
y = tf.nn.elu(tf.matmul(last, weights) + bias)
error = tf.reduce_mean(tf.square(tf.sub(y_, y)))
train_step = tf.train.AdamOptimizer(learning_rate=1e-3).minimize(error)
The error and shape organization seems to be correct, at least in the sense that it does learn the +1 step quickly without crashing. Shouldn't the LSTM be able to handle the non-linear sin function? It seems almost trivially easy, so my guess is that I set something up wrong and the LSTM isn't learning anything.

Time Series Ahead Prediction in Neural Network (N Point Ahead Prediction) Large Scale Iterative Training

(N=90) Point ahead Prediction using Neural Network:
I am trying to predict 3 minutes ahead i.e. 180 points ahead. Because I compressed my time series data as taking the mean of every 2 points as one, I have to predict (N=90) step-ahead prediction.
My time series data is given in seconds. The values are in between 30-90. They usually move from 30 to 90 and 90 to 30, as seen in the example below.
My data could be reach from: https://www.dropbox.com/s/uq4uix8067ti4i3/17HourTrace.mat
I am having trouble in implementing neural network to predict N points ahead. My only feature is previous time. I used elman recurrent neural network and also newff.
In my scenario I need to predict 90 points ahead. First how I separated my input and target data manually:
For Example:
data_in = [1,2,3,4,5,6,7,8,9,10]; //imagine 1:10 only defines the array index values.
N = 90; %predicted second ahead.
P(:, :) T(:) it could also be(2 theta time) P(:, :) T(:)
[1,2,3,4,5] [5+N] | [1,3,5,7,9] [9+N]
[2,3,4,5,6] [6+N] | [2,4,6,8,10] [10+N]
...
until it reaches to end of the data
I have 100 input points and 90 output points in Elman recurrent neural networks. What could be the most efficient hidden node size?
input_layer_size = 90;
NodeNum1 =90;
net = newelm(threshold,[NodeNum1 ,prediction_ahead],{'tansig', 'purelin'});
net.trainParam.lr = 0.1;
net.trainParam.goal = 1e-3;
//At the beginning of my training I filter it with kalman, normalization into range of [0,1] and after that I shuffled the data.
1) I won't able to train my complete data. First I tried to train complete M data which is around 900,000, which didn't gave me a solution.
2) Secondly I tried iteratively training. But in each iteration the new added data is merged with already trained data. After 20,000 trained data the accuracy start to decreases. First trained 1000 data perfectly fits in training. But after when I start iterativelt merge the new data and continue to training, the training accuracy drops very rapidly 90 to 20.
For example.
P = P_test(1:1000) T = T_test(1:1000) counter = 1;
while(1)
net = train(net,P,T, [], [] );%until it reaches to minimum error I train it.
[normTrainOutput] = sim(net,P, [], [] );
P = [ P P(counter*1000:counter*2000)]%iteratively new training portion of the data added.
counter = counter + 1; end
This approach is very slow and after a point it won't give any good resuts.
My third approach was iteratively training; It was similar to previous training but in each iteration, I do only train the 1000 portion of the data, without do any merging with previous trained data.For example when I train first 1000 data until it gets to minimum error which has >95% accuracy. After it has been trained, when I have done the same for the second 1000 portion of the data;it overwrites the weight and the predictor mainly behave as the latest train portion of the data.
> P = P_test(1:1000) T = T_test(1:1000) counter = 1;
while(1)
> net = train(net,P,T, [], [] ); % I did also use adapt()
> [normTrainOutput] = sim(net,P, [], [] );
>
> P = [ P(counter*1000:counter*2000)]%iteratively only 1000 portion of the data is added.
> counter = counter + 1;
end
Trained DATA: This figure is snapshot from my trained training set, blue line is the original time series and red line is the predicted values with trained neural network. The MSE is around 50.
Tested DATA: On the below picture, you can see my prediction for my testing data with the neural network, which is trained with 20,000 input points while keeping MSE error <50 for the training data set. It is able to catch few patterns but mostly I doesn't give the real good accuracy.
I wasn't able to successes any of this approaches. In each iteration I also observe that slight change on the alpha completely overwrites to already trained data and more focus onto the currently trained data portion.
I won't able to come up with a solution to this problem. In iterative training should I keep the learning rate small and number of epochs as small.
And I couldn't find an efficient way to predict 90 points ahead in time series. Any suggestions that what should I do to do in order to predict N points ahead, any tutorial or link for information.
What is the best way for iterative training? On my second approach when I reach 15 000 of trained data, training size starts suddenly to drop. Iteratively should I change the alpha on run time?
==========
Any suggestion or the things I am doing wrong would be very appreciated.
I also implemented recurrent neural network. But on training for large data I have faced with the same problems.Is it possible to do adaptive learning(online learning) in Recurrent Neural Networks for(newelm)? The weight won't update itself and I didn't see any improvement.
If yes, how it is possible, which functions should I use?
net = newelm(threshold,[6, 8, 90],{'tansig','tansig', 'purelin'});
net.trainFcn = 'trains';
batch_size = 10;
while(1)
net = train(net,Pt(:, k:k+batch_size ) , Tt(:, k:k+batch_size) );
end

Have a look at Echo State Networks (ESNs) or other forms of Reservoir Computing. They are perfect for time series prediction, very easy to use and converge fast. You don't need to worry about the structure of the network at all (every neuron in the mid-layer has random weights which do not change). You only learn the output weights.
If I understood the problem correctly, with Echo State Networks, I would just train the network to predict the next point AND 90 points ahead. This can be done by simply forcing the desired output in the output neurons and then performing ridge regression to learn the output weights.
When running the network after having trained it, at every step n, it would output the next point (n+1), which you would feed back to the network as input (to continue the iteration), and 90 points ahead (n+90), which you can do whatever you want with - i.e: you could also feed it back to the network so that it affects the next outputs.
Sorry if the answer is not very clear. It's hard to explain how reservoir computing works in a short answer, but if you just read the article in the link, you will find it very easy to understand the principles.
If you do decide to use ESNs, also read this paper to understand the most important property of ESNs and really know what you're doing.
EDIT: Depending on how "predictable" your system is, predicting 90 points ahead may still be very difficult. For example if you're trying to predict a chaotic system, noise would introduce very large errors if you're predicting far ahead.

use fuzzy logic using membership function to predict the future data. will be efficient method.

How to update the bias in neural network backpropagation?

Could someone please explain to me how to update the bias throughout backpropagation?
I've read quite a few books, but can't find bias updating!
I understand that bias is an extra input of 1 with a weight attached to it (for each neuron). There must be a formula.

Following the notation of Rojas 1996, chapter 7, backpropagation computes partial derivatives of the error function E (aka cost, aka loss)
∂E/∂w[i,j] = delta[j] * o[i]
where w[i,j] is the weight of the connection between neurons i and j, j being one layer higher in the network than i, and o[i] is the output (activation) of i (in the case of the "input layer", that's just the value of feature i in the training sample under consideration). How to determine delta is given in any textbook and depends on the activation function, so I won't repeat it here.
These values can then be used in weight updates, e.g.
// update rule for vanilla online gradient descent
w[i,j] -= gamma * o[i] * delta[j]
where gamma is the learning rate.
The rule for bias weights is very similar, except that there's no input from a previous layer. Instead, bias is (conceptually) caused by input from a neuron with a fixed activation of 1. So, the update rule for bias weights is
bias[j] -= gamma_bias * 1 * delta[j]
where bias[j] is the weight of the bias on neuron j, the multiplication with 1 can obviously be omitted, and gamma_bias may be set to gamma or to a different value. If I recall correctly, lower values are preferred, though I'm not sure about the theoretical justification of that.

The amount you change each individual weight and bias will be the partial derivative of your cost function in relation to each individual weight and each individual bias.
∂C/∂(index of bias in network)
Since your cost function probably doesn't explicitly depend on individual weights and values (Cost might equal (network output - expected output)^2, for example), you'll need to relate the partial derivatives of each weight and bias to something you know, i.e. the activation values (outputs) of neurons. Here's a great guide to doing this:
https://medium.com/#erikhallstrm/backpropagation-from-the-beginning-77356edf427d
This guide states how to do these things clearly, but can sometimes be lacking on explanation. I found it very helpful to read chapters 1 and 2 of this book as I read the guide linked above:
http://neuralnetworksanddeeplearning.com/chap1.html
(provides essential background for the answer to your question)
http://neuralnetworksanddeeplearning.com/chap2.html
(answers your question)
Basically, biases are updated in the same way that weights are updated: a change is determined based on the gradient of the cost function at a multi-dimensional point.
Think of the problem your network is trying to solve as being a landscape of multi-dimensional hills and valleys (gradients). This landscape is a graphical representation of how your cost changes with changing weights and biases. The goal of a neural network is to reach the lowest point in this landscape, thereby finding the smallest cost and minimizing error. If you imagine your network as a traveler trying to reach the bottom of these gradients (i.e. Gradient Descent), then the amount you will change each weight (and bias) by is related to the the slope of the incline (gradient of the function) that the traveler is currently climbing down. The exact location of the traveler is given by a multi-dimensional coordinate point (weight1, weight2, weight3, ... weight_n), where the bias can be thought of as another kind of weight. Thinking of the weights/biases of a network as the variables for the network's cost function make it clear that ∂C/∂(index of bias in network) must be used.

I understand that the function of bias is to make level adjust of the
input values. Below is what happens inside the neuron. The activation function of course
will make the final output, but it is left out for clarity.
O = W1 I1 + W2 I2 + W3 I3
In real neuron something happens already at synapses, the input data is level adjusted with average of samples and scaled with deviation of samples. Thus the input data is normalized and with equal weights they will make the same effect. The normalized In is calculated from raw data in (n is the index).
Bn = average(in); Sn = 1/stdev((in); In= (in+Bn)Sn
However this is not necessary to be performed separately, because the neuron weights and bias can do the same function. When you subsitute In with the in, you get new formula
O = w1 i1 + w2 i2 + w3 i3+ wbs
The last wbs is the bias and new weights wn as well
wbs = W1 B1 S1 + W2 B2 S2 + W3 B3 S3
wn =W1 (in+Bn) Sn
So there exists a bias and it will/should be adjusted automagically with the backpropagation

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart