Neural Network Diverging instead of converging - machine-learning

I have implemented a neural network (using CUDA) with 2 layers. (2 Neurons per layer).
I'm trying to make it learn 2 simple quadratic polynomial functions using backpropagation.
But instead of converging, the it is diverging (the output is becoming infinity)
Here are some more details about what I've tried:
I had set the initial weights to 0, but since it was diverging I have randomized the initial weights
I read that a neural network might diverge if the learning rate is too high so I reduced the learning rate to 0.000001
The two functions I am trying to get it to add are: 3 * i + 7 * j+9 and j*j + i*i + 24 (I am giving the layer i and j as input)
I had implemented it as a single layer previously and that could approximate the polynomial functions better
I am thinking of implementing momentum in this network but I'm not sure it would help it learn
I am using a linear (as in no) activation function
There is oscillation in the beginning but the output starts diverging the moment any of weights become greater than 1
I have checked and rechecked my code but there doesn't seem to be any kind of issue with it.
So here's my question: what is going wrong here?
Any pointer will be appreciated.

If the problem you are trying to solve is of classification type, try 3 layer network (3 is enough accordingly to Kolmogorov) Connections from inputs A and B to hidden node C (C = A*wa + B*wb) represent a line in AB space. That line divides correct and incorrect half-spaces. The connections from hidden layer to ouput, put hidden layer values in correlation with each other giving you the desired output.
Depending on your data, error function may look like a hair comb, so implementing momentum should help. Keeping learning rate at 1 proved optimum for me.
Your training sessions will get stuck in local minima every once in a while, so network training will consist of a few subsequent sessions. If session exceeds max iterations or amplitude is too high, or error is obviously high - the session has failed, start another.
At the beginning of each, reinitialize your weights with random (-0.5 - +0.5) values.
It really helps to chart your error descent. You will get that "Aha!" factor.

The most common reason for a neural network code to diverge is that the coder has forgotten to put the negative sign in the change in weight expression.
another reason could be that there is a problem with the error expression used for calculating the gradients.
if these don't hold, then we need to see the code and answer.

Related

Force a neural network to have 0-sum outputs

I have a pytorch neural net with n-dimensional output which I want to have 0-sum during training (my training data, i.e. the true outputs, have 0 sum). Of course I could just add a line computing the sum s and then subtract s/n from each element of the output. But this way, the network would be driven even less to actually finding outputs with zero sum, as this would get taken care of anyways (I've been getting worse test results with this approach). Also, as the true outputs in the training data have 0 sum, obviously the network converges to having almost 0 sum outputs, but not quite. Hence, I was wondering whether there is a smart way to force the network to have outputs that sum to 0, without just brute-force subtracting the sum in the end (which would corrupt learning outputs to have sum 0)? I.e. some sort of solution directly incorporated in the network? (Probably there isn't, at least I couldn't think of any...)
Your approach with "explicitly substracting the mean" is the correct way. The same way we use softmax to nicely parametrise distributions, and you could complain that "this makes the network not learn about probability even more!", but in fact it does, it simply does so in its own, unnormalised space. Same in your case - by subtracting the mean you make sure that you match the target variable while allowing your network to focus on hard problems, and not waste its compute on having to learn that the sum is zero. If you do anything else your network will literally have to learn to compute the mean somewhere and subtract it. There are some potential corner cases where there might be some deep representational reason for mean to be zero that could be argues for, but these cases are rare enough that chances that this is actually happening "magically" in the network are zero (and if you knew it was happening there would be better ways of targeting it than by zero ensuring).
What happens if you add an explicit loss?
pred = model(input)
original_loss = criterion(pred, target)
# add this loss
zero_sum_loss = pred.mean() ** 2
loss = original_loss + weight * zero_sum_loss
loss.backward()
optim.step()
# ...

Does the Izhikevich neuron model use weights?

I've been working a bit with neural networks and I'm interested on implementing a spiking neuron model.
I've read a fair amount of tutorials but most of them seem to be about generating pulses and I haven't found any application of it on a given input train.
Say for example I got input train:
Input[0] = [0,0,0,1,0,0,1,1]
It enters the Izhikevich neuron, does the input multiply a weight or only makes use of the parameters a, b, c and d?
Izhikevich equations are:
v[n+1] = 0.04*v[n]^2 + 5*v[n] + 140 - u[n] + I
u[n+1] = a*(b*v[n] - u[n])
where v[n] is input voltage and u[n] is a general recovery variable.
Are there any texts on implementations of Izhikevich or similar spiking neuron models on a practical problem? I'm trying to understand how information is encoded on this models but it looks different from what's done with standard second generation neurons. The only tutorial I've found where it deals with a spiking train and a set of weights is [1] but I haven't seen the same with Izhikevich.
[1] https://msdn.microsoft.com/en-us/magazine/mt422587.aspx
The plain Izhikevich model by itself, does not include weights.
The two equations you mentioned, model the membrane potential (v[]) over time of a point neuron. To use weights, you could connect two or more of such cells with synapses.
Each synapse could include some sort spike detection mechanism on the source cell (pre-synaptic), and a synaptic current mechanism in the target (post-synaptic) cell side. That synaptic current could then be multiplied by a weight term, and then become part of the I term (in the 1st equation above) for the target cell.
As a very simple example of a two cell network, at every time step, you could check if pre- cell v is above (say) 0 mV. If so, inject (say) 0.01 pA * weightPrePost into the post- cell. weightPrePost would range from 0 to 1, and could be modified in response to things like firing rate, or Hebbian-like spike synchrony like in STDP.
With multiple synaptic currents going into a cell, you could devise various schemes how to sum them. The simplest one would be just a simple sum, more complicated ones could include things like distance and dendrite diameters (e.g. simulated neural morphology).
This chapter is a nice introduction to other ways to model synapses: Modelling
Synaptic Transmission

Perceptron learns to reproduce just one pattern all the time

This is rather a weird problem.
A have a code of back propagation which works perfectly, like this:
Now, when I do batch learning I get wrong results even if it concerns just a simple scalar function approximation.
After training the network produces almost the same output for all input patterns.
By this moment I've tried:
Introduced bias weights
Tried with and without updating of input weights
Shuffled the patterns in batch learning
Tried to update after each pattern and accumulating
Initialized weights in different possible ways
Double-checked the code 10 times
Normalized accumulated updates by the number of patterns
Tried different layer, neuron numbers
Tried different activation functions
Tried different learning rates
Tried different number of epochs from 50 to 10000
Tried to normalize the data
I noticed that after a bunch of back propagations for just one pattern, the network produces almost the same output for large variety of inputs.
When I try to approximate a function, I always get just line (almost a line). Like this:
Related question: Neural Network Always Produces Same/Similar Outputs for Any Input
And the suggestion to add bias neurons didn't solve my problem.
I found a post like:
When ANNs have trouble learning they often just learn to output the
average output values, regardless of the inputs. I don't know if this
is the case or why it would be happening with such a simple NN.
which describes my situation closely enough. But how to deal with it?
I am coming to a conclusion that the situation I encounter has the right to be. Really, for each net configuration, one may just "cut" all the connections up to the output layer. This is really possible, for example, by setting all hidden weights to near-zero or setting biases at some insane values in order to oversaturate the hidden layer and make the output independent from the input. After that, we are free to adjust the output layer so that it just reproduces the output as is independently from the input. In batch learning, what happens is that the gradients get averaged and the net reproduces just the mean of the targets. The inputs do not play ANY role.
My answer can not be fully precise because you have not posted the content of the functions perceptron(...) and backpropagation(...).
But from what I guess, you train your network many times on ONE data, then completely on ONE other in a loop for data in training_data, which leads that your network will only remember the last one. Instead, try training your network on every data once, then do that again many times (invert the order of your nested loops).
In other word, the for I = 1:number of patterns loop should be inside the backpropagation(...) function's loop, so this function should contain two loops.
EXAMPLE (in C#):
Here are some parts of a backpropagation function, I simplified it here. At each update of the weights and biases, the entire network is "propagated". The following code can be found at this URL: https://visualstudiomagazine.com/articles/2015/04/01/back-propagation-using-c.aspx
public double[] Train(double[][] trainData, int maxEpochs, double learnRate, double momentum)
{
//...
Shuffle(sequence); // visit each training data in random order
for (int ii = 0; ii < trainData.Length; ++ii)
{
//...
ComputeOutputs(xValues); // copy xValues in, compute outputs
//...
// Find new weights and biases
// Update weights and biases
//...
} // each training item
}
Maybe what is not working is just that you want to enclose everything after this comment (in Batch learn as an example) with a secondary for loop to do multiple epochs of learning:
%--------------------------------------------------------------------------
%% Get all updates

Backpropogation neural network - error not converging

I am using backpropogation algorithm for my model. It works perfectly fine a simple xor case and when I tested it for a smaller subset of my actual data.
There are 3 inputs in total and a single output(0,1,2)
I have split the data set into training set (80% amounting to approx 5.5k) and the rest 20% as validation data.
I use trainingRate and momentum for calculating the delta weights.
I have normalized the input as below
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(input_array)
I use 1 hidden layer with sigmoid and linear activation functions for input-hidden and hidden-output respectively.
I train with trainingRate = 0.0005, momentum = 0.6, Epochs = 100,000. Any higher trainingRate shoots up the error to Nan. momentum values between 0.5 and 0.9 works fine and any other value makes the error Nan.
I tried various number of nodes in the hidden layer such as 3,6,9,10 and the error converged to 4140.327574 in each case. I am not sure how to reduce this. Changing the activation functions doesn't help. I even tried adding another hidden layer with gaussian activation function but I cannot reduce the error whatsoever.
Is it because of the outliers? Do i need to clean those values from the training data?
Any suggestion would be of great help be it the activation function, hidden layers, etc. I had been trying to get this working for quite some time and I am sort of stuck now.
Well I'm having kind of a similar problem, still haven fixed it, but I can tell you a couple of things I have found. I think the net is overfitting, my error at some point goes down and then starts going up again, also the verification set... is this you case also?
Check if you are implementing well the "early stopping" algorithm, most of the times the problem is not the backpropagation, but the error analysis or the validation analysis.
Hope this helps!

Does it makes any sense that weights and threshold are growing proportionally when training my perceptron?

I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.

Resources