I'm trying to implement a multilayer perceptron with backpropagation with only one hidden layer on Matlab. The objective is to replicate a function with two I'm trying to implement a multilayer perceptron with backpropagation with only one hidden layer on Matlab. The objective is to replicate a function with two inputs and one output.
The problem I'm having is that the error starts decreasing with every epoch but it just reaches a plateau and doesn't seems to improve as seen in:
This is an image of all the errors during a single Epoch:
as you can see there are some extreme cases that are not being handled correctly
Im using:
Weights initialized from -1 to 1
Mean Square Error
Variable number of hidden neurons
Momentum
Randomized input order
no bias
tanh activation function for the hidden layer
identity as the activation function of the output layer
Inputs in range of -3 to 3
Min-Max normalization of inputs
I have tried changing the number of neurons on the hidden layers, tried to lower the learning rate to really small amounts and nothing seems to help.
Here is the Matlab code:
clc
clear
%%%%%%% DEFINITIONS %%%%%%%%
i=0;
S=0;
X=rand(1000,2)*6-3; %generate inputs between -3,+3
Xval=rand(200,2)*6-3; %validation inputs
Number_Neurons=360;
Wh=rand(Number_Neurons,2)*2-1; %hidden weights
Wo=rand(Number_Neurons,1)*2-1; %output weights
Learn=.001;% learning factor
momentumWh=0; %momentums
momentumWo=0;
a=.01;%momentum factor
WoN=Wo; %new weight
fxy=#(x,y) (3.*(1-x).^2).*(exp(-x.^2-(y+1).^2))-10.*(x./5-x.^3-y.^5).*(exp(-x.^2-y.^2))-(exp(-(x+1).^2-y.^2))./3; %function to be replicated
fh=#(x) tanh(x); %hidden layer activation function
dfh= #(x) 1-tanh(x).^2; %derivative
fo=#(x) x; %output layer activation function
dfo= #(x) 1; %derivative
%%GRAPH FUNCTION
%[Xg,Yg]=meshgrid(X(:,1),X(:,2));
% Y=fxy(Xg,Yg);
% surf(Xg,Yg,Y)
%%%%%%%%%
Yr=fxy(X(:,1),X(:,2)); %Y real
Yval=fxy(Xval(:,1),Xval(:,2)); %validation Y
Epoch=1;
Xn=(X+3)/6;%%%min max normalization
Xnval=(Xval+3)/6;
E=ones(1,length(Yr));% error
Eval=ones(1,length(Yval));%validation error
MSE=1;
%%%%% ITERATION %%%%%
while 1
N=1;
perm=randperm(length(X(:,:))); %%%permutate inputs
Yrand=Yr(perm); %permutate outputs
Xrand=Xn(perm,:);
while N<=length(Yr) %epoch
%%%%%%foward pass %%%%%
S=Wh*Xrand(N,:)'; %input multiplied by hidden weights
Z=fh(S); %activation function of hidden layer
Yin=Z.*Wo; %output of hidden layer multiplied by output weights
Yins=sum(Yin); %sum all the inputs
Yc=fo(Yins);% activation function of output layer, Predicted Y
E(N)=Yrand(N)-Yc; %error
%%%%%%%% back propagation %%%%%%%%%%%%%
do=E(N).*dfo(Yins); %delta of output layer
DWo=Learn*(do.*Z)+a*momentumWo; %Gradient of output layer
WoN=Wo+DWo;%New output weight
momentumWo=DWo; %store momentum
dh=do.*Wo.*dfh(S); %delta of hidden layer
DWh1=Learn.*dh.*Xrand(N,1); %Gradient of hidden layer
DWh2=Learn.*dh.*Xrand(N,2);
DWh=[DWh1 DWh2]+a*momentumWh;%Gradient of hidden layer
Wh=Wh+DWh; %new hidden layer weights
momentumWh=DWh; %store momentum
Wo=WoN; %update output weight
N=N+1; %next value
end
MSET(Epoch)=(sum(E.^2))/length(E); %Mean Square Error Training
N=1;
%%%%%% validation %%%%%%%
while N<=length(Yval)
S=Wh*Xnval(N,:)';
Z=fh(S);
Yin=Z.*Wo;
Yins=sum(Yin);
Yc=fo(Yins);
Eval(N)=Yc-Yval(N);
N=N+1;
end
MSE(Epoch)=(sum(Eval.^2))/length(Eval); %Mean Square Error de validacion
if MSE(Epoch)<=1 %stop condition
break
end
disp(MSET(Epoch))
disp(MSE(Epoch))
Epoch=Epoch+1; %next epoch
end
There are a number of factors that can come into play for the particular problem that you are trying to solve:
The Complexity of the Problem: Is the problem considered easy for a neural network to solve (If using a standard dataset, have you compared the results to other studies?)
The Inputs: Are the inputs strongly related to the output? Are there more inputs that you can add to the NN? Are they preprocessed correctly?
Local Minima vs Global Minima: Are you sure that the problem has stopped in a local minima (A place where the NN gets stuck in learning that stops the NN from reaching a more optimal solution)?
Outputs: Are the output samples skewed in some way? Is this a binary output kind of problem, and are there enough samples on both sides?
Activation Function: Is there another appropriate Activation Function for the problem?
Then there is the Hidden Layers, Neurons, Learning Rate, Momentum, Epochs etc. which you appear to have trialled.
Based on the chart, this is the kind of learning performance that would roughly be expected for a BPNN, however trial and error is sometimes required to optimise the result from there.
I would try to work on the above options (particularly pre-processing of data) and see if this helps in your case.
Related
Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as
output = matmul(input, weights)
Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.
My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute
output = matmul(input, weights)
Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.
In back propagation, you could do something similar:
input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err
Where weights is the same, output_err is 100x2, and input is 100x10.
However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.
Thanks!
If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.
In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.
How gradient descent algorithm updates the batch weights in the back propagation method?
Thanks in advance ...
It is really easy once you understand the algorithm.
New Weights = Old Weights - learning-rate x Partial derivatives of loss function w.r.t. parameters
Let's consider a neural network with two inputs, two hidden neurons, two output neurons.
First, introduce weights and bias to your network. Then, compute total net input for hidden layer, as such
net_{h1} = w_1 * i_1 + w_2 * i_2 + b_1 * 1
Do the same for all other hidden layers.
Next, we can now calculate the error for each output neuron using the squared error function and sum them to get the total error.
Hereinafter, you will have to calculate the partial derivative of the total network error with respect to the previous weights, to find out how each weights affects the network. I have included a visual to help with your understanding.
I strongly suggest you go through this beginner friendly introduction to back-propagation to have a firm grasp of the concept. I hope my beginner post help you get started in the journey of Machine Learning!
I implemented an ANN (1 hidden layer of 64 units, learning rate = 0.001, epsilon = 0.001, iters = 500) with pythons OpenCV module. Train error ~ 3% and test error ~ 12%
In order to improve the accruacy/ generalisation of my NN I decided to proceed by- implementing model selection (of #hidden units and learning rate) to get an accurate value of hyperparameters and plotting learning curves to determine if more data is needed (currently have 2.5k).
Having read some sources regarding NN training and model selection, I'm very confused on the following matter -
1) In order to perform model selection, I know the following needs to be done-
create set possibleHiddenUnits {4, 8, 16, 32, 64}
randomly select Tr & Va sets from the total set of Tr + Va with some split e.g. 80/20
foreach ele in possibleHiddenUnits
(*) compute weights for the NN using backpropagation and an iterative optimisation algorithm like Gradient Descent (where we provide the termination criteria in the form of number of iterations / epsilon)
compute Validation set error using these trained weights
select the number of hidden units which min Va set error
Alternatively, I believe we can also use k-fold cross validation.
a. how do you decide what the number of iterations/ epsilon for GD should be?
b. does 1 iteration out of x iterations of GD (where the entire training set is used to compute the gradients of cost wrt weights through backprop) constitute an 'epoch'?
2) Sources (whats is the difference between train, validation and test set, in neural networks? and How to use k-fold cross validation in a neural network) mention that the training for a NN is done in the following way as it prevents over-fitting
for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training
a. I believe this method should be executed once the model selection has been done. But then how do we avoid overfitting of the model in step (*) of the model selection process above?
b. Am I right in assuming that one epoch constitues one iteration of training where weights are calculated using the entire Tr set through GD + backprop and GD involves x (>1) iters over the entire Tr set to calculate the weights ?
Also, out off 1b and 2b which is correct?
This is more of a comment but since I cant make comments yet I write it here. Have you tried other methods like l2 regularization or dropout? I dont know a lot about model selection but dropout has a very similiar effect like taking lots of models and averaging them. Normaly dropout should do the trick and you wont have problems with overfitting anymore.
I can't understand why dropout works like this in tensorflow. The blog of CS231n says that, "dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise." Also you can see this from picture(Taken from the same site)
From tensorflow site, With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0.
Now, why the input element is scaled up by 1/keep_prob? Why not keep the input element as it is with probability and not scale it with 1/keep_prob?
This scaling enables the same network to be used for training (with keep_prob < 1.0) and evaluation (with keep_prob == 1.0). From the Dropout paper:
The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.
Rather than adding ops to scale down the weights by keep_prob at test time, the TensorFlow implementation adds an op to scale up the weights by 1. / keep_prob at training time. The effect on performance is negligible, and the code is simpler (because we use the same graph and treat keep_prob as a tf.placeholder() that is fed a different value depending on whether we are training or evaluating the network).
Let's say the network had n neurons and we applied dropout rate 1/2
Training phase, we would be left with n/2 neurons. So if you were expecting output x with all the neurons, now you will get on x/2. So for every batch, the network weights are trained according to this x/2
Testing/Inference/Validation phase, we dont apply any dropout so the output is x. So, in this case, the output would be with x and not x/2, which would give you the incorrect result. So what you can do is scale it to x/2 during testing.
Rather than the above scaling specific to Testing phase. What Tensorflow's dropout layer does is that whether it is with dropout or without (Training or testing), it scales the output so that the sum is constant.
Here is a quick experiment to disperse any remaining confusion.
Statistically the weights of a NN-layer follow a distribution that is usually close to normal (but not necessarily), but even in the case when trying to sample a perfect normal distribution in practice, there are always computational errors.
Then consider the following experiment:
DIM = 1_000_000 # set our dims for weights and input
x = np.ones((DIM,1)) # our input vector
#x = np.random.rand(DIM,1)*2-1.0 # or could also be a more realistic normalized input
probs = [1.0, 0.7, 0.5, 0.3] # define dropout probs
W = np.random.normal(size=(DIM,1)) # sample normally distributed weights
print("W-mean = ", W.mean()) # note the mean is not perfect --> sampling error!
# DO THE DRILL
h = defaultdict(list)
for i in range(1000):
for p in probs:
M = np.random.rand(DIM,1)
M = (M < p).astype(int)
Wp = W * M
a = np.dot(Wp.T, x)
h[str(p)].append(a)
for k,v in h.items():
print("For drop-out prob %r the average linear activation is %r (unscaled) and %r (scaled)" % (k, np.mean(v), np.mean(v)/float(k)))
Sample output:
x-mean = 1.0
W-mean = -0.001003985674840264
For drop-out prob '1.0' the average linear activation is -1003.985674840258 (unscaled) and -1003.985674840258 (scaled)
For drop-out prob '0.7' the average linear activation is -700.6128015029908 (unscaled) and -1000.8754307185584 (scaled)
For drop-out prob '0.5' the average linear activation is -512.1602655283492 (unscaled) and -1024.3205310566984 (scaled)
For drop-out prob '0.3' the average linear activation is -303.21194422742315 (unscaled) and -1010.7064807580772 (scaled)
Notice that the unscaled activations diminish due to the statistically imperfect normal distribution.
Can you spot an obvious correlation between the W-mean and the average linear activation means?
If you keep reading in cs231n, the difference between dropout and inverted dropout is explained.
In a network with no dropout, the activations in layer L will be aL. The weights of next layer (L+1) will be learned in such a manner that it receives aL and produces output accordingly. But with a network containing dropout (with keep_prob = p), the weights of L+1 will be learned in such a manner that it receives p*aL and produces output accordingly. Why p*aL? Because the Expected value, E(aL), will be probability_of_keeping(aL)*aL + probability_of_not_keeping(aL)*0 which will be equal to p*aL + (1-p)*0 = p*aL. In the same network, during testing time there will be no dropout. Hence the layer L+1 will receive aL simply. But its weights were trained to expect p*aL as input. Therefore, during testing time you will have to multiply the activations with p. But instead of doing this, you can multiply the activations with 1/p during training only. This is called inverted dropout.
Since we want to leave the forward pass at test time untouched (and tweak our network just during training), tf.nn.dropout directly implements inverted dropout, scaling the values.
I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.