How to update batch weights in back propagation - machine-learning

How gradient descent algorithm updates the batch weights in the back propagation method?
Thanks in advance ...

It is really easy once you understand the algorithm.
New Weights = Old Weights - learning-rate x Partial derivatives of loss function w.r.t. parameters
Let's consider a neural network with two inputs, two hidden neurons, two output neurons.
First, introduce weights and bias to your network. Then, compute total net input for hidden layer, as such
net_{h1} = w_1 * i_1 + w_2 * i_2 + b_1 * 1
Do the same for all other hidden layers.
Next, we can now calculate the error for each output neuron using the squared error function and sum them to get the total error.
Hereinafter, you will have to calculate the partial derivative of the total network error with respect to the previous weights, to find out how each weights affects the network. I have included a visual to help with your understanding.
I strongly suggest you go through this beginner friendly introduction to back-propagation to have a firm grasp of the concept. I hope my beginner post help you get started in the journey of Machine Learning!

Related

Mini batch stochastic gradient descent

My question is what changes should be made to SGD algorithm to implement mini-batch SGD algorithm.
In the book, Machine Learning by Tom Mitchell, the GD, and SGD algorithms are explained very well. Here is a snippet of the book for SGD backpropagation algorithm:
I know that the difference between SGD and mini-batch SGD is that in the former we use one training example to update weights in each iteration (of the outer while-loop), while in the latter, a batch of training examples should be used in each iteration. But I still can't figure how the below algorithm should be changed to account for this change.
Here is what I think it should look like, but can't confirm this from several tutorials I followed on the web.
Until the termination condition is met, Do
batch <- get next batch
For each <x,t> in batch, Do
1- Propagate the input forward through the network.
2- d_k += o_k(1 - o_k)(t_k - o_k)
3- d_h += o_h(1 - o_h)sum(w_kh * d_k)
For each network weight w_ij, Do
w_ji += etha * d_j * x_ij
Any help is much appreciated!

understanding test and validation set usage for early stop and model selection

I implemented an ANN (1 hidden layer of 64 units, learning rate = 0.001, epsilon = 0.001, iters = 500) with pythons OpenCV module. Train error ~ 3% and test error ~ 12%
In order to improve the accruacy/ generalisation of my NN I decided to proceed by- implementing model selection (of #hidden units and learning rate) to get an accurate value of hyperparameters and plotting learning curves to determine if more data is needed (currently have 2.5k).
Having read some sources regarding NN training and model selection, I'm very confused on the following matter -
1) In order to perform model selection, I know the following needs to be done-
create set possibleHiddenUnits {4, 8, 16, 32, 64}
randomly select Tr & Va sets from the total set of Tr + Va with some split e.g. 80/20
foreach ele in possibleHiddenUnits
(*) compute weights for the NN using backpropagation and an iterative optimisation algorithm like Gradient Descent (where we provide the termination criteria in the form of number of iterations / epsilon)
compute Validation set error using these trained weights
select the number of hidden units which min Va set error
Alternatively, I believe we can also use k-fold cross validation.
a. how do you decide what the number of iterations/ epsilon for GD should be?
b. does 1 iteration out of x iterations of GD (where the entire training set is used to compute the gradients of cost wrt weights through backprop) constitute an 'epoch'?
2) Sources (whats is the difference between train, validation and test set, in neural networks? and How to use k-fold cross validation in a neural network) mention that the training for a NN is done in the following way as it prevents over-fitting
for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training
a. I believe this method should be executed once the model selection has been done. But then how do we avoid overfitting of the model in step (*) of the model selection process above?
b. Am I right in assuming that one epoch constitues one iteration of training where weights are calculated using the entire Tr set through GD + backprop and GD involves x (>1) iters over the entire Tr set to calculate the weights ?
Also, out off 1b and 2b which is correct?
This is more of a comment but since I cant make comments yet I write it here. Have you tried other methods like l2 regularization or dropout? I dont know a lot about model selection but dropout has a very similiar effect like taking lots of models and averaging them. Normaly dropout should do the trick and you wont have problems with overfitting anymore.

Multilayer Perceptron - Error plateau

I'm trying to implement a multilayer perceptron with backpropagation with only one hidden layer on Matlab. The objective is to replicate a function with two I'm trying to implement a multilayer perceptron with backpropagation with only one hidden layer on Matlab. The objective is to replicate a function with two inputs and one output.
The problem I'm having is that the error starts decreasing with every epoch but it just reaches a plateau and doesn't seems to improve as seen in:
This is an image of all the errors during a single Epoch:
as you can see there are some extreme cases that are not being handled correctly
Im using:
Weights initialized from -1 to 1
Mean Square Error
Variable number of hidden neurons
Momentum
Randomized input order
no bias
tanh activation function for the hidden layer
identity as the activation function of the output layer
Inputs in range of -3 to 3
Min-Max normalization of inputs
I have tried changing the number of neurons on the hidden layers, tried to lower the learning rate to really small amounts and nothing seems to help.
Here is the Matlab code:
clc
clear
%%%%%%% DEFINITIONS %%%%%%%%
i=0;
S=0;
X=rand(1000,2)*6-3; %generate inputs between -3,+3
Xval=rand(200,2)*6-3; %validation inputs
Number_Neurons=360;
Wh=rand(Number_Neurons,2)*2-1; %hidden weights
Wo=rand(Number_Neurons,1)*2-1; %output weights
Learn=.001;% learning factor
momentumWh=0; %momentums
momentumWo=0;
a=.01;%momentum factor
WoN=Wo; %new weight
fxy=#(x,y) (3.*(1-x).^2).*(exp(-x.^2-(y+1).^2))-10.*(x./5-x.^3-y.^5).*(exp(-x.^2-y.^2))-(exp(-(x+1).^2-y.^2))./3; %function to be replicated
fh=#(x) tanh(x); %hidden layer activation function
dfh= #(x) 1-tanh(x).^2; %derivative
fo=#(x) x; %output layer activation function
dfo= #(x) 1; %derivative
%%GRAPH FUNCTION
%[Xg,Yg]=meshgrid(X(:,1),X(:,2));
% Y=fxy(Xg,Yg);
% surf(Xg,Yg,Y)
%%%%%%%%%
Yr=fxy(X(:,1),X(:,2)); %Y real
Yval=fxy(Xval(:,1),Xval(:,2)); %validation Y
Epoch=1;
Xn=(X+3)/6;%%%min max normalization
Xnval=(Xval+3)/6;
E=ones(1,length(Yr));% error
Eval=ones(1,length(Yval));%validation error
MSE=1;
%%%%% ITERATION %%%%%
while 1
N=1;
perm=randperm(length(X(:,:))); %%%permutate inputs
Yrand=Yr(perm); %permutate outputs
Xrand=Xn(perm,:);
while N<=length(Yr) %epoch
%%%%%%foward pass %%%%%
S=Wh*Xrand(N,:)'; %input multiplied by hidden weights
Z=fh(S); %activation function of hidden layer
Yin=Z.*Wo; %output of hidden layer multiplied by output weights
Yins=sum(Yin); %sum all the inputs
Yc=fo(Yins);% activation function of output layer, Predicted Y
E(N)=Yrand(N)-Yc; %error
%%%%%%%% back propagation %%%%%%%%%%%%%
do=E(N).*dfo(Yins); %delta of output layer
DWo=Learn*(do.*Z)+a*momentumWo; %Gradient of output layer
WoN=Wo+DWo;%New output weight
momentumWo=DWo; %store momentum
dh=do.*Wo.*dfh(S); %delta of hidden layer
DWh1=Learn.*dh.*Xrand(N,1); %Gradient of hidden layer
DWh2=Learn.*dh.*Xrand(N,2);
DWh=[DWh1 DWh2]+a*momentumWh;%Gradient of hidden layer
Wh=Wh+DWh; %new hidden layer weights
momentumWh=DWh; %store momentum
Wo=WoN; %update output weight
N=N+1; %next value
end
MSET(Epoch)=(sum(E.^2))/length(E); %Mean Square Error Training
N=1;
%%%%%% validation %%%%%%%
while N<=length(Yval)
S=Wh*Xnval(N,:)';
Z=fh(S);
Yin=Z.*Wo;
Yins=sum(Yin);
Yc=fo(Yins);
Eval(N)=Yc-Yval(N);
N=N+1;
end
MSE(Epoch)=(sum(Eval.^2))/length(Eval); %Mean Square Error de validacion
if MSE(Epoch)<=1 %stop condition
break
end
disp(MSET(Epoch))
disp(MSE(Epoch))
Epoch=Epoch+1; %next epoch
end
There are a number of factors that can come into play for the particular problem that you are trying to solve:
The Complexity of the Problem: Is the problem considered easy for a neural network to solve (If using a standard dataset, have you compared the results to other studies?)
The Inputs: Are the inputs strongly related to the output? Are there more inputs that you can add to the NN? Are they preprocessed correctly?
Local Minima vs Global Minima: Are you sure that the problem has stopped in a local minima (A place where the NN gets stuck in learning that stops the NN from reaching a more optimal solution)?
Outputs: Are the output samples skewed in some way? Is this a binary output kind of problem, and are there enough samples on both sides?
Activation Function: Is there another appropriate Activation Function for the problem?
Then there is the Hidden Layers, Neurons, Learning Rate, Momentum, Epochs etc. which you appear to have trialled.
Based on the chart, this is the kind of learning performance that would roughly be expected for a BPNN, however trial and error is sometimes required to optimise the result from there.
I would try to work on the above options (particularly pre-processing of data) and see if this helps in your case.

Neural Network Diverging instead of converging

I have implemented a neural network (using CUDA) with 2 layers. (2 Neurons per layer).
I'm trying to make it learn 2 simple quadratic polynomial functions using backpropagation.
But instead of converging, the it is diverging (the output is becoming infinity)
Here are some more details about what I've tried:
I had set the initial weights to 0, but since it was diverging I have randomized the initial weights
I read that a neural network might diverge if the learning rate is too high so I reduced the learning rate to 0.000001
The two functions I am trying to get it to add are: 3 * i + 7 * j+9 and j*j + i*i + 24 (I am giving the layer i and j as input)
I had implemented it as a single layer previously and that could approximate the polynomial functions better
I am thinking of implementing momentum in this network but I'm not sure it would help it learn
I am using a linear (as in no) activation function
There is oscillation in the beginning but the output starts diverging the moment any of weights become greater than 1
I have checked and rechecked my code but there doesn't seem to be any kind of issue with it.
So here's my question: what is going wrong here?
Any pointer will be appreciated.
If the problem you are trying to solve is of classification type, try 3 layer network (3 is enough accordingly to Kolmogorov) Connections from inputs A and B to hidden node C (C = A*wa + B*wb) represent a line in AB space. That line divides correct and incorrect half-spaces. The connections from hidden layer to ouput, put hidden layer values in correlation with each other giving you the desired output.
Depending on your data, error function may look like a hair comb, so implementing momentum should help. Keeping learning rate at 1 proved optimum for me.
Your training sessions will get stuck in local minima every once in a while, so network training will consist of a few subsequent sessions. If session exceeds max iterations or amplitude is too high, or error is obviously high - the session has failed, start another.
At the beginning of each, reinitialize your weights with random (-0.5 - +0.5) values.
It really helps to chart your error descent. You will get that "Aha!" factor.
The most common reason for a neural network code to diverge is that the coder has forgotten to put the negative sign in the change in weight expression.
another reason could be that there is a problem with the error expression used for calculating the gradients.
if these don't hold, then we need to see the code and answer.

Does it makes any sense that weights and threshold are growing proportionally when training my perceptron?

I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.

Resources