I am trying to make a digit recognition program. I shall feed a white/black image of a digit and my output layer will fire the corresponding digit (one neuron shall fire, out of the 0 -> 9 neurons in the Output Layer). I finished implementing a Two-dimensional BackPropagation Neuron Network. My topology sizes are [5][3] -> [3][3] -> 1[10]. So it's One 2-D Input Layer, One 2-D Hidden Layer and One 1-D Output Layer. However I am getting weird and wrong results (Average Error and Output Values).
Debugging at this stage is kind of time consuming. Therefore, I would love to hear if this is the correct design so I continue debugging. Here are the flow steps of my implementation:
Build the Network: One Bias on each Layer except on the Output Layer (No Bias). A Bias's output value is always = 1.0, however its Connections Weights get updated on each pass like all other neurons in the network. All Weights range 0.000 -> 1.000 (no negatives)
Get Input data (0 | OR | 1) and set nth value as the nth Neuron Output Value in the input layer.
Feed Forward: On each Neuron 'n' in every Layer (except the Input Layer):
Get result of SUM (Output Value * Connection Weight) of connected Neurons
from previous layer towards this nth Neuron.
Get TanHyperbolic - Transfer Function - of this SUM as Results
Set Results as the Output Value of this nth Neuron
Get Results: Take Output Values of Neurons in the Output Layer
BackPropagation:
Calculate Network Error: on the Output Layer, get SUM Neurons' (Target Values - Output Values)^2. Divide this SUM by the size of the Output Layer. Get its SquareRoot as Result. Compute Average Error = (OldAverageError * SmoothingFactor * Result) / (SmoothingFactor + 1.00)
Calculate Output Layer Gradients: for each Output Neuron 'n', nth Gradient = (nth Target Value - nth Output Value) * nth Output Value TanHyperbolic Derivative
Calculate Hidden Layer Gradients: for each Neuron 'n', get SUM (TanHyperbolic Derivative of a weight going from this nth Neuron * Gradient of the destination Neuron) as Results. Assign (Results * this nth Output Value) as the Gradient.
Update all Weights: Starting from the hidden Layer and back to the Input Layer, for nth Neuron: Compute NewDeltaWeight = (NetLearningRate * nth Output Value * nth Gradient + Momentum * OldDeltaWeight). Then assign New Weight as (OldWeight + NewDeltaWeight)
Repeat process.
Here is my attempt for digit number seven. The outputs are Neuron # zero and Neuron # 6. Neuron six should be carrying 1 and Neuron # zero should be carrying 0. In my results, all Neuron other than six are carrying the same value (# zero is a sample).
Sorry for the long post. If you know this then you probably know how cool it is and how large it is to be in a single post. Thank you in advance
Softmax with log-loss is typically used for multiclass output layer activation function. You have multiclass/multinomial: with the 10 possible digits comprising the 10 classes.
So you can try changing your output layer activation function to softmax
http://en.wikipedia.org/wiki/Softmax_function
Artificial neural networks
In neural network simulations, the
softmax function is often implemented at the final layer of a network
used for classification. Such networks are then trained under a log
loss (or cross-entropy) regime, giving a non-linear variant of
multinomial logistic regression.
Let us know what effect that has. –
Related
Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as
output = matmul(input, weights)
Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.
My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute
output = matmul(input, weights)
Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.
In back propagation, you could do something similar:
input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err
Where weights is the same, output_err is 100x2, and input is 100x10.
However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.
Thanks!
If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.
In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.
I'm trying to implement a multilayer perceptron with backpropagation with only one hidden layer on Matlab. The objective is to replicate a function with two I'm trying to implement a multilayer perceptron with backpropagation with only one hidden layer on Matlab. The objective is to replicate a function with two inputs and one output.
The problem I'm having is that the error starts decreasing with every epoch but it just reaches a plateau and doesn't seems to improve as seen in:
This is an image of all the errors during a single Epoch:
as you can see there are some extreme cases that are not being handled correctly
Im using:
Weights initialized from -1 to 1
Mean Square Error
Variable number of hidden neurons
Momentum
Randomized input order
no bias
tanh activation function for the hidden layer
identity as the activation function of the output layer
Inputs in range of -3 to 3
Min-Max normalization of inputs
I have tried changing the number of neurons on the hidden layers, tried to lower the learning rate to really small amounts and nothing seems to help.
Here is the Matlab code:
clc
clear
%%%%%%% DEFINITIONS %%%%%%%%
i=0;
S=0;
X=rand(1000,2)*6-3; %generate inputs between -3,+3
Xval=rand(200,2)*6-3; %validation inputs
Number_Neurons=360;
Wh=rand(Number_Neurons,2)*2-1; %hidden weights
Wo=rand(Number_Neurons,1)*2-1; %output weights
Learn=.001;% learning factor
momentumWh=0; %momentums
momentumWo=0;
a=.01;%momentum factor
WoN=Wo; %new weight
fxy=#(x,y) (3.*(1-x).^2).*(exp(-x.^2-(y+1).^2))-10.*(x./5-x.^3-y.^5).*(exp(-x.^2-y.^2))-(exp(-(x+1).^2-y.^2))./3; %function to be replicated
fh=#(x) tanh(x); %hidden layer activation function
dfh= #(x) 1-tanh(x).^2; %derivative
fo=#(x) x; %output layer activation function
dfo= #(x) 1; %derivative
%%GRAPH FUNCTION
%[Xg,Yg]=meshgrid(X(:,1),X(:,2));
% Y=fxy(Xg,Yg);
% surf(Xg,Yg,Y)
%%%%%%%%%
Yr=fxy(X(:,1),X(:,2)); %Y real
Yval=fxy(Xval(:,1),Xval(:,2)); %validation Y
Epoch=1;
Xn=(X+3)/6;%%%min max normalization
Xnval=(Xval+3)/6;
E=ones(1,length(Yr));% error
Eval=ones(1,length(Yval));%validation error
MSE=1;
%%%%% ITERATION %%%%%
while 1
N=1;
perm=randperm(length(X(:,:))); %%%permutate inputs
Yrand=Yr(perm); %permutate outputs
Xrand=Xn(perm,:);
while N<=length(Yr) %epoch
%%%%%%foward pass %%%%%
S=Wh*Xrand(N,:)'; %input multiplied by hidden weights
Z=fh(S); %activation function of hidden layer
Yin=Z.*Wo; %output of hidden layer multiplied by output weights
Yins=sum(Yin); %sum all the inputs
Yc=fo(Yins);% activation function of output layer, Predicted Y
E(N)=Yrand(N)-Yc; %error
%%%%%%%% back propagation %%%%%%%%%%%%%
do=E(N).*dfo(Yins); %delta of output layer
DWo=Learn*(do.*Z)+a*momentumWo; %Gradient of output layer
WoN=Wo+DWo;%New output weight
momentumWo=DWo; %store momentum
dh=do.*Wo.*dfh(S); %delta of hidden layer
DWh1=Learn.*dh.*Xrand(N,1); %Gradient of hidden layer
DWh2=Learn.*dh.*Xrand(N,2);
DWh=[DWh1 DWh2]+a*momentumWh;%Gradient of hidden layer
Wh=Wh+DWh; %new hidden layer weights
momentumWh=DWh; %store momentum
Wo=WoN; %update output weight
N=N+1; %next value
end
MSET(Epoch)=(sum(E.^2))/length(E); %Mean Square Error Training
N=1;
%%%%%% validation %%%%%%%
while N<=length(Yval)
S=Wh*Xnval(N,:)';
Z=fh(S);
Yin=Z.*Wo;
Yins=sum(Yin);
Yc=fo(Yins);
Eval(N)=Yc-Yval(N);
N=N+1;
end
MSE(Epoch)=(sum(Eval.^2))/length(Eval); %Mean Square Error de validacion
if MSE(Epoch)<=1 %stop condition
break
end
disp(MSET(Epoch))
disp(MSE(Epoch))
Epoch=Epoch+1; %next epoch
end
There are a number of factors that can come into play for the particular problem that you are trying to solve:
The Complexity of the Problem: Is the problem considered easy for a neural network to solve (If using a standard dataset, have you compared the results to other studies?)
The Inputs: Are the inputs strongly related to the output? Are there more inputs that you can add to the NN? Are they preprocessed correctly?
Local Minima vs Global Minima: Are you sure that the problem has stopped in a local minima (A place where the NN gets stuck in learning that stops the NN from reaching a more optimal solution)?
Outputs: Are the output samples skewed in some way? Is this a binary output kind of problem, and are there enough samples on both sides?
Activation Function: Is there another appropriate Activation Function for the problem?
Then there is the Hidden Layers, Neurons, Learning Rate, Momentum, Epochs etc. which you appear to have trialled.
Based on the chart, this is the kind of learning performance that would roughly be expected for a BPNN, however trial and error is sometimes required to optimise the result from there.
I would try to work on the above options (particularly pre-processing of data) and see if this helps in your case.
I can't understand why dropout works like this in tensorflow. The blog of CS231n says that, "dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise." Also you can see this from picture(Taken from the same site)
From tensorflow site, With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0.
Now, why the input element is scaled up by 1/keep_prob? Why not keep the input element as it is with probability and not scale it with 1/keep_prob?
This scaling enables the same network to be used for training (with keep_prob < 1.0) and evaluation (with keep_prob == 1.0). From the Dropout paper:
The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.
Rather than adding ops to scale down the weights by keep_prob at test time, the TensorFlow implementation adds an op to scale up the weights by 1. / keep_prob at training time. The effect on performance is negligible, and the code is simpler (because we use the same graph and treat keep_prob as a tf.placeholder() that is fed a different value depending on whether we are training or evaluating the network).
Let's say the network had n neurons and we applied dropout rate 1/2
Training phase, we would be left with n/2 neurons. So if you were expecting output x with all the neurons, now you will get on x/2. So for every batch, the network weights are trained according to this x/2
Testing/Inference/Validation phase, we dont apply any dropout so the output is x. So, in this case, the output would be with x and not x/2, which would give you the incorrect result. So what you can do is scale it to x/2 during testing.
Rather than the above scaling specific to Testing phase. What Tensorflow's dropout layer does is that whether it is with dropout or without (Training or testing), it scales the output so that the sum is constant.
Here is a quick experiment to disperse any remaining confusion.
Statistically the weights of a NN-layer follow a distribution that is usually close to normal (but not necessarily), but even in the case when trying to sample a perfect normal distribution in practice, there are always computational errors.
Then consider the following experiment:
DIM = 1_000_000 # set our dims for weights and input
x = np.ones((DIM,1)) # our input vector
#x = np.random.rand(DIM,1)*2-1.0 # or could also be a more realistic normalized input
probs = [1.0, 0.7, 0.5, 0.3] # define dropout probs
W = np.random.normal(size=(DIM,1)) # sample normally distributed weights
print("W-mean = ", W.mean()) # note the mean is not perfect --> sampling error!
# DO THE DRILL
h = defaultdict(list)
for i in range(1000):
for p in probs:
M = np.random.rand(DIM,1)
M = (M < p).astype(int)
Wp = W * M
a = np.dot(Wp.T, x)
h[str(p)].append(a)
for k,v in h.items():
print("For drop-out prob %r the average linear activation is %r (unscaled) and %r (scaled)" % (k, np.mean(v), np.mean(v)/float(k)))
Sample output:
x-mean = 1.0
W-mean = -0.001003985674840264
For drop-out prob '1.0' the average linear activation is -1003.985674840258 (unscaled) and -1003.985674840258 (scaled)
For drop-out prob '0.7' the average linear activation is -700.6128015029908 (unscaled) and -1000.8754307185584 (scaled)
For drop-out prob '0.5' the average linear activation is -512.1602655283492 (unscaled) and -1024.3205310566984 (scaled)
For drop-out prob '0.3' the average linear activation is -303.21194422742315 (unscaled) and -1010.7064807580772 (scaled)
Notice that the unscaled activations diminish due to the statistically imperfect normal distribution.
Can you spot an obvious correlation between the W-mean and the average linear activation means?
If you keep reading in cs231n, the difference between dropout and inverted dropout is explained.
In a network with no dropout, the activations in layer L will be aL. The weights of next layer (L+1) will be learned in such a manner that it receives aL and produces output accordingly. But with a network containing dropout (with keep_prob = p), the weights of L+1 will be learned in such a manner that it receives p*aL and produces output accordingly. Why p*aL? Because the Expected value, E(aL), will be probability_of_keeping(aL)*aL + probability_of_not_keeping(aL)*0 which will be equal to p*aL + (1-p)*0 = p*aL. In the same network, during testing time there will be no dropout. Hence the layer L+1 will receive aL simply. But its weights were trained to expect p*aL as input. Therefore, during testing time you will have to multiply the activations with p. But instead of doing this, you can multiply the activations with 1/p during training only. This is called inverted dropout.
Since we want to leave the forward pass at test time untouched (and tweak our network just during training), tf.nn.dropout directly implements inverted dropout, scaling the values.
I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.
I am trying to approximate the sine() function using a neural network I wrote myself. I have tested my neural network on a simple OCR problem already and it worked, but I am having trouble applying it to approximate sine(). My problem is that during training my error converges on exactly 50%, so I'm guessing it's completely random.
I am using one input neuron for the input (0 to PI), and one output neuron for the result. I have a single hidden layer in which I can vary the number of neurons but I'm currently trying around 6-10.
I have a feeling the problem is because I am using the sigmoid transfer function (which is a requirement in my application) which only outputs between 0 and 1, while the output for sine() is between -1 and 1. To try to correct this I tried multiplying the output by 2 and then subtracting 1, but this didn't fix the problem. I'm thinking I have to do some kind of conversion somewhere to make this work.
Any ideas?
Use a linear output unit.
Here is a simple example using R:
set.seed(1405)
x <- sort(10*runif(50))
y <- sin(x) + 0.2*rnorm(x)
library(nnet)
nn <- nnet(x, y, size=6, maxit=40, linout=TRUE)
plot(x, y)
plot(sin, 0, 10, add=TRUE)
x1 <- seq(0, 10, by=0.1)
lines(x1, predict(nn, data.frame(x=x1)), col="green")
When you train the network, you should normalize the target (the sin function) to the range [0,1], then you can keep the sigmoid transfer function.
sin(x) in [-1,1] => 0.5*(sin(x)+1) in [0,1]
Train data:
input target target_normalized
------------------------------------
0 0 0.5
pi/4 0.70711 0.85355
pi/2 1 1
...
Note that that we mapped the target before training. Once you train and simulate the network, you can map back the output of the net.
The following is a MATLAB code to illustrate:
%% input and target
input = linspace(0,4*pi,200);
target = sin(input) + 0.2*randn(size(input));
% mapping
[targetMinMax,mapping] = mapminmax(target,0,1);
%% create network (one hidden layer with 6 nodes)
net = newfit(input, targetMinMax, [6], {'tansig' 'tansig'});
net.trainParam.epochs = 50;
view(net)
%% training
net = init(net); % init
[net,tr] = train(net, input, targetMinMax); % train
output = sim(net, input); % predict
%% view prediction
plot(input, mapminmax('reverse', output, mapping), 'r', 'linewidth',2), hold on
plot(input, target, 'o')
plot(input, sin(input), 'g')
hold off
legend({'predicted' 'target' 'sin()'})
There is no reason your network shouldn't work, although 6 is definitely on the low side for approximating a sine wave. I'd try at least 10 maybe even 20.
If that doesn't work then I think you need to give more detail about your system. i.e. the learning algorithm (back-propagation?), the learning rate etc.
I get the same behavior if use vanilla gradient descent. Try using a different training algorithm.
As far as the Java applet is concerned, I did notice something interesting: it does converge if I use a "bipolar sigmoid" and I start with some non-random weights (such as results from a previous training using a Quadratic function).