In my CNN network i am using i am using Leaky ReLu after BN layer. Leaky ReLu solves dying ReLu problem by adding f(y)=ay for negative values. BN introduces zero mean and unit variance. So is BN remove negative part or not i.e. is this converts all valus into 0 to 1 scale? Based on this only selection of Leaky ReLu will be done. Because if BN remove negative part then use of Leaky relu will be same as relu. I am using keras.
The BN layer tries to zero-mean its output by subtracting an expectation over inputs. So we can expect some of its output values to be negative.
So the LeakyReLU following the BN layer will still receive negative values.
Also, to add to the answer by protagonist, BN actually learns an affine transform ie, it transforms the input to have a mean of alpha (not necessarily 0) and a variance of beta (not necessarily 1) where both alpha and beta are the learnable parameters of the BN layer.
Related
I have read that the " He weight Initialization" (He et al., 2015) built on the Lecun weight initialization and suggested a zero-mean Gaussian distribution where the standard deviation is
enter image description here
and this function should be used with ReLU to solve the vanishing/exploding gradient problem. For me, it does make sense because the way ReLu was built makes it no bothered with vanishing/exploding gradient problem. Since, if the input is less than 0 the derivative would be zero otherwise the derivative would be one. So, whatever the variance is, the gradient would be zero or one. Therefore, the He weight Initialization is useless. I know that I am missing something, that's why I am asking if anyone would tell me the usefulness of that weight initialization?
Weight initialization is applied, in general terms, to weights of layers that have learnable / trainable parameters, just like dense layers, convolutional layers, and other layers. ReLU is an activation function, fully deterministic, and has no initialization.
Regarding to the vanishing gradient problem, the backpropagation step is funded by computing the gradients by the chain rule (partial derivatives) for each weight (see here):
(...) each of the neural network's weights receive an update
proportional to the partial derivative of the error function with
respect to the current weight in each iteration of training.
The more deep a network is, the smaller these gradients get, and when a network becomes deep enough, the backprop step is less effective (in the worst case, it stops learning) and this becomes a problem:
This has the effect of multiplying n of these small numbers to compute
gradients of the "front" layers in an n-layer network, meaning that
the gradient (error signal) decreases exponentially with n while the
front layers train very slowly.
Choosing a proper activation function, like ReLU, help avoiding this to happen, as you mentioned in the OP, by making partial derivatives of this activation not too small:
Rectifiers such as ReLU suffer less from the vanishing gradient
problem, because they only saturate in one direction.
Hope this helps!
Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
cheers,
Mike
[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.
I have implemented a neural network with 3 layers Input to Hidden Layer with 30 neurons(Relu Activation) to Softmax Output layer. I am using the cross entropy cost function. No outside libraries are being used. This is working on the NMIST dataset so 784 input neurons and 10 output neurons.
I have got about 96% accuracy with hyperbolic tangent as my hidden layer activation.
When I try to switch to relu activation my activations grow very fast which cause my weights grow unbounded as well until it blows up!
Is this a common problem to have when using relu activation?
I have tried L2 Regularization with minimal success. I end up having to set the learning rate lower by a factor of ten compared to the tanh activation and I have tried adjusting the weight decay rate accordingly and still the best accuracy I have gotten is about 90%. The rate of weight decay is still outpaced in the end by the updating of certain weights in the network which lead to an explosion.
It seems everyone is just replacing their activation functions with relu and they experience better results, so I keep looking for bugs and validating my implementation.
Is there more that goes into using relu as an activation function? Maybe I have problems in my implemenation, can someone validate accuracy with the same neural net structure?
as you can see the Relu function is unbounded on positive values, thus creating the weights to grow
in fact, that's why hyperbolic tangent and alike function are being used in those cases, to bound the output value between a certain range (-1 to 1 or 0 to 1 in most cases)
there is another approach to deal with this phenomenon called weights decay
the basic motivation is to get a more generalised model (avoid overfitting) and make sure the weights won't blow up you use a regulation value depending on the weight itself when update them
meaning that bigger weights get bigger penalty
you can farther read about it here
In ANN, we know that to make it "learn", we need to adjust the weights of the inputs to a particular neuron.
total_input=summation(w(j,i).a(j))
During adjustment, some weights are to be reduced while others to be increased.
Is the total weight of all j inputs to the i-th neuron should be 1?
There's absolutely no reason for the weights in the linear layer (a.k.a. dense or fully-connected layer) to sum up to anything specific, such as 1.0. They are usually initialized with small random numbers (so initial sum is unlikely to be 1.0) and then get tweaked somehow (not completely independently, but at least differently).
If the neural network doesn't use any regularization, it's often possible to train the network to large weight values, much larger than 1.0 (see also this question).
There are particular cases, when an analogous condition is true, for example softmax layer, which mathematically guarantees that the sum of outputs is 1.0. But the linear layer doesn't guarantee anything like that.
I'm trying to understand "Back Propagation" as it is used in Neural Nets that are optimized using Gradient Descent. Reading through the literature it seems to do a few things.
Use random weights to start with and get error values
Perform Gradient Descent on the loss function using these weights to arrive at new weights.
Update the weights with these new weights until the loss function is minimized.
The steps above seem to be the EXACT process to solve for Linear Models (Regression for e.g.)? Andrew Ng's excellent course on Coursera for Machine Learning does exactly that for Linear Regression.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets and why not for GLMs (Generalized Linear Models). They all seem to be doing the same thing- what might I be missing?
The main division happens to be hiding in plain sight: linearity. In fact, extend to question to continuity of the first derivative, and you'll encapsulate most of the difference.
First of all, take note of one basic principle of neural nets (NN): a NN with linear weights and linear dependencies is a GLM. Also, having multiple hidden layers is equivalent to a single hidden layer: it's still linear combinations from input to output.
A "modern' NN has non-linear layers: ReLUs (change negative values to 0), pooling (max, min, or mean of several values), dropouts (randomly remove some values), and other methods destroy our ability to smoothly apply Gradient Descent (GD) to the model. Instead, we take many of the principles and work backward, applying limited corrections layer by layer, all the way back to the weights at layer 1.
Lather, rinse, repeat until convergence.
Does that clear up the problem for you?
You got it!
A typical ReLU is
f(x) = x if x > 0,
0 otherwise
A typical pooling layer reduces the input length and width by a factor of 2; in each 2x2 square, only the maximum value is passed through. Dropout simply kills off random values to make the model retrain those weights from "primary sources". Each of these is a headache for GD, so we have to do it layer by layer.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets
I think (at least originally) back propagation of errors meant less than what you describe: the term "backpropagation of errors" only refered to the method of calculating derivatives of the loss function, instead of e.g. automatic differentiation, symbolic differentiation, or numerical differentiation. No matter what the gradient was then used for (e.g. Gradient Descent, or maybe Levenberg/Marquardt).
They all seem to be doing the same thing- what might I be missing?
They're using different models. If your neural network used linear neurons, it would be equivalent to linear regression.