What is the interpretation of prelu weights, if weights of prelu in a layer are close to 1, and in some other layer they are close 0?
Not much prelu literature around, any help would be really helpful!
PRelu formula is this:
As you can see, if a is learned to be around 0, then f(x) is almost equal to ordinary relu, and the gradient from the negative activations doesn't change the network. Put it simply, the network doesn't "want" to tweak inactive neurons in any direction. Practically, this also means that you can probably speed up the training by using relu in this layer. Also this non-linearity really matters.
On the contrary, when a is approximately 1, f(x) is almost x, i.e., it's like there is no non-linearity. This means that this layer is probably redundant and the network has enough freedom to make a decision boundary without it.
Related
I have read that the " He weight Initialization" (He et al., 2015) built on the Lecun weight initialization and suggested a zero-mean Gaussian distribution where the standard deviation is
enter image description here
and this function should be used with ReLU to solve the vanishing/exploding gradient problem. For me, it does make sense because the way ReLu was built makes it no bothered with vanishing/exploding gradient problem. Since, if the input is less than 0 the derivative would be zero otherwise the derivative would be one. So, whatever the variance is, the gradient would be zero or one. Therefore, the He weight Initialization is useless. I know that I am missing something, that's why I am asking if anyone would tell me the usefulness of that weight initialization?
Weight initialization is applied, in general terms, to weights of layers that have learnable / trainable parameters, just like dense layers, convolutional layers, and other layers. ReLU is an activation function, fully deterministic, and has no initialization.
Regarding to the vanishing gradient problem, the backpropagation step is funded by computing the gradients by the chain rule (partial derivatives) for each weight (see here):
(...) each of the neural network's weights receive an update
proportional to the partial derivative of the error function with
respect to the current weight in each iteration of training.
The more deep a network is, the smaller these gradients get, and when a network becomes deep enough, the backprop step is less effective (in the worst case, it stops learning) and this becomes a problem:
This has the effect of multiplying n of these small numbers to compute
gradients of the "front" layers in an n-layer network, meaning that
the gradient (error signal) decreases exponentially with n while the
front layers train very slowly.
Choosing a proper activation function, like ReLU, help avoiding this to happen, as you mentioned in the OP, by making partial derivatives of this activation not too small:
Rectifiers such as ReLU suffer less from the vanishing gradient
problem, because they only saturate in one direction.
Hope this helps!
I am currently training an LSTM RNN for time-series forecasting. I understand that it is common practice to clip the gradients of the RNN when it crosses a certain threshold. However, I am not completely clear on whether or not this includes the output layer.
If we call the hidden layer of an RNN h, then the output is sigmoid(connected_weights*h + bias). I know that the gradients for the weights for determining the hidden layer are clipped, but does the same go for the output layer?
In other words, are the gradients for the connected_weights also clipped in gradient clipping?
While nothing prevents you from clipping them as well, there is no reason to do so. A nice paper with reasons is here, I'll try to give you an overview.
The problem we're trying to solve by gradient clipping is that of exploding gradients: Let's assume that your RNN layer is computed like this:
h_t = sigmoid(U * x + W * h_tm1 + b)
So forgetting about the nonlinearity for a while, you could say that a current state h_t depends on some earlier state h_{t-T} as h_t = W^T * h_tmT + input. So if the matrix W inflates the hidden state, the influence of that old hidden state is growing exponentially with time. And the same happens as you backpropagate the gradient, resulting in gradients that will most likely get you to to some useless point in the parameter space.
On the other hand, the output layer is applied just once during both forward and backward pass, so while it may complicate the learning, it will only be by a 'constant' factor, independent of the unrolling in time.
To get a bit more technical: The crucial quantity which determines whether you get exploding gradient is the largest eigenvalue of W. If it is larger than one (or smaller than -1, then it's real fun :-)), then you get exploding gradients. Conversely, if it's smaller than one, you'll suffer from vanishing gradients, making it difficult to learn long-term dependencies. You can find a nice discussion of these phenomena here, with pointers to classical literature.
If we take the sigmoid back into the picture, it becomes more difficult to get exploding gradients, as the gradients get dampened by at least a factor of 4 when being backpropagated through it. But still, have an eigenvalue larger than 4 and you'll have adventures :-) It's rather important to initialize carefully, the second paper gives some hints. With tanh, there is little dampening around zero and ReLU just propagates the gradient through, so these are rather prone to gradient explodions and thus sensitive to initialization and gradient clipping.
Overall, LSTMs have better learning properties than vanilla RNNs, esp. with regard to the vanishing gradients. Though from my experience, gradient clipping is usually necessary with them as well.
EDIT: When to clip?
Right before the update of the weights, i.e. you do the backprop unaltered. The thing is that gradient clipping is kind of a dirty hack. You still want your gradient as precise as possible, so you better don't distort it in the middle of the backprop. Just that if you see the gradient become very large, you say Nah, this smells. I better make a tiny step. and clipping is an easy way to do it (it may be that only some elements of the gradient are exploded while the others are still well behaved and informative). With most of the toolkits, you don't have the choice anyway, because the backpropagation happens atomically.
I understand this decision depends on the task, but let me explain.
I'm designing a model that predicts steering angles from a given dashboard video frame using a convolutional neural network with dense layers at end. In my final dense layer, I only have a single unit that predicts a steering angle.
My question here is, for my task would either option below show a boost in performance?
a. Get ground truth steering angles, convert to radians, and squash them using tanh so they are between -1 and 1. In the final dense layer of my network, use a tanh activation function.
b. Get ground truth steering angles. These raw angles are between -420 and 420 degrees. In the final layer, use a linear activation.
I'm trying to think about it logically, where in option A the loss will likely be much smaller since the network is dealing with much smaller numbers. This would lead to smaller changes in weights.
Let me know your thoughts!
There are two types of variables in neural networks: weights and biases (mostly, there are additional variables, e.g. the moving mean and moving variance required for batchnorm). They behave a bit differently, for instance biases are not penalized by a regularizer as a result they don't tend to get small. So an assumption that the network is dealing only with small numbers is not accurate.
Still, biases need to be learned, and as can be seen from ResNet performance, it's easier to learn smaller values. In this sense, I'd rather pick [-1, 1] target range over [-420, 420]. But tanh is probably not an optimal activation function:
With tahn (just like with sigmoid), a saturated neuron kills the gradient during backprop. Choosing tahn with no specific reason is likely to hurt your training.
Forward and backward passes with tahn need to compute exp, which is also relatively expensive.
My option would be (at least initially, until some other variant proves to work better) to squeeze the ground truth values and have no activation at all (I think that's what you mean by a linear activation): let the network learn [-1, 1] range by itself.
In general, if you have any activation functions in the hidden layers, ReLu has proven to work better than sigmoid, though other modern functions have been proposed recently, e.g. leaky ReLu, PRelu, ELU, etc. You might try any of those.
I'm trying to understand "Back Propagation" as it is used in Neural Nets that are optimized using Gradient Descent. Reading through the literature it seems to do a few things.
Use random weights to start with and get error values
Perform Gradient Descent on the loss function using these weights to arrive at new weights.
Update the weights with these new weights until the loss function is minimized.
The steps above seem to be the EXACT process to solve for Linear Models (Regression for e.g.)? Andrew Ng's excellent course on Coursera for Machine Learning does exactly that for Linear Regression.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets and why not for GLMs (Generalized Linear Models). They all seem to be doing the same thing- what might I be missing?
The main division happens to be hiding in plain sight: linearity. In fact, extend to question to continuity of the first derivative, and you'll encapsulate most of the difference.
First of all, take note of one basic principle of neural nets (NN): a NN with linear weights and linear dependencies is a GLM. Also, having multiple hidden layers is equivalent to a single hidden layer: it's still linear combinations from input to output.
A "modern' NN has non-linear layers: ReLUs (change negative values to 0), pooling (max, min, or mean of several values), dropouts (randomly remove some values), and other methods destroy our ability to smoothly apply Gradient Descent (GD) to the model. Instead, we take many of the principles and work backward, applying limited corrections layer by layer, all the way back to the weights at layer 1.
Lather, rinse, repeat until convergence.
Does that clear up the problem for you?
You got it!
A typical ReLU is
f(x) = x if x > 0,
0 otherwise
A typical pooling layer reduces the input length and width by a factor of 2; in each 2x2 square, only the maximum value is passed through. Dropout simply kills off random values to make the model retrain those weights from "primary sources". Each of these is a headache for GD, so we have to do it layer by layer.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets
I think (at least originally) back propagation of errors meant less than what you describe: the term "backpropagation of errors" only refered to the method of calculating derivatives of the loss function, instead of e.g. automatic differentiation, symbolic differentiation, or numerical differentiation. No matter what the gradient was then used for (e.g. Gradient Descent, or maybe Levenberg/Marquardt).
They all seem to be doing the same thing- what might I be missing?
They're using different models. If your neural network used linear neurons, it would be equivalent to linear regression.
I think I read somewhere that convolutional neural networks do not suffer from the vanishing gradient problem as much as standard sigmoid neural networks with increasing number of layers. But I have not been able to find a 'why'.
Does it truly not suffer from the problem or am I wrong and it depends on the activation function?
[I have been using Rectified Linear Units, so I have never tested the Sigmoid Units for Convolutional Neural Networks]
Convolutional neural networks (like standard sigmoid neural networks) do suffer from the vanishing gradient problem. The most recommended approaches to overcome the vanishing gradient problem are:
Layerwise pre-training
Choice of the activation function
You may see that the state-of-the-art deep neural network for computer vision problem (like the ImageNet winners) have used convolutional layers as the first few layers of the their network, but it is not the key for solving the vanishing gradient. The key is usually training the network greedily layer by layer. Using convolutional layers have several other important benefits of course. Especially in vision problems when the input size is large (the pixels of an image), using convolutional layers for the first layers are recommended because they have fewer parameters than fully-connected layers and you don't end up with billions of parameters for the first layer (which will make your network prone to overfitting).
However, it has been shown (like this paper) for several tasks that using Rectified linear units alleviates the problem of vanishing gradients (as oppose to conventional sigmoid functions).
Recent advances had alleviate the effects of vanishing gradients in deep neural networks. Among contributing advances include:
Usage of GPU for training deep neural networks
Usage of better activation functions. (At this point rectified linear units (ReLU) seems to work the best.)
With these advances, deep neural networks can be trained even without layerwise pretraining.
Source:
http://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/
we do not use Sigmoid and Tanh as Activation functions which causes vanishing Gradient Problems. Mostly nowadays we use RELU based activation functions in training a Deep Neural Network Model to avoid such complications and improve the accuracy.
It’s because the gradient or slope of RELU activation if it’s over 0, is 1. Sigmoid derivative has a maximum slope of .25, which means that during the backward pass, you are multiplying gradients with values less than 1, and if you have more and more layers, you are multiplying it with values less than 1, making gradients smaller and smaller. RELU activation solves this by having a gradient slope of 1, so during backpropagation, there isn’t gradients passed back that are progressively getting smaller and smaller. but instead they are staying the same, which is how RELU solves the vanishing gradient problem.
One thing to note about RELU however is that if you have a value less than 0, that neuron is dead, and the gradient passed back is 0, meaning that during backpropagation, you will have 0 gradient being passed back if you had a value less than 0.
An alternative is Leaky RELU, which gives some gradient for values less than 0.
The first answer is from 2015 and a bit of age.
Today, CNNs typically also use batchnorm - while there is some debate why this helps: the inventors mention covariate shift: https://arxiv.org/abs/1502.03167
There are other theories like smoothing the loss landscape: https://arxiv.org/abs/1805.11604
Either way, it is a method that helps to deal significantly with vanishing/exploding gradient problem that is also relevant for CNNs. In CNNs you also apply the chain rule to get gradients. That is the update of the first layer is proportional to the product of N numbers, where N is the number of inputs. It is very likely that this number is either relatively big or small compared to the update of the last layer. This might be seen by looking at the variance of a product of random variables that quickly grows the more variables are being multiplied: https://stats.stackexchange.com/questions/52646/variance-of-product-of-multiple-random-variables
For recurrent networks that have long sequences of inputs, ie. of length L, the situation is often worse than for CNN, since there the product consists of L numbers. Often the sequence length L in a RNN is much larger than the number of layers N in a CNN.