I am about to start coding an LSTM using Keras/Theano.
As I understand it, using an LSTM (as opposed to a vanilla RNN) avoids vanishing and exploding gradients.
Could someone explain clearly (or provide their favorite link to a clear answer) that explains how LSTM, with its forget gate, memory cell input, and memory cell output gate prevent both the vanishing and exploding gradients.
Are there any other variants of RNNs that also avoids these issues? Or is LSTM the only RNN-variant that does this?
Related
Does anyone know the difference between Backpropagation and Levenberg–Marquardt in neural networks training? Sometimes I see that LM is considered as a BP algorithm and sometimes I see the opposite.
Your help will be highly appreciated.
Thank you.
Those are two completely unrelated concepts.
Levenberg-Marquardt (LM) is an optimization method, while backprop is just the recursive application of the chain rule for derivatives.
What LM intuitively does is this: when it is far from a local minimum, it ignores the curvature of the loss and acts as gradient descent. However, as it gets closer to a local minimum it pays more and more attention to the curvature by switching from gradient descent to a Gauss-Newton like approach.
The LM method needs both the gradient and the Hessian (as it solves variants of (H+coeff*Identity)dx=-g with H,g respectively the Hessian and the gradient. You can obtain the gradient via backpropagation. For the Hessian, it is most often not as simple although in least squares you can approximate it as 2gg^T, which means that in that case you can also obtain it easily at the end of the initial backprop.
For neural networks LM usually isn't really useful as you can't construct such a huge Hessian, and even if you do, it lacks the sparse structure needed to invert it efficiently.
I am currently training an LSTM RNN for time-series forecasting. I understand that it is common practice to clip the gradients of the RNN when it crosses a certain threshold. However, I am not completely clear on whether or not this includes the output layer.
If we call the hidden layer of an RNN h, then the output is sigmoid(connected_weights*h + bias). I know that the gradients for the weights for determining the hidden layer are clipped, but does the same go for the output layer?
In other words, are the gradients for the connected_weights also clipped in gradient clipping?
While nothing prevents you from clipping them as well, there is no reason to do so. A nice paper with reasons is here, I'll try to give you an overview.
The problem we're trying to solve by gradient clipping is that of exploding gradients: Let's assume that your RNN layer is computed like this:
h_t = sigmoid(U * x + W * h_tm1 + b)
So forgetting about the nonlinearity for a while, you could say that a current state h_t depends on some earlier state h_{t-T} as h_t = W^T * h_tmT + input. So if the matrix W inflates the hidden state, the influence of that old hidden state is growing exponentially with time. And the same happens as you backpropagate the gradient, resulting in gradients that will most likely get you to to some useless point in the parameter space.
On the other hand, the output layer is applied just once during both forward and backward pass, so while it may complicate the learning, it will only be by a 'constant' factor, independent of the unrolling in time.
To get a bit more technical: The crucial quantity which determines whether you get exploding gradient is the largest eigenvalue of W. If it is larger than one (or smaller than -1, then it's real fun :-)), then you get exploding gradients. Conversely, if it's smaller than one, you'll suffer from vanishing gradients, making it difficult to learn long-term dependencies. You can find a nice discussion of these phenomena here, with pointers to classical literature.
If we take the sigmoid back into the picture, it becomes more difficult to get exploding gradients, as the gradients get dampened by at least a factor of 4 when being backpropagated through it. But still, have an eigenvalue larger than 4 and you'll have adventures :-) It's rather important to initialize carefully, the second paper gives some hints. With tanh, there is little dampening around zero and ReLU just propagates the gradient through, so these are rather prone to gradient explodions and thus sensitive to initialization and gradient clipping.
Overall, LSTMs have better learning properties than vanilla RNNs, esp. with regard to the vanishing gradients. Though from my experience, gradient clipping is usually necessary with them as well.
EDIT: When to clip?
Right before the update of the weights, i.e. you do the backprop unaltered. The thing is that gradient clipping is kind of a dirty hack. You still want your gradient as precise as possible, so you better don't distort it in the middle of the backprop. Just that if you see the gradient become very large, you say Nah, this smells. I better make a tiny step. and clipping is an easy way to do it (it may be that only some elements of the gradient are exploded while the others are still well behaved and informative). With most of the toolkits, you don't have the choice anyway, because the backpropagation happens atomically.
I'm working on a (high energy physics related) problem using CNNs.
For understanding the problem, let's consider these examples here.
The left-hand side is the input to the CNN, the right-hand side the desired output. So the network is supposed to cluster the input. The actual algorithm behind this clustering (i.e. how we got the desired output for training) is really complex and we want the CNN to learn this.
I've tried different CNN architectures, for example one similar to the U-net architecture (https://arxiv.org/abs/1505.04597) but also various concatenations of convolutional layers, etc.
The outputs are always really similar (for all architectures).
Here you can see some CNN predictions.
In principle the network is performing quite well, but as you can see, in most cases the CNN output consists of several filled pixels that are directly next to each other, which will never (!) happen in the true cases.
I've been using mean squared error as the loss function in all of the networks.
Do you have any suggestions how one could avoid this problem and improve the networks performance?
Or is this a general limitation to CNNs and in practice it is not possible to solve such a problem using CNNs?
Thank you very much!
My suggestion would be to split up the work. First use a U-Shaped NN to find the activations in a binary segmentation task (like in your paper) and then regress on the found activations to find their final values. In my experience this works way better than doing regression on large images, because the MSE will result in blurry outputs, as you have observed.
The CNN does not know that you wanted a sharp result. As mentioned by #Thomas, MSE tends to give you blurry result as it is the nature of that loss function. Giving a blurry result does not introduce large loss in MSE.
An easy modification would be to use L1 Loss (absolute difference instead of squared error). It has a constant gradient unlike MSE whose gradient decreases with error.
If you really wanted a sharp result, it would be easier to add a manual step -- non maximum suppression (NMS). In practice, a 3x3 box-max filter might do.
In Andrew Ng's lecture notes, they use LBFGS and get some hidden features. Can I use gradient descent instead and produce the same hidden features? All the other parameters are the same, just change the optimization algorithm.
Because When I use LBFGS, my autoencoder can produce the same hidden features as in the lectures notes, but when I use gradient descent, the features in the hidden layer are gone, seems like totally random.
To be specific, in order to optimize the cost function, I implement 1)the cost function, 2)gradient of each Weight and Bias. And throw them into scipy optimize tool box to optimize the cost function. And this setting can give me the reasonable hidden features.
But when I change to gradient descent. I tried to let the "Weight - gradient of the Weight" and "Bias - gradient of the Bias". But the resulted hidden features looks like totally random.
Can somebody help me to know the reason? Thanks.
Yes, you can use SGD instead, in fact, it is the most popular choice in practise. L-BFGS-B is not a typical method for training neural networks. However:
you will have to tweak hyperparameters of the training method, you cannot just use the same ones that were used for LBFGS as this is completely different method (ok, not completely, but it uses first order optimization instead of second order)
you should include momentum in your SGD, it is an extremely easy way to get a kind of second order approximation, and is known to (when carefully tuned) perform as good as actual second-order methods in practise
I think I read somewhere that convolutional neural networks do not suffer from the vanishing gradient problem as much as standard sigmoid neural networks with increasing number of layers. But I have not been able to find a 'why'.
Does it truly not suffer from the problem or am I wrong and it depends on the activation function?
[I have been using Rectified Linear Units, so I have never tested the Sigmoid Units for Convolutional Neural Networks]
Convolutional neural networks (like standard sigmoid neural networks) do suffer from the vanishing gradient problem. The most recommended approaches to overcome the vanishing gradient problem are:
Layerwise pre-training
Choice of the activation function
You may see that the state-of-the-art deep neural network for computer vision problem (like the ImageNet winners) have used convolutional layers as the first few layers of the their network, but it is not the key for solving the vanishing gradient. The key is usually training the network greedily layer by layer. Using convolutional layers have several other important benefits of course. Especially in vision problems when the input size is large (the pixels of an image), using convolutional layers for the first layers are recommended because they have fewer parameters than fully-connected layers and you don't end up with billions of parameters for the first layer (which will make your network prone to overfitting).
However, it has been shown (like this paper) for several tasks that using Rectified linear units alleviates the problem of vanishing gradients (as oppose to conventional sigmoid functions).
Recent advances had alleviate the effects of vanishing gradients in deep neural networks. Among contributing advances include:
Usage of GPU for training deep neural networks
Usage of better activation functions. (At this point rectified linear units (ReLU) seems to work the best.)
With these advances, deep neural networks can be trained even without layerwise pretraining.
Source:
http://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/
we do not use Sigmoid and Tanh as Activation functions which causes vanishing Gradient Problems. Mostly nowadays we use RELU based activation functions in training a Deep Neural Network Model to avoid such complications and improve the accuracy.
It’s because the gradient or slope of RELU activation if it’s over 0, is 1. Sigmoid derivative has a maximum slope of .25, which means that during the backward pass, you are multiplying gradients with values less than 1, and if you have more and more layers, you are multiplying it with values less than 1, making gradients smaller and smaller. RELU activation solves this by having a gradient slope of 1, so during backpropagation, there isn’t gradients passed back that are progressively getting smaller and smaller. but instead they are staying the same, which is how RELU solves the vanishing gradient problem.
One thing to note about RELU however is that if you have a value less than 0, that neuron is dead, and the gradient passed back is 0, meaning that during backpropagation, you will have 0 gradient being passed back if you had a value less than 0.
An alternative is Leaky RELU, which gives some gradient for values less than 0.
The first answer is from 2015 and a bit of age.
Today, CNNs typically also use batchnorm - while there is some debate why this helps: the inventors mention covariate shift: https://arxiv.org/abs/1502.03167
There are other theories like smoothing the loss landscape: https://arxiv.org/abs/1805.11604
Either way, it is a method that helps to deal significantly with vanishing/exploding gradient problem that is also relevant for CNNs. In CNNs you also apply the chain rule to get gradients. That is the update of the first layer is proportional to the product of N numbers, where N is the number of inputs. It is very likely that this number is either relatively big or small compared to the update of the last layer. This might be seen by looking at the variance of a product of random variables that quickly grows the more variables are being multiplied: https://stats.stackexchange.com/questions/52646/variance-of-product-of-multiple-random-variables
For recurrent networks that have long sequences of inputs, ie. of length L, the situation is often worse than for CNN, since there the product consists of L numbers. Often the sequence length L in a RNN is much larger than the number of layers N in a CNN.