Should The Gradients For The Output Layer of an RNN Clipped? - machine-learning

I am currently training an LSTM RNN for time-series forecasting. I understand that it is common practice to clip the gradients of the RNN when it crosses a certain threshold. However, I am not completely clear on whether or not this includes the output layer.
If we call the hidden layer of an RNN h, then the output is sigmoid(connected_weights*h + bias). I know that the gradients for the weights for determining the hidden layer are clipped, but does the same go for the output layer?
In other words, are the gradients for the connected_weights also clipped in gradient clipping?

While nothing prevents you from clipping them as well, there is no reason to do so. A nice paper with reasons is here, I'll try to give you an overview.
The problem we're trying to solve by gradient clipping is that of exploding gradients: Let's assume that your RNN layer is computed like this:
h_t = sigmoid(U * x + W * h_tm1 + b)
So forgetting about the nonlinearity for a while, you could say that a current state h_t depends on some earlier state h_{t-T} as h_t = W^T * h_tmT + input. So if the matrix W inflates the hidden state, the influence of that old hidden state is growing exponentially with time. And the same happens as you backpropagate the gradient, resulting in gradients that will most likely get you to to some useless point in the parameter space.
On the other hand, the output layer is applied just once during both forward and backward pass, so while it may complicate the learning, it will only be by a 'constant' factor, independent of the unrolling in time.
To get a bit more technical: The crucial quantity which determines whether you get exploding gradient is the largest eigenvalue of W. If it is larger than one (or smaller than -1, then it's real fun :-)), then you get exploding gradients. Conversely, if it's smaller than one, you'll suffer from vanishing gradients, making it difficult to learn long-term dependencies. You can find a nice discussion of these phenomena here, with pointers to classical literature.
If we take the sigmoid back into the picture, it becomes more difficult to get exploding gradients, as the gradients get dampened by at least a factor of 4 when being backpropagated through it. But still, have an eigenvalue larger than 4 and you'll have adventures :-) It's rather important to initialize carefully, the second paper gives some hints. With tanh, there is little dampening around zero and ReLU just propagates the gradient through, so these are rather prone to gradient explodions and thus sensitive to initialization and gradient clipping.
Overall, LSTMs have better learning properties than vanilla RNNs, esp. with regard to the vanishing gradients. Though from my experience, gradient clipping is usually necessary with them as well.
EDIT: When to clip?
Right before the update of the weights, i.e. you do the backprop unaltered. The thing is that gradient clipping is kind of a dirty hack. You still want your gradient as precise as possible, so you better don't distort it in the middle of the backprop. Just that if you see the gradient become very large, you say Nah, this smells. I better make a tiny step. and clipping is an easy way to do it (it may be that only some elements of the gradient are exploded while the others are still well behaved and informative). With most of the toolkits, you don't have the choice anyway, because the backpropagation happens atomically.

Related

If we can clip gradient in WGAN, why bother with WGAN-GP?

I am working on WGAN and would like to implement WGAN-GP.
In its original paper, WGAN-GP is implemented with a gradient penalty because of the 1-Lipschitiz constraint. But packages out there like Keras can clip the gradient norm at 1 (which by definition is equivalent to 1-Lipschitiz constraint), so why do we bother to penalize the gradient? Why don't we just clip the gradient?
The reason is that clipping in general is a pretty hard constraint in a mathematical sense, not in a sense of implementation complexity. If you check original WGAN paper, you'll notice that clip procedure inputs model's weights and some hyperparameter c, which controls range for clipping.
If c is small then weights would be severely clipped to a tiny values range. The question is how to determine an appropriate c value. It depends on your model, dataset in a question, training procedure and so on and so forth. So why not to try soft penalizing instead of hard clipping? That's why WGAN-GP paper introduces additional constraint to a loss function that forces gradient's norm to be as much close to 1 as possible, avoiding hard collapsing to a predefined values.
The answer by CaptainTrunky is correct but I also wanted to point out one, really important, aspect.
Citing the original WGAN-GP paper:
Implementing k-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in [Corollary 1], the optimal WGAN critic has unit gradient norm almost everywhere under Pr and Pg; under a weight-clipping constraint, we observe that our neural network architectures which try to attain their maximum gradient norm k end up learning extremely simple functions.
So as You can see weight clipping may (it depends on the data You want to generate - autors of this article stated that it doesn't always behave like that) lead to undesired behaviour. When You will try to train WGAN to generate more complex data the task has high possibility of failure.

Is it normal for gradients to be extremely large in a deep convnet?

I just finished implementing a convolutional neural network from scratch. This is the first time I've done this. When testing my backpropagation algorithm, the outputted delta values for the weights are extremely large compared to what the original value was. For example, all my weights are initialized to a random number between -0.1 and 0.1, but the delta values outputted are around 75000. This obviously is much too big of a change, and it requires a very small learning rate to even be near functional. A learning rate like 0.01 seems like the convention but mine needs to be at least 0.0000001, leading me to believe I'm doing something wrong. The thing is I don't see how the deltas couldn't be large. To get the derivative of weights with regard to the cost function I convolve the activations of the previous layer (mostly positive due to leaky reLu) with the previous errors (all either 0.1 or 1 due to the derivative of leaky reLu). Obviously the sum of all these positive numbers will get very large as it propagates through the layers. Did I skip a step somewhere? Is this an exploding gradient problem? Should I use gradient clipping or batch normalization?
Depending on the size of the convolutions -0.1 to 0.1 seems extremely large. Try something like 0.01 or even less.
If you want to do a more insightful initialization you can take a look at glorot (http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi) or he (https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf) initializations.
The crux is to initialize with either uniform or gaussian values with mean 0 and standard deviation equal to square root of the input channels.

Is location-dependent convolution filter possible in PyTorch or TensorFlow?

Let's pretend that in plus of having an image, I also have a gradient from left to right on the X axis of an image, and another gradient from top to bottom on the Y axis. Those two gradients are of the same size of the image, and could both range from -0.5 to 0.5.
Now, I'd like to make the convolution kernel (a.k.a. convolution filter, or convolution weights) depend on the (x, y) location in the gradient. So the kernel is a function of the gradient as if the kernel was the output of a nested mini-neural net. This would make the weights of the filter to be different in every position, but slightly similar to their neighbors. How do I do that within PyTorch or TensorFlow?
Sure, I could compute a Toeplitz matrix (a.k.a. diagonal-constant matrix) by myself, but the matrix multiplication would take O(n^3) operations if pretending x==y==n, whereas convolutions can be implemented in O(n^2) normally. Or I could maybe iterate on every element myself and do the multiplications in an unvectorized fashion.
Any better ideas? I'd like to see creativity here, thinking about how could this be implemented neatly. I believe coding that would be an interesting way to build a network layer capable of doing things similar to a simplified version of a Spatial Transformer Networks, but which's spatial transformation would be independent of the image.
Here is a solution I thought for a simplified version of this problem where a linear combination of weights would be used rather than truly using a nested mini neural network:
It may be possible to do 4 different convolutions passes so as to have 4 feature maps, then to multiply those 4 maps with the gradients (2 vertical and 2 horizontal gradients), and add them together so that only 1 map remains. However, that would be a linear combination of the different maps which is simpler than truly using a nested neural network which would alter the kernel in the first place.
Thinking more about it, here is a solution to an equivalent question. The thing with this solution is that it flips the problem around by placing the "mini neural net" after rather than before, and in a quite different way. So it solves the problem, but offer a much different optimization space and convergence behavior which is less natural for me to think about than how I formulated the problem.
In a sense, a solution to the problem could be very similar to simply concatenating the two gradients to 1 regular feature map (from a regular convolution) such as having a depth of d_2 = d_1 + 2 after the concatenation), and then performing more convolutions on top of this. I won't prove why this is a valid solution to an equivalent problem, but I thought through this and it seems provable.
The optimization space (for the weights) would be here very different and I think it wouldn't converge with the same behavior. I'd like to know what you people think about this solution in terms of optimization convergence.
The reason why convolutions are more efficient than fully connected layers is because they are translation invariant. If you wish to have convolutions which are dependent on location you would need to add two extra parameters to the convolution i.e. having N+2 input channels where x, y coord are the values of the two additonal channels (as in e.g. CoordConv).
As for alternative solutions, is the gradient meaningful? If not, and it is uniform across all images, it might be better to just manually remove it in the pre-processing stage (similar to orientation correction, cropping etc). If it is not (e.g. differences in lighting, shadows) then including other layers under the assumption they would learn the invariance of different lightings is a common hands-off approach.

Is there any benefit to having linear activation functions at the last layer vs an activation function like tanh?

I understand this decision depends on the task, but let me explain.
I'm designing a model that predicts steering angles from a given dashboard video frame using a convolutional neural network with dense layers at end. In my final dense layer, I only have a single unit that predicts a steering angle.
My question here is, for my task would either option below show a boost in performance?
a. Get ground truth steering angles, convert to radians, and squash them using tanh so they are between -1 and 1. In the final dense layer of my network, use a tanh activation function.
b. Get ground truth steering angles. These raw angles are between -420 and 420 degrees. In the final layer, use a linear activation.
I'm trying to think about it logically, where in option A the loss will likely be much smaller since the network is dealing with much smaller numbers. This would lead to smaller changes in weights.
Let me know your thoughts!
There are two types of variables in neural networks: weights and biases (mostly, there are additional variables, e.g. the moving mean and moving variance required for batchnorm). They behave a bit differently, for instance biases are not penalized by a regularizer as a result they don't tend to get small. So an assumption that the network is dealing only with small numbers is not accurate.
Still, biases need to be learned, and as can be seen from ResNet performance, it's easier to learn smaller values. In this sense, I'd rather pick [-1, 1] target range over [-420, 420]. But tanh is probably not an optimal activation function:
With tahn (just like with sigmoid), a saturated neuron kills the gradient during backprop. Choosing tahn with no specific reason is likely to hurt your training.
Forward and backward passes with tahn need to compute exp, which is also relatively expensive.
My option would be (at least initially, until some other variant proves to work better) to squeeze the ground truth values and have no activation at all (I think that's what you mean by a linear activation): let the network learn [-1, 1] range by itself.
In general, if you have any activation functions in the hidden layers, ReLu has proven to work better than sigmoid, though other modern functions have been proposed recently, e.g. leaky ReLu, PRelu, ELU, etc. You might try any of those.

Regarding to backward of convolution layer in Deep learning

I understood the way to compute the forward part in Deep learning. Now, I want to understand the backward part. Let's take X(2,2) as an example. The backward at the position X(2,2) can compute as the figure bellow
My question is that where is dE/dY (such as dE/dY(1,1),dE/dY(1,2)...) in the formula? How to compute it at the first iteration?
SHORT ANSWER
Those terms are in the final expansion at the bottom of the slide; they contribute to the summation for dE/dX(2,2). In your first back-propagation, you start at the end and work backwards (hence the name) -- and the Y values are the ground-truth labels. So much for computing them. :-)
LONG ANSWER
I'll keep this in more abstract, natural-language terms. I'm hopeful that the alternate explanation will help you see the large picture as well as sorting out the math.
You start the training with assigned weights that may or may not be at all related to the ground truth (labels). You move blindly forward, making predictions at each layer based on naive faith in those weights. The Y(i,j) values are the resulting meta-pixels from that faith.
Then you hit the labels at the end. You work backward, adjusting each weight. Note that, at the last layer, the Y values are the ground-truth labels.
At each layer, you mathematically deal with two factors:
How far off was this prediction?
How heavily did this parameter contribute to that prediction?
You adjust the X-to-Y weight by "off * weight * learning_rate".
When you complete that for layer N, you back up to layer N-1 and repeat.
PROGRESSION
Whether you initialize your weights with fixed or random values (I generally recommend the latter), you'll notice that there's really not much progress in the early iterations. Since this is slow adjustment from guess-work weights, it takes several iterations to get a glimmer of useful learning into the last layers. The first layers are still cluelessly thrashing at this point. The loss function will bounce around close to its initial values for a while. For instance, with GoogLeNet's image recognition, this flailing lasts for about 30 epochs.
Then, finally, you get some valid learning in the latter layers, the patterns stabilize enough that some consistency percolates back to the early layers. At this point, you'll see the loss function drop to a "directed experimentation" level. From there, the progression depends a lot on the paradigm and texture of the problem: some have a sharp drop, then a gradual convergence; others have a more gradual drop, almost an exponential decay to convergence; more complex topologies have additional sharp drops as middle or early phases "get their footing".

Resources