learning rate decay in LSTM - machine-learning

I am currently reproducing the code for char-RNN described in http://karpathy.github.io/2015/05/21/rnn-effectiveness/. There are codes already implemented in tensorflow and the code I am referring to is at https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/train.py I am having a question for the learning rate decay.In the code the optimizer is defined as an AdamOptimizer. When I went through the code, I saw a line as following:
for e in range(args.num_epochs):
sess.run(tf.assign(model.lr, args.learning_rate * (args.decay_rate ** e)))
which adjusts the learning rate by a decay constant.
My question is: isn't Adam optimizer making us able to control the learning rate? Why do we still use a decay rate with respect to learning rate here?

I think you mean RMSprop and not Adam, both of the codes you linked use RMSprop. RMSprop only scales gradients to not have too large or too small norms. So, it is important to decay the learning rate when we have to slow down training after several epochs.

Related

Learning rate & gradient descent difference?

What is the difference between the two?, the two serve to reach the minimum point (lower loss) of a function for example.
I understand (I think) that the learning rate is multiplied by the gradient ( slope ) to make the gradient descent , but is that so ? Do I miss something?
What is the difference between lr and gradient?
Thanks
Deep learning neural networks are trained using the stochastic gradient descent algorithm.
Stochastic gradient descent is an optimization algorithm that estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weights of the model using the back-propagation of errors algorithm, referred to as simply backpropagation.
The amount that the weights are updated during training is referred to as the step size or the “learning rate.”
Specifically, the learning rate is a configurable hyperparameter
used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs.
A learning rate that is too large can cause the model to converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.
The challenge of training deep learning neural networks involves carefully selecting the learning rate. It may be the most important hyperparameter for the model.
The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate.
— Page 429, Deep Learning, 2016.
For more on what the learning rate is and how it works, see the post:
How to Configure the Learning Rate Hyperparameter When Training Deep Learning Neural Networks
Also you can refer to here: Understand the Impact of Learning Rate on Neural Network Performance

Why does different batch-sizes give different accuracy in Keras?

I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.

Weights becoming "NaN" in implementation of Neural Networks

I am trying to implement Neural Networks for classifcation having 5 hidden layers, and with softmax cross entropy in the output layer. The implementation is in JAVA.
For optimization, I have used MiniBatch gradient descent(Batch size=100, learning rate = 0.01)
However, after a couple of iterations, the weights become "NaN" and the predicted values turn out to be the same for every testcase.
Unable to debug the source of this error.
Here is the github link to the code(with the test/training file.)
https://github.com/ahana204/NeuralNetworks
In my case, i forgot to normalize the training data (by subtracting mean). This was causing the denominator of my softmax equation to be 0. Hope this helps.
Assuming the code you implemented is correct, one reason would be large learning rate. If learning rate is large, weights may not converge and may become very small or very large which could be shown NaN. Try to lower learning rate to see if anything changes.

Is tuning batch size or epochs necessary for linear regression with TensorFlow?

I am working on an article where I focus on a simple problem – linear regression over a large data set in the presence of standard normal or uniform noise. I chose Estimator API from TensorFlow as the modeling framework.
I am finding that, hyperparameter tuning is, in fact, of little importance for such a machine learning problem when the number of training steps can be made sufficiently large. By hyperparameter I mean batch size or number of epochs in the training data stream.
Is there any paper/article with formal proof of this?
I don't think there is a paper specifically focused on this question, because it's a more or less fundamental fact. The introductory chapter of this book discusses the probabilistic interpretation of machine learning in general and loss function optimization in particular.
In short, the idea is this: mini-batch optimization wrt (x1,..., xn) is equivalent to consecutive optimization steps wrt x1, ..., xn inputs, because the gradient is a linear operator. This means that mini-batch update equals to the sum of its individual updates. Important note here: I assume that NN doesn't apply batch-norm or any other layer that adds an explicit variation to the inference model (in this case the math is a bit more hairy).
So the batch size can be seen as a pure computational idea that speeds up the optimization through vectorization and parallel computing. Assuming that one can afford arbitrarily long training and the data are properly shuffled, the batch size can be set to any value. But it isn't automatically true for all hyperparameters, for example very high learning rate can easily force the optimization to diverge, so don't make a mistake thinking hyperparamer tuning isn't important in general.

Machine Learning: Is it better to retrain a model if the loss stagnates at a high value?

Meaning to say if during training you have set your learning rate too high and you had unfortunately reached a local minimum where the value is too high, is it good to retrain with a lower learning rate or should you start from a higher learning rate for the poor-performing model, in hopes that the loss will escape the local minimum?
In the strict sense, you don't have to retrain as you can continue training with a lower learning rate (this is called a learning shedule). A very common approach is to lower the learning rate (by usually dividing by 10) each time the loss stagnates or becomes constant.
Another approach is to use an optimizer that scales the learning rate with the gradient magnitude, so the learning rate naturally decays as you get closer to the minima. Examples of this are ADAM, Adagrad and RMSProp.
In any case, make sure to find the optimal learning rate on a validation set, this will considerably improve performance and make learning faster. This applies to both plain SGD and with any other optimizer.

Resources