What will happen if I multiply a constant to the loss function? I think I will get a larger gradient, right? Is it equal to having a larger learning rate?
Basically - it depends on many things:
If you use a classic stochastic / batch / full batch learning with an update rule, where:
new_weights = old_weights - learning_rate * gradient
then due to multiplication commutativity - your claim is true.
If you are using any learning method which has an adaptive learning rate (like ADAM or rmsprop)- then things change a little bit. Then still - your gradients would be affected by multiplication - but a learning rate could not be affected at all. It depends on how new value of a cost function will cooperate with learning algorithm.
If you use a learning method in which you have an adaptive gradient but not adaptive learning rate - usually learning rate is affected in a same way like in point 1. (e.g. in momentum methods).
Yes, you are right. It is equivalent to changing the learning rate.
Related
Some weeks ago I started coding the Levenberg-Marquardt algorithm from scratch in Matlab. I'm interested in the polynomial fitting of the data but I haven't been able to achieve the level of accuracy I would like. I'm using a fifth order polynomial after I tried other polynomials and it seemed to be the best option. The algorithm always converges to the same function minimization no matter what improvements I try to implement. So far, I have unsuccessfully added the following features:
Geodesic acceleration term as a second order correction
Delayed gratification for updating the damping parameter
Gain factor to get closer to the Gauss-Newton direction or
the steepest descent direction depending on the iteration.
Central differences and forward differences for the finite difference method
I don't have experience in nonlinear least squares, so I don't know if there is a way to minimize the residual even more or if there isn't more room for improvement with this method. I attach below an image of the behavior of the polynomial for the last iterations. If I run the code for more iterations, the curve ends up not changing from iteration to iteration. As it is observed, there is a good fit from time = 0 to time = 12. But I'm not able to fix the behavior of the function from time = 12 to time = 20. Any help will be very appreciated.
Fitting a polynomial does not seem to be the best idea. Your data set looks like an exponential transient, with an horizontal asymptote. Forcing a polynomial to that will work very poorly.
I'd rather try with a simple model, such as
A (1 - e^(-at)).
With naked eye, A ~ 15. You should have a look at the values of log(15 - y).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
When I increase/decrease batch size of the mini-batch used in SGD, should I change learning rate? If so, then how?
For reference, I was discussing with someone, and it was said that, when batch size is increased, the learning rate should be decreased by some extent.
My understanding is when I increase batch size, computed average gradient will be less noisy and so I either keep same learning rate or increase it.
Also, if I use an adaptive learning rate optimizer, like Adam or RMSProp, then I guess I can leave learning rate untouched.
Please correct me if I am mistaken and give any insight on this.
Theory suggests that when multiplying the batch size by k, one should multiply the learning rate by sqrt(k) to keep the variance in the gradient expectation constant. See page 5 at A. Krizhevsky. One weird trick for parallelizing convolutional neural networks: https://arxiv.org/abs/1404.5997
However, recent experiments with large mini-batches suggest for a simpler linear scaling rule, i.e multiply your learning rate by k when using mini-batch size of kN.
See P.Goyal et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour https://arxiv.org/abs/1706.02677
I would say that with using Adam, Adagrad and other adaptive optimizers, learning rate may remain the same if batch size does not change substantially.
Learning Rate Scaling for Dummies
I've always found the heuristics which seem to vary somewhere between scale with the square root of the batch size and the batch size to be a bit hand-wavy and fluffy, as is often the case in Deep Learning. Hence I devised my own theoretical framework to answer this question.
EDIT: Since the posting of this answer, my paper on this topic has been published at the journal of machine learning (https://www.jmlr.org/papers/volume23/20-1258/20-1258.pdf). I want to thank the stackoverflow community for believing in my ideas, engaging with and probing me, at a time where the research community dismissed me out of hand.
Learning Rate is a function of the Largest Eigenvalue
Let me start with two small sub-questions, which answer the main question
Are there any cases where we can a priori know the optimal learning rate?
Yes, for the convex quadratic, the optimal learning rate is given as 2/(λ+μ), where λ,μ represent the largest and smallest eigenvalues of the Hessian (Hessian = the second derivative of the loss ∇∇L, which is a matrix) respectively.
How do we expect these eigenvalues (which represent how much the loss changes along a infinitesimal move in the direction of the eigenvectors) to change as a function of batch size?
This is actually a little more tricky to answer (it is what I made the theory for in the first place), but it goes something like this.
Let us imagine that we have all the data and that would give us the full Hessian H. But now instead we only sub-sample this Hessian so we use a batch Hessian B. We can simply re-write B=H+(B-H)=H+E. Where E is now some error or fluctuation matrix.
Under some technical assumptions on the nature of the elements of E, we can assume this fluctations to be a zero mean random matrix and so the Batch Hessian becomes a fixed matrix + a random matrix.
For this model, the change in eigenvalues (which determines how large the learning rate can be) is known. In my paper there is another more fancy model, but the answer is more or less the same.
What actually happens? Experiments and Scaling Rules
I attach a plot of what happens in the case that the largest eigenvalue from the full data matrix is far outside that of the noise matrix (usually the case). As we increase the mini-batch size, the size of the noise matrix decreases and so the largest eigenvalue also decreases in size, hence larger learning rates can be used. This effect is initially proportional and continues to be approximately proportional until a threshold after which no appreciable decrease happens.
How well does this hold in practice? The answer as shown below in my plot on the VGG-16 without batch norm (see paper for batch normalisation and resnets), is very well.
I would hasten to add that for adaptive order methods, if you use a small numerical stability constant (epsilon for Adam) the argument is a little different because you have an interplay of the eigenvalues, the estimated eigenvalues and your stability constant! so you actually end up getting a square root rule up till a threshold. Quite why nobody is discussing this or has published this result is honestly a little beyond me.
But if you want my practical advice, stick with SGD and just go proportional to the increase in batch size if your batch size is small and then don't increase it beyond a certain point.
Apart from the papers mentioned in Dmytro's answer, you can refer to the article of: Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., & Storkey, A. (2018, October). Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio. The authors give the mathematical and empirical foundation to the idea that the ratio of learning rate to batch size influences the generalization capacity of DNN. They show that this ratio plays a major role in the width of the minima found by SGD. The higher ratio the wider is minima and better generalization.
When gradient descent quantitatively suggests by much the biases and weights to be reduced, what does learning rate is doing?? Am a beginner, someone please enlighten me on this.
Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.
new_weight = existing_weight — learning_rate * gradient
If learning rate is too small gradient descent can be slow
If learning rate is fast gradient descent can overshoot the minimum.It may fail to converge, it may even diverge
Adaptive stochastic optimization algorithms like Adam, RMSProp and Adagrad are known for adaptively change the parameters in the course of learning the weights.
However, when working with such algorithms, Keras provides the option to set the learning rate. Why would you do this if a proper value is found adaptively otherwise?
This option for Adam's does not let you manually set the learning rate itself, only it's initial value. The optimizer will adjust it's learning rate accordingly.
I'm not 100% sure if the same goes for the other optimizers but i would assume so.
I am implementing a generic module for Stochastic Gradient Descent. That takes arguments: training dataset, loss(x,y), dw(x,y) - per sample loss and per sample gradient change.
Now, for the convergence criteria, I have thought of :-
a) Checking loss function after every 10% of the dataset.size, averaged over some window
b) Checking the norm of the differences between weight vector, after every 10-20% of dataset size
c) Stabilization of error on the training set.
d) Change in the sign of the gradient (again, checked after every fixed intervals) -
I have noticed that these checks (precision of check etc.) depends on other stuff also, like step size, learning rate.. and the effect can vary from one training problem to another.
I can't seem to make up mind on, what should be the generic stopping criterion, regardless of the training set, fx,df/dw thrown at the SGD module. What do you guys do?
Also, for (d), what would be the meaning of "change in sign" for a n-dimensional vector? As, in - given dw_i, dw_i+1, how do I detect the change of sign, does it even have a meaning in more than 2 dimensions?
P.S. Apologies for non-math/latex symbols..still getting used to the stuff.
First, stochastic gradient descent is the on-line version of gradient descent method. The update rule is using a single example at a time.
Suppose, f(x) is your cost function for a single example, the stopping criteria of SGD for N-dimensional vector is usually:
See this1, or this2 for details.
Second, there is a further twist on stochastic gradient descent using so-called “minibatches”. It works identically to SGD, except that it uses more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers. See this3.