I learnt gradient descent through online resources (namely machine learning at coursera). However the information provided only said to repeat gradient descent until it converges.
Their definition of convergence was to use a graph of the cost function relative to the number of iterations and watch when the graph flattens out. Therefore I assume that I would do the following:
if (change_in_costfunction > precisionvalue) {
repeat gradient_descent
}
Alternatively, I was wondering if another way to determine convergence is to watch the coefficient approach it's true value:
if (change_in_coefficient_j > precisionvalue) {
repeat gradient_descent_for_j
}
...repeat for all coefficients
So is convergence based on the cost function or the coefficients? And how do we determine the precision value? Should it be a % of the coefficient or total cost function?
You can imagine how Gradient Descent (GD) works thinking that you throw marble inside a bowl and you start taking photos. The marble will oscillate till friction will stop it in the bottom. Now imaging that you are in an environment that friction is so small that the marble takes a long time to stop completely, so we can assume that when the oscillations are small enough the marble have reach the bottom (although it could continue oscillating). In the following image you could see the first eight steps (photos of the marble) of the GD.
If we continue taking photos the marble makes not appreciable movements, you should zoom the image:
We could keep taking photos and the movements will be more irrelevants.
So reaching a point in which GD makes very small changes in your objective function is called convergence, which doesn't mean it has reached the optimal result (but it is really quite quite near, if not on it).
The precision value can be chosen as the threshold in which you consecutive iterations of GD are almost the same:
grad(i) = 0.0001
grad(i+1) = 0.000099989 <-- grad has changed less than 0.01% => STOP
I think I understand your question. Based on my understanding, the GD function is based on the cost function. It iterates until the convergence of cost function.
Imagine, plotting a graph of cost function (y-axis) against the number of iterations of GD(x-axis).
Now, if the GD works properly the curve is concave up, or decreasing(similar to that of 1/x). Since, the curve is decreasing, the decrease in cost function becomes smaller and smaller, and then there comes a point where the curve is almost flattened. Around that point, we say the GD is more or less converged (again, where the cost function decreases by a unit less than the precision_value).
So, I would your first approach is what you need:
(if(change_in_costFunction > precision_value))
repeat GD;
Related
As far as I know, to calculate the weights and bias for simple linear regression, it follows gradient descent algorithm which works on finding global minima for cost function(curve). And that cost function is calculated by randomly choosing a set of weights and then calculating mean error on all the records. In that way we get a point on the cost curve. Again another set of weights are chosen and mean error is calculated. So all these points make up the cost curve.
My doubt is, how many times the weights are randomly chosen to get the points, before calculating (finding the cost function)the cost curve.
Thanks in advance.
Gradient descent algorithm iterates till convergence.
By convergence, it means, global minima is found out for the convex cost function.
There are basically two ways people use to find convergence.
Automatic convergence test : Declare convergence if cost function decreases by less than e in an iteration, where e is some small value such as 10^-3. However, it is difficult to choose this threshold value in practice.
Plot cost function against iterations : Plotting cost function against iteration can give you fair idea about convergence. It can also be used for debugging (cost function must be decreasing on every iteration).
For example, in this figure I can deduce that I need near 300-400 iterations of Gradient Descent.
Also this enables you to check for different learning rate (alpha) vs iterations.
Some weeks ago I started coding the Levenberg-Marquardt algorithm from scratch in Matlab. I'm interested in the polynomial fitting of the data but I haven't been able to achieve the level of accuracy I would like. I'm using a fifth order polynomial after I tried other polynomials and it seemed to be the best option. The algorithm always converges to the same function minimization no matter what improvements I try to implement. So far, I have unsuccessfully added the following features:
Geodesic acceleration term as a second order correction
Delayed gratification for updating the damping parameter
Gain factor to get closer to the Gauss-Newton direction or
the steepest descent direction depending on the iteration.
Central differences and forward differences for the finite difference method
I don't have experience in nonlinear least squares, so I don't know if there is a way to minimize the residual even more or if there isn't more room for improvement with this method. I attach below an image of the behavior of the polynomial for the last iterations. If I run the code for more iterations, the curve ends up not changing from iteration to iteration. As it is observed, there is a good fit from time = 0 to time = 12. But I'm not able to fix the behavior of the function from time = 12 to time = 20. Any help will be very appreciated.
Fitting a polynomial does not seem to be the best idea. Your data set looks like an exponential transient, with an horizontal asymptote. Forcing a polynomial to that will work very poorly.
I'd rather try with a simple model, such as
A (1 - e^(-at)).
With naked eye, A ~ 15. You should have a look at the values of log(15 - y).
I just finished implementing a convolutional neural network from scratch. This is the first time I've done this. When testing my backpropagation algorithm, the outputted delta values for the weights are extremely large compared to what the original value was. For example, all my weights are initialized to a random number between -0.1 and 0.1, but the delta values outputted are around 75000. This obviously is much too big of a change, and it requires a very small learning rate to even be near functional. A learning rate like 0.01 seems like the convention but mine needs to be at least 0.0000001, leading me to believe I'm doing something wrong. The thing is I don't see how the deltas couldn't be large. To get the derivative of weights with regard to the cost function I convolve the activations of the previous layer (mostly positive due to leaky reLu) with the previous errors (all either 0.1 or 1 due to the derivative of leaky reLu). Obviously the sum of all these positive numbers will get very large as it propagates through the layers. Did I skip a step somewhere? Is this an exploding gradient problem? Should I use gradient clipping or batch normalization?
Depending on the size of the convolutions -0.1 to 0.1 seems extremely large. Try something like 0.01 or even less.
If you want to do a more insightful initialization you can take a look at glorot (http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi) or he (https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf) initializations.
The crux is to initialize with either uniform or gaussian values with mean 0 and standard deviation equal to square root of the input channels.
Some TensorFlow examples calculate the cost function like this:
cost = tf.reduce_sum((pred-y)**2 / (2*n_samples))
So the quotient is the number of samples, multiplied by two.
The reason for the extra factor of 2, is it so that when the cost function is differentiated for back propagation, it will cancel a factor of 1/2 and save an operation?
If so, is it still recommended to do this, does it actually provide a significant performance improvement?
It's convenient in math, because one doesn't need to carry the 0.5 all along. But in code, it doesn't make a big difference, because this change makes the gradients (and, correspondingly, the updates of trainable variables) two times bigger or smaller. Since the updates are multiplied by the learning rate, this factor of 2 can be undone by a minor change of the hyperparameter. I say minor, because it's common to try the learning rates in log-scale during model selection anyway: 0.1, 0.01, 0.001, ....
As a result, no matter what particular formula is used in the loss function, its effect is negligible and doesn't lead to any training speed up. The choice of the right learning rate is more important.
The question how the learning rate influences the convergence rate and convergence itself.
If the learning rate is constant, will Q function converge to the optimal on or learning rate should necessarily decay to guarantee convergence?
Learning rate tells the magnitude of step that is taken towards the solution.
It should not be too big a number as it may continuously oscillate around the minima and it should not be too small of a number else it will take a lot of time and iterations to reach the minima.
The reason why decay is advised in learning rate is because initially when we are at a totally random point in solution space we need to take big leaps towards the solution and later when we come close to it, we make small jumps and hence small improvements to finally reach the minima.
Analogy can be made as: in the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he choses a different stick to get accurate short shot.
So its not that he won't be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole. Same is for decayed learning rate.
The learning rate must decay but not too fast.
The conditions for convergence are the following (sorry, no latex):
sum(alpha(t), 1, inf) = inf
sum(alpha(t)^2, 1, inf) < inf
Something like alpha = k/(k+t) can work well.
This paper discusses exactly this topic:
http://www.jmlr.org/papers/volume5/evendar03a/evendar03a.pdf
It should decay otherwise there will be some fluctuations provoking small changes in the policy.