How gradient descent algorithm behaves when it doesn't reach minimum - machine-learning

Sorry for this ugly image. When the gradient descent algorithm reaches the coefficient value where the error is minimal as shown in the figure, how does the algorithm move forward to reach the global minimum? I am new to Machine Learning. Please bear with me for such a very basic question.

Related

Determining the starting point of gradient descent [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have just learned that the starting point of gradient descent determines the ending point. So I wonder how do we determine the right starting point to reach the global minimum point so that we get the least cost function?
Yes, for a general objective function, the starting point of gradient descent determines the ending point. This is complicating, gradient descent may get stuck in suboptimal local minima. What can we do about that:
Convex optimization: Things are better if the objective is a convex function being optimized on a convex domain, namely, then any local minimum is also a global minimum. So gradient descent on a convex function won't get trapped in suboptimal local minima. Better yet, if the objective is strictly convex, then there is (at most) one global minimum. For these reasons, optimization-based methods are frequently formulated as convex optimizations when it is possible. Logistic regression for instance is a convex optimization problem.
As Tarik said, a good meta-strategy is to do gradient descent multiple times from different random starting positions. This is sometimes called a "random restart" or "shotgun" gradient descent approach.
Twists on the basic gradient descent idea can also be helpful in avoiding local minima. Stochastic gradient descent (SGD) (and similarly, simulated annealing) makes noisier steps. This noise has a cumulative effect like optimizing a smoothed version of the objective, hopefully smoothing over smaller valleys. Another idea is to add a momentum term to gradient descent or SGD, with the intention that momentum will allow the method to roll through and escape local minima.
Finally, an interesting and practical attitude is simply to surrender and accept that gradient descent's solution may be suboptimal. A local minimum solution may yet be useful. For instance, if that solution represents trained weights for a neural network, what really counts is that the network generalizes and performs well on the test set, not that it is optimal on the training set.

Back propagation vs Levenberg Marquardt

Does anyone know the difference between Backpropagation and Levenberg–Marquardt in neural networks training? Sometimes I see that LM is considered as a BP algorithm and sometimes I see the opposite.
Your help will be highly appreciated.
Thank you.
Those are two completely unrelated concepts.
Levenberg-Marquardt (LM) is an optimization method, while backprop is just the recursive application of the chain rule for derivatives.
What LM intuitively does is this: when it is far from a local minimum, it ignores the curvature of the loss and acts as gradient descent. However, as it gets closer to a local minimum it pays more and more attention to the curvature by switching from gradient descent to a Gauss-Newton like approach.
The LM method needs both the gradient and the Hessian (as it solves variants of (H+coeff*Identity)dx=-g with H,g respectively the Hessian and the gradient. You can obtain the gradient via backpropagation. For the Hessian, it is most often not as simple although in least squares you can approximate it as 2gg^T, which means that in that case you can also obtain it easily at the end of the initial backprop.
For neural networks LM usually isn't really useful as you can't construct such a huge Hessian, and even if you do, it lacks the sparse structure needed to invert it efficiently.

Short Definition of Backpropagation and Gradient Descent

I need to write a very short definition of backpropagation and gradient descent and I'm a bit confused what the difference is.
Is the following definition correct?:
For calculating the weights of a neuronal network the backpropagation algorithmn is used. It's a optimization process of reducing the model error. The technique is based on a gradient descent method. Conversely, the contribution of each weight to the total error is calculated from the output layer across all hidden layers to the input layer. For this, the partial derivative of the error function E to w is calculated. The resulting gradient is used to adjust the weights in direction of the steepest descen:
w_new = w_old - learning_rate* (part E / part w_old)
Any suggestions or corrections?
Thanks!
First gradient descent is just one of the method to perform back propagation other than this your definition is correct. We just compare the result generated with desired value and try to change the weights assigned to each edge so as to make the errors as low as possible. If after changing the error increases it reverts back to previous state. The learning rate which you are choosing should not be very low or very high otherwise it would lead to vanishing gradient or exploding gradient problem respectively and you wont be able to reach the minimum error.

Why we need to normalize input as zero mean and unit variance before feed to network?

In deep learning, I saw many papers apply the pre-processing step as normalization step. It normalizes the input as zero mean and unit variance before feeding to the convolutional network (has BatchNorm). Why not use original intensity? What is the benefit of the normalization step? If I used histogram matching among images, should I still use the normalization step? Thanks
Normalization is important to bring features onto the same scale for the network to behave much better. Let's assume there are two features where one is measured on a scale of 1 to 10 and the second on a scale from 1 to 10,000. In terms of squared error function the network will be busy optimizing the weights according to the larger error on the second feature.
Therefore it is better to normalize.
The answer to this can be found in Andrew Ng's tutorial: https://youtu.be/UIp2CMI0748?t=133.
TLDR: If you do not normalize input features, some features can have a very different scale and will slow down Gradient Descent.
Long explanation: Let us consider a model that uses two features Feature1 and Feature2 with the following ranges:
Feature1: [10,10000]
Feature2: [0.00001, 0.001]
The Contour plot of these will look something like this (scaled for easier visibility).
Contour plot of Feature1 and Feature2
When you perform Gradient Descent, you will calculate d(Feature1) and d(Feature2) where "d" denotes differential in order to move the model weights closer to minimizing the loss. As evident from the contour plot above, d(Feature1) is going to be significantly smaller compared to d(Feature2), so even if you choose a reasonably medium value of learning rate, then you will be zig-zagging around because of relatively large values of d(Feature2) and may even miss the global minima.
Medium value of learning rate
In order to avoid this, if you choose a very small value of learning rate, Gradient Descent will take a very long time to converge and you may stop training even before reaching the global minima.
Very small Gradient Descent
So as you can see from the above examples, not scaling your features lead to an inefficient Gradient Descent which results in not finding the most optimal model

How learning rate influences gradient descent?

When gradient descent quantitatively suggests by much the biases and weights to be reduced, what does learning rate is doing?? Am a beginner, someone please enlighten me on this.
Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.
new_weight = existing_weight — learning_rate * gradient
If learning rate is too small gradient descent can be slow
If learning rate is fast gradient descent can overshoot the minimum.It may fail to converge, it may even diverge

Resources