How learning rate influences gradient descent? - machine-learning

When gradient descent quantitatively suggests by much the biases and weights to be reduced, what does learning rate is doing?? Am a beginner, someone please enlighten me on this.

Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.
new_weight = existing_weight — learning_rate * gradient
If learning rate is too small gradient descent can be slow
If learning rate is fast gradient descent can overshoot the minimum.It may fail to converge, it may even diverge

Related

How cost function curve is actually calculated for gradient descent i:e how many times the model chooses the weights randomly?

As far as I know, to calculate the weights and bias for simple linear regression, it follows gradient descent algorithm which works on finding global minima for cost function(curve). And that cost function is calculated by randomly choosing a set of weights and then calculating mean error on all the records. In that way we get a point on the cost curve. Again another set of weights are chosen and mean error is calculated. So all these points make up the cost curve.
My doubt is, how many times the weights are randomly chosen to get the points, before calculating (finding the cost function)the cost curve.
Thanks in advance.
Gradient descent algorithm iterates till convergence.
By convergence, it means, global minima is found out for the convex cost function.
There are basically two ways people use to find convergence.
Automatic convergence test : Declare convergence if cost function decreases by less than e in an iteration, where e is some small value such as 10^-3. However, it is difficult to choose this threshold value in practice.
Plot cost function against iterations : Plotting cost function against iteration can give you fair idea about convergence. It can also be used for debugging (cost function must be decreasing on every iteration).
For example, in this figure I can deduce that I need near 300-400 iterations of Gradient Descent.
Also this enables you to check for different learning rate (alpha) vs iterations.

Learning rate too large, how does this affect the loss function for logistic regression using batch gradient descent

Question: If the learning rate (a) is too large, what happens to the graph and how could this affect the loss function with iterations
I've read somewhere that the graph may not converge or there could be many fluctuations in the graph, I would just like to be clear on this. I'm unsure as well on how this could affect the loss function when plotted.
your loss function would decrease over time over duration and epoch

Why we need to normalize input as zero mean and unit variance before feed to network?

In deep learning, I saw many papers apply the pre-processing step as normalization step. It normalizes the input as zero mean and unit variance before feeding to the convolutional network (has BatchNorm). Why not use original intensity? What is the benefit of the normalization step? If I used histogram matching among images, should I still use the normalization step? Thanks
Normalization is important to bring features onto the same scale for the network to behave much better. Let's assume there are two features where one is measured on a scale of 1 to 10 and the second on a scale from 1 to 10,000. In terms of squared error function the network will be busy optimizing the weights according to the larger error on the second feature.
Therefore it is better to normalize.
The answer to this can be found in Andrew Ng's tutorial: https://youtu.be/UIp2CMI0748?t=133.
TLDR: If you do not normalize input features, some features can have a very different scale and will slow down Gradient Descent.
Long explanation: Let us consider a model that uses two features Feature1 and Feature2 with the following ranges:
Feature1: [10,10000]
Feature2: [0.00001, 0.001]
The Contour plot of these will look something like this (scaled for easier visibility).
Contour plot of Feature1 and Feature2
When you perform Gradient Descent, you will calculate d(Feature1) and d(Feature2) where "d" denotes differential in order to move the model weights closer to minimizing the loss. As evident from the contour plot above, d(Feature1) is going to be significantly smaller compared to d(Feature2), so even if you choose a reasonably medium value of learning rate, then you will be zig-zagging around because of relatively large values of d(Feature2) and may even miss the global minima.
Medium value of learning rate
In order to avoid this, if you choose a very small value of learning rate, Gradient Descent will take a very long time to converge and you may stop training even before reaching the global minima.
Very small Gradient Descent
So as you can see from the above examples, not scaling your features lead to an inefficient Gradient Descent which results in not finding the most optimal model

Why would a neural networks validation loss and accuracy fluctuate at first?

I am training a neural network and at the beginning of training my networks loss and accuracy on the validation data fluctuates a lot, but towards the end of training it stabilizes. I am reduce learning rate on plateau for this network. Could it be that the network starts with a high learning rate and as the learning rate decreases both accuracy and loss stabilize?
For SGD, the amount of change in the parameters is a multiple of the learning rate and the gradient of the parameter values with respect to the loss.
θ = θ − α ∇θ E[J(θ)]
Every step it takes will be in a sub-optimal direction (ie slightly wrong) as the optimiser has usually only seen some of the values. At the start of training you are relatively from the optimal solution, so the gradient ∇θ E[J(θ)] is large, therefore each sub-optimal step has a large effect on your loss and accuracy.
Over time, as you (hopefully) get closer to the optimal solution, the gradient is smaller, so the steps become smaller, meaning that the effects of being slightly wrong are diminished. Smaller errors on each step makes your loss decrease more smoothly, so reduces fluctuations.

Are there any fixed relationships between mini batch gradient decent and gradient decent

For convex optimization, like as logistic regression.
For example I have 100 training samples. In mini batch gradient decent I set batch size equal to 10.
So after 10 times of mini batch gradient decent updating. Can I get the same result with one times gradient decent updating?
For non-convex optimization, like as Neural Network.
I know mini batch gradient decent can avoid some local optima sometimes. But are there any fixed relationships between them.
When we say batch gradient descent, it is updating the parameters using all the data. Below is an illustration of batch gradient descent. Note each iteration of the batch gradient descent involves a computation of the average of the gradients of the loss function over the entire training data set. In the figure, -gamma is the negative of the learning rate.
When the batch size is 1, it is called stochastic gradient descent (GD).
When you set the batch size to 10 (I assume the total training data size >>10), this method is called mini batches stochastic GD, which is a compromise between true stochastic GD and batch GD (which uses all the training data at one update). Mini batches performs better than true stochastic gradient descent because when the gradient computed at each step uses more training examples, we usually see smoother convergence. Below is an illustration of SGD. In this online learning setting, each iteration of the update consists of choosing a random training instance (z_t) from the outside world and update the parameter w_t.
The two figures I included here are from this paper.
From wiki:
The convergence of stochastic gradient descent has been analyzed using
the theories of convex minimization and of stochastic approximation.
Briefly, when the learning rates \alpha decrease with an appropriate
rate, and subject to relatively mild assumptions, stochastic gradient
descent converges almost surely to a global minimum when the objective
function is convex or pseudoconvex, and otherwise converges almost
surely to a local minimum. This is in fact a consequence of the
Robbins-Siegmund theorem.
Regarding your question:
[convex case] Can I get the same result with one times gradient decent updating?
If the meaning of "same result" is "converging" to the global minimum, then YES. This is approved by L´eon Bottou in his paper. That is either SGD or mini batch SGD converges to a global minimum almost surely. Note when we say almost surely:
It is obvious however that any online learning algorithm can be
mislead by a consistent choice of very improbable examples. There is
therefore no hope to prove that this algorithm always converges. The
best possible result then is the almost sure convergence, that is to
say that the algorithm converges towards the solution with probability 1.
For non-convex case, it is also proved in the same paper (section 5), that stochastic or mini batches converges to the local minimum almost surely.

Resources