Gradient Descent global minimum? - machine-learning

For the gradient descent algorithm which minimizes the average squared error, the algorithm finds coefficients to a linear predictor. The algorithm I am referring to is this one. These coefficients that the algorithm finds converge to the global minimum if the learning rate is small enough. We know there is a global minimum because the average squared error is a convex function of the weights.
What about as a function of the learning rate (aka alpha in the linked video)? Consider two methods for choosing the learning rate:
METHOD 1
iterate over all i in the range -15 to 2.
for each i let learning rate be 3^i .
run gradient descent for 20000 iterations
measure your training error
Choose learning 3^i for the i that had the lowest training error.
METHOD 2
iterate over all i in the range -15 to 2.
for each i let learning rate be 3^i .
run gradient descent for 20000 iterations
measure your training error
if error is higher than previous iteration, choose the i from the previous iteration and break the loop
Is Method 2 correct in assuming that once error increase for some choice of learning rate, all learning rates that are bigger than that one will be even worse?
In method 1 we went over all values of learning rate in a range. In method 2, we said we dont need to go over all values- just until we see an increase in error.

Quoting you,
...and measure the error after some fixed number of iterations and
when you see an increase in error...
Well, according to the video, this is how we detect the convergence, if the difference in gradient descent is <= 0.001 or some value, so there is already a bound you have set which will not allow further iteration for higher values in change of cost function.
There is only one local/global minima for the convex functions when the hypothesis is a linear predictor, so the gradient descent will naturally bring it down to that minima point.

Related

How cost function curve is actually calculated for gradient descent i:e how many times the model chooses the weights randomly?

As far as I know, to calculate the weights and bias for simple linear regression, it follows gradient descent algorithm which works on finding global minima for cost function(curve). And that cost function is calculated by randomly choosing a set of weights and then calculating mean error on all the records. In that way we get a point on the cost curve. Again another set of weights are chosen and mean error is calculated. So all these points make up the cost curve.
My doubt is, how many times the weights are randomly chosen to get the points, before calculating (finding the cost function)the cost curve.
Thanks in advance.
Gradient descent algorithm iterates till convergence.
By convergence, it means, global minima is found out for the convex cost function.
There are basically two ways people use to find convergence.
Automatic convergence test : Declare convergence if cost function decreases by less than e in an iteration, where e is some small value such as 10^-3. However, it is difficult to choose this threshold value in practice.
Plot cost function against iterations : Plotting cost function against iteration can give you fair idea about convergence. It can also be used for debugging (cost function must be decreasing on every iteration).
For example, in this figure I can deduce that I need near 300-400 iterations of Gradient Descent.
Also this enables you to check for different learning rate (alpha) vs iterations.

Why we need to normalize input as zero mean and unit variance before feed to network?

In deep learning, I saw many papers apply the pre-processing step as normalization step. It normalizes the input as zero mean and unit variance before feeding to the convolutional network (has BatchNorm). Why not use original intensity? What is the benefit of the normalization step? If I used histogram matching among images, should I still use the normalization step? Thanks
Normalization is important to bring features onto the same scale for the network to behave much better. Let's assume there are two features where one is measured on a scale of 1 to 10 and the second on a scale from 1 to 10,000. In terms of squared error function the network will be busy optimizing the weights according to the larger error on the second feature.
Therefore it is better to normalize.
The answer to this can be found in Andrew Ng's tutorial: https://youtu.be/UIp2CMI0748?t=133.
TLDR: If you do not normalize input features, some features can have a very different scale and will slow down Gradient Descent.
Long explanation: Let us consider a model that uses two features Feature1 and Feature2 with the following ranges:
Feature1: [10,10000]
Feature2: [0.00001, 0.001]
The Contour plot of these will look something like this (scaled for easier visibility).
Contour plot of Feature1 and Feature2
When you perform Gradient Descent, you will calculate d(Feature1) and d(Feature2) where "d" denotes differential in order to move the model weights closer to minimizing the loss. As evident from the contour plot above, d(Feature1) is going to be significantly smaller compared to d(Feature2), so even if you choose a reasonably medium value of learning rate, then you will be zig-zagging around because of relatively large values of d(Feature2) and may even miss the global minima.
Medium value of learning rate
In order to avoid this, if you choose a very small value of learning rate, Gradient Descent will take a very long time to converge and you may stop training even before reaching the global minima.
Very small Gradient Descent
So as you can see from the above examples, not scaling your features lead to an inefficient Gradient Descent which results in not finding the most optimal model

Are there any fixed relationships between mini batch gradient decent and gradient decent

For convex optimization, like as logistic regression.
For example I have 100 training samples. In mini batch gradient decent I set batch size equal to 10.
So after 10 times of mini batch gradient decent updating. Can I get the same result with one times gradient decent updating?
For non-convex optimization, like as Neural Network.
I know mini batch gradient decent can avoid some local optima sometimes. But are there any fixed relationships between them.
When we say batch gradient descent, it is updating the parameters using all the data. Below is an illustration of batch gradient descent. Note each iteration of the batch gradient descent involves a computation of the average of the gradients of the loss function over the entire training data set. In the figure, -gamma is the negative of the learning rate.
When the batch size is 1, it is called stochastic gradient descent (GD).
When you set the batch size to 10 (I assume the total training data size >>10), this method is called mini batches stochastic GD, which is a compromise between true stochastic GD and batch GD (which uses all the training data at one update). Mini batches performs better than true stochastic gradient descent because when the gradient computed at each step uses more training examples, we usually see smoother convergence. Below is an illustration of SGD. In this online learning setting, each iteration of the update consists of choosing a random training instance (z_t) from the outside world and update the parameter w_t.
The two figures I included here are from this paper.
From wiki:
The convergence of stochastic gradient descent has been analyzed using
the theories of convex minimization and of stochastic approximation.
Briefly, when the learning rates \alpha decrease with an appropriate
rate, and subject to relatively mild assumptions, stochastic gradient
descent converges almost surely to a global minimum when the objective
function is convex or pseudoconvex, and otherwise converges almost
surely to a local minimum. This is in fact a consequence of the
Robbins-Siegmund theorem.
Regarding your question:
[convex case] Can I get the same result with one times gradient decent updating?
If the meaning of "same result" is "converging" to the global minimum, then YES. This is approved by L´eon Bottou in his paper. That is either SGD or mini batch SGD converges to a global minimum almost surely. Note when we say almost surely:
It is obvious however that any online learning algorithm can be
mislead by a consistent choice of very improbable examples. There is
therefore no hope to prove that this algorithm always converges. The
best possible result then is the almost sure convergence, that is to
say that the algorithm converges towards the solution with probability 1.
For non-convex case, it is also proved in the same paper (section 5), that stochastic or mini batches converges to the local minimum almost surely.

Gradient Descent: Do we iterate on ALL of the training set with each step in GD? or Do we change GD for each training set?

I've taught myself machine learning with some online resources but I have a question about gradient descent that I couldn't figure out.
The formula for gradient descent is given by the following logistics regression:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for).
This is where I get confused.
It's not clear to me if summation is representing my entire training set or how many iterations I have done up to that point.
For example, imagine I have 10 training examples. If I perform gradient descent after each training example, my coefficients will be very different then if I performed gradient descent after all 10 training examples.
See below how the First Way is different then the Second Way:
First way
Step 1: Since coefficients initialized to 0, hθ(x)=0
Step 2: Perform gradient descent on the first training example.
Summation term only includes 1 training example
Step 3: Now use new coefficients for training examples 1 & 2... summation term includes first 2 training examples
Step 4: Perform gradient descent again.
Step 5: Now use new coefficients for training examples 1,2 &3... summation term includes first 3 training examples
Continue until convergence or all training examples used.
Second way
Step 1: Since coefficients initialized to 0, hθ(x)=0 for all 10
training examples
Step 2: Perform 1 step of gradient descent using all 10 training examples. Coefficients will be different from the First Way because the summation term includes all 10 training examples
Step 3: Use new coefficients on all 10 training examples again. summation term includes all 10 training examples
Step 4: Perform gradient descent and continue using coefficients on
all examples until convergence
I hope that explains my confusion. Does anyone know which way is correct?
Edit: Adding cost function and hypothesis function
cost function = −1/m∑[ylog(hθ(x))+(1−y)log(1−hθ(x))]
hθ(x) = 1/(1+ e^-z)
and z= θo + θ1X1+θ2X2 +θ3X3...θnXn
The second way you are describing it is the correct way to perform Gradient Descent. The true gradient is dependent on the whole data set, so one iteration of gradient descent requires using all of the data set. (This is true for any learning algorithm where you can take the gradient)
The "first way" is close to something that is called Stochastic Gradient Descent. The idea here is that using the whole data set for one update might be overkill, especially if some of the data points are redundant. In this case, we pick a random point from the data set - essentially setting m=1. We then update based on successive selections of single points in the data set. This way we can do m updates at about the same cost as one update of Gradient Descent. But each update is a bit noisy, which can make convergence to the final solution difficult.
The compromise between these approaches is called "MiniBatch". Taking the gradient of the whole data set is one full round of "batch" processing, as we need the whole data set on hand. Instead we will do a mini batch, selecting only a small subset of the whole data set. In this case we set k, 1 < k < m, where k is the number of points in the mini batch. We select k random data points to create the gradient from at every iteration, and then perform the update. Repeat until convergence. Obviously, increasing / decreasing k is a tradeoff between speed and accuracy.
Note: For both stochastic & mini batch gradient descent, it is important to shuffle / select randomly the next data point. If you use the same iteration order for each data point, you can get really weird / bad results - often diverging away from the solution.
In the case of batch gradient descent (take all samples), your solution will converge faster. In the case of stochastic gradient descent (take one sample at a time) the convergence will be slower.
When the training set is not huge, use the batch gradient descent. But there are situations where the training set is not fixed. For eg. the training happens on the fly - you keep getting more and more samples and update your vector accordingly. In this case you have to update per sample.

What should be a generic enough convergence criteria of Stochastic Gradient Descent

I am implementing a generic module for Stochastic Gradient Descent. That takes arguments: training dataset, loss(x,y), dw(x,y) - per sample loss and per sample gradient change.
Now, for the convergence criteria, I have thought of :-
a) Checking loss function after every 10% of the dataset.size, averaged over some window
b) Checking the norm of the differences between weight vector, after every 10-20% of dataset size
c) Stabilization of error on the training set.
d) Change in the sign of the gradient (again, checked after every fixed intervals) -
I have noticed that these checks (precision of check etc.) depends on other stuff also, like step size, learning rate.. and the effect can vary from one training problem to another.
I can't seem to make up mind on, what should be the generic stopping criterion, regardless of the training set, fx,df/dw thrown at the SGD module. What do you guys do?
Also, for (d), what would be the meaning of "change in sign" for a n-dimensional vector? As, in - given dw_i, dw_i+1, how do I detect the change of sign, does it even have a meaning in more than 2 dimensions?
P.S. Apologies for non-math/latex symbols..still getting used to the stuff.
First, stochastic gradient descent is the on-line version of gradient descent method. The update rule is using a single example at a time.
Suppose, f(x) is your cost function for a single example, the stopping criteria of SGD for N-dimensional vector is usually:
See this1, or this2 for details.
Second, there is a further twist on stochastic gradient descent using so-called “minibatches”. It works identically to SGD, except that it uses more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers. See this3.

Resources