Iteration times for gradient descent (GD) and stochastic gradient descent (SGD) - machine-learning

I'm curious about the computational complexity between gradient descent (GD) and stochastic gradient descent (SGD) which was mentioned in this paper
I found an easier version lecture notes which says if we want to reach accuracy of $$ e $$, then
the iteration number in GD will be O(log(1/e))
the iteration number in SGD will be O(1/e)
How to obtain these two iteration number? Thank you!

Related

What is difference between Gradient Descent and Grid Search in Machine Learning?

Hyperparameter Tuning use two techniques like Grid Search or Random Search.
Gradient Descent is mostly used to minimize the Loss function.
Here query is in when we will use Grid Search and Gradient descent.
Gradient Descent is used to optimize the model meaning its weights and biases to minimize the loss. It tries to reach to minima of the loss function and their generalise the model to a good extent. It optimizes the model based on the hyperparameters given to it.
For example, the learning rate is used like
W = W - ( learning_rate * gradient )
Here, the hyperparameter of learning rate affects W which are the weights.
In order to choose a better value of a hyperparameter, GridSearch and RandomSearch algorithms are used. Hyperparameters are constant during training but need to be fine tuned so that the model converges at something good.
Gradient Descent optimizes the model based on hyperparameters. Whereas in order to fine tune the hyperparameters, GridSearch and RandomSearch are used.
Gradient descent is used for the optimization of the model ( weights and biases )
Hyperparameter Tuning algorithms fine tune hyperparameter which affect the gradient descent.
The usage could be followed in this way.
Train the model on some chosen hyperparameters.
Evaluate the model for its loss and accuracy.
Run hyperparameter tuning to get better values for hyperparameters.
Train the model again with updated hyperparameters.
Follow this routine until the model reaches a considerable high accuracy and less loss.

Stochastic gradient descent Vs Mini-batch size 1

Is stochastic gradient descent basically the name given to mini-batch training where batch size = 1 and selecting random training rows? i.e. it is the same as 'normal' gradient descent, it's just the manner in which the training data is supplied that makes the difference?
One thing that confuses me is I've seen people say that even with SGD you can supply more than 1 data point, and have larger batches, so won't that just make it 'normal' mini-batch gradient descent?
On Optimization Terminology
Optimization algorithms that use only a single example at a time are sometimes called stochastic, as you mentioned. Optimization algorithms that use the entire training set are called batch or deterministic gradient methods.
Most algorithms used for deep learning fall somewhere in between, using more than one but fewer than all the training examples. These were traditionally called minibatch or minibatch stochastic methods, and it is now common to call them simply stochastic methods.
Hope that makes the terminology clearer:
Deeplearningbook by Goodfellow p.275-276

Why random input is recommended for Stochastic Gradiant Descent

Correct me if i am wrong?
1) For Batched Gradient Descent, the coefficients of the target function is updated at the end of the all instance trained. For example: if i have 100 images to be trained, after 100th image got trained, cost is evaluated, and updated co-efficient.
2) For Stochastic Gradient descent, for this same 100 images, each image trained, the co-efficient are updated.
Question:
For Stochastic Gradient Descent, it is claimed that the input images needs to be randomized in order to avoid being stuck. I could not imagine this problem. Could someone help?
Stochastic Gradient Descent do update by previous training data.
Therefore, we have to shuffle our training set prevent repeating same update.

Is the gradient descent algorithm guaranteed to converge after infinite iterations for a convex function?

Whatsoever maybe the step size choice for the algorithm?
We talk about gradient descent not converging if we take a large step size,but I am not sure the algorithm won't converge ever?

Are there any fixed relationships between mini batch gradient decent and gradient decent

For convex optimization, like as logistic regression.
For example I have 100 training samples. In mini batch gradient decent I set batch size equal to 10.
So after 10 times of mini batch gradient decent updating. Can I get the same result with one times gradient decent updating?
For non-convex optimization, like as Neural Network.
I know mini batch gradient decent can avoid some local optima sometimes. But are there any fixed relationships between them.
When we say batch gradient descent, it is updating the parameters using all the data. Below is an illustration of batch gradient descent. Note each iteration of the batch gradient descent involves a computation of the average of the gradients of the loss function over the entire training data set. In the figure, -gamma is the negative of the learning rate.
When the batch size is 1, it is called stochastic gradient descent (GD).
When you set the batch size to 10 (I assume the total training data size >>10), this method is called mini batches stochastic GD, which is a compromise between true stochastic GD and batch GD (which uses all the training data at one update). Mini batches performs better than true stochastic gradient descent because when the gradient computed at each step uses more training examples, we usually see smoother convergence. Below is an illustration of SGD. In this online learning setting, each iteration of the update consists of choosing a random training instance (z_t) from the outside world and update the parameter w_t.
The two figures I included here are from this paper.
From wiki:
The convergence of stochastic gradient descent has been analyzed using
the theories of convex minimization and of stochastic approximation.
Briefly, when the learning rates \alpha decrease with an appropriate
rate, and subject to relatively mild assumptions, stochastic gradient
descent converges almost surely to a global minimum when the objective
function is convex or pseudoconvex, and otherwise converges almost
surely to a local minimum. This is in fact a consequence of the
Robbins-Siegmund theorem.
Regarding your question:
[convex case] Can I get the same result with one times gradient decent updating?
If the meaning of "same result" is "converging" to the global minimum, then YES. This is approved by LĀ“eon Bottou in his paper. That is either SGD or mini batch SGD converges to a global minimum almost surely. Note when we say almost surely:
It is obvious however that any online learning algorithm can be
mislead by a consistent choice of very improbable examples. There is
therefore no hope to prove that this algorithm always converges. The
best possible result then is the almost sure convergence, that is to
say that the algorithm converges towards the solution with probability 1.
For non-convex case, it is also proved in the same paper (section 5), that stochastic or mini batches converges to the local minimum almost surely.

Resources