Stochastic gradient descent Vs Mini-batch size 1

Stochastic gradient descent Vs Mini-batch size 1 - machine-learning

Is stochastic gradient descent basically the name given to mini-batch training where batch size = 1 and selecting random training rows? i.e. it is the same as 'normal' gradient descent, it's just the manner in which the training data is supplied that makes the difference?
One thing that confuses me is I've seen people say that even with SGD you can supply more than 1 data point, and have larger batches, so won't that just make it 'normal' mini-batch gradient descent?

On Optimization Terminology
Optimization algorithms that use only a single example at a time are sometimes called stochastic, as you mentioned. Optimization algorithms that use the entire training set are called batch or deterministic gradient methods.
Most algorithms used for deep learning fall somewhere in between, using more than one but fewer than all the training examples. These were traditionally called minibatch or minibatch stochastic methods, and it is now common to call them simply stochastic methods.
Hope that makes the terminology clearer:
Deeplearningbook by Goodfellow p.275-276

Related

why doesnt Stochastic gradient descent fluctuate

In batch gradient descent the parameters were updated based on the total/average loss of all the points
In Stochastic gradient descent or SGD
we are updating the parameters after every point instead of one epoch.
so lets say if the final point is an outlier woudnt that cause the whole fitted line to fluctuate drastically.
How is it reliable .
or converge on a contour like this SGD contour

While it is true that in its most pristine form SGD operates on just 1 sample point, in reality this is not the dominant practice. In practice, we use a mini-batch of say 256, 128 or 64 samples rather than operating on the full batch size containing all the samples in the database, which might be well over than 1 million samples. So clearly operating on a mini-batch of say 256 is much faster than operating on 1 million points and at the same time helps curb the variability caused due to just using 1 sample point.
A second point is that there is no final point. One simply keeps iterating over the dataset. The learning rate for SGD is generally quite small say 1e-3. So even if a sample point happens to be an outlier, the wrong gradients will be scaled by 1e-3 and hence SGD will not be too much off the correct trajectory. When it iterates over the upcoming sample points, which are not outliers, it will again head towards the correct direction.
So altogether using a medium-sized mini-batch and using a small learning rate helps SGD to not digress a lot from the correct trajectory.
Now the word stochastic in SGD can also imply various other measures. For example some practitioners also use gradient clipping i.e. they clamp the calculated gradient to maximum value if the gradients are well over this decided maximum threshold. You can find more on gradient clipping in this post. Now, this is just one trick amongst dozens of other techniques and if you are interested can read source code of popular implementation of SGD in PyTorch or TensorFlow.

What is difference between Gradient Descent and Grid Search in Machine Learning?

Hyperparameter Tuning use two techniques like Grid Search or Random Search.
Gradient Descent is mostly used to minimize the Loss function.
Here query is in when we will use Grid Search and Gradient descent.

Gradient Descent is used to optimize the model meaning its weights and biases to minimize the loss. It tries to reach to minima of the loss function and their generalise the model to a good extent. It optimizes the model based on the hyperparameters given to it.
For example, the learning rate is used like
W = W - ( learning_rate * gradient )
Here, the hyperparameter of learning rate affects W which are the weights.
In order to choose a better value of a hyperparameter, GridSearch and RandomSearch algorithms are used. Hyperparameters are constant during training but need to be fine tuned so that the model converges at something good.
Gradient Descent optimizes the model based on hyperparameters. Whereas in order to fine tune the hyperparameters, GridSearch and RandomSearch are used.
Gradient descent is used for the optimization of the model ( weights and biases )
Hyperparameter Tuning algorithms fine tune hyperparameter which affect the gradient descent.
The usage could be followed in this way.
Train the model on some chosen hyperparameters.
Evaluate the model for its loss and accuracy.
Run hyperparameter tuning to get better values for hyperparameters.
Train the model again with updated hyperparameters.
Follow this routine until the model reaches a considerable high accuracy and less loss.

Backpropagation in Gradient Descent for Neural Networks vs. Linear Regression

I'm trying to understand "Back Propagation" as it is used in Neural Nets that are optimized using Gradient Descent. Reading through the literature it seems to do a few things.
Use random weights to start with and get error values
Perform Gradient Descent on the loss function using these weights to arrive at new weights.
Update the weights with these new weights until the loss function is minimized.
The steps above seem to be the EXACT process to solve for Linear Models (Regression for e.g.)? Andrew Ng's excellent course on Coursera for Machine Learning does exactly that for Linear Regression.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets and why not for GLMs (Generalized Linear Models). They all seem to be doing the same thing- what might I be missing?

The main division happens to be hiding in plain sight: linearity. In fact, extend to question to continuity of the first derivative, and you'll encapsulate most of the difference.
First of all, take note of one basic principle of neural nets (NN): a NN with linear weights and linear dependencies is a GLM. Also, having multiple hidden layers is equivalent to a single hidden layer: it's still linear combinations from input to output.
A "modern' NN has non-linear layers: ReLUs (change negative values to 0), pooling (max, min, or mean of several values), dropouts (randomly remove some values), and other methods destroy our ability to smoothly apply Gradient Descent (GD) to the model. Instead, we take many of the principles and work backward, applying limited corrections layer by layer, all the way back to the weights at layer 1.
Lather, rinse, repeat until convergence.
Does that clear up the problem for you?
You got it!
A typical ReLU is
f(x) = x if x > 0,
0 otherwise
A typical pooling layer reduces the input length and width by a factor of 2; in each 2x2 square, only the maximum value is passed through. Dropout simply kills off random values to make the model retrain those weights from "primary sources". Each of these is a headache for GD, so we have to do it layer by layer.

So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets
I think (at least originally) back propagation of errors meant less than what you describe: the term "backpropagation of errors" only refered to the method of calculating derivatives of the loss function, instead of e.g. automatic differentiation, symbolic differentiation, or numerical differentiation. No matter what the gradient was then used for (e.g. Gradient Descent, or maybe Levenberg/Marquardt).
They all seem to be doing the same thing- what might I be missing?
They're using different models. If your neural network used linear neurons, it would be equivalent to linear regression.

Are there any fixed relationships between mini batch gradient decent and gradient decent

For convex optimization, like as logistic regression.
For example I have 100 training samples. In mini batch gradient decent I set batch size equal to 10.
So after 10 times of mini batch gradient decent updating. Can I get the same result with one times gradient decent updating?
For non-convex optimization, like as Neural Network.
I know mini batch gradient decent can avoid some local optima sometimes. But are there any fixed relationships between them.

When we say batch gradient descent, it is updating the parameters using all the data. Below is an illustration of batch gradient descent. Note each iteration of the batch gradient descent involves a computation of the average of the gradients of the loss function over the entire training data set. In the figure, -gamma is the negative of the learning rate.
When the batch size is 1, it is called stochastic gradient descent (GD).
When you set the batch size to 10 (I assume the total training data size >>10), this method is called mini batches stochastic GD, which is a compromise between true stochastic GD and batch GD (which uses all the training data at one update). Mini batches performs better than true stochastic gradient descent because when the gradient computed at each step uses more training examples, we usually see smoother convergence. Below is an illustration of SGD. In this online learning setting, each iteration of the update consists of choosing a random training instance (z_t) from the outside world and update the parameter w_t.
The two figures I included here are from this paper.
From wiki:
The convergence of stochastic gradient descent has been analyzed using
the theories of convex minimization and of stochastic approximation.
Briefly, when the learning rates \alpha decrease with an appropriate
rate, and subject to relatively mild assumptions, stochastic gradient
descent converges almost surely to a global minimum when the objective
function is convex or pseudoconvex, and otherwise converges almost
surely to a local minimum. This is in fact a consequence of the
Robbins-Siegmund theorem.
Regarding your question:
[convex case] Can I get the same result with one times gradient decent updating?
If the meaning of "same result" is "converging" to the global minimum, then YES. This is approved by L´eon Bottou in his paper. That is either SGD or mini batch SGD converges to a global minimum almost surely. Note when we say almost surely:
It is obvious however that any online learning algorithm can be
mislead by a consistent choice of very improbable examples. There is
therefore no hope to prove that this algorithm always converges. The
best possible result then is the almost sure convergence, that is to
say that the algorithm converges towards the solution with probability 1.
For non-convex case, it is also proved in the same paper (section 5), that stochastic or mini batches converges to the local minimum almost surely.

What should be a generic enough convergence criteria of Stochastic Gradient Descent

I am implementing a generic module for Stochastic Gradient Descent. That takes arguments: training dataset, loss(x,y), dw(x,y) - per sample loss and per sample gradient change.
Now, for the convergence criteria, I have thought of :-
a) Checking loss function after every 10% of the dataset.size, averaged over some window
b) Checking the norm of the differences between weight vector, after every 10-20% of dataset size
c) Stabilization of error on the training set.
d) Change in the sign of the gradient (again, checked after every fixed intervals) -
I have noticed that these checks (precision of check etc.) depends on other stuff also, like step size, learning rate.. and the effect can vary from one training problem to another.
I can't seem to make up mind on, what should be the generic stopping criterion, regardless of the training set, fx,df/dw thrown at the SGD module. What do you guys do?
Also, for (d), what would be the meaning of "change in sign" for a n-dimensional vector? As, in - given dw_i, dw_i+1, how do I detect the change of sign, does it even have a meaning in more than 2 dimensions?
P.S. Apologies for non-math/latex symbols..still getting used to the stuff.

First, stochastic gradient descent is the on-line version of gradient descent method. The update rule is using a single example at a time.
Suppose, f(x) is your cost function for a single example, the stopping criteria of SGD for N-dimensional vector is usually:
See this1, or this2 for details.
Second, there is a further twist on stochastic gradient descent using so-called “minibatches”. It works identically to SGD, except that it uses more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers. See this3.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart