What is difference between Gradient Descent and Grid Search in Machine Learning? - machine-learning

Hyperparameter Tuning use two techniques like Grid Search or Random Search.
Gradient Descent is mostly used to minimize the Loss function.
Here query is in when we will use Grid Search and Gradient descent.

Gradient Descent is used to optimize the model meaning its weights and biases to minimize the loss. It tries to reach to minima of the loss function and their generalise the model to a good extent. It optimizes the model based on the hyperparameters given to it.
For example, the learning rate is used like
W = W - ( learning_rate * gradient )
Here, the hyperparameter of learning rate affects W which are the weights.
In order to choose a better value of a hyperparameter, GridSearch and RandomSearch algorithms are used. Hyperparameters are constant during training but need to be fine tuned so that the model converges at something good.
Gradient Descent optimizes the model based on hyperparameters. Whereas in order to fine tune the hyperparameters, GridSearch and RandomSearch are used.
Gradient descent is used for the optimization of the model ( weights and biases )
Hyperparameter Tuning algorithms fine tune hyperparameter which affect the gradient descent.
The usage could be followed in this way.
Train the model on some chosen hyperparameters.
Evaluate the model for its loss and accuracy.
Run hyperparameter tuning to get better values for hyperparameters.
Train the model again with updated hyperparameters.
Follow this routine until the model reaches a considerable high accuracy and less loss.

Related

Reducing (Versus Delaying) Overfitting in Neural Network

In neural nets, regularization (e.g. L2, dropout) is commonly used to reduce overfitting. For example, the plot below shows typical loss vs epoch, with and without dropout. Solid lines = Train, dashed = Validation, blue = baseline (no dropout), orange = with dropout. Plot courtesy of Tensorflow tutorials.
Weight regularization behaves similarly.
Regularization delays the epoch at which validation loss starts to increase, but regularization apparently does not decrease the minimum value of validation loss (at least in my models and the tutorial from which the above plot is taken).
If we use early stopping to stop training when validation loss is minimum (to avoid overfitting) and if regularization is only delaying the minimum validation loss point (vs. decreasing the minimum validation loss value) then it seems that regularization does not result in a network with greater generalization but rather just slows down training.
How can regularization be used to reduce the minimum validation loss (to improve model generalization) as opposed to just delaying it? If regularization is only delaying minimum validation loss and not reducing it, then why use it?
Over-generalizing from a single tutorial plot is arguably not a good idea; here is a relevant plot from the original dropout paper:
Clearly, if the effect of dropout was to delay convergence it would not be of much use. But of course it does not work always (as your plot clearly suggests), hence it should not be used by default (which is arguably the lesson here)...

Stochastic gradient descent Vs Mini-batch size 1

Is stochastic gradient descent basically the name given to mini-batch training where batch size = 1 and selecting random training rows? i.e. it is the same as 'normal' gradient descent, it's just the manner in which the training data is supplied that makes the difference?
One thing that confuses me is I've seen people say that even with SGD you can supply more than 1 data point, and have larger batches, so won't that just make it 'normal' mini-batch gradient descent?
On Optimization Terminology
Optimization algorithms that use only a single example at a time are sometimes called stochastic, as you mentioned. Optimization algorithms that use the entire training set are called batch or deterministic gradient methods.
Most algorithms used for deep learning fall somewhere in between, using more than one but fewer than all the training examples. These were traditionally called minibatch or minibatch stochastic methods, and it is now common to call them simply stochastic methods.
Hope that makes the terminology clearer:
Deeplearningbook by Goodfellow p.275-276

Backpropagation in Gradient Descent for Neural Networks vs. Linear Regression

I'm trying to understand "Back Propagation" as it is used in Neural Nets that are optimized using Gradient Descent. Reading through the literature it seems to do a few things.
Use random weights to start with and get error values
Perform Gradient Descent on the loss function using these weights to arrive at new weights.
Update the weights with these new weights until the loss function is minimized.
The steps above seem to be the EXACT process to solve for Linear Models (Regression for e.g.)? Andrew Ng's excellent course on Coursera for Machine Learning does exactly that for Linear Regression.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets and why not for GLMs (Generalized Linear Models). They all seem to be doing the same thing- what might I be missing?
The main division happens to be hiding in plain sight: linearity. In fact, extend to question to continuity of the first derivative, and you'll encapsulate most of the difference.
First of all, take note of one basic principle of neural nets (NN): a NN with linear weights and linear dependencies is a GLM. Also, having multiple hidden layers is equivalent to a single hidden layer: it's still linear combinations from input to output.
A "modern' NN has non-linear layers: ReLUs (change negative values to 0), pooling (max, min, or mean of several values), dropouts (randomly remove some values), and other methods destroy our ability to smoothly apply Gradient Descent (GD) to the model. Instead, we take many of the principles and work backward, applying limited corrections layer by layer, all the way back to the weights at layer 1.
Lather, rinse, repeat until convergence.
Does that clear up the problem for you?
You got it!
A typical ReLU is
f(x) = x if x > 0,
0 otherwise
A typical pooling layer reduces the input length and width by a factor of 2; in each 2x2 square, only the maximum value is passed through. Dropout simply kills off random values to make the model retrain those weights from "primary sources". Each of these is a headache for GD, so we have to do it layer by layer.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets
I think (at least originally) back propagation of errors meant less than what you describe: the term "backpropagation of errors" only refered to the method of calculating derivatives of the loss function, instead of e.g. automatic differentiation, symbolic differentiation, or numerical differentiation. No matter what the gradient was then used for (e.g. Gradient Descent, or maybe Levenberg/Marquardt).
They all seem to be doing the same thing- what might I be missing?
They're using different models. If your neural network used linear neurons, it would be equivalent to linear regression.

Non linear classifier against a linearly separable training set

I was thinking about the risks hided under training a non linear classifier against a labelled (large enough) dataset which is linearly separable.
What would be the main classification misleadings we can come up with? Some example?
In the bias-variance tradeoff, a non-linear classifier has, in general, a larger variance than the linear one. If the dataset is generated by a linearly-separable process but the measurements are noisy, then it will be more susceptible to overfitting.
However, if the dataset is large enough and the classifier is unbiased, then a non-linear classifier would eventually produce effectively a separating hyperplane.

Are there any fixed relationships between mini batch gradient decent and gradient decent

For convex optimization, like as logistic regression.
For example I have 100 training samples. In mini batch gradient decent I set batch size equal to 10.
So after 10 times of mini batch gradient decent updating. Can I get the same result with one times gradient decent updating?
For non-convex optimization, like as Neural Network.
I know mini batch gradient decent can avoid some local optima sometimes. But are there any fixed relationships between them.
When we say batch gradient descent, it is updating the parameters using all the data. Below is an illustration of batch gradient descent. Note each iteration of the batch gradient descent involves a computation of the average of the gradients of the loss function over the entire training data set. In the figure, -gamma is the negative of the learning rate.
When the batch size is 1, it is called stochastic gradient descent (GD).
When you set the batch size to 10 (I assume the total training data size >>10), this method is called mini batches stochastic GD, which is a compromise between true stochastic GD and batch GD (which uses all the training data at one update). Mini batches performs better than true stochastic gradient descent because when the gradient computed at each step uses more training examples, we usually see smoother convergence. Below is an illustration of SGD. In this online learning setting, each iteration of the update consists of choosing a random training instance (z_t) from the outside world and update the parameter w_t.
The two figures I included here are from this paper.
From wiki:
The convergence of stochastic gradient descent has been analyzed using
the theories of convex minimization and of stochastic approximation.
Briefly, when the learning rates \alpha decrease with an appropriate
rate, and subject to relatively mild assumptions, stochastic gradient
descent converges almost surely to a global minimum when the objective
function is convex or pseudoconvex, and otherwise converges almost
surely to a local minimum. This is in fact a consequence of the
Robbins-Siegmund theorem.
Regarding your question:
[convex case] Can I get the same result with one times gradient decent updating?
If the meaning of "same result" is "converging" to the global minimum, then YES. This is approved by LĀ“eon Bottou in his paper. That is either SGD or mini batch SGD converges to a global minimum almost surely. Note when we say almost surely:
It is obvious however that any online learning algorithm can be
mislead by a consistent choice of very improbable examples. There is
therefore no hope to prove that this algorithm always converges. The
best possible result then is the almost sure convergence, that is to
say that the algorithm converges towards the solution with probability 1.
For non-convex case, it is also proved in the same paper (section 5), that stochastic or mini batches converges to the local minimum almost surely.

Resources