What is the exact difference between strocastic gradient Decent, mini batch gradient decent and gradient decent? - machine-learning

I am new to AI. I just learnt GD and about batches for gradient decent. I am confused about whats the exact difference between them. Any solution for this would be appreciated.
Thanks in advance

All of those methods are first order optimization methods, only require the knowledge of gradients, to minimize fintie sum functions. This means that we minimize a function F that is written as the sum of N functions f_{i}, and we can compute the gradient of each of those functions in any given point.
The GD methods consists in using the gradient of F, wich is equal to the sum of gradients of all f_{i} to do one update, i.e.
x <- x - alpha* grad(F)
The stochastic GD, cinsists in selecting randomly one function f_{i}, and doing an update using its gradients, i.e.
x <- x - alpha*grad(f_{i})
So each update is faster, but we need more updates to find the optimimum.
Mini-batch GD is in between of those two strategies and selects m functions f_{i} randomly to do one update.
For more information look at this link

Check this.
In both gradient descent (GD) and stochastic gradient descent (SGD), you iteratively update a set of parameters to minimize an error function.
While in GD, you have to run through all the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use only one or subset of training sample from your training set to do the update for a parameter in a particular iteration. If you use a subset, it is called Minibatch Stochastic gradient Descent.
Thus, if the number of training samples is large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.
SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values is enough because they reach the optimal values and keep oscillating there.
Hope this will help you.

Related

Batch gradient descent in scikit-learn

How do we set parameters for sklearn.linear_model.SGDRegressor to make it perform Batch gradient descent?
I want to solve a linear-regression problem using Batch gradient descent. I need to make SGD act like batch gradient descent, and this should be done (I think) by making it modify the model at the end of an epoch. Can it be somehow parameterized to behave like that?
I need to make SGD act like batch gradient descent, and this should be done (I think) by making it modify the model at the end of an epoch.
You cannot do that; it is clear from the documentation that:
the gradient of the loss is estimated each sample at a time and the model is updated along the way
And although in the SGDClassifier docs it is mentioned that
SGD allows minibatch (online/out-of-core) learning
which presumably holds also for SGDRegressor, what is actually meant is that you can use the partial_fit method for providing the data in different batches; the computations (and updates), however, are always performed per sample.
If you really need to perform linear regression with GD, you could do it easily in Keras or Tensorflow, assembling an LR model and using a batch size equal to the whole of your training samples.

How does Lightgbm (or other boosted trees implementations with 2nd order approximations of the loss) work for L1 losses?

I've been trying to understand how Lightgbm handless L1 loses (MAE, MAPE, HUBER)
According to this article, the gain during a split should depend only on the first and second derivatives of the loss function. This is due to the fact that Lightgbm uses a second order approximation to the loss function and consequently we can approximate the loss as follows
For L1 losses however, the absolute value of the gradient of the loss is constant and its hessian 0. I've also read that to deal with this, for loss functions with hessian = 0 we should rather use 1 as the Hessian:
"For these objective function with first_order_gradient is constant, LightGBM has a special treatment for them: (...) it will use the constant gradient for the tree structure learning, but use the residual for the leaf output calculation, with percentile function, e.g. 50% for MAE. This solution is from sklearn, and is proven to work in many benchmarks."
However, even using constant hessian doesn't make sense to me: if for instance when using MAE the gradient is the sign of the error, the squared gradient doesn't give us information. Does it mean that when the gradient is constant, LightGbm does not use the second order approximation, and defaults to traditional gradient boosting?
On the other hand, when reading about GOSS boosting the original lightgbm paper
for the GOSS boosting strategy, the authors consider the square of the sum of the gradients. I see the same problem as above: if the gradient of the MAE is the sign of the error, how does taking the square of the gradient reflect a gain? Does it mean that also GOSS won't work with loss functions with constant gradient?
Thanks in advance,
I've asked this in the Lightgbm repo and got this answer:
Before this version, we use the second-order approximation, but its performance actually is not good.
And we switch back to 1) use first-order gradient to find split point; 2) then use the median of residuals for leaf outputs, as shown in the above code.
So it seems Lightgbm will treat the already implemented L1 losses using gradient descent. For custom loss functions, it will still try to do the 2nd order approx.

What is the difference between gradient descent and cost function J(theta)?

I am learning machine learning from coursera. But I'm little confused between gradient descent and cost function. When and where I should use those?
J(ϴ) is minimized by trial and error approach i.e. trying lot of values and then checking the output. So in practice this means that this work is done by hand and is time consuming.
Gradient Descent basically just does what J(ϴ) does but in a automated way — change the theta values, or parameters, bit by bit, until we hopefully arrived a minimum. This is an iterative method where the model moves to the direction of steepest descent i.e. the optimal value of theta.
Why use Gradient descent? it is easy to implement and is generic optimization technique so will work even if you change your model. It is also better to use GD if you have a lot of features because in this case, normal J(ϴ) computation becomes very expensive.
Gradient Descent requires a cost function(there are many types of cost functions). One common function that is often used is mean squared error, which measure the difference between the estimator (the dataset) and the estimated value (the prediction).
We need this cost function because we want to minimize it. Minimizing any function means finding the deepest valley in that function. Keep in mind that, the cost function is used to monitor the error in predictions of an ML model. So minimizing this, basically means getting to the lowest error value possible or increasing the accuracy of the model. In short, We increase the accuracy by iterating over a training data set while tweaking the parameters(the weights and biases) of our model.
In short, the whole point of Gradient descent is to minimize the cost function

Why do we need epochs?

In courses there is nothing about epochs, but in practice they are everywhere used.
Why do we need them if the optimizer finds the best weight in one pass. Why does the model improve?
Generally whenever you want to optimize you use gradient descent. Gradient descent has a parameter called learning rate. In one iteration alone you can not guarantee that the gradient descent algorithm would converge to a local minima with the specified learning rate. That is the reason why you iterate again for the gradient descent to converge better.
Its also a good practice to change learning rates per epoch by observing the learning curves for better convergence.
Why do we need [to train several epochs] if the optimizer finds the best weight in one pass?
That's wrong in most cases. Gradient descent methods (see a list of them) does usually not find the optimal parameters (weights) in one pass. In fact, I have never seen any case where the optimal parameters were even reached (except for constructed cases).
One epoch consists of many weight update steps. One epoch means that the optimizer has used every training example once. Why do we need several epochs? Because gradient descent are iterative algorithms. It improves, but it just gets there in tiny steps. It only uses tiny steps, because it can only use local information. It does not have an idea of the function besides the current point at which it is.
You might want to read the gradient descent part of my optimization basics blog post.

Can someone explain to me the difference between a cost function and the gradient descent equation in logistic regression?

I'm going through the ML Class on Coursera on Logistic Regression and also the Manning Book Machine Learning in Action. I'm trying to learn by implementing everything in Python.
I'm not able to understand the difference between the cost function and the gradient. There are examples on the net where people compute the cost function and then there are places where they don't and just go with the gradient descent function w :=w - (alpha) * (delta)w * f(w).
What is the difference between the two if any?
Whenever you train a model with your data, you are actually producing some new values (predicted) for a specific feature. However, that specific feature already has some values which are real values in the dataset. We know the closer the predicted values to their corresponding real values, the better the model.
Now, we are using cost function to measure how close the predicted values are to their corresponding real values.
We also should consider that the weights of the trained model are responsible for accurately predicting the new values. Imagine that our model is y = 0.9*X + 0.1, the predicted value is nothing but (0.9*X+0.1) for different Xs.
[0.9 and 0.1 in the equation are just random values to understand.]
So, by considering Y as real value corresponding to this x, the cost formula is coming to measure how close (0.9*X+0.1) is to Y.
We are responsible for finding the better weight (0.9 and 0.1) for our model to come up with a lowest cost (or closer predicted values to real ones).
Gradient descent is an optimization algorithm (we have some other optimization algorithms) and its responsibility is to find the minimum cost value in the process of trying the model with different weights or indeed, updating the weights.
We first run our model with some initial weights and gradient descent updates our weights and find the cost of our model with those weights in thousands of iterations to find the minimum cost.
One point is that gradient descent is not minimizing the weights, it is just updating them. This algorithm is looking for minimum cost.
A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).
It's strange to think about it, but there is more than one measure for how "accurately" a line fits to data points.
To access how accurately a line fits the data, we have a "cost" function which which can compare predicted vs. actual values and provide a "penalty" for how wrong it is.
penalty = cost_funciton(predicted, actual)
A naive cost function might just take the difference between the predicted and actual.
More sophisticated functions will square the value, since we'd rather have many small errors than one large error.
Additionally, each point has a different "sensitivity" to moving the line. Some points react very strongly to movement. Others react less strongly.
Often, you can make a tradeoff, and move TOWARD a point that is sensitive, and AWAY from a point that is NOT sensitive. In that scenario , you get more than you give up.
The "gradient" is a way of measuring how sensitive each point is to moving the line.
This article does a good job of describing WHY there is more than one measure, and WHY some points are more sensitive than others:
https://towardsdatascience.com/wrapping-your-head-around-gradient-descent-with-pictures-3fbd810235f5?source=friends_link&sk=7117e5de8c66bd4a4c2bb2a87a928773
Let's take an example of logistic regression model for binary classification. Output(Predicted Value) of the model for any given input will be offset(deviation) with respect to the actual output(Expected Value) while training. So, the model needs to be trained with minimal error(loss) so that model can perform well with high accuracy.
The function used to find the parameters(m and c in case of linear equation, y = mx+c) value at which the minimal error(loss) occurs is called Cost Function/Loss Function. Loss function is a term used to find the loss for single row/record of the training sample and Cost function is a term used to find the loss for the entire training dataset.
Now, How do we find the parameter(m and c in our case) values at which the minimum loss occurs? Its by using gradient descent algorithm using the equation, which helps us to find the points at which the minimum loss occurs and the parameters values at this points are considered for model building (let say y = 0.5x + 2) where m=.5 and c=2 are the points at which the loss is minimum.
Cost function is something is like at what cost you are building your model for a good model that cost should be minimum. To find the minimum cost function we use gradient descent method. That give value of coefficients to determine minimum cost function

Resources