convex figure in Machine learning - machine-learning

I am learning ML from the past few days and I stuck in the gradient descent convex figure. From which values, the convex figure is drawn to apply gradient descent. Is that convex figure is made from hypothesis and cost function algorithms??

Related

How cost function curve is actually calculated for gradient descent i:e how many times the model chooses the weights randomly?

As far as I know, to calculate the weights and bias for simple linear regression, it follows gradient descent algorithm which works on finding global minima for cost function(curve). And that cost function is calculated by randomly choosing a set of weights and then calculating mean error on all the records. In that way we get a point on the cost curve. Again another set of weights are chosen and mean error is calculated. So all these points make up the cost curve.
My doubt is, how many times the weights are randomly chosen to get the points, before calculating (finding the cost function)the cost curve.
Thanks in advance.
Gradient descent algorithm iterates till convergence.
By convergence, it means, global minima is found out for the convex cost function.
There are basically two ways people use to find convergence.
Automatic convergence test : Declare convergence if cost function decreases by less than e in an iteration, where e is some small value such as 10^-3. However, it is difficult to choose this threshold value in practice.
Plot cost function against iterations : Plotting cost function against iteration can give you fair idea about convergence. It can also be used for debugging (cost function must be decreasing on every iteration).
For example, in this figure I can deduce that I need near 300-400 iterations of Gradient Descent.
Also this enables you to check for different learning rate (alpha) vs iterations.

How learning rate influences gradient descent?

When gradient descent quantitatively suggests by much the biases and weights to be reduced, what does learning rate is doing?? Am a beginner, someone please enlighten me on this.
Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.
new_weight = existing_weight — learning_rate * gradient
If learning rate is too small gradient descent can be slow
If learning rate is fast gradient descent can overshoot the minimum.It may fail to converge, it may even diverge

How gradient descent algorithm behaves when it doesn't reach minimum

Sorry for this ugly image. When the gradient descent algorithm reaches the coefficient value where the error is minimal as shown in the figure, how does the algorithm move forward to reach the global minimum? I am new to Machine Learning. Please bear with me for such a very basic question.

Is the gradient descent algorithm guaranteed to converge after infinite iterations for a convex function?

Whatsoever maybe the step size choice for the algorithm?
We talk about gradient descent not converging if we take a large step size,but I am not sure the algorithm won't converge ever?

Backpropagation in Gradient Descent for Neural Networks vs. Linear Regression

I'm trying to understand "Back Propagation" as it is used in Neural Nets that are optimized using Gradient Descent. Reading through the literature it seems to do a few things.
Use random weights to start with and get error values
Perform Gradient Descent on the loss function using these weights to arrive at new weights.
Update the weights with these new weights until the loss function is minimized.
The steps above seem to be the EXACT process to solve for Linear Models (Regression for e.g.)? Andrew Ng's excellent course on Coursera for Machine Learning does exactly that for Linear Regression.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets and why not for GLMs (Generalized Linear Models). They all seem to be doing the same thing- what might I be missing?
The main division happens to be hiding in plain sight: linearity. In fact, extend to question to continuity of the first derivative, and you'll encapsulate most of the difference.
First of all, take note of one basic principle of neural nets (NN): a NN with linear weights and linear dependencies is a GLM. Also, having multiple hidden layers is equivalent to a single hidden layer: it's still linear combinations from input to output.
A "modern' NN has non-linear layers: ReLUs (change negative values to 0), pooling (max, min, or mean of several values), dropouts (randomly remove some values), and other methods destroy our ability to smoothly apply Gradient Descent (GD) to the model. Instead, we take many of the principles and work backward, applying limited corrections layer by layer, all the way back to the weights at layer 1.
Lather, rinse, repeat until convergence.
Does that clear up the problem for you?
You got it!
A typical ReLU is
f(x) = x if x > 0,
0 otherwise
A typical pooling layer reduces the input length and width by a factor of 2; in each 2x2 square, only the maximum value is passed through. Dropout simply kills off random values to make the model retrain those weights from "primary sources". Each of these is a headache for GD, so we have to do it layer by layer.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets
I think (at least originally) back propagation of errors meant less than what you describe: the term "backpropagation of errors" only refered to the method of calculating derivatives of the loss function, instead of e.g. automatic differentiation, symbolic differentiation, or numerical differentiation. No matter what the gradient was then used for (e.g. Gradient Descent, or maybe Levenberg/Marquardt).
They all seem to be doing the same thing- what might I be missing?
They're using different models. If your neural network used linear neurons, it would be equivalent to linear regression.

Resources