Determining the starting point of gradient descent [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have just learned that the starting point of gradient descent determines the ending point. So I wonder how do we determine the right starting point to reach the global minimum point so that we get the least cost function?

Yes, for a general objective function, the starting point of gradient descent determines the ending point. This is complicating, gradient descent may get stuck in suboptimal local minima. What can we do about that:
Convex optimization: Things are better if the objective is a convex function being optimized on a convex domain, namely, then any local minimum is also a global minimum. So gradient descent on a convex function won't get trapped in suboptimal local minima. Better yet, if the objective is strictly convex, then there is (at most) one global minimum. For these reasons, optimization-based methods are frequently formulated as convex optimizations when it is possible. Logistic regression for instance is a convex optimization problem.
As Tarik said, a good meta-strategy is to do gradient descent multiple times from different random starting positions. This is sometimes called a "random restart" or "shotgun" gradient descent approach.
Twists on the basic gradient descent idea can also be helpful in avoiding local minima. Stochastic gradient descent (SGD) (and similarly, simulated annealing) makes noisier steps. This noise has a cumulative effect like optimizing a smoothed version of the objective, hopefully smoothing over smaller valleys. Another idea is to add a momentum term to gradient descent or SGD, with the intention that momentum will allow the method to roll through and escape local minima.
Finally, an interesting and practical attitude is simply to surrender and accept that gradient descent's solution may be suboptimal. A local minimum solution may yet be useful. For instance, if that solution represents trained weights for a neural network, what really counts is that the network generalizes and performs well on the test set, not that it is optimal on the training set.

Related

Back propagation vs Levenberg Marquardt

Does anyone know the difference between Backpropagation and Levenberg–Marquardt in neural networks training? Sometimes I see that LM is considered as a BP algorithm and sometimes I see the opposite.
Your help will be highly appreciated.
Thank you.
Those are two completely unrelated concepts.
Levenberg-Marquardt (LM) is an optimization method, while backprop is just the recursive application of the chain rule for derivatives.
What LM intuitively does is this: when it is far from a local minimum, it ignores the curvature of the loss and acts as gradient descent. However, as it gets closer to a local minimum it pays more and more attention to the curvature by switching from gradient descent to a Gauss-Newton like approach.
The LM method needs both the gradient and the Hessian (as it solves variants of (H+coeff*Identity)dx=-g with H,g respectively the Hessian and the gradient. You can obtain the gradient via backpropagation. For the Hessian, it is most often not as simple although in least squares you can approximate it as 2gg^T, which means that in that case you can also obtain it easily at the end of the initial backprop.
For neural networks LM usually isn't really useful as you can't construct such a huge Hessian, and even if you do, it lacks the sparse structure needed to invert it efficiently.

What is momentum in machine learning? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm new in the field of machine learning and recently I heard about this term. I tried to read some articles in the internet but I still don't understand the idea behind it. Can someone give me some examples?
During back-propagation, we're adjusting the weights of the model to adapt to the most recent training results. On a nicely-behaved surface, we would simply use Newton's method and converge to the optimum solution with no problem. However, reality is rarely well-behaved, especially in the initial chaos of a randomly-initialized model. We need to traverse the space with something less haphazard than a full-scale attempt to hit the optimum on the next iteration (as Newton's method does).
Instead, we make two amendments to Newton's approach. The first is the learning rate: Newton adjusted weights by using the local gradient to compute where the solution should be, and going straight to that new input value for the next iteration. Learning rate scales this down quite a bit, taking smaller steps in the indicated direction. For instance, a learning rate of 0.1 says to go only 10% of the computed distance. From that new value, we again compute the gradient, "sneaking up" on the solution. This gives us a better chance of finding the optimum on a varied surface, rather than overshooting or oscillating past it in all directions.
Momentum see here is a similar attempt to maintain a consistent direction. If we're taking smaller steps, it also makes sense to maintain a somewhat consistent heading through our space. We take a linear combination of the previous heading vector, and the newly-computed gradient vector, and adjust in that direction. For instance, if we have a momentum of 0.90, we will take 90% of the previous direction plus 10% of the new direction, and adjust weights accordingly -- multiplying that direction vector by the learning rate.
Does that help?
Momentum is a term used in gradient descent algorithm.
Gradient descent is an optimization algorithm which works by finding the direction of steepest slope in its current status and updates its status by moving towards that direction. As a result, in each step its guaranteed that the value of function to be minimized decreases by each step. The problem is this direction can change greatly in some points of the function while there best path to go usually does not contain a lot of turns. So it's desirable for us to make the algorithm keep the direction it has already been going along for sometime before it changes its direction. To do that the momentum is introduced.
A way of thinking about this is imagining a stone rolling down a hill till it stops at a flat area (a local minimum). If the stone rolling down a hill happens to pass a point where the steepest direction changes for a just moment, we don't expect it to change its direction entirely (since its physical momentum keeps it going). But if the direction of slope changes entirely the stone will gradually change its direction towards the steepest descent again.
here is a detailed link, you might want to check out the math behind it or just see the effect of momentum in action:
https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d

Feature scaling (normalization) in multiple regression analysis with normal equation method? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
I am doing linear regression with multiple features. I decided to use normal equation method to find coefficients of linear model. If we use gradient descent for linear regression with multiple variables we typically do feature scaling in order to quicken gradient descent convergence. For now, I am going to use normal equation formula:
I have two contradictory information sources. In 1-st it is stated that no feature scaling required for normal equations. In another I can see that feature normalization has to be done.
Sources:
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex3/ex3.html
http://puriney.github.io/numb/2013/07/06/normal-equations-gradient-descent-and-linear-regression/
At the end of these two articles information concerning feature scaling in normal equations presented.
The question is do we need to do feature scaling before normal equation analysis?
You may indeed not need to scale your features, and from theoretical point of view you get solution in just one "step". In practice, however, things might be a bit different.
Notice matrix inversion in your formula. Inverting a matrix is not quite trivial computational operation. In fact, there's a measure of how hard it's to invert a matrix (and perform some other computations), called condition number:
If the condition number is not too much larger than one (but it can still be a multiple of one), the matrix is well conditioned which means its inverse can be computed with good accuracy. If the condition number is very large, then the matrix is said to be ill-conditioned. Practically, such a matrix is almost singular, and the computation of its inverse, or solution of a linear system of equations is prone to large numerical errors. A matrix that is not invertible has condition number equal to infinity.
P.S. Large condition number is actually the same problem that slows down gradient descent's convergence.
You don't need to perform feature scaling when using the normal equation. It's useful only for the gradient descent method to optimize the performance. The article from the Stanford University provides the correct information.
Of course you can scale the features in this case as well, but it will not bring you any advantages (and will cost you some additional calculations).

Why do we use gradient descent in linear regression? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
In some machine learning classes I took recently, I've covered gradient descent to find the best fit line for linear regression.
In some statistics classes, I have learnt that we can compute this line using statistic analysis, using the mean and standard deviation - this page covers this approach in detail. Why is this seemingly more simple technique not used in machine learning?
My question is, is gradient descent the preferred method for fitting linear models? If so, why? Or did the professor simply use gradient descent in a simpler setting to introduce the class to the technique?
The example you gave is one-dimensional, which is not usually the case in machine learning, where you have multiple input features.
In that case, you need to invert a matrix to use their simple approach, which can be hard or ill-conditioned.
Usually the problem is formulated as a least square problem, which is slightly easier. There are standard least square solvers which could be used instead of gradient descent (and often are). If the number of data points is very hight, using a standard least squares solver might be too expensive, and (stochastic) gradient descent might give you a solution that is as good in terms of test-set error as a more precise solution, with a run-time that is orders of magnitude smaller (see this great chapter by Leon Bottou)
If your problem is small that it can be efficiently solved by an off-the-shelf least squares solver, you should probably not do gradient descent.
Basically the 'gradient descent' algorithm is a general optimization technique and can be used to optimize ANY cost function. It is often used when the optimum point cannot be estimated in a closed form solution.
So let's say we want to minimize a cost function. What ends up happening in gradient descent is that we start from some random initial point and we try to move in the 'gradient direction' in order to decrease the cost function. We move step by step until there is no decrease in the cost function. At this time we are at the minimum point. To make it easier to understand, imagine a bowl and a ball. If we drop the ball from some initial point on the bowl it will move until it is settled at the bottom of the bowl.
As the gradient descent is a general algorithm, one can apply it to any problem that requires optimizing a cost function. In the regression problem, the cost function that is often used is the mean square error (MSE). Finding a closed form solution requires inverting a matrix that in most of the time is ill-conditioned (it's determinant is very close to zero and therefore it does not give a robust inverse matrix). To circumvent this problem, people are often take the gradient descent approach to find the solution which does not suffer from ill-conditionally problem.

Binary classification using radial basis kernel SVM with a single feature [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Is there any interpretation (graphical or otherwise) of a radial basis kernel SVM being trained with a single feature? I can visualize the effect in 2 dimensions (the result being a separation boundary that is curved rather than a linear line. (e.g http://en.wikipedia.org/wiki/File:Kernel_Machine.png).
I'm having trouble thinking of what this would be like if your original data only had a single feature. What would the boundary line look like for this case?
In one dimension, your data would be numbers, and decision boundary would be simply finite set of numbers, representing finite set of intervals of classification to one class and finite set of intervals of classification to the another one.
In fact, the decision boundary in R^2 is actually the set of points, for which weighted sum of gaussian distributions in support vectors (where alpha_i are these weights) is equal to b (intercept/threshold term). You can actually draw this distribution (in 3d now). Similarly, in 1d you would get a similar distribution, which could be drawn in 2d, and the decision would be based on this distribution being bigger/smaller than b.
This video shows what happen in a kernel mapping, it is not using the RBF Kernel, but the idea is the same:
http://www.youtube.com/watch?v=3liCbRZPrZA
As for the 1D case, there is not much difference, it would be something like this:
It would look line a line that switches back and forth between two different color (one color for each class). Nothing special happens in 1D besides SVMs being over kill.

Resources