What is momentum in machine learning? [closed]

What is momentum in machine learning? [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm new in the field of machine learning and recently I heard about this term. I tried to read some articles in the internet but I still don't understand the idea behind it. Can someone give me some examples?

During back-propagation, we're adjusting the weights of the model to adapt to the most recent training results. On a nicely-behaved surface, we would simply use Newton's method and converge to the optimum solution with no problem. However, reality is rarely well-behaved, especially in the initial chaos of a randomly-initialized model. We need to traverse the space with something less haphazard than a full-scale attempt to hit the optimum on the next iteration (as Newton's method does).
Instead, we make two amendments to Newton's approach. The first is the learning rate: Newton adjusted weights by using the local gradient to compute where the solution should be, and going straight to that new input value for the next iteration. Learning rate scales this down quite a bit, taking smaller steps in the indicated direction. For instance, a learning rate of 0.1 says to go only 10% of the computed distance. From that new value, we again compute the gradient, "sneaking up" on the solution. This gives us a better chance of finding the optimum on a varied surface, rather than overshooting or oscillating past it in all directions.
Momentum see here is a similar attempt to maintain a consistent direction. If we're taking smaller steps, it also makes sense to maintain a somewhat consistent heading through our space. We take a linear combination of the previous heading vector, and the newly-computed gradient vector, and adjust in that direction. For instance, if we have a momentum of 0.90, we will take 90% of the previous direction plus 10% of the new direction, and adjust weights accordingly -- multiplying that direction vector by the learning rate.
Does that help?

Momentum is a term used in gradient descent algorithm.
Gradient descent is an optimization algorithm which works by finding the direction of steepest slope in its current status and updates its status by moving towards that direction. As a result, in each step its guaranteed that the value of function to be minimized decreases by each step. The problem is this direction can change greatly in some points of the function while there best path to go usually does not contain a lot of turns. So it's desirable for us to make the algorithm keep the direction it has already been going along for sometime before it changes its direction. To do that the momentum is introduced.
A way of thinking about this is imagining a stone rolling down a hill till it stops at a flat area (a local minimum). If the stone rolling down a hill happens to pass a point where the steepest direction changes for a just moment, we don't expect it to change its direction entirely (since its physical momentum keeps it going). But if the direction of slope changes entirely the stone will gradually change its direction towards the steepest descent again.
here is a detailed link, you might want to check out the math behind it or just see the effect of momentum in action:
https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d

Related

Impact of epsilon in the Q-learning process? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am in the process of learning Reinforcement Learning. I understood the general idea of RL.
However, so far I am yet to understand what is the impact of 𝜖 in RL? What should be the recommended value?
When 𝜖=1, does it imply explore randomly? If this intuition is correct, then it will not learn anything with 𝜖=1.
On the other hand, if we set 𝜖=0 that will imply do not explore and in this case agent will not learn anything either.
I am wondering whether my intuition is correct or not.

What does epsilon control?
From https://en.wikipedia.org/wiki/Reinforcement_learning
with probability epsilon, exploration is chosen, and the action is chosen uniformly at random.
This means that with a probability equal to epsilon the system will ignore what it has learnt so far and effectively stumble or jerk blindly to the next state. That kinds sounds ridiculous as a strategy and certainly if I saw a pedestrian sometimes flailing randomly as they walked down the street I wouldn't say they were 'exploring'.
So, how is this exploring in something like q-learning or RL? If instead of small actions like flailing the the pedestrian example, if we are talking about larger logical moves then trying something new once in a while, like walking north first then east instead of always east first then north, may take us by a nice icecream shop, or game reward.
Depending on what kinds of actions your q-learning or RL system controls, adding various levels (epsilon) of noisy actions will help.
Let's take a look at two scenarios:
A 8x8 grid, some reward squares, a final goal square, and some 'evil' squares.
A Call-Of-Duty like game, a FPS in a huge open world with few rewards other than after several minutes of play, with each second of game play involving several controller movements.
In the first one, if we have an epsilon of .1, that means 10% of the time we just move at random. If we are right beside an 'evil' square, that means even if the RL has learnt that it needs to not move to the 'evil' square, it has a 10% of moving randomly, and if it does than a 1/4 chance of moving to the 'evil' square... so a 2.5% of just dying for no reason whenever beside an 'evil' square. It will have to navigate across the 8x8, and if it is set up like a maze between 'evil' squares and the start is opposite the end, then there will be about 20 or so moves. With a 10% error rate, that will be about 20 to 24 moves once it has reached mastery. When it just starts out, its informed moves are no better than random and it will have a 1/4^20 chance of making it the first time. Alternatively, if it learns some path that is sub-optimal, the only method it has to learn the optimal path is for its random moves (epsilon) to happen at the right time and in the right direction. That can take a really really long time if we need 5 correct random moves to happen in a row (1/10^5 x 1/4^5)
Now let's look at the FPS game with millions of moves. Each move isn't so critical. So an epsilon of .1 (10%) isn't so bad. On the flip side, the number of random moves that need to be chained together to make a meaningful new move or follow a path is astronomically large. Setting a higher epsilon (like .5) will certainly increase the chance, but we still have the issue of branching factor. If in order to go all the way down a new alley way we need 12 seconds of game play with 5 actions per second, then that's 60 actions in a row. There is a 1/2^60 chance of getting 60 random new moves in a row to go down this alleyway when the RL has a belief to not to. That doesn't really sound like an exploration factor to me.
But it does work for toy problems.
Lots of possibilities to improve it
If we frame the RL problem just a little differently, we can get really different results.
Let's say instead of epsilon meaning that we do something random for one time step, it instead is the chance of switching states between: using RL to guide our actions or doing one thing continuously (go north, chop wood). In either state we switch to the other with chance epsilon.
In the FPS game example, if we had an epsilon of 1/15 , then for about 15 steps (5 steps per second, this is 3 seconds on average) we would have RL control our moves, then for 3 seconds it would do just one thing (run north, shoot the sky, ...) then back to RL. This will be a lot less like a pedestrian twitching as they walked, and instead more like once in a while running north and then west.

How to improve Levenberg-Marquardt's method for polynomial curve fitting?

Some weeks ago I started coding the Levenberg-Marquardt algorithm from scratch in Matlab. I'm interested in the polynomial fitting of the data but I haven't been able to achieve the level of accuracy I would like. I'm using a fifth order polynomial after I tried other polynomials and it seemed to be the best option. The algorithm always converges to the same function minimization no matter what improvements I try to implement. So far, I have unsuccessfully added the following features:
Geodesic acceleration term as a second order correction
Delayed gratification for updating the damping parameter
Gain factor to get closer to the Gauss-Newton direction or
the steepest descent direction depending on the iteration.
Central differences and forward differences for the finite difference method
I don't have experience in nonlinear least squares, so I don't know if there is a way to minimize the residual even more or if there isn't more room for improvement with this method. I attach below an image of the behavior of the polynomial for the last iterations. If I run the code for more iterations, the curve ends up not changing from iteration to iteration. As it is observed, there is a good fit from time = 0 to time = 12. But I'm not able to fix the behavior of the function from time = 12 to time = 20. Any help will be very appreciated.

Fitting a polynomial does not seem to be the best idea. Your data set looks like an exponential transient, with an horizontal asymptote. Forcing a polynomial to that will work very poorly.
I'd rather try with a simple model, such as
A (1 - e^(-at)).
With naked eye, A ~ 15. You should have a look at the values of log(15 - y).

Regarding to backward of convolution layer in Deep learning

I understood the way to compute the forward part in Deep learning. Now, I want to understand the backward part. Let's take X(2,2) as an example. The backward at the position X(2,2) can compute as the figure bellow
My question is that where is dE/dY (such as dE/dY(1,1),dE/dY(1,2)...) in the formula? How to compute it at the first iteration?

SHORT ANSWER
Those terms are in the final expansion at the bottom of the slide; they contribute to the summation for dE/dX(2,2). In your first back-propagation, you start at the end and work backwards (hence the name) -- and the Y values are the ground-truth labels. So much for computing them. :-)
LONG ANSWER
I'll keep this in more abstract, natural-language terms. I'm hopeful that the alternate explanation will help you see the large picture as well as sorting out the math.
You start the training with assigned weights that may or may not be at all related to the ground truth (labels). You move blindly forward, making predictions at each layer based on naive faith in those weights. The Y(i,j) values are the resulting meta-pixels from that faith.
Then you hit the labels at the end. You work backward, adjusting each weight. Note that, at the last layer, the Y values are the ground-truth labels.
At each layer, you mathematically deal with two factors:
How far off was this prediction?
How heavily did this parameter contribute to that prediction?
You adjust the X-to-Y weight by "off * weight * learning_rate".
When you complete that for layer N, you back up to layer N-1 and repeat.
PROGRESSION
Whether you initialize your weights with fixed or random values (I generally recommend the latter), you'll notice that there's really not much progress in the early iterations. Since this is slow adjustment from guess-work weights, it takes several iterations to get a glimmer of useful learning into the last layers. The first layers are still cluelessly thrashing at this point. The loss function will bounce around close to its initial values for a while. For instance, with GoogLeNet's image recognition, this flailing lasts for about 30 epochs.
Then, finally, you get some valid learning in the latter layers, the patterns stabilize enough that some consistency percolates back to the early layers. At this point, you'll see the loss function drop to a "directed experimentation" level. From there, the progression depends a lot on the paradigm and texture of the problem: some have a sharp drop, then a gradual convergence; others have a more gradual drop, almost an exponential decay to convergence; more complex topologies have additional sharp drops as middle or early phases "get their footing".

Learning rate of a Q learning agent

The question how the learning rate influences the convergence rate and convergence itself.
If the learning rate is constant, will Q function converge to the optimal on or learning rate should necessarily decay to guarantee convergence?

Learning rate tells the magnitude of step that is taken towards the solution.
It should not be too big a number as it may continuously oscillate around the minima and it should not be too small of a number else it will take a lot of time and iterations to reach the minima.
The reason why decay is advised in learning rate is because initially when we are at a totally random point in solution space we need to take big leaps towards the solution and later when we come close to it, we make small jumps and hence small improvements to finally reach the minima.
Analogy can be made as: in the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he choses a different stick to get accurate short shot.
So its not that he won't be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole. Same is for decayed learning rate.

The learning rate must decay but not too fast.
The conditions for convergence are the following (sorry, no latex):
sum(alpha(t), 1, inf) = inf
sum(alpha(t)^2, 1, inf) < inf
Something like alpha = k/(k+t) can work well.
This paper discusses exactly this topic:
http://www.jmlr.org/papers/volume5/evendar03a/evendar03a.pdf

It should decay otherwise there will be some fluctuations provoking small changes in the policy.

Why do we use gradient descent in linear regression? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
In some machine learning classes I took recently, I've covered gradient descent to find the best fit line for linear regression.
In some statistics classes, I have learnt that we can compute this line using statistic analysis, using the mean and standard deviation - this page covers this approach in detail. Why is this seemingly more simple technique not used in machine learning?
My question is, is gradient descent the preferred method for fitting linear models? If so, why? Or did the professor simply use gradient descent in a simpler setting to introduce the class to the technique?

The example you gave is one-dimensional, which is not usually the case in machine learning, where you have multiple input features.
In that case, you need to invert a matrix to use their simple approach, which can be hard or ill-conditioned.
Usually the problem is formulated as a least square problem, which is slightly easier. There are standard least square solvers which could be used instead of gradient descent (and often are). If the number of data points is very hight, using a standard least squares solver might be too expensive, and (stochastic) gradient descent might give you a solution that is as good in terms of test-set error as a more precise solution, with a run-time that is orders of magnitude smaller (see this great chapter by Leon Bottou)
If your problem is small that it can be efficiently solved by an off-the-shelf least squares solver, you should probably not do gradient descent.

Basically the 'gradient descent' algorithm is a general optimization technique and can be used to optimize ANY cost function. It is often used when the optimum point cannot be estimated in a closed form solution.
So let's say we want to minimize a cost function. What ends up happening in gradient descent is that we start from some random initial point and we try to move in the 'gradient direction' in order to decrease the cost function. We move step by step until there is no decrease in the cost function. At this time we are at the minimum point. To make it easier to understand, imagine a bowl and a ball. If we drop the ball from some initial point on the bowl it will move until it is settled at the bottom of the bowl.
As the gradient descent is a general algorithm, one can apply it to any problem that requires optimizing a cost function. In the regression problem, the cost function that is often used is the mean square error (MSE). Finding a closed form solution requires inverting a matrix that in most of the time is ill-conditioned (it's determinant is very close to zero and therefore it does not give a robust inverse matrix). To circumvent this problem, people are often take the gradient descent approach to find the solution which does not suffer from ill-conditionally problem.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart