Impact of epsilon in the Q-learning process? [closed]

Impact of epsilon in the Q-learning process? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am in the process of learning Reinforcement Learning. I understood the general idea of RL.
However, so far I am yet to understand what is the impact of 𝜖 in RL? What should be the recommended value?
When 𝜖=1, does it imply explore randomly? If this intuition is correct, then it will not learn anything with 𝜖=1.
On the other hand, if we set 𝜖=0 that will imply do not explore and in this case agent will not learn anything either.
I am wondering whether my intuition is correct or not.

What does epsilon control?
From https://en.wikipedia.org/wiki/Reinforcement_learning
with probability epsilon, exploration is chosen, and the action is chosen uniformly at random.
This means that with a probability equal to epsilon the system will ignore what it has learnt so far and effectively stumble or jerk blindly to the next state. That kinds sounds ridiculous as a strategy and certainly if I saw a pedestrian sometimes flailing randomly as they walked down the street I wouldn't say they were 'exploring'.
So, how is this exploring in something like q-learning or RL? If instead of small actions like flailing the the pedestrian example, if we are talking about larger logical moves then trying something new once in a while, like walking north first then east instead of always east first then north, may take us by a nice icecream shop, or game reward.
Depending on what kinds of actions your q-learning or RL system controls, adding various levels (epsilon) of noisy actions will help.
Let's take a look at two scenarios:
A 8x8 grid, some reward squares, a final goal square, and some 'evil' squares.
A Call-Of-Duty like game, a FPS in a huge open world with few rewards other than after several minutes of play, with each second of game play involving several controller movements.
In the first one, if we have an epsilon of .1, that means 10% of the time we just move at random. If we are right beside an 'evil' square, that means even if the RL has learnt that it needs to not move to the 'evil' square, it has a 10% of moving randomly, and if it does than a 1/4 chance of moving to the 'evil' square... so a 2.5% of just dying for no reason whenever beside an 'evil' square. It will have to navigate across the 8x8, and if it is set up like a maze between 'evil' squares and the start is opposite the end, then there will be about 20 or so moves. With a 10% error rate, that will be about 20 to 24 moves once it has reached mastery. When it just starts out, its informed moves are no better than random and it will have a 1/4^20 chance of making it the first time. Alternatively, if it learns some path that is sub-optimal, the only method it has to learn the optimal path is for its random moves (epsilon) to happen at the right time and in the right direction. That can take a really really long time if we need 5 correct random moves to happen in a row (1/10^5 x 1/4^5)
Now let's look at the FPS game with millions of moves. Each move isn't so critical. So an epsilon of .1 (10%) isn't so bad. On the flip side, the number of random moves that need to be chained together to make a meaningful new move or follow a path is astronomically large. Setting a higher epsilon (like .5) will certainly increase the chance, but we still have the issue of branching factor. If in order to go all the way down a new alley way we need 12 seconds of game play with 5 actions per second, then that's 60 actions in a row. There is a 1/2^60 chance of getting 60 random new moves in a row to go down this alleyway when the RL has a belief to not to. That doesn't really sound like an exploration factor to me.
But it does work for toy problems.
Lots of possibilities to improve it
If we frame the RL problem just a little differently, we can get really different results.
Let's say instead of epsilon meaning that we do something random for one time step, it instead is the chance of switching states between: using RL to guide our actions or doing one thing continuously (go north, chop wood). In either state we switch to the other with chance epsilon.
In the FPS game example, if we had an epsilon of 1/15 , then for about 15 steps (5 steps per second, this is 3 seconds on average) we would have RL control our moves, then for 3 seconds it would do just one thing (run north, shoot the sky, ...) then back to RL. This will be a lot less like a pedestrian twitching as they walked, and instead more like once in a while running north and then west.

Related

What is wrong with my approach of using MLP to make a chess engine?

I’m making a chess engine using machine learning, and I’m experiencing problems debugging it. I need help figuring out what is wrong with my program, and I would appreciate any help.
I made my research and borrowed ideas from multiple successful projects. The idea is to use reinforcement learning to teach NN to differentiate between strong and weak positions.
I collected 3 million games with Elo over 2000 and used my own method to label them. After researching hundreds of games, I found out, that it’s safe to assume that in the last 10 turns of any game, the balance doesn’t change, and the winning side has a strong advantage. So I picked positions from the last 10 turns and made two labels: one for a win for white and zero for black. I didn’t include any draw positions. To avoid bias, I have picked even numbers of positions labeled with wins for both sides and even number of positions for both sides with the next turn.
Each position I represented by a vector with the length of 773 elements. Every piece on every square of a chess board, together with castling rights and a next turn, I coded with ones and zeros. My sequential model has an input layer with 773 neurons and an output layer with one single neuron. I have used a three hidden layer deep MLP with 1546, 500 and 50 hidden units for layers 1, 2, and 3 respectively with dropout regularization value of 20% on each. Hidden layers are connected with the non- linear activation function ReLU, while the final output layer has a sigmoid output. I used binary crossentropy loss function and the Adam algorithm with all default parameters, except for the learning rate, which I set to 0.0001.
I used 3 percent of the positions for validation. During the first 10 epochs, validation accuracy gradually went up from 90 to 92%, just one percent behind training accuracy. Further training led to overfitting, with training accuracy going up, and validation accuracy going down.
I tested the trained model on multiple positions by hand, and got pretty bad results. Overall the model can predict which side is winning, if that side has more pieces or pawns close to a conversion square. Also it gives the side with a next turn a small advantage (0.1). But overall it doesn’t make much sense. In most cases it heavily favors black (by ~0.3) and doesn’t properly take into account the setup. For instance, it labels the starting position as ~0.0001, as if the black side has almost 100% chance to win. Sometimes irrelevant transformation of a position results in unpredictable change of the evaluation. One king and one queen from each side usually is viewed as lost position for white (0.32), unless black king is on certain square, even though it doesn’t really change the balance on the chessboard.
What I did to debug the program:
To make sure I have not made any mistakes, I analyzed, how each position is being recorded, step by step. Then I picked a dozen of positions from the final numpy array, right before training, and converted it back to analyze them on a regular chess board.
I used various numbers of positions from the same game (1 and 6) to make sure, that using too many similar positions is not the cause for the fast overfitting. By the way, even one position for each game in my database resulted in 3 million data set, which should be sufficient according to some research papers.
To make sure that the positions I use are not too simple, I analyzed them. 1.3 million of them had 36 points in pieces (knights, bishops, rooks, and queens; pawns were not included in the count), 1.4 million - 19 points, and only 0.3 million - had less.

Some things you could try:
Add unit tests and asserts wherever possible. E.g. if you know that some value is never supposed to get negative, add an assert to check that this condition really holds.
Print shapes of all tensors to check that you have really created the architecture you intended.
Check if your model outperforms some simple baseline model.
You say your model overfits, so maybe simplify it / add regularization?
Check how your model performs on the simplest positions. E.g. can it recognize a checkmate?

What is momentum in machine learning? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm new in the field of machine learning and recently I heard about this term. I tried to read some articles in the internet but I still don't understand the idea behind it. Can someone give me some examples?

During back-propagation, we're adjusting the weights of the model to adapt to the most recent training results. On a nicely-behaved surface, we would simply use Newton's method and converge to the optimum solution with no problem. However, reality is rarely well-behaved, especially in the initial chaos of a randomly-initialized model. We need to traverse the space with something less haphazard than a full-scale attempt to hit the optimum on the next iteration (as Newton's method does).
Instead, we make two amendments to Newton's approach. The first is the learning rate: Newton adjusted weights by using the local gradient to compute where the solution should be, and going straight to that new input value for the next iteration. Learning rate scales this down quite a bit, taking smaller steps in the indicated direction. For instance, a learning rate of 0.1 says to go only 10% of the computed distance. From that new value, we again compute the gradient, "sneaking up" on the solution. This gives us a better chance of finding the optimum on a varied surface, rather than overshooting or oscillating past it in all directions.
Momentum see here is a similar attempt to maintain a consistent direction. If we're taking smaller steps, it also makes sense to maintain a somewhat consistent heading through our space. We take a linear combination of the previous heading vector, and the newly-computed gradient vector, and adjust in that direction. For instance, if we have a momentum of 0.90, we will take 90% of the previous direction plus 10% of the new direction, and adjust weights accordingly -- multiplying that direction vector by the learning rate.
Does that help?

Momentum is a term used in gradient descent algorithm.
Gradient descent is an optimization algorithm which works by finding the direction of steepest slope in its current status and updates its status by moving towards that direction. As a result, in each step its guaranteed that the value of function to be minimized decreases by each step. The problem is this direction can change greatly in some points of the function while there best path to go usually does not contain a lot of turns. So it's desirable for us to make the algorithm keep the direction it has already been going along for sometime before it changes its direction. To do that the momentum is introduced.
A way of thinking about this is imagining a stone rolling down a hill till it stops at a flat area (a local minimum). If the stone rolling down a hill happens to pass a point where the steepest direction changes for a just moment, we don't expect it to change its direction entirely (since its physical momentum keeps it going). But if the direction of slope changes entirely the stone will gradually change its direction towards the steepest descent again.
here is a detailed link, you might want to check out the math behind it or just see the effect of momentum in action:
https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d

How do I 'fuse' together two different kinds of data to get a final result?

I am building a robot (2 powered wheels and one ball bearing). The problem is that I can't seem to make it drive straight. I literally find it impossible, I have been trying for weeks.
Currently I am able to rely on rotations (of both motors) or the gyro readings(I also have two gyros, each near the two tyres)
Is there a way I can fuse those together, giving me a more accurate way to determine which motor I need to speed up?
My motors accept a value from 0-900 (although the speed should be determined by me and not fixed). Also if an algorithm exists, I'd like some directions of what I'd need to swap if I make the motors go backwards.

The problem of the robot not going straight because the rotation speeds of the two motors might be different. E.g one might be completing 30 rotations per second and the other 31. Trivially this is solved using a wheel encoder. Using that you get the difference of rotation of two wheels and then change the motor speeds accordingly. Gyro can give you angular error per unit time, e.g if robot is moving 5 degree on the left after 2 seconds,then the right motor speed is to be decreased. But how much it needs to be decreased is not a trivial problem, because ultimately it happens because of hardware error. So what you can do is collect some data of the gyro angles, maybe 10 readings per second and analyze the data to find just how much you need to decrease the speed of one motor.

Learning rate of a Q learning agent

The question how the learning rate influences the convergence rate and convergence itself.
If the learning rate is constant, will Q function converge to the optimal on or learning rate should necessarily decay to guarantee convergence?

Learning rate tells the magnitude of step that is taken towards the solution.
It should not be too big a number as it may continuously oscillate around the minima and it should not be too small of a number else it will take a lot of time and iterations to reach the minima.
The reason why decay is advised in learning rate is because initially when we are at a totally random point in solution space we need to take big leaps towards the solution and later when we come close to it, we make small jumps and hence small improvements to finally reach the minima.
Analogy can be made as: in the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he choses a different stick to get accurate short shot.
So its not that he won't be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole. Same is for decayed learning rate.

The learning rate must decay but not too fast.
The conditions for convergence are the following (sorry, no latex):
sum(alpha(t), 1, inf) = inf
sum(alpha(t)^2, 1, inf) < inf
Something like alpha = k/(k+t) can work well.
This paper discusses exactly this topic:
http://www.jmlr.org/papers/volume5/evendar03a/evendar03a.pdf

It should decay otherwise there will be some fluctuations provoking small changes in the policy.

How to get out of 'sticky' states? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
The problem:
I've trained an agent to perform a simple task in a grid world (go to the top of the grid while not hitting obstacles), but the following situation always seems to occur. It finds itself in a easy part of the state space (no obstacles), and so continually gets a strong positive reinforcement signal. Then when it does find itself is difficult part of the state space (wedged next to two obstacles) it simply chooses same action as before, to no effect (It goes up and hits the obstacle). Eventually the Q value for this value matches the negative reward, but by this time the other actions have even lower Q values from being useless in the easy part of the state space, so the error signal drops to zero and the incorrect action is still always chosen.
How can I prevent this from happening? I've thought of a few solutions, but none seem viable:
Use a policy that is always exploration heavy. As the obstacles take ~5 actions to get around, a single random action every now and then seems ineffective.
Make the reward function such that bad actions are worse when they are repeated. This makes the reward function break the Markov property. Maybe this isn't a bad thing, but I simply don't have a clue.
Only reward the agent for completing the task. The task takes over a thousand actions to complete, so the training signal would be way too weak.
Some background on the task:
So I've made a little testbed for trying out RL algorithms -- something like a more complex version of the grid-world described in the Sutton book. The world is a large binary grid (300 by 1000) populated by 1's in the form of randomly sized rectangles on a backdrop of 0's. A band of 1's surrounds the edges of the world.
An agent occupies a single space in this world and only a fixed windows around it (41 by 41 window with the agent in the center). The agent's actions consist of moving by 1 space in any of the four cardinal directions. The agent can only move through spaces marked by a 0, 1's are impassible.
The current task to be performed in this environment is to make it to the top of the grid world starting from a random position along the bottom. A reward of +1 is given for successfully moving upwards. A reward of -1 is given for any move that would hit an obstacle or the edge of the world. All other states receive a reward of 0.
The agent uses the basic SARSA algorithm with a neural net value function approximator (as discussed in the Sutton book). For policy decisions I've tried both e-greedy and softmax.

The typical way of teaching such tasks is to give the agent a negative reward each step and then a big payout on completion. You can compensate for the long delay by using eligibility traces and by placing the agent close to the goal initially, and then close to the area it has explored.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart