I'm trying to understand Q-Learning
The basic update formula:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I understand the formula, and what it does, but my question is:
How does the agent know to choose Q(st, at)?
I understand that an agent follows some policy π, but how do you create this policy in the first place?
My agents are playing checkers, so I am focusing on model-free algorithms.
All the agent knows is the current state it is in.
I understand that when it performs an action, you update the utility, but how does it know to take that action in the first place.
At the moment I have:
Check each move you could make from that state.
Pick whichever move has the highest utility.
Update the utility of the move made.
However, this doesnt really solve much, you still get stuck in local minimum/maximums.
So, just to round things off, my main question is:
How, for an agent that knows nothing and is using a model-free algorithm, do you generate an initial policy, so it know which action to take?
That update formula incrementally computes the expected value of each action in every state. A greedy policy chooses always the highest valued action. This is the best policy when you have already learned the values. The most common policy for use during learning is the ε-greedy policy, which chooses the highest valued action with probability 1-ε, and a random action with probability ε.
How do people deal with problems where the legal actions in different states are different? In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states.
I'm also interested in see if the solutions would be different if the legal actions were overlapping.
For Q learning (where my network gives me the values for state/action pairs), I was thinking maybe I could just be careful about which Q value to choose when I'm constructing the target value. (ie instead of choosing the max, I choose the max among legal actions...)
For Policy-Gradient type of methods I'm less sure of what the appropriate setup is. Is it okay to just mask the output layer when computing the loss?
There are two closely related works in recent two years:
[1] Boutilier, Craig, et al. "Planning and learning with stochastic action sets." arXiv preprint arXiv:1805.02363 (2018).
[2] Chandak, Yash, et al. "Reinforcement Learning When All Actions Are Not Always Available." AAAI. 2020.
Currently this problem seems to not have one, universal and straight-forward answer. Maybe because it is not that of an issue?
Your suggestion of choosing the best Q value for legal actions is actually one of the proposed ways to handle this. For policy gradients methods you can achieve similar result by masking the illegal actions and properly scaling up the probabilities of the other actions.
Other approach would be giving negative rewards for choosing an illegal action - or ignoring the choice and not making any change in the environment, returning the same reward as before. For one of my personal experiences (Q Learning method) I've chosen the latter and the agent learned what he has to learn, but he was using the illegal actions as a 'no action' action from time to time. It wasn't really a problem for me, but negative rewards would probably eliminate this behaviour.
As you see, these solutions don't change or differ when the actions are 'overlapping'.
Answering what you've asked in the comments - I don't believe you can train the agent in described conditions without him learning the legal/illegal actions rules. This would need, for example, something like separate networks for each set of legal actions and doesn't sound like the best idea (especially if there are lots of possible legal action sets).
But is the learning of these rules hard?
You have to answer some questions yourself - is the condition, that makes the action illegal, hard to express/articulate? It is, of course, environment-specific, but I would say that it is not that hard to express most of the time and agents just learn them during training. If it is hard, does your environment provide enough information about the state?
Not sure if I understand your question correctly, but if you mean that in certain states some actions are impossible then you simply reflect it in the reward function (big negative value). You can even decide to end the episode if it is not clear what state would the illegal action result in. The agent should then learn that those actions are not desirable in the specific states.
In exploration mode, the agent might still choose to take the illegal actions. However, in exploitation mode it should avoid them.
I recently built a DDQ agent for connect-four and had to address this. Whenever a column was chosen that was already full with tokens, I set the reward equivalent to losing the game. This was -100 in my case and it worked well.
In connect four, allowing an illegal move (effectively skipping a turn) can in some cases be advantageous for the player. This is why I set the reward equivalent to losing and not a smaller negative number.
So if you set the negative reward greater than losing, you'll have to consider in your domain what are the implications of allowing illegal moves to happen in exploration.
I am trying to understand Q-Learning,
My current algorithm operates as follows:
1. A lookup table is maintained that maps a state to information about its immediate reward and utility for each action available.
2. At each state, check to see if it is contained in the lookup table and initialise it if not (With a default utility of 0).
3. Choose an action to take with a probability of:
(*ϵ* = 0>ϵ>1 - probability of taking a random action)
1-ϵ = Choosing the state-action pair with the highest utility.
ϵ = Choosing a random move.
ϵ decreases over time.
4. Update the current state's utility based on:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I am currently playing my agent against a simple heuristic player, who always takes the move that will give it the best immediate reward.
The results - The results are very poor, even after a couple hundred games, the Q-Learning agent is losing a lot more than it is winning. Furthermore, the change in win-rate is almost non-existent, especially after reaching a couple hundred games.
Am I missing something? I have implemented a couple agents:
(Rote-Learning, TD(0), TD(Lambda), Q-Learning)
But they all seem to be yielding similar, disappointing, results.
There are on the order of 10²⁰ different states in checkers, and you need to play a whole game for every update, so it will be a very, very long time until you get meaningful action values this way. Generally, you'd want a simplified state representation, like a neural network, to solve this kind of problem using reinforcement learning.
Also, a couple of caveats:
Ideally, you should update 1 value per game, because the moves in a single game are highly correlated.
You should initialize action values to small random values to avoid large policy changes from small Q updates.
I am learning about SARSA algorithm implementation and had a question. I understand that the general "learning" step takes the form of:
Robot (r) is in state s. There are four actions available:
North (n), East (e), West (w) and South (s)
such that the list of Actions,
a = {n,w,e,s}
The robot randomly picks an action, and updates as follows:
Q(a,s) = Q(a,s) + L[r + DQ(a',s1) - Q(a,s)]
Where L is the learning rate, r is the reward associated to (a,s), Q(s',a') is the expected reward from an action a' in the new state s' and D is the discount factor.
Firstly, I don't undersand the role of the term - Q(a,s), why are we re-subtracting the current Q-value?
Secondly, when picking actions a and a' why do these have to be random? I know in some implementations or SARSA all possible Q(s', a') are taken into account and the highest value is picked. (I believe this is Epsilon-Greedy?) Why not to this also to pick which Q(a,s) value to update? Or why not update all Q(a,s) for the current s?
Finally, why is SARSA limited to one-step lookahead? Why, say, not also look into an hypothetical Q(s'',a'')?
I guess overall my questions boil down to what makes SARSA better than another breath-first or depth-first search algorithm?
Why do we subtract Q(a,s)? r + DQ(a',s1) is the reward that we got on this run through from getting to state s by taking action a. In theory, this is the value that Q(a,s) should be set to. However, we won't always take the same action after getting to state s from action a, and the rewards associated with going to future states will change in the future. So we can't just set Q(a,s) equal to r + DQ(a',s1). Instead, we just want to push it in the right direction so that it will eventually converge on the right value. So we look at the error in prediction, which requires subtracting Q(a,s) from r + DQ(a',s1). This is the amount that we would need to change Q(a,s) by in order to make it perfectly match the reward that we just observed. Since we don't want to do that all at once (we don't know if this is always going to be the best option), we multiply this error term by the learning rate, l, and add this value to Q(a,s) for a more gradual convergence on the correct value.`
Why do we pick actions randomly? The reason to not always pick the next state or action in a deterministic way is basically that our guess about which state is best might be wrong. When we first start running SARSA, we have a table full of 0s. We put non-zero values into the table by exploring those areas of state space and finding that there are rewards associated with them. As a result, something not terrible that we have explored will look like a better option than something that we haven't explored. Maybe it is. But maybe the thing that we haven't explored yet is actually way better than we've already seen. This is called the exploration vs exploitation problem - if we just keep doing things that we know work, we may never find the best solution. Choosing next steps randomly ensures that we see more of our options.
Why can't we just take all possible actions from a given state? This will force us to basically look at the entire learning table on every iteration. If we're using something like SARSA to solve the problem, the table is probably too big to do this for in a reasonable amount of time.
Why can SARSA only do one-step look-ahead? Good question. The idea behind SARSA is that it's propagating expected rewards backwards through the table. The discount factor, D, ensures that in the final solution you'll have a trail of gradually increasing expected rewards leading to the best reward. If you filled in the table at random, this wouldn't always be true. This doesn't necessarily break the algorithm, but I suspect it leads to inefficiencies.
Why is SARSA better than search? Again, this comes down to an efficiency thing. The fundamental reason that anyone uses learning algorithms rather than search algorithms is that search algorithms are too slow once you have too many options for states and actions. In order to know the best action to take from any other state action pair (which is what SARSA calculates), you would need to do a search of the entire graph from every node. This would take O(s*(s+a)) time. If you're trying to solve real-world problems, that's generally too long.
I'm looking over a sample exam and there is a question on Q-learning, I have included it below. In the 3rd step, how come the action taken is 'right' rather than 'up' (back to A2). It appears the Q value to go back up to A2 would be 0.18, and the Q value to go right would be 0.09. So why wouldn't the agent go back to A2 instead of going to B3?
Maze & Q-Table
Edit: Also, how come 2,C has a reward value of 2 for action 'right' even though there's a wall there and not possible to go right? Do we just assume thats not a possible move and ignore its Q value?
Edit2: Then in step 6, the Q values for going 'down' and 'right' at state 1,C are equal. At that point does the agent just pick randomly? So then for this question I would just pick the best move since it's possible the agent would pick it?
Edit3: Would it be true to say the agent doesn't return to the state he previously came from? Will an agent ever explore the same state more than once (not including starting a new instance of the maze)?
You seem to assume that you should look at the values of the state in the next time step. This is incorrect. The Q-function answers the question:
If I'm in state x, which action should I take?
In non-deterministic environments you don't even know what the next state will be, so it would be impossible to determine which action to take in your interpretation.
The learning part of Q-learning indeed acts on two subsequent timesteps, but after they are already known, and they are used to update the values of Q-function. This has nothing to do with how these samples (state, action, reinforcement, next state) are collected. In this case, samples are collected by the agent interacting with the environment. And in Q-learning setting agents interact with the environment according to a policy, which is based on current values of Q-function here. Conceptually, a policy works in terms of answering the question I quoted above.
In steps 1 and 2, the Q-function is modified only for states 1,A and 2,A. In step 3 the agent is in state 3,A so that's the only part of Q-function that's relevant.
In the 3rd step, how come the action taken is 'right' rather than 'up' (back to A2).
In state 3,A the action that has the highest Q-value is "right" (0.2). All other actions have value 0.0.
Also, how come 2,C has a reward value of 2 for action 'right' even though there's a wall there and not possible to go right? Do we just assume thats not a possible move and ignore its Q value?
As I see it, there is no wall to the right from 2,C. Nevertheless, the Q-function is given and it's irrelevant in this task whether it's possible to reach such Q-function using Q-learning. And you can always start Q-learning from an arbitrary Q-function anyway.
In Q-learning your only knowledge is the Q-function, so you don't know anything about "walls" and other things - you act according to Q-function, and that's the whole beauty of this algorithm.
Then in step 6, the Q values for going 'down' and 'right' at state 1,C are equal. At that point does the agent just pick randomly? So then for this question I would just pick the best move since it's possible the agent would pick it?
Again, you should look at the values for the state the agent is currently in, so for 1,B "right" is optimal - it has 0.1 and other actions are 0.0.
To answer the last question, even though it's irrelevant here: yes, if the agent is taking the greedy step and multiple actions seem optimal, it chooses one at random in most common policies.
Would it be true to say the agent doesn't return to the state he previously came from? Will an agent ever explore the same state more than once (not including starting a new instance of the maze)?
No. As I've stated above - the only guideline agent is using in pure Q-learning is the Q-function. It's not aware that it has been in a particular state before.
I'm trying to get into machine learning, and decided to try things out for myself. I wrote a small tic-tac-toe game. So far, the computer plays against itself using random moves.
Now, I want to apply reinforcement learning by writing an agent that will explore or exploit based on the knowledge it has on the current state of the board.
The part I don't understand is this:
What does the agent use to train itself for the current state? Lets say a RNG bot (o) player does this:
Now the agent has to decide what the best move should be. A well trained one would pick 1st, 3rd, 7th or 9th. Does it look up a similar state in the DB that led him to a win? Because if so, I think I will need to save every single move into the DB up to eventually it's end state (win/lose/draw state), and that would be quite a lot of data for a single play?
If I'm thinking this through wrong, I would like to know how to this correctly.
1) Observe a current board state s;
2) Make a next move based on the distribution of all available V(s') of next moves. Strictly the choice is often based on Boltzman’s distribution of V(s'), but can be simplified to maximum-value move (greedy) or, with some probability epsilon, a random move as you are using;
3) Record s' in a sequence;
4) If the game finishes, it updates the values of the visited states in the sequence and starts over again; otherwise, go to 1).
Game Playing
1) Observe a current board state s;
2) Make a next move based on the distribution of all available V(s') of next moves;
3) Until the game is over and it starts over again; otherwise, go to 1).
Regarding your question, yes the look-up table in Game Playing phase is built up in the Learning phase. Every time the state is chosen from the all the V(s) with a maximum possible number of 3^9=19683. Here is a sample code written by Python that runs 10000 games in training.