I am implementing Q-learning for a simple task, which involves a robot moving to a target position, in a continuous coordinate system. Each episode has a fixed length, and the rewards are sparse: there is a single reward given to the final transition of the episode, and the reward is a function of the final distance between the robot and the target.
The problem I am wondering about, is that when computing the q values for a particular state, there may be conflicting target q values. For example, let's say that in episode A, the robot ends up near the target on the final step of the episode, and receives a reward of 0.9. Then in episode B, let's say that the robot moves right through the target in the middle of the episode, and finishes the episode far away from the target:
My issue is with the problem state, where the two episodes overlap. If I am storing my transitions in a replay buffer, and I sample the transition from Episode A, then the target q value for that action will be equal to discount_factor x max_q(next_state) + reward, but if the transition from Episode B is sampled, then the target q value is discount_factor x max_q(next_state) + 0, because there is only a reward in the final state of the episode. (I am assuming here that at the problem state, both episodes take the same action).
This means that my optimisation has two different targets for the same state-action pair, which will be impossible to learn.
Have I misunderstood the problem, or is this a real issue? Or should I change the way my rewards are assigned?
Thanks!
First of all in these two episodes either states or actions do differ. First lets assume that robot is omni-directional (there is no "facing" direction - it can move in any direction), then in "overlapping" state, in episode A different action is executed than in episode B (since one goes up and right and one goes left), and since Q values are of form Q(s, a) there is no "conflict" as you fit Q(s, a_A) and Q(s, a_B) "separately". Now second option - robot does have a heading direction, so in both episodes they might have been executing the same action (like "forward") but then state s actually includes heading direction, so you have Q(s_A, a) and Q(s_B, a) (again different objects, simply the same action).
In general, it is also not true that you cannot learn when you obtain two different values for the same state/action pair - you will learn the expected value (which is a typical case for any stochastic environment either way).
Related
Simple question on hello world example of the MCTS for tic-tac-toe,
Let's assume we are given a board and we want to make an optimal decision. As I undestand the choice of consecutive nodes while simulation (until leaf is met) is determined by a exploration/exploitation trade-off function (as described on wikipedia). I really wonder what is the intuition behind first component (exploitation) of the function here, especially for games between two players with oppposite goals. Then the meaning of "the most promising" changes depending on who makes a move. Shouldn't this function change depeding on who makes the next move (especially its first component)?
Yes, that exploitation part of the equation should be implemented to take into account the evaluations from the perspective of the agent/player who gets to select an action in that node.
For single-agent settings, the implementation is straightforward; simply always maximize.
For zero-sum, turn-based, two-player settings, you'd want to alternate between maximizing or minimizing that exploitation part of the equation (note: always maximize the exploration term!). This can also be implemented by simply multiplying that term by -1 in nodes where the opponent gets to move.
Other settings are possible too, but require slightly more implementation effort (e.g. keeping different average scores for different players in settings which are not zero-sum or have more than two players)
I am trying to understand Q-Learning,
My current algorithm operates as follows:
1. A lookup table is maintained that maps a state to information about its immediate reward and utility for each action available.
2. At each state, check to see if it is contained in the lookup table and initialise it if not (With a default utility of 0).
3. Choose an action to take with a probability of:
(*ϵ* = 0>ϵ>1 - probability of taking a random action)
1-ϵ = Choosing the state-action pair with the highest utility.
ϵ = Choosing a random move.
ϵ decreases over time.
4. Update the current state's utility based on:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I am currently playing my agent against a simple heuristic player, who always takes the move that will give it the best immediate reward.
The results - The results are very poor, even after a couple hundred games, the Q-Learning agent is losing a lot more than it is winning. Furthermore, the change in win-rate is almost non-existent, especially after reaching a couple hundred games.
Am I missing something? I have implemented a couple agents:
(Rote-Learning, TD(0), TD(Lambda), Q-Learning)
But they all seem to be yielding similar, disappointing, results.
There are on the order of 10²⁰ different states in checkers, and you need to play a whole game for every update, so it will be a very, very long time until you get meaningful action values this way. Generally, you'd want a simplified state representation, like a neural network, to solve this kind of problem using reinforcement learning.
Also, a couple of caveats:
Ideally, you should update 1 value per game, because the moves in a single game are highly correlated.
You should initialize action values to small random values to avoid large policy changes from small Q updates.
The search algorithm is a Breadth first search. I'm not sure how to store terms from and equation into a open list. The function f(x) has the form of ax^e1 + bx^e2 + cx^e3 + k, where a, b, c, are coefficients; k is constant. All exponents, coefficients, and constants are integers between 0 and 5.
Initial state: of the problem solving process should be any term from the ax^e1, bx^e2, cX^e3, k.
The algorithm gradually expands the number of terms in each level of the list.
Not sure how to add the terms to an equation from an open Queue. That is the question.
The general problem that you are dealing belongs to the regression analysis area, and several techniques are available to find a function that fits a given data set, including the popular least squares methods for finding the line of best fit given a dataset (a brief starting point is the related page on wikipedia, but if you want to deepen this topic, you should look at the research paper out there).
If you want to stick with the breadth first search algorithm, although this kind of approach is not common for such a problem, first of all, you need to define all the elements for a search problem, namely (see for more information Chapter 3 of the book of Stuart and Russell, Artificial Intelligence: A Modern Approach):
Initial state: Some initial values for the different terms.
Actions: in your case it should be a change in the different terms. Note that you should discretize the changes in the values.
Transition function: function that determines the new states given a state and an action.
Goal test: a check to recognize whether a state is a goal state or not, and so to terminate the search. There are different ways to define this test in a regression problem. One way is to set a threshold for the sum of the square errors.
Step cost: The cost for an action. In such an abstract problem, probably you can consider the unweighted distance from the initial state on the search graph.
Note that you should carefully think about these elements, as, for example, they determine how efficient your search would be or whether you will have cycles in the search graph.
After you defined all of the elements for the search problem, you basically have to implement:
Node, that contains information about the parent, the state, and the current cost;
Function to expand a given node that returns the successor nodes (according to the transition function, the actions, and the step cost);
Goal test;
The actual search algorithm. In the queue at the beginning you will have the node containing the initial state. After, it is updated with the successor nodes.
I am learning about SARSA algorithm implementation and had a question. I understand that the general "learning" step takes the form of:
Robot (r) is in state s. There are four actions available:
North (n), East (e), West (w) and South (s)
such that the list of Actions,
a = {n,w,e,s}
The robot randomly picks an action, and updates as follows:
Q(a,s) = Q(a,s) + L[r + DQ(a',s1) - Q(a,s)]
Where L is the learning rate, r is the reward associated to (a,s), Q(s',a') is the expected reward from an action a' in the new state s' and D is the discount factor.
Firstly, I don't undersand the role of the term - Q(a,s), why are we re-subtracting the current Q-value?
Secondly, when picking actions a and a' why do these have to be random? I know in some implementations or SARSA all possible Q(s', a') are taken into account and the highest value is picked. (I believe this is Epsilon-Greedy?) Why not to this also to pick which Q(a,s) value to update? Or why not update all Q(a,s) for the current s?
Finally, why is SARSA limited to one-step lookahead? Why, say, not also look into an hypothetical Q(s'',a'')?
I guess overall my questions boil down to what makes SARSA better than another breath-first or depth-first search algorithm?
Why do we subtract Q(a,s)? r + DQ(a',s1) is the reward that we got on this run through from getting to state s by taking action a. In theory, this is the value that Q(a,s) should be set to. However, we won't always take the same action after getting to state s from action a, and the rewards associated with going to future states will change in the future. So we can't just set Q(a,s) equal to r + DQ(a',s1). Instead, we just want to push it in the right direction so that it will eventually converge on the right value. So we look at the error in prediction, which requires subtracting Q(a,s) from r + DQ(a',s1). This is the amount that we would need to change Q(a,s) by in order to make it perfectly match the reward that we just observed. Since we don't want to do that all at once (we don't know if this is always going to be the best option), we multiply this error term by the learning rate, l, and add this value to Q(a,s) for a more gradual convergence on the correct value.`
Why do we pick actions randomly? The reason to not always pick the next state or action in a deterministic way is basically that our guess about which state is best might be wrong. When we first start running SARSA, we have a table full of 0s. We put non-zero values into the table by exploring those areas of state space and finding that there are rewards associated with them. As a result, something not terrible that we have explored will look like a better option than something that we haven't explored. Maybe it is. But maybe the thing that we haven't explored yet is actually way better than we've already seen. This is called the exploration vs exploitation problem - if we just keep doing things that we know work, we may never find the best solution. Choosing next steps randomly ensures that we see more of our options.
Why can't we just take all possible actions from a given state? This will force us to basically look at the entire learning table on every iteration. If we're using something like SARSA to solve the problem, the table is probably too big to do this for in a reasonable amount of time.
Why can SARSA only do one-step look-ahead? Good question. The idea behind SARSA is that it's propagating expected rewards backwards through the table. The discount factor, D, ensures that in the final solution you'll have a trail of gradually increasing expected rewards leading to the best reward. If you filled in the table at random, this wouldn't always be true. This doesn't necessarily break the algorithm, but I suspect it leads to inefficiencies.
Why is SARSA better than search? Again, this comes down to an efficiency thing. The fundamental reason that anyone uses learning algorithms rather than search algorithms is that search algorithms are too slow once you have too many options for states and actions. In order to know the best action to take from any other state action pair (which is what SARSA calculates), you would need to do a search of the entire graph from every node. This would take O(s*(s+a)) time. If you're trying to solve real-world problems, that's generally too long.
I'm looking over a sample exam and there is a question on Q-learning, I have included it below. In the 3rd step, how come the action taken is 'right' rather than 'up' (back to A2). It appears the Q value to go back up to A2 would be 0.18, and the Q value to go right would be 0.09. So why wouldn't the agent go back to A2 instead of going to B3?
Maze & Q-Table
Solution
Edit: Also, how come 2,C has a reward value of 2 for action 'right' even though there's a wall there and not possible to go right? Do we just assume thats not a possible move and ignore its Q value?
Edit2: Then in step 6, the Q values for going 'down' and 'right' at state 1,C are equal. At that point does the agent just pick randomly? So then for this question I would just pick the best move since it's possible the agent would pick it?
Edit3: Would it be true to say the agent doesn't return to the state he previously came from? Will an agent ever explore the same state more than once (not including starting a new instance of the maze)?
You seem to assume that you should look at the values of the state in the next time step. This is incorrect. The Q-function answers the question:
If I'm in state x, which action should I take?
In non-deterministic environments you don't even know what the next state will be, so it would be impossible to determine which action to take in your interpretation.
The learning part of Q-learning indeed acts on two subsequent timesteps, but after they are already known, and they are used to update the values of Q-function. This has nothing to do with how these samples (state, action, reinforcement, next state) are collected. In this case, samples are collected by the agent interacting with the environment. And in Q-learning setting agents interact with the environment according to a policy, which is based on current values of Q-function here. Conceptually, a policy works in terms of answering the question I quoted above.
In steps 1 and 2, the Q-function is modified only for states 1,A and 2,A. In step 3 the agent is in state 3,A so that's the only part of Q-function that's relevant.
In the 3rd step, how come the action taken is 'right' rather than 'up' (back to A2).
In state 3,A the action that has the highest Q-value is "right" (0.2). All other actions have value 0.0.
Also, how come 2,C has a reward value of 2 for action 'right' even though there's a wall there and not possible to go right? Do we just assume thats not a possible move and ignore its Q value?
As I see it, there is no wall to the right from 2,C. Nevertheless, the Q-function is given and it's irrelevant in this task whether it's possible to reach such Q-function using Q-learning. And you can always start Q-learning from an arbitrary Q-function anyway.
In Q-learning your only knowledge is the Q-function, so you don't know anything about "walls" and other things - you act according to Q-function, and that's the whole beauty of this algorithm.
Then in step 6, the Q values for going 'down' and 'right' at state 1,C are equal. At that point does the agent just pick randomly? So then for this question I would just pick the best move since it's possible the agent would pick it?
Again, you should look at the values for the state the agent is currently in, so for 1,B "right" is optimal - it has 0.1 and other actions are 0.0.
To answer the last question, even though it's irrelevant here: yes, if the agent is taking the greedy step and multiple actions seem optimal, it chooses one at random in most common policies.
Would it be true to say the agent doesn't return to the state he previously came from? Will an agent ever explore the same state more than once (not including starting a new instance of the maze)?
No. As I've stated above - the only guideline agent is using in pure Q-learning is the Q-function. It's not aware that it has been in a particular state before.