In a perfect information environment, where we are able to know the state after an action, like playing chess, is there any reason to use Q learning not TD (temporal difference) learning?
As far as I understand, TD learning will try to learn V(state) value, but Q learning will learn Q(state action value) value, which means Q learning learns slower (as state action combination is more than state only), is that correct?
Q-Learning is a TD (temporal difference) learning method.
I think you are trying to refer to TD(0) vs Q-learning.
I would say it depends on your actions being deterministic or not. Even if you have the transition function, it can be expensive to decide which action to take in TD(0) as you need to calculate the expected value for each of the actions in each step. In Q-learning that would be summarized in the Q-value.
Given a deterministic environment (or as you say, a "perfect" environment in which you are able to know the state after performing an action), I guess you can simulate the affect of all possible actions in a given state (i.e., compute all possible next states), and choose the action that achieves the next state with the maximum value V(state).
However,it should be taken into account that both value functions V(state) and Q functions Q(state,action) are defined for a given policy. In some way, the value function can be considered as an average of the Q function, in the sense that V(s) "evaluates" the state s for all possible actions. So, to compute a good estimation of V(s) the agent still needs to perform all the possible actions in s.
In conclusion, I think that although V(s) is simpler than Q(s,a), likely they need a similar quantity of experience (or time) to achieve a stable estimation.
You can find more info about value (V and Q) functions in this section of the Sutton & Barto RL book.
Q learning is a TD control algorithm, this means it tries to give you an optimal policy as you said. TD learning is more general in the sense that can include control algorithms and also only prediction methods of V for a fixed policy.
Actually Q-learning is the process of using state-action pairs instead of just states. But that doesnt mean Q learning is different from TD. In TD(0) our agent takes one step(which could be one step in state-action pair or just state) and then updates it's Q-value. And same in n-step TD where our agent takes n steps and then updates the Q-values. Comparing TD and Q-learning isn't the right way. You can compare TD and SARSA algorithms instead. And TD and MonteCarlo
Related
I was trying to understand the proof why policy improvement theorem can be applied on epsilon-greedy policy.
The proof starts with the mathematical definition -
I am confused on the very first line of the proof.
This equation is the Bellman expectation equation for Q(s,a), while V(s) and Q(s,a) follow the relation -
So how can we ever derive the first line of the proof?
The optimal control problem was first introduced in the 1950s. The problem was to design a controller to maximize or minimize an objective function. Richard Bellman approached this optimal control problem by introducing the Bellman Equation:
Where the value is equivalent to the discounted sum of the rewards. If we take the first time step out, we get the following:
Subsequently, classic reinforcement learning is based on the Markov Decision Process, and assumes all state transitions are known. Thus the equation becomes the following:
That is, the summation is equivalent to the summation of all possible transitions from the that state, multiplied by the reward for achieving the new state.
The above equations are written in the value form. Sometimes, we want the value to also be a function of the action, thus creating the action-value. The conversion of the above equation to the action value form is:
The biggest issue with this equation is that in real life, the transitional probabilities are in fact not known. It is impossible to know the transitional probabilities of every single state unless the problem is extremely simple. To solve this problem, we usually just take the max of the future discounted portion. That is, we assume we behave optimally in the future, rather than taking the average of all possible scenarios.
However, the environment can be heavily stochastic in a real scenario. Therefore, the best estimate of the action-value function in any state is simply an estimate. And the post probabilistic case is the expected value. Thus, giving you:
The reward notation is t+1 in your equation. This is mainly because of different interpretations. The proof above still holds for your notation. It is simply saying you won't know your reward until you get to your next sampling time.
I am trying to implement Q-Learning algorithm but I do not have enough time to select the action by e-greedy.For simplicity I am choosing a random action,without any proper justification.Will this work ?
Yes, random action selection will allow Q-learning to learn an optimal policy. The goal of e-greedy exploration of to ensure that all the state-action pairs are (asymptotically) visited infinitely often, which is a convergence requirement [Sutton & Barto, Section 6.5]. Obviously, a random action selection process also complies this requirement.
The main drawback is that your agent will be acting poorly during all the learning stage. Also, maybe the convergence speed could be penalized, but I guess this last point is very application dependent.
I am implementing a SARSA(lambda) model in C++ to overcome some of the limitations (the sheer amount of time and space DP models require) of DP models, which hopefully will reduce the computation time (takes quite a few hours atm for similar research) and less space will allow adding more complexion to the model.
We do have explicit transition probabilities, and they do make a difference. So how should we incorporate them in a SARSA model?
Simply select the next state according to the probabilities themselves? Apparently SARSA models don't exactly expect you to use probabilities - or perhaps I've been reading the wrong books.
PS- Is there a way of knowing if the algorithm is properly implemented? First time working with SARSA.
The fundamental difference between Dynamic Programming (DP) and Reinforcement Learning (RL) is that the first assumes that environment's dynamics is known (i.e., a model), while the latter can learn directly from data obtained from the process, in the form of a set of samples, a set of process trajectories, or a single trajectory. Because of this feature, RL methods are useful when a model is difficult or costly to construct. However, it should be notice that both approaches share the same working principles (called Generalized Policy Iteration in Sutton's book).
Given they are similar, both approaches also share some limitations, namely, the curse of dimensionality. From Busoniu's book (chapter 3 is free and probably useful for your purposes):
A central challenge in the DP and RL fields is that, in their original
form (i.e., tabular form), DP and RL algorithms cannot be implemented
for general problems. They can only be implemented when the state and
action spaces consist of a finite number of discrete elements, because
(among other reasons) they require the exact representation of value
functions or policies, which is generally impossible for state spaces
with an infinite number of elements (or too costly when the number of
states is very high).
Even when the states and actions take finitely many values, the cost
of representing value functions and policies grows exponentially with
the number of state variables (and action variables, for Q-functions).
This problem is called the curse of dimensionality, and makes the
classical DP and RL algorithms impractical when there are many state
and action variables. To cope with these problems, versions of the
classical algorithms that approximately represent value functions
and/or policies must be used. Since most problems of practical
interest have large or continuous state and action spaces,
approximation is essential in DP and RL.
In your case, it seems quite clear that you should employ some kind of function approximation. However, given that you know the transition probability matrix, you can choose a method based on DP or RL. In the case of RL, transitions are simply used to compute the next state given an action.
Whether is better to use DP or RL? Actually I don't know the answer, and the optimal method likely depends on your specific problem. Intuitively, sampling a set of states in a planned way (DP) seems more safe, but maybe a big part of your state space is irrelevant to find an optimal pocliy. In such a case, sampling a set of trajectories (RL) maybe is more effective computationally. In any case, if both methods are rightly applied, should achive a similar solution.
NOTE: when employing function approximation, the convergence properties are more fragile and it is not rare to diverge during the iteration process, especially when the approximator is non linear (such as an artificial neural network) combined with RL.
If you have access to the transition probabilities, I would suggest not to use methods based on a Q-value. This will require additional sampling in order to extract information that you already have.
It may not always be the case, but without additional information I would say that modified policy iteration is a more appropriate method for your problem.
I'm trying to figure out how to implement Q learning in a gridworld example. I believe I understand the basics of how Q learning works but it doesn't seem to be giving me the correct values.
This example is from Sutton and Barton's book on reinforcement learning.
The gridworld is specified such that the agent can take actions {N,E,W,S} at any given state with equal probability and the rewards for all actions is 0 except if the agent attempts to move off the grid in which case it's -1. There are two special states, A and B where the agent deterministically will move to A' and B' respectively with rewards +10 and +5 respectively.
My question is about how I would go about implementing this through Q learning. I want to be able to estimate the value function through matrix inversion. The agent starts out in some initial state, not knowing anything and then takes actions selected by an epsilon-greedy algorithm and gets rewards that we can simulate since we know how the rewards are distributed.
This leads me to my question. Can I build a transition probability matrix each time the agent transitions from some state S -> S' where the probabilities are computed based on the frequency with which the agent took a particular action and did a particular transition?
For Q-learning you don't need a "model" of the environment (i.e transition probabilities matrix) to estimate the value function as it is a model-free method. For matrix inversion evaluation you refer to dynamic programming (model-based) where you use a transition matrix. You can think of Q-learning algorithm as a kind of trial and error algorithm where you select an action and receive feedback from the environment. However, contrary to model-based methods you don't have any knowledge about how your environment works (no transition matrix and reward matrix). Eventually, after enough sampled experience the Q function will converge to the optimal one.
For the implementation of the algorithm, you can start from an initial state, after initializing you Q function for all stats and actions (so you can keep track of a $SxA$). Then you select an action according to you policy. Here you should implement a step function. The step function will return the new state $s'$ and the reward. Consider step function as the feedback of the environment to your action.
Eventually you just need to update your Q-function according to:
$Q(s,a)=Q(s,a)+\alpha\left[r+\gamma\underset{a'}{\max(Q(s',a)})-Q(s,a)\right]$
Set $s=s'$ and repeat the whole process till convergence.
Hope this helps
Not sure if this helps, but here is a write up explaining Q learning through an example of a robot. There is also some R code in there if you want to try it out yourself.
I have to solve this problem with Q-learning.
Well, actually I have to evaluated a Q-learning based policy on it.
I am a tourist manager.
I have n hotels, each can contain a different number of persons.
for each person I put in a hotel I get a reward, based on which room I have chosen.
If I want I can also murder the person, so it goes in no hotel but it gives me a different reward.
(OK,that's a joke...but it's to say that I can have a self transition. so the number of people in my rooms doesn't change after that action).
my state is a vector containing the number of persons in each hotel.
my action is a vector of zeroes and ones which tells me where do I
put the new person.
my reward matrix is formed by the rewards I get for each transition
between states (even the self transition one).
now,since I can get an unlimited number of people (i.e. I can fill it but I can go on killing them) how can I build the Q matrix? without the Q matrix I can't get a policy and so I can't evaluate it...
What do I see wrongly? should I choose a random state as final? Do I have missed the point at all?
This question is old, but I think merits an answer.
One of the issues is that there is not necessarily the notion of an episode, and corresponding terminal state. Rather, this is a continuing problem. Your goal is to maximize your reward forever into the future. In this case, there is discount factor gamma less than one that essentially specifies how far you look into the future on each step. The return is specified as the cumulative discounted sum of future rewards. For episodic problems, it is common to use a discount of 1, with the return being the cumulative sum of future rewards until the end of an episode is reached.
To learn the optimal Q, which is the expected return for following the optimal policy, you have to have a way to perform the off-policy Q-learning updates. If you are using sample transitions to get Q-learning updates, then you will have to specify a behavior policy that takes actions in the environment to get those samples. To understand more about Q-learning, you should read the standard introductory RL textbook: "Reinforcement Learning: An Introduction", Sutton and Barto.
RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think.
I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like a really huge state space. If you are comfortable with using a discrete approach, you would get a good start and learn something about your problem by limiting the scope (finite number of people and hotels/rooms) of the problem and turning Q-learning loose on the smaller state matrix.
OR, you could jump right into a method that can handle infinite state space like an neural network.
In my experience if you have the patience of trying the smaller problem first, you will be better prepared to solve the bigger one next.
Maybe it isn't an answer on "is it possible?", but... Read about r-learning, to solve this particular problem you may want to learn not only Q- or V-function, but also rho - expected reward over time. Joint learning of Q and rho results in better strategy.
To iterate on the above response, with an infinite state space, you definitely should consider generalization of some sort for your Q Function. You will get more value out of your Q function response in an infinite space. You could experiment with several different function approximations, whether that is simple linear regression or a neural network.
Like Martha said, you will need to have a gamma less than one to account for the infinite horizon. Otherwise, you would be trying to determine the fitness of N amount of policies that all equal infinity, which means you will not be able to measure the optimal policy.
The main thing I wanted to add here though for anyone reading this later is the significance of reward shaping. In an infinite problem, where there isn't that final large reward, sub-optimal reward loops can occur, where the agent gets "stuck", since maybe a certain state has a reward higher than any of its neighbors in a finite horizon (which was defined by gamma). To account for that, you want to make sure you penalize the agent for landing in the same state multiple times to avoid these suboptimal loops. Obviously, exploration is extremely important as well, and when the problem is infinite, some amount of exploration will always be necessary.