Q learning transition matrix - machine-learning

I'm trying to figure out how to implement Q learning in a gridworld example. I believe I understand the basics of how Q learning works but it doesn't seem to be giving me the correct values.
This example is from Sutton and Barton's book on reinforcement learning.
The gridworld is specified such that the agent can take actions {N,E,W,S} at any given state with equal probability and the rewards for all actions is 0 except if the agent attempts to move off the grid in which case it's -1. There are two special states, A and B where the agent deterministically will move to A' and B' respectively with rewards +10 and +5 respectively.
My question is about how I would go about implementing this through Q learning. I want to be able to estimate the value function through matrix inversion. The agent starts out in some initial state, not knowing anything and then takes actions selected by an epsilon-greedy algorithm and gets rewards that we can simulate since we know how the rewards are distributed.
This leads me to my question. Can I build a transition probability matrix each time the agent transitions from some state S -> S' where the probabilities are computed based on the frequency with which the agent took a particular action and did a particular transition?

For Q-learning you don't need a "model" of the environment (i.e transition probabilities matrix) to estimate the value function as it is a model-free method. For matrix inversion evaluation you refer to dynamic programming (model-based) where you use a transition matrix. You can think of Q-learning algorithm as a kind of trial and error algorithm where you select an action and receive feedback from the environment. However, contrary to model-based methods you don't have any knowledge about how your environment works (no transition matrix and reward matrix). Eventually, after enough sampled experience the Q function will converge to the optimal one.
For the implementation of the algorithm, you can start from an initial state, after initializing you Q function for all stats and actions (so you can keep track of a $SxA$). Then you select an action according to you policy. Here you should implement a step function. The step function will return the new state $s'$ and the reward. Consider step function as the feedback of the environment to your action.
Eventually you just need to update your Q-function according to:
$Q(s,a)=Q(s,a)+\alpha\left[r+\gamma\underset{a'}{\max(Q(s',a)})-Q(s,a)\right]$
Set $s=s'$ and repeat the whole process till convergence.
Hope this helps

Not sure if this helps, but here is a write up explaining Q learning through an example of a robot. There is also some R code in there if you want to try it out yourself.

Related

Confusion in understanding Q(s,a) formula for Reinforcement Learning MDP?

I was trying to understand the proof why policy improvement theorem can be applied on epsilon-greedy policy.
The proof starts with the mathematical definition -
I am confused on the very first line of the proof.
This equation is the Bellman expectation equation for Q(s,a), while V(s) and Q(s,a) follow the relation -
So how can we ever derive the first line of the proof?
The optimal control problem was first introduced in the 1950s. The problem was to design a controller to maximize or minimize an objective function. Richard Bellman approached this optimal control problem by introducing the Bellman Equation:
Where the value is equivalent to the discounted sum of the rewards. If we take the first time step out, we get the following:
Subsequently, classic reinforcement learning is based on the Markov Decision Process, and assumes all state transitions are known. Thus the equation becomes the following:
That is, the summation is equivalent to the summation of all possible transitions from the that state, multiplied by the reward for achieving the new state.
The above equations are written in the value form. Sometimes, we want the value to also be a function of the action, thus creating the action-value. The conversion of the above equation to the action value form is:
The biggest issue with this equation is that in real life, the transitional probabilities are in fact not known. It is impossible to know the transitional probabilities of every single state unless the problem is extremely simple. To solve this problem, we usually just take the max of the future discounted portion. That is, we assume we behave optimally in the future, rather than taking the average of all possible scenarios.
However, the environment can be heavily stochastic in a real scenario. Therefore, the best estimate of the action-value function in any state is simply an estimate. And the post probabilistic case is the expected value. Thus, giving you:
The reward notation is t+1 in your equation. This is mainly because of different interpretations. The proof above still holds for your notation. It is simply saying you won't know your reward until you get to your next sampling time.

ELI5 score function and softmax policy for policy gradient

H, i am following David Silver's lecture on policy gradients but have trouble getting some points he is making when introducing the score function.
At time 33:44 he is justifying the use of the likelihood ratio trick as follows: "By rewriting the gradient in this way we are able to take expectations. Computing expectation of this thing is hard, but computing expectation of this thing is easy because we have this policy here and it is the policy we are following".
So, my questions for this slide are the following
what kind of expectation are we computing? Is it the pi(s,a)
-- probability of taking action a in state s?
why do we need an expectation of it at all?
why is computing expectation of log * pi(s,a) easier? (maybe an example please)
And then, as we transition to the next slide, he shows the score function for the softmax policy. I don't get how he derived it at all... is it just calculus? Could you maybe show the steps?
Thanks :)

Will Q Learning algorithm produce the same result if I do not use e-greedy?

I am trying to implement Q-Learning algorithm but I do not have enough time to select the action by e-greedy.For simplicity I am choosing a random action,without any proper justification.Will this work ?
Yes, random action selection will allow Q-learning to learn an optimal policy. The goal of e-greedy exploration of to ensure that all the state-action pairs are (asymptotically) visited infinitely often, which is a convergence requirement [Sutton & Barto, Section 6.5]. Obviously, a random action selection process also complies this requirement.
The main drawback is that your agent will be acting poorly during all the learning stage. Also, maybe the convergence speed could be penalized, but I guess this last point is very application dependent.

TD learning vs Q learning

In a perfect information environment, where we are able to know the state after an action, like playing chess, is there any reason to use Q learning not TD (temporal difference) learning?
As far as I understand, TD learning will try to learn V(state) value, but Q learning will learn Q(state action value) value, which means Q learning learns slower (as state action combination is more than state only), is that correct?
Q-Learning is a TD (temporal difference) learning method.
I think you are trying to refer to TD(0) vs Q-learning.
I would say it depends on your actions being deterministic or not. Even if you have the transition function, it can be expensive to decide which action to take in TD(0) as you need to calculate the expected value for each of the actions in each step. In Q-learning that would be summarized in the Q-value.
Given a deterministic environment (or as you say, a "perfect" environment in which you are able to know the state after performing an action), I guess you can simulate the affect of all possible actions in a given state (i.e., compute all possible next states), and choose the action that achieves the next state with the maximum value V(state).
However,it should be taken into account that both value functions V(state) and Q functions Q(state,action) are defined for a given policy. In some way, the value function can be considered as an average of the Q function, in the sense that V(s) "evaluates" the state s for all possible actions. So, to compute a good estimation of V(s) the agent still needs to perform all the possible actions in s.
In conclusion, I think that although V(s) is simpler than Q(s,a), likely they need a similar quantity of experience (or time) to achieve a stable estimation.
You can find more info about value (V and Q) functions in this section of the Sutton & Barto RL book.
Q learning is a TD control algorithm, this means it tries to give you an optimal policy as you said. TD learning is more general in the sense that can include control algorithms and also only prediction methods of V for a fixed policy.
Actually Q-learning is the process of using state-action pairs instead of just states. But that doesnt mean Q learning is different from TD. In TD(0) our agent takes one step(which could be one step in state-action pair or just state) and then updates it's Q-value. And same in n-step TD where our agent takes n steps and then updates the Q-values. Comparing TD and Q-learning isn't the right way. You can compare TD and SARSA algorithms instead. And TD and MonteCarlo

is Q-learning without a final state even possible?

I have to solve this problem with Q-learning.
Well, actually I have to evaluated a Q-learning based policy on it.
I am a tourist manager.
I have n hotels, each can contain a different number of persons.
for each person I put in a hotel I get a reward, based on which room I have chosen.
If I want I can also murder the person, so it goes in no hotel but it gives me a different reward.
(OK,that's a joke...but it's to say that I can have a self transition. so the number of people in my rooms doesn't change after that action).
my state is a vector containing the number of persons in each hotel.
my action is a vector of zeroes and ones which tells me where do I
put the new person.
my reward matrix is formed by the rewards I get for each transition
between states (even the self transition one).
now,since I can get an unlimited number of people (i.e. I can fill it but I can go on killing them) how can I build the Q matrix? without the Q matrix I can't get a policy and so I can't evaluate it...
What do I see wrongly? should I choose a random state as final? Do I have missed the point at all?
This question is old, but I think merits an answer.
One of the issues is that there is not necessarily the notion of an episode, and corresponding terminal state. Rather, this is a continuing problem. Your goal is to maximize your reward forever into the future. In this case, there is discount factor gamma less than one that essentially specifies how far you look into the future on each step. The return is specified as the cumulative discounted sum of future rewards. For episodic problems, it is common to use a discount of 1, with the return being the cumulative sum of future rewards until the end of an episode is reached.
To learn the optimal Q, which is the expected return for following the optimal policy, you have to have a way to perform the off-policy Q-learning updates. If you are using sample transitions to get Q-learning updates, then you will have to specify a behavior policy that takes actions in the environment to get those samples. To understand more about Q-learning, you should read the standard introductory RL textbook: "Reinforcement Learning: An Introduction", Sutton and Barto.
RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think.
I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like a really huge state space. If you are comfortable with using a discrete approach, you would get a good start and learn something about your problem by limiting the scope (finite number of people and hotels/rooms) of the problem and turning Q-learning loose on the smaller state matrix.
OR, you could jump right into a method that can handle infinite state space like an neural network.
In my experience if you have the patience of trying the smaller problem first, you will be better prepared to solve the bigger one next.
Maybe it isn't an answer on "is it possible?", but... Read about r-learning, to solve this particular problem you may want to learn not only Q- or V-function, but also rho - expected reward over time. Joint learning of Q and rho results in better strategy.
To iterate on the above response, with an infinite state space, you definitely should consider generalization of some sort for your Q Function. You will get more value out of your Q function response in an infinite space. You could experiment with several different function approximations, whether that is simple linear regression or a neural network.
Like Martha said, you will need to have a gamma less than one to account for the infinite horizon. Otherwise, you would be trying to determine the fitness of N amount of policies that all equal infinity, which means you will not be able to measure the optimal policy.
The main thing I wanted to add here though for anyone reading this later is the significance of reward shaping. In an infinite problem, where there isn't that final large reward, sub-optimal reward loops can occur, where the agent gets "stuck", since maybe a certain state has a reward higher than any of its neighbors in a finite horizon (which was defined by gamma). To account for that, you want to make sure you penalize the agent for landing in the same state multiple times to avoid these suboptimal loops. Obviously, exploration is extremely important as well, and when the problem is infinite, some amount of exploration will always be necessary.

Resources