Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
In reinforcement learning, what is the difference between policy iteration and value iteration?
As much as I understand, in value iteration, you use the Bellman equation to solve for the optimal policy, whereas, in policy iteration, you randomly select a policy π, and find the reward of that policy.
My doubt is that if you are selecting a random policy π in PI, how is it guaranteed to be the optimal policy, even if we are choosing several random policies.
Let's look at them side by side. The key parts for comparison are highlighted. Figures are from Sutton and Barto's book: Reinforcement Learning: An Introduction.
Key points:
Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v_(s) after just one sweep of all states regardless of convergence).
The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check.
In my experience, policy iteration is faster than value iteration, as a policy converges more quickly than a value function. I remember this is also described in the book.
I guess the confusion mainly came from all these somewhat similar terms, which also confused me before.
In policy iteration algorithms, you start with a random policy, then find the value function of that policy (policy evaluation step), then find a new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the Bellman operator.
In value iteration, you start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the optimal value function. This process is based on the optimality Bellman operator.
In some sense, both algorithms share the same working principle, and they can be seen as two cases of the generalized policy iteration. However, the optimality Bellman operator contains a max operator, which is non linear and, therefore, it has different features. In addition, it's possible to use hybrid methods between pure value iteration and pure policy iteration.
The basic difference is -
In Policy Iteration - You randomly select a policy and find value function corresponding to it , then find a new (improved) policy based on the previous value function, and so on this will lead to optimal policy .
In Value Iteration - You randomly select a value function , then find a new (improved) value function in an iterative process, until reaching the optimal value function , then derive optimal policy from that optimal value function .
Policy iteration works on principle of “Policy evaluation —-> Policy improvement”.
Value Iteration works on principle of “ Optimal value function —-> optimal policy”.
As far as I am concerned, in contrary to #zyxue 's idea, VI is generally much faster than PI.
The reason is very straightforward, as you already knew, Bellman Equation is used for solving value function for given policy. Since we can solve the value function for optimal policy directly, solving value function for current policy is obviously a waste of time.
As for your question about the convergency of PI, I think you might overlook the fact that if you improve the strategy for each information state, then you improve the strategy for the whole game. This is also easy to prove, if you were familiar with Counterfactual Regret Minimization -- the sum of the regret for each information state has formed the upperbound of the overall regret, and thus minimizing the regret for each state will minimize the overall regret, which leads to the optimal policy.
The main difference in speed is due to the max operation in every iteration of value iteration (VI).
In VI, each state will use just one action (with the max utility value) for calculating the updated utility value, but it first has to calculate the value of all possible actions in order to find this action via the Bellman Equation.
In policy iteration (PI), this max operation is ommited in step 1 (policy evaluation) by just following the intermediate policy to choose the action.
If there are N possible actions, VI has to calculate the bellman equation N times for each state and then take the max, whereas PI just calculates it one time (for the action stated by the current policy).
However in PI, there is a policy improvement step that still uses the max operator and is as slow as a step in VI, but since PI converges in less iterations, this step won't happen as often as in VI.
Related
Watching this lecture, it says that policy iteration is faster than value iteration.
The reasons are:
Value iteration runs in O(S^2 * A) whereas policy iteration runs in O(S^2) while computing the values, and only the extraction of the policy runs in O(S^2 * A).
Usually, value iteration reaches the optimal policy way before convergence, and the rest of the computations are wasteful. As opposed to this, policy iteration stops right after there is no change in the policy.
Reading this article, it says that value iteration is faster than policy iteration. Quoting from the article:
As we’ve seen, Policy Iteration evaluates a policy and then uses these
values to improve that policy. This process is repeated until
eventually the optimal policy is reached. As a result, at each
iteration prior to the optimal policy, a sub-optimal policy has to be
fully evaluated. Consequently, there is potentially a lot of wasted
effort when trying to find the optimal policy.
In Policy Iteration, at each step, policy evaluation is run until
convergence, then the policy is updated and the process repeats.
In contrast, Value Iteration only does a single iteration of policy
evaluation at each step. Then, for each state, it takes the maximum
action value to be the estimated state value. Once these state values
have converged, to the optimal state values, then the optimal policy
can be obtained. In practice this performs much better than Policy
Iteration and finds the optimal state value function in much fewer
steps.
I've read a couple of other articles and posts supporting the argument that the policy iteration performs better, but I'm not quite sure.
In the context of Double Q or Deuling Q Networks, I am not sure if I fully understand the difference. Especially with V. What exactly is V(s)? How can a state have an inherent value?
If we are considering this in the context of trading stocks lets say, then how would we define these three variables?
No matter what network can talk about, the reward is an inherent part of the environment. This is the signal (in fact, the only signal) that an agent receives throughout its life after making actions. For example: an agent that plays chess gets only one reward at the end of the game, either +1 or -1, all other times the reward is zero.
Here you can see a problem in this example: the reward is very sparse and is given just once, but the states in a game are obviously very different. If an agent is in a state when it has the queen while the opponent has just lost it, the chances of winning are very high (simplifying a little bit, but you get an idea). This is a good state and an agent should strive to get there. If on the other hand, an agent lost all the pieces, it is a bad state, it will likely lose the game.
We would like to quantify what actually good and bad states are, and here comes the value function V(s). Given any state, it returns a number, big or small. Usually, the formal definition is the expectation of the discounted future rewards, given a particular policy to act (for the discussion of a policy see this question). This makes perfect sense: a good state is such one, in which the future +1 reward is very probable; the bad state is quite the opposite -- when the future -1 is very probable.
Important note: the value function depends on the rewards and not just for one state, for many of them. Remember that in our example the reward for almost all states is 0. Value function takes into account all future states along with their probabilities.
Another note: strictly speaking the state itself doesn't have a value. But we have assigned one to it, according to our goal in the environment, which is to maximize the total reward. There can be multiple policies and each will induce a different value function. But there is (usually) one optimal policy and the corresponding optimal value function. This is what we'd like to find!
Finally, the Q-function Q(s, a) or the action-value function is the assessment of a particular action in a particular state for a given policy. When we talk about an optimal policy, action-value function is tightly related to the value function via Bellman optimality equations. This makes sense: the value of an action is fully determined by the value of the possible states after this action is taken (in the game of chess the state transition is deterministic, but in general it's probabilistic as well, that's why we talk about all possible states here).
Once again, action-value function is a derivative of the future rewards. It's not just a current reward. Some actions can be much better or much worse than others even though the immediate reward is the same.
Speaking of the stock trading example, the main difficulty is to define a policy for the agent. Let's imagine the simplest case. In our environment, a state is just a tuple (current price, position). In this case:
The reward is non-zero only when an agent actually holds a position; when it's out of the market, there is no reward, i.e. it's zero. This part is more or less easy.
But the value and action-value functions are very non-trivial (remember it accounts only for the future rewards, not the past). Say, the price of AAPL is at $100, is it good or bad considering future rewards? Should you rather buy or sell it? The answer depends on the policy...
For example, an agent might somehow learn that every time the price suddenly drops to $40, it will recover soon (sounds too silly, it's just an illustration). Now if an agent acts according to this policy, the price around $40 is a good state and it's value is high. Likewise, the action-value Q around $40 is high for "buy" and low for "sell". Choose a different policy and you'll get a different value and action-value functions. The researchers try to analyze the stock history and come up with sensible policies, but no one knows an optimal policy. In fact, no one even knows the state probabilities, only their estimates. This is what makes the task truly difficult.
I was trying to understand the proof why policy improvement theorem can be applied on epsilon-greedy policy.
The proof starts with the mathematical definition -
I am confused on the very first line of the proof.
This equation is the Bellman expectation equation for Q(s,a), while V(s) and Q(s,a) follow the relation -
So how can we ever derive the first line of the proof?
The optimal control problem was first introduced in the 1950s. The problem was to design a controller to maximize or minimize an objective function. Richard Bellman approached this optimal control problem by introducing the Bellman Equation:
Where the value is equivalent to the discounted sum of the rewards. If we take the first time step out, we get the following:
Subsequently, classic reinforcement learning is based on the Markov Decision Process, and assumes all state transitions are known. Thus the equation becomes the following:
That is, the summation is equivalent to the summation of all possible transitions from the that state, multiplied by the reward for achieving the new state.
The above equations are written in the value form. Sometimes, we want the value to also be a function of the action, thus creating the action-value. The conversion of the above equation to the action value form is:
The biggest issue with this equation is that in real life, the transitional probabilities are in fact not known. It is impossible to know the transitional probabilities of every single state unless the problem is extremely simple. To solve this problem, we usually just take the max of the future discounted portion. That is, we assume we behave optimally in the future, rather than taking the average of all possible scenarios.
However, the environment can be heavily stochastic in a real scenario. Therefore, the best estimate of the action-value function in any state is simply an estimate. And the post probabilistic case is the expected value. Thus, giving you:
The reward notation is t+1 in your equation. This is mainly because of different interpretations. The proof above still holds for your notation. It is simply saying you won't know your reward until you get to your next sampling time.
In reinforcement learning, I'm trying to understand the difference between policy iteration, and value iteration. There are some general answers to this out there, but I have two specific queries which I cannot find an answer to.
1) I have heard that policy iteration "works forwards", whereas value iteration "works backwards". What does this mean? I thought that both methods just take each state, then look at all the other states it can reach, and compute the value from this -- either by marginalising over the policy's action distribution (policy iteration) or by taking that argmax with regards to the action values (value iteration). So why is there any notion of the "direction" in which each method "moves"?
2) Policy iteration requires an iterative process during policy evaluation, to find the value function -- by However, value iteration just requires one step. Why is this different? Why does value iteration converge in just one step?
Thank you!
The answer provided by #Nick Walker is right and quite complete, however I would like to add a graphical explanation of the difference between Value iteration and Policy iteration, which maybe help to answer the second part of your question.
Both methods, PI and VI, follow the same working principle based in Generalized Policy Iteration. This basically means that they alternate between improving the policy (which requires knowing its value function), and computing the value function of the new, improved, policy.
At the end of this iterative process, both, the value and the policy, converge to the optimal.
However, it was notice that is not necessary to compute exactly the full value function, instead, one step is necessary to allow the convergence. In the following figure, (b) represents the operations performed by Policy Iteration, where the full value function is computed. While (d) shows how Value iteration works.
Obviously this representation of both methods are simplistic, but it highlights the difference between the key idea behind each algorithm.
I have heard that policy iteration "works forwards", whereas value iteration "works backwards". What does this mean?
I can't find anything online that describes policy iteration and value iteration in terms of direction, and to my knowledge this is not a common way to explain the difference between them.
One possibility is that someone was referring to visual impression of values propagating in value iteration. After the first sweep, values are correct on a 1 timestep horizon. Each value correctly tells you what to do to maximize your cumulative reward if you have 1 tilmestep to live. This means that that states that transition to the terminal state and receive a reward have positive values while most everything else is 0. Each sweep, the values become correct for one timestep longer of a horizon. So the values creep backwards from the terminal state towards the start state as the horizon expands. In policy iteration, instead of just propagating values back one step, you calculate the complete value function for the current policy. Then you improve the policy and repeat. I can't say that this has a forward connotation to it, but it certainly lacks the backwards appearance. You may want to see Pablo's answer to a similar question for another explanation of the differences that may help you contextualize what you have heard.
It's also possible that you heard about this forwards-backwards contrast in regards to something related, but different; implementations of temporal difference learning algorithms. In this case, the direction refers to the direction in which you look when making an update to state-action values; forwards means you need to have information about the results of future actions, while backwards means you only need information about things that happened previously. You can read about this in chapter 12 of Reinforcement Learning: An Introduction 2nd edition.
Why does policy iteration have to do a bunch of value function calculations when value iteration seemingly just does one that ends up being optimal? Why does value iteration converge in just one step?
In policy evaluation, we already have a policy and we're just calculating the value of taking actions as it dictates. It repeatedly looks at each state and moves the state's value towards the values of the states that the policy's action will transition to (until the values stop changing and we consider it converged). This value function is not optimal. It's only useful because we can use it in combination with the policy improvement theorem to improve the policy. The expensive process of extracting a new policy, which requires us to maximize over actions in a state, happens infrequently, and policies seem to converge pretty quickly. So even though the policy evaluation step looks like it would be time consuming, PI is actually pretty fast.
Value iteration is just policy iteration where you do exactly one iteration of policy evaluation and extract a new policy at the same time (maximizing over actions is the implicit policy extraction). Then you repeat this iterate-extract procedure over and over until the values stop changing. The fact that these steps are merged together makes it look more straightforward on paper, but maximizing at each sweep is expensive and means that value iteration is often slower than policy iteration.
I have to solve this problem with Q-learning.
Well, actually I have to evaluated a Q-learning based policy on it.
I am a tourist manager.
I have n hotels, each can contain a different number of persons.
for each person I put in a hotel I get a reward, based on which room I have chosen.
If I want I can also murder the person, so it goes in no hotel but it gives me a different reward.
(OK,that's a joke...but it's to say that I can have a self transition. so the number of people in my rooms doesn't change after that action).
my state is a vector containing the number of persons in each hotel.
my action is a vector of zeroes and ones which tells me where do I
put the new person.
my reward matrix is formed by the rewards I get for each transition
between states (even the self transition one).
now,since I can get an unlimited number of people (i.e. I can fill it but I can go on killing them) how can I build the Q matrix? without the Q matrix I can't get a policy and so I can't evaluate it...
What do I see wrongly? should I choose a random state as final? Do I have missed the point at all?
This question is old, but I think merits an answer.
One of the issues is that there is not necessarily the notion of an episode, and corresponding terminal state. Rather, this is a continuing problem. Your goal is to maximize your reward forever into the future. In this case, there is discount factor gamma less than one that essentially specifies how far you look into the future on each step. The return is specified as the cumulative discounted sum of future rewards. For episodic problems, it is common to use a discount of 1, with the return being the cumulative sum of future rewards until the end of an episode is reached.
To learn the optimal Q, which is the expected return for following the optimal policy, you have to have a way to perform the off-policy Q-learning updates. If you are using sample transitions to get Q-learning updates, then you will have to specify a behavior policy that takes actions in the environment to get those samples. To understand more about Q-learning, you should read the standard introductory RL textbook: "Reinforcement Learning: An Introduction", Sutton and Barto.
RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think.
I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like a really huge state space. If you are comfortable with using a discrete approach, you would get a good start and learn something about your problem by limiting the scope (finite number of people and hotels/rooms) of the problem and turning Q-learning loose on the smaller state matrix.
OR, you could jump right into a method that can handle infinite state space like an neural network.
In my experience if you have the patience of trying the smaller problem first, you will be better prepared to solve the bigger one next.
Maybe it isn't an answer on "is it possible?", but... Read about r-learning, to solve this particular problem you may want to learn not only Q- or V-function, but also rho - expected reward over time. Joint learning of Q and rho results in better strategy.
To iterate on the above response, with an infinite state space, you definitely should consider generalization of some sort for your Q Function. You will get more value out of your Q function response in an infinite space. You could experiment with several different function approximations, whether that is simple linear regression or a neural network.
Like Martha said, you will need to have a gamma less than one to account for the infinite horizon. Otherwise, you would be trying to determine the fitness of N amount of policies that all equal infinity, which means you will not be able to measure the optimal policy.
The main thing I wanted to add here though for anyone reading this later is the significance of reward shaping. In an infinite problem, where there isn't that final large reward, sub-optimal reward loops can occur, where the agent gets "stuck", since maybe a certain state has a reward higher than any of its neighbors in a finite horizon (which was defined by gamma). To account for that, you want to make sure you penalize the agent for landing in the same state multiple times to avoid these suboptimal loops. Obviously, exploration is extremely important as well, and when the problem is infinite, some amount of exploration will always be necessary.