Can state in Proximal Policy Optimization contain history?

Can state in Proximal Policy Optimization contain history? - machine-learning

For example can the State at timestep t actually be made of the state at t and t-1.
S_t = [s_t, s_t-1]
i.e. Does Proximal Policy Optimization already incorporate the state history, or can it be implicit in the State (or neither).

You could concatenate your observations. This is very common to do it RL. Usually in atari domain the last four frames are joined into a single observation. This makes it possible for the agent to understand change in the environment.
a basic PPO algorithm does not by default implicitly keep track of state history. You could make this possible though by adding a recurrent layer.

Related

What exactly is the difference between Q, V (value function) , and reward in Reinforcement Learning?

In the context of Double Q or Deuling Q Networks, I am not sure if I fully understand the difference. Especially with V. What exactly is V(s)? How can a state have an inherent value?
If we are considering this in the context of trading stocks lets say, then how would we define these three variables?

No matter what network can talk about, the reward is an inherent part of the environment. This is the signal (in fact, the only signal) that an agent receives throughout its life after making actions. For example: an agent that plays chess gets only one reward at the end of the game, either +1 or -1, all other times the reward is zero.
Here you can see a problem in this example: the reward is very sparse and is given just once, but the states in a game are obviously very different. If an agent is in a state when it has the queen while the opponent has just lost it, the chances of winning are very high (simplifying a little bit, but you get an idea). This is a good state and an agent should strive to get there. If on the other hand, an agent lost all the pieces, it is a bad state, it will likely lose the game.
We would like to quantify what actually good and bad states are, and here comes the value function V(s). Given any state, it returns a number, big or small. Usually, the formal definition is the expectation of the discounted future rewards, given a particular policy to act (for the discussion of a policy see this question). This makes perfect sense: a good state is such one, in which the future +1 reward is very probable; the bad state is quite the opposite -- when the future -1 is very probable.
Important note: the value function depends on the rewards and not just for one state, for many of them. Remember that in our example the reward for almost all states is 0. Value function takes into account all future states along with their probabilities.
Another note: strictly speaking the state itself doesn't have a value. But we have assigned one to it, according to our goal in the environment, which is to maximize the total reward. There can be multiple policies and each will induce a different value function. But there is (usually) one optimal policy and the corresponding optimal value function. This is what we'd like to find!
Finally, the Q-function Q(s, a) or the action-value function is the assessment of a particular action in a particular state for a given policy. When we talk about an optimal policy, action-value function is tightly related to the value function via Bellman optimality equations. This makes sense: the value of an action is fully determined by the value of the possible states after this action is taken (in the game of chess the state transition is deterministic, but in general it's probabilistic as well, that's why we talk about all possible states here).
Once again, action-value function is a derivative of the future rewards. It's not just a current reward. Some actions can be much better or much worse than others even though the immediate reward is the same.
Speaking of the stock trading example, the main difficulty is to define a policy for the agent. Let's imagine the simplest case. In our environment, a state is just a tuple (current price, position). In this case:
The reward is non-zero only when an agent actually holds a position; when it's out of the market, there is no reward, i.e. it's zero. This part is more or less easy.
But the value and action-value functions are very non-trivial (remember it accounts only for the future rewards, not the past). Say, the price of AAPL is at $100, is it good or bad considering future rewards? Should you rather buy or sell it? The answer depends on the policy...
For example, an agent might somehow learn that every time the price suddenly drops to $40, it will recover soon (sounds too silly, it's just an illustration). Now if an agent acts according to this policy, the price around $40 is a good state and it's value is high. Likewise, the action-value Q around $40 is high for "buy" and low for "sell". Choose a different policy and you'll get a different value and action-value functions. The researchers try to analyze the stock history and come up with sensible policies, but no one knows an optimal policy. In fact, no one even knows the state probabilities, only their estimates. This is what makes the task truly difficult.

Policy Iteration vs Value Iteration

In reinforcement learning, I'm trying to understand the difference between policy iteration, and value iteration. There are some general answers to this out there, but I have two specific queries which I cannot find an answer to.
1) I have heard that policy iteration "works forwards", whereas value iteration "works backwards". What does this mean? I thought that both methods just take each state, then look at all the other states it can reach, and compute the value from this -- either by marginalising over the policy's action distribution (policy iteration) or by taking that argmax with regards to the action values (value iteration). So why is there any notion of the "direction" in which each method "moves"?
2) Policy iteration requires an iterative process during policy evaluation, to find the value function -- by However, value iteration just requires one step. Why is this different? Why does value iteration converge in just one step?
Thank you!

The answer provided by #Nick Walker is right and quite complete, however I would like to add a graphical explanation of the difference between Value iteration and Policy iteration, which maybe help to answer the second part of your question.
Both methods, PI and VI, follow the same working principle based in Generalized Policy Iteration. This basically means that they alternate between improving the policy (which requires knowing its value function), and computing the value function of the new, improved, policy.
At the end of this iterative process, both, the value and the policy, converge to the optimal.
However, it was notice that is not necessary to compute exactly the full value function, instead, one step is necessary to allow the convergence. In the following figure, (b) represents the operations performed by Policy Iteration, where the full value function is computed. While (d) shows how Value iteration works.
Obviously this representation of both methods are simplistic, but it highlights the difference between the key idea behind each algorithm.

I have heard that policy iteration "works forwards", whereas value iteration "works backwards". What does this mean?
I can't find anything online that describes policy iteration and value iteration in terms of direction, and to my knowledge this is not a common way to explain the difference between them.
One possibility is that someone was referring to visual impression of values propagating in value iteration. After the first sweep, values are correct on a 1 timestep horizon. Each value correctly tells you what to do to maximize your cumulative reward if you have 1 tilmestep to live. This means that that states that transition to the terminal state and receive a reward have positive values while most everything else is 0. Each sweep, the values become correct for one timestep longer of a horizon. So the values creep backwards from the terminal state towards the start state as the horizon expands. In policy iteration, instead of just propagating values back one step, you calculate the complete value function for the current policy. Then you improve the policy and repeat. I can't say that this has a forward connotation to it, but it certainly lacks the backwards appearance. You may want to see Pablo's answer to a similar question for another explanation of the differences that may help you contextualize what you have heard.
It's also possible that you heard about this forwards-backwards contrast in regards to something related, but different; implementations of temporal difference learning algorithms. In this case, the direction refers to the direction in which you look when making an update to state-action values; forwards means you need to have information about the results of future actions, while backwards means you only need information about things that happened previously. You can read about this in chapter 12 of Reinforcement Learning: An Introduction 2nd edition.
Why does policy iteration have to do a bunch of value function calculations when value iteration seemingly just does one that ends up being optimal? Why does value iteration converge in just one step?
In policy evaluation, we already have a policy and we're just calculating the value of taking actions as it dictates. It repeatedly looks at each state and moves the state's value towards the values of the states that the policy's action will transition to (until the values stop changing and we consider it converged). This value function is not optimal. It's only useful because we can use it in combination with the policy improvement theorem to improve the policy. The expensive process of extracting a new policy, which requires us to maximize over actions in a state, happens infrequently, and policies seem to converge pretty quickly. So even though the policy evaluation step looks like it would be time consuming, PI is actually pretty fast.
Value iteration is just policy iteration where you do exactly one iteration of policy evaluation and extract a new policy at the same time (maximizing over actions is the implicit policy extraction). Then you repeat this iterate-extract procedure over and over until the values stop changing. The fact that these steps are merged together makes it look more straightforward on paper, but maximizing at each sweep is expensive and means that value iteration is often slower than policy iteration.

Incorporating Transition Probabilities in SARSA

I am implementing a SARSA(lambda) model in C++ to overcome some of the limitations (the sheer amount of time and space DP models require) of DP models, which hopefully will reduce the computation time (takes quite a few hours atm for similar research) and less space will allow adding more complexion to the model.
We do have explicit transition probabilities, and they do make a difference. So how should we incorporate them in a SARSA model?
Simply select the next state according to the probabilities themselves? Apparently SARSA models don't exactly expect you to use probabilities - or perhaps I've been reading the wrong books.
PS- Is there a way of knowing if the algorithm is properly implemented? First time working with SARSA.

The fundamental difference between Dynamic Programming (DP) and Reinforcement Learning (RL) is that the first assumes that environment's dynamics is known (i.e., a model), while the latter can learn directly from data obtained from the process, in the form of a set of samples, a set of process trajectories, or a single trajectory. Because of this feature, RL methods are useful when a model is difficult or costly to construct. However, it should be notice that both approaches share the same working principles (called Generalized Policy Iteration in Sutton's book).
Given they are similar, both approaches also share some limitations, namely, the curse of dimensionality. From Busoniu's book (chapter 3 is free and probably useful for your purposes):
A central challenge in the DP and RL fields is that, in their original
form (i.e., tabular form), DP and RL algorithms cannot be implemented
for general problems. They can only be implemented when the state and
action spaces consist of a finite number of discrete elements, because
(among other reasons) they require the exact representation of value
functions or policies, which is generally impossible for state spaces
with an infinite number of elements (or too costly when the number of
states is very high).
Even when the states and actions take finitely many values, the cost
of representing value functions and policies grows exponentially with
the number of state variables (and action variables, for Q-functions).
This problem is called the curse of dimensionality, and makes the
classical DP and RL algorithms impractical when there are many state
and action variables. To cope with these problems, versions of the
classical algorithms that approximately represent value functions
and/or policies must be used. Since most problems of practical
interest have large or continuous state and action spaces,
approximation is essential in DP and RL.
In your case, it seems quite clear that you should employ some kind of function approximation. However, given that you know the transition probability matrix, you can choose a method based on DP or RL. In the case of RL, transitions are simply used to compute the next state given an action.
Whether is better to use DP or RL? Actually I don't know the answer, and the optimal method likely depends on your specific problem. Intuitively, sampling a set of states in a planned way (DP) seems more safe, but maybe a big part of your state space is irrelevant to find an optimal pocliy. In such a case, sampling a set of trajectories (RL) maybe is more effective computationally. In any case, if both methods are rightly applied, should achive a similar solution.
NOTE: when employing function approximation, the convergence properties are more fragile and it is not rare to diverge during the iteration process, especially when the approximator is non linear (such as an artificial neural network) combined with RL.

If you have access to the transition probabilities, I would suggest not to use methods based on a Q-value. This will require additional sampling in order to extract information that you already have.
It may not always be the case, but without additional information I would say that modified policy iteration is a more appropriate method for your problem.

Hidden markov model next state only depends on previous one state? What about previous n states?

I am working on a prototype framework.
Basically I need to generate a model or profile for each individual's lifestyle based on some sensor data about him/her, such as GPS, motions, heart rate, surrounding environment readings, temperature etc.
The proposed model or profile is a knowledge representation of an individual's lifestyle pattern. Maybe a graph with probabilities.
I am thinking to use Hidden Markov Model to implement this. As the states in HMM can be Working, Sleeping, Leisure, Sport and etc. Observations can be a set of various sensor data.
My understanding of HMM is that next state S(t) is only depends on previous one state S(t-1). However in reality, a person's activity might depends on previous n states. Is it still a good idea to use HMM? Or should I use some other more appropriate models? I have seen some work on second order and multiple order of Markov Chains, does it also apply HMM?
I really appreciate if you can give me a detailed explanation.
Thanks!!

What you are talking about is a First Order HMM in which your model would only have knowledge of the previous history State. In case of an Order-n Markov Model, the next state would be dependent on the previous 'n' States and may be this is what you are looking for right?
You are right that as far as simple HMMs are considered, the next state is dependent only upon the current state. However, it is also possible to achieve a mth Order HMM by defining the transition probabilities as shown in this link. However, as the order increases, so does the overall complexity of your matrices and hence your model, so it's really upto you if your up for the challenge and willing to put the requisite effort.

Reinforcement Learning With Variable Actions

All the reinforcement learning algorithms I've read about are usually applied to a single agent that has a fixed number of actions. Are there any reinforcement learning algorithms for making a decision while taking into account a variable number of actions? For example, how would you apply a RL algorithm in a computer game where a player controls N soldiers, and each soldier has a random number of actions based its condition? You can't formulate fixed number of actions for a global decision maker (i.e. "the general") because the available actions are continually changing as soldiers are created and killed. And you can't formulate a fixed number of actions at the soldier level, since the soldier's actions are conditional based on its immediate environment. If a soldier sees no opponents, then it might only be able to walk, whereas if it sees 10 opponents, then it has 10 new possible actions, attacking 1 of the 10 opponents.

What you describe is nothing unusual. Reinforcement learning is a way of finding the value function of a Markov Decision Process. In an MDP, every state has its own set of actions. To proceed with reinforcement learning application, you have to clearly define what the states, actions, and rewards are in your problem.

If you have a number of actions for each soldier that are available or not depending on some conditions, then you can still model this as selection from a fixed set of actions. For example:
Create a "utility value" for each of the full set of actions for each soldier
Choose the highest valued action, ignoring those actions that are not available at a given time
If you have multiple possible targets, then the same principle applies, except this time you model your utility function to take the target designation as an additional parameter, and run the evaluation function multiple times (one for each target). You pick the target that has the highest "attack utility".

In continuous domain action spaces, the policy NN often outputs the mean and/or the variance, from which you, then, sample the action, assuming it follows a certain distribution.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart