Different rewards for same state in reinforcement learning - machine-learning

I want to implement Q-Learning for the Chrome dinosaur game (the one you can play when you are offline).
I defined my state as: distance to next obstacle, speed and the size of the next obstacle.
For the reward I wanted to use the number of successfully passed obstacles, but it could happen that the same state has different immediate rewards. The same type of obstacle could reappear later in the game, but the reward for passing it would be higher because more obstacles have already been passed.
My question now is: Is this a problem or would Q-Learning still work? If not is there a better way?

The definition of an MDP says that the reward r(s,a,s') is defined to be the expected reward for taking action a in state s to search state s'. This means that a given (s,a,s') can have a constant reward, or some distribution of rewards as long as it has a well defined expectation. As you've defined it, the reward is proportional to the number of obstacles passed. Because the game can continue forever, the reward for some (s,a,s') begins to look like the sum of the natural numbers. This series diverges so it does not have an expectation. In practice, if you ran Q-learning you would probably see the value function diverge (NaN values), but the policy in the middle of learning might be okay because the values that will grow the fastest will be the best state action pairs.
To avoid this, you should choose a different reward function. You could reward the agent with whatever its score is when it dies (big reward at the end, zero otherwise). You would also be fine giving a living reward (small reward each time step) as long as the agent has no choice but to move forward. As long as the highest total rewards are assigned to the longest runs (and the expectation of the reward for a (s,a,s') tuple is well defined) it's good.

Related

How machine know which step can get max reward?

In my understanding, reinforcement learning will get a reward from the action.
However, when playing a video game, there is no reward ( reward == 0 ) in most of the steps (ex: street fighter), eventually, we got a reward ( ex: player win, reward = 1 ), there are so many actions, how machine know which one is the key point to win this game ?
In Reinforcement Learning the reward can be immediate or delayed [1]:
The immediate reward could be:
very high positive if the agent wins the game (it is the last action that defeats the opponent);
very low negative if the agent loses the game;
positive if the action damages your opponent;
negative if the agent loses health points.
The delayed reward is caused by a future reward that is possible through a current action. For example, moving one step to the left could cause that in the next step it avoids being hit and it can hit the opponent.
Reinforcement learning algorithms, such as Q-learning, choose the action that gives the highest expected reward. This reward is continuously updated with the current reward (r at time t) and with possible future rewards (the last value in the equation, max Q, based on actions from time t+1 and later):
More detailed information about (Deep) Reinforcement Learning, with some examples of applications to games, is given in A Beginner's Guide to Deep Reinforcement Learning.

What exactly is the difference between Q, V (value function) , and reward in Reinforcement Learning?

In the context of Double Q or Deuling Q Networks, I am not sure if I fully understand the difference. Especially with V. What exactly is V(s)? How can a state have an inherent value?
If we are considering this in the context of trading stocks lets say, then how would we define these three variables?
No matter what network can talk about, the reward is an inherent part of the environment. This is the signal (in fact, the only signal) that an agent receives throughout its life after making actions. For example: an agent that plays chess gets only one reward at the end of the game, either +1 or -1, all other times the reward is zero.
Here you can see a problem in this example: the reward is very sparse and is given just once, but the states in a game are obviously very different. If an agent is in a state when it has the queen while the opponent has just lost it, the chances of winning are very high (simplifying a little bit, but you get an idea). This is a good state and an agent should strive to get there. If on the other hand, an agent lost all the pieces, it is a bad state, it will likely lose the game.
We would like to quantify what actually good and bad states are, and here comes the value function V(s). Given any state, it returns a number, big or small. Usually, the formal definition is the expectation of the discounted future rewards, given a particular policy to act (for the discussion of a policy see this question). This makes perfect sense: a good state is such one, in which the future +1 reward is very probable; the bad state is quite the opposite -- when the future -1 is very probable.
Important note: the value function depends on the rewards and not just for one state, for many of them. Remember that in our example the reward for almost all states is 0. Value function takes into account all future states along with their probabilities.
Another note: strictly speaking the state itself doesn't have a value. But we have assigned one to it, according to our goal in the environment, which is to maximize the total reward. There can be multiple policies and each will induce a different value function. But there is (usually) one optimal policy and the corresponding optimal value function. This is what we'd like to find!
Finally, the Q-function Q(s, a) or the action-value function is the assessment of a particular action in a particular state for a given policy. When we talk about an optimal policy, action-value function is tightly related to the value function via Bellman optimality equations. This makes sense: the value of an action is fully determined by the value of the possible states after this action is taken (in the game of chess the state transition is deterministic, but in general it's probabilistic as well, that's why we talk about all possible states here).
Once again, action-value function is a derivative of the future rewards. It's not just a current reward. Some actions can be much better or much worse than others even though the immediate reward is the same.
Speaking of the stock trading example, the main difficulty is to define a policy for the agent. Let's imagine the simplest case. In our environment, a state is just a tuple (current price, position). In this case:
The reward is non-zero only when an agent actually holds a position; when it's out of the market, there is no reward, i.e. it's zero. This part is more or less easy.
But the value and action-value functions are very non-trivial (remember it accounts only for the future rewards, not the past). Say, the price of AAPL is at $100, is it good or bad considering future rewards? Should you rather buy or sell it? The answer depends on the policy...
For example, an agent might somehow learn that every time the price suddenly drops to $40, it will recover soon (sounds too silly, it's just an illustration). Now if an agent acts according to this policy, the price around $40 is a good state and it's value is high. Likewise, the action-value Q around $40 is high for "buy" and low for "sell". Choose a different policy and you'll get a different value and action-value functions. The researchers try to analyze the stock history and come up with sensible policies, but no one knows an optimal policy. In fact, no one even knows the state probabilities, only their estimates. This is what makes the task truly difficult.

deep reinforcement learning parameters and training time for a simple game

I want to learn how deep reinforcement algorithm works and what time it takes to train itself for any given environment.
I came up with a very simple example of environment:
There is a counter which holds an integer between 0 to 100.
counting to 100 is its goal.
there is one parameter direction whose value can be +1 or -1.
it simply show the direction to move.
out neural network takes this direction as input and 2 possible action as output.
Change the direction
Do not change the direction
1st action will simply flip the direction (+1 => -1 or -1 =>+1). 2nd action will keep the direction as it is.
I am using python for backend and javascript for frontend.
It seems to take too much time, and still it is pretty random. i have used 4 layer perceptron. training rate of 0.001 . memory learning with batch of 100. Code is of Udemy tutorial of Artificial Intelligence and is working properly.
My question is, What should be the reward for completion and for each state.? and how much time it is required to train simple example as that.?
In Reinforcement Learning the underlining reward function is what defines the game. Different reward functions lead to different games with different optimal strategies.
In your case there are a few different possibilities:
Give +1 for reaching 100 and only then.
Give +1 for reaching 100 and -0.001 for every time step it is not at 100.
Give +1 for going up -1 for going down.
The third case is way too easy there is no long term planing involved. In the first too cases the agent will only start learning once it accidentally reaches 100 and sees that it is good. But in the first case once it learns to go up it doesn't matter how long it takes to get there. The second is the most interesting where it needs to get there as fast as possible.
There is no right answer for what reward to use, but ultimately the reward you choose defines the game you are playing.
Note: 4 layer perceptron for this problem is Big Time Overkill. One layer should be enough (this problem is very simple). Have you tried the reinforcement learning environments at OpenAI's gym? Highly recommend it, they have all the "classical" reinforcement learning problems.

Q Learning Grid World Scenario

I'm researching GridWorld from Q-learning Perspective. I have issues regarding the following question:
1) In the grid-world example, rewards are positive for goals, negative
for running into the edge of the world, and zero the rest of the time.
Are the signs of these rewards important, or only the intervals
between them?
Keep in mind that Q-values are expected values. The policy will extracted by choosing the action that maximises the Q function for each given state.
a_best(s) = max_a Q(s,a)
Notice that you can apply constant value to all Q-values without affecting the policy. It doesn't matter if you shift all the q-values by applying some constant value, the relation between the q-values with respect to max will still be the same.
In fact, you can apply any affine transformation (Q'= a*Q+b) and your decisions will not change.
Only the relative values matter. Say you have the following reward function...
Now say we add a constant C to all rewards...
We can prove that adding a constant C will add another constant K to the value of all states and thus does not affect the relative values of any state...
Where...
The values remain consistent throughout, so only the intervals between rewards matter, not their signs.
It's important to note, however, that this rule does not apply for all episodic tasks. Generally, the rule only applies if the length of the episodes are fixed. For tasks where the length of each episode is determined by actions (think board games), adding a positive constant may result in a longer learning interval.

is Q-learning without a final state even possible?

I have to solve this problem with Q-learning.
Well, actually I have to evaluated a Q-learning based policy on it.
I am a tourist manager.
I have n hotels, each can contain a different number of persons.
for each person I put in a hotel I get a reward, based on which room I have chosen.
If I want I can also murder the person, so it goes in no hotel but it gives me a different reward.
(OK,that's a joke...but it's to say that I can have a self transition. so the number of people in my rooms doesn't change after that action).
my state is a vector containing the number of persons in each hotel.
my action is a vector of zeroes and ones which tells me where do I
put the new person.
my reward matrix is formed by the rewards I get for each transition
between states (even the self transition one).
now,since I can get an unlimited number of people (i.e. I can fill it but I can go on killing them) how can I build the Q matrix? without the Q matrix I can't get a policy and so I can't evaluate it...
What do I see wrongly? should I choose a random state as final? Do I have missed the point at all?
This question is old, but I think merits an answer.
One of the issues is that there is not necessarily the notion of an episode, and corresponding terminal state. Rather, this is a continuing problem. Your goal is to maximize your reward forever into the future. In this case, there is discount factor gamma less than one that essentially specifies how far you look into the future on each step. The return is specified as the cumulative discounted sum of future rewards. For episodic problems, it is common to use a discount of 1, with the return being the cumulative sum of future rewards until the end of an episode is reached.
To learn the optimal Q, which is the expected return for following the optimal policy, you have to have a way to perform the off-policy Q-learning updates. If you are using sample transitions to get Q-learning updates, then you will have to specify a behavior policy that takes actions in the environment to get those samples. To understand more about Q-learning, you should read the standard introductory RL textbook: "Reinforcement Learning: An Introduction", Sutton and Barto.
RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think.
I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like a really huge state space. If you are comfortable with using a discrete approach, you would get a good start and learn something about your problem by limiting the scope (finite number of people and hotels/rooms) of the problem and turning Q-learning loose on the smaller state matrix.
OR, you could jump right into a method that can handle infinite state space like an neural network.
In my experience if you have the patience of trying the smaller problem first, you will be better prepared to solve the bigger one next.
Maybe it isn't an answer on "is it possible?", but... Read about r-learning, to solve this particular problem you may want to learn not only Q- or V-function, but also rho - expected reward over time. Joint learning of Q and rho results in better strategy.
To iterate on the above response, with an infinite state space, you definitely should consider generalization of some sort for your Q Function. You will get more value out of your Q function response in an infinite space. You could experiment with several different function approximations, whether that is simple linear regression or a neural network.
Like Martha said, you will need to have a gamma less than one to account for the infinite horizon. Otherwise, you would be trying to determine the fitness of N amount of policies that all equal infinity, which means you will not be able to measure the optimal policy.
The main thing I wanted to add here though for anyone reading this later is the significance of reward shaping. In an infinite problem, where there isn't that final large reward, sub-optimal reward loops can occur, where the agent gets "stuck", since maybe a certain state has a reward higher than any of its neighbors in a finite horizon (which was defined by gamma). To account for that, you want to make sure you penalize the agent for landing in the same state multiple times to avoid these suboptimal loops. Obviously, exploration is extremely important as well, and when the problem is infinite, some amount of exploration will always be necessary.

Resources