I want know how to add a constraint to Q-learning. I have an action resulting in two rewards every time (reward 1= delivery cost , reward 2= delivery time). I want to minimize the cost while ensuring max delivery time limit is not violated.
Is there a standard/formalized way to do this?
The easiest solution would be to create a single reward function that takes both of those signals into account.
To minimize delivery costs, you'd want to start out defining your reward function like:
R(.) = -delivery_cost
The negation is there because Reinforcement Learning is typically about rewards which should be maximized, instead of costs which should be minimized.
A straightforward way to have the agent learn not to violate a delivery time limit would be to subtract a massive constant from the reward if the deliverty time limit is violated, and not add or subtract anything if it isn't violated. So, that would look something like:
R(.) =
-delivery_cost - M IF delivery_time > constraint,
-delivery_cost otherwise
The value of M would have to be something really big. How big is ''really big'' depends on how big you expect the delivery_costs can become, because it has to be bigger than that.
Of course it is also possible to create a smoother reward function than this, especially if you want to allow violating the delivery time constraint by a little bit if that means you get a significant reduction in costs.
If you want to look into significantly more complex solutions than what I proposed above, you'll want to look around for literature on Multi-Objective Reinforcement Learning.
Related
The environment is a directed graph that consists of nodes which have their own "goodness"(marked green) and edges that have prices(marked red). In this environment exists Price(P) constraint. The goal is to accumulate the most "goodness" points from nodes, as possible while making a circle(for example 0-->6-->5-->0) and not exceeding price constraint..
I managed to implement Q-Learning algorithm when there are no constraints, but I don't fully understand how to add Hard Constrains, while approximating Q-Function.
For instance, starting point is 0. Price limit is 13. Taking path 0-->1-->2-->3-->4-->5-->0 wouldn't be a valid choice for Agent, because at node 5 price(13) limit was reached, therefore, Agent should be punished for violating constraints. However, taking path 0-->6-->5-->0 would be a correct choice for the Agent and therefore, he should be rewarded. What I do not understand, how to tell Agent that sometimes going from 5 to 0 is perfect choice and sometimes it is not applicable, because some constraints where violated. I tried to give huge penalty if price constraint was violated and ended episode immediately, but that didn't seem to workout.
My question(s):
How to add hard constraints to RL algorithms like Q-Learning?
Is Q-Learning a valid choice for that kind of problem?
Should other algorithms be chosen like Monte Carlo Tree Search
instead of Q-Learning.
I assume it is a very common problem in real world scenarios, but I couldn't find any examples about that.
I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.
In the context of Double Q or Deuling Q Networks, I am not sure if I fully understand the difference. Especially with V. What exactly is V(s)? How can a state have an inherent value?
If we are considering this in the context of trading stocks lets say, then how would we define these three variables?
No matter what network can talk about, the reward is an inherent part of the environment. This is the signal (in fact, the only signal) that an agent receives throughout its life after making actions. For example: an agent that plays chess gets only one reward at the end of the game, either +1 or -1, all other times the reward is zero.
Here you can see a problem in this example: the reward is very sparse and is given just once, but the states in a game are obviously very different. If an agent is in a state when it has the queen while the opponent has just lost it, the chances of winning are very high (simplifying a little bit, but you get an idea). This is a good state and an agent should strive to get there. If on the other hand, an agent lost all the pieces, it is a bad state, it will likely lose the game.
We would like to quantify what actually good and bad states are, and here comes the value function V(s). Given any state, it returns a number, big or small. Usually, the formal definition is the expectation of the discounted future rewards, given a particular policy to act (for the discussion of a policy see this question). This makes perfect sense: a good state is such one, in which the future +1 reward is very probable; the bad state is quite the opposite -- when the future -1 is very probable.
Important note: the value function depends on the rewards and not just for one state, for many of them. Remember that in our example the reward for almost all states is 0. Value function takes into account all future states along with their probabilities.
Another note: strictly speaking the state itself doesn't have a value. But we have assigned one to it, according to our goal in the environment, which is to maximize the total reward. There can be multiple policies and each will induce a different value function. But there is (usually) one optimal policy and the corresponding optimal value function. This is what we'd like to find!
Finally, the Q-function Q(s, a) or the action-value function is the assessment of a particular action in a particular state for a given policy. When we talk about an optimal policy, action-value function is tightly related to the value function via Bellman optimality equations. This makes sense: the value of an action is fully determined by the value of the possible states after this action is taken (in the game of chess the state transition is deterministic, but in general it's probabilistic as well, that's why we talk about all possible states here).
Once again, action-value function is a derivative of the future rewards. It's not just a current reward. Some actions can be much better or much worse than others even though the immediate reward is the same.
Speaking of the stock trading example, the main difficulty is to define a policy for the agent. Let's imagine the simplest case. In our environment, a state is just a tuple (current price, position). In this case:
The reward is non-zero only when an agent actually holds a position; when it's out of the market, there is no reward, i.e. it's zero. This part is more or less easy.
But the value and action-value functions are very non-trivial (remember it accounts only for the future rewards, not the past). Say, the price of AAPL is at $100, is it good or bad considering future rewards? Should you rather buy or sell it? The answer depends on the policy...
For example, an agent might somehow learn that every time the price suddenly drops to $40, it will recover soon (sounds too silly, it's just an illustration). Now if an agent acts according to this policy, the price around $40 is a good state and it's value is high. Likewise, the action-value Q around $40 is high for "buy" and low for "sell". Choose a different policy and you'll get a different value and action-value functions. The researchers try to analyze the stock history and come up with sensible policies, but no one knows an optimal policy. In fact, no one even knows the state probabilities, only their estimates. This is what makes the task truly difficult.
I have to solve this problem with Q-learning.
Well, actually I have to evaluated a Q-learning based policy on it.
I am a tourist manager.
I have n hotels, each can contain a different number of persons.
for each person I put in a hotel I get a reward, based on which room I have chosen.
If I want I can also murder the person, so it goes in no hotel but it gives me a different reward.
(OK,that's a joke...but it's to say that I can have a self transition. so the number of people in my rooms doesn't change after that action).
my state is a vector containing the number of persons in each hotel.
my action is a vector of zeroes and ones which tells me where do I
put the new person.
my reward matrix is formed by the rewards I get for each transition
between states (even the self transition one).
now,since I can get an unlimited number of people (i.e. I can fill it but I can go on killing them) how can I build the Q matrix? without the Q matrix I can't get a policy and so I can't evaluate it...
What do I see wrongly? should I choose a random state as final? Do I have missed the point at all?
This question is old, but I think merits an answer.
One of the issues is that there is not necessarily the notion of an episode, and corresponding terminal state. Rather, this is a continuing problem. Your goal is to maximize your reward forever into the future. In this case, there is discount factor gamma less than one that essentially specifies how far you look into the future on each step. The return is specified as the cumulative discounted sum of future rewards. For episodic problems, it is common to use a discount of 1, with the return being the cumulative sum of future rewards until the end of an episode is reached.
To learn the optimal Q, which is the expected return for following the optimal policy, you have to have a way to perform the off-policy Q-learning updates. If you are using sample transitions to get Q-learning updates, then you will have to specify a behavior policy that takes actions in the environment to get those samples. To understand more about Q-learning, you should read the standard introductory RL textbook: "Reinforcement Learning: An Introduction", Sutton and Barto.
RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think.
I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like a really huge state space. If you are comfortable with using a discrete approach, you would get a good start and learn something about your problem by limiting the scope (finite number of people and hotels/rooms) of the problem and turning Q-learning loose on the smaller state matrix.
OR, you could jump right into a method that can handle infinite state space like an neural network.
In my experience if you have the patience of trying the smaller problem first, you will be better prepared to solve the bigger one next.
Maybe it isn't an answer on "is it possible?", but... Read about r-learning, to solve this particular problem you may want to learn not only Q- or V-function, but also rho - expected reward over time. Joint learning of Q and rho results in better strategy.
To iterate on the above response, with an infinite state space, you definitely should consider generalization of some sort for your Q Function. You will get more value out of your Q function response in an infinite space. You could experiment with several different function approximations, whether that is simple linear regression or a neural network.
Like Martha said, you will need to have a gamma less than one to account for the infinite horizon. Otherwise, you would be trying to determine the fitness of N amount of policies that all equal infinity, which means you will not be able to measure the optimal policy.
The main thing I wanted to add here though for anyone reading this later is the significance of reward shaping. In an infinite problem, where there isn't that final large reward, sub-optimal reward loops can occur, where the agent gets "stuck", since maybe a certain state has a reward higher than any of its neighbors in a finite horizon (which was defined by gamma). To account for that, you want to make sure you penalize the agent for landing in the same state multiple times to avoid these suboptimal loops. Obviously, exploration is extremely important as well, and when the problem is infinite, some amount of exploration will always be necessary.
I need some help with solving a problem that uses the Q-learning algorithm.
Problem description:
I have a rocket simulator where the rocket is taking random paths and also crashes sometimes. The rocket has 3 different engines that can be either on or off. Depending on which engine(s) is activated, the rocket flies towards different directions.
Functions for turning the engines off/on is available
The task:
Construct the Q-learning controller that will turn to rocket to face up all the time.
A sensor that reads the angle of the rocket is available as input.
My solution:
I have the following states:
I also have the following actions:
all engines off
left engine on
right engine on
middle engine on
left and right on
left and middle on
right and middle on
And the following rewards:
Angle = 0, Reward = 100
All other angles, reward = 0
Question:
Now to the question, is this a good choice of rewards and states ? Can I improve my solution ? Is it better to have more rewards for other angles ?
Thanks in advance
16 states x 7 actions is a very small problem.
Rewards for other angles will help you learn faster, but can create odd behaviors later depending on your dynamics.
If you don't have momentum you may decrease the number of states, which will speed up learning and reduce memory useage (which is already tiny). To find the optimal number of states, try decreasing the number of states while analyzing a metric such as reward/timesteps over multiple games, or mean error (normalized by starting angle) over multiple games. Some state representations may perform much better than others. If not, choose the one which converges fastest. This should be relatively cheap with your small Q table.
If you want to learn quickly, you may also try Q-lambda or some other modified Reinforcement Learning algorithm to make use of temporal difference learning.
Edit: Depending on your dynamics this problem may not actually be suitable as a Markov Decision Process. For example, you may need to include the current rotation rate.
Try putting smaller rewards on the states next to the desired state. This will get your agent to learn to go up quicker.