Reinforcement learning with hard constraints - machine-learning

The environment is a directed graph that consists of nodes which have their own "goodness"(marked green) and edges that have prices(marked red). In this environment exists Price(P) constraint. The goal is to accumulate the most "goodness" points from nodes, as possible while making a circle(for example 0-->6-->5-->0) and not exceeding price constraint..
I managed to implement Q-Learning algorithm when there are no constraints, but I don't fully understand how to add Hard Constrains, while approximating Q-Function.
For instance, starting point is 0. Price limit is 13. Taking path 0-->1-->2-->3-->4-->5-->0 wouldn't be a valid choice for Agent, because at node 5 price(13) limit was reached, therefore, Agent should be punished for violating constraints. However, taking path 0-->6-->5-->0 would be a correct choice for the Agent and therefore, he should be rewarded. What I do not understand, how to tell Agent that sometimes going from 5 to 0 is perfect choice and sometimes it is not applicable, because some constraints where violated. I tried to give huge penalty if price constraint was violated and ended episode immediately, but that didn't seem to workout.
My question(s):
How to add hard constraints to RL algorithms like Q-Learning?
Is Q-Learning a valid choice for that kind of problem?
Should other algorithms be chosen like Monte Carlo Tree Search
instead of Q-Learning.
I assume it is a very common problem in real world scenarios, but I couldn't find any examples about that.

Related

Optimize deep Q network with long episode

I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.

Genetic algorithm - shortest path in weighted graph

I want to make a genetic algorithm that solves a shortest path problem in weighted, connected graph. Similar to travelling salesman, but instead of fully-connected graph, it's just connected.
My idea is to randomly generate a path consisting of n-1 nodes for each chromosome in binary form, where numbers indicate nodes in a path. Then I will choose the best depending on sum of weights (if cant go from A to B i would give it penalty) and crossover/mutate bits in it. Will it work? It feels a little like smaller version of bruteforce. Is there a better way?
Thanks!
Genetic algorithm is pretty much "smaller version of bruteforce". It is just a metaheuristic, not an optimization method which has decent convergence guarantees. It basically depends on randomness to provide new solutions, thus it is a "slightly better random search".
So "will it work"? Yes, it will do something, as long as you have enough randomness in mutation it will even (eventually) converge to optimum. Will it work better than a random search? Hard to say, this depends on dozens of factors, not only your encoding, but also all the hyperparameters used etc. in general genetic algorithms are about trials and errors. In particular representation of chromosomes which does not loose any information (yours does not) does not matter, meaning that everything depends on clever implementation of crossover and mutation (as long as chromosomes do not loose any information they are all equivalent).
Edited.
You can use permutation coding GA. In permutation coding, you should give the start and end points. GA searches for the best chromosome with your fitness function. Candidate solutions (chromosomes) will be like 2-5-4-3-1 or 2-3-1-4-5 or 1-2-5-4-3 etc. So your solution depends on your fitness function. (Look at GA package for R to apply permutation GA easily.)
Connections are constraints for your problem. My best advice is create a constraint matrix like that:
FirstPoint SecondPoint Connected
A B true
A C true
A E false
... ... ...
In standard TSP, just distances are considered. In your fitness function, you have to consider this matrix and add a penalty to return value for each false.
Example chromosome: A-B-E-D-C
A-B: 1
B-E: 1
E-D: 4
D-C: 3
Fitness value: 9
.
Example chromosome: A-E-B-C-D
A-E: penalty
E-B: 1
B-C: 6
C-D: 3
Fitness value: 10 + penalty value.
Because your constraint is a hard constraint, you can use max integer value as the penalty. GA will find the best solution. :)

is Q-learning without a final state even possible?

I have to solve this problem with Q-learning.
Well, actually I have to evaluated a Q-learning based policy on it.
I am a tourist manager.
I have n hotels, each can contain a different number of persons.
for each person I put in a hotel I get a reward, based on which room I have chosen.
If I want I can also murder the person, so it goes in no hotel but it gives me a different reward.
(OK,that's a joke...but it's to say that I can have a self transition. so the number of people in my rooms doesn't change after that action).
my state is a vector containing the number of persons in each hotel.
my action is a vector of zeroes and ones which tells me where do I
put the new person.
my reward matrix is formed by the rewards I get for each transition
between states (even the self transition one).
now,since I can get an unlimited number of people (i.e. I can fill it but I can go on killing them) how can I build the Q matrix? without the Q matrix I can't get a policy and so I can't evaluate it...
What do I see wrongly? should I choose a random state as final? Do I have missed the point at all?
This question is old, but I think merits an answer.
One of the issues is that there is not necessarily the notion of an episode, and corresponding terminal state. Rather, this is a continuing problem. Your goal is to maximize your reward forever into the future. In this case, there is discount factor gamma less than one that essentially specifies how far you look into the future on each step. The return is specified as the cumulative discounted sum of future rewards. For episodic problems, it is common to use a discount of 1, with the return being the cumulative sum of future rewards until the end of an episode is reached.
To learn the optimal Q, which is the expected return for following the optimal policy, you have to have a way to perform the off-policy Q-learning updates. If you are using sample transitions to get Q-learning updates, then you will have to specify a behavior policy that takes actions in the environment to get those samples. To understand more about Q-learning, you should read the standard introductory RL textbook: "Reinforcement Learning: An Introduction", Sutton and Barto.
RL problems don't need a final state per se. What they need is reward states. So, as long as you have some rewards, you are good to go, I think.
I don't have a lot of XP with RL problems like this one. As a commenter suggests, this sounds like a really huge state space. If you are comfortable with using a discrete approach, you would get a good start and learn something about your problem by limiting the scope (finite number of people and hotels/rooms) of the problem and turning Q-learning loose on the smaller state matrix.
OR, you could jump right into a method that can handle infinite state space like an neural network.
In my experience if you have the patience of trying the smaller problem first, you will be better prepared to solve the bigger one next.
Maybe it isn't an answer on "is it possible?", but... Read about r-learning, to solve this particular problem you may want to learn not only Q- or V-function, but also rho - expected reward over time. Joint learning of Q and rho results in better strategy.
To iterate on the above response, with an infinite state space, you definitely should consider generalization of some sort for your Q Function. You will get more value out of your Q function response in an infinite space. You could experiment with several different function approximations, whether that is simple linear regression or a neural network.
Like Martha said, you will need to have a gamma less than one to account for the infinite horizon. Otherwise, you would be trying to determine the fitness of N amount of policies that all equal infinity, which means you will not be able to measure the optimal policy.
The main thing I wanted to add here though for anyone reading this later is the significance of reward shaping. In an infinite problem, where there isn't that final large reward, sub-optimal reward loops can occur, where the agent gets "stuck", since maybe a certain state has a reward higher than any of its neighbors in a finite horizon (which was defined by gamma). To account for that, you want to make sure you penalize the agent for landing in the same state multiple times to avoid these suboptimal loops. Obviously, exploration is extremely important as well, and when the problem is infinite, some amount of exploration will always be necessary.

What parameters can I play with using mcl?

I am clustering undirected graphs using mcl. To do so, I have choose a threshold under which nodes are connected, a similarity measure for each edge and the inflation parameter to tune the granularity of my graph. I have been playing around with these parameters, but so far, the clusters I have seem to be too large (I did visualizations that suggest that the largest clusters should be cut into 2 or more clusters). Therefore, I was wondering what are the other parameters I can play with to improve my clustering (I am currently working with the scheme parameter of mcl to see whether increasing the accuracy would help, but if there are other 'more specific' parameters that could help to get smaller clusters for instance, please let me know)?
There are really mainly two things to consider. The first and most important is outside mcl (http://micans.org/mcl/) itself, namely how the network is constructed. I've written about it elsewhere, but I'll repeat it here because it is important.
If you have a weighted similarity, choose an edge-weight (similarity) cutoff
such that the topology of the network becomes informative; i.e. too many edges
or too few edges yield little discriminative information in the
absence/presence structure of edges. Choose it such that no edges connect
things you consider very dissimilar, and that edges connect things you consider
somewhat similar to quite similar. In the case of mcl, the dynamic range in
edge weight between 'a bit similar' and 'very similar' should be, as a rule of
a thumb, one order of magnitude, i.e. two-fold or five-fold or ten-fold, as
opposed to varying from 0.9 to 1.0. Of course, it is possible to give simple
networks to mcl and it will just utilise the absence/presence of edges. Make sure
the network does not become very dense - a very rough rule of thumb could be to aim
for a total number of edges that is in the order of V * sqrt(V) if the number of nodes (vertcies) is V, that is, each node has, on average, in the order of sqrt(V) neighbours.
The above, network construction, is really crucial, and it is advisable
to try different approaches. Now, given a network,
there is really only one mcl parameter to vary: the inflation parameter (the -I option).
A good set of values to test with is 1.4, 2, 3, 4, 6.
In summary, if you are exploring, try different ways of network construction,
using your knowledge of the data to make the network a meaningful representation,
and combine this with trying different mcl inflation values.

Genetic Algorithm, large population vs small one

Im wondering if there is a general rule of thumb for population sizing. Ive read in a book that 2x the chromosome length is a good starting point. Am i correct in assuming then that if i had an equation with 5 variables, i should have a population of 10?
Im also wondering if the following is correct:
Larger Population Size.
Pros:
Larger diversity so more likely to pick up on traits which return a good fitness.
Cons:
Requires longer to process.
vs
Smaller Population Size.
Pros:
Larger number of generations experienced per unit time.
Cons:
Mutation will have to be more prominent in order to compensate for smaller population??
EDIT
A little additional info, say i have an equation which has 5 unknown parameters. For each parameter i have anywhere between 10-50 values i would like to try assign to each of these variables. So for example
variable1 = 20 different values
variable2 = 15 different values
...
I thought a GA would be a decent approach to such a problem as the search space is quite large, ie worst case for the above would be 312,500,000 permutations (unless i have screwed up?) n!/(n-k)! where n = 50 and k = 1 => 50 * 50 * 50 * 50 * 50
unfortunately the number of parameters/range of values to check can vary alot so i was looking for some sort of rule of thumb as to how large i should set the population.
Thanks for ur help + if there is any more info you need/prefer to discuss in one of the chatrooms, just give me a shout.
I'm not sure where you read that 2x the chromosome length is a good starting point, but I'm guessing it's a book that concentrated on larger problems.
If you only have five variables, a genetic algorithm is probably not the right choice for converging upon a solution. With a chromosome length of five you're probably going to find that you very quickly reach a non-deterministic(this will change in subsequent runs) local minimum and slowly iterate around that space until you find the true local minimum.
However, if you are insistent on using a GA I would suggest abandoning that rule of thumb for this problem and really think about starting population as a measure of how far from the final solution you expect a random solution to be.
The reason that many rule of thumbs is dependent on chromosome length is because that's a decent proxy for this, if I have a hundred variables, and given randomly generating dna sequence is going to be further from ideal than if I had only one variable.
Additionally, if you're worried about computation intensity I'm going to go ahead and say that it shouldn't be an issue since you're dealing with such a small solution set. I think a better rule of thumb for smaller sets like this would be along the lines of:
(ln(chromosome_length*(solution_space/granularity)/mutation_rate))^2
Probably with a constant thrown in to scale for the particular problem.
It's definitely not a great rule of thumb (no rule is) but here's my logic for it:
Chromosome length is just a proxy for size of solution space, so taking into account the size of the solution space will necessarily increase the accuracy of this proxy
A smaller mutation rate necessitates a larger population size to compensate for the fact that you are more prone to get caught in local minima
Any rule of thumb should scale logarithmically since a genetic algorithm is akin to a tree search of your solution space.
The squared term was mostly the result of trying this out, but it looks like the logarithmic scaling was a little aggressive, though the general shape seemed right.
However I think a better choice would be to start at a reasonable number (100) and try iterating up and down until you find a population size that seems to balance accuracy with execution speed.
As with most genetic algorithm parameters population size is highly dependant on the problem. There are certain factors that can help to point in the direction of whether you should have a large or small population size but a lot of the time testing different values against a known solution before running it on your problem is a good idea (if this is possible of course).
A population size of 10 does seem rather small though. You say you have an equation with five variables. Is your problem represented by a chromosome of 5 values? It seems small for a chromosome and if this is the case it's likely that using a genetic algorithm may not be the best way to solve the problem. Perhaps if you give a bit more detail on your problem and how you are representing it people may have a better idea of how to advise you.
I'd also add that your cons for large and small population sizes aren't exactly correct. A larger population size does take longer to process than a small one but since it can often solve the problem quicker then overall the processing time isn't necessarily longer. gain, it's highly dependant on the problem. With a smaller population size mutation shouldn't have to be more prominent. Mutation is generally used to stop the genetic algorithm from becoming stuck in a local maximum and should usually be a very small value. A small population is more likely to become stuck in a local maximum but if you have a mutation value which is too high you may be nullifying the natural improvement of the genetic algorithm.

Resources