How to check if cvxpy's solve() is successful? - cvxpy

This my first time trying to use cvxpy. I have 2 very simple constrains:
x = cp.Variable((5, 5))
constrains = [cp.sum(x) == 1.0, 0 <= x]
The solution worked most of time, satisfying both constrains. But sometimes the solution only satisfied the first constrain and spit out negative values. I am wondering if there is a way to have the solver to indicate whether it has succeeded or not.

This kind of information is always part of some status-information filled by the solver. In cvxpy's case this is documented here:
So something like:
problem.solve()
if problem.status == 'optimal':
...
else:
...
is the usual route.
Remark:
The solver decides this and feasibility and optimality decisions are depending on tolerances in general (floating-point math!).
Furthermore, most solvers within cvxpy are interior-point like solvers (some even first-order based solvers) which slowly converge to some arbitrarily accurate approximate solution such that:
your simplex-constraint (sum(x) == 1) might be off (compared to 1.0) by some small epsilon like 1e-12
some non-negative variable might be negative by some small epsilon like 1e-12
This is totally normal (for these kinds of solvers; things are different when using simplex-like solvers or simplex-based crossover post-opt). The user needs to take care and the approach he is chosing usually depends on his use-case. E.g. post-clipping x = np.clip(x.value, 0.0, np.inf), rounding and so on.

For me problem.status == 'optimal' didn't work, it said it was optimal when the constraints were not met. This worked better.
result = prob.solve()
if np.isnan(result):
print('no solution found')

Related

How to bias Z3's (Python) SAT solving towards a criteria, such as 'preferring' to have more negated literals

In Z3 (Python) is there any way to 'bias' the SAT search towards a 'criteria'?
A case example: I would like Z3 to obtain a model, but not any model: if possible, give me a model that has a great amount of negated literals.
Thus, for instance, if we have to search A or B a possible model is [A = True, B = True], but I would rather have received the model [A = True, B = False] or the model [A = False, B = True], since they have more False assignments.
Of course, I guess the 'criteria' must be much more concrete (say, if possible: I prefer models with the half of literals to False ), but I think the idea is understandable.
I do not care whether the method is native or not. Any help?
There are two main ways to handle this sort of problems in z3. Either using the optimizer, or manually computing via multiple-calls to the solver.
Using the optimizer
As Axel noted, one way to handle this problem is to use the optimizing solver. Your example would be coded as follows:
from z3 import *
A, B = Bools('A B')
o = Optimize()
o.add(Or(A, B))
o.add_soft(Not(A))
o.add_soft(Not(B))
print(o.check())
print(o.model())
This prints:
sat
[A = False, B = True]
You can add weights to soft-constraints, which gives a way to associate a penalty if the constraint is violated. For instance, if you wanted to make A true if at all possible, but don't care much for B, then you'd associate a bigger penalty with Not(B):
from z3 import *
A, B = Bools('A B')
o = Optimize()
o.add(Or(A, B))
o.add_soft(Not(A))
o.add_soft(Not(B), weight = 10)
print(o.check())
print(o.model())
This prints:
sat
[A = True, B = False]
The way to think about this is as follows: You're asking z3 to:
Satisfy all regular constraints. (i.e., those you put in with add)
Satisfy as many of the soft constraints as possible (i.e., those you put in with add_soft.) If a solution isn't possible that satisfies them all, then the solver is allowed to "violate" them, trying to minimize the total cost of all violated constraints, computed by summing the weights up.
When no weights are given, you can assume it is 1. You can also group these constraints, though I doubt you need that generality.
So, in the second example, z3 violated Not(A), because doing so has a cost of 1, while violating Not(B) would've incurred a cost of 10.
Note that when you use the optimizer, z3 uses a different engine than the one it uses for regular SMT solving: In particular, this engine is not incremental. That is, if you call check twice on it (after introducing some new constraints), it'll solve the whole problem from scratch, instead of learning from the results of the first. Also, the optimizing solver is not as optimized as the regular solver (pun intended!), so it usually performs worse on straight satisfiability as well. See https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/nbjorner-nuz.pdf for details.
Manual approach
If you don't want to use the optimizer, you can also do this "manually" using the idea of tracking variables. The idea is to identify soft-constraints (i.e., those that can be violated at some cost), and associate them with tracker variables.
Here's the basic algorithm:
Make a list of your soft constraints. Int the above example, they'll be Not(A) and Not(B). (That is, you'd like these to be satisfied giving you negative literals, but obviously you want these to be satisfied only if possible.) Call these S_i. Let's say you have N of them.
For each such constraint, create a new tracker variable, which will be a boolean. Call these t_i.
Assert N regular constraints, each of the form Implies(t_i, S_i), for each soft-constraint.
Use a pseudo-boolean constraint, of the form AtMostK, to force that at most K of these tracker variables t_i are false. Then use a binary-search like schema to find the optimal value of K. Note that since you're using the regular solver, you can use it in the incremental mode, i.e., with calls to push and pop.
For a detailed description of this technique, see https://stackoverflow.com/a/15075840/936310.
Summary
Which of the above two methods will work is problem dependent. Using the optimizer is easiest from an end-user point of view, but you're pulling in heavy machinery that you may not need and you can thus suffer from a performance penalty. The second method might be faster, at the risk of more complicated (and thus error prone!) programming. As usual, do some benchmarking to see which works the best for your particular problem.
Z3py features an optimizing solver Optimize. This has a method add_soft with the following description:
Add soft constraint with optional weight and optional identifier.
If no weight is supplied, then the penalty for violating the soft constraint
is 1.
Soft constraints are grouped by identifiers. Soft constraints that are
added without identifiers are grouped by default.
A small example can be found here:
The Optimize context provides three main extensions to satisfiability checking:
o = Optimize()
x, y = Ints('x y')
o.maximize(x + 2*y) # maximizes LIA objective
u, v = BitVecs('u v', 32)
o.minimize(u + v) # minimizes BV objective
o.add_soft(x > 4, 4) # soft constraint with
# optional weight

Obtaining representation of SMT formula as SAT formula

I've come up with an SMT formula in Z3 which outputs one solution to a constraint solving problem using only BitVectors and IntVectors of fixed length. The logic I use for the IntVectors is only simple Presburger arithmetic (of the form (x[i] - x[i + 1] <=/>= z) for some x and z). I also take the sum of all of the bits in the bitvector (NOT the binary value), and set that value to be within a range of [a, b].
This works perfectly. The only problem is that, as z3 works by always taking the easiest path towards determining satisfiability, I always get the same answer back, whereas in my domain I'd like to find a variety of substantially different solutions (I know for a fact that multiple, very different solutions exist). I'd like to use this nifty tool I found https://bitbucket.org/kuldeepmeel/weightgen, which lets you uniformly sample a constrained space of possibilities using SAT. To use this though, I need to convert my SMT formula into a SAT formula.
Do you know of any resources that would help me learn how to perform Presburger arithmetic and adding the bits of a bitvector as a SAT instance? Alternatively, do you know of any SMT solver which as an intermediate step outputs a readable description of the problem as a SAT instance?
Many thanks!
[Edited to reflect the fact that I really do need the uniform sampling feature.]

What is the way to understand Proximal Policy Optimization Algorithm in RL?

I know the basics of Reinforcement Learning, but what terms it's necessary to understand to be able read arxiv PPO paper ?
What is the roadmap to learn and use PPO ?
To better understand PPO, it is helpful to look at the main contributions of the paper, which are: (1) the Clipped Surrogate Objective and (2) the use of "multiple epochs of stochastic gradient ascent to perform each policy update".
From the original PPO paper:
We have introduced [PPO], a family of policy optimization methods that use multiple epochs of stochastic gradient ascent to perform each policy update. These methods have the stability and reliability of trust-region [TRPO] methods but are much simpler to implement, requiring only a few lines of code change to a vanilla policy gradient implementation, applicable in more general settings (for example, when using a joint architecture for the policy and value function), and have better overall performance.
1. The Clipped Surrogate Objective
The Clipped Surrogate Objective is a drop-in replacement for the policy gradient objective that is designed to improve training stability by limiting the change you make to your policy at each step.
For vanilla policy gradients (e.g., REINFORCE) --- which you should be familiar with, or familiarize yourself with before you read this --- the objective used to optimize the neural network looks like:
This is the standard formula that you would see in the Sutton book, and other resources, where the A-hat could be the discounted return (as in REINFORCE) or the advantage function (as in GAE) for example. By taking a gradient ascent step on this loss with respect to the network parameters, you will incentivize the actions that led to higher reward.
The vanilla policy gradient method uses the log probability of your action (log π(a | s)) to trace the impact of the actions, but you could imagine using another function to do this. Another such function, introduced in this paper, uses the probability of the action under the current policy (π(a|s)), divided by the probability of the action under your previous policy (π_old(a|s)). This looks a bit similar to importance sampling if you are familiar with that:
This r(θ) will be greater than 1 when the action is more probable for your current policy than it is for your old policy; it will be between 0 and 1 when the action is less probable for your current policy than for your old.
Now to build an objective function with this r(θ), we can simply swap it in for the log π(a|s) term. This is what is done in TRPO:
But what would happen here if your action is much more probable (like 100x more) for your current policy? r(θ) will tend to be really big and lead to taking big gradient steps that might wreck your policy. To deal with this and other issues, TRPO adds several extra bells and whistles (e.g., KL Divergence constraints) to limit the amount the policy can change and help guarantee that it is monotonically improving.
Instead of adding all these extra bells and whistles, what if we could build these stabilizing properties into the objective function? As you might guess, this is what PPO does. It gains the same performance benefits as TRPO and avoids the complications by optimizing this simple (but kind of funny looking) Clipped Surrogate Objective:
The first term (blue) inside the minimization is the same (r(θ)A) term we saw in the TRPO objective. The second term (red) is a version where the (r(θ)) is clipped between (1 - e, 1 + e). (in the paper they state a good value for e is about 0.2, so r can vary between ~(0.8, 1.2)). Then, finally, the minimization of both of these terms is taken (green).
Take your time and look at the equation carefully and make sure you know what all the symbols mean, and mathematically what is happening. Looking at the code may also help; here is the relevant section in both the OpenAI baselines and anyrl-py implementations.
Great.
Next, let's see what effect the L clip function creates. Here is a diagram from the paper that plots the value of the clip objective for when the Advantage is positive and negative:
On the left half of the diagram, where (A > 0), this is where the action had an estimated positive effect on the outcome. On the right half of the diagram, where (A < 0), this is where the action had an estimated negative effect on the outcome.
Notice how on the left half, the r-value gets clipped if it gets too high. This will happen if the action became a lot more probable under the current policy than it was for the old policy. When this happens, we do not want to get greedy and step too far (because this is just a local approximation and sample of our policy, so it will not be accurate if we step too far), and so we clip the objective to prevent it from growing. (This will have the effect in the backward pass of blocking the gradient --- the flat line causing the gradient to be 0).
On the right side of the diagram, where the action had an estimated negative effect on the outcome, we see that the clip activates near 0, where the action under the current policy is unlikely. This clipping region will similarly prevent us from updating too much to make the action much less probable after we already just took a big step to make it less probable.
So we see that both of these clipping regions prevent us from getting too greedy and trying to update too much at once and leaving the region where this sample offers a good estimate.
But why are we letting the r(θ) grow indefinitely on the far right side of the diagram? This seems odd as first, but what would cause r(θ) to grow really large in this case? r(θ) growth in this region will be caused by a gradient step that made our action a lot more probable, and it turning out to make our policy worse. If that was the case, we would want to be able to undo that gradient step. And it just so happens that the L clip function allows this. The function is negative here, so the gradient will tell us to walk the other direction and make the action less probable by an amount proportional to how much we screwed it up. (Note that there is a similar region on the far left side of the diagram, where the action is good and we accidentally made it less probable.)
These "undo" regions explain why we must include the weird minimization term in the objective function. They correspond to the unclipped r(θ)A having a lower value than the clipped version and getting returned by the minimization. This is because they were steps in the wrong direction (e.g., the action was good but we accidentally made it less probable). If we had not included the min in the objective function, these regions would be flat (gradient = 0) and we would be prevented from fixing mistakes.
Here is a diagram summarizing this:
And that is the gist of it. The Clipped Surrogate Objective is just a drop-in replacement you could use in the vanilla policy gradient. The clipping limits the effective change you can make at each step in order to improve stability, and the minimization allows us to fix our mistakes in case we screwed it up. One thing I didn't discuss is what is meant by PPO objective forming a "lower bound" as discussed in the paper. For more on that, I would suggest this part of a lecture the author gave.
2. Multiple epochs for policy updating
Unlike vanilla policy gradient methods, and because of the Clipped Surrogate Objective function, PPO allows you to run multiple epochs of gradient ascent on your samples without causing destructively large policy updates. This allows you to squeeze more out of your data and reduce sample inefficiency.
PPO runs the policy using N parallel actors each collecting data, and then it samples mini-batches of this data to train for K epochs using the Clipped Surrogate Objective function. See full algorithm below (the approximate param values are: K = 3-15, M = 64-4096, T (horizon) = 128-2048):
The parallel actors part was popularized by the A3C paper and has become a fairly standard way for collecting data.
The newish part is that they are able to run K epochs of gradient ascent on the trajectory samples. As they state in the paper, it would be nice to run the vanilla policy gradient optimization for multiple passes over the data so that you could learn more from each sample. However, this generally fails in practice for vanilla methods because they take too big of steps on the local samples and this wrecks the policy. PPO, on the other hand, has the built-in mechanism to prevent too much of an update.
For each iteration, after sampling the environment with π_old (line 3) and when we start running the optimization (line 6), our policy π will be exactly equal to π_old. So at first, none of our updates will be clipped and we are guaranteed to learn something from these examples. However, as we update π using multiple epochs, the objective will start hitting the clipping limits, the gradient will go to 0 for those samples, and the training will gradually stop...until we move on to the next iteration and collect new samples.
....
And that's all for now. If you are interested in gaining a better understanding, I would recommend digging more into the original paper, trying to implement it yourself, or diving into the baselines implementation and playing with the code.
[edit: 2019/01/27]: For a better background and for how PPO relates to other RL algorithms, I would also strongly recommend checking out OpenAI's Spinning Up resources and implementations.
PPO, and including TRPO tries to update the policy conservatively, without affecting its performance adversely between each policy update.
To do this, you need a way to measure how much the policy has changed after each update. This measurement is done by looking at the KL divergence between the updated policy and the old policy.
This becomes a constrained optimization problem, we want change the policy in the direction of maximum performance, following the constraints that the KL divergence between my new policy and old do not exceed some pre defined (or adaptive) threshold.
With TRPO, we compute the KL constraint during update and finds the learning rate for this problem (via Fisher Matrix and conjugate gradient). This is somewhat messy to implement.
With PPO, we simplify the problem by turning the KL divergence from a constraint to a penalty term, similar to for example to L1, L2 weight penalty (to prevent a weights from growing large values). PPO makes additional modifications by removing the need to compute KL divergence all together, by hard clipping the policy ratio (ratio of updated policy with old) to be within a small range around 1.0, where 1.0 means the new policy is same as old.
PPO is a simple algorithm, which falls into policy optimization algorithms class (as opposed to value-based methods such as DQN). If you "know" RL basics (I mean if you have at least read thoughtfully some first chapters of Sutton's book for example), then a first logical step is to get familiar with policy gradient algorithms. You can read this paper or chapter 13 of Sutton's book new edition. Additionally, you may also read this paper on TRPO, which is a previous work from PPO's first author (this paper has numerous notational mistakes; just note). Hope that helps. --Mehdi
I think implementing for a discrete action space such as Cartpole-v1 is easier than for continuous action spaces. But for continuous action spaces, this is the most straight-forward implementation I found in Pytorch as you can clearly see how they get mu and std where as I could not with more renowned implementations such as Openai Baselines and Spinning up or Stable Baselines.
RL-Adventure PPO
These lines from the link above:
class ActorCritic(nn.Module):
def __init__(self, num_inputs, num_outputs, hidden_size, std=0.0):
super(ActorCritic, self).__init__()
self.critic = nn.Sequential(
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, 1)
)
self.actor = nn.Sequential(
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, num_outputs),
)
self.log_std = nn.Parameter(torch.ones(1, num_outputs) * std)
self.apply(init_weights)
def forward(self, x):
value = self.critic(x)
mu = self.actor(x)
std = self.log_std.exp().expand_as(mu)
dist = Normal(mu, std)
return dist, value
and the clipping:
def ppo_update(ppo_epochs, mini_batch_size, states, actions, log_probs, returns, advantages, clip_param=0.2):
for _ in range(ppo_epochs):
for state, action, old_log_probs, return_, advantage in ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantages):
dist, value = model(state)
entropy = dist.entropy().mean()
new_log_probs = dist.log_prob(action)
ratio = (new_log_probs - old_log_probs).exp()
surr1 = ratio * advantage
surr2 = torch.clamp(ratio, 1.0 - clip_param, 1.0 + clip_param) * advantage
I found the link above the comments on this video on Youtube:
arxiv insights PPO

Control the solving strategy of Z3

So lets assume I have a large Problem to solve in Z3 and if i try to solve it in one take, it would take too much time. So i divide this problem in parts and solve them individually.
As a toy example lets assume that my complex problem is to solve those 3 equations:
eq1: x>5
eq2: y<6
eq3: x+y = 10
So my question is whether for example it would be possible to solve eq1 and eq2 first. And then using the result solve eq3.
assert eq1
assert eq2
(check-sat)
assert eq3
(check-sat)
(get-model)
seems to work but I m not sure whether it makes sense performancewise?
Would incremental solving maybe help me out there? Or is there any other feature of z3 that i can use to partition my problem?
The problems considered are usually satisfiability problems, i.e., the goal is to find one solution (model). A solution (model) that satisfies eq1 does not necessarily satisfy eq3, thus you can't just cut the problem in half. We would have to find all solutions (models) for eq1 so that we can replace x in eq3 with that (set of) solutions. (For example, this is what happens in Gaussian elimination after the matrix is diagonal.)

Z3: Function Expansion and Encoding in QBVF

I am trying to encode formulas with functions in Z3 and I have an encoding problem. Consider the following example:
f(x) = x + 42
g(x1, x2) = f(x1) / f(x2)
h(x1, x2) = g(x1, x2) % g(x2, x1)
k(x1, x2, x3) = h(x1, x2) - h(x2, x3)
sat( k(y1, y2, y3) == 42 && k(y3, y2, y1) == 42 * 2 && ... )
I would like my encoding to be both efficient (no expression duplication) and allow Z3 to re-use lemmas about functions across subproblems. Here is what I have tried so far:
Inline the functions for every free variable instantiation y1, y2, etc. This introduces duplication and performance is not as good as I hoped for.
Assert the function declarations with universal quantifiers. This works for very specific examples - from the solving times it seems that Z3 can (?) re-use results from previous queries that involve the same functions. However, solving times vary greatly and in many cases (1) turns out to be faster.
Use function definitions (i.e., quantifiers + the MACRO_FINDER option). If my understanding of the documentation is correct, this should expand the functions and thus should be close to (1). However, in terms of performance the results were a bit surprising (">" means faster):
For problems where (1) > (2) I get: (1) > (3) > (2)
For problems where (2) > (1) I get: (2) > (1) = (3)
I have also tried tweaking the MBQI option (and others) with most of the above. However, it is not clear what is the best combination. I am using Z3 4.0.
The question is: What is the "right" way to encode the problem? Note that I only have interpreted functions (I do not really need UF). Could I use this fact for a more efficient encoding and avoid function expansion?
Thanks
I think there's no clear answer to this question. Some techniques work better for one type of benchmarks and other techniques work better for others. For the QBVF benchmarks we've looked at so far, we found macros give us the best combination of small benchmark size and small solving times, but this may not apply in this case.
Your understanding of the documentation is correct, the macro finder will identify quantifiers that look like function definitions and replace all other calls to that function with its definition. It's possible that not all of your macros are picked up or that you are using quasi-macros which aren't detected correctly, either of which could go towards explaining why the performance is sometimes worse than your (1). How much is the difference in the case that (1) > (3)? A little overhead is to be expected, but vast variations in runtime are probably due to some macros being malformed or not being detected.
In general, the is no "right" way to encode these problems. Function expansion can not always be avoided. The trade-off is essentially between expanding eagerly (1, 3), or doing it lazily (2). It may be that there is a correlation of the type SAT (1, 3 faster) and UNSAT (2 faster), but this is also not guaranteed to be the case.

Resources