I would like to improve the scalability of SMT solving. I have actually implemented the incremental solving. But I would like to improve more. Any other general methods to improve it without the knowledge of the problem itself?
There's no single "trick" that can make z3 scale better for an arbitrary problem. It really depends on what the actual problem is and what sort of constraints you have. Of course, this goes for any general computing problem, but it really applies in the context of an SMT solver.
Having said that, here're some general ideas based on my experience, roughly in the order of ease of use:
Read the Programming Z3 book This is a very nice write-up and will teach you a ton of things about how z3 is architected and what the best idioms are. You might hit something in there that directly applies to your problem: https://theory.stanford.edu/~nikolaj/programmingz3.html
Keep booleans as booleans not integers Never use integers to represent booleans. (That is, use 1 for true, 0 for false; multiplication for and etc. This is a terrible idea that kills the powerful SAT engine underneath.) Explicitly convert if necessary. Most problems where people tend to deploy such tricks involve counting how many booleans are true etc.: Such problems should be solved using the pseudo-boolean tactics that's built into the solver. (Look up pbEq, pbLt etc.)
Don't optimize unless absolutely necessary The optimization engine is not incremental, nor it is well optimized (pun intended). It works rather slowly compared to all other engines, and for good reason: Optimization modulo theories is a very tricky thing to do. Avoid it unless you really have an optimization problem to tackle. You might also try to optimize "outside" the solver: Make a SAT call, get the results, and making subsequent calls asking for "smaller" cost values. You may not hit the optimum using this trick, but the values might be good enough after a couple of iterations. Of course, how well the results will be depends entirely on your problem.
Case split Try reducing your constraints by case-splitting on key variables. Example: If you're dealing with floating-point constraints, say; do a case split on normal, denormal, infinity, and NaN values separately. Depending on your particular domain, you might have such semantic categories where underlying algorithms take different paths, and mixing-and-matching them will always give the solver a hard time. Case splitting based on context can speed things up.
Use a faster machine and more memory This goes without saying; but having plenty of memory can really speed up certain problems, especially if you have a lot of variables. Get the biggest machine you can!
Make use of your cores You probably have a machine with many cores, further your operating system most likely provides fine-grained multi-tasking. Make use of this: Start many instances of z3 working on the same problem but with different tactics, random seeds, etc.; and take the result of the first one that completes. Random seeds can play a significant role if you have a huge constraint set, so running more instances with different seed values can get you "lucky" on average.
Try to use parallel solving Most SAT/SMT solver algorithms are sequential in nature. There has been a number of papers on how to parallelize some of the algorithms, but most engines do not have parallel counterparts. z3 has an interface for parallel solving, though it's less advertised and rather finicky. Give it a try and see if it helps out. Details are here: https://theory.stanford.edu/~nikolaj/programmingz3.html#sec-parallel-z3
Profile Profile z3 source code itself as it runs on your problem, and see
where the hot-spots are. See if you can recommend code improvements to developers to address these. (Better yet, submit a pull request!) Needless to say, this will require a deep study of z3 itself, probably not suitable for end-users.
Bottom line: There's no free lunch. No single method will make z3 run better for your problems. But above ideas might help improve run times. If you describe the particular problem you're working on in some detail, you'll most likely get better advice as it applies to your constraints.
Related
Consider a number of nodes with some connections between them.
My model's task is to color the nodes. One of the conditions is that the black nodes form a fully-connected set.
How do I code that?
NB: in case it matters: the connections between symbols are a precondition.
What have you tried? Stack-overflow works the best if you show what you tried and where you got stuck. Based on how you model your graph, there could be many different ways.
Here’s a hint to get you started: in programming with z3, you usually write the code that “checks” the nodes are fully connected. Then, through the magic of constraint solving, that causes the solver to provide models that satisfy that criteria. So, start with modeling your graph and how you can check that the same-colored nodes are connected.
Note that hard problems like graph coloring, clique finding, isomorphisms etc remain hard in this realm too. They’re easier to code perhaps, but you shouldn’t expect better performance than exhaustive search for large instances on average; unless your graphs have special structure that the solver can exploit. But in that case, you’re better off using a custom algorithm anyhow, instead of relying on a general purpose SMT solver. Of course, this all depends on what your main goal is. It’s best to try multiple approaches and pick the one that performs the best.
Currently, I have a somewhat superficial understanding of how SMT solvers work (the basics of algorithms like E-matching, MBQI, and CVC4/5's inductive reasoning). However, it's very frustrating to debug by trial-and-error.
Is there any guidance on how to debug SMT scripts that make heavy use of quantifiers?
A badly-written script often goes into infinite loop but I cannot tell if it's my mistake, or it's just taking too long to respond.
The SMT solvers tend to hide internals from users, so it's quite hard to figure out why it's stuck. Is there any way to print the "solving context"?
Or maybe I'm using SMT solvers the wrong way? I should design my own verification algorithm, only employing SMT solvers for local decisions?
Any help is appreciated!
This is a very subjective question, and largely opinion based. But a couple of general remarks:
Don't directly program in SMTLib. It is not meant to be for human-consumption. Instead, use a higher-level API, and script them from a language that you're more familiar with. There are bindings available from any number of languages, including C/C++/Java/Python/O'Caml/Haskell/Scala etc. Just doing this will get rid of most of the mundane mistakes you make.
Turn on verbosity output of the solver. You might be able to notice patterns in the log output. Unfortunately this is very solver specific, and can be hard to decipher; but can also indicate if, for instance, you're stuck in an e-matching loop in the presence of quantifiers.
If there's a custom algorithm for your verification problem (Hoare triples, separation logic, abstract interpretation, ...), then you first have to apply these techniques and delegate local/sub-lemmas to an SMT solver. Do not expect the SMT solver to be able to do large proofs, and anything that requires actual induction out-of-the box.
Try reducing complexity by putting in over-constraints and see which ones help. Based on your findings you might be able to do a case-split, for instance, if the over-constraints enumerate a reasonably small search-space.
Again, these are very general remarks and whether they'll apply for your specific problem is anyone's guess. But I'd start with coding in a higher-level API if you aren't already doing so.
I have been reading papers about the Markov model, suddenly a great extension like TML(Tractable Markov Logic) coming out.
It is a subset of Markov logic, and uses probabilistic class and part hierarchies to control complexity.
This model has both complex logical structure and uncertainty.
It can represent objects, classes, and relations between objects, subject to certain restrictions which ensure that inference in any model built in TML can be queried efficiently.
I am just wondering why such a good idea not widely spreading around the area of application scenarios like activity analysis?
More info
My understanding is that TML is polynomial in the size of the model, but the size of the model needs to be compiled to a given problem and may become exponentially large. So, at the end, it's still not really tractable.
However, it may be advantageous to use it in the case that the compiled form will be used multiple times, because then the compilation is done only once for multiple queries. Also, once you obtain the compiled form, you know what to expect in terms of run-time.
However, I think the main reason you don't see TML being used more broadly is that it is just an academic idea. There is no robust, general-purpose system based on it. If you try to work on a real problem with it, you will probably find out that it lacks certain practical features. For example, there is no way to represent a normal distribution with it, and lots of problems involve normal distributions. In such cases, one may still use the ideas behind the TML paper but would have to create their own implementation that includes further features needed for the problem at hand. This is a general problem that applies to lots and lots of academic ideas. Only a few become really useful and the basis of practical systems. Most of them exert influence at the level of ideas only.
I have a huge set of linear real arithmetic constraints to solve, and I am incrementally feeding them to the solver. Z3 always seems to get stuck after a while. Is Z3 internally going to change its strategy in solving the constraints, such as moving away from the Simplex algorithm and try others, etc. Or do I have to explicitly instruct Z3 to do so? I am using Z3py.
Without further details it's impossible to answer this question precisely.
Generally, with no logic being set and the default tactic being run or (check-sat) being called without further options, Z3 will switch to a different solver the first time it sees a push command; prior to that it can use a non-incremental solver.
The incremental solver comes with all the positives and negatives of incremental solvers, i.e., it may be faster initially, but it may not be able to exploit previously learned lemmas after some time, and it may simply remember too many irrelevant facts. Also, the heuristics may 'remember' information that doesn't apply at a later time, e.g., a 'good' variable ordering may change into a bad one after everything is popped and a different problem over the same variables is pushed. In the past, some users found it works better for them to use the incremental solver for some number of queries, but to start from scratch when it becomes too slow.
I asked myself the question whether most people normally code the machine learning algorithms themselves or whether they are likely to use existing solutions like Weka or R packages.
Of course it depends on the problem - but let's say that I want to use a common solution like a neural network. Is there still a reason to code it myself? To understand the mechanism better and adapt it? Or is the thought of standardized solutions more important?
This is not a good question for Stackoverflow. It's an opinion question, not a programming problem.
Nevertheless, here is my take:
It depends on what you want to do.
If you want to find which algorithm works best for your data problem at hand, try ELKI, Weka, R, Matlab, SciPy, whatever. Try out all the algorithms you can find, and spend even more time on preprocessing your data.
If you know which algorithm you need and need to get it into production, many of these tools will not perform good enough or be easy enough to integrate. Instead, check if you can find low level libraries such as libSVM that provide the functionality you need. If these don't exist, roll your own optimized code.
If you want to do research in this domain, you are best off with extending the existing tools. ELKI and Weka have APIs that you can plug into to provide extensions. R doesn't really have an API (CRAN it's a mess...) but people just dump their code somewhere and (hopefully) add a manual how to use it. Extending these frameworks can save you a lot of effort: you have comparison methods ready to use, and you can re-use a lot of their code. ELKI for example has a lot of index structures to accelerate algorithms. Most of the time, the index acceleration is much harder to write than the actual algorithm. So if you can reuse the existing indexes, this will make your algorithms much faster, too (and you will also benefit from future enhancements to these frameworks).
If you want to learn about existing algorithms you better implement them yourself. You'll be surprised how much more there is to optimizing some algorithms than what is taught in class. E.g. APRIORI. The basic idea is quite simple. But getting all the pruning details right, I say 1 out of 20 students gets these details. If you implement APRIORI, then benchmark it against a known good implementation and try to understand why yours is much slower, then you'll actually discover the subtle details to the algorithms. And don't be surprised to see a factor of 100 performance difference between ELKI, R, Weka etc. - it's can still be the same algorithm, just implemented more or less efficiently when it comes to actual data structures used, memory layout etc.