How to analyse z3 performance issues? - z3

I have 37 similar SMT2 problems, each in two equisatisfiable versions that I call compact and unrolled. The problems are using incremental SMT solving and every (check-sat) returns unsat. The compact versions are using the QF_AUFBV logic, the unrolled versions use QF_ABV. I did run them in z3, yices, and boolector (but boolector only supports the unrolled version). The results of this performance evaluation can be found here:
http://scratch.clifford.at/compact_smt2_enc_r1102.html
The SMT2 files for this examples can be downloaded from here (~10 MB):
http://scratch.clifford.at/compact_smt2_enc_r1102.zip
I run each solver 5 times with different values for the :random-seed option. (Except boolector which does not support :random-seed. So I simply run boolector 5 times on the same input.) The variation I get when running the solvers with different :random-seed is relatively small (see +/- values in the table for the max outlier).
There is a wide spread between solvers. Boolector and Yices are consistently faster than z3, in some cases up to two orders of magnitude.
However, my question is about "z3 vs z3" performance. Consider for example the following data points:
| Test Case | Z3 Median Runtime | Max Outlier |
|-------------------|-------------------|-------------|
| insn_add unrolled | 873.35 seconds | +/-  0% |
| insn_add compact | 1837.59 seconds | +/- 1% |
| insn_sub unrolled | 4395.67 seconds | +/- 16% |
| insn_sub compact | 2199.21 seconds | +/- 5% |
The problems insn_add and insn_sub are almost identical. Both are generated from Verilog using Yosys and the only difference is that insn_add is using this Verilog module and insn_sub is using this one in its place. Here is the diff between those two source files:
--- insn_add.v 2017-01-31 15:20:47.395354732 +0100
+++ insn_sub.v 2017-01-31 15:20:47.395354732 +0100
## -1,6 +1,6 ##
// DO NOT EDIT -- auto-generated from generate.py
-module rvfi_insn_add (
+module rvfi_insn_sub (
input rvfi_valid,
input [ 32 - 1 : 0] rvfi_insn,
input [`RISCV_FORMAL_XLEN - 1 : 0] rvfi_pc_rdata,
## -29,9 +29,9 ##
wire [4:0] insn_rd = rvfi_insn[11: 7];
wire [6:0] insn_opcode = rvfi_insn[ 6: 0];
- // ADD instruction
- wire [`RISCV_FORMAL_XLEN-1:0] result = rvfi_rs1_rdata + rvfi_rs2_rdata;
- assign spec_valid = rvfi_valid && insn_funct7 == 7'b 0000000 && insn_funct3 == 3'b 000 && insn_opcode == 7'b 0110011;
+ // SUB instruction
+ wire [`RISCV_FORMAL_XLEN-1:0] result = rvfi_rs1_rdata - rvfi_rs2_rdata;
+ assign spec_valid = rvfi_valid && insn_funct7 == 7'b 0100000 && insn_funct3 == 3'b 000 && insn_opcode == 7'b 0110011;
assign spec_rs1_addr = insn_rs1;
assign spec_rs2_addr = insn_rs2;
assign spec_rd_addr = insn_rd;
But their behavior in this benchmark is very different: Overall the performance for insn_sub is much worse than the performance for insn_add. Furthermore, in the case of insn_add the unrolled version runs about twice as fast as the compact version, but in the case of insn_sub the compact version runs about twice as fast as the unrolled version.
Here are the times before creating the median. The :random-seed setting obviously does not seem to make much of a difference:
insn_add unrolled: 868.15 873.34 873.35 873.36 874.88
insn_add compact: 1828.70 1829.32 1837.59 1843.74 1867.13
insn_sub unrolled: 3204.06 4195.10 4395.67 4539.30 4596.05
insn_sub compact: 2003.26 2187.52 2199.21 2206.04 2209.87
Since the value of :random-seed does not seem to have much of an effect, I would assume there is something intrinsic to those .smt2 files that makes them fast or slow on z3. How would I investigate this? How would I find out what makes the fast cases fast and the slow cases slow, so I can avoid whatever makes the slow cases slow? (Yes, I know that this is a very broad question. Sorry. :)
<edit>
Here are some more concrete questions along the lines of my primary question. This questions are directly inspired by the obvious differences I can see between the (compact) insn_add and insn_sub benchmarks.
Can the order of (declare-..) and (define-..) statements in my SMT input influence performance?
Can changing the names of declared or defined function influence performance?
If I split a BV into smaller BVs, and then concatenate them back again, can this influence performance?
If I either compare two BVs for equality, or split the BV into single bit variables and compare each of the bits individually, can this influence performance?
Also: What operations in z3 do actually change when I chose a different value for :random-seed?
</edit>
Making small changes to the .smt2 files without changing the semantics can be very difficult for large test cases generated by complex tools. I'm hoping there are other things I can try first, or maybe there is some existing expert knowledge about the kind of changes that might be worth investigating. Or alternatively: What kind of changes would effectively be equivalent to changing :random-seed and thus are not worth investigating.
(Tests performed with git rev c67cf16, i.e. current git head of z3 on an AWS EC2 c4.8xlarge instance with make -j40. The runtimes are CPU seconds, not wall-clock seconds.)
Edit 2:
I have now three test cases (test1.smt2, test2.smt2, and test3.smt2) that are identical except that I've renamed some of the functions I declare/define. The test cases can be found at http://svn.clifford.at/handicraft/2017/z3perf/.
This is a variation of the original problem that takes ~2 minutes to solve instead of ~1 hour. As before, changing the value of :random-seed only has a marginal effect. But renaming some of the functions without changing anything else changes the runtime by more than 2x:
I've now opened an issue on github, arguing that :random-seed should by tied into whatever thing I change randomly inside z3 when I rename the functions in my SMT2 code.

As you say, there can be many things that may be creating that perf difference in add vs sub.
A good start is to check if the formulas after preprocessing are equal modulo add/sub (btw, Z3 converts 'a - b' into 'a + (-1) * b'). If not, then trace down which preprocessing step is at fault. Then trace down the problem and send us a patch :)
Alternatively, the problem could be down the line, e.g., in the bitblaster. You can also dump the bit-blasted formulas of both of your files and check if there is a significant difference in terms of number of variables and/or clauses.
Anyway, you'll need to be prepared to invest a day or two (maybe more) to track down these issues. If you find something, let us know and/or send us a patch! :)

Related

Z3PY extremely slow with many variables?

I have been working with the optimizer in Z3PY, and only using Z3 ints and (x < y)-like constraints in my project. It has worked really well. I have been using up to 26 variables (Z3 ints), and it takes the solver about 5 seconds to find a solution and I have maybe 100 soft constraints, at least. But now I tried with 49 variables, and it does not solve it at all (I shut it down after 1 hour).
So I made a little experiment to find out what was slowing it down, is it the amount of variables or the amount of soft constraints? It seems like the bottle neck is the amount of variables.
I created 26 Z3-ints. Then I added as hard constraints, that it should not be lower than 1 or more than 26. Also, all numbers must be unique. No other constraints was added at all.
In other words, the solution that the solver will find is a simple order [1,2,3,4,5....up to 26]. Ordered in a way that the solver finds out.
I mean this is a simple thing, there are really no constraints except those I mentioned. And the solver solves this in 0.4 seconds or something like that, fast and sufficient. Which is expected. But if I increase the amount of variables to 49 (and of course the constraints now are that it should not be lower than 1 or more than 49), it takes the solver about 1 minute to solve. That seems really slow for such a simple task? Should it be like this, anybody knows? The time complexity is really extremely increased?
(I know that I can use Solver() instead of Optimizer() for this particular experiment, and it will be solved within a second, but in reality I need it to be done with Optimizer since I have a lot of soft constraints to work with.)
EDIT: Adding some code for my example.
I declare an array with Z3 ints that I call "reqs".
The array is consisting of 26 variables in one example and 49 in the other example I am talking about.
solver = Optimize()
for i in (reqs):
solver.add(i >= 1)
for i in (reqs):
solver.add(i <= len(reqs))
d = Distinct(reqs)
solver.add(d)
res = solver.check()
print(res)
Each benchmark is unique, and it's impossible to come up with a good strategy that applies equally well in all cases. But the scenario you describe is simple enough to deal with. The performance problem comes from the fact that Distinct creates too many inequalities (quadratic in number) for the solver, and the optimizer is having a hard time dealing with them as you increase the number of variables.
As a rule of thumb, you should avoid using Distinct if you can. For this particular case, it'd suffice to impose a strict ordering on the variables. (Of course, this may not always be possible depending on your other constraints, but it seems what you're describing can benefit from this trick.) So, I'd code it like this:
from z3 import *
reqs = [Int('i_%d' % i) for i in range(50)]
solver = Optimize()
for i in reqs:
solver.add(i >= 1, i <= len(reqs))
for i, j in zip(reqs, reqs[1:]):
solver.add(i < j)
res = solver.check()
print(res)
print(solver.model())
When I run this, I get:
$ time python a.py
sat
[i_39 = 40,
i_3 = 4,
...
i_0 = 1,
i_2 = 3]
python a.py 0.27s user 0.09s system 98% cpu 0.365 total
which is pretty snippy. Hopefully you can generalize this to your original problem.

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

which is more important, number of variables or subexpressions?

I presume the technique detecting shared expressions is applied on most of modern SMT solvers. The performance should be very good when it processes a sequence of similar expressions. However, I got unexpected results after I run Z3 on input1 and input2. Instead of build a long constraint A in "input1", some intermediate variables are defined to map to the sub-expressions of A in "input2". In that case, input1 has less variables, which should be solved faster than input2. I cannot find useful information from the statistic as they are exactly same except the solving time and memory consumed:
I would very much appreciate if someone can answer/explain what affects the performance of the SMT solvers more, the number of variables or number of subexpressions?
I've done some profiling, and it seems that both inputs behave exactly the same in the solver. All (check-sat) commands take exactly the same time. Note that input 2 is a file of size 255KB, but input1 is a file of size 240MB, i.e., this file is about 1000 times larger than the first one. According to my profiler, all of the additional time required to solve these queries is spent in the parser. So, it simply takes a long time to read and check the input; the actual queries are all easy.

isolate lua from locale

I am considering embedding Lua in C++ app (running under FreeBSD 8.2). But benchmarking revealed poor performance in some cases. Specifically when Lua tries to convert strings to numbers and compare strings, it becomes slower, and worse, ruins scalability (8 cores perform worse than one!). I now think it is locale, because when I avoid auto-conversion everything works fine. But for real life I will need string comparisons and number conversions. How can I:
isolate Lua from locale, i.e. ensure that none of Lua's functions use locale indirectly. For instance can I provide my own conversion and comparison functions?
or disable locale altogether. I tried setlocale (LC_ALL, "C"), it works ok (locale changes), but bottleneck remains
Update:
following suggestion by lhf I jumped right into Lua library code. What I found is dozens of places where (officially) locale-dependent functions are used. To remove all of them would cost too much effort, there must be a better way. I tried to measure which of them do not scale. I also added some other commonly used functions, as well as some of my own interest (Lua interpreter creation and destruction, setting global variable, etc). Results follow. The correct percentage must be 700%, i.e. 7 threads must perform 7 times better than 1 thread:
nop: 824% (1:106867300/7:881101495)
sprintf %f: 57% (1:2093975/7:1203949)
sprintf %.14g: 51% (1:2503818/7:1278312)
sprintf %.14lf: 73% (1:2134432/7:1576657)
sprintf %lf: 64% (1:2083480/7:1340885)
sprintf %d: 601% (1:6388005/7:38426161)
sscanf %s: 181% (1:8484822/7:15439285)
sscanf %f: 712% (1:3722659/7:26511335)
lua_cycle: 677% (1:113483/7:768936)
set_global: 715% (1:1506045/7:10780282)
set_get_global: 605% (1:2814992/7:17044081)
strcoll: 670% (1:38361144/7:257300597)
getenv: 681% (1:8526168/7:58131030)
isdigit: 695% (1:106894420/7:743529202)
isalpha: 662% (1:80771002/7:535055196)
isalpha(r): 638% (1:78232353/7:499207555)
strtol: 694% (1:16865106/7:117208528)
strtod: 749% (1:16727244/7:125323881)
time: 168% (1:727666/7:1225499)
gettimeofday: 162% (1:727549/7:1183433)
figures change from run to run, but big picture remains consistent: sprintf double conversions perform worse than on single thread. time and gettimeofday scale badly. sscanf with %s also scales poorly which is quite surprising, but not an issue in my case.
At last it probably was not locale at all. I changed Lua conversion from sprintf to some simplified hand-made code and everything works fine so far..
BTW, first benchmark was run on linux desktop and showed nothing so strange. I was surprised by its FreeBSD behaviour.
To avoid locales in string comparison, change strcoll to strcmp in lvm.c. To avoid locales in string-to-number conversions, change the definition of lua_str2number in luaconf.h to avoid strtod. (Note however that supplying your own strtod is not an easy task.) You can also remove trydecpoint in llex.c.

Precision of reals through writeln/readln in Delphi

My clients application exports and imports quite a few variables of type real through a text file using writeln and readln. I've tried to increase the width of the fields written so the code looks like:
writeln(file, exportRealvalue:30); //using excess width of field
....
readln(file, importRealvalue);
When I export and then import and export again and compare the files I get a difference in the last two digits, e.g (might be off on the actual number of digits here but you get it):
-1.23456789012E-0002
-1.23456789034E-0002
This actually makes a difference in the app so the client wants to know what I can do about it. Now I'm not sure it's only the write/read that does it but I thought I'd throw a quick question out there before I dive into the hey stack again. Do I need to go binary on this?
This is not an app dealing with currency or something, I just write and read the values to/from file. I know floating points are a bit strange sometimes and I thought one of the routines (writeln/readln) may have some funny business going on.
You might try switching to extended for greater precision. As was pointed out though, floating point numbers only have so many significant digits of precision, so it is still possible to display more digits then are accurately stored, which could result in the behavior you specified.
From the Delphi help:
Fundamental Win32 real types
| Significant | Size in
Type | Range | digits | bytes
---------+----------------------------------+-------------+----------
Real | -5.0 x 10^–324 .. 1.7 x 10^308 | 15–16 | 8
Real48 | -2.9 x 10^–39 .. 1.7 x 10^38 | 11-12 | 6
Single | -1.5 x 10^–45 .. 3.4 x 10^38 | 7-8 | 4
Double | -5.0 x 10^–324 .. 1.7 x 10^308 | 15-16 | 8
Extended | -3.6 x 10^–4951 .. 1.1 x 10^4932 | 10-20 | 10
Comp | -2^63+1 .. 2^63–1 | 10-20 | 8
Currency | -922337203685477.5808.. | |
922337203685477.5807 | 10-20 | 8
Note: The six-byte Real48 type was called Real in earlier versions of Object Pascal. If you are recompiling code that uses the older, six-byte Real type in Delphi, you may want to change it to Real48. You can also use the {$REALCOMPATIBILITY ON} compiler directive to turn Real back into the six-byte type. The following remarks apply to fundamental real types.
Real48 is maintained for backward compatibility. Since its storage format is not native to the Intel processor architecture, it results in slower performance than other floating-point types.
Extended offers greater precision than other real types but is less portable. Be careful using Extended if you are creating data files to share across platforms.
Notice that the range is greater then the significant digits. So you can have a number larger then can be accurately stored. I would recommend rounding to the significant digits to prevent that from happening.
If you want to specify the precision of a real with a WriteLn, use the following:
WriteLn(RealVar:12:3);
It outputs the value Realvar with at least 12 positions and a precision of 3.
When using floating point types, you should be aware of the precision limitations on the specified types. A 4 byte IEEE-754 type, for instance, has only about 7.5 significant digits of precision. An eight byte IEEE-754 type has roughly double the number of significant digits. Apparently, the delphi real type has a precision that lies around 11 significant digits. The result of this is that any extra digits of formatting that you specify are likely to be noise that can result in conversions between base 10 formatted values and base 2 floating point values.
First of all I would try to see if I could get any help from using Str with different arguments or increasing the precision of the types in your app. (Have you tried using Extended?)
As a last resort, (Warning! Workaround!!) I'd try saving the customer's string representation along with the binary representation in a sorted list. Before writing back a floating point value I'd see if there already is a matching value in the table, whose string representation is already known and can be used instead. In order to make get this lookup quick, you can sort it on the numeric value and use binary search for finding the best match.
Depending on how much processing you need to do, an alternative could be to keep the numbers in BCD format to retain original accuracy.
It's hard to answer this without knowing what type your ExportRealValue and ImportRealValue are. As others have mentioned, the real types all have different precisions.
It's worth noting, contrary to some thought, extended is not always higher precision. Extended is 10-20 significant figures where double is 15-16. As you are having trouble around the tenth sig fig perhaps you are using extended already.
To get more control over the reading and writing you can convert the numbers to and from strings yourself and write them to a file stream. At least that way you don't have to worry if readln and writeln are up to no good behind your back.

Resources