I have precise and validated descriptions of the behaviors of many X86 instructions in terms amenable to encoding in QF_ABV and solving directly with the standard solver (using no special solving strategies). I wrote an SMT-LIB script whose interface matches my ultimate goal perfectly:
X86State, a record sort describing x86 machine state (registers and flags as bitvectors, and memory as an array).
X86Instr, a record sort describing x86 instructions (enumerated mnemonics, operands as an ML-like discriminated union describing registers, memory expressions, etc.)
A function x86-translate taking an X86State and an X86Instr, and returning a new X86State. It decodes the X86Instr and produces a new X86State in terms of the symbolic effects of the given X86Instr on the input X86State.
It's great for prototyping: the user can write x86 easily and directly. After simplifying a formula built using the library, all functions and extraneous data types are eliminated, leaving a QF_ABV expression. I hoped that users could simply (set-logic QF_ABV) and #include my script (alas, neither the SMT-LIB standard nor Z3 support #include).
Unfortunately, by defining functions and types, the script requires theories such as uninterpreted functions, thus requiring a logic other than QF_ABV (or even QF_AUFBV due to the types). My experience with SMT solvers dictates that the lowest acceptable logic should be specified for best solving time. Also, it is unclear whether I can reuse my SMT-LIB script in a programmatic context (e.g. OCaml, Python, C) as I desire. Finally, the script is a bit verbose given the lack of higher-order functions, and my lack of access to par leading to code duplication.
Thus, despite having accomplished my technical goals, I think that SMT-LIB might be the wrong approach. Is there a more natural avenue for interacting with Z3 to implement my x86 instruction description / QF_ABV translation scheme? Is the SMT-LIB script re-usable at all in these avenues? For example, you can build "custom OCaml top-levels", i.e. interpreters with scripts "burned into them". Something like that could be nice. Or do I have to re-implement the functionality in another language, in a program that interacts with Z3 via a theory extension (C DLL)? What's the best option here?
Well, I don't think that people write .smt2 files by hand. These are usually generated automatically by some program.
I find the Z3 Python interface quite nice, so I guess you could give it a try. But you can always write a simple .smt2 dumper from any language.
BTW, do you plan releasing the specification you wrote for X86? I would be really interested!
Related
I'm using the Z3_parse_smtlib2_string function from the Z3 C API (via Haskell's Z3 lib) to parse an SMTLIB file and apply some tactics to simplify its content, however I notice that any push, pop and check-sat commands appear to be swallowed by this function and do not appear in the resulting AST.
Is there anyway that I can parse this without losing these commands (and then apply the required tactics, once again without losing them)?
I don't think it's possible to do this with Z3_parse_smtlib2_string. As you can see in the documentation "It returns a formula comprising of the conjunction of assertions in the scope (up to push/pop) at the end of the string." See: https://z3prover.github.io/api/html/group__capi.html#ga7905ebec9289b9fe5debcad965f6267e
Note that the reason for this is not just mere "not-implemented" or "buggy." Look at the return type of the function you're using. It returns a Z3_ast_vector, and Z3_ast only captures "expressions" in the SMTLib language. But push/pop etc. are not considered expressions by Z3, but rather commands; i.e., they are internally represented differently. (Whether this was a conscious choice or historical is something I'm not sure about.)
I don't think there's a function to do what you're asking; i.e., can return both expressions and commands. You can ask at https://github.com/Z3Prover/z3/discussions to see if the developers can provide an alternative API, or if they already have something exposed to the users that achieves this.
The SMT-LIB standard enables arbitrary attributes, but prescribes only very few, e.g. :pattern. Z3, on the other hand, currently only supports a few selected attributes, and issues a warning for unrecognised ones.
Which attributes are supported, and what is their typical use case?
Attributes
:named: named terms can be included in unsat cores
:weight: heavier quantifiers make Z3 reach its quantifier instantiation depth threshold more quickly
:qid: identify quantifiers, e.g. when obtaining instantiation statistics
:pattern: syntactic hints for when to instantiate a quantifier in e-matching
:no-pattern: prevent Z3 from using certain terms when inferring patterns
:ex-act: seems like dead code, to be removed
:skolemid: specific to VCC/Boogie use
:lblneg, :lblpos: associated with Boogie labels to track the source of counter-examples
Notes
Information partly provided by Z3's very own Nikolaj Bjorner, see also Z3 issue #4536.
See also Z3 source file smt2parser.cpp
I was entirely amazed by how Coq's parser is implemented. e.g.
https://softwarefoundations.cis.upenn.edu/lf-current/Imp.html#lab347
It's so crazy that the parser seems ok to take any lexeme by giving notation command and subsequent parser is able to parse any expression as it is. So what it means is the grammar must be context sensitive. But this is so flexible that it absolutely goes beyond my comprehension.
Any pointers on how this kind of parser is theoretically feasible? How should it work? Any materials or knowledge would work. I just try to learn about this type of parser in general. Thanks.
Please do not ask me to read Coq's source myself. I want to check the idea in general but not a specific implementation.
Indeed, this notation system is very powerful and it was probably one of the reasons of Coq's success. In practice, this is a source of much complication in the source code. I think that #ejgallego should be able to tell you more about it but here is a quick explanation:
At the beginning, Coq's documents were evaluated sentence by sentence (sentences are separated by dots) by coqtop. Some commands can define notations and these modify the parsing rules when they are evaluated. Thus, later sentences are evaluated with a slightly different parser.
Since version 8.5, there is also a mechanism (the STM) to evaluate a document fully (many sentences in parallel) but there is some special mechanism for handling these notation commands (basically you have to wait for these to be evaluated before you can continue parsing and evaluating the rest of the document).
Thus, contrary to a normal programming language, where the compiler will take a document, pass it through the lexer, then the parser (parse the full document in one go), and then have an AST to give to the typer or other later stages, in Coq each command is parsed and evaluated separately. Thus, there is no need to resort to complex contextual grammars...
I'll drop my two cents to complement #Zimmi48's excellent answer.
Coq indeed features an extensible parser, which TTBOMK is mainly the work of Hugo Herbelin, built on the CAMLP4/CAMLP5 extensible parsing system by Daniel de Rauglaudre. Both are the canonical sources for information about the parser, I'll try to summarize what I know but note indeed that my experience with the system is short.
The CAMLPX system basically supports any LL1 grammar. Coq exposes to the user the whole set of grammar rules, allowing the user to redefine them. This is the base mechanism on which extensible grammars are built. Notations are compiled into parsing rules in the Metasyntax module, and unfolded in a latter post-processing phase. And that really is AFAICT.
The system itself hasn't changed much in the whole 8.x series, #Zimmi48's comments are more related to the internal processing of commands after parsing. I recently learned that Coq v7 had an even more powerful system for modifying the parser.
In words of Hugo Herbelin "the art of extensible parsing is a delicate one" and indeed it is, but Coq's achieved a pretty great implementation of it.
Is there a complete listing of all theories/logics that z3 supports? I have consulted this SMTLIB Tutorial which provides a number of logics, but I do not believe that the list is exhaustive. The z3 documentation itself doesn't seem to specify which logics are supported.
I ask because I have an smt file which cannot be solved under any of the logics in the SMTLIB Tutorial (when specified with 'set-logic'), but can be solved when no logic is specified.
For Z3, I have not seen such a list in the documentation, but you can find it in the source code if you really want to know. The list starts around line 65 of check_logic.cpp. I parsed out the list for you using a scary awk one-liner, and found this as of May 20, 2016 (between Z3 4.4.1 and 4.4.2):
"AUFLIA", "AUFLIRA", "AUFNIRA", "LRA", "QF_ABV", "QF_AUFBV", "QF_UFBV", "QF_AUFLIA", "QF_AX", "QF_BV", "QF_IDL", "QF_RDL", "QF_LIA", "QF_LRA", "QF_NIA", "QF_NRA", "QF_UF", "QF_UFIDL", "QF_UFLIA", "QF_UFLRA", "QF_UFNRA", "UFLRA", "UFNIA", "UFBV", "QF_S"
You can compare this to the official list of SMT-LIB 2 logics.
Probably more importantly for you is what the "best logic" is for your application. It sounds like you have a large and varying set of problems that you want Z3 to apply whatever tactics it can to. In that case, for now, it's best to leave the logic unspecified. The problem is that in SMT-LIB v2.0 there was no all-encompassing logic -- the largest logic by some standards was AUFNIRA, but this does not include, for example, bit vectors. As a result, CVC4 introduced a non-standard ALL_SUPPORTED logic, and Z3 performs best for some classes of problems when no logic is specified. This shortcoming of the SMT-LIB 2.0 standard is addressed in SMT-LIB 2.5, with a new logic called "ALL". However, this is not yet supported by either Z3 or CVC4.
You specify a logic in Z3 to ensure that Z3 uses a particular strategy and engine that is typically useful for the class of formulas expressed in this logic.
If no logic is specified, then Z3 falls back to a default mode. There is no logic corresponding to
this default mode: it integrates multiple engines.
I encountered a problem while doing my student research project. I'm an electrical engineering student, but my project has somewhat to do with theoretical computer science: I need to parse a lot of pascal sourcecode-files for typedefinitions and constants and visualize all occurrences. The typedefinitions are spread recursively over various files, i.e. there is type a = byte in file x, in file y, there is a record (struct) b, that contains type a and then there is even a type c in file z that is an array of type b.
My idea so far was to learn about compiler construction, since the compiler has to resolve all typedefinitions and break them down to the elemental types.
So, I've read about compiler construction in two books (one of which is even written by the pascal inventor), but I'm lacking so many basics of theoretical computer science that it took me one week alone to work my way halfway through. What I've learned so far is that for achieving my goal, lexer and parser should be sufficient. Since this software is only a really smart part of the whole project, I can't spend so much time with it, so I started experimenting with flex and later with antlr.
My hope was, that parsing for typedefinitions only was such an easy task, that I could manage to do it with only using a scanner and let it do some parser's work: The pascal-files consist of 5 main-parts, each one being optional: A header with comments, a const-section, a type-section, a var-section and (in least cases) a code-section. Each section has a start-identifier but no clear end-identifier. So I started searching for the start of the type- and const-section (TYPE, CONST), discarding everything else. In flex, this is fairly easy, because it allows "start conditions". They can be used as various states like "INITIAL", "TYPE-SECTION", "CONST-SECTION" and "COMMENT" with different rules for each state. I wanted to get back a string from the scanner with following syntax " = ". There was one thing that made this task difficult: Some type contain comments like in this example: AuEingangsBool_t {PCMON} = MAX_AuEingangsFeld;. The scanner can not extract such type-definition with a regular expression.
My next step was to do it properly with scanner AND parser, so I searched for a parsergenerator and found antlr. Since I write the tool in C# anyway, I decided to use its scannergenerator, too, so that I do not have to communicate between different programs. Now I encountered following Problem: AFAIK, antlr does not support "start conditions" as flex do. That means, I have to scan the whole file (okay, comments still get discarded) and get a lot of unneccessary (and wrong) tokens. Because I don't use rules for the whole pascal grammar, the scanner would identify most keywords of the pascal syntax as user-identifiers and the parser would nag about all those series of tokens, that do not fit to type- and constant-defintions
Now, finally my question(s): Can anyone of you tell me, which approach leads anywhere for my project? Is there a possibility to scan only parts of the source-files with antlr? Or do I have to connect flex with antlr for that purpose? Can I tell antlr's parser to ignore every token that is not in the const- or type-section? Are those tools too powerful for my task and should I write own routines instead?
You'd be better off to find a compiler for Pascal, and simply modify to report the information you want. Presumably there is such a compiler for your Pascal, and often the source code for such compilers is available.
Otherwise you essentially need to build a parser. Building lexer, and then hacking around with the resulting lexemes, is essentially building a bad parser by ad hoc methods. ANTLR is a good way to go; you can define the lexemes (including means to pick up and ignore comments) pretty easily, especially for older dialects of Pascal. You'll need good BNF rules for the type information that you want, and translate those rules to the parser generator. What you can do to minimize work, is to cheat on rules for the parts of the language you don't care about. For instance, you could write an accurate subgrammar for assignment statements. Since you don't care about them, you can write a sloppy subgrammar that treats assignment statements as anything that begins with an identifier, is followed by arbitrary other tokens, and ends with semicolon. This kind of a grammar is called an "island grammar"; it is only accurate where it needs to be accurate.
I don't know about the recursive bit. Is there a reason you can't just process each file separately? The answer may depend on what information you want to know about each type declaration, and if you go deep enough, you may need a symbol table as well as an island parser. Parser generators offer you no help for this.
First, there can be type and const blocks within other blocks (procedures, in later Delphi versions also classes).
Moreover, I'm not entirely sure that you can actually simply scan for a const token, and then start parsing. Const is also used for other purposes in most common (Borland) Pascal dialects. Some keywords can be reused in a different context, and if you don't parse the global blockstructure, and only look for const and type in specific places you will erroneously start parsing there.
A base problem of course is the comments. Scanners cut out comments as early as possible, and don't regard them further. You probably have to setup the scanner so that comments are attached to the adjacent tokens as field (associate with token before or save them up till a certain token follows).
As far antlr vs flex, no clue. The only parsergenerator I have some minor experience in parsing Pascal with is Coco/R (a parsergenerator popular by Wirthians), but in general I (and many pascalians) prefer handcoded.