I'm trying to write a scheme parser as an exercise, but I'm not sure how to implement the vector syntax. The specification seems to indicate that the way to input a vector is '#(1 2 3) and such a vector should evaluate to #(1 2 3). However, when I experimented in several Scheme implementations, #(1 2 3) was accepted as valid input as well. Is this a universally implemented extension to the spec?
According to the R5RS spec, vectors must be quoted:
Note that this is the external representation of a vector, not an expression evaluating to a vector. Like list constants, vector constants must be quoted:
'#(0 (2 2 2 2) "Anna")
Some implementations have chosen to allow evaluation of unquoted vectors, but any such extensions are non-standard, and a "feature" of that particular implementation.
For what it's worth, the upcoming R7RS spec explicitly requires vectors to self-evaluate:
Vector constants are self-evaluating, so they do not need to be quoted in programs.
Not universal at all. Chicken, for example, will not allow you to evaluate an unquoted vector.
#;1> #(1 2 3)
Error: illegal non-atomic object: #(1 2 3)
#;1> '#(1 2 3)
#(1 2 3)
Funny things about standards and implementations. An implementation can give you lots of features as long as they don't collide with the spec. So one implementation wanted #(1 2 3) to be evaluated as if it was self evaluation and it's ok because that will never happen in a R5RS conforming program anyway. In R5RS, running a non R5RS conforming program like
(define test "hello")
(string-set! test 0 #\H)
Is undefined. It could fail, not change test silently or change test. Same goes for R6RS but an implementation might signal an error - but still don't have to.
AFAIK racket has the most strict R5RS, but even their implementation has som extras still. Eg. they allow symbols that are not allowed in the spec and they have symbols defined in their report-environment to boostrap R5RS in racket's module system. However, like all implementations there are little errors when given a R5RS conforming program.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last month.
Improve this question
I'm reading Crafting Interpreters. It's very readable. Now I'm reading
chapter 17 Compiling Expressions and find algorithm:
Vaughan Pratt’s “top-down operator precedence parsing”. The implementation is very brief
and I don't understand it why it works.
So I read Vaughan Pratt’s “top-down operator precedence parsing” paper. It's so old
and not easy to read. I read related blogs about it and spend days reading the
original paper.
related blogs :
https://abarker.github.io/typped/pratt_parsing_intro.html
https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html
I am now more confident that I can write an implementation. But I still can't see the
trick behind the magic. Here are some of my questions, if I can describe them clealy:
What grammar can Pratt’s parser handle? Floyd Operator Grammar? I even take a look
at Floyd's paper, but it's very abstract. Or Pratt’s parser can handle any Language
as long as it meets the restrictions on page 44
These restrictions on the language, while slightly irksome,...
On page 45,Theorem 2, Proof.
First assign even integers (to make room for the followin~terpolations) to the data type classes.
Then to each argument position assign an integer lying strictly (where possible) between the integers
corresponding to the classes of the argument and result types.
On page 44,
The idea is to assign data types to classes and then to totally order
the classes.
An example might be, in ascending order,Outcomes (e.g., the pseudo-result of “print”), Booleans,
Graphs (e.g. trees, lists, plexes), Strings, Algebraics (e.g. integers, complex nos, polynomials)...
I can't figure out what the term "data type" means in this paper. If it means primitive
data types in a programming language, like boolean , int , char in Java, then the following exmaple
may be Counterexample
1 + 2 * 3
for +, it's argument type is number, say assign integer 2 to data type number class. +'s result data type is
also a number. so + must have the integer 2. But the same is for *. In this way + and * would have the same
binding power.
I guess data type in this paper is the AST Node type. So +'s result type is term, *'s result type is factor,
which will have a bigger integger than +'s. But I can't be sure.
By data type, Pratt meant, roughly, "primitive data type in the programming language being parsed". (Although he included some things which are often not thought about as types, unless you program in C or C++, and I don't think he contemplated user-defined types.)
But Pratt is not really talking about a parsing algorithm on page 44. What he's talking about is language design. He wants languages to be designed in such a way that a simple semantic rule involving argument and result types can be used to determine operator precedence. He wants this rule, apparently, in order to save the programmer the task of memorizing arbitrary operator precedence tables (or to have to deal with different languages ordering operators in different ways.)
These are indeed laudable goals, but I'm afraid that the train had already left the station when Pratt was writing that paper. Half a century later, it's evident that we're never going to achieve that level of interlanguage consistency. Fortunately, we can always use parentheses, and we can even write compilers which nag at you about not using redundant parentheses until you give up and write them.
Anyway, that paragraph probably contravened SO's no-opinions policy, so let's get back to Pratt's paper. The rule he proposes is that all of a languages primitive datatypes be listed in a fixed total ordering, with the hope that all language designers will choose the same ordering. (I use the word "dominates" to describe this ordering: type A dominates another type B if A comes later in Pratt's list than B. A type does not dominate itself.)
At one end of the ordering is the null type, which is the result type of an operator which doesn't have a return value. Pratt calls this type "Outcome", since an operator which doesn't return anything must have had some side-effect --its "outcome"-- in order to not be pointless. At the other end of the ordering is what C++ calls a reference type: something which can be used as an argument to an assignment operator. And he then proposes a semantic rule: no operator can produce a result whose type dominates the type of one or more of its arguments, unless the operator's syntax unambiguously identifies the arguments.
That last exception is clearly necessary, since there will always be operators which produce types subordinate to the types of their arguments. Pratt's example is the length operator, which in his view must require parentheses because the Integer type dominates the String and Collection types, so length x, which returns an Integer given a String, cannot be legal. (You could write length(x) or |x| (provided | is not used for other purposes), because those syntaxes are unambiguous.)
It's worth noting that this rule must also apply to implicit coercions, which is equivalent to saying that the rule applies to all overloads of a single operator symbol. (C++ was still far in the future when Pratt was writing, but implicit coercions were common.)
Given that total ordering on types and the restriction (which Pratt calls "slightly irksome") on operator syntax, he then proposes a single simple syntactic rule: operator associativity is resolved by eliminating all possibilities which would violate type ordering. If that's not sufficient to resolve associativity, it can only be the case that there is only one type between the result and argument types of the two operators vying for precedence. In that case, associativity is to the left.
Pratt goes on to prove that this rule is sufficient to remove all ambiguity, and furthermore that it is possible to derive Floyd's operator precedence relation from type ordering (and the knowledge about return and argument types of every operator). So syntactically, Pratt's languages are similar to Floyd's operator precedence grammars.
But remember that Pratt is talking about language design, not parsing. (Up to that point of the paper.) Floyd, on the other hand, was only concerned with parsing, and his parsing model would certainly allow a prefix length operator. (See, for example, C's sizeof operator.) So the two models are not semantically equivalent.
This reduces the amount of memorization needed by someone learning the language: they only have to memorize the order of types. They don't need to try to memorize the precedence between, say, a concatenation operator and a division operator, because they can just work it out from the fact that Integer dominates String. [Note 1]
Unfortunately, Pratt has to deal with the fact that "left associative unless you have a type mismatch" really does not cover all the common expectations for expression parsing. Although not every language complies, most of us would be surprised to find that a*4 + b*6 was parsed as ((a * 4) + b) * 6, and would demand an explanation. So Pratt proposes that it is possible to make an exception by creating "pseudotypes". We can pretend that the argument and return types of multiplication and division are different from (and dominate) the argument and return types of addition and subtraction. Then we can allow the Product type to be implicitly coerced to the Sum type (conceptually, because the coercion does nothing), thereby forcing the desired parse.
Of course, he has now gone full circle: the programmer needs to memorise both the type ordering rules, but also the pseudotype ordering rules, which are nothing but precedence rules in disguise.
The rest of the paper describes the actual algorithm, which is quite clever. Although it is conceptually identical to Floyd's operator precedence parsing, Pratt's algorithm is a top-down algorithm; it uses the native call stack instead of requiring a separate stack algorithm, and it allows the parser to interpolate code with production parsing without waiting for the production to terminate.
I think I've already deviated sufficiently from SO's guidelines in the tone of this answer, so I'll leave it at that, with no other comment about the relative virtues of top-down and bottom-up control flows.
Notes
Integer dominates String means that there is no implicit coercion from a String to an Integer, the same reason that the length operator needs to be parenthesised. There could be an implicit coercion from Integer to String. So the expression a divide b concatenate c must be parsed as (a divide b) concatenate c. a divide (b concatenate c) is disallowed and so the parser can ignore the possibility.
I am playing around with Tatsu to implement a parser for a language used in the semiconductor industry. This language requires that variables be defined before usage. So for example:
SignalGroup { A: In; B: Out};
Pattern {
V {A=1, B=1 }
V {A=1, B=0 }
};
In this case, the SignalGroup block must come before the Pattern block. How do I enforce/implement this "ordering" when writing the grammer in TatSu?
Although for some languages it is possible to write grammars that verify if the same symbol appears on different places, the grammars usually end up being too complicated to be useful.
Compilers (translators) are usually implemented with separate lexical, syntactical, and semantic analyzer components. There are several reasons for that:
Each component is so well focused that it is clearer and easier to write.
Each component is very efficient
The most common errors (which are exactly lexical, syntactical, and semantic) can be reported earlier
With those components in mind, checking if a symbol has ben previously defined belongs to the semantic (meaning) aspect of the program, and the way to check is to keep a symbol table that is filled when the definition parts of the input are being parsed, and queried on the use parts of the input are being parsed.
In TatSu in particular the different components are well separated, yet run in parallel. For your requirement you just need to use the simplest grammar that allows for the semantic actions that store and query the symbols. By raising FailedSemantics from within semantic actions, any semantic errors will be reported exactly as the lexical and syntactical ones so the user doesn't have to think about which component flagged each error.
If you use the Python parser generation in TatSu, the translator will generate the skeleton of a semantic actions class as part of the output.
I am getting the user to input a function, e.g. y = 2x^2 + 3, as a string. What I am looking to do is to enter that string into TChart and for TChart to graph the function.
As far as I know, TChart/TeeChart will only accept X values that are assigned values, e.g. -10 to 10 for X, so the X value would need to be calculated each time - this isn't an issue.
The issue is getting each part of the inputted function and substituting the X-values into each part. The workaround I have found is to get the user to enter the degree for each part of the function, e.g. 2 for X^2, 3 for X^3, etc. but is there a cleaner way of doing this?
If I could convert the inputted string into a Mathematical formula which TeeChart would accept, that would be the ideal outcome.
Saying that you can't use external units effectively makes your question unanswerable in the SO format, because the topic is far broader (and deeper) that can comfortably be dealt with in SO's Q&A format. So the following is at best an outline:
If you want to, or have to, write a DIY expression evaluator, one way to do it is to proceed as follows:
Write yourself a class that takes a string as input and snips it up into a series of symbols, aka "tokens" which represent the component parts of the expression, e.g, numbers, operators, parentheses, names of functions, names of variables, etc; these tokens might themselves be records or class instances and need to include a mechanisms for storing values associated with particular symbols (e.g. the tokens that represent numbers in the input). This step is called "tokenisation" or "lexing". Store the resulting list of symbols in a list or similiar structure. This class needs to implement a mechanism to retrieve the next symbol from the list (usually, this method is called something like "NextToken") and indicate whether there are any symbols left. This class also needs a mechanism to "put back" a symbol (or, equivalently, "peek" the symbol following the current one).
Then, write yourself a s/ware machine which takes the tokenised symbols and "evaluates" the list of symbols to produce the (mathematical) result you're after. This step is an order of magnitude or two more difficult than the tokenisation step. There are numerous ways to do it. As I said an a comment earlier, a recursive descent parser is probably the most tractable approach if you've never done anything like this before. There are countless examples in textbooks, but here's a link to an article about a Delphi implementation that should be understandable as an intro:
http://www8.umoncton.ca/umcm-deslierres_michel/Calcs/ParsingMathExpr-1.html
That article begins by noting that there are numerous pre-existing Delphi expression evaluators but makes the point that they are not necessarily the best place to start for someone wanting to learn how to write an evaluator/parser rather than just use one. Instead it goes through the coding of an evaluator to implement this simple expression grammar:
expression : term | term + term | term − term
term : factor | factor * factor | factor / factor
factor : number | ( expression ) | + factor | − factor
(the vertical bar | denotes ‘or’)
The article has a link to a second part which shows had to add exponentiation to the evaluator - this is trickier than it might sound and involves issues of ambiguity: e.g. how to evaluate - and what does it mean to write - an expression like
x^y^z
? This relates to the issue of "associativity": most operators are "left associative" which means that they bind more tightly to what's on the left of them than what's on their right. The exponentiation operator is an example of the reverse, where the operator binds more tightly to what's on its right.
Have fun!
By the way, you used to see suggestions to implement an evaluator using the "shunting yard algorithm"
http://en.wikipedia.org/wiki/Shunting-yard_algorithm
to convert an "infix" expression where the operators are between the operands, as in 1 + 3 * 4 to RPN (reverse Polish notation), as used on older HP calculators. The reason to do that was that RPN makes for much more efficient evaluation of an expression that the infix equivalent. Ymmv, but personally I found that implementing the SY algorithm properly was actually trickier than learning how to write an evaluator in the expression/term/factor style.
Fwiw, RPN is the basis of the Forth programming language, http://en.wikipedia.org/wiki/Forth_%28programming_language%29, so you could write a Forth implementation in Delphi if you wanted!
Is such type of an error produced during type checking or when input is being parsed?
Under what type should the error be addressed?
The way I see it it is a semantic error, because your language parses just fine even though your are using an identifier which you haven't previously bound--i.e. syntactic analysis only checks the program for well-formed-ness. Semantic analysis actually checks that your program has a valid meaning--e.g. bindings, scoping or typing. As #pst said you can do scope checking during parsing, but this is an implementation detail. AFAIK old compilers used to do this to save some time and space, but I think today such an approach is questionable if you don't have some hard performance/memory constraints.
The program conforms to the language grammar, so it is syntactically correct. A language grammar doesn't contain any statements like 'the identifier must be declared', and indeed doesn't have any way of doing so. An attempt to build a two-level grammar along these lines failed spectacularly in the Algol-68 project, and it has not been attempted since to my knowledge.
The meaning, if any, of each is a semantic issue. Frank deRemer called issues like this 'static semantics'.
In my opinion, this is not strictly a syntax error - nor a semantic one. If I were to implement this for a statically typed, compiled language (like C or C++), then I would not put the check into the parser (because the parser is practically incapable of checking for this mistake), rather into the code generator (the part of the compiler that walks the abstract syntax tree and turns it into assembly code). So in my opinion, it lies between syntax and semantic errors: it's a syntax-related error that can only be checked by performing semantic analysis on the code.
If we consider a primitive scripting language however, where the AST is directly executed (without compilation to bytecode and without JIT), then it's the evaluator/executor function itself that walks the AST and finds the undeclared variable - in this case, it will be a runtime error. The difference lies between the "AST_walk()" routine being in different parts of the program lifecycle (compilation time and runtime), should the language be a scripting or a compiled one.
In the case of languages -- and there are many -- which require identifiers to be declared, a program with undeclared identifiers is ill-formed and thus a missing declaration is clearly a syntax error.
The usual way to deal with this is to incorporate information about symbols in a symbol table, so that the parse can use this information.
Here are a few examples of how identifier type affects parsing:
C / C++
A classic case:
(a)-b;
Depending on a, that's either a cast or a subtraction:
#include <stdio.h>
#if TYPEDEF
typedef double a;
#else
double a = 3.0;
#endif
int main() {
int b = 3;
printf("%g\n", (a)-b);
return 0;
}
Consequently, if a hadn't been declared at all, the compiler must reject the program as syntactically ill-formed (and that is precisely the word the standard uses.)
XML
This one is simple:
<block>Hello, world</blob>
That's ill-formed XML, but it cannot be detected with a CFG. (Nonetheless, all XML parsers will correctly reject it as ill-formed.) In the case of HTML/SGML, where end-tags may be omitted under some well-defined circumstances, parsing is trickier but nonetheless deterministic; again, the precise declaration of a tag will determine the parse of a valid input, and it's easy to come up with inputs which parse differently depending on declaration.
English
OK, not a programming language. I have lots of other programming language examples, but I thought this one might trigger some other intuitions.
Consider the two grammatically correct sentences:
The sheep is in the meadow.
The sheep are in the meadow.
Now, what about:
The cow is in the meadow.
(*) The cow are in the meadow.
The second sentence is intelligible, albeit ambiguous (is the noun or the verb wrong?) but it is certainly not grammatically correct. But in order to know that (and other similar examples), we have to know that sheep has an unmarked plural. Indeed, many animals have unmarked plurals, so I recognize all the following as grammatical:
The caribou are in the meadow.
The antelope are in the meadow.
The buffalo are in the meadow.
But definitely not:
(*) The mouse are in the meadow.
(*) The bird are in the meadow.
etc.
It seems that there is a common misconception that because the syntactic analyzer uses a context free grammar parser, that syntactic analysis is restricted to parsing a context free grammar. This is simply not true.
In the case of C (and family), the syntax analyzer uses a symbol table to help it parse. In the case of XML, it uses the tag stack, and in the case of generalize SGML (including HTML) it also uses tag declarations. Consequently, the syntax analyzer considered as a whole is more powerful than the CFG, which is just a part of the analysis.
The fact that a given program passes the syntax analysis does not mean that it is semantically correct. For example, the syntax analyser needs to know whether a is a type or not in order to correctly parse (a)-b, but it does not need to know whether the cast is in fact possible, in the case that it a is a type, or that a and b can meaningfully be subtracted, in the case that a is a variable. These verifications can happen during type analysis after the parse tree is built, but they are still compile-time errors.
A formula with quantifiers contains declaration of trans function. Z3 successfully finds the model and prints it. But it also prints models for functions like trans!1!4464, trans!7!4463 .. which are not used anywhere in the model. What is it? How can I disable this output?
Here is the query: http://dl.dropbox.com/u/444947/asyn_arbiter_bound_16.smt2
and here is Z3's output - http://dl.dropbox.com/u/444947/asyn_arbiter_bound_16_result.txt
Recall that the models returned by Z3 can be viewed as simple functional programs.
Your formula is in the UFBV fragment. Z3 uses several modules to decide this fragment. Each module converts a formula F into a "simpler" formula F', and produces a procedure that converts a model for F' into a model for F. We call these procedures: "model converters". Model converters will, for example, eliminate interpretations for auxiliary functions and constants introduced in the F to F' conversion. Some auxiliary definitions can't be removed seems they are used to give interpretation of other definitions. We should view them as "auxiliary functions". They also create an interpretation for symbols that have been eliminated..
In your example, the following happens: the last module produces a model that contains the trans!... and k!... symbols. This model is for a formula that does not even contain trans. The function trans has been eliminated. As we apply the model converters, the interpretation for trans is constructed based on the interpretation of all trans!.... At this point, trans!... and k!... symbols are still being used since the interpretation of trans has a reference to all trans!... symbols and the interpretation of the trans!... functions have references to the k!... function symbols. At this step, there are no unnecessary symbols in the model. However, in a later step, the interpretation of trans is simplified by unfolding the definitions of trans!... and k!.... So, after this simplification step, trans!... and k!... are essentially "dead code".
That being said, the model returned by Z3 is correct, that is, it is a model for your formula. I acknowledge that these extra symbols are annoying and unnecessary. To eliminate them, we have apply the equivalent of a "dead code" elimination step. We are really close to the next release. So, this feature will not be available in the next release, but I will add it for the release following the next one.