How to understand Pratt Parsing [closed] - parsing

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last month.
Improve this question
I'm reading Crafting Interpreters. It's very readable. Now I'm reading
chapter 17 Compiling Expressions and find algorithm:
Vaughan Pratt’s “top-down operator precedence parsing”. The implementation is very brief
and I don't understand it why it works.
So I read Vaughan Pratt’s “top-down operator precedence parsing” paper. It's so old
and not easy to read. I read related blogs about it and spend days reading the
original paper.
related blogs :
https://abarker.github.io/typped/pratt_parsing_intro.html
https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html
I am now more confident that I can write an implementation. But I still can't see the
trick behind the magic. Here are some of my questions, if I can describe them clealy:
What grammar can Pratt’s parser handle? Floyd Operator Grammar? I even take a look
at Floyd's paper, but it's very abstract. Or Pratt’s parser can handle any Language
as long as it meets the restrictions on page 44
These restrictions on the language, while slightly irksome,...
On page 45,Theorem 2, Proof.
First assign even integers (to make room for the followin~terpolations) to the data type classes.
Then to each argument position assign an integer lying strictly (where possible) between the integers
corresponding to the classes of the argument and result types.
On page 44,
The idea is to assign data types to classes and then to totally order
the classes.
An example might be, in ascending order,Outcomes (e.g., the pseudo-result of “print”), Booleans,
Graphs (e.g. trees, lists, plexes), Strings, Algebraics (e.g. integers, complex nos, polynomials)...
I can't figure out what the term "data type" means in this paper. If it means primitive
data types in a programming language, like boolean , int , char in Java, then the following exmaple
may be Counterexample
1 + 2 * 3
for +, it's argument type is number, say assign integer 2 to data type number class. +'s result data type is
also a number. so + must have the integer 2. But the same is for *. In this way + and * would have the same
binding power.
I guess data type in this paper is the AST Node type. So +'s result type is term, *'s result type is factor,
which will have a bigger integger than +'s. But I can't be sure.

By data type, Pratt meant, roughly, "primitive data type in the programming language being parsed". (Although he included some things which are often not thought about as types, unless you program in C or C++, and I don't think he contemplated user-defined types.)
But Pratt is not really talking about a parsing algorithm on page 44. What he's talking about is language design. He wants languages to be designed in such a way that a simple semantic rule involving argument and result types can be used to determine operator precedence. He wants this rule, apparently, in order to save the programmer the task of memorizing arbitrary operator precedence tables (or to have to deal with different languages ordering operators in different ways.)
These are indeed laudable goals, but I'm afraid that the train had already left the station when Pratt was writing that paper. Half a century later, it's evident that we're never going to achieve that level of interlanguage consistency. Fortunately, we can always use parentheses, and we can even write compilers which nag at you about not using redundant parentheses until you give up and write them.
Anyway, that paragraph probably contravened SO's no-opinions policy, so let's get back to Pratt's paper. The rule he proposes is that all of a languages primitive datatypes be listed in a fixed total ordering, with the hope that all language designers will choose the same ordering. (I use the word "dominates" to describe this ordering: type A dominates another type B if A comes later in Pratt's list than B. A type does not dominate itself.)
At one end of the ordering is the null type, which is the result type of an operator which doesn't have a return value. Pratt calls this type "Outcome", since an operator which doesn't return anything must have had some side-effect --its "outcome"-- in order to not be pointless. At the other end of the ordering is what C++ calls a reference type: something which can be used as an argument to an assignment operator. And he then proposes a semantic rule: no operator can produce a result whose type dominates the type of one or more of its arguments, unless the operator's syntax unambiguously identifies the arguments.
That last exception is clearly necessary, since there will always be operators which produce types subordinate to the types of their arguments. Pratt's example is the length operator, which in his view must require parentheses because the Integer type dominates the String and Collection types, so length x, which returns an Integer given a String, cannot be legal. (You could write length(x) or |x| (provided | is not used for other purposes), because those syntaxes are unambiguous.)
It's worth noting that this rule must also apply to implicit coercions, which is equivalent to saying that the rule applies to all overloads of a single operator symbol. (C++ was still far in the future when Pratt was writing, but implicit coercions were common.)
Given that total ordering on types and the restriction (which Pratt calls "slightly irksome") on operator syntax, he then proposes a single simple syntactic rule: operator associativity is resolved by eliminating all possibilities which would violate type ordering. If that's not sufficient to resolve associativity, it can only be the case that there is only one type between the result and argument types of the two operators vying for precedence. In that case, associativity is to the left.
Pratt goes on to prove that this rule is sufficient to remove all ambiguity, and furthermore that it is possible to derive Floyd's operator precedence relation from type ordering (and the knowledge about return and argument types of every operator). So syntactically, Pratt's languages are similar to Floyd's operator precedence grammars.
But remember that Pratt is talking about language design, not parsing. (Up to that point of the paper.) Floyd, on the other hand, was only concerned with parsing, and his parsing model would certainly allow a prefix length operator. (See, for example, C's sizeof operator.) So the two models are not semantically equivalent.
This reduces the amount of memorization needed by someone learning the language: they only have to memorize the order of types. They don't need to try to memorize the precedence between, say, a concatenation operator and a division operator, because they can just work it out from the fact that Integer dominates String. [Note 1]
Unfortunately, Pratt has to deal with the fact that "left associative unless you have a type mismatch" really does not cover all the common expectations for expression parsing. Although not every language complies, most of us would be surprised to find that a*4 + b*6 was parsed as ((a * 4) + b) * 6, and would demand an explanation. So Pratt proposes that it is possible to make an exception by creating "pseudotypes". We can pretend that the argument and return types of multiplication and division are different from (and dominate) the argument and return types of addition and subtraction. Then we can allow the Product type to be implicitly coerced to the Sum type (conceptually, because the coercion does nothing), thereby forcing the desired parse.
Of course, he has now gone full circle: the programmer needs to memorise both the type ordering rules, but also the pseudotype ordering rules, which are nothing but precedence rules in disguise.
The rest of the paper describes the actual algorithm, which is quite clever. Although it is conceptually identical to Floyd's operator precedence parsing, Pratt's algorithm is a top-down algorithm; it uses the native call stack instead of requiring a separate stack algorithm, and it allows the parser to interpolate code with production parsing without waiting for the production to terminate.
I think I've already deviated sufficiently from SO's guidelines in the tone of this answer, so I'll leave it at that, with no other comment about the relative virtues of top-down and bottom-up control flows.
Notes
Integer dominates String means that there is no implicit coercion from a String to an Integer, the same reason that the length operator needs to be parenthesised. There could be an implicit coercion from Integer to String. So the expression a divide b concatenate c must be parsed as (a divide b) concatenate c. a divide (b concatenate c) is disallowed and so the parser can ignore the possibility.

Related

Parser AST - Advantage of Binary Expression vs Function

I've written a parser for a simple in house SQL style language. It's a typical recursive descent parser.
Naturally, we have expressions, and two of the possible forms of expressions I model are BinaryExpression and FunctionExpression. My question is, since a binary expression can be modelled as a function with two arguments, is there any advantage in keeping the distinction?
Perhaps function invocation is not normally modelled as an expression but as a statement, but here all my functions must produce a value.
How you choose to model your language is really up to you; it completely depends on how you intend to use the AST you construct.
Certainly there is no fundamental difference between evaluation of a binary operator and evaluation of a function with two arguments. On the other hand, there is a significant difference in the presentation (in most languages). Certain operators have very well understood properties which can be of use during static analysis, such as finding optimisations.
So both styles are certainly valid, and you will have to make the choice based on your knowledge of the intended use(s) of the AST.

Handling in-ambiguous yet breaking syntax in expression parsing

Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you
for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().

Predictive editor for Rascal grammar

I'm trying to write a predictive editor for a grammar written in Rascal. The heart of this would be a function taking as input a list of symbols and returning as output a list of symbol types, such that an instance of any of those types would be a syntactically legal continuation of the input symbols under the grammar. So if the input list was [4,+] the output might be [integer]. Is there a clever way to do this in Rascal? I can think of imperative programming ways of doing it, but I suspect they don't take proper advantage of Rascal's power.
That's a pretty big question. Here's some lead to an answer but the full answer would be implementing it for you completely :-)
Reify an original grammar for the language you are interested in as a value using the # operator, so that you have a concise representation of the grammar which can be queried easily. The representation is defined over the modules Type, ParseTree which extends Type and Grammar.
Construct the same representation for the input query. This could be done in many ways. A kick-ass, language-parametric, way would be to extend Rascal's parser algorithm to return partial trees for partial input, but I believe this would be too much hassle now. An easier solution would entail writing a grammar for a set of partial inputs, i.e. the language grammar with at specific points shorter rules. The grammar will be ambiguous but that is not a problem in this case.
Use tags to tag the "short" rules so that you can find them easily later: syntax E = #short E "+";
Parse with the extended and now ambiguous grammar;
The resulting parse trees will contain the same representation as in ParseTree that you used to reify the original grammar, except in that one the rules are longer, as in prod(E, [E,+,E],...)
then select the trees which serve you best for the goal of completion (which use the #short tag), and extract their productions "prod", which look like this prod(E,[E,+],...). For example using the / operator: [candidate : /candidate:prod(_,_,/"short") := trees], and you could use a cursor position to find candidates which are close by instead of all short trees in there.
Use list matching to find prefixes in the original grammar, like if (/match:prod(_,[*prefix, predicted, *postfix],_) := grammar) ..., prefix is your query as extracted from the #short rules. predicted is your answer and postfix is whatever would come after.
yield the predicted symbol back as a type for the user to read: "<type(predicted, ())>" (will pretty print it nicely even if it's some complex regexp type and does the quoting right etc.)

Converting a function (as a string) to be graphed by TChart?

I am getting the user to input a function, e.g. y = 2x^2 + 3, as a string. What I am looking to do is to enter that string into TChart and for TChart to graph the function.
As far as I know, TChart/TeeChart will only accept X values that are assigned values, e.g. -10 to 10 for X, so the X value would need to be calculated each time - this isn't an issue.
The issue is getting each part of the inputted function and substituting the X-values into each part. The workaround I have found is to get the user to enter the degree for each part of the function, e.g. 2 for X^2, 3 for X^3, etc. but is there a cleaner way of doing this?
If I could convert the inputted string into a Mathematical formula which TeeChart would accept, that would be the ideal outcome.
Saying that you can't use external units effectively makes your question unanswerable in the SO format, because the topic is far broader (and deeper) that can comfortably be dealt with in SO's Q&A format. So the following is at best an outline:
If you want to, or have to, write a DIY expression evaluator, one way to do it is to proceed as follows:
Write yourself a class that takes a string as input and snips it up into a series of symbols, aka "tokens" which represent the component parts of the expression, e.g, numbers, operators, parentheses, names of functions, names of variables, etc; these tokens might themselves be records or class instances and need to include a mechanisms for storing values associated with particular symbols (e.g. the tokens that represent numbers in the input). This step is called "tokenisation" or "lexing". Store the resulting list of symbols in a list or similiar structure. This class needs to implement a mechanism to retrieve the next symbol from the list (usually, this method is called something like "NextToken") and indicate whether there are any symbols left. This class also needs a mechanism to "put back" a symbol (or, equivalently, "peek" the symbol following the current one).
Then, write yourself a s/ware machine which takes the tokenised symbols and "evaluates" the list of symbols to produce the (mathematical) result you're after. This step is an order of magnitude or two more difficult than the tokenisation step. There are numerous ways to do it. As I said an a comment earlier, a recursive descent parser is probably the most tractable approach if you've never done anything like this before. There are countless examples in textbooks, but here's a link to an article about a Delphi implementation that should be understandable as an intro:
http://www8.umoncton.ca/umcm-deslierres_michel/Calcs/ParsingMathExpr-1.html
That article begins by noting that there are numerous pre-existing Delphi expression evaluators but makes the point that they are not necessarily the best place to start for someone wanting to learn how to write an evaluator/parser rather than just use one. Instead it goes through the coding of an evaluator to implement this simple expression grammar:
expression : term | term + term | term − term
term : factor | factor * factor | factor / factor
factor : number | ( expression ) | + factor | − factor
(the vertical bar | denotes ‘or’)
The article has a link to a second part which shows had to add exponentiation to the evaluator - this is trickier than it might sound and involves issues of ambiguity: e.g. how to evaluate - and what does it mean to write - an expression like
x^y^z
? This relates to the issue of "associativity": most operators are "left associative" which means that they bind more tightly to what's on the left of them than what's on their right. The exponentiation operator is an example of the reverse, where the operator binds more tightly to what's on its right.
Have fun!
By the way, you used to see suggestions to implement an evaluator using the "shunting yard algorithm"
http://en.wikipedia.org/wiki/Shunting-yard_algorithm
to convert an "infix" expression where the operators are between the operands, as in 1 + 3 * 4 to RPN (reverse Polish notation), as used on older HP calculators. The reason to do that was that RPN makes for much more efficient evaluation of an expression that the infix equivalent. Ymmv, but personally I found that implementing the SY algorithm properly was actually trickier than learning how to write an evaluator in the expression/term/factor style.
Fwiw, RPN is the basis of the Forth programming language, http://en.wikipedia.org/wiki/Forth_%28programming_language%29, so you could write a Forth implementation in Delphi if you wanted!

Parser for user defined infix operators

I am writing an interpreter for a language where functions can be used as operators. However, the functions content will only be known at runtime.
For that I considered two solutions:
Parsing is done at runtime, using the runtime information on the function
All user-defined operators use default values for precedence and associativity.
I chose the latter as I see a number of advantages in parsing separately to execution.
Now it comes to implementation and I am interested to see what options there are. My initial thoughts are a shift reduce parser, but I have little experience in constructing parsers.
Example:
LHS op RHS : LHS * RHS /* define a binary operator 'op' */
var : 3 /* define a variable */
print 5 op var /* should print 15 */
LHS op RHS : LHS / RHS /* Re-define op */
print var op var /* Should print 1 */
in the last case, the parser will get from the lexer: " id id id id ". Only at runtime do I know that the 'op' id is an operator.
(Posting the results of the comments, as requested.)
Solution #1 is definitely ugly, complex to implement, and unneeded, I agree. Solution #2 is by far easier to implement and comprehend. You can also allow custom associativity and precedence for operators, as long as those are known statically. The main thing is that these facts are known at parse time.
As for actual parsing, most parsers will work just fine, as any two expression surrounding an id are an application of a custom infix operator (this is less true if you allow custom precedence and associativity, in that case you need an algorithm that allows determine those on a per-operator basis at parse time). Either case, my personal favorite is a "Top Down Operator Precedence Parser", or Pratt parser. I found the following resources (ordered by usefulness to me, YMMV) describe it quite well:
Simple Top-Down Parsing in Python
Pratt Parsers: Expression Parsing Made Easy
Top Down Operator Precedence
Two properties of the algorithm make it suit this problem very well:
The lookup of associativity ("binding power") happens dynamically for each token (allowing the parser to allow the user to define precedence for their operators).
It's very simple to write by hand[*], and you'll probably have to do that as such an degree of dynamism is beyond the scope of most (at least all I know) parser generators.
[*] I've personally written a parser for a very large (lacking only case, multidimensional arrays and perhaps some obscure subtleties) subset of Pascal in 500 lines of Python and 2-3 days of work, the rest is only missing because other parts of the software it's used in were more interesting at the time and I didn't have a reason to implement the rest.

Resources