Parser AST - Advantage of Binary Expression vs Function - parsing

I've written a parser for a simple in house SQL style language. It's a typical recursive descent parser.
Naturally, we have expressions, and two of the possible forms of expressions I model are BinaryExpression and FunctionExpression. My question is, since a binary expression can be modelled as a function with two arguments, is there any advantage in keeping the distinction?
Perhaps function invocation is not normally modelled as an expression but as a statement, but here all my functions must produce a value.

How you choose to model your language is really up to you; it completely depends on how you intend to use the AST you construct.
Certainly there is no fundamental difference between evaluation of a binary operator and evaluation of a function with two arguments. On the other hand, there is a significant difference in the presentation (in most languages). Certain operators have very well understood properties which can be of use during static analysis, such as finding optimisations.
So both styles are certainly valid, and you will have to make the choice based on your knowledge of the intended use(s) of the AST.

Related

How to understand Pratt Parsing [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last month.
Improve this question
I'm reading Crafting Interpreters. It's very readable. Now I'm reading
chapter 17 Compiling Expressions and find algorithm:
Vaughan Pratt’s “top-down operator precedence parsing”. The implementation is very brief
and I don't understand it why it works.
So I read Vaughan Pratt’s “top-down operator precedence parsing” paper. It's so old
and not easy to read. I read related blogs about it and spend days reading the
original paper.
related blogs :
https://abarker.github.io/typped/pratt_parsing_intro.html
https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html
I am now more confident that I can write an implementation. But I still can't see the
trick behind the magic. Here are some of my questions, if I can describe them clealy:
What grammar can Pratt’s parser handle? Floyd Operator Grammar? I even take a look
at Floyd's paper, but it's very abstract. Or Pratt’s parser can handle any Language
as long as it meets the restrictions on page 44
These restrictions on the language, while slightly irksome,...
On page 45,Theorem 2, Proof.
First assign even integers (to make room for the followin~terpolations) to the data type classes.
Then to each argument position assign an integer lying strictly (where possible) between the integers
corresponding to the classes of the argument and result types.
On page 44,
The idea is to assign data types to classes and then to totally order
the classes.
An example might be, in ascending order,Outcomes (e.g., the pseudo-result of “print”), Booleans,
Graphs (e.g. trees, lists, plexes), Strings, Algebraics (e.g. integers, complex nos, polynomials)...
I can't figure out what the term "data type" means in this paper. If it means primitive
data types in a programming language, like boolean , int , char in Java, then the following exmaple
may be Counterexample
1 + 2 * 3
for +, it's argument type is number, say assign integer 2 to data type number class. +'s result data type is
also a number. so + must have the integer 2. But the same is for *. In this way + and * would have the same
binding power.
I guess data type in this paper is the AST Node type. So +'s result type is term, *'s result type is factor,
which will have a bigger integger than +'s. But I can't be sure.
By data type, Pratt meant, roughly, "primitive data type in the programming language being parsed". (Although he included some things which are often not thought about as types, unless you program in C or C++, and I don't think he contemplated user-defined types.)
But Pratt is not really talking about a parsing algorithm on page 44. What he's talking about is language design. He wants languages to be designed in such a way that a simple semantic rule involving argument and result types can be used to determine operator precedence. He wants this rule, apparently, in order to save the programmer the task of memorizing arbitrary operator precedence tables (or to have to deal with different languages ordering operators in different ways.)
These are indeed laudable goals, but I'm afraid that the train had already left the station when Pratt was writing that paper. Half a century later, it's evident that we're never going to achieve that level of interlanguage consistency. Fortunately, we can always use parentheses, and we can even write compilers which nag at you about not using redundant parentheses until you give up and write them.
Anyway, that paragraph probably contravened SO's no-opinions policy, so let's get back to Pratt's paper. The rule he proposes is that all of a languages primitive datatypes be listed in a fixed total ordering, with the hope that all language designers will choose the same ordering. (I use the word "dominates" to describe this ordering: type A dominates another type B if A comes later in Pratt's list than B. A type does not dominate itself.)
At one end of the ordering is the null type, which is the result type of an operator which doesn't have a return value. Pratt calls this type "Outcome", since an operator which doesn't return anything must have had some side-effect --its "outcome"-- in order to not be pointless. At the other end of the ordering is what C++ calls a reference type: something which can be used as an argument to an assignment operator. And he then proposes a semantic rule: no operator can produce a result whose type dominates the type of one or more of its arguments, unless the operator's syntax unambiguously identifies the arguments.
That last exception is clearly necessary, since there will always be operators which produce types subordinate to the types of their arguments. Pratt's example is the length operator, which in his view must require parentheses because the Integer type dominates the String and Collection types, so length x, which returns an Integer given a String, cannot be legal. (You could write length(x) or |x| (provided | is not used for other purposes), because those syntaxes are unambiguous.)
It's worth noting that this rule must also apply to implicit coercions, which is equivalent to saying that the rule applies to all overloads of a single operator symbol. (C++ was still far in the future when Pratt was writing, but implicit coercions were common.)
Given that total ordering on types and the restriction (which Pratt calls "slightly irksome") on operator syntax, he then proposes a single simple syntactic rule: operator associativity is resolved by eliminating all possibilities which would violate type ordering. If that's not sufficient to resolve associativity, it can only be the case that there is only one type between the result and argument types of the two operators vying for precedence. In that case, associativity is to the left.
Pratt goes on to prove that this rule is sufficient to remove all ambiguity, and furthermore that it is possible to derive Floyd's operator precedence relation from type ordering (and the knowledge about return and argument types of every operator). So syntactically, Pratt's languages are similar to Floyd's operator precedence grammars.
But remember that Pratt is talking about language design, not parsing. (Up to that point of the paper.) Floyd, on the other hand, was only concerned with parsing, and his parsing model would certainly allow a prefix length operator. (See, for example, C's sizeof operator.) So the two models are not semantically equivalent.
This reduces the amount of memorization needed by someone learning the language: they only have to memorize the order of types. They don't need to try to memorize the precedence between, say, a concatenation operator and a division operator, because they can just work it out from the fact that Integer dominates String. [Note 1]
Unfortunately, Pratt has to deal with the fact that "left associative unless you have a type mismatch" really does not cover all the common expectations for expression parsing. Although not every language complies, most of us would be surprised to find that a*4 + b*6 was parsed as ((a * 4) + b) * 6, and would demand an explanation. So Pratt proposes that it is possible to make an exception by creating "pseudotypes". We can pretend that the argument and return types of multiplication and division are different from (and dominate) the argument and return types of addition and subtraction. Then we can allow the Product type to be implicitly coerced to the Sum type (conceptually, because the coercion does nothing), thereby forcing the desired parse.
Of course, he has now gone full circle: the programmer needs to memorise both the type ordering rules, but also the pseudotype ordering rules, which are nothing but precedence rules in disguise.
The rest of the paper describes the actual algorithm, which is quite clever. Although it is conceptually identical to Floyd's operator precedence parsing, Pratt's algorithm is a top-down algorithm; it uses the native call stack instead of requiring a separate stack algorithm, and it allows the parser to interpolate code with production parsing without waiting for the production to terminate.
I think I've already deviated sufficiently from SO's guidelines in the tone of this answer, so I'll leave it at that, with no other comment about the relative virtues of top-down and bottom-up control flows.
Notes
Integer dominates String means that there is no implicit coercion from a String to an Integer, the same reason that the length operator needs to be parenthesised. There could be an implicit coercion from Integer to String. So the expression a divide b concatenate c must be parsed as (a divide b) concatenate c. a divide (b concatenate c) is disallowed and so the parser can ignore the possibility.

Does context-sensitive tokenisation require multiple goal symbols in the lexical grammar?

According to the ECMAScript spec:
There are several situations where the identification of lexical input
elements is sensitive to the syntactic grammar context that is
consuming the input elements. This requires multiple goal symbols for
the lexical grammar.
Two such symbols are InputElementDiv and InputElementRegExp.
In ECMAScript, the meaning of / depends on the context in which it appears. Depending on the context, a / can either be a division operator, the start of a regex literal or a comment delimiter. The lexer cannot distinguish between a division operator and regex literal on its own, so it must rely on context information from the parser.
I'd like to understand why this requires the use of multiple goal symbols in the lexical grammar. I don't know much about language design so I don't know if this is due to some formal requirement of a grammar or if it's just convention.
Questions
Why not just use a single goal symbol like so:
InputElement ::
[...]
DivPunctuator
RegularExpressionLiteral
[...]
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
What are some other languages that use multiple goal symbols in their lexical grammar?
How would we classify the ECMAScript lexical grammar? It's not context-sensitive in the sense of the formal definition of a CSG (i.e. the LHS of its productions are not surrounded by a context of terminal and nonterminal symbols).
Saying that the lexical production is "sensitive to the syntactic grammar context that is consuming the input elements" does not make the grammar context-sensitive, in the formal-languages definition of that term. Indeed, there are productions which are "sensitive to the syntactic grammar context" in just about every non-trivial grammar. It's the essence of parsing: the syntactic context effectively provides the set of potentially expandable non-terminals, and those will differ in different syntactic contexts, meaning that, for example, in most languages a statement cannot be entered where an expression is expected (although it's often the case that an expression is one of the manifestations of a statement).
However, the difference does not involve different expansions for the same non-terminal. What's required in a "context-free" language is that the set of possible derivations of a non-terminal is the same set regardless of where that non-terminal appears. So the context can provide a different selection of non-terminals, but every non-terminal can be expanded without regard to its context. That is the sense in which the grammar is free of context.
As you note, context-sensitivity is usually abstracted in a grammar by a grammar with a pattern on the left-hand side rather than a single non-terminal. In the original definition, the context --everything other than the non-terminal to be expanded-- needed to be passed through the production untouched; only a single non-terminal could be expanded, but the possible expansions depend on the context, as indicated by the productions. Implicit in the above is that there are grammars which can be written in BNF which don't even conform to that rule for context-sensitivity (or some other equivalent rule). So it's not a binary division, either context-free or context-sensitive. It's possible for a grammar to be neither (and, since the empty context is still a context, any context-free grammar is also context-sensitive). The bottom line is that when mathematicians talk, the way they use words is sometimes unexpected. But it always has a clear underlying definition.
In formal language theory, there are not lexical and syntactic productions; just productions. If both the lexical productions and the syntactic productions are free of context, then the total grammar is free of context. From a practical viewpoint, though, combined grammars are harder to parse, for a variety of reasons which I'm not going to go into here. It turns out that it is somewhat easier to write the grammars for a language, and to parse them, with a division between lexical and syntactic parsers.
In the classic model, the lexical analysis is done first, so that the parser doesn't see individual characters. Rather, the syntactic analysis is done with an "alphabet" (in a very expanded sense) of "lexical tokens". This is very convenient -- it means, for example, that the lexical analysis can simply drop whitespace and comments, which greatly simplifies writing a syntactic grammar. But it also reduces generality, precisely because the syntactic parser cannot "direct" the lexical analyser to do anything. The lexical analyser has already done what it is going to do before the syntactic parser is aware of its needs.
If the parser were able to direct the lexical analyser, it would do so in the same way as it directs itself. In some productions, the token non-terminals would include InputElementDiv and while in other productions InputElementRegExp would be the acceptable non-terminal. As I noted, that's not context-sensitivity --it's just the normal functioning of a context-free grammar-- but it does require a modification to the organization of the program to allow the parser's goals to be taken into account by the lexical analyser. This is often referred to (by practitioners, not theorists) as "lexical feedback" and sometimes by terms which are rather less value neutral; it's sometimes considered a weakness in the design of the language, because the neatly segregated lexer/parser architecture is violated. C++ is a pretty intense example, and indeed there are C++ programs which are hard for humans to parse as well, which is some kind of indication. But ECMAScript does not really suffer from that problem; human beings usually distinguish between the division operator and the regexp delimiter without exerting any noticeable intellectual effort. And, while the lexical feedback required to implement an ECMAScript parser does make the architecture a little less tidy, it's really not a difficult task, either.
Anyway, a "goal symbol" in the lexical grammar is just a phrase which the authors of the ECMAScript reference decided to use. Those "goal symbols" are just ordinary lexical non-terminals, like any other production, so there's no difference between saying that there are "multiple goal symbols" and saying that the "parser directs the lexer to use a different production", which I hope addresses the question you asked.
Notes
The lexical difference in the two contexts is not just that / has a different meaning. If that were all that it was, there would be no need for lexical feedback at all. The problem is that the tokenization itself changes. If an operator is possible, then the /= in
a /=4/gi;
is a single token (a compound assignment operator), and gi is a single identifier token. But if a regexp literal were possible at that point (and it's not, because regexp literals cannot follow identifiers), then the / and the = would be separate tokens, and so would g and i.
Parsers which are built from a single set of productions are preferred by some programmers (but not the one who is writing this :-) ); they are usually called "scannerless parsers". In a scannerless parser for ECMAScript there would be no lexical feedback because there is no separate lexical analysis.
There really is a breach between the theoretical purity of formal language theory and the practical details of writing a working parser of a real-life programming language. The theoretical models are really useful, and it would be hard to write a parser without knowing something about them. But very few parsers rigidly conform to the model, and that's OK. Similarly, the things which are popularly calle "regular expressions" aren't regular at all, in a formal language sense; some "regular expression" operators aren't even context-free (back-references). So it would be a huge mistake to assume that some theoretical result ("regular expressions can be identified in linear time and constant space") is actually true of a "regular expression" library. I don't think parsing theory is the only branch of computer science which exhibits this dichotomy.
Why not just use a single goal symbol like so:
InputElement ::
...
DivPunctuator
RegularExpressionLiteral
...
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
Note that DivPunctuator and RegExLiteral aren't productions per se, rather they're nonterminals. And in this context, they're right-hand-sides (alternatives) in your proposed production for InputElement. So I'd rephrase your question as: Why not have the syntactic parser tell the lexical parser which of those two alternatives to use? (Or equivalently, which of those two to suppress.)
In the ECMAScript spec, there's a mechanism to accomplish this: grammatical parameters (explained in section 5.1.5).
E.g., you could define the parameter Div, where:
+Div means "a slash should be recognized as a DivPunctuator", and
~Div means "a slash should be recognized as the start of a RegExLiteral".
So then your production would become
InputElement[Div] ::
...
[+Div] DivPunctuator
[~Div] RegularExpressionLiteral
...
But notice that the syntactic parser still has to tell the lexical parser to use either InputElement[+Div] or InputElement[~Div] as the goal symbol, so you arrive back at the spec's current solution, modulo renaming.
What are some other languages that use multiple goal symbols in their lexical grammar?
I think most don't try to define a single symbol that derives all tokens (or input elements), let alone have to divide it up into variants like ECMAScript's InputElementFoo, so it might be difficult to find another language with something similar in its specification.
Instead, it's pretty common to simply define rules for the syntax of different kinds of tokens (e.g. Identifier, NumericLiteral) and then reference them from the syntactic productions. So that's kind of like having multiple lexical goal symbols, but not (I would say) in the sense you were asking about.
How would we classify the ECMAScript lexical grammar?
It's basically context-free, plus some extensions.

Can this be parsed by a LALR(1) parser?

I am writing a parser in Bison for a language which has the following constructs, among others:
self-dispatch: [identifier arguments]
dispatch: [expression . identifier arguments]
string slicing: expression[expression,expression] - similar to Python.
arguments is a comma-separated list of expressions, which can be empty too. All of the above are expressions on their own, too.
My problem is that I am not sure how to parse both [method [other_method]] and [someString[idx1, idx2].toInt] or if it is possible to do this at all with an LALR(1) parser.
To be more precise, let's take the following example: [a[b]] (call method a with the result of method b). When it reaches the state [a . [b]] (the lookahead is the second [), it won't know whether to reduce a (which has already been reduced to identifier) to expression because something like a[b,c] might follow (which could itself be reduced to expression and continue with the second construct from above) or to keep it identifier (and shift it) because a list of arguments will follow (such as [b] in this case).
Is this shift/reduce conflict due to the way I expressed this grammar or is it not possible to parse all of these constructs with an LALR(1) parser?
And, a more general question, how can one prove that a language is/is not parsable by a particular type of parser?
Assuming your grammar is unambiguous (which the part you describe appears to be) then your best bet is to specify a %glr-parser. Since in most cases, the correct parse will be forced after only a few tokens, the overhead should not be noticeable, and the advantage is that you do not need to complicate either the grammar or the construction of the AST.
The one downside is that bison cannot verify that the grammar is unambiguous -- in general, this is not possible -- and it is not easy to prove. If it turns out that some input is ambiguous, the GLR parser will generate an error, so a good test suite is important.
Proving that the language is not LR(1) would be tricky, and I suspect that it would be impossible because the language probably is recognizable with an LALR(1) parser. (Impossible to tell without seeing the entire grammar, though.) But parsing (outside of CS theory) needs to create a correct parse tree in order to be useful, and the sort of modifications required to produce an LR grammar will also modify the AST, requiring a post-parse fixup. The difficultly in creating a correct AST spring from the difference in precedence between
a[b[c],d]
and
[a[b[c],d]]
In the first (subset) case, b binds to its argument list [c] and the comma has lower precedence; in the end, b[c] and d are sibling children of the slice. In the second case (method invocation), the comma is part of the argument list and binds more tightly than the method application; b, [c] and d are siblings in a method application. But you cannot decide the shape of the parse tree until an arbitrarily long input (since d could be any expression).
That's all a bit hand-wavey since "precedence" is not formally definable, and there are hacks which could make it possible to adjust the tree. Since the LR property is not really composable, it is really possible to provide a more rigorous analysis. But regardless, the GLR parser is likely to be the simplest and most robust solution.
One small point for future reference: CFGs are not just a programming tool; they also serve the purpose of clearly communicating the grammar in question. Nirmally, if you want to describe your language, you are better off using a clear CFG than trying to describe informally. Of course, meaningful non-terminal names will help, and a few examples never hurt, but the essence of the grammar is in the formal description and omitting that makes it harder for others to "be helpful".

Refactoring of decision trees using automatic learning

The problem is the following:
I developed an expression evaluation engine that provides a XPath-like language to the user so he can build the expressions. These expressions are then parsed and stored as an expression tree. There are many kinds of expressions, including logic (and/or/not), relational (=, !=, >, <, >=, <=), arithmetic (+, -, *, /) and if/then/else expressions.
Besides these operations, the expression can have constants (numbers, strings, dates, etc) and also access to external information by using a syntax similar to XPath to navigate in a tree of Java objects.
Given the above, we can build expressions like:
/some/value and /some/other/value
/some/value or /some/other/value
if (<evaluate some expression>) then
<evaluate some other expression>
else
<do something else>
Since the then-part and the else-part of the if-then-else expressions are expressions themselves, and everything is considered to be an expression, then anything can appear there, including other if-then-else's, allowing the user to build large decision trees by nesting if-then-else's.
As these expressions are built manually and prone to human error, I decided to build an automatic learning process capable of optimizing these expression trees based on the analysis of common external data. For example: in the first expression above (/some/value and /some/other/value), if the result of /some/other/value is false most of the times, we can rearrange the tree so this branch will be the left branch to take advantage of short-circuit evaluation (the right side of the AND is not evaluated since the left side already determined the result).
Another possible optimization is to rearrange nested if-then-else expressions (decision trees) so the most frequent path taken, based on the most common external data used, will be executed sooner in the future, avoiding unnecessary evaluation of some branches most of the times.
Do you have any ideas on what would be the best or recommended approach/algorithm to use to perform this automatic refactoring of these expression trees?
I think what you are describing is compiler optimizations which is a huge subject with everything from
inline expansion
deadcode elimination
constant propagation
loop transformation
Basically you have a lot of rewrite rules that are guaranteed to preserve the functionality of the code/xpath.
In the question on rearranging of the nested if-else I don't think you need to resort to machine-learning.
One (I think optimal) approach would be to use Huffman coding of your links huffman_coding
Take each path as a letter and we then encode them with Huffman coding and get a so called Huffman tree. This tree will have the least evaluations running on a (large enough) sample with the same distribution you made the Huffman tree from.
If you have restrictions on ``evaluate some expression''-expresssion or that they have different computational cost etc. You probably need another approach.
And remember, as always when it comes to optimization you should be careful and only do things that really matter.

Code quotations and Expression trees

I wonder if there is any difference in how the two features are implemented under the hood? I.e. Aren't just code quotations built on top of the old good expression trees?
Thanks.
The two types are quite similar, but they are represented differently.
Quotations are designed in a more functional way. For example foo a b would be represented as a series of applications App(App(foo, a), b)
Quotations can represent some constructs that are available only in F# and using expression trees would hide them. For example there is Expr.LetRecursive for let rec declarations
Quotations were first introduced in .NET 3.0. Back then expression trees could only represent C# expressions, so it wasn't possible to easily capture all F# constructs (quotations can capture any F# expression including imperative ones).
Quotations are also designed to be easily processible using recursion. The ExprShape module contains patterns that allow you to handle all possible quotations with just 4 cases (which is a lot easier than implementing visitor pattern with tens of methods in C#).
When you have an F# quotation, you can translate it to C# expression tree using FSharp.Quotations.Evaluator. This is quite useful if you're using some .NET API that expects expression trees from F#. As far as I know, there is no translation the other way round.

Resources