Recursive descent parsing - from LL(1) up - parsing

The following simple "calculator expression" grammar (BNF) can be easily parsed with the a trivial recursive-descent parser, which is predictive LL(1):
<expr> := <term> + <term>
| <term> - <term>
| <term>
<term> := <factor> * <factor>
<factor> / <factor>
<factor>
<factor> := <number>
| <id>
| ( <expr> )
<number> := \d+
<id> := [a-zA-Z_]\w+
Because it is always enough to see the next token in order to know the rule to pick. However, suppose that I add the following rule:
<command> := <expr>
| <id> = <expr>
For the purpose of interacting with the calculator on the command line, with variables, like this:
calc> 5+5
=> 10
calc> x = 8
calc> 6 * x + 1
=> 49
Is it true that I can not use a simple LL(1) predictive parser to parse <command> rules ? I tried to write the parser for it, but it seems that I need to know more tokens forward. Is the solution to use backtracking, or can I just implement LL(2) and always look two tokens forward ?
How to RD parser generators handle this problem (ANTLR, for instance)?

THe problem with
<command> := <expr>
| <id> = <expr>
is that when you "see" <id> you can't tell if it's the beginning of an assignement (second rule) or it's a "<factor>". You will only know when you'll read the next token.
AFAIK ANTLR is LL(*) (and is also able to generate rat-pack parsers if I'm not mistaken) so it will probably handle this grammare considering two tokens at once.
If you can play with the grammar I would suggest to either add a keyword for the assignment (e.g. let x = 8) :
<command> := <expr>
| "let" <id> "=" <expr>
or use the = to signify evaluation:
<command> := "=" <expr>
| <id> "=" <expr>

I think there are two ways to solve this with a recursive descent parser: either by using (more) lookahead or by backtracking.
Lookahead
command() {
if (currentToken() == id && lookaheadToken() == '=') {
return assignment();
} else {
return expr();
}
}
Backtracking
command() {
savedLocation = scanLocation();
if (accept( id )) {
identifier = acceptedTokenValue();
if (!accept( '=' )) {
setScanLocation( savedLocation );
return expr();
}
return new assignment( identifier, expr() );
} else {
return expr();
}
}

The problem is that the grammar:
<command> := <expr>
| <id> = <expr>
is not a mutually-recursive procedure. For a recursive decent parser you will need to determine a non-recursive equivalent.
rdentato post's shows how to fix this, assuming you can play with the grammar. This powerpoint spells out the problem in a bit more detail and shows how to correct it:
http://www.google.com/url?sa=t&source=web&ct=res&cd=7&url=http%3A%2F%2Fxml.cs.nccu.edu.tw%2Fcourses%2Fcompiler%2Fcp2006%2Fslides%2Flec3-Parsing%26TopDownParsing.ppt&ei=-YLaSPrWGaPwhAK5ydCqBQ&usg=AFQjCNGAFrODJxoxkgJEwDMQ8A8594vn0Q&sig2=nlYKQVfakmqy_57137XzrQ

ANTLR 3 uses a "LL(*)" parser as opposed to a LL(k) parser, so it will look ahead until it reaches the end of the input if it has to, without backtracking, using a specially optimized determinstic finite automata (DFA).

Related

How would I implement operator-precedence in my grammar?

I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.

How to rewrite this grammar for LL(1) parsing?

I am trying to write a top-down recursive-descent parser for a small language, and I am facing some issues with the assignment statements. Here is the grammar from the language specifications:
<assign_stmt> ::= <lvalue> <l_tail> ":=" <expr> ";"
<l_tail> ::= ":=" <lvalue> <l_tail>
| ""
<expr> ::= ....
#multiple layers betwen <expr> and <lvalue>, like <term>, <factor>, etc.
#in the end, <expr> can be a <lvalue>
| <lvalue>
so that the assignments can look like
a := b := 3;
c := d := e := f;
The grammar does not seem to be ambiguous, but it is causing me issues because <expr> can itself be a <lvalue>. When parsing <l_tail>, both production rules are equally valid and I don't know which one to pick. I tried various left-factorizations (see below), but so far, I have not been able to find a LL(1) grammar that works for me. Is it even possible here?
<assign_stmt> ::= <lvalue> ":=" <rest>
<rest> ::= <expr> ";"
| <lvalue> ":=" <l_tail>
Note that I could go around this issue by parsing <l_tail> and then looking for the ";" token. Depending on the result, I would know whether the last <lvalue> was actually an <expr> or not (without having to backtrack). However, I am learning here, and I would like to know the "right" way to overcome this problem.

Converting a parse tree to AST

Let me put the question first: Can I convert a parse tree implementing this particular grammar to an AST trivially.
I was given this grammar to build a parse tree:
literal := INTEGER | FLOAT | TRUE | FALSE .
designator := IDENTIFIER { "[" expression0 "]" } .
op0 := ">=" | "<=" | "!=" | "==" | ">" | "<" .
op1 := "+" | "-" | "or" .
op2 := "*" | "/" | "and" .
expression0 := expression1 [ op0 expression1 ] .
expression1 := expression2 { op1 expression2 } .
expression2 := expression3 { op2 expression3 } .
expression3 := "not" expression3
| "(" expression0 ")"
| designator
| call-expression
| literal .
For this particular example:
func main() : void {
let a = 1 + 2 + 3 + 4;
}
My parser will generate (part of) the parse tree as such
EXPRESSION1
EXPRESSION2
EXPRESSION3
LITERAL
INTEGER(1)(lineNum:2, charPos:10)
OP1
ADD(lineNum:2, charPos:12)
EXPRESSION2
EXPRESSION3
LITERAL
INTEGER(2)(lineNum:2, charPos:14)
OP1
ADD(lineNum:2, charPos:16)
EXPRESSION2
EXPRESSION3
LITERAL
INTEGER(3)(lineNum:2, charPos:18)
OP1
ADD(lineNum:2, charPos:20)
EXPRESSION2
EXPRESSION3
LITERAL
INTEGER(4)(lineNum:2, charPos:22)
Just notice how these tree branches under EXPRESSION1 go:
EXPRESSION2 + EXPRESSION2 + EXPRESSION2 + EXPRESSION2
which the operator + doesn't correspond to its two operands. So it seems to me that, in the AST conversion, I can't get an AST that aids 3-address IR code generation by simply pulling up the operator to replace the non-terminal EXPRESSION1.
To achieve this goal, the grammar I would have written for this language will be like this instead
expression1 := expression2 | expression1 + expression2 (1)
expression2 := expression3 | expression2 * expression3 (2)
expression3 := literal (3)
which the branches under EXPRESSION1 are only
EXPRESSION1 + EXPRESSION2
However, this grammar is not LL(1) since |FIRST(expression2)| = |{literal, +}| > 1.
It begs the question that (1) what would be the most elegant and trivial way to convert this parse tree? (2) is my construction of the parse tree a complete waste of time for this grammar that I should have started out coding an AST instead?
I believe you wish to produce an AST like:
ADD
/ \
1 ADD
/ \
2 ADD
/ \
3 4
But probably you noticed that your parse tree is actually a flat list not representing an easy starting point to group operators and their operands in single pass. Anyway coding a more advanced parser, recognizing tree structure and applying grammar reductions is not a trivial task.
Hence before starting doing that you might wish to consider some existing parser generators like yacc or ANTLR. Probably you will need to rewrite your grammar to create rules centered on the operators instead of treating them as a recursion utility. A grammar not being a classic LL(1) might not be a big problem as modern generators (like ANTLR) can handle LL(k) grammars with bigger lookahead, predicates etc. and just bypass issues of that type.
If you still insist on coding it manually think about using a stack to store symbols and converting them to an AST node once an expression is collected but again it is not a simple task.

How do you write a grammar for this arithmetic expression?

This is an arithmetic expression for my language: ADD 100 MUL 5 DIV 10 SUB 7 MOD 10 4
Where ADD = addition, SUB = subtraction, MUL = multiplication, DIV = division, MOD = modulo.
The expression above can also be rewitten into the standard 100 + (5 * (10 / (7 - (10 % 4)))), parenthesis included to mark the order of operations.
This is quite different than the standard because evaluation starts with the right most operation, that is MOD 10 4, then the result of that is then used to evaluate the next operation, that is SUB 7 2, where 2 is the result of the modulo operation. Parenthesis is not required for this grammar.
I have gotten hold the grammar for the standard notation from https://ruslanspivak.com/lsbasi-part6/, here it is:
<expr> := <term> ((ADD|SUB) <term>)*
<term> := <factor> ((MUL|DIV|MOD) <factor>)*
<factor> := integer
In my language, I am clueless in writing the grammar for arithmetic operations. Are modifications needed for the grammar above? Or do I need to write a completely new grammar?
I managed to solve this by analyzing the execution of each production in my code. To my surprise, I forgot to include an <expr> production in <factor>. Changing my code a bit my moving certain conditions and I was able to parse the sample expression above. This is the grammar for the arithmetic expression in my language:
<expr> := ((ADD|SUB) <term> <term>)* | <term>
<term> := ((MUL|DIV|MOD) <factor> <factor>)* | <factor>
<factor> := INTEGER | <expr>
The <expr> production in <factor> makes it possible to have multiple operations because it goes back to the start to parse the next operation.

Relation between grammar and operator associativity

Some compiler books / articles / papers talk about design of a grammar and the relation of its operator's associativity. I'm a big fan of top-down, especially recursive descent, parsers and so far most (if not all) compilers I've written use the following expression grammar:
Expr ::= Term { ( "+" | "-" ) Term }
Term ::= Factor { ( "*" | "/" ) Factor }
Factor ::= INTEGER | "(" Expr ")"
which is an EBNF representation of this BNF:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor = INTEGER | "(" Expr ")"
According to what I read, some regards this grammar as being "wrong" due to the change of operator associativity (left to right for those 4 operators) proven by the growing parse tree to the right instead of left. For a parser implemented through attribute grammar, this might be true as l-attribute value requires that this value created first then passed to child nodes. however, when implementing with normal recursive descent parser, it's up to me whether to construct this node first then pass to child nodes (top-down) or let child nodes be created first then add the returned value as the children of this node (passed in this node's constructor) (bottom-up). There should be something I miss here because I don't agree with the statement saying this grammar is "wrong" and this grammar has been used in many languages esp. Wirthian ones. Usually (or all?) the reading that says it promotes LR parsing instead of LL.
I think the issue here is that a language has an abstract syntax which is just like:
E ::= E + E | E - E | E * E | E / E | Int | (E)
but this is actually implemented via a concrete syntax which is used to specify associativity and precedence. So, if you're writing a recursive decent parse, you're implicitly writing the concrete syntax into it as you go along and that's fine, though it may be good to specify it exactly as a phrase-structured grammar as well!
There are a couple of issues with your grammar if it is to be a fully-fledged concrete grammar. First of all, you need to add productions to just 'go to the next level down', so relaxing your syntax a bit:
Expr ::= Term + Term | Term - Term | Term
Term ::= Factor * Factor | Factor / Factor | Factor
Factor ::= INTEGER | (Expr)
Otherwise there's no way to derive valid sentences starting from the start symbol (in this case Expr). For example, how would you derive '1 * 2' without those extra productions?
Expr -> Term
-> Factor * Factor
-> 1 * Factor
-> 1 * 2
We can see the other grammar handles this in a slightly different way:
Expr -> Term Expr'
-> Factor Term' Expr'
-> 1 Term' Expr'
-> 1 * Factor Term' Expr'
-> 1 * 2 Term' Expr'
-> 1 * 2 ε Expr'
-> 1 * 2 ε ε
= 1 * 2
but this achieves the same effect.
Your parser is actually non-associative. To see this ask how E + E + E would be parsed and find that it couldn't. Whichever + is consumed first, we get E on one side and E + E on the other, but then we're trying to parse E + E as a Term which is not possible. Equivalently, think about deriving that expression from the start symbol, again not possible.
Expr -> Term + Term
-> ? (can't get another + in here)
The other grammar is left-associative ebcase an arbitrarily long sting of E + E + ... + E can be derived.
So anyway, to sum up, you're right that when writing the RDP, you can implement whatever concrete version of the abstract syntax you like and you probably know a lot more about that than me. But there are these issues when trying to produce the grammar which describes your RDP precisely. Hope that helps!
To get associative trees, you really need to have the trees formed with the operator as the subtree root node, with children having similar roots.
Your implementation grammar:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor ::= INTEGER | "(" Expr ")"
must make that awkward; if you implement recursive descent on this, the Expr' routine has no access to the "left child" and so can't build the tree. You can always patch this up by passing around pieces (in this case, passing tree parts up the recursion) but that just seems awkward. You could have chosen this instead as a grammar:
Expr ::= Term ( ("+"|"-") Term )*;
Term ::= Factor ( ( "*" | "/" ) Factor )* ;
Factor ::= INTEGER | "(" Expr ")"
which is just as easy (easier?) to code recursive descent-wise, but now you can form the trees you need without trouble.
This doesn't really get you associativity; it just shapes the trees so that it could be allowed. Associativity means that the tree ( + (+ a b) c) means the same thing as (+ a (+ b c)); its actually a semantic property (sure doesn't work for "-" but the grammar as posed can't distinguish).
We have a tool (the DMS Software Reengineering Toolkit) that includes parsers and term-rewriting (using source-to-source transformations) in which the associativity is explicitly expressed. We'd write your grammar:
Expr ::= Term ;
[Associative Commutative] Expr ::= Expr "+" Term ;
Expr ::= Expr "-" Term ;
Term ::= Factor ;
[Associative Commutative] Term ::= Term "*" Factor ;
Term ::= Term "/" Factor ;
Factor ::= INTEGER ;
Factor ::= "(" Expr ")" ;
The grammar seems longer and clumsier this way, but it in fact allows us to break out the special cases and mark them as needed. In particular, we can now distinguish operators that are associative from those that are not, and mark them accordingly. With that semantic marking, our tree-rewrite engine automatically accounts for associativity and commutativity. You can see a full example of such DMS rules being used to symbolically simplify high-school algebra using explicit rewrite rules over a typical expression grammar that don't have to account for such semantic properties. That is built into the rewrite engine.

Resources