Let me put the question first: Can I convert a parse tree implementing this particular grammar to an AST trivially.
I was given this grammar to build a parse tree:
literal := INTEGER | FLOAT | TRUE | FALSE .
designator := IDENTIFIER { "[" expression0 "]" } .
op0 := ">=" | "<=" | "!=" | "==" | ">" | "<" .
op1 := "+" | "-" | "or" .
op2 := "*" | "/" | "and" .
expression0 := expression1 [ op0 expression1 ] .
expression1 := expression2 { op1 expression2 } .
expression2 := expression3 { op2 expression3 } .
expression3 := "not" expression3
| "(" expression0 ")"
| designator
| call-expression
| literal .
For this particular example:
func main() : void {
let a = 1 + 2 + 3 + 4;
}
My parser will generate (part of) the parse tree as such
EXPRESSION1
EXPRESSION2
EXPRESSION3
LITERAL
INTEGER(1)(lineNum:2, charPos:10)
OP1
ADD(lineNum:2, charPos:12)
EXPRESSION2
EXPRESSION3
LITERAL
INTEGER(2)(lineNum:2, charPos:14)
OP1
ADD(lineNum:2, charPos:16)
EXPRESSION2
EXPRESSION3
LITERAL
INTEGER(3)(lineNum:2, charPos:18)
OP1
ADD(lineNum:2, charPos:20)
EXPRESSION2
EXPRESSION3
LITERAL
INTEGER(4)(lineNum:2, charPos:22)
Just notice how these tree branches under EXPRESSION1 go:
EXPRESSION2 + EXPRESSION2 + EXPRESSION2 + EXPRESSION2
which the operator + doesn't correspond to its two operands. So it seems to me that, in the AST conversion, I can't get an AST that aids 3-address IR code generation by simply pulling up the operator to replace the non-terminal EXPRESSION1.
To achieve this goal, the grammar I would have written for this language will be like this instead
expression1 := expression2 | expression1 + expression2 (1)
expression2 := expression3 | expression2 * expression3 (2)
expression3 := literal (3)
which the branches under EXPRESSION1 are only
EXPRESSION1 + EXPRESSION2
However, this grammar is not LL(1) since |FIRST(expression2)| = |{literal, +}| > 1.
It begs the question that (1) what would be the most elegant and trivial way to convert this parse tree? (2) is my construction of the parse tree a complete waste of time for this grammar that I should have started out coding an AST instead?
I believe you wish to produce an AST like:
ADD
/ \
1 ADD
/ \
2 ADD
/ \
3 4
But probably you noticed that your parse tree is actually a flat list not representing an easy starting point to group operators and their operands in single pass. Anyway coding a more advanced parser, recognizing tree structure and applying grammar reductions is not a trivial task.
Hence before starting doing that you might wish to consider some existing parser generators like yacc or ANTLR. Probably you will need to rewrite your grammar to create rules centered on the operators instead of treating them as a recursion utility. A grammar not being a classic LL(1) might not be a big problem as modern generators (like ANTLR) can handle LL(k) grammars with bigger lookahead, predicates etc. and just bypass issues of that type.
If you still insist on coding it manually think about using a stack to store symbols and converting them to an AST node once an expression is collected but again it is not a simple task.
Related
I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.
I am having quite a problem with antlr4 right now.
Whenever I try to feed antlr with this RPN grammar
grammar UPN;
//Parser
expression : plus | minus | mult | div | NUMBER;
plus : expression expression '+';
minus : expression expression '-';
mult : expression expression '*';
div : expression expression '/';
//Lexer
NUMBER : '-'? ('0'..'9')+;
antlr will throw an error because plus,minus,mult and div are mutually left recursive.
I dont know how to fix that.
(I know this occurs because with this grammar "expression" could be infinitely looped, I have had this problem before with another grammar, but i could fix that on my own)
My only solution would be to restrict the grammar in the following way
grammar UPN;
//Parser
expression : plus | minus | mult | div | NUMBER;
exp2 : plus2 | minus2 | mult2 | div2 | NUMBER;
plus : exp2 exp2'+';
minus : exp2 exp2'-';
mult: exp2 exp2'*';
div: exp2 exp2'/';
plus2 : NUMBER NUMBER '+';
minus2 : NUMBER NUMBER '-';
mult2: NUMBER NUMBER '*';
div2: NUMBER NUMBER '/';
//Lexer
NUMBER : '-'? ('0'..'9')+;
but this is not really what i want it to be, because now i could work at maximum with expressions like
2 3 + 5 4 - *
and the grammar would be more complex than it actually could be.
Hope you guys can help me
ANTLR4 only supports "direct" left recursive rules, not "indirect", as you have them.
Try something like this:
grammar RPN;
parse : expression EOF;
expression
: expression expression '+'
| expression expression '-'
| expression expression '*'
| expression expression '/'
| NUMBER
;
NUMBER : '-'? ('0'..'9')+;
SPACES : [ \t\r\n] -> skip;
Btw, 23+54-* is not a valid RPN expression: it must start with two numbers.
I have the following Antlr grammar rule:
expression1
: e=expression2 (BINOR^ e2=expression2)*
;
However if I have '3 | 1 | 2 | 6' this results in a flat tree, with 3, 1, 2, 6 all children of the BINOR node. What I really want is to be able to pattern match on either
expression2
or
^(BINOR expression2 expression2)
How can I change the rewrite so that these are the 2 patterns?
EDIT:
If I use custom rewrites, I'm thinking along the lines of
expression1
: e=expression2 (BINOR e2=expression2)*
-> {$BINOR != null}? ^(BINOR $e $e2*)
-> $e
But when I do this with '1|2|3' the resulting tree only has one BINOR node with two children which are 1 and 3, so 2 is missing.
Many thanks
You were close, this would work:
expression1
#init{boolean or = false;}
: e=expression2 (BINOR {or=true;} expression2)* -> {or}? ^(BINOR expression2+)
-> $e
;
But this is preferred since it doesn't use any custom code:
grammar T;
options {
output=AST;
}
expression1
: (e=expression2 -> $e) ((BINOR expression2)+ -> ^(BINOR expression2+))?
;
expression2
: NUMBER
;
NUMBER
: '0'..'9'+
;
BINOR
: '|'
;
The parser generated from the grammar above will parse the input "3|1|2|6" into the AST:
and the input "3" into the AST:
But your original try:
expression1
: e=expression2 (BINOR^ e2=expression2)*
;
does not produce a flat tree (assuming you have output=AST; in your options). It generates the following AST for "3|1|2|6":
If you "see" a flat tree, I guess you're using the interpreter in ANTLRWorks, which does not show the AST but the parse tree of your parse. The interpreter is also rather buggy (does not handle predicates and does not evaluate custom code), so best not use it. Use ANTLRWorks debugger instead, which works like a charm (the images from my answer are from the debugger)!
Some compiler books / articles / papers talk about design of a grammar and the relation of its operator's associativity. I'm a big fan of top-down, especially recursive descent, parsers and so far most (if not all) compilers I've written use the following expression grammar:
Expr ::= Term { ( "+" | "-" ) Term }
Term ::= Factor { ( "*" | "/" ) Factor }
Factor ::= INTEGER | "(" Expr ")"
which is an EBNF representation of this BNF:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor = INTEGER | "(" Expr ")"
According to what I read, some regards this grammar as being "wrong" due to the change of operator associativity (left to right for those 4 operators) proven by the growing parse tree to the right instead of left. For a parser implemented through attribute grammar, this might be true as l-attribute value requires that this value created first then passed to child nodes. however, when implementing with normal recursive descent parser, it's up to me whether to construct this node first then pass to child nodes (top-down) or let child nodes be created first then add the returned value as the children of this node (passed in this node's constructor) (bottom-up). There should be something I miss here because I don't agree with the statement saying this grammar is "wrong" and this grammar has been used in many languages esp. Wirthian ones. Usually (or all?) the reading that says it promotes LR parsing instead of LL.
I think the issue here is that a language has an abstract syntax which is just like:
E ::= E + E | E - E | E * E | E / E | Int | (E)
but this is actually implemented via a concrete syntax which is used to specify associativity and precedence. So, if you're writing a recursive decent parse, you're implicitly writing the concrete syntax into it as you go along and that's fine, though it may be good to specify it exactly as a phrase-structured grammar as well!
There are a couple of issues with your grammar if it is to be a fully-fledged concrete grammar. First of all, you need to add productions to just 'go to the next level down', so relaxing your syntax a bit:
Expr ::= Term + Term | Term - Term | Term
Term ::= Factor * Factor | Factor / Factor | Factor
Factor ::= INTEGER | (Expr)
Otherwise there's no way to derive valid sentences starting from the start symbol (in this case Expr). For example, how would you derive '1 * 2' without those extra productions?
Expr -> Term
-> Factor * Factor
-> 1 * Factor
-> 1 * 2
We can see the other grammar handles this in a slightly different way:
Expr -> Term Expr'
-> Factor Term' Expr'
-> 1 Term' Expr'
-> 1 * Factor Term' Expr'
-> 1 * 2 Term' Expr'
-> 1 * 2 ε Expr'
-> 1 * 2 ε ε
= 1 * 2
but this achieves the same effect.
Your parser is actually non-associative. To see this ask how E + E + E would be parsed and find that it couldn't. Whichever + is consumed first, we get E on one side and E + E on the other, but then we're trying to parse E + E as a Term which is not possible. Equivalently, think about deriving that expression from the start symbol, again not possible.
Expr -> Term + Term
-> ? (can't get another + in here)
The other grammar is left-associative ebcase an arbitrarily long sting of E + E + ... + E can be derived.
So anyway, to sum up, you're right that when writing the RDP, you can implement whatever concrete version of the abstract syntax you like and you probably know a lot more about that than me. But there are these issues when trying to produce the grammar which describes your RDP precisely. Hope that helps!
To get associative trees, you really need to have the trees formed with the operator as the subtree root node, with children having similar roots.
Your implementation grammar:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor ::= INTEGER | "(" Expr ")"
must make that awkward; if you implement recursive descent on this, the Expr' routine has no access to the "left child" and so can't build the tree. You can always patch this up by passing around pieces (in this case, passing tree parts up the recursion) but that just seems awkward. You could have chosen this instead as a grammar:
Expr ::= Term ( ("+"|"-") Term )*;
Term ::= Factor ( ( "*" | "/" ) Factor )* ;
Factor ::= INTEGER | "(" Expr ")"
which is just as easy (easier?) to code recursive descent-wise, but now you can form the trees you need without trouble.
This doesn't really get you associativity; it just shapes the trees so that it could be allowed. Associativity means that the tree ( + (+ a b) c) means the same thing as (+ a (+ b c)); its actually a semantic property (sure doesn't work for "-" but the grammar as posed can't distinguish).
We have a tool (the DMS Software Reengineering Toolkit) that includes parsers and term-rewriting (using source-to-source transformations) in which the associativity is explicitly expressed. We'd write your grammar:
Expr ::= Term ;
[Associative Commutative] Expr ::= Expr "+" Term ;
Expr ::= Expr "-" Term ;
Term ::= Factor ;
[Associative Commutative] Term ::= Term "*" Factor ;
Term ::= Term "/" Factor ;
Factor ::= INTEGER ;
Factor ::= "(" Expr ")" ;
The grammar seems longer and clumsier this way, but it in fact allows us to break out the special cases and mark them as needed. In particular, we can now distinguish operators that are associative from those that are not, and mark them accordingly. With that semantic marking, our tree-rewrite engine automatically accounts for associativity and commutativity. You can see a full example of such DMS rules being used to symbolically simplify high-school algebra using explicit rewrite rules over a typical expression grammar that don't have to account for such semantic properties. That is built into the rewrite engine.
The following simple "calculator expression" grammar (BNF) can be easily parsed with the a trivial recursive-descent parser, which is predictive LL(1):
<expr> := <term> + <term>
| <term> - <term>
| <term>
<term> := <factor> * <factor>
<factor> / <factor>
<factor>
<factor> := <number>
| <id>
| ( <expr> )
<number> := \d+
<id> := [a-zA-Z_]\w+
Because it is always enough to see the next token in order to know the rule to pick. However, suppose that I add the following rule:
<command> := <expr>
| <id> = <expr>
For the purpose of interacting with the calculator on the command line, with variables, like this:
calc> 5+5
=> 10
calc> x = 8
calc> 6 * x + 1
=> 49
Is it true that I can not use a simple LL(1) predictive parser to parse <command> rules ? I tried to write the parser for it, but it seems that I need to know more tokens forward. Is the solution to use backtracking, or can I just implement LL(2) and always look two tokens forward ?
How to RD parser generators handle this problem (ANTLR, for instance)?
THe problem with
<command> := <expr>
| <id> = <expr>
is that when you "see" <id> you can't tell if it's the beginning of an assignement (second rule) or it's a "<factor>". You will only know when you'll read the next token.
AFAIK ANTLR is LL(*) (and is also able to generate rat-pack parsers if I'm not mistaken) so it will probably handle this grammare considering two tokens at once.
If you can play with the grammar I would suggest to either add a keyword for the assignment (e.g. let x = 8) :
<command> := <expr>
| "let" <id> "=" <expr>
or use the = to signify evaluation:
<command> := "=" <expr>
| <id> "=" <expr>
I think there are two ways to solve this with a recursive descent parser: either by using (more) lookahead or by backtracking.
Lookahead
command() {
if (currentToken() == id && lookaheadToken() == '=') {
return assignment();
} else {
return expr();
}
}
Backtracking
command() {
savedLocation = scanLocation();
if (accept( id )) {
identifier = acceptedTokenValue();
if (!accept( '=' )) {
setScanLocation( savedLocation );
return expr();
}
return new assignment( identifier, expr() );
} else {
return expr();
}
}
The problem is that the grammar:
<command> := <expr>
| <id> = <expr>
is not a mutually-recursive procedure. For a recursive decent parser you will need to determine a non-recursive equivalent.
rdentato post's shows how to fix this, assuming you can play with the grammar. This powerpoint spells out the problem in a bit more detail and shows how to correct it:
http://www.google.com/url?sa=t&source=web&ct=res&cd=7&url=http%3A%2F%2Fxml.cs.nccu.edu.tw%2Fcourses%2Fcompiler%2Fcp2006%2Fslides%2Flec3-Parsing%26TopDownParsing.ppt&ei=-YLaSPrWGaPwhAK5ydCqBQ&usg=AFQjCNGAFrODJxoxkgJEwDMQ8A8594vn0Q&sig2=nlYKQVfakmqy_57137XzrQ
ANTLR 3 uses a "LL(*)" parser as opposed to a LL(k) parser, so it will look ahead until it reaches the end of the input if it has to, without backtracking, using a specially optimized determinstic finite automata (DFA).