How to rewrite this grammar for LL(1) parsing? - parsing

I am trying to write a top-down recursive-descent parser for a small language, and I am facing some issues with the assignment statements. Here is the grammar from the language specifications:
<assign_stmt> ::= <lvalue> <l_tail> ":=" <expr> ";"
<l_tail> ::= ":=" <lvalue> <l_tail>
| ""
<expr> ::= ....
#multiple layers betwen <expr> and <lvalue>, like <term>, <factor>, etc.
#in the end, <expr> can be a <lvalue>
| <lvalue>
so that the assignments can look like
a := b := 3;
c := d := e := f;
The grammar does not seem to be ambiguous, but it is causing me issues because <expr> can itself be a <lvalue>. When parsing <l_tail>, both production rules are equally valid and I don't know which one to pick. I tried various left-factorizations (see below), but so far, I have not been able to find a LL(1) grammar that works for me. Is it even possible here?
<assign_stmt> ::= <lvalue> ":=" <rest>
<rest> ::= <expr> ";"
| <lvalue> ":=" <l_tail>
Note that I could go around this issue by parsing <l_tail> and then looking for the ";" token. Depending on the result, I would know whether the last <lvalue> was actually an <expr> or not (without having to backtrack). However, I am learning here, and I would like to know the "right" way to overcome this problem.

Related

Understanding what makes a rule left-recursive in antlr

I've been trial-and-erroring to figure out from an intuitive level when a rule in antlr is left-recursive of not. For example, this (Removing left recursion) is left-recursive in theory, but works in Antlr:
// Example input: x+y+1
grammar DBParser;
expression
: expression '+' term
| term;
term
: term '*' atom
| atom;
atom
: NUMBER
| IDENTIFIER
;
NUMBER : [0-9]+ ;
IDENTIFIER : [a-zA-Z]+ ;
So what makes a rule left-recursive and a problem in antlr4, and what would be the simplest example of showing that (in an actual program)? I'm trying to practice remove left-recursive productions, but I can't even figure out how to intentionally add a left-recursive rule that antlr4 can't resolve!
Antlr4 can handle direct left-recursion as long as it is not hidden:
Hidden means that the recursion is not at the beginning of the right-hand side but it might still be at the beginning of the expansion because all previous symbols are nullable.
Indirect means that the first symbol on the right-hand side of a production for N eventually derives a sequence starting with N. Antlr describes that as "mutual recursion".
Here are some SO questions I found by searching for [antlr] left recursive:
ANTLR4 - Mutually left-recursive grammar
ANTLR4 mutually left-recursive error when parsing
ANTLR4 left-recursive error
ANTLR Grammar Mutually Left Recursive
Mutually left-recursive lexer rules on ANTL4?
Mutually left recursive with simple calculator
There are lots more, including one you asked. I'm sure you can mine some examples.
As mentioned by rici above, one of the items that Antlr4 does not support is indirect left-recursion. Here would be an example:
grammar DBParser;
expr: binop | Atom;
binop: expr '+' expr;
Atom: [a-z]+ | [0-9]+ ;
error(119): DBParser.g4::: The following sets of rules are mutually left-recursive [expr, binop]
Notice that Antlr4 can support the following direct left-recursion though:
grammar DBParser;
expr: expr '+' expr | Atom;
Atom: [a-z]+ | [0-9]+ ;
However, even if you add in parentheticals (for whatever reason?) it doesn't work:
grammar DBParser;
expr: (expr '+' expr) | Atom;
Atom: [a-z]+ | [0-9]+ ;
Or:
grammar DBParser;
expr: (expr '+' expr | Atom);
Atom: [a-z]+ | [0-9]+ ;
Both will raise:
error(119): DBParser.g4::: The following sets of rules are mutually left-recursive [expr]

How would I implement operator-precedence in my grammar?

I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.

Parens in BNF, EBNF

I could capture a parenthetical group using something like:
expr ::= "(" <something> ")"
However, sometimes it's useful to use multiple levels of nesting, and so it's (theoretically) possible to have more than one parens as long as they match. For example:
>>> (1)+1
2
>>> (((((-1)))))+2
1
>>> ((2+2)+(1+1))
6
>>> (2+2))
SyntaxError: invalid syntax
Is there a way to specify a "matching-ness" in EBNF, or how is parenthetical-matching handled by most parsers?
In order to be able to match an arbitrary amount of anything (be it parentheses, operators, list items etc.) you need recursion (EBNF also features repetition operators that can be used instead of recursion in some cases, but not for constructs that need to be matched like parentheses).
For well-matched parentheses, the proper production is simply:
expr ::= "(" expr ")"
That's in addition to productions for other types of expressions, of course, so a complete grammar might look like this:
expr ::= "(" expr ")"
expr ::= NUMBER
expr ::= expr "+" expr
expr ::= expr "-" expr
expr ::= expr "*" expr
expr ::= expr "/" expr
Or for an unambiguous grammar:
expr ::= expr "+" multExpr
expr ::= expr "-" multExpr
multExpr ::= multExpr "*" primaryExpr
multExpr ::= multExpr "/" primaryExpr
primaryExpr ::= "(" expr ")"
primaryExpr ::= NUMBER
Also, how do you usually go about 'testing' that it is correct -- is there an online tool or something that can validate a syntax?
There are many parser generators that can accept some form of BNF- or EBNF-like notation and generate a parser from it. You can use one of those and then test whether the generated parser parses what you want it to. They're usually not available as online tools though. Also note that parser generators generally need the grammar to be unambiguous or you to add precedence declarations to disambiguate it.
also wouldn't infinite loop?
No. The exact mechanics depend on the parsing algorithm used of course, but if the character at the current input position is not an opening parenthesis, then clearly this isn't the right production to use and another one needs to be applied (or a syntax error raised if none of the productions apply).
Left recursion can cause infinite recursion when using top-down parsing algorithms (though in case of parser generators it's more likely that the grammar will either be rejected or in some cases automatically rewritten than that you get an actual infinite recursion or loop), but non-left recursion doesn't cause that kind of problem with any algorithm.

Is this a LL(1) grammar?

Considering the following grammar for propositional logic:
<A> ::= <B> <-> <A> | <B>
<B> ::= <C> -> <B> | <C>
<C> ::= <D> \/ <C> | <D>
<D> ::= <E> /\ <D> | <E>
<E> ::= <F> | -<F>
<F> ::= <G> | <H>
<G> ::= (<A>)
<H> ::= p | q | r | ... | z
Precedence for conectives is: -, /\, /, ->, <->.
Associativity is also considered, for example p\/q\/r should be the same as p\/(q\/r). The same for the other conectives.
I pretend to make a predictive top-down parser in java. I dont see here ambiguity or direct left recursion, but not sure if thats all i need to consider this a LL(1) grammar. Maybe undirect left recursion?
If this is not a LL(1) grammar, what would be the steps required to transform it for my intentions?
It's not LL(1). Here's why:
The first rule of an LL(1) grammar is:
A grammar G is LL(1) if and only if whenever A --> C | D are two distinct productions of G, the following conditions hold:
For no terminal a , do both C and D derive strings beginning with a.
This rule is, so that there are no conflicts while parsing this code. When the parser encounters a (, it won't know which production to use.
Your grammar violates this first rule. All your non-terminals on the right hand of the same production , that is, all your Cs and Ds, eventually reduce to G and H, so all of them derive at least one string beginning with (.

Recursive descent parsing - from LL(1) up

The following simple "calculator expression" grammar (BNF) can be easily parsed with the a trivial recursive-descent parser, which is predictive LL(1):
<expr> := <term> + <term>
| <term> - <term>
| <term>
<term> := <factor> * <factor>
<factor> / <factor>
<factor>
<factor> := <number>
| <id>
| ( <expr> )
<number> := \d+
<id> := [a-zA-Z_]\w+
Because it is always enough to see the next token in order to know the rule to pick. However, suppose that I add the following rule:
<command> := <expr>
| <id> = <expr>
For the purpose of interacting with the calculator on the command line, with variables, like this:
calc> 5+5
=> 10
calc> x = 8
calc> 6 * x + 1
=> 49
Is it true that I can not use a simple LL(1) predictive parser to parse <command> rules ? I tried to write the parser for it, but it seems that I need to know more tokens forward. Is the solution to use backtracking, or can I just implement LL(2) and always look two tokens forward ?
How to RD parser generators handle this problem (ANTLR, for instance)?
THe problem with
<command> := <expr>
| <id> = <expr>
is that when you "see" <id> you can't tell if it's the beginning of an assignement (second rule) or it's a "<factor>". You will only know when you'll read the next token.
AFAIK ANTLR is LL(*) (and is also able to generate rat-pack parsers if I'm not mistaken) so it will probably handle this grammare considering two tokens at once.
If you can play with the grammar I would suggest to either add a keyword for the assignment (e.g. let x = 8) :
<command> := <expr>
| "let" <id> "=" <expr>
or use the = to signify evaluation:
<command> := "=" <expr>
| <id> "=" <expr>
I think there are two ways to solve this with a recursive descent parser: either by using (more) lookahead or by backtracking.
Lookahead
command() {
if (currentToken() == id && lookaheadToken() == '=') {
return assignment();
} else {
return expr();
}
}
Backtracking
command() {
savedLocation = scanLocation();
if (accept( id )) {
identifier = acceptedTokenValue();
if (!accept( '=' )) {
setScanLocation( savedLocation );
return expr();
}
return new assignment( identifier, expr() );
} else {
return expr();
}
}
The problem is that the grammar:
<command> := <expr>
| <id> = <expr>
is not a mutually-recursive procedure. For a recursive decent parser you will need to determine a non-recursive equivalent.
rdentato post's shows how to fix this, assuming you can play with the grammar. This powerpoint spells out the problem in a bit more detail and shows how to correct it:
http://www.google.com/url?sa=t&source=web&ct=res&cd=7&url=http%3A%2F%2Fxml.cs.nccu.edu.tw%2Fcourses%2Fcompiler%2Fcp2006%2Fslides%2Flec3-Parsing%26TopDownParsing.ppt&ei=-YLaSPrWGaPwhAK5ydCqBQ&usg=AFQjCNGAFrODJxoxkgJEwDMQ8A8594vn0Q&sig2=nlYKQVfakmqy_57137XzrQ
ANTLR 3 uses a "LL(*)" parser as opposed to a LL(k) parser, so it will look ahead until it reaches the end of the input if it has to, without backtracking, using a specially optimized determinstic finite automata (DFA).

Resources