Find an equivalent LR grammar - parsing

I am trying to find an LR(1) or LR(0) grammar for pascal. Here is a part of my grammar which is not LR(0) as it has shift/reduce conflict.
EXPR --> AEXPR | AEXPR realop AEXPR
AEXPR --> TERM | SIGN TERM | AEXPR addop TERM
TERM --> TERM mulop FACTOR | FACTOR
FACTOR --> id | num | ( EXPR )
SIGN --> + | -
(Uppercase words are variables and lowercase words, + , - are terminals)
As you see , EXPR --> AEXPR | AEXPR realop AEXPR cause a shift/reduce conflict on LR(0) parsing. I tried adding a new variable , and some other ways to find an equivalent LR (0) grammar for this, but I was not successful.
I have two problems.
First: Is this grammar a LR(1) grammar?
Second: Is it possible to find a LR(0) equivalent for this grammar? what about LR(1) equivalent?

Yes, your grammar is an LR(1) grammar. [see note below]
It is not just the first production which causes an LR(0) conflict. In an LR(0) grammar, you must be able to predict whether to shift or reduce (and which production to reduce) without consulting the lookahead symbol. That's a highly restrictive requirement.
Nonetheless, there is a grammar which will recognize the same language. It's not an equivalent grammar in the sense that it does not produce the same parse tree (or any useful parse tree), so it depends on what you consider equivalent.
EXPR → TERM | EXPR OP TERM
TERM → num | id | '(' EXPR ')' | addop TERM
OP → addop | mulop | realop
The above works by ignoring operator precedence; it regards an expression as simply the regular language TERM (op TERM)*. (I changed + | - to addop because I couldn't see how your scanner could work otherwise, but that's not significant.)
There is a transformation normally used to make LR(1) expression grammars suitable for LL(1) parsing, but since LL(1) is allowed to examine the lookahead character, it is able to handle operator precedence in a normal way. The LL(1) "equivalent" grammar does not produce a parse tree with the correct operator associativity -- all operators become right-associative -- but it is possible to recover the correct parse tree by a simple tree rotation.
In the case of the LR(0) grammar, where operator precedence has been lost, the tree transformation would be almost equivalent to reparsing the input, using something like the shunting yard algorithm to create the true parse tree.
Note
I don't believe the grammar presented is the correct grammar, because it makes unary plus and minus bind less tightly than multiplication, with the result that -3*4 is parsed as -(3*4). As it happens, there is no semantic difference most of the time, but it still feels wrong to me. I would have written the grammar as:
EXPR → AEXPR | AEXPR realop AEXPR
AEXPR → TERM | AEXPR addop TERM
TERM → FACTOR | TERM mulop FACTOR
FACTOR → num | id | '(' EXPR ')' | addop FACTOR
which makes unary operators bind more tightly. (As above, I assume that addop is precisely + or -.)

Related

Understanding what makes a rule left-recursive in antlr

I've been trial-and-erroring to figure out from an intuitive level when a rule in antlr is left-recursive of not. For example, this (Removing left recursion) is left-recursive in theory, but works in Antlr:
// Example input: x+y+1
grammar DBParser;
expression
: expression '+' term
| term;
term
: term '*' atom
| atom;
atom
: NUMBER
| IDENTIFIER
;
NUMBER : [0-9]+ ;
IDENTIFIER : [a-zA-Z]+ ;
So what makes a rule left-recursive and a problem in antlr4, and what would be the simplest example of showing that (in an actual program)? I'm trying to practice remove left-recursive productions, but I can't even figure out how to intentionally add a left-recursive rule that antlr4 can't resolve!
Antlr4 can handle direct left-recursion as long as it is not hidden:
Hidden means that the recursion is not at the beginning of the right-hand side but it might still be at the beginning of the expansion because all previous symbols are nullable.
Indirect means that the first symbol on the right-hand side of a production for N eventually derives a sequence starting with N. Antlr describes that as "mutual recursion".
Here are some SO questions I found by searching for [antlr] left recursive:
ANTLR4 - Mutually left-recursive grammar
ANTLR4 mutually left-recursive error when parsing
ANTLR4 left-recursive error
ANTLR Grammar Mutually Left Recursive
Mutually left-recursive lexer rules on ANTL4?
Mutually left recursive with simple calculator
There are lots more, including one you asked. I'm sure you can mine some examples.
As mentioned by rici above, one of the items that Antlr4 does not support is indirect left-recursion. Here would be an example:
grammar DBParser;
expr: binop | Atom;
binop: expr '+' expr;
Atom: [a-z]+ | [0-9]+ ;
error(119): DBParser.g4::: The following sets of rules are mutually left-recursive [expr, binop]
Notice that Antlr4 can support the following direct left-recursion though:
grammar DBParser;
expr: expr '+' expr | Atom;
Atom: [a-z]+ | [0-9]+ ;
However, even if you add in parentheticals (for whatever reason?) it doesn't work:
grammar DBParser;
expr: (expr '+' expr) | Atom;
Atom: [a-z]+ | [0-9]+ ;
Or:
grammar DBParser;
expr: (expr '+' expr | Atom);
Atom: [a-z]+ | [0-9]+ ;
Both will raise:
error(119): DBParser.g4::: The following sets of rules are mutually left-recursive [expr]

Is it possible to write a recursive-descent parser for this grammar?

From this question, a grammar for expressions involving binary operators (+ - * /) which disallows outer parentheses:
top_level : expression PLUS term
| expression MINUS term
| term TIMES factor
| term DIVIDE factor
| NUMBER
expression : expression PLUS term
| expression MINUS term
| term
term : term TIMES factor
| term DIVIDE factor
| factor
factor : NUMBER
| LPAREN expression RPAREN
This grammar is LALR(1). I have therefore been able to use PLY (a Python implementation of yacc) to create a bottom-up parser for the grammar.
For comparison purposes, I would now like to try building a top-down recursive-descent parser for the same language. I have transformed the grammar, removing left-recursion and applying left-factoring:
top_level : expression top_level1
| term top_level2
| NUMBER
top_level1 : PLUS term
| MINUS term
top_level2 : TIMES factor
| DIVIDE factor
expression : term expression1
expression1 : PLUS term expression1
| MINUS term expression1
| empty
term : factor term1
term1 : TIMES factor term1
| DIVIDE factor term1
| empty
factor : NUMBER
| LPAREN expression RPAREN
Without the top_level rules this grammar is LL(1), so writing a recursive-descent parser would be fairly straightforward. Unfortunately, including top_level, the grammar is not LL(1).
Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?
Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)
Is it possible to simplify this grammar to ease the recursive-descent approach?
The grammar is not LL with finite lookahead, but the language is LL(1) because an LL(1) grammar exists. Pragmatically, a recursive descent parser is easy to write even without modifying the grammar.
Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?
If α is a derivation of expression, β of term and γ of factor, then top_level can derive both the sentence (α)+β and the sentence (α)*γ (but it cannot derive the sentence (α).) However, (α) is a possible derivation of both expression and term, so it is impossible to decide which production of top_level to use until the symbol following the ) is encountered. Since α can be of arbitrary length, there is no k for which a lookahead of k is sufficient to distinguish the two productions. Some people might call that LL(∞), but that doesn't seem to be a very useful grammatical category to me. (LL(*) is, afaik, the name of a parsing strategy invented by Terence Parr, and not an accepted name for a class of grammars.) I would simply say that the grammar is not LL(k) for any k.
Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)
Sure. It's not even that difficult.
The first symbol must either be a NUMBER or a (. If it is a NUMBER, we predict (call) expression. If it is (, we consume it, call expression, consume the following ) (or declare an error, if the next symbol is not a close parenthesis), and then either call expression1 or term1 and then expression1, depending on what the next symbol is. Again, if the next symbol doesn't match the FIRST set of either expression1 or term1, we declare a syntax error. Note that the above strategy does not require the top_level* productions at all.
Since that will clearly work without backtracking, it can serve as the basis for how to write an LL(1) grammar.
Is it possible to simplify this grammar to ease the recursive-descent approach?
I'm not sure if the following grammar is any simpler, but it does correspond to the recursive descent parser described above.
top_level : NUMBER optional_expression_or_term_1
| LPAREN expression RPAREN expression_or_term_1
optional_expression_or_term_1: empty
| expression_or_term_1
expression_or_term_1
: PLUS term expression1
| MINUS term expression1
| TIMES factor term1 expression1
| DIVIDE factor term1 expression1
expression : term expression1
expression1 : PLUS term expression1
| MINUS term expression1
| empty
term : factor term1
term1 : TIMES factor term1
| DIVIDE factor term1
| empty
factor : NUMBER
| LPAREN expression RPAREN
I'm left with two observations, both of which you are completely free to ignore (particularly the second one which is 100% opinion).
The first is that it seems odd to me to ban (1+2) but allow (((1)))+2, or ((1+2))+3. But no doubt you have your reasons. (Of course, you could easily ban the redundant double parentheses by replacing expression with top_level in the second production for factor.
Second, it seems to me that the hoop-jumping involved in the LL(1) grammar in the third section is just one more reason to ask why there is any reason to use LL grammars. The LR(1) grammar is easier to read, and its correspondence with the language's syntactic structure is clearer. The logic of the generated recursive-descent parser may be easier to understand, but to me that seems secondary.
To make the grammar LL(1) you need to finish left-factoring top_level.
You stopped at:
top_level : expression top_level1
| term top_level2
| NUMBER
expression and term both have NUMBER in their FIRST sets, so they must first be substituted to left-factor:
top_level : NUMBER term1 expression1 top_level1
| NUMBER term1 top_level2
| NUMBER
| LPAREN expression RPAREN term1 expression1 top_level1
| LPAREN expression RPAREN term1 top_level2
which you can then left-factor to
top_level : NUMBER term1 top_level3
| LPAREN expression RPAREN term1 top_level4
top_level3 : expression1 top_level1
| top_level2
| empty
top_level4 : expression1 top_level1
| top_level2
Note that this still is not LL(1) as there are epsilon rules (term1, expression1) with overlapping FIRST and FOLLOW sets. So you need to factor those out too to make it LL(1)

Reduce/reduce conflict in grammar

Let's imagine I want to be able to parse values like this (each line is a separate example):
x
(x)
((((x))))
x = x
(((x))) = x
(x) = ((x))
I've written this YACC grammar:
%%
Line: Binding | Expr
Binding: Pattern '=' Expr
Expr: Id | '(' Expr ')'
Pattern: Id | '(' Pattern ')'
Id: 'x'
But I get a reduce/reduce conflict:
$ bison example.y
example.y: warning: 1 reduce/reduce conflict [-Wconflicts-rr]
Any hint as to how to solve it? I am using GNU bison 3.0.2
Reduce/reduce conflicts often mean there is a fundamental problem in the grammar.
The first step in resolving is to get the output file (bison -v example.y produces example.output). Bison 2.3 says (in part):
state 7
4 Expr: Id .
6 Pattern: Id .
'=' reduce using rule 6 (Pattern)
')' reduce using rule 4 (Expr)
')' [reduce using rule 6 (Pattern)]
$default reduce using rule 4 (Expr)
The conflict is clear; after the grammar reads an x (and reduces that to an Id) and a ), it doesn't know whether to reduce the expression as an Expr or as a Pattern. That presents a problem.
I think you should rewrite the grammar without one of Expr and Pattern:
%%
Line: Binding | Expr
Binding: Expr '=' Expr
Expr: Id | '(' Expr ')'
Id: 'x'
Your grammar is not LR(k) for any k. So you either need to fix the grammar or use a GLR parser.
Suppose the input starts with:
(((((((((((((x
Up to here, there is no problem, because every character has been shifted onto the parser stack.
But now what? At the next step, x must be reduced and the lookahead is ). If there is an = somewhere in the future, x is a Pattern. Otherwise, it is an Expr.
You can fix the grammar by:
getting rid of Pattern and changing Binding to Expr | Expr '=' Expr;
getting rid of all the definitions of Expr and replacing them with Expr: Pattern
The second alternative is probably better in the long run, because it is likely that in the full grammar which you are imagining (or developing), Pattern is a subset of Expr, rather than being identical to Expr. Factoring Expr into a unit production for Pattern and the non-Pattern alternatives will allow you to parse the grammar with an LALR(1) parser (if the rest of the grammar conforms).
Or you can use a GLR grammar, as noted above.

Example for LL(1) Grammar which is NOT LALR?

I am learning now about parsers on my Theory Of Compilation course.
I need to find an example for grammar which is in LL(1) but not in LALR.
I know it should be exist. please help me think of the most simple example to this problem.
Some googling brings up this example for a non-LALR(1) grammar, which is LL(1):
S ::= '(' X
| E ']'
| F ')'
X ::= E ')'
| F ']'
E ::= A
F ::= A
A ::= ε
The LALR(1) construction fails, because there is a reduce-reduce conflict between E and F. In the set of LR(0) states, there is a state made up of
E ::= A . ;
F ::= A . ;
which is needed for both S and X contexts. The LALR(1) lookahead sets for these items thus mix up tokens originating from the S and X productions. This is different for LR(1), where there are different states for these cases.
With LL(1), decisions are made by looking at FIRST sets of the alternatives, where ')' and ']' always occur in different alternatives.
From the Dragon book (Second Edition, p. 242):
The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive or LL methods. For a grammar to be LR(k), we must be able to recognize the occurrence of the right side of a production in a right-sentential form, with k input symbols of lookahead. This requirement is far less stringent than that for LL(k) grammars where we must be able to recognize the use of a production seeing only the first k symbols of what the right side derives. Thus, it should not be surprising that LR grammars can describe more languages than LL grammars.

Relation between grammar and operator associativity

Some compiler books / articles / papers talk about design of a grammar and the relation of its operator's associativity. I'm a big fan of top-down, especially recursive descent, parsers and so far most (if not all) compilers I've written use the following expression grammar:
Expr ::= Term { ( "+" | "-" ) Term }
Term ::= Factor { ( "*" | "/" ) Factor }
Factor ::= INTEGER | "(" Expr ")"
which is an EBNF representation of this BNF:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor = INTEGER | "(" Expr ")"
According to what I read, some regards this grammar as being "wrong" due to the change of operator associativity (left to right for those 4 operators) proven by the growing parse tree to the right instead of left. For a parser implemented through attribute grammar, this might be true as l-attribute value requires that this value created first then passed to child nodes. however, when implementing with normal recursive descent parser, it's up to me whether to construct this node first then pass to child nodes (top-down) or let child nodes be created first then add the returned value as the children of this node (passed in this node's constructor) (bottom-up). There should be something I miss here because I don't agree with the statement saying this grammar is "wrong" and this grammar has been used in many languages esp. Wirthian ones. Usually (or all?) the reading that says it promotes LR parsing instead of LL.
I think the issue here is that a language has an abstract syntax which is just like:
E ::= E + E | E - E | E * E | E / E | Int | (E)
but this is actually implemented via a concrete syntax which is used to specify associativity and precedence. So, if you're writing a recursive decent parse, you're implicitly writing the concrete syntax into it as you go along and that's fine, though it may be good to specify it exactly as a phrase-structured grammar as well!
There are a couple of issues with your grammar if it is to be a fully-fledged concrete grammar. First of all, you need to add productions to just 'go to the next level down', so relaxing your syntax a bit:
Expr ::= Term + Term | Term - Term | Term
Term ::= Factor * Factor | Factor / Factor | Factor
Factor ::= INTEGER | (Expr)
Otherwise there's no way to derive valid sentences starting from the start symbol (in this case Expr). For example, how would you derive '1 * 2' without those extra productions?
Expr -> Term
-> Factor * Factor
-> 1 * Factor
-> 1 * 2
We can see the other grammar handles this in a slightly different way:
Expr -> Term Expr'
-> Factor Term' Expr'
-> 1 Term' Expr'
-> 1 * Factor Term' Expr'
-> 1 * 2 Term' Expr'
-> 1 * 2 ε Expr'
-> 1 * 2 ε ε
= 1 * 2
but this achieves the same effect.
Your parser is actually non-associative. To see this ask how E + E + E would be parsed and find that it couldn't. Whichever + is consumed first, we get E on one side and E + E on the other, but then we're trying to parse E + E as a Term which is not possible. Equivalently, think about deriving that expression from the start symbol, again not possible.
Expr -> Term + Term
-> ? (can't get another + in here)
The other grammar is left-associative ebcase an arbitrarily long sting of E + E + ... + E can be derived.
So anyway, to sum up, you're right that when writing the RDP, you can implement whatever concrete version of the abstract syntax you like and you probably know a lot more about that than me. But there are these issues when trying to produce the grammar which describes your RDP precisely. Hope that helps!
To get associative trees, you really need to have the trees formed with the operator as the subtree root node, with children having similar roots.
Your implementation grammar:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor ::= INTEGER | "(" Expr ")"
must make that awkward; if you implement recursive descent on this, the Expr' routine has no access to the "left child" and so can't build the tree. You can always patch this up by passing around pieces (in this case, passing tree parts up the recursion) but that just seems awkward. You could have chosen this instead as a grammar:
Expr ::= Term ( ("+"|"-") Term )*;
Term ::= Factor ( ( "*" | "/" ) Factor )* ;
Factor ::= INTEGER | "(" Expr ")"
which is just as easy (easier?) to code recursive descent-wise, but now you can form the trees you need without trouble.
This doesn't really get you associativity; it just shapes the trees so that it could be allowed. Associativity means that the tree ( + (+ a b) c) means the same thing as (+ a (+ b c)); its actually a semantic property (sure doesn't work for "-" but the grammar as posed can't distinguish).
We have a tool (the DMS Software Reengineering Toolkit) that includes parsers and term-rewriting (using source-to-source transformations) in which the associativity is explicitly expressed. We'd write your grammar:
Expr ::= Term ;
[Associative Commutative] Expr ::= Expr "+" Term ;
Expr ::= Expr "-" Term ;
Term ::= Factor ;
[Associative Commutative] Term ::= Term "*" Factor ;
Term ::= Term "/" Factor ;
Factor ::= INTEGER ;
Factor ::= "(" Expr ")" ;
The grammar seems longer and clumsier this way, but it in fact allows us to break out the special cases and mark them as needed. In particular, we can now distinguish operators that are associative from those that are not, and mark them accordingly. With that semantic marking, our tree-rewrite engine automatically accounts for associativity and commutativity. You can see a full example of such DMS rules being used to symbolically simplify high-school algebra using explicit rewrite rules over a typical expression grammar that don't have to account for such semantic properties. That is built into the rewrite engine.

Resources