How to simplify this left-recursive rule? - parsing

I am trying to simplify a rule that has left-recursion in the form of:
A → Aα | β
----------
A → βA'
A' → αA' | ε
The rule I have is:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
From my understanding of the formula, here is what my variables would be:
A = selectStmt
α = setOp selectStmt
β = simpleSelectStmt
A'= selectStmt' // for readability
Then, from application of the rule we get:
1. A → βA'
selectStmt → simpleSelectStmt selectStmt'
2. A' → αA' | ε
selectStmt' -> setOp selectStmt selectStmt' | ε
But then how do I simplify it further to get the final production? In a comment from my previous question at Removing this left-recursive way to define a SELECT statement, it was stated:
In our case a direct application it would take us from what we had originally:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
to
selectStmt: simpleSelectStmt selectStmt'
and
selectStmt': (setOp selectStmt) | empty
which simplifies to
selectStmt: simpleSelectStmt (setOp selectStmt)?
I don't get how that simplification works. Specifically:
How does selectStmt' -> setOp selectStmt selectStmt' | ε simplify to selectStmt': (setOp selectStmt) | empty? And how is the ε removed here? I assume (setOp selectStmt) | empty simplifies to (setOp selectStmt)? because if it can be empty than it just means the optional ?.

Your starting point:
# I removed the parentheses from the following
selectStmt: selectStmt setOp selectStmt | simpleSelectStmt
is ambiguous. Left recursion elimination does not solve ambiguity; rather, it retains ambiguity. So it's not a bad idea to resolve the ambiguity first.
Real-world parser generators can resolve this kind of ambiguity by using operator precedence rules. Some parser generators require you to write out the precedence rules, but Antlr prefers to use a default set of precedence rules (using the order of the productions in the grammar, and assuming every operator to be left-associative unless otherwise declared). (I mention Antlr because you seem to be using it as a reference implementation, although its production semantics is a bit quirky; the implicit precedence rule is just one of the quirks.)
Translating operator precedence into precise BNF is a non-trivial endeavour. Parser generators tend to implement operator precedence by eliminating certain productions, either at grammar compile time (yacc/bison) or with runtime predicates (Antlr4 and most generators based on the shunting yard algorithm). Nevertheless, since operator precedence doesn't affect the context-free property, we know that there is a context-free grammar in which the ambiguity has been resolved. And in some cases, like this one, it's very easy to find.
This is essentially the same ambiguity as you find in arithmetic expressions; without some kind of precedence resolution, 1+2+3+4 is syntactically ambiguous, with five different parse trees. ((1+(2+(3+4))), (1+((2+3)+4)), ((1+2)+(3+4)), ((1+(2+3))+4), (((1+2)+3)+4)). As it happens, these are semantically identical, because addition is associative (in the mathematical sense). But with other operators, such as - or /, the different parses result in different semantics. (The semantics are also different if you use floating point arithmetic, which is not associative.)
So, in the same way as your grammar, the algebraic grammar which starts:
expr: expr '+' expr
expr: expr '*' expr
is ambiguous; it leads precisely to the above ambiguity. The resolution is to say that + and most other algebraic operators are left associative. That results in an adjustment to the grammar:
expr: expr '+' term | term
term: term '*' factor | factor
...
which is not ambiguous (but is still left recursive).
Note that if we had chosen to make those operators right associative, thereby producing the parse (1+(2+(3+4))), then the unambiguous grammar would be right-recursive:
expr: term '+' expr | term
term: factor '*' term | factor
...
Since those particular operators are associative, so that it doesn't matter which syntactic binding we chose (as long as * binds more tightly than +), we could bypass left-recursion elimination altogether, as long as those were the only operators we cared about. But, as noted above, there are lots of operators whose semantics are not so convenient.
It's worth stopping for a moment to understand why the unambiguous grammars are unambiguous. It shouldn't be hard to understand, and it's an important aspect of context-free grammars.
Take the production expr: expr '+' term. Note that term does not derive 2 + 3; term only allows multiplicative operators. So 1 + 2 + 3 can only be produced by reducing 1 + 2 to expr and 3 to term, leaving expr '+' term, which matches the production for expr. Consequently, ((1+2)+3) is the only possible parse. (1+(2+3)) could only be written with explicit parentheses.
Now, it's easy to do left-recursion elimination on expr: expr '+' term, or selectStmt: selectStmt setOp simpleSelectStmt | simpleSelectStmt, to return to the question at hand. We proceed precisely as you indicate, except that α is setOp simpleSelectStmt. We then get:
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
By back-substituting selectStmt into the first production of selectStmt', we get
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp selectStmt
| ε
That's cool; it's not ambiguous, not left-recursive, and has no LL(1) conflicts. But it does not produce the same parse tree as the original. In fact, the parse tree is quite peculiar: S1 UNION S2 UNION S3 is parsed as (S1 (UNION S2 (UNION S3 ()))).
Intriguingly, this is exactly the same place we would have gotten to had we used the right-associative grammar selectStmt: simpleSelectStmt setOp selectStmt | simpleSelectStmt. That grammar is unambiguous and not left-recursive, but it's not LL(1) because both alternatives start with simpleSelectStmt. So we need to left-factor, turning it into selectStmt: simpleSelectStmt (setop selectStmt | ε), exactly the same grammar as we ended up with from the left-recursive starting point.
But the left-recursive and right-recursive grammars really are different: one of them parses as ((S1 UNION S2) UNION S3) and the other one as (S1 UNION (S2 UNION S3)). With UNION, we have the luxury of not caring, but that wouldn't be the case with a SET DIFFERENCE operator, for example.
So the take-away: left-recursion elimination erases the difference between left and right associative operators, and that difference has to be put back using some non-grammatical feature (such as Antlr's run-time semantics). On the other hand, bottom-up parsers like Yacc/Bison, which do not require left-recursion elimination, can implement either parse without requiring any extra mechanism.
Anyway, let's go back to
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
It should be clear that selectStmt' represents zero or more repetitions of setOp simpleSelectStmt. (Try that with a pad of paper, deriving successively longer sentences, in order to convince yourself that it's true.)
So if we had a parser generator which implemented the Kleene * operator (zero or more repetitions), we could write selectStmt' as (setOp simpleSelectStmt)*, making the final grammar
selectStmt: simpleSelectStmt (setOp simpleSelectStmt)*
That's no longer BNF --BNF does not have grouping, optionality, or repetition operators-- but in practical terms it's a lot easier to read, and if you're using Antlr or a similar parser generator, it's what you will inevitably write. (All the same, it still doesn't indicate whether setOp binds to the left or to the right. So the convenience does come at a small price.)

Related

Is it possible to write a recursive-descent parser for this grammar?

From this question, a grammar for expressions involving binary operators (+ - * /) which disallows outer parentheses:
top_level : expression PLUS term
| expression MINUS term
| term TIMES factor
| term DIVIDE factor
| NUMBER
expression : expression PLUS term
| expression MINUS term
| term
term : term TIMES factor
| term DIVIDE factor
| factor
factor : NUMBER
| LPAREN expression RPAREN
This grammar is LALR(1). I have therefore been able to use PLY (a Python implementation of yacc) to create a bottom-up parser for the grammar.
For comparison purposes, I would now like to try building a top-down recursive-descent parser for the same language. I have transformed the grammar, removing left-recursion and applying left-factoring:
top_level : expression top_level1
| term top_level2
| NUMBER
top_level1 : PLUS term
| MINUS term
top_level2 : TIMES factor
| DIVIDE factor
expression : term expression1
expression1 : PLUS term expression1
| MINUS term expression1
| empty
term : factor term1
term1 : TIMES factor term1
| DIVIDE factor term1
| empty
factor : NUMBER
| LPAREN expression RPAREN
Without the top_level rules this grammar is LL(1), so writing a recursive-descent parser would be fairly straightforward. Unfortunately, including top_level, the grammar is not LL(1).
Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?
Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)
Is it possible to simplify this grammar to ease the recursive-descent approach?
The grammar is not LL with finite lookahead, but the language is LL(1) because an LL(1) grammar exists. Pragmatically, a recursive descent parser is easy to write even without modifying the grammar.
Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?
If α is a derivation of expression, β of term and γ of factor, then top_level can derive both the sentence (α)+β and the sentence (α)*γ (but it cannot derive the sentence (α).) However, (α) is a possible derivation of both expression and term, so it is impossible to decide which production of top_level to use until the symbol following the ) is encountered. Since α can be of arbitrary length, there is no k for which a lookahead of k is sufficient to distinguish the two productions. Some people might call that LL(∞), but that doesn't seem to be a very useful grammatical category to me. (LL(*) is, afaik, the name of a parsing strategy invented by Terence Parr, and not an accepted name for a class of grammars.) I would simply say that the grammar is not LL(k) for any k.
Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)
Sure. It's not even that difficult.
The first symbol must either be a NUMBER or a (. If it is a NUMBER, we predict (call) expression. If it is (, we consume it, call expression, consume the following ) (or declare an error, if the next symbol is not a close parenthesis), and then either call expression1 or term1 and then expression1, depending on what the next symbol is. Again, if the next symbol doesn't match the FIRST set of either expression1 or term1, we declare a syntax error. Note that the above strategy does not require the top_level* productions at all.
Since that will clearly work without backtracking, it can serve as the basis for how to write an LL(1) grammar.
Is it possible to simplify this grammar to ease the recursive-descent approach?
I'm not sure if the following grammar is any simpler, but it does correspond to the recursive descent parser described above.
top_level : NUMBER optional_expression_or_term_1
| LPAREN expression RPAREN expression_or_term_1
optional_expression_or_term_1: empty
| expression_or_term_1
expression_or_term_1
: PLUS term expression1
| MINUS term expression1
| TIMES factor term1 expression1
| DIVIDE factor term1 expression1
expression : term expression1
expression1 : PLUS term expression1
| MINUS term expression1
| empty
term : factor term1
term1 : TIMES factor term1
| DIVIDE factor term1
| empty
factor : NUMBER
| LPAREN expression RPAREN
I'm left with two observations, both of which you are completely free to ignore (particularly the second one which is 100% opinion).
The first is that it seems odd to me to ban (1+2) but allow (((1)))+2, or ((1+2))+3. But no doubt you have your reasons. (Of course, you could easily ban the redundant double parentheses by replacing expression with top_level in the second production for factor.
Second, it seems to me that the hoop-jumping involved in the LL(1) grammar in the third section is just one more reason to ask why there is any reason to use LL grammars. The LR(1) grammar is easier to read, and its correspondence with the language's syntactic structure is clearer. The logic of the generated recursive-descent parser may be easier to understand, but to me that seems secondary.
To make the grammar LL(1) you need to finish left-factoring top_level.
You stopped at:
top_level : expression top_level1
| term top_level2
| NUMBER
expression and term both have NUMBER in their FIRST sets, so they must first be substituted to left-factor:
top_level : NUMBER term1 expression1 top_level1
| NUMBER term1 top_level2
| NUMBER
| LPAREN expression RPAREN term1 expression1 top_level1
| LPAREN expression RPAREN term1 top_level2
which you can then left-factor to
top_level : NUMBER term1 top_level3
| LPAREN expression RPAREN term1 top_level4
top_level3 : expression1 top_level1
| top_level2
| empty
top_level4 : expression1 top_level1
| top_level2
Note that this still is not LL(1) as there are epsilon rules (term1, expression1) with overlapping FIRST and FOLLOW sets. So you need to factor those out too to make it LL(1)

Find an equivalent LR grammar

I am trying to find an LR(1) or LR(0) grammar for pascal. Here is a part of my grammar which is not LR(0) as it has shift/reduce conflict.
EXPR --> AEXPR | AEXPR realop AEXPR
AEXPR --> TERM | SIGN TERM | AEXPR addop TERM
TERM --> TERM mulop FACTOR | FACTOR
FACTOR --> id | num | ( EXPR )
SIGN --> + | -
(Uppercase words are variables and lowercase words, + , - are terminals)
As you see , EXPR --> AEXPR | AEXPR realop AEXPR cause a shift/reduce conflict on LR(0) parsing. I tried adding a new variable , and some other ways to find an equivalent LR (0) grammar for this, but I was not successful.
I have two problems.
First: Is this grammar a LR(1) grammar?
Second: Is it possible to find a LR(0) equivalent for this grammar? what about LR(1) equivalent?
Yes, your grammar is an LR(1) grammar. [see note below]
It is not just the first production which causes an LR(0) conflict. In an LR(0) grammar, you must be able to predict whether to shift or reduce (and which production to reduce) without consulting the lookahead symbol. That's a highly restrictive requirement.
Nonetheless, there is a grammar which will recognize the same language. It's not an equivalent grammar in the sense that it does not produce the same parse tree (or any useful parse tree), so it depends on what you consider equivalent.
EXPR → TERM | EXPR OP TERM
TERM → num | id | '(' EXPR ')' | addop TERM
OP → addop | mulop | realop
The above works by ignoring operator precedence; it regards an expression as simply the regular language TERM (op TERM)*. (I changed + | - to addop because I couldn't see how your scanner could work otherwise, but that's not significant.)
There is a transformation normally used to make LR(1) expression grammars suitable for LL(1) parsing, but since LL(1) is allowed to examine the lookahead character, it is able to handle operator precedence in a normal way. The LL(1) "equivalent" grammar does not produce a parse tree with the correct operator associativity -- all operators become right-associative -- but it is possible to recover the correct parse tree by a simple tree rotation.
In the case of the LR(0) grammar, where operator precedence has been lost, the tree transformation would be almost equivalent to reparsing the input, using something like the shunting yard algorithm to create the true parse tree.
Note
I don't believe the grammar presented is the correct grammar, because it makes unary plus and minus bind less tightly than multiplication, with the result that -3*4 is parsed as -(3*4). As it happens, there is no semantic difference most of the time, but it still feels wrong to me. I would have written the grammar as:
EXPR → AEXPR | AEXPR realop AEXPR
AEXPR → TERM | AEXPR addop TERM
TERM → FACTOR | TERM mulop FACTOR
FACTOR → num | id | '(' EXPR ')' | addop FACTOR
which makes unary operators bind more tightly. (As above, I assume that addop is precisely + or -.)

Nondeterministic, unamigious Grammar?

According to wikipedias GLR description, they "handle nondeterministic and ambiguous grammars."
I can visualize an ambiguous grammar, like the dangling else problem, but what's a nondeterministic CF grammar which isn't ambiguous?
Pretty much any non LR(k) grammar is non-deterministic, but not necessarily ambiguous. The obvious is example is when you have some abitrarily large construct that can be parsed two ways, and which is correct depends on something AFTER the large construct. eg:
S ::= A x | B y
A ::= A a | a
B ::= B a | a
However, such non-deterministic grammars can often be reworked so as to be deterministic, if the two ways of parsing the large construct can be combined (as with S ::= A x | A y for the above grammar which is a deterministic way of parsing the same language.)
More interesting is LANGUAGES that are inherently non-deterministic -- that is there is no deterministic grammar for the language. For that there needs to be something inside the arbitrarily large construct that needs to match what comes after. eg:
S ::= X x | Y y
X ::= a X a | x
Y ::= a Y a | y

Example of an LR grammar that cannot be represented by LL?

All LL grammars are LR grammars, but not the other way around, but I still struggle to deal with the distinction. I'm curious about small examples, if any exist, of LR grammars which do not have an equivalent LL representation.
Well, as far as grammars are concerned, its easy -- any simple left-recursive grammar is LR (probably LR(1)) and not LL. So a list grammar like:
list ::= list ',' element | element
is LR(1) (assuming the production for element is) but not LL. Such grammars can be fairly easily converted into LL grammars by left-factoring and such, so this is not too interesting however.
Of more interest is LANGUAGES that are LR but not LL -- that is a language for which there exists an LR(1) grammar but no LL(k) grammar for any k. An example is things that need optional trailing matches. For example, the language of any number of a symbols followed by the same number or fewer b symbols, but not more bs -- { a^i b^j | i >= j }. There's a trivial LR(1) grammar:
S ::= a S | P
P ::= a P b | \epsilon
but no LL(k) grammar. The reason is that an LL grammar needs to decide whether to match an a+b pair or an odd a when looking at an a, while the LR grammar can defer that decision until after it sees the b or the end of the input.
This post on cs.stackechange.com has lots of references about this

Example for LL(1) Grammar which is NOT LALR?

I am learning now about parsers on my Theory Of Compilation course.
I need to find an example for grammar which is in LL(1) but not in LALR.
I know it should be exist. please help me think of the most simple example to this problem.
Some googling brings up this example for a non-LALR(1) grammar, which is LL(1):
S ::= '(' X
| E ']'
| F ')'
X ::= E ')'
| F ']'
E ::= A
F ::= A
A ::= ε
The LALR(1) construction fails, because there is a reduce-reduce conflict between E and F. In the set of LR(0) states, there is a state made up of
E ::= A . ;
F ::= A . ;
which is needed for both S and X contexts. The LALR(1) lookahead sets for these items thus mix up tokens originating from the S and X productions. This is different for LR(1), where there are different states for these cases.
With LL(1), decisions are made by looking at FIRST sets of the alternatives, where ')' and ']' always occur in different alternatives.
From the Dragon book (Second Edition, p. 242):
The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive or LL methods. For a grammar to be LR(k), we must be able to recognize the occurrence of the right side of a production in a right-sentential form, with k input symbols of lookahead. This requirement is far less stringent than that for LL(k) grammars where we must be able to recognize the use of a production seeing only the first k symbols of what the right side derives. Thus, it should not be surprising that LR grammars can describe more languages than LL grammars.

Resources