Definition of Shift/Reduce conflict in LR(1) parsing - parsing

Is it correct to state:
" A shift reduce conflict occurs in an LR(1) parser if and only if there exist items:
A -> alpha .
A -> alpha . beta
such that Follow(A) is not disjoint from First(beta).
where A is a non-terminal, alpha and beta are possibly empty sequences of grammar symbols."
(Intuitively this is because there is no way to determine, based on the top of the stack and the lookahead, whether it is appropriate to shift or reduce)
Note: If you guys think this depends on any specifics of the definition of first/follow comment and I will provide.

No, that statement is not correct.
Suppose in some grammar:
the productions include both:
A → α
A → α β
FOLLOW(A) ∩ FIRST(β) ≠ ∅
That is not quite sufficient for the grammar to have a shift-reduce conflict, because it requires that the non-terminal A be reachable in a context in which the lookahead includes FOLLOW(A) ∩ FIRST(β).
So only if we require the grammar to be reduced, or at least to not contain any unreachable or useless productions, is the above sufficient to generate a shift-reduce conflict.
However, the above conditions are not necessary because there is no requirement that the shift and the reduction apply to the same non-terminal, or even "related" non-terminals. Consider the following simple grammar:
prog → stmt
prog → prog stmt
stmt → expr
stmt → func
expr → ID
expr → '(' expr ')'
func → ID '(' ')' stmt
That grammar (which is not ambiguous, as it happens) has a shift-reduce conflict in the state containing ID . ( because ID could be reduced to expr and then stmt, or ( could be shifted as part of func → ID '(' ')' stmt.
Although it's a side-point, it's worth noting the the FOLLOW set is only used in the construction of SLR(k) grammars. The canonical LR(k) construction -- and even the LALR(k) construction -- will successfully generating parsers for grammars in which the use of the FOLLOW set instead of a full lookahead computation will indicate a (non-existent) shift-reduce conflict. The classic example, taken from the Dragon book (example 4.39 in my copy), edited with slightly more meaningful non-terminal names:
stmt → lvalue '=' rvalue
stmt → rvalue
lvalue → '*' rvalue
lvalue → ID
rvalue → lvalue
Here, FOLLOW(rvalue) is {=, $}, and as a result in the state {stmt → lvalue · '=' rvalue, rvalue → lvalue ·}, it appears that a reduction to rvalue is possible, leading to a false shift-reduce conflict.

Related

How to simplify this left-recursive rule?

I am trying to simplify a rule that has left-recursion in the form of:
A → Aα | β
----------
A → βA'
A' → αA' | ε
The rule I have is:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
From my understanding of the formula, here is what my variables would be:
A = selectStmt
α = setOp selectStmt
β = simpleSelectStmt
A'= selectStmt' // for readability
Then, from application of the rule we get:
1. A → βA'
selectStmt → simpleSelectStmt selectStmt'
2. A' → αA' | ε
selectStmt' -> setOp selectStmt selectStmt' | ε
But then how do I simplify it further to get the final production? In a comment from my previous question at Removing this left-recursive way to define a SELECT statement, it was stated:
In our case a direct application it would take us from what we had originally:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
to
selectStmt: simpleSelectStmt selectStmt'
and
selectStmt': (setOp selectStmt) | empty
which simplifies to
selectStmt: simpleSelectStmt (setOp selectStmt)?
I don't get how that simplification works. Specifically:
How does selectStmt' -> setOp selectStmt selectStmt' | ε simplify to selectStmt': (setOp selectStmt) | empty? And how is the ε removed here? I assume (setOp selectStmt) | empty simplifies to (setOp selectStmt)? because if it can be empty than it just means the optional ?.
Your starting point:
# I removed the parentheses from the following
selectStmt: selectStmt setOp selectStmt | simpleSelectStmt
is ambiguous. Left recursion elimination does not solve ambiguity; rather, it retains ambiguity. So it's not a bad idea to resolve the ambiguity first.
Real-world parser generators can resolve this kind of ambiguity by using operator precedence rules. Some parser generators require you to write out the precedence rules, but Antlr prefers to use a default set of precedence rules (using the order of the productions in the grammar, and assuming every operator to be left-associative unless otherwise declared). (I mention Antlr because you seem to be using it as a reference implementation, although its production semantics is a bit quirky; the implicit precedence rule is just one of the quirks.)
Translating operator precedence into precise BNF is a non-trivial endeavour. Parser generators tend to implement operator precedence by eliminating certain productions, either at grammar compile time (yacc/bison) or with runtime predicates (Antlr4 and most generators based on the shunting yard algorithm). Nevertheless, since operator precedence doesn't affect the context-free property, we know that there is a context-free grammar in which the ambiguity has been resolved. And in some cases, like this one, it's very easy to find.
This is essentially the same ambiguity as you find in arithmetic expressions; without some kind of precedence resolution, 1+2+3+4 is syntactically ambiguous, with five different parse trees. ((1+(2+(3+4))), (1+((2+3)+4)), ((1+2)+(3+4)), ((1+(2+3))+4), (((1+2)+3)+4)). As it happens, these are semantically identical, because addition is associative (in the mathematical sense). But with other operators, such as - or /, the different parses result in different semantics. (The semantics are also different if you use floating point arithmetic, which is not associative.)
So, in the same way as your grammar, the algebraic grammar which starts:
expr: expr '+' expr
expr: expr '*' expr
is ambiguous; it leads precisely to the above ambiguity. The resolution is to say that + and most other algebraic operators are left associative. That results in an adjustment to the grammar:
expr: expr '+' term | term
term: term '*' factor | factor
...
which is not ambiguous (but is still left recursive).
Note that if we had chosen to make those operators right associative, thereby producing the parse (1+(2+(3+4))), then the unambiguous grammar would be right-recursive:
expr: term '+' expr | term
term: factor '*' term | factor
...
Since those particular operators are associative, so that it doesn't matter which syntactic binding we chose (as long as * binds more tightly than +), we could bypass left-recursion elimination altogether, as long as those were the only operators we cared about. But, as noted above, there are lots of operators whose semantics are not so convenient.
It's worth stopping for a moment to understand why the unambiguous grammars are unambiguous. It shouldn't be hard to understand, and it's an important aspect of context-free grammars.
Take the production expr: expr '+' term. Note that term does not derive 2 + 3; term only allows multiplicative operators. So 1 + 2 + 3 can only be produced by reducing 1 + 2 to expr and 3 to term, leaving expr '+' term, which matches the production for expr. Consequently, ((1+2)+3) is the only possible parse. (1+(2+3)) could only be written with explicit parentheses.
Now, it's easy to do left-recursion elimination on expr: expr '+' term, or selectStmt: selectStmt setOp simpleSelectStmt | simpleSelectStmt, to return to the question at hand. We proceed precisely as you indicate, except that α is setOp simpleSelectStmt. We then get:
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
By back-substituting selectStmt into the first production of selectStmt', we get
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp selectStmt
| ε
That's cool; it's not ambiguous, not left-recursive, and has no LL(1) conflicts. But it does not produce the same parse tree as the original. In fact, the parse tree is quite peculiar: S1 UNION S2 UNION S3 is parsed as (S1 (UNION S2 (UNION S3 ()))).
Intriguingly, this is exactly the same place we would have gotten to had we used the right-associative grammar selectStmt: simpleSelectStmt setOp selectStmt | simpleSelectStmt. That grammar is unambiguous and not left-recursive, but it's not LL(1) because both alternatives start with simpleSelectStmt. So we need to left-factor, turning it into selectStmt: simpleSelectStmt (setop selectStmt | ε), exactly the same grammar as we ended up with from the left-recursive starting point.
But the left-recursive and right-recursive grammars really are different: one of them parses as ((S1 UNION S2) UNION S3) and the other one as (S1 UNION (S2 UNION S3)). With UNION, we have the luxury of not caring, but that wouldn't be the case with a SET DIFFERENCE operator, for example.
So the take-away: left-recursion elimination erases the difference between left and right associative operators, and that difference has to be put back using some non-grammatical feature (such as Antlr's run-time semantics). On the other hand, bottom-up parsers like Yacc/Bison, which do not require left-recursion elimination, can implement either parse without requiring any extra mechanism.
Anyway, let's go back to
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
It should be clear that selectStmt' represents zero or more repetitions of setOp simpleSelectStmt. (Try that with a pad of paper, deriving successively longer sentences, in order to convince yourself that it's true.)
So if we had a parser generator which implemented the Kleene * operator (zero or more repetitions), we could write selectStmt' as (setOp simpleSelectStmt)*, making the final grammar
selectStmt: simpleSelectStmt (setOp simpleSelectStmt)*
That's no longer BNF --BNF does not have grouping, optionality, or repetition operators-- but in practical terms it's a lot easier to read, and if you're using Antlr or a similar parser generator, it's what you will inevitably write. (All the same, it still doesn't indicate whether setOp binds to the left or to the right. So the convenience does come at a small price.)

Compilers, finding FIRST set for grammar

I am reading the famous violet dragon book 2nd edition, and can't get the example from page 65, about creating the FIRST set:
We have the following grammar (terminals are bolded):
stmt → expr;
| if ( expr ) stmt
| for ( optexpr ; optexpr ; optexpr ) stmt
| other
optexpr → ε
| expr
And the book suggests that the following is the correct calculation of FIRST:
FIRST(stmt) → {expr, if, for, other} // agree on this
FIRST(expr ;) → {expr} // Where does this come from?
As the comment suggests, from where does the second line come from?
There is no error in the textbook.
The FIRST function is defined (on page 64, emphasis added):
Let α be a string of grammar symbols (terminals and/or nonterminals). We define FIRST(α) to be the set of terminals that appear as the first symbols of one or more strings of terminals generated from α.
In this example, expr ; is a string of grammar symbols consisting of two terminals, so it is a possible value of α. Since it includes no nonterminals, it cannot only generate itself; thus the only string of terminals generated from that value of α is precisely expr ;, and the only terminal that will appear in FIRST(α) is the first symbol in that string, expr.
This may all seem to be belabouring the obvious, but it leads up to the important note just under the example you cite:
The FIRST sets must be considered if there are two productions A → α and A → β. Ignoring ε-productions for the moment, predictive parsing requires FIRST(α) and FIRST(β) to be disjoint.
Since expr ; is one of the possible right-hand sides for stmt, we need to compute its FIRST set (even though the computation in this case is trivial) in order to test this prerequisite.

Find an equivalent LR grammar

I am trying to find an LR(1) or LR(0) grammar for pascal. Here is a part of my grammar which is not LR(0) as it has shift/reduce conflict.
EXPR --> AEXPR | AEXPR realop AEXPR
AEXPR --> TERM | SIGN TERM | AEXPR addop TERM
TERM --> TERM mulop FACTOR | FACTOR
FACTOR --> id | num | ( EXPR )
SIGN --> + | -
(Uppercase words are variables and lowercase words, + , - are terminals)
As you see , EXPR --> AEXPR | AEXPR realop AEXPR cause a shift/reduce conflict on LR(0) parsing. I tried adding a new variable , and some other ways to find an equivalent LR (0) grammar for this, but I was not successful.
I have two problems.
First: Is this grammar a LR(1) grammar?
Second: Is it possible to find a LR(0) equivalent for this grammar? what about LR(1) equivalent?
Yes, your grammar is an LR(1) grammar. [see note below]
It is not just the first production which causes an LR(0) conflict. In an LR(0) grammar, you must be able to predict whether to shift or reduce (and which production to reduce) without consulting the lookahead symbol. That's a highly restrictive requirement.
Nonetheless, there is a grammar which will recognize the same language. It's not an equivalent grammar in the sense that it does not produce the same parse tree (or any useful parse tree), so it depends on what you consider equivalent.
EXPR → TERM | EXPR OP TERM
TERM → num | id | '(' EXPR ')' | addop TERM
OP → addop | mulop | realop
The above works by ignoring operator precedence; it regards an expression as simply the regular language TERM (op TERM)*. (I changed + | - to addop because I couldn't see how your scanner could work otherwise, but that's not significant.)
There is a transformation normally used to make LR(1) expression grammars suitable for LL(1) parsing, but since LL(1) is allowed to examine the lookahead character, it is able to handle operator precedence in a normal way. The LL(1) "equivalent" grammar does not produce a parse tree with the correct operator associativity -- all operators become right-associative -- but it is possible to recover the correct parse tree by a simple tree rotation.
In the case of the LR(0) grammar, where operator precedence has been lost, the tree transformation would be almost equivalent to reparsing the input, using something like the shunting yard algorithm to create the true parse tree.
Note
I don't believe the grammar presented is the correct grammar, because it makes unary plus and minus bind less tightly than multiplication, with the result that -3*4 is parsed as -(3*4). As it happens, there is no semantic difference most of the time, but it still feels wrong to me. I would have written the grammar as:
EXPR → AEXPR | AEXPR realop AEXPR
AEXPR → TERM | AEXPR addop TERM
TERM → FACTOR | TERM mulop FACTOR
FACTOR → num | id | '(' EXPR ')' | addop FACTOR
which makes unary operators bind more tightly. (As above, I assume that addop is precisely + or -.)

Reduce/reduce conflict in grammar

Let's imagine I want to be able to parse values like this (each line is a separate example):
x
(x)
((((x))))
x = x
(((x))) = x
(x) = ((x))
I've written this YACC grammar:
%%
Line: Binding | Expr
Binding: Pattern '=' Expr
Expr: Id | '(' Expr ')'
Pattern: Id | '(' Pattern ')'
Id: 'x'
But I get a reduce/reduce conflict:
$ bison example.y
example.y: warning: 1 reduce/reduce conflict [-Wconflicts-rr]
Any hint as to how to solve it? I am using GNU bison 3.0.2
Reduce/reduce conflicts often mean there is a fundamental problem in the grammar.
The first step in resolving is to get the output file (bison -v example.y produces example.output). Bison 2.3 says (in part):
state 7
4 Expr: Id .
6 Pattern: Id .
'=' reduce using rule 6 (Pattern)
')' reduce using rule 4 (Expr)
')' [reduce using rule 6 (Pattern)]
$default reduce using rule 4 (Expr)
The conflict is clear; after the grammar reads an x (and reduces that to an Id) and a ), it doesn't know whether to reduce the expression as an Expr or as a Pattern. That presents a problem.
I think you should rewrite the grammar without one of Expr and Pattern:
%%
Line: Binding | Expr
Binding: Expr '=' Expr
Expr: Id | '(' Expr ')'
Id: 'x'
Your grammar is not LR(k) for any k. So you either need to fix the grammar or use a GLR parser.
Suppose the input starts with:
(((((((((((((x
Up to here, there is no problem, because every character has been shifted onto the parser stack.
But now what? At the next step, x must be reduced and the lookahead is ). If there is an = somewhere in the future, x is a Pattern. Otherwise, it is an Expr.
You can fix the grammar by:
getting rid of Pattern and changing Binding to Expr | Expr '=' Expr;
getting rid of all the definitions of Expr and replacing them with Expr: Pattern
The second alternative is probably better in the long run, because it is likely that in the full grammar which you are imagining (or developing), Pattern is a subset of Expr, rather than being identical to Expr. Factoring Expr into a unit production for Pattern and the non-Pattern alternatives will allow you to parse the grammar with an LALR(1) parser (if the rest of the grammar conforms).
Or you can use a GLR grammar, as noted above.

Example for LL(1) Grammar which is NOT LALR?

I am learning now about parsers on my Theory Of Compilation course.
I need to find an example for grammar which is in LL(1) but not in LALR.
I know it should be exist. please help me think of the most simple example to this problem.
Some googling brings up this example for a non-LALR(1) grammar, which is LL(1):
S ::= '(' X
| E ']'
| F ')'
X ::= E ')'
| F ']'
E ::= A
F ::= A
A ::= ε
The LALR(1) construction fails, because there is a reduce-reduce conflict between E and F. In the set of LR(0) states, there is a state made up of
E ::= A . ;
F ::= A . ;
which is needed for both S and X contexts. The LALR(1) lookahead sets for these items thus mix up tokens originating from the S and X productions. This is different for LR(1), where there are different states for these cases.
With LL(1), decisions are made by looking at FIRST sets of the alternatives, where ')' and ']' always occur in different alternatives.
From the Dragon book (Second Edition, p. 242):
The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive or LL methods. For a grammar to be LR(k), we must be able to recognize the occurrence of the right side of a production in a right-sentential form, with k input symbols of lookahead. This requirement is far less stringent than that for LL(k) grammars where we must be able to recognize the use of a production seeing only the first k symbols of what the right side derives. Thus, it should not be surprising that LR grammars can describe more languages than LL grammars.

Resources