I am learning now about parsers on my Theory Of Compilation course.
I need to find an example for grammar which is in LL(1) but not in LALR.
I know it should be exist. please help me think of the most simple example to this problem.
Some googling brings up this example for a non-LALR(1) grammar, which is LL(1):
S ::= '(' X
| E ']'
| F ')'
X ::= E ')'
| F ']'
E ::= A
F ::= A
A ::= ε
The LALR(1) construction fails, because there is a reduce-reduce conflict between E and F. In the set of LR(0) states, there is a state made up of
E ::= A . ;
F ::= A . ;
which is needed for both S and X contexts. The LALR(1) lookahead sets for these items thus mix up tokens originating from the S and X productions. This is different for LR(1), where there are different states for these cases.
With LL(1), decisions are made by looking at FIRST sets of the alternatives, where ')' and ']' always occur in different alternatives.
From the Dragon book (Second Edition, p. 242):
The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive or LL methods. For a grammar to be LR(k), we must be able to recognize the occurrence of the right side of a production in a right-sentential form, with k input symbols of lookahead. This requirement is far less stringent than that for LL(k) grammars where we must be able to recognize the use of a production seeing only the first k symbols of what the right side derives. Thus, it should not be surprising that LR grammars can describe more languages than LL grammars.
Related
I am trying to simplify a rule that has left-recursion in the form of:
A → Aα | β
----------
A → βA'
A' → αA' | ε
The rule I have is:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
From my understanding of the formula, here is what my variables would be:
A = selectStmt
α = setOp selectStmt
β = simpleSelectStmt
A'= selectStmt' // for readability
Then, from application of the rule we get:
1. A → βA'
selectStmt → simpleSelectStmt selectStmt'
2. A' → αA' | ε
selectStmt' -> setOp selectStmt selectStmt' | ε
But then how do I simplify it further to get the final production? In a comment from my previous question at Removing this left-recursive way to define a SELECT statement, it was stated:
In our case a direct application it would take us from what we had originally:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
to
selectStmt: simpleSelectStmt selectStmt'
and
selectStmt': (setOp selectStmt) | empty
which simplifies to
selectStmt: simpleSelectStmt (setOp selectStmt)?
I don't get how that simplification works. Specifically:
How does selectStmt' -> setOp selectStmt selectStmt' | ε simplify to selectStmt': (setOp selectStmt) | empty? And how is the ε removed here? I assume (setOp selectStmt) | empty simplifies to (setOp selectStmt)? because if it can be empty than it just means the optional ?.
Your starting point:
# I removed the parentheses from the following
selectStmt: selectStmt setOp selectStmt | simpleSelectStmt
is ambiguous. Left recursion elimination does not solve ambiguity; rather, it retains ambiguity. So it's not a bad idea to resolve the ambiguity first.
Real-world parser generators can resolve this kind of ambiguity by using operator precedence rules. Some parser generators require you to write out the precedence rules, but Antlr prefers to use a default set of precedence rules (using the order of the productions in the grammar, and assuming every operator to be left-associative unless otherwise declared). (I mention Antlr because you seem to be using it as a reference implementation, although its production semantics is a bit quirky; the implicit precedence rule is just one of the quirks.)
Translating operator precedence into precise BNF is a non-trivial endeavour. Parser generators tend to implement operator precedence by eliminating certain productions, either at grammar compile time (yacc/bison) or with runtime predicates (Antlr4 and most generators based on the shunting yard algorithm). Nevertheless, since operator precedence doesn't affect the context-free property, we know that there is a context-free grammar in which the ambiguity has been resolved. And in some cases, like this one, it's very easy to find.
This is essentially the same ambiguity as you find in arithmetic expressions; without some kind of precedence resolution, 1+2+3+4 is syntactically ambiguous, with five different parse trees. ((1+(2+(3+4))), (1+((2+3)+4)), ((1+2)+(3+4)), ((1+(2+3))+4), (((1+2)+3)+4)). As it happens, these are semantically identical, because addition is associative (in the mathematical sense). But with other operators, such as - or /, the different parses result in different semantics. (The semantics are also different if you use floating point arithmetic, which is not associative.)
So, in the same way as your grammar, the algebraic grammar which starts:
expr: expr '+' expr
expr: expr '*' expr
is ambiguous; it leads precisely to the above ambiguity. The resolution is to say that + and most other algebraic operators are left associative. That results in an adjustment to the grammar:
expr: expr '+' term | term
term: term '*' factor | factor
...
which is not ambiguous (but is still left recursive).
Note that if we had chosen to make those operators right associative, thereby producing the parse (1+(2+(3+4))), then the unambiguous grammar would be right-recursive:
expr: term '+' expr | term
term: factor '*' term | factor
...
Since those particular operators are associative, so that it doesn't matter which syntactic binding we chose (as long as * binds more tightly than +), we could bypass left-recursion elimination altogether, as long as those were the only operators we cared about. But, as noted above, there are lots of operators whose semantics are not so convenient.
It's worth stopping for a moment to understand why the unambiguous grammars are unambiguous. It shouldn't be hard to understand, and it's an important aspect of context-free grammars.
Take the production expr: expr '+' term. Note that term does not derive 2 + 3; term only allows multiplicative operators. So 1 + 2 + 3 can only be produced by reducing 1 + 2 to expr and 3 to term, leaving expr '+' term, which matches the production for expr. Consequently, ((1+2)+3) is the only possible parse. (1+(2+3)) could only be written with explicit parentheses.
Now, it's easy to do left-recursion elimination on expr: expr '+' term, or selectStmt: selectStmt setOp simpleSelectStmt | simpleSelectStmt, to return to the question at hand. We proceed precisely as you indicate, except that α is setOp simpleSelectStmt. We then get:
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
By back-substituting selectStmt into the first production of selectStmt', we get
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp selectStmt
| ε
That's cool; it's not ambiguous, not left-recursive, and has no LL(1) conflicts. But it does not produce the same parse tree as the original. In fact, the parse tree is quite peculiar: S1 UNION S2 UNION S3 is parsed as (S1 (UNION S2 (UNION S3 ()))).
Intriguingly, this is exactly the same place we would have gotten to had we used the right-associative grammar selectStmt: simpleSelectStmt setOp selectStmt | simpleSelectStmt. That grammar is unambiguous and not left-recursive, but it's not LL(1) because both alternatives start with simpleSelectStmt. So we need to left-factor, turning it into selectStmt: simpleSelectStmt (setop selectStmt | ε), exactly the same grammar as we ended up with from the left-recursive starting point.
But the left-recursive and right-recursive grammars really are different: one of them parses as ((S1 UNION S2) UNION S3) and the other one as (S1 UNION (S2 UNION S3)). With UNION, we have the luxury of not caring, but that wouldn't be the case with a SET DIFFERENCE operator, for example.
So the take-away: left-recursion elimination erases the difference between left and right associative operators, and that difference has to be put back using some non-grammatical feature (such as Antlr's run-time semantics). On the other hand, bottom-up parsers like Yacc/Bison, which do not require left-recursion elimination, can implement either parse without requiring any extra mechanism.
Anyway, let's go back to
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
It should be clear that selectStmt' represents zero or more repetitions of setOp simpleSelectStmt. (Try that with a pad of paper, deriving successively longer sentences, in order to convince yourself that it's true.)
So if we had a parser generator which implemented the Kleene * operator (zero or more repetitions), we could write selectStmt' as (setOp simpleSelectStmt)*, making the final grammar
selectStmt: simpleSelectStmt (setOp simpleSelectStmt)*
That's no longer BNF --BNF does not have grouping, optionality, or repetition operators-- but in practical terms it's a lot easier to read, and if you're using Antlr or a similar parser generator, it's what you will inevitably write. (All the same, it still doesn't indicate whether setOp binds to the left or to the right. So the convenience does come at a small price.)
I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.
I'm writing a grammar for a toy language in Yacc (the one packaged with Go) and I have an expected shift-reduce conflict due to the following pseudo-issue. I have to distilled the problem grammar down to the following.
start:
stmt_list
expr:
INT | IDENT | lambda | '(' expr ')' { $$ = $2 }
lambda:
'(' params ')' '{' stmt_list '}'
params:
expr | params ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
A lambda function looks something like this:
map((v) { v * 2 }, collection)
My parser emits:
conflicts: 1 shift/reduce
Given the input:
(a)
It correctly parses an expr by the '(' expr ')' rule. However given an input of:
(a) { a }
(Which would be a lambda for the identity function, returning its input). I get:
syntax error: unexpected '{'
This is because when (a) is read, the parser is choosing to reduce it as '(' expr ')', rather than consider it to be '(' params ')'. Given this conflict is a shift-reduce and not a reduce-reduce, I'm assuming this is solvable. I just don't know how to structure the grammar to support this syntax.
EDIT | It's ugly, but I'm considering defining a token so that the lexer can recognize the ')' '{' sequence and send it through as a single token to resolve this.
EDIT 2 | Actually, better still, I'll make lambdas require syntax like ->(a, b) { a * b} in the grammar, but have the lexer emit the -> rather than it being in the actual source code.
Your analysis is indeed correct; although the grammar is not ambiguous, it is impossible for the parser to decide with the input reduced to ( <expr> and with lookahead ) whether or not the expr should be reduced to params before shifting the ) or whether the ) should be shifted as part of a lambda. If the next token were visible, the decision could be made, so the grammar LR(2), which is outside of the competence of go/yacc.
If you were using bison, you could easily solve this problem by requesting a GLR parser, but I don't believe that go/yacc provides that feature.
There is an LR(1) grammar for the language (there is always an LR(1) grammar corresponding to any LR(k) grammar for any value of k) but it is rather annoying to write by hand. The essential idea of the LR(k) to LR(1) transformation is to shift the reduction decisions k-1 tokens forward by accumulating k-1 tokens of context into each production. So in the case that k is 2, each production P: N → α will be replaced with productions TNU → Tα U for each T in FIRST(α) and each U in FOLLOW(N). [See Note 1] That leads to a considerable blow-up of non-terminals in any non-trivial grammar.
Rather than pursuing that idea, let me propose two much simpler solutions, both of which you seem to be quite close to.
First, in the grammar you present, the issue really is simply the need for a two-token lookahead when the two tokens are ){. That could easily be detected in the lexer, and leads to a solution which is still hacky but a simpler hack: Return ){ as a single token. You need to deal with intervening whitespace, etc., but it doesn't require retaining any context in the lexer. This has the added bonus that you don't need to define params as a list of exprs; they can just be a list of IDENT (if that's relevant; a comment suggests that it isn't).
The alternative, which I think is a bit cleaner, is to extend the solution you already seem to be proposing: accept a little too much and reject the errors in a semantic action. In this case, you might do something like:
start:
stmt_list
expr:
INT
| IDENT
| lambda
| '(' expr_list ')'
{ // If $2 has more than one expr, report error
$$ = $2
}
lambda:
'(' expr_list ')' '{' stmt_list '}'
{ // If anything in expr_list is not a valid param, report error
$$ = make_lambda($2, $4)
}
expr_list:
expr | expr_list ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
Notes
That's only an outline; the complete algorithm includes the mechanism to recover the original parse tree. If k is greater than 2 then T and U are strings the the FIRSTk-1 and FOLLOWk-1 sets.
If it really is a shift-reduce conflict, and you want only the shift behavior, your parser generator may give you a way to prefer a shift vs. a reduce. This is classically how the conflict for grammar rules for "if-then-stmt" and "if-then-stmt-else-stmt" is resolved, when the if statement can also be a statement.
See http://www.gnu.org/software/bison/manual/html_node/Shift_002fReduce.html
You can get this effect two ways:
a) Count on the accidental behavior of the parsing engine.
If an LALR parser handles shifts first, and then reductions if there are no shifts, then you'll get this "prefer shift" for free. All the parser generator has to do is built the parse tables anyway, even if there is a detected conflict.
b) Enforce the accidental behavior. Design (or a get a) parser generator to accept "prefer shift on token T". Then one can supress the ambiguity. One still have to implement the parsing engine as in a) but that's pretty easy.
I think this is easier/cleaner than abusing the lexer to make strange tokens (and that doesn't always work anyway).
Obviously, you could make a preference for reductions to turn it the other way. With some extra hacking, you could make shift-vs-reduce specific the state in which the conflict occured; you can even make it specific to the pair of conflicting rules but now the parsing engine needs to keep preference data around for nonterminals. That still isn't hard. Finally, you could add a predicate for each nonterminal which is called when a shift-reduce conflict is about to occur, and it have it provide a decision.
The point is you don't have to accept "pure" LALR parsing; you can bend it easily in a variety of ways, if you are willing to modify the parser generator/engine a little bit. This gives a really good reason to understand how these tools work; then you can abuse them to your benefit.
Can someone please confirm for me if the following BNF grammar is LL(1):
S ::= A B B A
A ::= a
A ::=
B ::= b
B ::=
where S is the start symbol and non terminals A and B can derive to epsilon. I know if there are 2 or more productions in a single cell in the parse table, then the grammar isn't LL(1). But if a cell already contains epsilon, can we safely replace it with the new production when constructing the parse table?
This grammar is ambiguous, and thus not LL(1), nor LL(k) for any k.
Take a single a or b as input, and see that it can be matched by either of the A or B references from S. Thus there are two different parse trees, proving that the grammar is ambiguous.
All LL grammars are LR grammars, but not the other way around, but I still struggle to deal with the distinction. I'm curious about small examples, if any exist, of LR grammars which do not have an equivalent LL representation.
Well, as far as grammars are concerned, its easy -- any simple left-recursive grammar is LR (probably LR(1)) and not LL. So a list grammar like:
list ::= list ',' element | element
is LR(1) (assuming the production for element is) but not LL. Such grammars can be fairly easily converted into LL grammars by left-factoring and such, so this is not too interesting however.
Of more interest is LANGUAGES that are LR but not LL -- that is a language for which there exists an LR(1) grammar but no LL(k) grammar for any k. An example is things that need optional trailing matches. For example, the language of any number of a symbols followed by the same number or fewer b symbols, but not more bs -- { a^i b^j | i >= j }. There's a trivial LR(1) grammar:
S ::= a S | P
P ::= a P b | \epsilon
but no LL(k) grammar. The reason is that an LL grammar needs to decide whether to match an a+b pair or an odd a when looking at an a, while the LR grammar can defer that decision until after it sees the b or the end of the input.
This post on cs.stackechange.com has lots of references about this