Example of an LR grammar that cannot be represented by LL? - parsing

All LL grammars are LR grammars, but not the other way around, but I still struggle to deal with the distinction. I'm curious about small examples, if any exist, of LR grammars which do not have an equivalent LL representation.

Well, as far as grammars are concerned, its easy -- any simple left-recursive grammar is LR (probably LR(1)) and not LL. So a list grammar like:
list ::= list ',' element | element
is LR(1) (assuming the production for element is) but not LL. Such grammars can be fairly easily converted into LL grammars by left-factoring and such, so this is not too interesting however.
Of more interest is LANGUAGES that are LR but not LL -- that is a language for which there exists an LR(1) grammar but no LL(k) grammar for any k. An example is things that need optional trailing matches. For example, the language of any number of a symbols followed by the same number or fewer b symbols, but not more bs -- { a^i b^j | i >= j }. There's a trivial LR(1) grammar:
S ::= a S | P
P ::= a P b | \epsilon
but no LL(k) grammar. The reason is that an LL grammar needs to decide whether to match an a+b pair or an odd a when looking at an a, while the LR grammar can defer that decision until after it sees the b or the end of the input.
This post on cs.stackechange.com has lots of references about this

Related

How to simplify this left-recursive rule?

I am trying to simplify a rule that has left-recursion in the form of:
A → Aα | β
----------
A → βA'
A' → αA' | ε
The rule I have is:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
From my understanding of the formula, here is what my variables would be:
A = selectStmt
α = setOp selectStmt
β = simpleSelectStmt
A'= selectStmt' // for readability
Then, from application of the rule we get:
1. A → βA'
selectStmt → simpleSelectStmt selectStmt'
2. A' → αA' | ε
selectStmt' -> setOp selectStmt selectStmt' | ε
But then how do I simplify it further to get the final production? In a comment from my previous question at Removing this left-recursive way to define a SELECT statement, it was stated:
In our case a direct application it would take us from what we had originally:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
to
selectStmt: simpleSelectStmt selectStmt'
and
selectStmt': (setOp selectStmt) | empty
which simplifies to
selectStmt: simpleSelectStmt (setOp selectStmt)?
I don't get how that simplification works. Specifically:
How does selectStmt' -> setOp selectStmt selectStmt' | ε simplify to selectStmt': (setOp selectStmt) | empty? And how is the ε removed here? I assume (setOp selectStmt) | empty simplifies to (setOp selectStmt)? because if it can be empty than it just means the optional ?.
Your starting point:
# I removed the parentheses from the following
selectStmt: selectStmt setOp selectStmt | simpleSelectStmt
is ambiguous. Left recursion elimination does not solve ambiguity; rather, it retains ambiguity. So it's not a bad idea to resolve the ambiguity first.
Real-world parser generators can resolve this kind of ambiguity by using operator precedence rules. Some parser generators require you to write out the precedence rules, but Antlr prefers to use a default set of precedence rules (using the order of the productions in the grammar, and assuming every operator to be left-associative unless otherwise declared). (I mention Antlr because you seem to be using it as a reference implementation, although its production semantics is a bit quirky; the implicit precedence rule is just one of the quirks.)
Translating operator precedence into precise BNF is a non-trivial endeavour. Parser generators tend to implement operator precedence by eliminating certain productions, either at grammar compile time (yacc/bison) or with runtime predicates (Antlr4 and most generators based on the shunting yard algorithm). Nevertheless, since operator precedence doesn't affect the context-free property, we know that there is a context-free grammar in which the ambiguity has been resolved. And in some cases, like this one, it's very easy to find.
This is essentially the same ambiguity as you find in arithmetic expressions; without some kind of precedence resolution, 1+2+3+4 is syntactically ambiguous, with five different parse trees. ((1+(2+(3+4))), (1+((2+3)+4)), ((1+2)+(3+4)), ((1+(2+3))+4), (((1+2)+3)+4)). As it happens, these are semantically identical, because addition is associative (in the mathematical sense). But with other operators, such as - or /, the different parses result in different semantics. (The semantics are also different if you use floating point arithmetic, which is not associative.)
So, in the same way as your grammar, the algebraic grammar which starts:
expr: expr '+' expr
expr: expr '*' expr
is ambiguous; it leads precisely to the above ambiguity. The resolution is to say that + and most other algebraic operators are left associative. That results in an adjustment to the grammar:
expr: expr '+' term | term
term: term '*' factor | factor
...
which is not ambiguous (but is still left recursive).
Note that if we had chosen to make those operators right associative, thereby producing the parse (1+(2+(3+4))), then the unambiguous grammar would be right-recursive:
expr: term '+' expr | term
term: factor '*' term | factor
...
Since those particular operators are associative, so that it doesn't matter which syntactic binding we chose (as long as * binds more tightly than +), we could bypass left-recursion elimination altogether, as long as those were the only operators we cared about. But, as noted above, there are lots of operators whose semantics are not so convenient.
It's worth stopping for a moment to understand why the unambiguous grammars are unambiguous. It shouldn't be hard to understand, and it's an important aspect of context-free grammars.
Take the production expr: expr '+' term. Note that term does not derive 2 + 3; term only allows multiplicative operators. So 1 + 2 + 3 can only be produced by reducing 1 + 2 to expr and 3 to term, leaving expr '+' term, which matches the production for expr. Consequently, ((1+2)+3) is the only possible parse. (1+(2+3)) could only be written with explicit parentheses.
Now, it's easy to do left-recursion elimination on expr: expr '+' term, or selectStmt: selectStmt setOp simpleSelectStmt | simpleSelectStmt, to return to the question at hand. We proceed precisely as you indicate, except that α is setOp simpleSelectStmt. We then get:
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
By back-substituting selectStmt into the first production of selectStmt', we get
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp selectStmt
| ε
That's cool; it's not ambiguous, not left-recursive, and has no LL(1) conflicts. But it does not produce the same parse tree as the original. In fact, the parse tree is quite peculiar: S1 UNION S2 UNION S3 is parsed as (S1 (UNION S2 (UNION S3 ()))).
Intriguingly, this is exactly the same place we would have gotten to had we used the right-associative grammar selectStmt: simpleSelectStmt setOp selectStmt | simpleSelectStmt. That grammar is unambiguous and not left-recursive, but it's not LL(1) because both alternatives start with simpleSelectStmt. So we need to left-factor, turning it into selectStmt: simpleSelectStmt (setop selectStmt | ε), exactly the same grammar as we ended up with from the left-recursive starting point.
But the left-recursive and right-recursive grammars really are different: one of them parses as ((S1 UNION S2) UNION S3) and the other one as (S1 UNION (S2 UNION S3)). With UNION, we have the luxury of not caring, but that wouldn't be the case with a SET DIFFERENCE operator, for example.
So the take-away: left-recursion elimination erases the difference between left and right associative operators, and that difference has to be put back using some non-grammatical feature (such as Antlr's run-time semantics). On the other hand, bottom-up parsers like Yacc/Bison, which do not require left-recursion elimination, can implement either parse without requiring any extra mechanism.
Anyway, let's go back to
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
It should be clear that selectStmt' represents zero or more repetitions of setOp simpleSelectStmt. (Try that with a pad of paper, deriving successively longer sentences, in order to convince yourself that it's true.)
So if we had a parser generator which implemented the Kleene * operator (zero or more repetitions), we could write selectStmt' as (setOp simpleSelectStmt)*, making the final grammar
selectStmt: simpleSelectStmt (setOp simpleSelectStmt)*
That's no longer BNF --BNF does not have grouping, optionality, or repetition operators-- but in practical terms it's a lot easier to read, and if you're using Antlr or a similar parser generator, it's what you will inevitably write. (All the same, it still doesn't indicate whether setOp binds to the left or to the right. So the convenience does come at a small price.)

LALR(1) and SLR(1) parser

I get a hypothesis from our teacher and he want from us to search and validate it. We have SLR(1) and LALR(1) parser. The hypothesis is:
Suppose we have a language structure called X. If We couldn't provide a LALR(1) grammar for this structure, we couldn't provide a SLR(1) too and maybe a LR(1) grammar could solve problem. but If we could provide a LALR(1) grammar for this structure, we could provide a SLR(1) too.
If you search in internet, you find a lot of sites which say this grammar is not SLR(1) but it is LALR(1):
S -> R
S -> L = R
L -> * R
L -> id
R -> L
("id", "*" and "=" are terminals and others are non-terminals)
If we try to find SLR(1) items, we will see shift/reduce conflict. it is true, but my hypothesis say something else. In our hypothesis, we talk about language described by grammar not grammar itself! We can remove "R" and convert grammar to LL(1) and It is also SLR(1) and LALR(1):
S -> LM
M -> epsilon
M -> = L
L -> * L
L -> id
You can try this grammar and you can see that this grammar describe same language as last grammar and has SLR(1) and LALR(1) grammar!
so my problem is not finding a grammar which is LALR(1) but not SLR(1). There are a lot of them in internet. I want to know is there any language which has LALR(1) grammar but not SLR(1) grammar? and if our hypothesis is true, then there is no need to LALR(1) and SLR(1) could do everything for us, however LALR(1) is easier to use and maybbe in future, a language reject this hypothesis.
I'm sorry for bad English.
Thanks.
Every LR(k) language has an SLR(1) grammar.
There is a proof in this paper from 1976, which provides an algorithm for constructing the SLR(1) grammar, if you have an LR(k) grammar and know the value of k. Unfortunately, there is no algorithm which can tell you definitely whether a CFG is LR(k), much less provide the value of k. (If you somehow know that the grammar is LR(k), you can try successive values of k until you find one which works. But that procedure will never terminate if the grammar is not LR(k).)
The above comes from this reference question on the Computing Science StackExchange site, which is a better place for this kind of question.
LR(1) > LALR(1) > SLR(1)
LR(1) is the most powerful, LALR(1) in less powerful and SLR(1) is least powerful.
This is fact, because of the way the lookahead sets are computed. (1) means lookahead of one token. Here is a grammar that is LR(1), but not LALR(1) and definitely not SLR(1):
G : S... <eof>
;
S : c A1 t ';'
| c A2 n ';'
| r A2 t ';'
| r A1 n ';'
;
A1 : a
;
A2 : a
;
This grammar cannot be made LALR(1) or SLR(1). Or sure you can remove A1 and A2 and
replace them with a, but then you have a different grammar. The problem is that an
action may be attached to the rule A1 : a and a different action my be attached to A2 : a. For example:
A1 : a => X()
;
A2 : a => Y()
;
An SLR(1) parser generator will report conflicts in your grammar that are not real conflicts. I'm talking about the real world using large grammars (e.g. C11.grm).
The SLR(1) lookahead computation is simplistic, getting the lookaheads from the grammar, instead of the LR(0) state machine created by an LALR(1) parser generator.
That is why Frank DeRemer's paper, in 1969, on LALR(1) is so important.
By looking at the grammar, A1 can be followed by either t or n, so this is a conflict
reported by SLR(1), but there is an LR(1) state machine in which there is no conflict on which follows A1.

Grammar for arithmetic expressions without alternatives?

How to write Unambiguous Grammar for arithmetic expressions e.g. a+(b+c)*d
E.g.
E -> E + T | T
T -> T * F | F
F -> ( E ) | i
WITHOUT alternatives - in my case without |T and |F and |i
This should be possible by adding more sentences to the grammar but I'm having hard time to figure out how...
NOTE: this is for University... so may be not a good real world Grammar :)
What you're asking for is impossible. If you do not have alternative productions in your grammar, then it is not possible for there to be any decisions about which productions to use. As a result, your grammar will either generate no strings, or will generate a single string. Grammars with these properties are called LL(0) grammars and are not at all practical.
Hope this helps!

Is this BNF grammar LL(1)?

Can someone please confirm for me if the following BNF grammar is LL(1):
S ::= A B B A
A ::= a
A ::=
B ::= b
B ::=
where S is the start symbol and non terminals A and B can derive to epsilon. I know if there are 2 or more productions in a single cell in the parse table, then the grammar isn't LL(1). But if a cell already contains epsilon, can we safely replace it with the new production when constructing the parse table?
This grammar is ambiguous, and thus not LL(1), nor LL(k) for any k.
Take a single a or b as input, and see that it can be matched by either of the A or B references from S. Thus there are two different parse trees, proving that the grammar is ambiguous.

Example for LL(1) Grammar which is NOT LALR?

I am learning now about parsers on my Theory Of Compilation course.
I need to find an example for grammar which is in LL(1) but not in LALR.
I know it should be exist. please help me think of the most simple example to this problem.
Some googling brings up this example for a non-LALR(1) grammar, which is LL(1):
S ::= '(' X
| E ']'
| F ')'
X ::= E ')'
| F ']'
E ::= A
F ::= A
A ::= ε
The LALR(1) construction fails, because there is a reduce-reduce conflict between E and F. In the set of LR(0) states, there is a state made up of
E ::= A . ;
F ::= A . ;
which is needed for both S and X contexts. The LALR(1) lookahead sets for these items thus mix up tokens originating from the S and X productions. This is different for LR(1), where there are different states for these cases.
With LL(1), decisions are made by looking at FIRST sets of the alternatives, where ')' and ']' always occur in different alternatives.
From the Dragon book (Second Edition, p. 242):
The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive or LL methods. For a grammar to be LR(k), we must be able to recognize the occurrence of the right side of a production in a right-sentential form, with k input symbols of lookahead. This requirement is far less stringent than that for LL(k) grammars where we must be able to recognize the use of a production seeing only the first k symbols of what the right side derives. Thus, it should not be surprising that LR grammars can describe more languages than LL grammars.

Resources