So I have some grammar that doesn't work for a top-down parser due to it having left recursion:
L::= A a | B b
A::= L b | a a
B::= b B b | b a
So in order to fix this, I have to remove the left recursion. To do this, I do a little substitute-like-thing:
L::= A a | B b
A::= A a b | B b b | a a (I plugged in the possible values of "L")
A then turns to (I believe):
A::= a a A' | B b b
A'::= a b A' | ε
I'm fairly certain that I'm correct up to there (wouldn't be surprised if I'm not, though). Where I'm struggling is now removing the left recursion out of "B b b". I've tried going about this so many ways, and I don't think any of them work. Here's one that seems most logical, but ugly as well (thus saying it's probably wrong). Starting by manipulating B::= b B b | b a
B::= b B b | b a
B::= b B' (they both start with b, so maybe i can pull it out?)
B'::= B b | a
B'::= b B' b | a (substituted B's value in)
B'::= b B" | a
B"::= b B" |a B" | ε
So I guess to show what the finalized B's would be:
B::= b B'
B'::= b B" | a
B"::= b B" | a B" | ε
This seems way too ugly to be correct. Especially since I'd have to plug that into the new "A" terminals that I created.
Can someone help me out? No idea if I'm going about this the right way. I'm supposed to be able to create an LL(1) parse table afterward (should be able to do that part on my own).
Thanks.
In a parser that tries to expand nonterminals from the left, if some nonterminal can expand to a string with itself on the left, the parser may expand that nonterminal forever without actually parsing anything. This is left recursion. On the other hand, it is perfectly fine if a nonterminal expands to a string with some different nonterminal on the left, as long as no chain of expansions produces a string with the original nonterminal on the left. Similarly, it is fine if nonterminals aren't all the way to the right in the expansion, as long as they're not all the way to the left.
tl;dr There's nothing wrong with B b b or b B b. You've removed all the left recursion. You don't need to keep going.
Related
Consider the following CFG:
S := AbC
A := aA | epsilon
C := Ac
Here, FIRST(A) = FIRST(B) = FIRST(C) = {a, ε}, so all the FIRST sets are the same. However, this grammar is supposedly LL(1). How is that possible? Wouldn't that mean that there would be a bunch of FIRST/FIRST conflicts everywhere?
There's nothing fundamentally wrong about having multiple nonterminals that have the same FIRST sets. Things only become problematic if you have multiple nonterminals with overlapping FIRST or FOLLOW sets in a context where you have to choose between a number of production options.
As an example, consider this simple grammar:
A → bB | cC
B → b | c
C → b | c
Notice that all three of A, B, and C have the same FIRST set, namely {b, c}. But this grammar is also LL(1). While you can formally convince yourself of this by writing out the actual LL(1) parsing table, you can think of this in another way as well. Imagine you're reading the nonterminal A, and you see the character b. Which production should you pick: A → bB, or A → cC? Well, there's no reason to pick A → cC, because that would put c at the front of your string. So don't pick that one. Instead, pick A → bB. Similarly, suppose you're reading an A and you see the character c. Which production should you pick? You'd never want to pick A → bB, since that would put b at the front of your string. Instead, you'd pick A → cC.
Notice that in this discussion, we never stopped to think about what FIRST(B) or FIRST(C) was. It simply didn't come up because we never needed to know what characters could be produced there.
Now, let's look at your example. If you're trying to expand an S, there's only one possible production to apply, which is S → AbC. So there's no possible conflict here; when you see S, you always apply that rule. Similarly, if you're trying to expand a C, there's no choice of what to do. You have to expand C → Ac.
So now let's think about the nonterminal A, where there really is a choice of what to do next. If you see the character a, then we have to decide - do we expand out A → aA, or do we expand out A → ε? In answering that question, we have to think about the FOLLOW set of A, since the production A → ε would only make sense to pick if we saw a terminal symbol where we basically just want to get A out of the way. Here, FOLLOW(A) = {b, c}, with the b coming from the production S → AbC and the c coming from the production C → Ac. So we'd only pick A → ε if we see b or c, not if we see a. That means that
on reading a, we pick A → aA, and
on reading b o r c, we pick A → ε.
Notice that in this discussion we never needed to think about what FIRST(B) or FIRST(C) was. In fact, we never even needed to look at what FIRST(A) was either! So that's why there isn't necessarily a conflict. Were we to encounter a scenario where we needed to compare FIRST(A) against FIRST(B) or something like that, then yes, we'd definitely have a conflict. But that never came up, so no conflict exists.
I'm having problems understanding an online explanation of how to remove the left recursion in this grammar. I know how to remove direct recursion, but I'm not clear how to handle the indirect. Could anyone explain it?
A --> B x y | x
B --> C D
C --> A | c
D --> d
The way I learned to do this is to replace one of the offending non-terminal symbols with each of its expansions. In this case, we first replace B with its expansions:
A --> B x y | x
B --> C D
becomes
A --> C x y | D x y | x
Now, we do the same for non-terminal symbol C:
A --> C x y | D x y | x
C --> A | c
becomes
A --> A x y | c x y | D x y | x
The only other remaining grammar rule is
D --> d
so you can also make that replacement, leaving your entire grammar as
A --> A x y | c x y | d x y | x
There is no indirect left recursion now, since there is nothing indirect at all.
Also see here.
To eliminate left recursion altogether (not merely indirect left recursion), introduce the A' symbol from your own materials (credit to OP for this clarification and completion):
A -> x A'
A' -> xyA' | cxyA' | dxyA' | epsilon
Response to naomik's comments
Yes, grammars have interesting properties, and you can characterize certain semantic capabilities in terms of constraints on grammar rules. There are transformation algorithms to handle certain types of parsing problems.
In this case, we want to remove left-recursion: one desirable property of a grammar is that the use of any rule must consume at least one input token (terminal symbol). Left-recursion opens a door to infinite recursion in the parser.
I learned these things in my "Foundations of Computing" and "Compiler Construction" classes many years ago. Instead of writing a parser to adapt to a particular grammar, we'd transform the grammar to fit the parser style we wanted.
The following grammar generates the sentences a, a, a, b, b, b, ..., h, b. Unfortunately it is not LR(1) so cannot be used with tools such as "yacc".
S -> a comma a.
S -> C comma b.
C -> a | b | c | d | e | f | g | h.
Is it possible to transform this grammar to be LR(1) (or even LALR(1), LL(k) or LL(1)) without the need to expand the nonterminal C and thus significantly increase the number of productions?
Not as long as you have the nonterminal C unchanged preceding comma in some rule.
In that case it is clear that a parser cannot decide, having seen an "a", and having lookahead "comma", whether to reduce or shift. So with C unchanged, this grammar is not LR(1), as you have said.
But the solution lies in the two phrases, "having seen an 'a'" and "C unchanged". You asked if there's fix that doesn't expand C. There isn't, but you could expand C "a little bit" by removing "a" from C, since that's the source of the problem:
S -> a comma a .
S -> a comma b .
S -> C comma b .
C -> b | c | d | e | f | g | h .
So, we did not "significantly" increase the number of productions.
Let's assume we have the following CFG G:
A -> A b A
A -> a
Which should produce the strings
a, aba, ababa, abababa, and so on. Now I want to remove the left recursion to make it suitable for predictive parsing. The dragon book gives the following rule to remove immediate left recursions.
Given
A -> Aa | b
rewrite as
A -> b A'
A' -> a A'
| ε
If we simply apply the rule to the grammar from above, we get grammar G':
A -> a A'
A' -> b A A'
| ε
Which looks good to me, but apparently this grammar is not LL(1), because of some ambiguity. I get the following First/Follow sets:
First(A) = { "a" }
First(A') = { ε, "b" }
Follow(A) = { $, "b" }
Follow(A') = { $, "b" }
From which I construct the parsing table
| a | b | $ |
----------------------------------------------------
A | A -> a A' | | |
A' | | A' -> b A A' | A' -> ε |
| | A' -> ε | |
and there is a conflict in T[A',b], so the grammar isn't LL(1), although there are no left recursions any more and there are also no common prefixes of the productions so that it would require left factoring.
I'm not completely sure where the ambiguity comes from. I guess that during parsing the stack would fill with S'. And you can basically remove it (reduce to epsilon), if it isn't needed any more. I think this is the case if another S' is below on on the stack.
I think the LL(1) grammar G'' that I try to get from the original one would be:
A -> a A'
A' -> b A
| ε
Am I missing something? Did I do anything wrong?
Is there a more general procedure for removing left recursion that considers this edge case? If I want to automatically remove left recursions I should be able to handle this, right?
Is the second grammar G' a LL(k) grammar for some k > 1?
The original grammar was ambiguous, so it is not surprising that the new grammar is also ambiguous.
Consider the string a b a b a. We can derive this in two ways from the original grammar:
A ⇒ A b A
⇒ A b a
⇒ A b A b a
⇒ A b a b a
⇒ a b a b a
A ⇒ A b A
⇒ A b A b A
⇒ A b A b a
⇒ A b a b a
⇒ a b a b a
Unambiguous grammars are, of course possible. Here are right- and left-recursive versions:
A ⇒ a A ⇒ a
A ⇒ a b A A ⇒ A b a
(Although these represent the same language, they have different parses: the right-recursive version associates to the right, while the left-recursive one associates to the left.)
Removing left recursion cannot introduce ambiguity. This kind of transformation preserves ambiguity. If the CFG is already ambiguous, the result will be ambiguous too, and if the original is not, the resulting neither.
Unfortunately, it is not possible for ANTLR to support direct-left recursion when the rule has parameters passed. The only viable option is to remove the left recursion. Is there a way to remove the left-recursion in the following grammar ?
a[int x]
: b a[$x] c
| a[$x - 1]
(
c a[$x - 1]
| b c
)
;
The problem is in the second alternative involving left recursion. Any kind of help would be much appreciated.
Without the parameters and easier formatting, it would look like this:
a
: b a c
| a (c a | b c)
;
When a's left recursive alternative is matched n times, it would just mean that (c a | b c) will be matched n times, pre-pended with the terminating b a c (the first alternative). That means that this rule will always start with b a c, followed by zero or more occurrences of (c a | b c):
a
: b a c (c a | b c)*
;