How to remove inderect left recursion - parsing

I need help removing the indirect left recursion from this grammar:
A -> B (sB)*
| dAd
| z
B -> <id>
| sB
| A
So you could move from A->B->A.... without consuming any characters.
I tried to fix it a couple different ways but keep running into issues because of this bit (sB)*
I am not sure if I'm doing something wrong or if the grammar is wrong in general.

Before we begin, let's number your productions, so that we have something to refer to:
1: A -> B (s B)*
2: A -> d A d
3: A -> z
4: B -> <id>
5: B -> s B
6: B -> A
Since you're trying to eliminate left recursion, I can only assume you're trying to apply LL parsing. However, this grammar is ambiguous, so it can't be an LL(1) grammar. For instance, the phrase zszsz can be (leftmost) derived from A in more than one way:
A ->+ B s B (1)
->+ A s B (6)
->+ z s B (3)
->+ z s B s B (1)
->+ z s z s z (6, 3, 6, 3)
A ->+ B s B (1)
->+ A s B (6)
->+ B s B s B (1)
->+ A s B s B (6)
->+ z s B s B (3)
->+ z s z s z (6, 3, 6, 3)
The first step would be to simplify this grammar, so that every production only has sequences of terminals and non-terminals on the "expanded" side. Rule #1 has a Kleene star, so let's get rid of it by replacing it by a non-terminal C:
1: A -> B C
2: A -> d A d
3: A -> z
4: B -> <id>
5: B -> s B
6: B -> A
7: C -> <empty>
8: C -> s B C
Now, all productions are simple.
Next, we identify indirect left recursion (if any), and turn it into direct left recursion. By looking at all productions that start with a non-terminal, we find that A and B are involved in indirect left recursion (through rules #1 and #6). We can break this loop by substituting B in rule #1 with what it can produce; we replace rule #1 with
9: A -> <id> C
10: A -> s B C
11: A -> A C
Alternatively, we could break the loop by substituting the productions #1, #2, and #3 in #6. However we do it, the resulting grammar is free of indirect left recursion.
Then we eliminate direct left recursion (if any) in our grammar. This occurs in the non-terminal A, as a result of our substitution:
2: A -> d A d
3: A -> z
...
9: A -> <id> C
10: A -> s B C
11: A -> A C
We introduce another non-terminal D, and replace these rules with
12: A -> d A d D
13: A -> z D
14: A -> <id> C D
15: A -> s B C D
17: D -> <empty>
18: D -> A D
The resulting grammar is free of left recursion:
4: B -> <id>
5: B -> s B
6: B -> A
7: C -> <empty>
8: C -> s B C
12: A -> d A d D
13: A -> z D
14: A -> <id> C D
15: A -> s B C D
17: D -> <empty>
18: D -> A D
As stated in the beginning, you can't construct an LL(1) parsing table from this grammar, because the leftmost derivation of zszsz from A is still ambiguous.

Interesting. I can't see a mechanical way to do it. Is this how the language is specified or did you end up with it by some other simplifications? Anyway, a solution for the specific issue is to "inline" B in the left-recursive part:
A -> (<id> | sB | dAd | z) (sB)*
B -> <id> | sB | A
Basic idea is to substitute the no-recursive terms in the recursive part and moving the recursive term to the end.

Start with
A -> B (sB)* | dAd | z
B -> <id> | sB | A
Substitute
A -> (<id> | sB | A) (sB)* | dAd | z
Define
C -> (sB)*
Substitute
A -> (<id> | sB | A) C | dAd | z
Factor
A -> <id> C | sBC | AC | dAd | z
Define
D -> <id> C | sBC | dAd | z
So
A -> D | AC
Remove left recursion
A -> D (C)*
Substitute for C and D
A -> (<id> (sB)* | sB(sB)* | dAd | z) (sB)**
Since x** = x*
A -> (<id> (sB)* | sB(sB)* | dAd | z) (sB)*
Since x*x* = x*
A -> (<id> | sB | dAd | z) (sB)*
B -> <id> | sB | A
Same result as Sreenivasa's.
Edit added after seeing #Rymoid's answer.
At this point the left recursion has been removed, so we are done. But as pointed out by #Rymoid, the grammar is still ambiguous and so not LL(1). Below I will try to cope with the ambiguity, but not to find an LL(1) grammar.
One problem is that, since A =>* sB, the choice sB | A is ambiguous and unneeded. Let's start by removing that choice. We have
A -> (<id> | sB | dAd | z) (sB)*
B -> <id> | A
Likewise A =>* <id>, so the choice <id> | A is ambiguous and not needed. We have
A -> (<id> | sB | dAd | z) (sB)*
B -> A
And then we don't need B anymore.
A -> (<id> | sA | dAd | z) (sA)*
The remaining problem is that, since s is in the follow set of A, there is no way to tell, with one token of lookahead, whether to stay in the (sA)* loop or exit it.
The original question did not ask for an LL(1) grammar, but since the post is tagged [JavaCC], we might assume that what is wanted is one that works with JavaCC. That's not quite the same thing as being LL(1), although being LL(1) implies that the grammar will work well with JavaCC.
I'll assume all uses of A outside of the definition of A are definitely not followed by an s. To be concrete about this, I'll assume that there is (only) one more production which is S -> A <EOF>and that S is the start nonterminal. But really the important thing is that you never have an A followed by an s except because of the loop in A's current definition.
We have
S -> A <EOF>
A -> (<id> | sA | dAd | z) (sA)*
When you have an ambiguous grammar but want to eliminate ambiguity, the question to ask yourself is: Which parse do I want in the ambiguous cases? Two answers are: "Stay in the loop as long as possible." and "Jump out of the loop as soon as possible." (Other answers are possible, but unlikely.)
"Stay in the loop as long as possible"
This is the JavaCC default, so there is no need to change the grammar. It might generate a warning. It might be possible to suppress that warning with LOOKAHEAD( <s> ) at the start of the loop.
"Exit the loop as soon as possible"
Make two versions of A. A0 is never followed by an s. A1 is always followed by an s. (In fact it is followed by the first s possible, so the (sA)* part is not wanted. This choice corresponds to bailing out of the loop as soon as possible.)
S -> A0 <EOF>
A0 -> (<id> | sA0 | dA0d | z) [ s (A1s)* A0 ]
A1 -> <id> | sA1 | dA0d | z
I'm fairly sure this is unambiguous and that A0 defines the same language as A. It is not LL(1) and JavaCC will give a warning that should be heeded.
To make it suitable for JavaCC we can add a syntactic lookahead of LOOKAHEAD( A1 <s> ) to the start of the loop.

Related

Can a follow-follow conflict exist in a grammar?

I know that First/First and First/Follow conflicts exist in a grammar which makes the grammar "not LL(1)". I was just wondering if Follow/Follow conflict exist in a grammar.
Yes, this is possible, but it requires an unusual configuration to make it happen. Consider the following grammar, which has been augmented with a new start symbol:
S' → S$
S → tT
T → A | B
A → ε
B → ε
Now, let's imagine trying to fill in our LL(1) parse table, which is shown here:
$ t
+----------+----------+
S' | | S' -> S$ |
+----------+----------+
S | | S -> tT |
+----------+----------+
T | T -> A | |
| T -> B | |
+----------+----------+
A | A -> e | |
+----------+----------+
B | B -> e | |
+----------+----------+
Notice that there are two items in the entry for (T, $). And that makes sense: if we have the active nonterminal T and see a $, we know that we need to select a production that's going to expand out to the empty string. And we have two different ways of doing this: we could use T → A or T → B, with the ultimate goal of expanding each of those nonterminals out to the empty string. This is a problem - we can't predict which route to take.
Now, what sort of conflict is this? It can't be a FIRST/FIRST conflict, because FIRST(A) = {ε} and FIRST(B) = {ε}, so neither A nor B has any terminals in its first set. It can't be a FIRST/FOLLOW conflict for the same reason.
That means that it's the rare FOLLOW/FOLLOW conflict - we know that we'd choose the production based on what's in the FOLLOW sets of A and B, and yet they're exactly identical to one another and so the parser can't choose what to do next unambiguously.
This is prehaps a simpler example
S → A a
A → B | C
B → ε
C → ε
Here, since a is both in the FOLLOW of B and C, on (A, a) there will be a conflict between A → B and A → C. Note that there are no other conflicts.

This context-free grammar is ambiguous and I'm not sure why. The SLR(1) compiler I'm building doesn't work the way I expect it to

I'm building a syntax parser. It's going good to be SLR(1) but I believe there are some reduce/shift conflicts or some kind of conflict that is making the parser reject strings too early . Here is the grammar:
Note: I did left factor the grammar to see if that was the problem, but that doesn't get rid of ambiguity. However this is the original grammar without left factoring
P'' -> P'$
P' -> P
P -> C | C;D
D -> R | RD
R -> pu{P}
C -> I | I;C
I -> h | O | A | R | Z
O -> i(V) | z(V)
Y -> u
V -> S | N
S -> u
N -> u
A -> S=s | S=S | N=X
X -> N | b | L
L -> d(X,X) | s(X,X) | m(X,X)
R -> f(B)t{C} | f(B)t{C}1{C}
B -> e(V,V) | (N<N) | (N>N) | nB | a(B,B) | o(B,B)
Z -> w(B){C} | r(N=0;N<N;N=a(N,1)){C}
I understand this grammar is quite big, but if you could help me here you would be a life saver. Thank you in advance!
Having recognized an I, and with ; as the next symbol, there's a shift-reduce conflict:
The production C -> I;C says to shift the ;.
The production P -> C;D says to reduce via C -> I.
So the grammar is not SLR(1).

Is it possible to transform this grammar to be LR(1)?

The following grammar generates the sentences a, a, a, b, b, b, ..., h, b. Unfortunately it is not LR(1) so cannot be used with tools such as "yacc".
S -> a comma a.
S -> C comma b.
C -> a | b | c | d | e | f | g | h.
Is it possible to transform this grammar to be LR(1) (or even LALR(1), LL(k) or LL(1)) without the need to expand the nonterminal C and thus significantly increase the number of productions?
Not as long as you have the nonterminal C unchanged preceding comma in some rule.
In that case it is clear that a parser cannot decide, having seen an "a", and having lookahead "comma", whether to reduce or shift. So with C unchanged, this grammar is not LR(1), as you have said.
But the solution lies in the two phrases, "having seen an 'a'" and "C unchanged". You asked if there's fix that doesn't expand C. There isn't, but you could expand C "a little bit" by removing "a" from C, since that's the source of the problem:
S -> a comma a .
S -> a comma b .
S -> C comma b .
C -> b | c | d | e | f | g | h .
So, we did not "significantly" increase the number of productions.

can removing left recursion introduce ambiguity?

Let's assume we have the following CFG G:
A -> A b A
A -> a
Which should produce the strings
a, aba, ababa, abababa, and so on. Now I want to remove the left recursion to make it suitable for predictive parsing. The dragon book gives the following rule to remove immediate left recursions.
Given
A -> Aa | b
rewrite as
A -> b A'
A' -> a A'
| ε
If we simply apply the rule to the grammar from above, we get grammar G':
A -> a A'
A' -> b A A'
| ε
Which looks good to me, but apparently this grammar is not LL(1), because of some ambiguity. I get the following First/Follow sets:
First(A) = { "a" }
First(A') = { ε, "b" }
Follow(A) = { $, "b" }
Follow(A') = { $, "b" }
From which I construct the parsing table
| a | b | $ |
----------------------------------------------------
A | A -> a A' | | |
A' | | A' -> b A A' | A' -> ε |
| | A' -> ε | |
and there is a conflict in T[A',b], so the grammar isn't LL(1), although there are no left recursions any more and there are also no common prefixes of the productions so that it would require left factoring.
I'm not completely sure where the ambiguity comes from. I guess that during parsing the stack would fill with S'. And you can basically remove it (reduce to epsilon), if it isn't needed any more. I think this is the case if another S' is below on on the stack.
I think the LL(1) grammar G'' that I try to get from the original one would be:
A -> a A'
A' -> b A
| ε
Am I missing something? Did I do anything wrong?
Is there a more general procedure for removing left recursion that considers this edge case? If I want to automatically remove left recursions I should be able to handle this, right?
Is the second grammar G' a LL(k) grammar for some k > 1?
The original grammar was ambiguous, so it is not surprising that the new grammar is also ambiguous.
Consider the string a b a b a. We can derive this in two ways from the original grammar:
A ⇒ A b A
⇒ A b a
⇒ A b A b a
⇒ A b a b a
⇒ a b a b a
A ⇒ A b A
⇒ A b A b A
⇒ A b A b a
⇒ A b a b a
⇒ a b a b a
Unambiguous grammars are, of course possible. Here are right- and left-recursive versions:
A ⇒ a A ⇒ a
A ⇒ a b A A ⇒ A b a
(Although these represent the same language, they have different parses: the right-recursive version associates to the right, while the left-recursive one associates to the left.)
Removing left recursion cannot introduce ambiguity. This kind of transformation preserves ambiguity. If the CFG is already ambiguous, the result will be ambiguous too, and if the original is not, the resulting neither.

Transform a grammar G into LL(1)

I have the following grammar and I need to convert it to LL(1) grammar
G = (N; T; P; S) N = {S,A,B,C} T = {a, b, c, d}
P = {
S -> CbSb | adB | bc
A -> BdA | b
B -> aCd | ë
C -> Cca | bA | a
}
The point is that I know how to convert when its just a production, but I can't find any clear method of solving this on the internet.
Thanks in advance!
Remove left recursion, direct and indirect.
Build an LA(k) table. If there's no ambiguity, the grammar (and the language) is LL(k).
The obvious left recursion in the grammar is:
S ==> C... ==> C...

Resources