Left recursion elimination - parsing

I have this grammar
S->S+S|SS|(S)|S*|a
I want to know how to eliminate the left recursion from this grammar because the S+S is really confusing...

Let's see if we can simplify the given grammar.
S -> S*|S+S|SS|(S)|a
We can write it as;
S -> S*|SQ|SS|B|a
Q -> +S
B -> (S)
Now, you can eliminate left recursion in familiar territory.
S -> BS'|aS'
S' -> *S'|QS'|SS'|e
Q -> +S
B -> (S)
Note that e is epsilon/lambda.
We have removed the left recursion, so we no longer have need of Q and B.
S -> (S)S'|aS'
S' -> *S'|+SS'|SS'|e
You'll find this useful when dealing with left recursion elimination.

My answer using theory from this reference
How to Eliminate Left recursion in Context-Free-Grammar.
S --> S+S | SS | S* | a | (S)
-------------- -------
Sα form β form
Left-Recursive-Rules Non-Left-Recursive-Rules
We can write like
S ---> Sα1 | Sα2 | Sα3 | β1 | β2
Rules to convert in equivalent Non-recursive grammar:
S ---> β1 | β2
Z ---> α1 |
α2 | α3
Z ---> α1Z |
α2Z | α3Z
S ---> β1Z | β2Z
Where
α1 = +S
α2 = S
α3 = *
And β-productions not start starts with S:
β1 = a
β2 = (S)
Grammar without left-recursion:
Non- left recursive Productions S --> βn
S --> a | (S)
Introduce new variable Z with following productions: Z ---> αn and Z --> αnZ
Z --> +S | S | *
and
Z --> +SZ | SZ | *Z
And new S productions: S --> βnZ
S --> aZ | (S)Z
Second form (answer)
Productions Z --> +S | S | * and Z --> +SZ | SZ | *Z can be combine as Z --> +SZ | SZ | *Z| ^ where ^ is null-symbol.
Z --> ^ use to remove Z from production rules.
So second answer:
S --> aZ | (S)Z and Z --> +SZ | SZ | *Z| ^

Related

Cannot get LL(1) form of grammar for recursive descent parser

I have grammar:
S -> bU | ad | d
U -> Ufab | VSc | bS
V -> fad | f | Ua
To contruct recursive descent parser I need LL(1) form.
Best I got is:
S -> bU | ad | d
U -> fY | bSX
Y -> adScX | ScX
X -> fabX | aScX | ε
Removed left recursions and done some left factoring but I am stuck.
Tried for several hours but I cannot get it...
E.g. valid string are:
bbdfabadc
bbdfabfabfabfab
bfadadcfabfab
bbadaadc
bfbbdfabc
Obviously my grammar form is ambiguous for some so I cannot make recursive descent parser...
From answer:
S -> bU | ad | d
U -> fYZ | bSZ
X -> fab | aSc
Y -> adA | bUc | dc
Z -> ε | XZ
A -> Sc | c
Still not LL(1). First and follow for Z are not disjoint.
Generally to make a grammar LL(1) you'll need to repeatedly left factor and remove left recursion until you've managed to get rid of all the non-LL things. Which you do first depends on the grammar, but in this case you'll want to start with left factoring
To left factor the rule
U -> Ufab | VSc | bS
you need to first substitute V giving
U -> Ufab | fadSc | fSc | UaSc | bS
which you then left factor into
U -> UX | fY | bS
X -> fab | aSc
Y -> adSc | Sc
now U is simple enough that you can eliminate the left recursion directly:
U -> fYZ | bSZ
Z -> ε | XZ
giving you
S -> bU | ad | d
U -> fYZ | bSZ
X -> fab | aSc
Y -> adSc | Sc
Z -> ε | XZ
Now you still have a left factoring problem with Y so you need to substitute S:
Y -> adSc | bUc | adc | dc
which you left factor to
Y -> adA | bUc | dc
A -> Sc | c
giving an almost LL(1) grammar:
S -> bU | ad | d
U -> fYZ | bSZ
X -> fab | aSc
Y -> adA | bUc | dc
Z -> ε | XZ
A -> Sc | c
but now things are stuck as the epsilon rule for Z means we need FIRST(X) and FOLLOW(Z) to be disjoint (in order to decide between the two Z rules). This is generally indicative of a non-LL language, as there's some trailing context that could be associated with more than one rule (via the S -> bU -> bbSZ -> bbbUZ -> bbbbSZZ exapansion chain -- trailing Zs can be recognized but either might be empty). Often times you can still recognize this language with a simple recursive-descent parser (or LL-style state table) by simply resolving the Z ambiguity/conflict in favor of the non-epsilon rule.

Designing a DFA

I want to design a DFA for the following language after fixing ambiguity.
I thought and tried a lot but couldn't get a proper answer.
S->aA|aB|lambda
A->aA|aS
B->bB|aB|b
I recommend first getting an NFA by considering this to be a regular grammar; then, determinize the NFA, and then we can write down a new grammar that's equivalent to this one but unambiguous (for the same reason the determinized automaton is deterministic). Writing down the NFA for this grammar is easy: productions of the form X -> sY translate into transitions from state X to state Y on input s. Similarly, transitions of the form X -> lambda mean X is an accepting state, and transitions of the form X -> b imply a new accepting state that transitions to a dead state.
We need states for each nonterminal symbol S, A and B; and we will have transitions for every production. Our NFA looks like this:
/---a----\
| |
V |
----->(S)--a-->(A)<--\
| | |
a \--a-/ /--a,b--\
| | |
V V |
/--->(B)--b-->(X)-a,b->(Y)<-----/
| |
\-a,b-/
Here, states (S) and (X) are accepting, state (Y) is a dead state (we didn't really need to depict this explicitly, but bear with me) and this automaton is totally equivalent to the grammar. Now, we need to determinize this. States of the determinized automaton will correspond to subsets of states from the nondeterministic version. Our first deterministic state will correspond to the set containing just (S), and we will figure out the other required subsets (of which we can have at most 32, since we have 5 states and 2 to the power of 5 is 32) using the transitions:
Q s Q'
{(S)} a {(A),(B)}
{(S)} b empty
{(A),(B)} a {(A),(B),(S)}
{(A),(B)} b {(B),(X)}
{(A),(B),(S)} a {(A),(B),(S)}
{(A),(B),(S)} b {(B),(X)}
{(B),(X)} a {(B),(Y)}
{(B),(X)} b {(B),(X),(Y)}
{(B),(Y)} a {(B),(Y)}
{(B),(Y)} b {(B),(X),(Y)}
{(B),(X),(Y)} a {(B),(Y)}
{(B),(X),(Y)} b {(B),(X),(Y)}
We encountered six states, plus a dead state (empty) which we can name q1 through q6, plus qD. All of the states corresponding to subsets with either (S) or (X) in them are accepting, and (S) is the initial state. Our DFA looks like this:
/-a,b-\
| |
V |
----->(q1)--b-->(qD)----/
|
a /--a--\
| | |
V V |
(q2)--a-->(q3)----/
| |
b |
| b
V |
/--(q4)<------/ /--b--\
| | | |
| \------b------(q6)<---+
a /--a----\ | |
| | | | |
\-->(q5)<-----+--a-/ |
| |
\---------b---------/
Finally, we can read off the unambiguous regular grammar from our DFA:
(q1) -> a(q2) | b(qD) | lambda
(qD) -> a(qD) | b(qD)
(q2) -> a(q3) | b(q4)
(q3) -> a(q3) | b(q4) | lambda
(q4) -> a(q5) | b(q6) | lambda
(q5) -> a(q5) | b(q6)
(q6) -> a(q5) | b(q6) | lambda

Normalizing structural differences in grammars

Consider the following grammar:
S → A | B
A → xy
B → xyz
This is what I think an LR(0) parser would do given the input xyz:
| xyz → shift
x | yz → shift
xy | z → reduce
A | z → shift
Az | → fail
If my assumption is correct and we changed rule B to read:
B → Az
now the grammar suddenly becomes acceptable by an LR(0) parser. I presume this new grammar describes the exact same set of strings than the first grammar in this question.
What are the differences between the first and second grammars?
How do we decouple structural differences in our grammars from the language they describe?
Through normalization?
What kind of normalization?
To further clarify:
I want to describe a language to a parser, without the structure of the grammar playing a role. I'd like to obtain the most minimal/fundamental description of a set of strings. For LR(k) grammars, I'd like to minimize the k.
I think your LR(0) parser is not a standard parser:
Source
An LR(0) parser is a shift/reduce parser that uses zero tokens of lookahead to determine what action to take (hence the 0). This means that in any configuration of the parser, the parser must have an unambiguous action to choose - either it shifts a specific symbol or applies a specific reduction. If there are ever two or more choices to make, the parser fails and we say that the grammar is not LR(0).
So, When you have:
S->A|B
A->xy
B->Az
or
S->A|B
A->xy
B->xyz
LR(0) will never check B rule, And for both of them it will fail.
State0 - Clousure(S->°A):
S->°A
A->°xy
Arcs:
0 --> x --> 2
0 --> A --> 1
-------------------------
State1 - Goto(State0,A):
S->A°
Arcs:
1 --> $ --> Accept
-------------------------
State2 - Goto(State0,x):
A->x°y
Arcs:
2 --> y --> 3
-------------------------
State3 - Goto(State2,y):
A->xy°
Arcs:
-------------------------
But if you have
I->S
S->A|B
A->xy
B->xyz or B->Az
Both of them will accept the xyz, but in difference states:
State0 - Clousure(I->°S):
I->°S
S->°A
S->°B
A->°xy A->°xy, $z
B->°xyz B->°Az, $
Arcs:
0 --> x --> 4
0 --> S --> 1
0 --> A --> 2
0 --> B --> 3
-------------------------
State1 - Goto(State0,S):
I->S°
Arcs:
1 --> $ --> Accept
-------------------------
State2 - Goto(State0,A):
S->A° S->A°, $
B->A°z, $
Arcs: 2 --> z --> 5
-------------------------
State3 - Goto(State0,B):
S->B°
Arcs:
-------------------------
State4 - Goto(State0,x):
A->x°y A->x°y, $z
B->x°yz
Arcs:
4 --> y --> 5 4 --> y --> 6
-------------------------
State5 - Goto(State4,y): - Goto(State2,z):
A->xy° B->Az°, $
Arcs:
5 --> z --> 6 -<None>-
-------------------------
State6 - Goto(State5,z): - Goto(State4,y)
B->xyz° A->xy°, $z
Arcs:
-------------------------
You can see the Goto Table and Action Table is different.
[B->xyz] [B->Az]
| Stack | Input | Action | Stack | Input | Action
--+---------+--------+---------- --+---------+--------+----------
1 | 0 | xyz$ | Shift 1 | 0 | xyz$ | Shift
2 | 0 4 | yz$ | Shift 2 | 0 4 | xy$ | Shift
3 | 0 4 5 | z$ | Shift 3 | 0 4 6 | z$ | Reduce A->xy
4 | 0 4 5 6 | $ | Reduce B->xyz 4 | 0 2 | z$ | Shift
5 | 0 3 | $ | Reduce S->B 5 | 0 2 5 | $ | Reduce B->Az
6 | 0 1 | $ | Accept 6 | 0 3 | $ | Reduce S->B
7 | 0 1 | $ | Accept
Simply when you change B->xyz to B->Az you add an Action to your LR Table to find the differences you can check Action Table and Goto Table (Constructing LR(0) parsing tables)
When you have A->xy and B->xyz then you have two bottom handles [xy or xyz] but when you have B->Az you have only one bottom handle [xy] that can accept an additional z.
I think related to local optimization -c=a+b; d=a+b -> c=a+b; d=c- when you use B->Az you make B->xyz optimized.

Step by step elimination of this indirect left recursion

I've seen this algorithm one should be able to use to remove all left recursion.
Yet I'm running into problems with this particular grammar:
A -> Cd
B -> Ce
C -> A | B | f
Whatever I try I end up in loops or with a grammar that is still indirect left recursive.
What are the steps to properly implement this algorithm on this grammar?
Rule is that you first establish some kind of order for non-terminals, and then find all paths where indirect recursion happens.
In this case order would be A < B < C, and possible paths for recursion of non-terminal C would be
C=> A => Cd
and
C=> B => Ce
so new rules for C would be
C=> Cd | Ce | f
now you can simply just remove direct left recursion:
C=> fC'
C'=> dC' | eC' | eps
and the resulting non-recursive grammar would be:
A => Cd
B => Ce
C => fC'
C' => dC' | eC' | eps
Figured it out already.
My confusion was that in this order, the algorithm seemed to do nothing, so I figured that must be wrong, and started replacing A -> Cd in the first iteration (ignoring j cannot go beyond i) getting into infinite loops.
1) By reordering the rules:
C -> A | B | f
A -> Cd
B -> Ce
2) replace C in A -> Cd
C -> A | B | f
A -> Ad | Bd | fd
B -> Ce
3) B not yet in range of j, so leave that and replace direct left recursion of A
C -> A | B | f
A -> BdA' | fdA'
A'-> dA' | epsylon
B -> Ce
4) replace C in B -> Ce
C -> A | B | f
A -> BdA' | fdA'
A'-> dA' | epsylon
B -> Ae | Be | fe
5) not done yet! also need to replace the new rule B -> Ae (production of A is in range of j)
C -> A | B | f
A -> BdA' | fdA'
A'-> dA' | epsylon
B -> BdA'e | fdA'e | Be | fe
6) replace direct left recursion in productions of B
C -> A | B | f
A -> BdA' | fdA'
A'-> dA' | epsylon
B -> fdA'eB' | feB'
B'-> dA'eB' | eB' | epsylon
woohoo! left-recursion free grammar!

How to implement a left recursion eliminator?

How can i implement an eliminator for this?
A := AB |
AC |
D |
E ;
This is an example of so called immediate left recursion, and is removed like this:
A := DA' |
EA' ;
A' := ε |
BA' |
CA' ;
The basic idea is to first note that when parsing an A you will necessarily start with a D or an E. After the D or an E you will either end (tail is ε) or continue (if we're in a AB or AC construction).
The actual algorithm works like this:
For any left-recursive production like this: A -> A a1 | ... | A ak | b1 | b2 | ... | bm replace the production with A -> b1 A' | b2 A' | ... | bm A' and add the production A' -> ε | a1 A' | ... | ak A'.
See Wikipedia: Left Recursion for more information on the elimination algorithm (including elimination of indirect left recursion).
Another form available is:
A := (D | E) (B | C)*
The mechanics of doing it are about the same but some parsers might handle that form better. Also consider what it will take to munge the action rules along with the grammar its self; the other form requires the factoring tool to generate a new type for the A' rule to return where as this form doesn't.

Resources