How to implement a left recursion eliminator? - parsing

How can i implement an eliminator for this?
A := AB |
AC |
D |
E ;

This is an example of so called immediate left recursion, and is removed like this:
A := DA' |
EA' ;
A' := ε |
BA' |
CA' ;
The basic idea is to first note that when parsing an A you will necessarily start with a D or an E. After the D or an E you will either end (tail is ε) or continue (if we're in a AB or AC construction).
The actual algorithm works like this:
For any left-recursive production like this: A -> A a1 | ... | A ak | b1 | b2 | ... | bm replace the production with A -> b1 A' | b2 A' | ... | bm A' and add the production A' -> ε | a1 A' | ... | ak A'.
See Wikipedia: Left Recursion for more information on the elimination algorithm (including elimination of indirect left recursion).

Another form available is:
A := (D | E) (B | C)*
The mechanics of doing it are about the same but some parsers might handle that form better. Also consider what it will take to munge the action rules along with the grammar its self; the other form requires the factoring tool to generate a new type for the A' rule to return where as this form doesn't.

Related

Cannot get LL(1) form of grammar for recursive descent parser

I have grammar:
S -> bU | ad | d
U -> Ufab | VSc | bS
V -> fad | f | Ua
To contruct recursive descent parser I need LL(1) form.
Best I got is:
S -> bU | ad | d
U -> fY | bSX
Y -> adScX | ScX
X -> fabX | aScX | ε
Removed left recursions and done some left factoring but I am stuck.
Tried for several hours but I cannot get it...
E.g. valid string are:
bbdfabadc
bbdfabfabfabfab
bfadadcfabfab
bbadaadc
bfbbdfabc
Obviously my grammar form is ambiguous for some so I cannot make recursive descent parser...
From answer:
S -> bU | ad | d
U -> fYZ | bSZ
X -> fab | aSc
Y -> adA | bUc | dc
Z -> ε | XZ
A -> Sc | c
Still not LL(1). First and follow for Z are not disjoint.
Generally to make a grammar LL(1) you'll need to repeatedly left factor and remove left recursion until you've managed to get rid of all the non-LL things. Which you do first depends on the grammar, but in this case you'll want to start with left factoring
To left factor the rule
U -> Ufab | VSc | bS
you need to first substitute V giving
U -> Ufab | fadSc | fSc | UaSc | bS
which you then left factor into
U -> UX | fY | bS
X -> fab | aSc
Y -> adSc | Sc
now U is simple enough that you can eliminate the left recursion directly:
U -> fYZ | bSZ
Z -> ε | XZ
giving you
S -> bU | ad | d
U -> fYZ | bSZ
X -> fab | aSc
Y -> adSc | Sc
Z -> ε | XZ
Now you still have a left factoring problem with Y so you need to substitute S:
Y -> adSc | bUc | adc | dc
which you left factor to
Y -> adA | bUc | dc
A -> Sc | c
giving an almost LL(1) grammar:
S -> bU | ad | d
U -> fYZ | bSZ
X -> fab | aSc
Y -> adA | bUc | dc
Z -> ε | XZ
A -> Sc | c
but now things are stuck as the epsilon rule for Z means we need FIRST(X) and FOLLOW(Z) to be disjoint (in order to decide between the two Z rules). This is generally indicative of a non-LL language, as there's some trailing context that could be associated with more than one rule (via the S -> bU -> bbSZ -> bbbUZ -> bbbbSZZ exapansion chain -- trailing Zs can be recognized but either might be empty). Often times you can still recognize this language with a simple recursive-descent parser (or LL-style state table) by simply resolving the Z ambiguity/conflict in favor of the non-epsilon rule.

Designing a DFA

I want to design a DFA for the following language after fixing ambiguity.
I thought and tried a lot but couldn't get a proper answer.
S->aA|aB|lambda
A->aA|aS
B->bB|aB|b
I recommend first getting an NFA by considering this to be a regular grammar; then, determinize the NFA, and then we can write down a new grammar that's equivalent to this one but unambiguous (for the same reason the determinized automaton is deterministic). Writing down the NFA for this grammar is easy: productions of the form X -> sY translate into transitions from state X to state Y on input s. Similarly, transitions of the form X -> lambda mean X is an accepting state, and transitions of the form X -> b imply a new accepting state that transitions to a dead state.
We need states for each nonterminal symbol S, A and B; and we will have transitions for every production. Our NFA looks like this:
/---a----\
| |
V |
----->(S)--a-->(A)<--\
| | |
a \--a-/ /--a,b--\
| | |
V V |
/--->(B)--b-->(X)-a,b->(Y)<-----/
| |
\-a,b-/
Here, states (S) and (X) are accepting, state (Y) is a dead state (we didn't really need to depict this explicitly, but bear with me) and this automaton is totally equivalent to the grammar. Now, we need to determinize this. States of the determinized automaton will correspond to subsets of states from the nondeterministic version. Our first deterministic state will correspond to the set containing just (S), and we will figure out the other required subsets (of which we can have at most 32, since we have 5 states and 2 to the power of 5 is 32) using the transitions:
Q s Q'
{(S)} a {(A),(B)}
{(S)} b empty
{(A),(B)} a {(A),(B),(S)}
{(A),(B)} b {(B),(X)}
{(A),(B),(S)} a {(A),(B),(S)}
{(A),(B),(S)} b {(B),(X)}
{(B),(X)} a {(B),(Y)}
{(B),(X)} b {(B),(X),(Y)}
{(B),(Y)} a {(B),(Y)}
{(B),(Y)} b {(B),(X),(Y)}
{(B),(X),(Y)} a {(B),(Y)}
{(B),(X),(Y)} b {(B),(X),(Y)}
We encountered six states, plus a dead state (empty) which we can name q1 through q6, plus qD. All of the states corresponding to subsets with either (S) or (X) in them are accepting, and (S) is the initial state. Our DFA looks like this:
/-a,b-\
| |
V |
----->(q1)--b-->(qD)----/
|
a /--a--\
| | |
V V |
(q2)--a-->(q3)----/
| |
b |
| b
V |
/--(q4)<------/ /--b--\
| | | |
| \------b------(q6)<---+
a /--a----\ | |
| | | | |
\-->(q5)<-----+--a-/ |
| |
\---------b---------/
Finally, we can read off the unambiguous regular grammar from our DFA:
(q1) -> a(q2) | b(qD) | lambda
(qD) -> a(qD) | b(qD)
(q2) -> a(q3) | b(q4)
(q3) -> a(q3) | b(q4) | lambda
(q4) -> a(q5) | b(q6) | lambda
(q5) -> a(q5) | b(q6)
(q6) -> a(q5) | b(q6) | lambda

Is this grammar left recursive or factor?

I am new to this topic of left recursion and left factoring, please help me in determining whether this grammar is left recursive or left factored, if it is then why ?
S-> aAd | bBd | aBe | bAe | cA | cB
Is it Left recursive? Answer: No.
By definition, "A grammar is left-recursive if we can find some non-terminal A which will eventually derive a sentential form with itself as the left-symbol."
example:
Immediate left recursion occurs in rules of the form
A -> Aa | b
where a and b are sequences of nonterminals and terminals, and b doesn't start with A.
Clearly not the case in:
S-> aAd | bBd | aBe | bAe | cA | cB
Is it Left factored? Answer: Yes.
By definition, in left factoring, it is not clear which two alternative production to choose, to expand a non-terminal.
This ambiguity occurs, when you have two alternative production that starts with same terminal/non-terminal.
In your case, I can see that thrice, two alternative paths:
S-> aAd | aBe
S-> bBd | bAe
S-> cA | cB
If I remove the left factoring then the grammar becomes:
S-> aA'
A'-> Ad | Be
S-> bB'
B'-> Bd | Ae
S-> cC'
C'-> A | B
This slide explains the same in simpler words

Step by step elimination of this indirect left recursion

I've seen this algorithm one should be able to use to remove all left recursion.
Yet I'm running into problems with this particular grammar:
A -> Cd
B -> Ce
C -> A | B | f
Whatever I try I end up in loops or with a grammar that is still indirect left recursive.
What are the steps to properly implement this algorithm on this grammar?
Rule is that you first establish some kind of order for non-terminals, and then find all paths where indirect recursion happens.
In this case order would be A < B < C, and possible paths for recursion of non-terminal C would be
C=> A => Cd
and
C=> B => Ce
so new rules for C would be
C=> Cd | Ce | f
now you can simply just remove direct left recursion:
C=> fC'
C'=> dC' | eC' | eps
and the resulting non-recursive grammar would be:
A => Cd
B => Ce
C => fC'
C' => dC' | eC' | eps
Figured it out already.
My confusion was that in this order, the algorithm seemed to do nothing, so I figured that must be wrong, and started replacing A -> Cd in the first iteration (ignoring j cannot go beyond i) getting into infinite loops.
1) By reordering the rules:
C -> A | B | f
A -> Cd
B -> Ce
2) replace C in A -> Cd
C -> A | B | f
A -> Ad | Bd | fd
B -> Ce
3) B not yet in range of j, so leave that and replace direct left recursion of A
C -> A | B | f
A -> BdA' | fdA'
A'-> dA' | epsylon
B -> Ce
4) replace C in B -> Ce
C -> A | B | f
A -> BdA' | fdA'
A'-> dA' | epsylon
B -> Ae | Be | fe
5) not done yet! also need to replace the new rule B -> Ae (production of A is in range of j)
C -> A | B | f
A -> BdA' | fdA'
A'-> dA' | epsylon
B -> BdA'e | fdA'e | Be | fe
6) replace direct left recursion in productions of B
C -> A | B | f
A -> BdA' | fdA'
A'-> dA' | epsylon
B -> fdA'eB' | feB'
B'-> dA'eB' | eB' | epsylon
woohoo! left-recursion free grammar!

difference between top down and bottom up parsing techniques?

I guess the same logic is applied in both of them, i.e replacing the matched strings with the corresponding non-terminal elements as provided in the production rules.
Why do they categorize LL as top down and LR as bottom-up?
Bottom up parsing:
Bottom-up parsing (also known as
shift-reduce parsing) is a strategy
for analyzing unknown data
relationships that attempts to
identify the most fundamental units
first, and then to infer higher-order
structures from them. It attempts to
build trees upward toward the start
symbol.
Top-down parsing:
Top-down parsing is a strategy of
analyzing unknown data relationships
by hypothesizing general parse tree
structures and then considering
whether the known fundamental
structures are compatible with the
hypothesis.
Top down parsing
involves to generating the string from first non-terminal.
Example: recursive descent parsing,non-recursive descent parsing, LL parsing, etc.
The grammars with left recursive and left factoring do not work.
Might occur backtracking.
Use of left most derivation
Things Of Interest Blog
The difference between top-down parsing and bottom-up parsing
Given a formal grammar and a string produced by that grammar, parsing is figuring out the production process for that string.
In the case of the context-free grammars, the production process takes the form of a parse tree. Before we begin, we always know two things about the parse tree: the root node, which is the initial symbol from which the string was originally derived, and the leaf nodes, which are all the characters of the string in order. What we don't know is the layout of nodes and branches between them.
For example, if the string is acddf, we know this much already:
S
/|\
???
| | | | |
a c d d f
Example grammar for use in this article
S → xyz | aBC
B → c | cd
C → eg | df
Bottom-up parsing
This approach is not unlike solving a jigsaw puzzle. We start at the bottom of the parse tree with individual characters. We then use the rules to connect the characters together into larger tokens as we go. At the end of the string, everything should have been combined into a single big S, and S should be the only thing we have left. If not, it's necessary to backtrack and try combining tokens in different ways.
With bottom-up parsing, we typically maintain a stack, which is the list of characters and tokens we've seen so far. At each step, we shift a new character onto the stack, and then reduce as far as possible by combining characters into larger tokens.
Example
String is acddf.
Steps
ε can't be reduced
a can't be reduced
ac can be reduced, as follows:
reduce ac to aB
aB can't be reduced
aBd can't be reduced
aBdd can't be reduced
aBddf can be reduced, as follows:
reduce aBddf to aBdC
aBdC can't be reduced
End of string. Stack is aBdC, not S. Failure! Must backtrack.
aBddf can't be reduced
ac can't be reduced
acd can be reduced, as follows:
reduce acd to aB
aB can't be reduced
aBd can't be reduced
aBdf can be reduced, as follows:
reduce aBdf to aBC
aBC can be reduced, as follows:
reduce aBC to S
End of string. Stack is S. Success!
Parse trees
|
a
| |
a c
B
| |
a c
B
| | |
a c d
B
| | | |
a c d d
B
| | | | |
a c d d f
B C
| | | |\
a c d d f
| |
a c
| | |
a c d
B
| /|
a c d
B
| /| |
a c d d
B
| /| | |
a c d d f
B C
| /| |\
a c d d f
S
/|\
/ | |
/ B C
| /| |\
a c d d f
Example 2
If all combinations fail, then the string cannot be parsed.
String is acdg.
Steps
ε can't be reduced
a can't be reduced
ac can be reduced, as follows:
reduce ac to aB
aB can't be reduced
aBd can't be reduced
aBdg can't be reduced
End of string. Stack is aBdg, not S. Failure! Must backtrack.
ac can't be reduced
acd can be reduced, as follows:
reduce acd to aB
aB can't be reduced
aBg can't be reduced
End of string. stack is aBg, not S. Failure! Must backtrack.
acd can't be reduced
acdg can't be reduced
End of string. Stack is is acdg, not S. No backtracking is possible. Failure!
Parse trees
|
a
| |
a c
B
| |
a c
B
| | |
a c d
B
| | | |
a c d g
| |
a c
| | |
a c d
B
| /|
a c d
B
| /| |
a c d g
| | |
a c d
| | | |
a c d g
Top-down parsing
For this approach we assume that the string matches S and look at the internal logical implications of this assumption. For example, the fact that the string matches S logically implies that either (1) the string matches xyz or (2) the string matches aBC. If we know that (1) is not true, then (2) must be true. But (2) has its own further logical implications. These must be examined as far as necessary to prove the base assertion.
Example
String is acddf.
Steps
Assertion 1: acddf matches S
Assertion 2: acddf matches xyz:
Assertion is false. Try another.
Assertion 2: acddf matches aBC i.e. cddf matches BC:
Assertion 3: cddf matches cC i.e. ddf matches C:
Assertion 4: ddf matches eg:
False.
Assertion 4: ddf matches df:
False.
Assertion 3 is false. Try another.
Assertion 3: cddf matches cdC i.e. df matches C:
Assertion 4: df matches eg:
False.
Assertion 4: df matches df:
Assertion 4 is true.
Assertion 3 is true.
Assertion 2 is true.
Assertion 1 is true. Success!
Parse trees
S
|
S
/|\
a B C
| |
S
/|\
a B C
| |
c
S
/|\
a B C
/| |
c d
S
/|\
a B C
/| |\
c d d f
Example 2
If, after following every logical lead, we can't prove the basic hypothesis ("The string matches S") then the string cannot be parsed.
String is acdg.
Steps
Assertion 1: acdg matches S:
Assertion 2: acdg matches xyz:
False.
Assertion 2: acdg matches aBC i.e. cdg matches BC:
Assertion 3: cdg matches cC i.e. dg matches C:
Assertion 4: dg matches eg:
False.
Assertion 4: dg matches df:
False.
False.
Assertion 3: cdg matches cdC i.e. g matches C:
Assertion 4: g matches eg:
False.
Assertion 4: g matches df:
False.
False.
False.
Assertion 1 is false. Failure!
Parse trees
S
|
S
/|\
a B C
| |
S
/|\
a B C
| |
c
S
/|\
a B C
/| |
c d
Why left-recursion is a problem for top-down parsers
If our rules were left-recursive, for example something like this:
S → Sb
Then notice how our algorithm behaves:
Steps
Assertion 1: acddf matches S:
Assertion 2: acddf matches Sb:
Assertion 3: acddf matches Sbb:
Assertion 4: acddf matches Sbbb:
...and so on forever
Parse trees
S
|
S
|\
S b
|
S
|\
S b
|\
S b
|
S
|\
S b
|\
S b
|\
S b
|
...

Resources