Left recursion will make the parser go into an infinite loop. So why does the same not happen with right recursion?
In a recursive descent parser a grammar rule like A -> B C | D is implemented by trying to parse B at the current position and then, if that succeeds, trying to parse C at the position where B ended. If either fail, we try to parse D at the current position¹.
If C is equal to A (right recursion) that's okay. That simply means that if B succeeds, we try to parse A at the position after B, which means that we try to first parse B there and then either try A again at a new position or try D. This will continue until finally B fails and we try D.
If B is equal to A (left recursion), however, that is very much a problem. Because now to parse A, we first try to parse A at the current position, which tries to parse A at the current position ... ad infinitum. We never advance our position and never try anything except A (which just keeps trying itself), so we never get to a point where we might terminate.
¹ Assuming full back tracking. Otherwise A might fail without trying D if B and/or C consumed any tokens (or more tokens than we've got lookahead), but none of this matters for this discussion.
If you're puzzled by the lack of symmetry, another way of looking at this is that left recursion causes problems for recursive descent parsers is because we typically parse languages from left to right. That means that if a parser is left-recursive, then the recursive symbol is the first one that's tried, and it's tried in the same state as the parent rule, guaranteeing that the recursion will continue infinitely.
If you really think about it, there's no fundamental reason why we parse languages from left-to-right; it's just a convention! (There are reasons, such as that it's faster to read files from disk that way; but those are consequences of the convention.) If you wrote a right-to-left recursive descent parser, which started at the end of the file and consumed characters from the end first, working backwards to the beginning of the file, then right recursion is what would cause problems, and you would need to rewrite your right-recursive grammars to be left-recursive before you can parse them. That's because if you're handling the right symbol first, then it's the one that's parsed with the same state as the parent.
So there you are; symmetry is preserved. Just as left-to-right recursive descent parsers struggle with left recursion, similarly right-to-left recursive descent parsers struggle with right recursion.
Related
I was going through the text Compilers Principles, Techniques and Tools by Ullman et. al where I came across the excerpt where the authors try to justify why stack is the best data structure of shift reduce parsing. They said that it is so because of the fact that
"The handle will always eventually appear on top of the stack, never inside."
The Excerpt
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation. These two steps can be of the form
In case (1), A is replaced by , and then the rightmost nonterminal B in that right side is replaced by . In case (2), A is again replaced first, but this time the right side is a string y of terminals only. The next rightmost nonterminal B will be somewhere to the left of y.
Let us consider case (1) in reverse, where a shift-reduce parser has just reached the configuration
The parser now reduces the handle to B to reach the configuration
in which is the handle, and it gets reduced to A
In case (2), in configuration
the handle is on top of the stack. After reducing the handle to B, the parser can shift the string xy to get the next handle y on top of the stack,
Now the parser reduces y to A.
In both cases, after making a reduction the parser had to shift zero or more symbols to get the next handle onto the stack. It never had to go into the stack to find the handle. It is this aspect of handle pruning that makes a stack a particularly convenient data structure for implementing a shift-reduce parser.
My reasoning and doubts
Intuitively this is how I feel that the statement in can be justified
If there is an handle on the top of the stack, then the algorithm, will first reduce it before pushing the next input symbol on top of the stack. Since before the push any possible handle is reduced, so there is no chance of an handle being on the top of the stack and then pushing a new input symbol thereby causing the handle to go inside the stack.
Moreover I could not understand the logic the authors have given in highlighted portion of the excerpt justifying that the handle cannot occur inside the stack, based on what they say about B and other facts related to it.
Please can anyone help me understand the concept.
The key to the logic expressed by the authors is in the statement at the beginning (emphasis added):
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation.
It's also important to remember that a bottom-up parser traces out a right-most derivation backwards. Each reduction performed by the parser is a step in the derivation; since the derivation is rightmost the non-terminal being replaced in the derivation step must be the last non-terminal in the sentential form. So if we write down the sequence of reduction actions used by the parser and then read the list backwards, we get the derivation. Alternatively, if we write down the list of productions used in the rightmost derivation and then read it backwards, we get the sequence of parser reductions.
Either way, the point is to prove that the successive handles in the derivation steps correspond to monotonically non-retreating prefixes in the original input. The authors' proof takes two derivation steps (any two derivation steps) and shows that the end of the handle of the second derivation step is not before the end of the handle of the first step (although the ends of the two handles may be at the same point in the input).
I'm delving deeper into parsing and came across an issue I don't quite understand. I made up the following grammar:
S = R | aSc
R = b | RbR
where S is the start symbol. It is possible to show that abbbc is a valid sentence based on this grammar, hopefully, that is correct but I may have completely missunderstood something. If I try to implement this using recursive descent I seem to have a problem when trying to parse abbbc, using left-derivation eg
S => aSc
aSc => aRc
at this point I would have thought that recursive descent would pick the first option in the second production because the next token is b leading to:
aRc => abc
and we're finished since there are no more non-terminals, which isn't of course abbbc. The only way to show that abbbc is valid is to pick the second option but with one lookahead I assume it would always pick b. I don't think the grammar is ambiguous unless I missed something. So what I am doing wrong?
Update: I came across this nice derivation app at https://web.stanford.edu/class/archive/cs/cs103/cs103.1156/tools/cfg/. I used to do a sanity check that abbbc is a valid sentence and it is.
Thinking more about this problem, is it true to say that I can't use LL(1) to parse this grammar but in fact need LL(2)? With two lookaheads I could correctly pick the second option in the second production because I now also know there are more tokens to be read and therefore picking b would prematurely terminate the derivation.
For starters, I’m glad you’re finding our CFG tool useful! A few of my TAs made that a while back and we’ve gotten a lot of mileage out of it.
Your grammar is indeed ambiguous. This stems from your R nonterminal:
R → b | RbR
Generally speaking, if you have recursive production rules with two copies of the same nonterminal in it, it will lead to ambiguities because there will be multiple options for how to apply the rule twice. For example, in this case, you can derive bbbbb by first expanding R to RbR, then either
expanding the left R to RbR and converting each R to a b, or
expanding the right R to RbR and converting each R to a b.
Because this grammar is ambiguous, it isn’t going to be LL(k) for any choice of k because all LL(k) grammars must be unambiguous. That means that stepping up the power of your parser won’t help here. You’ll need to rewrite the grammar to not be ambiguous.
The nonterminal R that you’ve described here generates strings of odd numbers of b’s in them, so we could try redesigning R to achieve this more directly. An initial try might be something like this:
R → b | bbR
This, unfortunately, isn’t LL(1), since after seeing a single b it’s unclear whether you’re supposed to apply the first production rule or the second. However, it is LL(2).
If you’d like an LL(1) grammar, you could do something like this:
R → bX
X → bbX | ε
This works by laying down a single b, then laying down as many optional pairs of b’s as you’d like.
Left recursion will make the parser go into an infinite loop. So why does the same not happen with right recursion?
In a recursive descent parser a grammar rule like A -> B C | D is implemented by trying to parse B at the current position and then, if that succeeds, trying to parse C at the position where B ended. If either fail, we try to parse D at the current position¹.
If C is equal to A (right recursion) that's okay. That simply means that if B succeeds, we try to parse A at the position after B, which means that we try to first parse B there and then either try A again at a new position or try D. This will continue until finally B fails and we try D.
If B is equal to A (left recursion), however, that is very much a problem. Because now to parse A, we first try to parse A at the current position, which tries to parse A at the current position ... ad infinitum. We never advance our position and never try anything except A (which just keeps trying itself), so we never get to a point where we might terminate.
¹ Assuming full back tracking. Otherwise A might fail without trying D if B and/or C consumed any tokens (or more tokens than we've got lookahead), but none of this matters for this discussion.
If you're puzzled by the lack of symmetry, another way of looking at this is that left recursion causes problems for recursive descent parsers is because we typically parse languages from left to right. That means that if a parser is left-recursive, then the recursive symbol is the first one that's tried, and it's tried in the same state as the parent rule, guaranteeing that the recursion will continue infinitely.
If you really think about it, there's no fundamental reason why we parse languages from left-to-right; it's just a convention! (There are reasons, such as that it's faster to read files from disk that way; but those are consequences of the convention.) If you wrote a right-to-left recursive descent parser, which started at the end of the file and consumed characters from the end first, working backwards to the beginning of the file, then right recursion is what would cause problems, and you would need to rewrite your right-recursive grammars to be left-recursive before you can parse them. That's because if you're handling the right symbol first, then it's the one that's parsed with the same state as the parent.
So there you are; symmetry is preserved. Just as left-to-right recursive descent parsers struggle with left recursion, similarly right-to-left recursive descent parsers struggle with right recursion.
I am making a parser using bison. I just wanna ask if it still necessary for a grammar to be left-factored when used in bison. I tried giving bison a non-left-factored grammar and it didn't gave any warning or error and it also accepted the example syntax I gave to the parser, but I'm worried that it the parser may not be accurate in every input.
Left factoring is how you remove LL-conflicts in a grammar. Since Bison uses LALR it has no problems with left recursion or any other LL-conflicts (indeed, left recursion is preferable as it minimizes stack requirements), so left factoring is neither necessary nor desirable.
Note that left factoring won't break anything -- bison can deal with a left-factored grammar as well as a non-left factored one, but it may require more resources (memory) to parse the left-factored grammar, so in general, don't.
edit
You seem to be confused about how LL-vs-LR parsing work and how the structure of the grammar affects each.
LL parsing is top down -- you start with just the start symbol on the parse stack, and at each step, you replace the non-terminal on top of the stack with the symbols from the right side of some rule for that non-terminal. When there is a terminal on top of the stack, it must match the next token of input, so you pop it and consume the input. The goal being to consume all the input and end up with an empty stack.
LR parsing is bottom up -- you start with an empty stack, and at each step you either copy a token from the input to the stack (consuming it), or you replace a sequence of symbols on the top of the stack corresponding to the right side of some rule with the single symbol from the left side of the rule. The goal being to consume all the input and be left with just the start symbol on the stack.
So different rules for the same non-terminal which start with the same symbols on the right side are a big problem for LL parsing -- you could replace that non-terminal with the symbols from either rule and match the next few tokens of input, so you would need more lookahead to know which to do. But for LR parsing, there's no problem -- you just shift (move) the tokens from the input to the stack and when you get to the later tokens you decide which right side it matches.
LR parsing tends to have problems with rules that end with the same tokens on the right hand side, rather than rules that start with the same tokens. In your example from John Levine's book, there are rules "cart_animal ::= HORSE" and "work_animal ::= HORSE", so after shifting a HORSE symbol, it could be reduced (replace by) either "cart_animal" or "work_animal". Since the context allows either to be followed by the "AND" token, you end up with a reduce/reduce (LR) conflict when the next token is "AND".
In fact, the opposite is true. Parsers generated by LALR(1) parser generators not only support left recursion, they in fact work better with left recursion. Ironically, you may have to refactor right recursion out of your grammar.
Right recursion works; however, it delays reduction, causing parse stack space that is proportional to the size of the recursive construct being parsed.
For instance, building a Lisp-style list like this:
list : item { $$ = cons($1, nil); }
| item list { $$ = cons($1, $2); }
means that the parser stack is proportional to the length of the list. No reduction takes place until the rightmost item is reached, and then a cascade of reductions takes place, building the list from right to left by a sequence of cons calls.
You might not encounter this issue until you start parsing data, rather than code, and the data gets large.
If you modify this for left recursion, you can build a the list in a constant amount parser stack, because the action will be "reduce as you go":
list : item { $$ = cons($1, nil); }
| list item { $$ = append($1, cons($2, nil)); }
(Now there is a performance problem with append searching for the tail of the list; for which there are various solutions, unrelated to the parsing.)
I have read this to understand more the difference between top down and bottom up parsing, can anyone explain the problems associated with left recursion in a top down parser?
In a top-down parser, the parser begins with the start symbol and tries to guess which productions to apply to reach the input string. To do so, top-down parsers need to use contextual clues from the input string to guide its guesswork.
Most top-down parsers are directional parsers, which scan the input in some direction (typically, left to right) when trying to determine which productions to guess. The LL(k) family of parsers is one example of this - these parsers use information about the next k symbols of input to determine which productions to use.
Typically, the parser uses the next few tokens of input to guess productions by looking at which productions can ultimately lead to strings that start with the upcoming tokens. For example, if you had the production
A → bC
you wouldn't choose to use this production unless the next character to match was b. Otherwise, you'd be guaranteed there was a mismatch. Similarly, if the next input character was b, you might want to choose this production.
So where does left recursion come in? Well, suppose that you have these two productions:
A → Ab | b
This grammar generates all strings of one or more copies of the character b. If you see a b in the input as your next character, which production should you pick? If you choose Ab, then you're assuming there are multiple b's ahead of you even though you're not sure this is the case. If you choose b, you're assuming there's only one b ahead of you, which might be wrong. In other words, if you have to pick one of the two productions, you can't always choose correctly.
The issue with left recursion is that if you have a nonterminal that's left-recursive and find a string that might match it, you can't necessarily know whether to use the recursion to generate a longer string or avoid the recursion and generate a shorter string. Most top-down parsers will either fail to work for this reason (they'll report that there's some uncertainty about how to proceed and refuse to parse), or they'll potentially use extra memory to track each possible branch, running out of space.
In short, top-down parsers usually try to guess what to do from limited information about the string. Because of this, they get confused by left recursion because they can't always accurately predict which productions to use.
Hope this helps!
Reasons
1)The grammar which are left recursive(Direct/Indirect) can't be converted into {Greibach normal form (GNF)}* So the Left recursion can be eliminated to Right Recuraive Format.
2)Left Recursive Grammars are also nit LL(1),So again elimination of left Recursion may result into LL(1) grammer.
GNF
A Grammer of the form A->aV is Greibach Normal Form.