I have read this to understand more the difference between top down and bottom up parsing, can anyone explain the problems associated with left recursion in a top down parser?
In a top-down parser, the parser begins with the start symbol and tries to guess which productions to apply to reach the input string. To do so, top-down parsers need to use contextual clues from the input string to guide its guesswork.
Most top-down parsers are directional parsers, which scan the input in some direction (typically, left to right) when trying to determine which productions to guess. The LL(k) family of parsers is one example of this - these parsers use information about the next k symbols of input to determine which productions to use.
Typically, the parser uses the next few tokens of input to guess productions by looking at which productions can ultimately lead to strings that start with the upcoming tokens. For example, if you had the production
A → bC
you wouldn't choose to use this production unless the next character to match was b. Otherwise, you'd be guaranteed there was a mismatch. Similarly, if the next input character was b, you might want to choose this production.
So where does left recursion come in? Well, suppose that you have these two productions:
A → Ab | b
This grammar generates all strings of one or more copies of the character b. If you see a b in the input as your next character, which production should you pick? If you choose Ab, then you're assuming there are multiple b's ahead of you even though you're not sure this is the case. If you choose b, you're assuming there's only one b ahead of you, which might be wrong. In other words, if you have to pick one of the two productions, you can't always choose correctly.
The issue with left recursion is that if you have a nonterminal that's left-recursive and find a string that might match it, you can't necessarily know whether to use the recursion to generate a longer string or avoid the recursion and generate a shorter string. Most top-down parsers will either fail to work for this reason (they'll report that there's some uncertainty about how to proceed and refuse to parse), or they'll potentially use extra memory to track each possible branch, running out of space.
In short, top-down parsers usually try to guess what to do from limited information about the string. Because of this, they get confused by left recursion because they can't always accurately predict which productions to use.
Hope this helps!
Reasons
1)The grammar which are left recursive(Direct/Indirect) can't be converted into {Greibach normal form (GNF)}* So the Left recursion can be eliminated to Right Recuraive Format.
2)Left Recursive Grammars are also nit LL(1),So again elimination of left Recursion may result into LL(1) grammer.
GNF
A Grammer of the form A->aV is Greibach Normal Form.
Related
Left recursion will make the parser go into an infinite loop. So why does the same not happen with right recursion?
In a recursive descent parser a grammar rule like A -> B C | D is implemented by trying to parse B at the current position and then, if that succeeds, trying to parse C at the position where B ended. If either fail, we try to parse D at the current position¹.
If C is equal to A (right recursion) that's okay. That simply means that if B succeeds, we try to parse A at the position after B, which means that we try to first parse B there and then either try A again at a new position or try D. This will continue until finally B fails and we try D.
If B is equal to A (left recursion), however, that is very much a problem. Because now to parse A, we first try to parse A at the current position, which tries to parse A at the current position ... ad infinitum. We never advance our position and never try anything except A (which just keeps trying itself), so we never get to a point where we might terminate.
¹ Assuming full back tracking. Otherwise A might fail without trying D if B and/or C consumed any tokens (or more tokens than we've got lookahead), but none of this matters for this discussion.
If you're puzzled by the lack of symmetry, another way of looking at this is that left recursion causes problems for recursive descent parsers is because we typically parse languages from left to right. That means that if a parser is left-recursive, then the recursive symbol is the first one that's tried, and it's tried in the same state as the parent rule, guaranteeing that the recursion will continue infinitely.
If you really think about it, there's no fundamental reason why we parse languages from left-to-right; it's just a convention! (There are reasons, such as that it's faster to read files from disk that way; but those are consequences of the convention.) If you wrote a right-to-left recursive descent parser, which started at the end of the file and consumed characters from the end first, working backwards to the beginning of the file, then right recursion is what would cause problems, and you would need to rewrite your right-recursive grammars to be left-recursive before you can parse them. That's because if you're handling the right symbol first, then it's the one that's parsed with the same state as the parent.
So there you are; symmetry is preserved. Just as left-to-right recursive descent parsers struggle with left recursion, similarly right-to-left recursive descent parsers struggle with right recursion.
Left recursion will make the parser go into an infinite loop. So why does the same not happen with right recursion?
In a recursive descent parser a grammar rule like A -> B C | D is implemented by trying to parse B at the current position and then, if that succeeds, trying to parse C at the position where B ended. If either fail, we try to parse D at the current position¹.
If C is equal to A (right recursion) that's okay. That simply means that if B succeeds, we try to parse A at the position after B, which means that we try to first parse B there and then either try A again at a new position or try D. This will continue until finally B fails and we try D.
If B is equal to A (left recursion), however, that is very much a problem. Because now to parse A, we first try to parse A at the current position, which tries to parse A at the current position ... ad infinitum. We never advance our position and never try anything except A (which just keeps trying itself), so we never get to a point where we might terminate.
¹ Assuming full back tracking. Otherwise A might fail without trying D if B and/or C consumed any tokens (or more tokens than we've got lookahead), but none of this matters for this discussion.
If you're puzzled by the lack of symmetry, another way of looking at this is that left recursion causes problems for recursive descent parsers is because we typically parse languages from left to right. That means that if a parser is left-recursive, then the recursive symbol is the first one that's tried, and it's tried in the same state as the parent rule, guaranteeing that the recursion will continue infinitely.
If you really think about it, there's no fundamental reason why we parse languages from left-to-right; it's just a convention! (There are reasons, such as that it's faster to read files from disk that way; but those are consequences of the convention.) If you wrote a right-to-left recursive descent parser, which started at the end of the file and consumed characters from the end first, working backwards to the beginning of the file, then right recursion is what would cause problems, and you would need to rewrite your right-recursive grammars to be left-recursive before you can parse them. That's because if you're handling the right symbol first, then it's the one that's parsed with the same state as the parent.
So there you are; symmetry is preserved. Just as left-to-right recursive descent parsers struggle with left recursion, similarly right-to-left recursive descent parsers struggle with right recursion.
I am making a parser using bison. I just wanna ask if it still necessary for a grammar to be left-factored when used in bison. I tried giving bison a non-left-factored grammar and it didn't gave any warning or error and it also accepted the example syntax I gave to the parser, but I'm worried that it the parser may not be accurate in every input.
Left factoring is how you remove LL-conflicts in a grammar. Since Bison uses LALR it has no problems with left recursion or any other LL-conflicts (indeed, left recursion is preferable as it minimizes stack requirements), so left factoring is neither necessary nor desirable.
Note that left factoring won't break anything -- bison can deal with a left-factored grammar as well as a non-left factored one, but it may require more resources (memory) to parse the left-factored grammar, so in general, don't.
edit
You seem to be confused about how LL-vs-LR parsing work and how the structure of the grammar affects each.
LL parsing is top down -- you start with just the start symbol on the parse stack, and at each step, you replace the non-terminal on top of the stack with the symbols from the right side of some rule for that non-terminal. When there is a terminal on top of the stack, it must match the next token of input, so you pop it and consume the input. The goal being to consume all the input and end up with an empty stack.
LR parsing is bottom up -- you start with an empty stack, and at each step you either copy a token from the input to the stack (consuming it), or you replace a sequence of symbols on the top of the stack corresponding to the right side of some rule with the single symbol from the left side of the rule. The goal being to consume all the input and be left with just the start symbol on the stack.
So different rules for the same non-terminal which start with the same symbols on the right side are a big problem for LL parsing -- you could replace that non-terminal with the symbols from either rule and match the next few tokens of input, so you would need more lookahead to know which to do. But for LR parsing, there's no problem -- you just shift (move) the tokens from the input to the stack and when you get to the later tokens you decide which right side it matches.
LR parsing tends to have problems with rules that end with the same tokens on the right hand side, rather than rules that start with the same tokens. In your example from John Levine's book, there are rules "cart_animal ::= HORSE" and "work_animal ::= HORSE", so after shifting a HORSE symbol, it could be reduced (replace by) either "cart_animal" or "work_animal". Since the context allows either to be followed by the "AND" token, you end up with a reduce/reduce (LR) conflict when the next token is "AND".
In fact, the opposite is true. Parsers generated by LALR(1) parser generators not only support left recursion, they in fact work better with left recursion. Ironically, you may have to refactor right recursion out of your grammar.
Right recursion works; however, it delays reduction, causing parse stack space that is proportional to the size of the recursive construct being parsed.
For instance, building a Lisp-style list like this:
list : item { $$ = cons($1, nil); }
| item list { $$ = cons($1, $2); }
means that the parser stack is proportional to the length of the list. No reduction takes place until the rightmost item is reached, and then a cascade of reductions takes place, building the list from right to left by a sequence of cons calls.
You might not encounter this issue until you start parsing data, rather than code, and the data gets large.
If you modify this for left recursion, you can build a the list in a constant amount parser stack, because the action will be "reduce as you go":
list : item { $$ = cons($1, nil); }
| list item { $$ = append($1, cons($2, nil)); }
(Now there is a performance problem with append searching for the tail of the list; for which there are various solutions, unrelated to the parsing.)
I wanted to know why top down parsers cannot handle left recursion and we need to eliminate left recursion due to this as mentioned in dragon book..
Think of what it's doing. Suppose we have a left-recursive production rule A -> Aa | b, and right now we try to match that rule. So we're checking whether we can match an A here, but in order to do that, we must first check whether we can match an A here. That sounds impossible, and it mostly is. Using a recursive-descent parser, that obviously represents an infinite recursion.
It is possible using more advanced techniques that are still top-down, for example see [1] or [2].
[1]: Richard A. Frost and Rahmatullah Hafiz. A new top-down parsing algorithm to accommodate ambiguity and left recursion in polynomial time. SIGPLAN Notices, 41(5):46–54, 2006.
[2]: R. Frost, R. Hafiz, and P. Callaghan, Modular and efficient top-down
parsing for ambiguous left-recursive grammars. ACL-IWPT, pp. 109 –
120, 2007.
Top-down parsers cannot handle left recursion
A top-down parser cannot handle left recursive productions. To understand why not, let's take a very simple left-recursive grammar.
S → a
S → S a
There is only one token, a, and only one nonterminal, S. So the parsing table has just one entry. Both productions must go into that one table entry.
The problem is that, on lookahead a, the parser cannot know if another a comes after the lookahead. But the decision of which production to use depends on that information.
Let's take the following context-free grammar:
G = ( {Sum, Product, Number}, {decimal representations of numbers, +, *}, P, Sum)
Being P:
Sum → Sum + Product
Sum → Product
Product → Product * Number
Product → Number
Number → decimal representation of a number
I am trying to parse expressions produced by this grammar with a bottom-up-parser and a look-ahead-buffer (LAB) of length 1 (which supposingly should do without guessing and back-tracking).
Now, given a stack and a LAB, there are often several possibilities of how to reduce the stack or whether to reduce it at all or push another token.
Currently I use this decision tree:
If any top n tokens of the stack plus the LAB are the begining of the
right side of a rule, I push the next token onto the stack.
Otherwise, I reduce the maximum number of tokens on top of the stack.
I.e. if it is possible to reduce the topmost item and at the same time
it is possible to reduce the three top most items, I do the latter.
If no such reduction is possible, I push another token onto the stack.
Rinse and repeat.
This, seems (!) to work, but it requires an awful amount of rule searching, finding matching prefixes, etc. No way this can run in O(NM).
What is the standard (and possibly only sensible) approach to decide whether to reduce or push (shift), and in case of reduction, which reduction to apply?
Thank you in advance for your comments and answers.
The easiest bottom-up parsing approach for grammars like yours (basically, expression grammars) is operator-precedence parsing.
Recall that bottom-up parsing involves building the parse tree left-to-right from the bottom. In other words, at any given time during the parse, we have a partially assembled tree with only terminal symbols to the right of where we're reading, and a combination of terminals and non-terminals to the left (the "prefix"). The only possible reduction is one which applies to a suffix of the prefix; if no reduction applies, we must be able to shift a terminal from the input to the prefix.
An operator grammar has the feature that there are never two consecutive non-terminals in any production. Consequently, in a bottom-up parse of an operator grammar, either the last symbol in the prefix is a terminal or the second-last symbol is one. (Both of them could be.)
An operator precedence parser is essentially blind to non-terminals; it simply doesn't distinguish between them. So you cannot have two productions whose right-hand sides contain exactly the same sequence of terminals, because the op-prec parser wouldn't know which of these two productions to apply. (That's the traditional view. It's actually possible to extend that a bit so that you can have two productions with the same terminals, provided that the non-terminals are in different places. That allows grammars which have unary - operators, for example, since the right hand sides <non-terminal> - <non-terminal> and - <non-terminal> can be distinguished without knowing the names of the non-terminals; only their presence.
The other requirement is that you have to be able to build a precedence relationship between the terminals. More precisely, we define three precedence relations, usually written <·, ·> and ·=· (or some typographic variation on the theme), and insist that for any two terminals x and y, at most one of the relations x ·> y, x ·=· y and x <· y are true.
Roughly speaking, the < and > in the relations correspond to the edges of a production. In other words, if x <· y, that means that x can be followed by a non-terminal with a production whose first terminal is y. Similarly, x ·> y means that y can follow a non-terminal with a production whose last terminal is x. And x ·=· y means that there is some right-hand side where x and y are consecutive terminals, in that order (possibly with an intervening non-terminal).
If the single-relation restriction is true, then we can parse as follows:
Let x be the last terminal in the prefix (that is, either the last or second-last symbol), and let y be the lookahead symbol, which must be a terminal. If x ·> y then we reduce, and repeat the rule. Otherwise, we shift y onto the prefix.
In order to reduce, we need to find the start of the production. We move backwards over the prefix, comparing consecutive terminals (all of which must have <· or ·=· relations) until we find one with a <· relation. Then the terminals between the <· and the ·> are the right-hand side of the production we're looking for, and we can slot the non-terminals into the right-hand side as indicated.
There is no guarantee that there will be an appropriate production; if there isn't, the parse fails. But if the input is a valid sentence, and if the grammar is an operator-precedence grammar, then we will be able to find the right production to reduce.
Note that it is usually really easy to find the production, because most productions have only one (<non-terminal> * <non-terminal>) or two (( <non-terminal> )) terminals. A naive implementation might just run the terminals together into a string and use that as the key in a hash-table.
The classic implementation of operator-precedence parsing is the so-called "Shunting Yard Algorithm", devised by Edsger Dijkstra. In that algorithm, the precedence relations are modelled by providing two functions, left-precedence and right-precedence, which map terminals to integers such that x <· y is true only if right-precedence(x) < left-precedence(y) (and similarly for the other operators). It is not always possible to find such mappings, and the mappings are a cover of the actual precedence relations because it is often the case that there are pairs of terminals for which no precedence relationship applies. Nonetheless, it is often the case that these mappings can be found, and almost always the case for simple expression grammars.
I hope that's enough to get you started. I encourage you to actually read some texts about bottom-up parsing, because I think I've already written far too much for an SO answer, and I haven't yet included a single line of code. :)