I am making a parser using bison. I just wanna ask if it still necessary for a grammar to be left-factored when used in bison. I tried giving bison a non-left-factored grammar and it didn't gave any warning or error and it also accepted the example syntax I gave to the parser, but I'm worried that it the parser may not be accurate in every input.
Left factoring is how you remove LL-conflicts in a grammar. Since Bison uses LALR it has no problems with left recursion or any other LL-conflicts (indeed, left recursion is preferable as it minimizes stack requirements), so left factoring is neither necessary nor desirable.
Note that left factoring won't break anything -- bison can deal with a left-factored grammar as well as a non-left factored one, but it may require more resources (memory) to parse the left-factored grammar, so in general, don't.
edit
You seem to be confused about how LL-vs-LR parsing work and how the structure of the grammar affects each.
LL parsing is top down -- you start with just the start symbol on the parse stack, and at each step, you replace the non-terminal on top of the stack with the symbols from the right side of some rule for that non-terminal. When there is a terminal on top of the stack, it must match the next token of input, so you pop it and consume the input. The goal being to consume all the input and end up with an empty stack.
LR parsing is bottom up -- you start with an empty stack, and at each step you either copy a token from the input to the stack (consuming it), or you replace a sequence of symbols on the top of the stack corresponding to the right side of some rule with the single symbol from the left side of the rule. The goal being to consume all the input and be left with just the start symbol on the stack.
So different rules for the same non-terminal which start with the same symbols on the right side are a big problem for LL parsing -- you could replace that non-terminal with the symbols from either rule and match the next few tokens of input, so you would need more lookahead to know which to do. But for LR parsing, there's no problem -- you just shift (move) the tokens from the input to the stack and when you get to the later tokens you decide which right side it matches.
LR parsing tends to have problems with rules that end with the same tokens on the right hand side, rather than rules that start with the same tokens. In your example from John Levine's book, there are rules "cart_animal ::= HORSE" and "work_animal ::= HORSE", so after shifting a HORSE symbol, it could be reduced (replace by) either "cart_animal" or "work_animal". Since the context allows either to be followed by the "AND" token, you end up with a reduce/reduce (LR) conflict when the next token is "AND".
In fact, the opposite is true. Parsers generated by LALR(1) parser generators not only support left recursion, they in fact work better with left recursion. Ironically, you may have to refactor right recursion out of your grammar.
Right recursion works; however, it delays reduction, causing parse stack space that is proportional to the size of the recursive construct being parsed.
For instance, building a Lisp-style list like this:
list : item { $$ = cons($1, nil); }
| item list { $$ = cons($1, $2); }
means that the parser stack is proportional to the length of the list. No reduction takes place until the rightmost item is reached, and then a cascade of reductions takes place, building the list from right to left by a sequence of cons calls.
You might not encounter this issue until you start parsing data, rather than code, and the data gets large.
If you modify this for left recursion, you can build a the list in a constant amount parser stack, because the action will be "reduce as you go":
list : item { $$ = cons($1, nil); }
| list item { $$ = append($1, cons($2, nil)); }
(Now there is a performance problem with append searching for the tail of the list; for which there are various solutions, unrelated to the parsing.)
Related
I was going through the text Compilers Principles, Techniques and Tools by Ullman et. al where I came across the excerpt where the authors try to justify why stack is the best data structure of shift reduce parsing. They said that it is so because of the fact that
"The handle will always eventually appear on top of the stack, never inside."
The Excerpt
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation. These two steps can be of the form
In case (1), A is replaced by , and then the rightmost nonterminal B in that right side is replaced by . In case (2), A is again replaced first, but this time the right side is a string y of terminals only. The next rightmost nonterminal B will be somewhere to the left of y.
Let us consider case (1) in reverse, where a shift-reduce parser has just reached the configuration
The parser now reduces the handle to B to reach the configuration
in which is the handle, and it gets reduced to A
In case (2), in configuration
the handle is on top of the stack. After reducing the handle to B, the parser can shift the string xy to get the next handle y on top of the stack,
Now the parser reduces y to A.
In both cases, after making a reduction the parser had to shift zero or more symbols to get the next handle onto the stack. It never had to go into the stack to find the handle. It is this aspect of handle pruning that makes a stack a particularly convenient data structure for implementing a shift-reduce parser.
My reasoning and doubts
Intuitively this is how I feel that the statement in can be justified
If there is an handle on the top of the stack, then the algorithm, will first reduce it before pushing the next input symbol on top of the stack. Since before the push any possible handle is reduced, so there is no chance of an handle being on the top of the stack and then pushing a new input symbol thereby causing the handle to go inside the stack.
Moreover I could not understand the logic the authors have given in highlighted portion of the excerpt justifying that the handle cannot occur inside the stack, based on what they say about B and other facts related to it.
Please can anyone help me understand the concept.
The key to the logic expressed by the authors is in the statement at the beginning (emphasis added):
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation.
It's also important to remember that a bottom-up parser traces out a right-most derivation backwards. Each reduction performed by the parser is a step in the derivation; since the derivation is rightmost the non-terminal being replaced in the derivation step must be the last non-terminal in the sentential form. So if we write down the sequence of reduction actions used by the parser and then read the list backwards, we get the derivation. Alternatively, if we write down the list of productions used in the rightmost derivation and then read it backwards, we get the sequence of parser reductions.
Either way, the point is to prove that the successive handles in the derivation steps correspond to monotonically non-retreating prefixes in the original input. The authors' proof takes two derivation steps (any two derivation steps) and shows that the end of the handle of the second derivation step is not before the end of the handle of the first step (although the ends of the two handles may be at the same point in the input).
I'm implementing the automatic construction of an LALR parse table for no reason at all. There are two flavors of this parser, LALR(0) and LALR(1), where the number signifies the amount of look-ahead.
I have gotten myself confused on what look-ahead means.
If my input stream is 'abc' and I have the following production, would I need 0 look-ahead, or 1?
P :== a E
Same question, but I can't choose the correct P production in advance by only looking at the 'a' in the input.
P :== a b E
| a b F
I have additional confusion in that I don't think the latter P-productions really happen in when building a LALR parser generator. The reason is that the grammar is effectively left-factored automatically as we compute the closures.
I was working through this page and was ok until I got to the first/follow section. My issue here is that I don't know why we are calculating these things, so I am having trouble abstracting this in my head.
I almost get the idea that the look-ahead is not related to shifting input, but instead in deciding when to reduce.
I've been reading the Dragon book, but it is about as linear as a Tarantino script. It seems like a great reference for people who already know how to do this.
The first thing you need to do when learning about bottom-up parsing (such as LALR) is to remember that it is completely different from top-down parsing. Top-down parsing starts with a nonterminal, the left-hand-side (LHS) of a production, and guesses which right-hand-side (RHS) to use. Bottom-up parsing, on the other hand, starts by identifying the RHS and then figures out which LHS to select.
To be more specific, a bottom-up parser accumulates incoming tokens into a queue until a right-hand side is at the right-hand end of the queue. Then it reduces that RHS by replacing it with the corresponding LHS, and checks to see whether an appropriate RHS is at the right-hand edge of the modified accumulated input. It keeps on doing that until it decides that no more reductions will take place at that point in the input, and then reads a new token (or, in other words, takes the next input token and shifts it onto the end of the queue.)
This continues until the last token is read and all possible reductions are performed, at which point if what remains is the single non-terminal which is the "start symbol", it accepts the parse.
It is not obligatory for the parser to reduce a RHS just because it appears at the end of the current queue, but it cannot reduce a RHS which is not at the end of the queue. That means that it has to decide whether to reduce or not before it shifts any other token. Since the decision is not always obvious, it may examine one or more tokens which it has not yet read ("lookahead tokens", because it is looking ahead into the input) in order to decide. But it can only look at the next k tokens for some value of k, typically 1.
Here's a very simple example; a comma separated list:
1. Start -> List
2. List -> ELEMENT
3. List -> List ',' ELEMENT
Let's suppose the input is:
ELEMENT , ELEMENT , ELEMENT
At the beginning, the input queue is empty, and since no RHS is empty the only alternative is to shift:
queue remaining input action
---------------------- --------------------------- -----
ELEMENT , ELEMENT , ELEMENT SHIFT
At the next step, the parser decides to reduce using production 2:
ELEMENT , ELEMENT , ELEMENT REDUCE 2
Now there is a List at the end of the queue, so the parser could reduce using production 1, but it decides not to based on the fact that it sees a , in the incoming input. This goes on for a while:
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT REDUCE 3
List , ELEMENT SHIFT
List , ELEMENT SHIFT
List , ELEMENT -- REDUCE 3
Now the lookahead token is the "end of input" pseudo-token. This time, it does decide to reduce:
List -- REDUCE 1
Start -- ACCEPT
and the parse is successful.
That still leaves a few questions. To start with, how do we use the FIRST and FOLLOW sets?
As a simple answer, the FOLLOW set of a non-terminal cannot be computed without knowing the FIRST sets for the non-terminals which might follow that non-terminal. And one way we can decide whether or not a reduction should be performed is to see whether the lookahead is in the FOLLOW set for the target non-terminal of the reduction; if not, the reduction certainly should not be performed. That algorithm is sufficient for the simple grammar above, for example: the reduction of Start -> List is not possible with a lookahead of ,, because , is not in FOLLOW(Start). Grammars whose only conflicts can be resolved in this way are SLR grammars (where S stands for "Simple", which it certainly is).
For most grammars, that is not sufficient, and more analysis has to be performed. It is possible that a symbol might be in the FOLLOW set of a non-terminal, but not in the context which lead to the current stack configuration. In order to determine that, we need to know more about how we got to the current configuration; the various possible analyses lead to LALR, IELR and canonical LR parsing, amongst other possibilities.
I have read this to understand more the difference between top down and bottom up parsing, can anyone explain the problems associated with left recursion in a top down parser?
In a top-down parser, the parser begins with the start symbol and tries to guess which productions to apply to reach the input string. To do so, top-down parsers need to use contextual clues from the input string to guide its guesswork.
Most top-down parsers are directional parsers, which scan the input in some direction (typically, left to right) when trying to determine which productions to guess. The LL(k) family of parsers is one example of this - these parsers use information about the next k symbols of input to determine which productions to use.
Typically, the parser uses the next few tokens of input to guess productions by looking at which productions can ultimately lead to strings that start with the upcoming tokens. For example, if you had the production
A → bC
you wouldn't choose to use this production unless the next character to match was b. Otherwise, you'd be guaranteed there was a mismatch. Similarly, if the next input character was b, you might want to choose this production.
So where does left recursion come in? Well, suppose that you have these two productions:
A → Ab | b
This grammar generates all strings of one or more copies of the character b. If you see a b in the input as your next character, which production should you pick? If you choose Ab, then you're assuming there are multiple b's ahead of you even though you're not sure this is the case. If you choose b, you're assuming there's only one b ahead of you, which might be wrong. In other words, if you have to pick one of the two productions, you can't always choose correctly.
The issue with left recursion is that if you have a nonterminal that's left-recursive and find a string that might match it, you can't necessarily know whether to use the recursion to generate a longer string or avoid the recursion and generate a shorter string. Most top-down parsers will either fail to work for this reason (they'll report that there's some uncertainty about how to proceed and refuse to parse), or they'll potentially use extra memory to track each possible branch, running out of space.
In short, top-down parsers usually try to guess what to do from limited information about the string. Because of this, they get confused by left recursion because they can't always accurately predict which productions to use.
Hope this helps!
Reasons
1)The grammar which are left recursive(Direct/Indirect) can't be converted into {Greibach normal form (GNF)}* So the Left recursion can be eliminated to Right Recuraive Format.
2)Left Recursive Grammars are also nit LL(1),So again elimination of left Recursion may result into LL(1) grammer.
GNF
A Grammer of the form A->aV is Greibach Normal Form.
I wrote a parser that analyzes code and reduces it as would by lex and yacc (pretty much.)
I am wondering about one aspect of the matter. If I have a set of rules such as the following:
unary: IDENTIFIER
| IDENTIFIER '(' expr_list ')'
The first rule with just an IDENTIFIER can be reduced as soon as an identifier is found. However, the second rule can only be reduced if the input also includes a valid list of expressions written between parenthesis.
How is a parser expected to work in a case like this one?
If I reduce the first identifier immediately, I can keep the result and throw it away if I realize that the second rule does match. If the second rule does not match, then I can return the result of the early reduction.
This also means that both reduction functions are going to be called if the second rule applies.
Are we instead expected to hold on the early reduction and apply it only if the second, longer rule applies?
For those interested, I put a more complete version of my parser grammar in this answer: https://codereview.stackexchange.com/questions/41769/numeric-expression-parser-calculator/41786#41786
Bottom-up parsers (like bison and yacc) do not reduce until they reach the end of the production. They don't have to guess which reduction they will use until they need it. So it's really not an issue to have two productions with the same prefix. In this sense, the algorithm is radically different from a top-down algorithm, used in recursive descent parsing, for example.
In order for the fragment which you provide to be parseable by an LALR(1) parser-generator -- that is, a bottom-up parser with the ability to examine only one (1) token beyond the end of a production -- the grammar must be such that there is no place in which unary might be followed by (. As long as that is true, the fact that the parser can see a ( is sufficient to prevent the unit reduction unary: IDENTIFIER from happening in a context in which it should reduce with the other unary production.
(That's a bit of an oversimplification, but I don't think that it would be correct to reproduce a standard text on LALR parsing here on SO.)
From the Yacc introduction by Stephen C. Johnson:
With right recursive rules, such as
seq : item
| item seq
;
the parser would be a bit bigger, and the items would be seen, and
reduced, from right to left. More seriously, an internal stack in the
parser would be in danger of overflowing if a very long sequence were
read. Thus, the user should use left recursion wherever reasonable.
I know Yacc generates LR parsers, so I tried parsing some simple right recursive grammar by hand. I couldn't see the problem so far. Anyone can give an example to demonstrate these problems?
The parser size is not a serious issue (most of the time).
The runtime stack size can be an issue. The trouble is that the right recursive rules mean that stack can't be reduced until the parsers reached the end of the sequence, whereas with the left recursive rule, each time the grammar encounters an seq item, it can reduce the number of items on the stack.
Classically, the stack for tokens was fixed and limited in size. Consequently, a right recursive rule such as one to handle:
IF <cond> THEN
<stmt-list>
ELSIF <cond> THEN
<stmt-list>
ELSIF <cond> THEN
<stmt-list>
ELSE
<stmt-list>
ENDIF
would limit the number of terms in the chain of ELIF clauses that the grammar could accept.
Hypothetical grammar:
if_stmt: if_clause opt_elif_clause_list opt_else_clause ENDIF
;
if_clause: IF condition THEN stmt_list
;
opt_elif_clause_list: /* Nothing */
| elif_clause opt_elif_clause_list /* RR */
;
elif_clause: ELIF condition THEN stmt_list
;
opt_else_clause: /* Nothing */
| ELSE stmt_list
;
stmt_list: stmt
| stmt_list stmt /* LR */
;
I seem to remember doing some testing of this a long time ago now (a decade or more ago), and the Yacc that I was using at the time, in conjunction with the language grammar, similar to that above, meant that after about 300 ELIF clauses, the parser stopped (I think it stopped under control, recognizing it had run out of space, rather than crashing oblivious of the space exhaustion).
I'm not at all sure why he says the right recursive parser will be bigger -- in general it will require one fewer state in its state machine (which should if anything make it smaller), but the real issue is that the right recursive grammar requires unbounded stack space, while the left recursive grammar requires constant space (O(n) space vs O(1) space).
Now O(n) vs O(1) may sound like a big deal, but depending on what you are doing, it may be unimportant. In particular, if you're reading your entire input into memory to process it, that's O(n) space that completely overwhelms the O(n) vs O(1) distinction for right vs left recursion. If you're using a particularly old version of yacc that still has a fixed parser stack, it may be a problem, but recent versions of yacc (Berkeley yacc, bison) automatically expand their parse stack on demand, so the only limit is available memory.