I'm implementing the automatic construction of an LALR parse table for no reason at all. There are two flavors of this parser, LALR(0) and LALR(1), where the number signifies the amount of look-ahead.
I have gotten myself confused on what look-ahead means.
If my input stream is 'abc' and I have the following production, would I need 0 look-ahead, or 1?
P :== a E
Same question, but I can't choose the correct P production in advance by only looking at the 'a' in the input.
P :== a b E
| a b F
I have additional confusion in that I don't think the latter P-productions really happen in when building a LALR parser generator. The reason is that the grammar is effectively left-factored automatically as we compute the closures.
I was working through this page and was ok until I got to the first/follow section. My issue here is that I don't know why we are calculating these things, so I am having trouble abstracting this in my head.
I almost get the idea that the look-ahead is not related to shifting input, but instead in deciding when to reduce.
I've been reading the Dragon book, but it is about as linear as a Tarantino script. It seems like a great reference for people who already know how to do this.
The first thing you need to do when learning about bottom-up parsing (such as LALR) is to remember that it is completely different from top-down parsing. Top-down parsing starts with a nonterminal, the left-hand-side (LHS) of a production, and guesses which right-hand-side (RHS) to use. Bottom-up parsing, on the other hand, starts by identifying the RHS and then figures out which LHS to select.
To be more specific, a bottom-up parser accumulates incoming tokens into a queue until a right-hand side is at the right-hand end of the queue. Then it reduces that RHS by replacing it with the corresponding LHS, and checks to see whether an appropriate RHS is at the right-hand edge of the modified accumulated input. It keeps on doing that until it decides that no more reductions will take place at that point in the input, and then reads a new token (or, in other words, takes the next input token and shifts it onto the end of the queue.)
This continues until the last token is read and all possible reductions are performed, at which point if what remains is the single non-terminal which is the "start symbol", it accepts the parse.
It is not obligatory for the parser to reduce a RHS just because it appears at the end of the current queue, but it cannot reduce a RHS which is not at the end of the queue. That means that it has to decide whether to reduce or not before it shifts any other token. Since the decision is not always obvious, it may examine one or more tokens which it has not yet read ("lookahead tokens", because it is looking ahead into the input) in order to decide. But it can only look at the next k tokens for some value of k, typically 1.
Here's a very simple example; a comma separated list:
1. Start -> List
2. List -> ELEMENT
3. List -> List ',' ELEMENT
Let's suppose the input is:
ELEMENT , ELEMENT , ELEMENT
At the beginning, the input queue is empty, and since no RHS is empty the only alternative is to shift:
queue remaining input action
---------------------- --------------------------- -----
ELEMENT , ELEMENT , ELEMENT SHIFT
At the next step, the parser decides to reduce using production 2:
ELEMENT , ELEMENT , ELEMENT REDUCE 2
Now there is a List at the end of the queue, so the parser could reduce using production 1, but it decides not to based on the fact that it sees a , in the incoming input. This goes on for a while:
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT REDUCE 3
List , ELEMENT SHIFT
List , ELEMENT SHIFT
List , ELEMENT -- REDUCE 3
Now the lookahead token is the "end of input" pseudo-token. This time, it does decide to reduce:
List -- REDUCE 1
Start -- ACCEPT
and the parse is successful.
That still leaves a few questions. To start with, how do we use the FIRST and FOLLOW sets?
As a simple answer, the FOLLOW set of a non-terminal cannot be computed without knowing the FIRST sets for the non-terminals which might follow that non-terminal. And one way we can decide whether or not a reduction should be performed is to see whether the lookahead is in the FOLLOW set for the target non-terminal of the reduction; if not, the reduction certainly should not be performed. That algorithm is sufficient for the simple grammar above, for example: the reduction of Start -> List is not possible with a lookahead of ,, because , is not in FOLLOW(Start). Grammars whose only conflicts can be resolved in this way are SLR grammars (where S stands for "Simple", which it certainly is).
For most grammars, that is not sufficient, and more analysis has to be performed. It is possible that a symbol might be in the FOLLOW set of a non-terminal, but not in the context which lead to the current stack configuration. In order to determine that, we need to know more about how we got to the current configuration; the various possible analyses lead to LALR, IELR and canonical LR parsing, amongst other possibilities.
Related
When writing a shift reduce parser, how does a shift reduce figure out what rule to apply efficiently? For example, if I have the following rules
S –> S + S
S –> id
How would the parser quickly determine the rule to apply in the following parse stacks?
$ id # id -> S
$ S # shift
$ S + # shift
$ S + id # id -> S
$ S + S # S + S -> S
$ S
All the examples I've seen just pull the correct rule out of nowhere, but what is the code behind choosing a rule? Pseudocode would be appreciated.
I've taken the examples from here, but pretty much any shift reduce parsing articles I find online just magically know what rule to use and don't show how to choose them.
The rule number is in the parsing table. In other words, it was precomputed when the parsing table was created.
An LR state is a set of LR items, where each item is a production and an index into the production, usually written with a •. When you take a transition from one state to the next one, you move the • one symbol to the right in all the qualifying items. For a shift action, an item qualifies if the symbol following the • is the token being shifted, and for a goto action, which happens at the end of a reduction, an item qualifies if the symbol following the • is the non-terminal which was just reduced.
Normally not all the items in a state qualify, unless there is just one item in the state. But it can happen that there are two or more qualifying items; that's an indication that the grammar probably wasn't LL. Anyway, it doesn't matter. The parser generator takes all the qualifying items and uses them to create a new state (or look up an already constructed state). Newly constructed states are completed by "ε-closure", which is a fancy way of saying that you add all the productions for each non-terminal which follows the • in the new state. (Recursively, which is why it's called a closure.)
When the parser reaches a state where the • is at the end of an item, it can reduce that particular item, which is precisely the production which will be reduced. Reducing an item basically means backing up the parser until you reach the beginning of the item's production, which 8s what the parser stack is used for: each stack entry is a transition, do as you pop the stack you move backwards in the parse history. Once you reach the beginning of the item, you must be in a state which has a goto action on the production's non-terminal. That must be the case because an item with the • at the beginning was added during ε-closure, which only happens when some item(s) in the state have their • before that non-terminal. Then you take the goto action, which registers the fact that an instance of that non-terminal has just been recognised, and continue from there. So there's no magic.
Each reducible item has a lookahead set, which was also computed during table construction, consisting of the possible tokens which might come next. If the actual next token --the lookahead token-- is in that set, the reduction is allowed to happen. If the lookahead token follows the • in the current state, a shift action is allowed. If a state has both a possible reduction action and a possible shift action on the same token, the table has a parsing conflict and the grammar is not LR. The same if two different items are both reducible on that state on the same lookahead. For a grammar to be LR, every state can have at most one possible action for every different lookahead token. (If it has no possible action for the current lookahead, the parse fails and a syntax error is reported.)
In my opinion, you can't really learn this algorithm by reading about, although I've tried to write it. To see how it works, you need to construct (or borrow) a parsing table and play parser, armed with a whiteboard or a big pad of paper to keep track of the parsing stack. If you can find (or build) a parsing table where the items have not been deleted, you might find it easier to follow, although it takes up a lot more space. (G2G, like many "tutorials", deleted the items, possibly making it look like magic. But there are other resources, such as the infamous Dragon Book.)
The parser itself doesn't need to look at the items; all the relevant information has been summarised in the parsing table, which I suppose is why sites like G2G don't show them. And they do create a lot of clutter. Bison can produce Graphview source for an image of the parsing automaton; you need to supply the --report=all command-line option if you want to see the ε-closure in each state.
I was going through the text Compilers Principles, Techniques and Tools by Ullman et. al where I came across the excerpt where the authors try to justify why stack is the best data structure of shift reduce parsing. They said that it is so because of the fact that
"The handle will always eventually appear on top of the stack, never inside."
The Excerpt
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation. These two steps can be of the form
In case (1), A is replaced by , and then the rightmost nonterminal B in that right side is replaced by . In case (2), A is again replaced first, but this time the right side is a string y of terminals only. The next rightmost nonterminal B will be somewhere to the left of y.
Let us consider case (1) in reverse, where a shift-reduce parser has just reached the configuration
The parser now reduces the handle to B to reach the configuration
in which is the handle, and it gets reduced to A
In case (2), in configuration
the handle is on top of the stack. After reducing the handle to B, the parser can shift the string xy to get the next handle y on top of the stack,
Now the parser reduces y to A.
In both cases, after making a reduction the parser had to shift zero or more symbols to get the next handle onto the stack. It never had to go into the stack to find the handle. It is this aspect of handle pruning that makes a stack a particularly convenient data structure for implementing a shift-reduce parser.
My reasoning and doubts
Intuitively this is how I feel that the statement in can be justified
If there is an handle on the top of the stack, then the algorithm, will first reduce it before pushing the next input symbol on top of the stack. Since before the push any possible handle is reduced, so there is no chance of an handle being on the top of the stack and then pushing a new input symbol thereby causing the handle to go inside the stack.
Moreover I could not understand the logic the authors have given in highlighted portion of the excerpt justifying that the handle cannot occur inside the stack, based on what they say about B and other facts related to it.
Please can anyone help me understand the concept.
The key to the logic expressed by the authors is in the statement at the beginning (emphasis added):
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation.
It's also important to remember that a bottom-up parser traces out a right-most derivation backwards. Each reduction performed by the parser is a step in the derivation; since the derivation is rightmost the non-terminal being replaced in the derivation step must be the last non-terminal in the sentential form. So if we write down the sequence of reduction actions used by the parser and then read the list backwards, we get the derivation. Alternatively, if we write down the list of productions used in the rightmost derivation and then read it backwards, we get the sequence of parser reductions.
Either way, the point is to prove that the successive handles in the derivation steps correspond to monotonically non-retreating prefixes in the original input. The authors' proof takes two derivation steps (any two derivation steps) and shows that the end of the handle of the second derivation step is not before the end of the handle of the first step (although the ends of the two handles may be at the same point in the input).
An LL(1)-parser needs a lookahead-symbol for being able to decide which production to use. This is the reason why I always thought the term "lookahead" is used, when a parser looks at the next input token without "consuming" it (i.e. it can still be read from the input by the next action). LR(0) parsers, however, made me doubt that this is correct:
Every example of LR(0)-parsers that I've seen also uses the next input token for deciding whether to shift or to reduce.
In case of reduction the input token is not consumed.
I used the freeware tool "ParsingEmu" for generating an LR-table and performing an LR evalutation below for the word "aab". As you can see the column head contain tokens. From the evaluation you can see that the parser is deciding which column to use by looking at the next input token. But when the parser reduces in steps 4 - 6 the input doesn't change (although the parser needs to know the next input token "$" when performing a transition to the next state).
Grammar:
S -> A
A -> aA
A -> b
Table:
Evaluation:
Now I made following assumptions for the reason of my confusion:
My assumption for the definition of "lookahead" (lookahead = input token not being consumed) is wrong. Lookahead just means two different things for either LL-parsers or LR-parsers. If so, how can "lookahead" be defined then?
LR-parsers have (from the theoretical point of view when you would use push-down automaton) additional internal states where they consume the input token by putting it on the stack and therefore are able to make the shift- reduce- decision by just looking on the stack.
The evaluation shown above is LR(1). If true, what would an LR(0) evaluation look like?
Now what is correct, 1, 2 or 3 or something completely different?
It's important to be precise:
An LR(k) parser uses the curent parser state and k lookahead symbols to decide whether to reduce, and if so, by which production.
It also uses a shift transition table to decide which parsing state it should move to after shifting the next input token. The shift transition table is keyed by the current state and the (single) token being shifted, regardless of the value of k.
If in a given parser state, it would be possible to produce both a shift and a reduce action, then the parser has a shift/reduce conflict, and it is invalid. Consequently, the above two determinations could in theory be done nondeterministically.
If in a given parser state, no reduce is possible and the next input symbol cannot be shifted (that is, there is no transition for that state with that input symbol), then the parse has failed and the algorithm terminates.
If, on the other hand, the shift transition leads to the designated Accept state, then the parse succeeds and the algorithm terminates.
What all that means is that the lookahead is used to predict which, if any, reduction should be applied. In an LR(0) parser, the decision to shift (more accurately, to attempt to shift) must be made before reading the next input token, but the computation of the state to transition to do is made after reading the token, at which point it will signal an error if no shift is possible.
LL(k) parsers must predict which production replace a non-terminal with as soon as they see the non-terminal. The basic LL algorithm starts with a stack containing [S, $] (top to bottom) and does whichever of the following is applicable until done:
If the top of the stack is a non-terminal, replace the top of the stack with one of the productions for that non-terminal, using the next k input symbols to decide which one (without moving the input cursor), and continue.
If the top of the stack is a terminal, read the next input token. If it is the same terminal, pop the stack and continue. Otherwise, the parse has failed and the algorithm finishes.
If the stack is empty, the parse has succeeded and the algorithm finishes. (We assume that there is a unique EOF-marker $ at the end of the input.)
In both cases, lookahead has the same meaning: it consists of looking at input tokens without moving the input cursor.
If k is 0, then:
An LR(k) parser must decide whether or not to reduce without examining input, which means that no state can have either two different reduce actions or a reduce and a shift action.
An LL(k) parser must decide which production of a given non-terminal is appicable without examining input. In practice, this means that a each non-terminal can have only one production, which means the language must be finite.
I have read this to understand more the difference between top down and bottom up parsing, can anyone explain the problems associated with left recursion in a top down parser?
In a top-down parser, the parser begins with the start symbol and tries to guess which productions to apply to reach the input string. To do so, top-down parsers need to use contextual clues from the input string to guide its guesswork.
Most top-down parsers are directional parsers, which scan the input in some direction (typically, left to right) when trying to determine which productions to guess. The LL(k) family of parsers is one example of this - these parsers use information about the next k symbols of input to determine which productions to use.
Typically, the parser uses the next few tokens of input to guess productions by looking at which productions can ultimately lead to strings that start with the upcoming tokens. For example, if you had the production
A → bC
you wouldn't choose to use this production unless the next character to match was b. Otherwise, you'd be guaranteed there was a mismatch. Similarly, if the next input character was b, you might want to choose this production.
So where does left recursion come in? Well, suppose that you have these two productions:
A → Ab | b
This grammar generates all strings of one or more copies of the character b. If you see a b in the input as your next character, which production should you pick? If you choose Ab, then you're assuming there are multiple b's ahead of you even though you're not sure this is the case. If you choose b, you're assuming there's only one b ahead of you, which might be wrong. In other words, if you have to pick one of the two productions, you can't always choose correctly.
The issue with left recursion is that if you have a nonterminal that's left-recursive and find a string that might match it, you can't necessarily know whether to use the recursion to generate a longer string or avoid the recursion and generate a shorter string. Most top-down parsers will either fail to work for this reason (they'll report that there's some uncertainty about how to proceed and refuse to parse), or they'll potentially use extra memory to track each possible branch, running out of space.
In short, top-down parsers usually try to guess what to do from limited information about the string. Because of this, they get confused by left recursion because they can't always accurately predict which productions to use.
Hope this helps!
Reasons
1)The grammar which are left recursive(Direct/Indirect) can't be converted into {Greibach normal form (GNF)}* So the Left recursion can be eliminated to Right Recuraive Format.
2)Left Recursive Grammars are also nit LL(1),So again elimination of left Recursion may result into LL(1) grammer.
GNF
A Grammer of the form A->aV is Greibach Normal Form.
Let's take the following context-free grammar:
G = ( {Sum, Product, Number}, {decimal representations of numbers, +, *}, P, Sum)
Being P:
Sum → Sum + Product
Sum → Product
Product → Product * Number
Product → Number
Number → decimal representation of a number
I am trying to parse expressions produced by this grammar with a bottom-up-parser and a look-ahead-buffer (LAB) of length 1 (which supposingly should do without guessing and back-tracking).
Now, given a stack and a LAB, there are often several possibilities of how to reduce the stack or whether to reduce it at all or push another token.
Currently I use this decision tree:
If any top n tokens of the stack plus the LAB are the begining of the
right side of a rule, I push the next token onto the stack.
Otherwise, I reduce the maximum number of tokens on top of the stack.
I.e. if it is possible to reduce the topmost item and at the same time
it is possible to reduce the three top most items, I do the latter.
If no such reduction is possible, I push another token onto the stack.
Rinse and repeat.
This, seems (!) to work, but it requires an awful amount of rule searching, finding matching prefixes, etc. No way this can run in O(NM).
What is the standard (and possibly only sensible) approach to decide whether to reduce or push (shift), and in case of reduction, which reduction to apply?
Thank you in advance for your comments and answers.
The easiest bottom-up parsing approach for grammars like yours (basically, expression grammars) is operator-precedence parsing.
Recall that bottom-up parsing involves building the parse tree left-to-right from the bottom. In other words, at any given time during the parse, we have a partially assembled tree with only terminal symbols to the right of where we're reading, and a combination of terminals and non-terminals to the left (the "prefix"). The only possible reduction is one which applies to a suffix of the prefix; if no reduction applies, we must be able to shift a terminal from the input to the prefix.
An operator grammar has the feature that there are never two consecutive non-terminals in any production. Consequently, in a bottom-up parse of an operator grammar, either the last symbol in the prefix is a terminal or the second-last symbol is one. (Both of them could be.)
An operator precedence parser is essentially blind to non-terminals; it simply doesn't distinguish between them. So you cannot have two productions whose right-hand sides contain exactly the same sequence of terminals, because the op-prec parser wouldn't know which of these two productions to apply. (That's the traditional view. It's actually possible to extend that a bit so that you can have two productions with the same terminals, provided that the non-terminals are in different places. That allows grammars which have unary - operators, for example, since the right hand sides <non-terminal> - <non-terminal> and - <non-terminal> can be distinguished without knowing the names of the non-terminals; only their presence.
The other requirement is that you have to be able to build a precedence relationship between the terminals. More precisely, we define three precedence relations, usually written <·, ·> and ·=· (or some typographic variation on the theme), and insist that for any two terminals x and y, at most one of the relations x ·> y, x ·=· y and x <· y are true.
Roughly speaking, the < and > in the relations correspond to the edges of a production. In other words, if x <· y, that means that x can be followed by a non-terminal with a production whose first terminal is y. Similarly, x ·> y means that y can follow a non-terminal with a production whose last terminal is x. And x ·=· y means that there is some right-hand side where x and y are consecutive terminals, in that order (possibly with an intervening non-terminal).
If the single-relation restriction is true, then we can parse as follows:
Let x be the last terminal in the prefix (that is, either the last or second-last symbol), and let y be the lookahead symbol, which must be a terminal. If x ·> y then we reduce, and repeat the rule. Otherwise, we shift y onto the prefix.
In order to reduce, we need to find the start of the production. We move backwards over the prefix, comparing consecutive terminals (all of which must have <· or ·=· relations) until we find one with a <· relation. Then the terminals between the <· and the ·> are the right-hand side of the production we're looking for, and we can slot the non-terminals into the right-hand side as indicated.
There is no guarantee that there will be an appropriate production; if there isn't, the parse fails. But if the input is a valid sentence, and if the grammar is an operator-precedence grammar, then we will be able to find the right production to reduce.
Note that it is usually really easy to find the production, because most productions have only one (<non-terminal> * <non-terminal>) or two (( <non-terminal> )) terminals. A naive implementation might just run the terminals together into a string and use that as the key in a hash-table.
The classic implementation of operator-precedence parsing is the so-called "Shunting Yard Algorithm", devised by Edsger Dijkstra. In that algorithm, the precedence relations are modelled by providing two functions, left-precedence and right-precedence, which map terminals to integers such that x <· y is true only if right-precedence(x) < left-precedence(y) (and similarly for the other operators). It is not always possible to find such mappings, and the mappings are a cover of the actual precedence relations because it is often the case that there are pairs of terminals for which no precedence relationship applies. Nonetheless, it is often the case that these mappings can be found, and almost always the case for simple expression grammars.
I hope that's enough to get you started. I encourage you to actually read some texts about bottom-up parsing, because I think I've already written far too much for an SO answer, and I haven't yet included a single line of code. :)