After reduction, which state will parser go? - parsing

I use Bison output file to analyze the state (machine) transformation of parser, I find when parser deduce a rule, it goes back to a previous state, but sometimes it goes one state back, sometimes it goes two or three states back. Can anyone tell me what is the rule that determine to which state the state machine will go back, after finished a deduction?
Thanks in advance.

When an LR(k) machine performs a reduction, it pops the right-hand side of the production off the parser stack, revealing the state in which parsing of the production started. It then looks up the reduced non-terminal in the GOTO table for that state.
So the number of entries popped off the parser stack will be the number of symbols on the right-hand side of the reduced production. (In theory, an LR parser could optimize by not pushing all symbols onto the stack, which would allow it to pop fewer symbols off the stack. But as far as I know, bison doesn't do this particular optimization, because it would dramatically complicated the interface.)

Related

How does a shift reduce parser know what rule to apply?

When writing a shift reduce parser, how does a shift reduce figure out what rule to apply efficiently? For example, if I have the following rules
S –> S + S
S –> id
How would the parser quickly determine the rule to apply in the following parse stacks?
$ id # id -> S
$ S # shift
$ S + # shift
$ S + id # id -> S
$ S + S # S + S -> S
$ S
All the examples I've seen just pull the correct rule out of nowhere, but what is the code behind choosing a rule? Pseudocode would be appreciated.
I've taken the examples from here, but pretty much any shift reduce parsing articles I find online just magically know what rule to use and don't show how to choose them.
The rule number is in the parsing table. In other words, it was precomputed when the parsing table was created.
An LR state is a set of LR items, where each item is a production and an index into the production, usually written with a •. When you take a transition from one state to the next one, you move the • one symbol to the right in all the qualifying items. For a shift action, an item qualifies if the symbol following the • is the token being shifted, and for a goto action, which happens at the end of a reduction, an item qualifies if the symbol following the • is the non-terminal which was just reduced.
Normally not all the items in a state qualify, unless there is just one item in the state. But it can happen that there are two or more qualifying items; that's an indication that the grammar probably wasn't LL. Anyway, it doesn't matter. The parser generator takes all the qualifying items and uses them to create a new state (or look up an already constructed state). Newly constructed states are completed by "ε-closure", which is a fancy way of saying that you add all the productions for each non-terminal which follows the • in the new state. (Recursively, which is why it's called a closure.)
When the parser reaches a state where the • is at the end of an item, it can reduce that particular item, which is precisely the production which will be reduced. Reducing an item basically means backing up the parser until you reach the beginning of the item's production, which 8s what the parser stack is used for: each stack entry is a transition, do as you pop the stack you move backwards in the parse history. Once you reach the beginning of the item, you must be in a state which has a goto action on the production's non-terminal. That must be the case because an item with the • at the beginning was added during ε-closure, which only happens when some item(s) in the state have their • before that non-terminal. Then you take the goto action, which registers the fact that an instance of that non-terminal has just been recognised, and continue from there. So there's no magic.
Each reducible item has a lookahead set, which was also computed during table construction, consisting of the possible tokens which might come next. If the actual next token --the lookahead token-- is in that set, the reduction is allowed to happen. If the lookahead token follows the • in the current state, a shift action is allowed. If a state has both a possible reduction action and a possible shift action on the same token, the table has a parsing conflict and the grammar is not LR. The same if two different items are both reducible on that state on the same lookahead. For a grammar to be LR, every state can have at most one possible action for every different lookahead token. (If it has no possible action for the current lookahead, the parse fails and a syntax error is reported.)
In my opinion, you can't really learn this algorithm by reading about, although I've tried to write it. To see how it works, you need to construct (or borrow) a parsing table and play parser, armed with a whiteboard or a big pad of paper to keep track of the parsing stack. If you can find (or build) a parsing table where the items have not been deleted, you might find it easier to follow, although it takes up a lot more space. (G2G, like many "tutorials", deleted the items, possibly making it look like magic. But there are other resources, such as the infamous Dragon Book.)
The parser itself doesn't need to look at the items; all the relevant information has been summarised in the parsing table, which I suppose is why sites like G2G don't show them. And they do create a lot of clutter. Bison can produce Graphview source for an image of the parsing automaton; you need to supply the --report=all command-line option if you want to see the ε-closure in each state.

In shift reduce parsing why the handle always eventually appear on top of the stack and never inside?

I was going through the text Compilers Principles, Techniques and Tools by Ullman et. al where I came across the excerpt where the authors try to justify why stack is the best data structure of shift reduce parsing. They said that it is so because of the fact that
"The handle will always eventually appear on top of the stack, never inside."
The Excerpt
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation. These two steps can be of the form
In case (1), A is replaced by , and then the rightmost nonterminal B in that right side is replaced by . In case (2), A is again replaced first, but this time the right side is a string y of terminals only. The next rightmost nonterminal B will be somewhere to the left of y.
Let us consider case (1) in reverse, where a shift-reduce parser has just reached the configuration
The parser now reduces the handle to B to reach the configuration
in which is the handle, and it gets reduced to A
In case (2), in configuration
the handle is on top of the stack. After reducing the handle to B, the parser can shift the string xy to get the next handle y on top of the stack,
Now the parser reduces y to A.
In both cases, after making a reduction the parser had to shift zero or more symbols to get the next handle onto the stack. It never had to go into the stack to find the handle. It is this aspect of handle pruning that makes a stack a particularly convenient data structure for implementing a shift-reduce parser.
My reasoning and doubts
Intuitively this is how I feel that the statement in can be justified
If there is an handle on the top of the stack, then the algorithm, will first reduce it before pushing the next input symbol on top of the stack. Since before the push any possible handle is reduced, so there is no chance of an handle being on the top of the stack and then pushing a new input symbol thereby causing the handle to go inside the stack.
Moreover I could not understand the logic the authors have given in highlighted portion of the excerpt justifying that the handle cannot occur inside the stack, based on what they say about B and other facts related to it.
Please can anyone help me understand the concept.
The key to the logic expressed by the authors is in the statement at the beginning (emphasis added):
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation.
It's also important to remember that a bottom-up parser traces out a right-most derivation backwards. Each reduction performed by the parser is a step in the derivation; since the derivation is rightmost the non-terminal being replaced in the derivation step must be the last non-terminal in the sentential form. So if we write down the sequence of reduction actions used by the parser and then read the list backwards, we get the derivation. Alternatively, if we write down the list of productions used in the rightmost derivation and then read it backwards, we get the sequence of parser reductions.
Either way, the point is to prove that the successive handles in the derivation steps correspond to monotonically non-retreating prefixes in the original input. The authors' proof takes two derivation steps (any two derivation steps) and shows that the end of the handle of the second derivation step is not before the end of the handle of the first step (although the ends of the two handles may be at the same point in the input).

Isn't an LR(0) parser using lookaheads as well?

An LL(1)-parser needs a lookahead-symbol for being able to decide which production to use. This is the reason why I always thought the term "lookahead" is used, when a parser looks at the next input token without "consuming" it (i.e. it can still be read from the input by the next action). LR(0) parsers, however, made me doubt that this is correct:
Every example of LR(0)-parsers that I've seen also uses the next input token for deciding whether to shift or to reduce.
In case of reduction the input token is not consumed.
I used the freeware tool "ParsingEmu" for generating an LR-table and performing an LR evalutation below for the word "aab". As you can see the column head contain tokens. From the evaluation you can see that the parser is deciding which column to use by looking at the next input token. But when the parser reduces in steps 4 - 6 the input doesn't change (although the parser needs to know the next input token "$" when performing a transition to the next state).
Grammar:
S -> A
A -> aA
A -> b
Table:
Evaluation:
Now I made following assumptions for the reason of my confusion:
My assumption for the definition of "lookahead" (lookahead = input token not being consumed) is wrong. Lookahead just means two different things for either LL-parsers or LR-parsers. If so, how can "lookahead" be defined then?
LR-parsers have (from the theoretical point of view when you would use push-down automaton) additional internal states where they consume the input token by putting it on the stack and therefore are able to make the shift- reduce- decision by just looking on the stack.
The evaluation shown above is LR(1). If true, what would an LR(0) evaluation look like?
Now what is correct, 1, 2 or 3 or something completely different?
It's important to be precise:
An LR(k) parser uses the curent parser state and k lookahead symbols to decide whether to reduce, and if so, by which production.
It also uses a shift transition table to decide which parsing state it should move to after shifting the next input token. The shift transition table is keyed by the current state and the (single) token being shifted, regardless of the value of k.
If in a given parser state, it would be possible to produce both a shift and a reduce action, then the parser has a shift/reduce conflict, and it is invalid. Consequently, the above two determinations could in theory be done nondeterministically.
If in a given parser state, no reduce is possible and the next input symbol cannot be shifted (that is, there is no transition for that state with that input symbol), then the parse has failed and the algorithm terminates.
If, on the other hand, the shift transition leads to the designated Accept state, then the parse succeeds and the algorithm terminates.
What all that means is that the lookahead is used to predict which, if any, reduction should be applied. In an LR(0) parser, the decision to shift (more accurately, to attempt to shift) must be made before reading the next input token, but the computation of the state to transition to do is made after reading the token, at which point it will signal an error if no shift is possible.
LL(k) parsers must predict which production replace a non-terminal with as soon as they see the non-terminal. The basic LL algorithm starts with a stack containing [S, $] (top to bottom) and does whichever of the following is applicable until done:
If the top of the stack is a non-terminal, replace the top of the stack with one of the productions for that non-terminal, using the next k input symbols to decide which one (without moving the input cursor), and continue.
If the top of the stack is a terminal, read the next input token. If it is the same terminal, pop the stack and continue. Otherwise, the parse has failed and the algorithm finishes.
If the stack is empty, the parse has succeeded and the algorithm finishes. (We assume that there is a unique EOF-marker $ at the end of the input.)
In both cases, lookahead has the same meaning: it consists of looking at input tokens without moving the input cursor.
If k is 0, then:
An LR(k) parser must decide whether or not to reduce without examining input, which means that no state can have either two different reduce actions or a reduce and a shift action.
An LL(k) parser must decide which production of a given non-terminal is appicable without examining input. In practice, this means that a each non-terminal can have only one production, which means the language must be finite.

Issue with left recursion in top down parsing

I have read this to understand more the difference between top down and bottom up parsing, can anyone explain the problems associated with left recursion in a top down parser?
In a top-down parser, the parser begins with the start symbol and tries to guess which productions to apply to reach the input string. To do so, top-down parsers need to use contextual clues from the input string to guide its guesswork.
Most top-down parsers are directional parsers, which scan the input in some direction (typically, left to right) when trying to determine which productions to guess. The LL(k) family of parsers is one example of this - these parsers use information about the next k symbols of input to determine which productions to use.
Typically, the parser uses the next few tokens of input to guess productions by looking at which productions can ultimately lead to strings that start with the upcoming tokens. For example, if you had the production
A → bC
you wouldn't choose to use this production unless the next character to match was b. Otherwise, you'd be guaranteed there was a mismatch. Similarly, if the next input character was b, you might want to choose this production.
So where does left recursion come in? Well, suppose that you have these two productions:
A → Ab | b
This grammar generates all strings of one or more copies of the character b. If you see a b in the input as your next character, which production should you pick? If you choose Ab, then you're assuming there are multiple b's ahead of you even though you're not sure this is the case. If you choose b, you're assuming there's only one b ahead of you, which might be wrong. In other words, if you have to pick one of the two productions, you can't always choose correctly.
The issue with left recursion is that if you have a nonterminal that's left-recursive and find a string that might match it, you can't necessarily know whether to use the recursion to generate a longer string or avoid the recursion and generate a shorter string. Most top-down parsers will either fail to work for this reason (they'll report that there's some uncertainty about how to proceed and refuse to parse), or they'll potentially use extra memory to track each possible branch, running out of space.
In short, top-down parsers usually try to guess what to do from limited information about the string. Because of this, they get confused by left recursion because they can't always accurately predict which productions to use.
Hope this helps!
Reasons
1)The grammar which are left recursive(Direct/Indirect) can't be converted into {Greibach normal form (GNF)}* So the Left recursion can be eliminated to Right Recuraive Format.
2)Left Recursive Grammars are also nit LL(1),So again elimination of left Recursion may result into LL(1) grammer.
GNF
A Grammer of the form A->aV is Greibach Normal Form.

Why can't a compiler have a "shift/shift" conflict?

I currently am studying about compilers and as I understand in LR(0) there are cases where we have "shift/reduce" or "reduce/reduce" conflicts, but it's impossible to have "shift/shift" conflicts! Why we can't have a "shift/shift" conflict?
Shift/reduce conflicts occur when the parser can't tell whether to shift (push the next input token atop the parsing stack) or reduce (pop a series of terminals and nonterminals from the parsing stack). A reduce/reduce conflict is when the parser knows to reduce, but can't tell which reduction to perform.
If you were to have a shift/shift conflict, the parser would know that it needed to push the next token onto its parsing stack, but wouldn't know how to do it. Since there is only one way to push the token onto the parsing stack, there generally cannot be any conflicts of this form.
That said, it's theoretically possible for a shift/shift conflict to exist if you had a strange setup in which there were two or more transitions leading out of a given parsing state that were labeled with the same terminal symbol. The conflict in that case would be whether to shift and go to one state or shift and go to another. This could happen if you tried to compress an automaton into fewer states and did so incorrectly, or if you were trying to build a nondeterministic parsing automaton. In practice, this would never happen.
Hope this helps!

Resources