How does a shift reduce parser know what rule to apply? - parsing

When writing a shift reduce parser, how does a shift reduce figure out what rule to apply efficiently? For example, if I have the following rules
S –> S + S
S –> id
How would the parser quickly determine the rule to apply in the following parse stacks?
$ id # id -> S
$ S # shift
$ S + # shift
$ S + id # id -> S
$ S + S # S + S -> S
$ S
All the examples I've seen just pull the correct rule out of nowhere, but what is the code behind choosing a rule? Pseudocode would be appreciated.
I've taken the examples from here, but pretty much any shift reduce parsing articles I find online just magically know what rule to use and don't show how to choose them.

The rule number is in the parsing table. In other words, it was precomputed when the parsing table was created.
An LR state is a set of LR items, where each item is a production and an index into the production, usually written with a •. When you take a transition from one state to the next one, you move the • one symbol to the right in all the qualifying items. For a shift action, an item qualifies if the symbol following the • is the token being shifted, and for a goto action, which happens at the end of a reduction, an item qualifies if the symbol following the • is the non-terminal which was just reduced.
Normally not all the items in a state qualify, unless there is just one item in the state. But it can happen that there are two or more qualifying items; that's an indication that the grammar probably wasn't LL. Anyway, it doesn't matter. The parser generator takes all the qualifying items and uses them to create a new state (or look up an already constructed state). Newly constructed states are completed by "ε-closure", which is a fancy way of saying that you add all the productions for each non-terminal which follows the • in the new state. (Recursively, which is why it's called a closure.)
When the parser reaches a state where the • is at the end of an item, it can reduce that particular item, which is precisely the production which will be reduced. Reducing an item basically means backing up the parser until you reach the beginning of the item's production, which 8s what the parser stack is used for: each stack entry is a transition, do as you pop the stack you move backwards in the parse history. Once you reach the beginning of the item, you must be in a state which has a goto action on the production's non-terminal. That must be the case because an item with the • at the beginning was added during ε-closure, which only happens when some item(s) in the state have their • before that non-terminal. Then you take the goto action, which registers the fact that an instance of that non-terminal has just been recognised, and continue from there. So there's no magic.
Each reducible item has a lookahead set, which was also computed during table construction, consisting of the possible tokens which might come next. If the actual next token --the lookahead token-- is in that set, the reduction is allowed to happen. If the lookahead token follows the • in the current state, a shift action is allowed. If a state has both a possible reduction action and a possible shift action on the same token, the table has a parsing conflict and the grammar is not LR. The same if two different items are both reducible on that state on the same lookahead. For a grammar to be LR, every state can have at most one possible action for every different lookahead token. (If it has no possible action for the current lookahead, the parse fails and a syntax error is reported.)
In my opinion, you can't really learn this algorithm by reading about, although I've tried to write it. To see how it works, you need to construct (or borrow) a parsing table and play parser, armed with a whiteboard or a big pad of paper to keep track of the parsing stack. If you can find (or build) a parsing table where the items have not been deleted, you might find it easier to follow, although it takes up a lot more space. (G2G, like many "tutorials", deleted the items, possibly making it look like magic. But there are other resources, such as the infamous Dragon Book.)
The parser itself doesn't need to look at the items; all the relevant information has been summarised in the parsing table, which I suppose is why sites like G2G don't show them. And they do create a lot of clutter. Bison can produce Graphview source for an image of the parsing automaton; you need to supply the --report=all command-line option if you want to see the ε-closure in each state.

Related

In shift reduce parsing why the handle always eventually appear on top of the stack and never inside?

I was going through the text Compilers Principles, Techniques and Tools by Ullman et. al where I came across the excerpt where the authors try to justify why stack is the best data structure of shift reduce parsing. They said that it is so because of the fact that
"The handle will always eventually appear on top of the stack, never inside."
The Excerpt
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation. These two steps can be of the form
In case (1), A is replaced by , and then the rightmost nonterminal B in that right side is replaced by . In case (2), A is again replaced first, but this time the right side is a string y of terminals only. The next rightmost nonterminal B will be somewhere to the left of y.
Let us consider case (1) in reverse, where a shift-reduce parser has just reached the configuration
The parser now reduces the handle to B to reach the configuration
in which is the handle, and it gets reduced to A
In case (2), in configuration
the handle is on top of the stack. After reducing the handle to B, the parser can shift the string xy to get the next handle y on top of the stack,
Now the parser reduces y to A.
In both cases, after making a reduction the parser had to shift zero or more symbols to get the next handle onto the stack. It never had to go into the stack to find the handle. It is this aspect of handle pruning that makes a stack a particularly convenient data structure for implementing a shift-reduce parser.
My reasoning and doubts
Intuitively this is how I feel that the statement in can be justified
If there is an handle on the top of the stack, then the algorithm, will first reduce it before pushing the next input symbol on top of the stack. Since before the push any possible handle is reduced, so there is no chance of an handle being on the top of the stack and then pushing a new input symbol thereby causing the handle to go inside the stack.
Moreover I could not understand the logic the authors have given in highlighted portion of the excerpt justifying that the handle cannot occur inside the stack, based on what they say about B and other facts related to it.
Please can anyone help me understand the concept.
The key to the logic expressed by the authors is in the statement at the beginning (emphasis added):
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation.
It's also important to remember that a bottom-up parser traces out a right-most derivation backwards. Each reduction performed by the parser is a step in the derivation; since the derivation is rightmost the non-terminal being replaced in the derivation step must be the last non-terminal in the sentential form. So if we write down the sequence of reduction actions used by the parser and then read the list backwards, we get the derivation. Alternatively, if we write down the list of productions used in the rightmost derivation and then read it backwards, we get the sequence of parser reductions.
Either way, the point is to prove that the successive handles in the derivation steps correspond to monotonically non-retreating prefixes in the original input. The authors' proof takes two derivation steps (any two derivation steps) and shows that the end of the handle of the second derivation step is not before the end of the handle of the first step (although the ends of the two handles may be at the same point in the input).

Isn't an LR(0) parser using lookaheads as well?

An LL(1)-parser needs a lookahead-symbol for being able to decide which production to use. This is the reason why I always thought the term "lookahead" is used, when a parser looks at the next input token without "consuming" it (i.e. it can still be read from the input by the next action). LR(0) parsers, however, made me doubt that this is correct:
Every example of LR(0)-parsers that I've seen also uses the next input token for deciding whether to shift or to reduce.
In case of reduction the input token is not consumed.
I used the freeware tool "ParsingEmu" for generating an LR-table and performing an LR evalutation below for the word "aab". As you can see the column head contain tokens. From the evaluation you can see that the parser is deciding which column to use by looking at the next input token. But when the parser reduces in steps 4 - 6 the input doesn't change (although the parser needs to know the next input token "$" when performing a transition to the next state).
Grammar:
S -> A
A -> aA
A -> b
Table:
Evaluation:
Now I made following assumptions for the reason of my confusion:
My assumption for the definition of "lookahead" (lookahead = input token not being consumed) is wrong. Lookahead just means two different things for either LL-parsers or LR-parsers. If so, how can "lookahead" be defined then?
LR-parsers have (from the theoretical point of view when you would use push-down automaton) additional internal states where they consume the input token by putting it on the stack and therefore are able to make the shift- reduce- decision by just looking on the stack.
The evaluation shown above is LR(1). If true, what would an LR(0) evaluation look like?
Now what is correct, 1, 2 or 3 or something completely different?
It's important to be precise:
An LR(k) parser uses the curent parser state and k lookahead symbols to decide whether to reduce, and if so, by which production.
It also uses a shift transition table to decide which parsing state it should move to after shifting the next input token. The shift transition table is keyed by the current state and the (single) token being shifted, regardless of the value of k.
If in a given parser state, it would be possible to produce both a shift and a reduce action, then the parser has a shift/reduce conflict, and it is invalid. Consequently, the above two determinations could in theory be done nondeterministically.
If in a given parser state, no reduce is possible and the next input symbol cannot be shifted (that is, there is no transition for that state with that input symbol), then the parse has failed and the algorithm terminates.
If, on the other hand, the shift transition leads to the designated Accept state, then the parse succeeds and the algorithm terminates.
What all that means is that the lookahead is used to predict which, if any, reduction should be applied. In an LR(0) parser, the decision to shift (more accurately, to attempt to shift) must be made before reading the next input token, but the computation of the state to transition to do is made after reading the token, at which point it will signal an error if no shift is possible.
LL(k) parsers must predict which production replace a non-terminal with as soon as they see the non-terminal. The basic LL algorithm starts with a stack containing [S, $] (top to bottom) and does whichever of the following is applicable until done:
If the top of the stack is a non-terminal, replace the top of the stack with one of the productions for that non-terminal, using the next k input symbols to decide which one (without moving the input cursor), and continue.
If the top of the stack is a terminal, read the next input token. If it is the same terminal, pop the stack and continue. Otherwise, the parse has failed and the algorithm finishes.
If the stack is empty, the parse has succeeded and the algorithm finishes. (We assume that there is a unique EOF-marker $ at the end of the input.)
In both cases, lookahead has the same meaning: it consists of looking at input tokens without moving the input cursor.
If k is 0, then:
An LR(k) parser must decide whether or not to reduce without examining input, which means that no state can have either two different reduce actions or a reduce and a shift action.
An LL(k) parser must decide which production of a given non-terminal is appicable without examining input. In practice, this means that a each non-terminal can have only one production, which means the language must be finite.

LALR parsers and look-ahead

I'm implementing the automatic construction of an LALR parse table for no reason at all. There are two flavors of this parser, LALR(0) and LALR(1), where the number signifies the amount of look-ahead.
I have gotten myself confused on what look-ahead means.
If my input stream is 'abc' and I have the following production, would I need 0 look-ahead, or 1?
P :== a E
Same question, but I can't choose the correct P production in advance by only looking at the 'a' in the input.
P :== a b E
| a b F
I have additional confusion in that I don't think the latter P-productions really happen in when building a LALR parser generator. The reason is that the grammar is effectively left-factored automatically as we compute the closures.
I was working through this page and was ok until I got to the first/follow section. My issue here is that I don't know why we are calculating these things, so I am having trouble abstracting this in my head.
I almost get the idea that the look-ahead is not related to shifting input, but instead in deciding when to reduce.
I've been reading the Dragon book, but it is about as linear as a Tarantino script. It seems like a great reference for people who already know how to do this.
The first thing you need to do when learning about bottom-up parsing (such as LALR) is to remember that it is completely different from top-down parsing. Top-down parsing starts with a nonterminal, the left-hand-side (LHS) of a production, and guesses which right-hand-side (RHS) to use. Bottom-up parsing, on the other hand, starts by identifying the RHS and then figures out which LHS to select.
To be more specific, a bottom-up parser accumulates incoming tokens into a queue until a right-hand side is at the right-hand end of the queue. Then it reduces that RHS by replacing it with the corresponding LHS, and checks to see whether an appropriate RHS is at the right-hand edge of the modified accumulated input. It keeps on doing that until it decides that no more reductions will take place at that point in the input, and then reads a new token (or, in other words, takes the next input token and shifts it onto the end of the queue.)
This continues until the last token is read and all possible reductions are performed, at which point if what remains is the single non-terminal which is the "start symbol", it accepts the parse.
It is not obligatory for the parser to reduce a RHS just because it appears at the end of the current queue, but it cannot reduce a RHS which is not at the end of the queue. That means that it has to decide whether to reduce or not before it shifts any other token. Since the decision is not always obvious, it may examine one or more tokens which it has not yet read ("lookahead tokens", because it is looking ahead into the input) in order to decide. But it can only look at the next k tokens for some value of k, typically 1.
Here's a very simple example; a comma separated list:
1. Start -> List
2. List -> ELEMENT
3. List -> List ',' ELEMENT
Let's suppose the input is:
ELEMENT , ELEMENT , ELEMENT
At the beginning, the input queue is empty, and since no RHS is empty the only alternative is to shift:
queue remaining input action
---------------------- --------------------------- -----
ELEMENT , ELEMENT , ELEMENT SHIFT
At the next step, the parser decides to reduce using production 2:
ELEMENT , ELEMENT , ELEMENT REDUCE 2
Now there is a List at the end of the queue, so the parser could reduce using production 1, but it decides not to based on the fact that it sees a , in the incoming input. This goes on for a while:
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT REDUCE 3
List , ELEMENT SHIFT
List , ELEMENT SHIFT
List , ELEMENT -- REDUCE 3
Now the lookahead token is the "end of input" pseudo-token. This time, it does decide to reduce:
List -- REDUCE 1
Start -- ACCEPT
and the parse is successful.
That still leaves a few questions. To start with, how do we use the FIRST and FOLLOW sets?
As a simple answer, the FOLLOW set of a non-terminal cannot be computed without knowing the FIRST sets for the non-terminals which might follow that non-terminal. And one way we can decide whether or not a reduction should be performed is to see whether the lookahead is in the FOLLOW set for the target non-terminal of the reduction; if not, the reduction certainly should not be performed. That algorithm is sufficient for the simple grammar above, for example: the reduction of Start -> List is not possible with a lookahead of ,, because , is not in FOLLOW(Start). Grammars whose only conflicts can be resolved in this way are SLR grammars (where S stands for "Simple", which it certainly is).
For most grammars, that is not sufficient, and more analysis has to be performed. It is possible that a symbol might be in the FOLLOW set of a non-terminal, but not in the context which lead to the current stack configuration. In order to determine that, we need to know more about how we got to the current configuration; the various possible analyses lead to LALR, IELR and canonical LR parsing, amongst other possibilities.

Encode FIRST and FOLLOW sets into a recursive descent parser

This is a follow up to a previous question I asked How to encode FIRST & FOLLOW sets inside a compiler, but this one is more about the design of my program.
I am implementing the Syntax Analysis phase of my compiler by writing a recursive descent parser. I need to be able to take advantage of the FIRST and FOLLOW sets so I can handle errors in the syntax of the source program more efficiently. I have already calculated the FIRST and FOLLOW for all of my non-terminals, but am have trouble deciding where to logically place them in my program and what the best data-structure would be to do so.
Note: all code will be pseudo code
Option 1) Use a map, and map all non-terminals by their name to two Sets that contains their FIRST and FOLLOW sets:
class ParseConstants
Map firstAndFollowMap = #create a map .....
firstAndFollowMap.put("<program>", FIRST_SET, FOLLOW_SET)
end
This seems like a viable option, but inside of my parser I would then need sorta ugly code like this to retrieve the FIRST and FOLLOW and pass to error function:
#processes the <program> non-terminal
def program
List list = firstAndFollowMap.get("<program>")
Set FIRST = list.get(0)
Set FOLLOW = list.get(1)
error(current_symbol, FOLLOW)
end
Option 2) Create a class for each non-terminal and have a FIRST and FOLLOW property:
class Program
FIRST = .....
FOLLOW = ....
end
this leads to code that looks a little nicer:
#processes the <program> non-terminal
def program
error(current_symbol, Program.FOLLOW)
end
These are the two options I thought up, I would love to hear any other suggestions for ways to encode these two sets, and also any critiques and additions to the two ways I posted would be helpful.
Thanks
I have also posted this question here: http://www.coderanch.com/t/570697/java/java/Encode-FIRST-FOLLOW-sets-recursive
You don't really need the FIRST and FOLLOW sets. You need to compute those to get the parse table. That is a table of {<non-terminal, token> -> <action, rule>} if LL(k) (which means seeing a non-terminal in stack and token in input, which action to take and if applies, which rule to apply), or a table of {<state, token> -> <action, state>} if (C|LA|)LR(k) (which means given state in stack and token in input, which action to take and go to which state.
After you get this table, you don't need the FIRST and FOLLOWS any more.
If you are writing a semantic analyzer, you must assume the parser is working correctly. Phrase level error handling (which means handling parse errors), is totally orthogonal to semantic analysis.
This means that in case of parse error, the phrase level error handler (PLEH) would try to fix the error. If it couldn't, parsing stops. If it could, the semantic analyzer shouldn't know if there was an error which was fixed, or there wasn't any error at all!
You can take a look at my compiler library for examples.
About phrase level error handling, you again don't need FIRST and FOLLOW. Let's talk about LL(k) for now (simply because about LR(k) I haven't thought about much yet). After you build the grammar table, you have many entries, like I said like this:
<non-terminal, token> -> <action, rule>
Now, when you parse, you take whatever is on the stack, if it was a terminal, then you must match it with the input. If it didn't match, the phrase level error handler kicks in:
Role one: handle missing terminals - simply generate a fake terminal of the type you need in your lexer and have the parser retry. You can do other stuff as well (for example check ahead in the input, if you have the token you want, drop one token from lexer)
If what you get is a non-terminal (T) from the stack instead, you must look at your lexer, get the lookahead and look at your table. If the entry <T, lookahead> existed, then you're good to go. Follow the action and push to/pop from the stack. If, however, no such entry existed, again, the phrase level error handler kicks in:
Role two: handle unexpected terminals - you can do many things to get passed this. What you do depends on what T and lookahead are and your expert knowledge of your grammar.
Examples of the things you can do are:
Fail! You can do nothing
Ignore this terminal. This means that you push lookahead to the stack (after pushing T back again) and have the parser continue. The parser would first match lookahead, throw it away and continues. Example: if you have an expression like this: *1+2/0.5, you can drop the unexpected * this way.
Change lookahead to something acceptable, push T back and retry. For example, an expression like this: 5id = 10; could be illegal because you don't accept ids that start with numbers. You can replace it with _5id for example to continue

Is it acceptable to store the previous state as a global variable?

One of the biggest problems with designing a lexical analyzer/parser combination is overzealousness in designing the analyzer. (f)lex isn't designed to have parser logic, which can sometimes interfere with the design of mini-parsers (by means of yy_push_state(), yy_pop_state(), and yy_top_state().
My goal is to parse a document of the form:
CODE1 this is the text that might appear for a 'CODE' entry
SUBCODE1 the CODE group will have several subcodes, which
may extend onto subsequent lines.
SUBCODE2 however, not all SUBCODEs span multiple lines
SUBCODE3 still, however, there are some SUBCODES that span
not only one or two lines, but any number of lines.
this makes it a challenge to use something like \r\n
as a record delimiter.
CODE2 Moreover, it's not guaranteed that a SUBCODE is the
only way to exit another SUBCODE's scope. There may
be CODE blocks that accomplish this.
In the end, I've decided that this section of the project is better left to the lexical analyzer, since I don't want to create a pattern that matches each line (and identifies continuation records). Part of the reason is that I want the lexical parser to have knowledge of the contents of each line, without incorporating its own tokenizing logic. That is to say, if I match ^SUBCODE[ ][ ].{71}\r\n (all records are blocked in 80-character records) I would not be able to harness the power of flex to tokenize the structured data residing in .{71}.
Given these constraints, I'm thinking about doing the following:
Entering a CODE1 state from the <INITIAL> start condition results
in calls to:
yy_push_state(CODE_STATE)
yy_push_state(CODE_CODE1_STATE)
(do something with the contents of the CODE1 state identifier, if such contents exist)
yy_push_state(SUBCODE_STATE) (to tell the analyzer to expect SUBCODE states belonging to the CODE_CODE1_STATE. This is where the analyzer begins to masquerade as a parser.
The <SUBCODE1_STATE> start condition is nested as follows: <CODE_STATE>{ <CODE_CODE1_STATE> { <SUBCODE_STATE>{ <SUBCODE1_STATE> { (perform actions based on the matching patterns) } } }. It also sets the global previous_state variable to yy_top_state(), to wit SUBCODE1_STATE.
Within <SUBCODE1_STATE>'s scope, \r\n will call yy_pop_state(). If a continuation record is present (which is a pattern at the highest scope against which all text is matched), yy_push_state(continuation_record_states[previous_state]) is called, bringing us back to the scope in 2. continuation_record_states[] maps each state with its continuation record state, which is used by the parser.
As you can see, this is quite complicated, which leads me to conclude that I'm massively over-complicating the task.
Questions
For states lacking an extremely clear token signifying the end of its scope, is my proposed solution acceptable?
Given that I want to tokenize the input using flex, is there any way to do so without start conditions?
The biggest problem I'm having is that each record (beginning with the (SUB)CODE prefix) is unique, but the information appearing after the (SUB)CODE prefix is not. Therefore, it almost appears mandatory to have multiple states like this, and the abstract CODE_STATE and SUBCODE_STATE states would act as groupings for each of the concrete SUBCODE[0-9]+_STATE and CODE[0-9]+_STATE states.
I would look at how the OMeta parser handles these things.

Resources