Meaning of YACC expression using yysindex and yyrindex - parsing

In a yacc-based project I've encountered a complex expression which I don't understand what it does. The expression is repeated multiple times, so it looks like copy-and-paste. In fact the same exact expression occurs inside the YACC skeleton (byacc-1.9), so I'm assuming it has some particular meaning.
Here's the expression:
if (((yyn = yysindex[lastyystate]) && (yyn += tok) >= 0 &&
yyn <= YYTABLESIZE && yycheck[yyn] == tok) ||
((yyn = yyrindex[lastyystate]) && (yyn += tok) >= 0 &&
yyn <= YYTABLESIZE && yycheck[yyn] == tok)) {
If you partition this you get
((yyn = yysindex[lastyystate]) && (yyn += tok) >= 0 && yyn <= YYTABLESIZE && yycheck[yyn] == tok)
OR
((yyn = yyrindex[lastyystate]) && (yyn += tok) >= 0 && yyn <= YYTABLESIZE && yycheck[yyn] == tok))
I'm fairly familiar with parser generators and know that yacc is LALR(1). I'm guessing
yysindex is the "shift table"
yyrindex is the "reduce table"
yyn <= YYTABLESIZE is just range checking
Given the similarity of the two parts, and assuming my guesses are right, it would seem like they both look for something in the packed/coded parsing tables. I haven't digged into the details on how yacc stores the parsing tables, if someone knows more about this that would probably help.
In this context tok is (obiviously) the token number. But why the += tok and what is the yycheck table?
This code is a part of source code completion, say using TAB, if that helps to explain things.
Extra points if you can explain in a single sentence, as in give a function name, what the intention of this complex expression is.

"Next+check" transition table compression, also commonly referred to as the comb algorithm, is described in the Dragon book with respect to lexers in Section 3.9, with a note in chapter 4 about its use with parser tables.
The algorithm stores a sparse matrix by overlaying rows so that valid entries are overlapped only with missing entries. To look up a value in the compressed table you need a vector of the starting index for each row. Then you add the column number you're looking for to that starting index. Finally, you need to know whether the entry corresponds to the row you're looking in; the check vector contains the number of the row for which each table entry is valid.
It's called comb compression from the idea that each row is like a comb with many broken tines.
In the parser framework, that code (or something similar to it) will be used to ascertain the parser action corresponding to a token. Possible answers are:
Shift the token and go to state S. (Push the token onto the stack and read a new lookahead).
Reduce the stack using production number P. (Pop the right-hand side off the stack, execute the reduction action, go to the state found by consulting the GOTO table from the state revealed by the pop and the number of the reduced non-terminal, push the reduced non-terminal onto the stack, and then continue with the same lookahead token.)
Signal an error.
Accept the input. (Only if the lookahead token is the end-of-input marker. This possibility is often special cased.)
I guess that you are right about their being separate shift and reduce action tables. It's possible that the rows overlap better if you compress the actions separately, although the compression would have to be a lot better to compensate for the extra check array.
Given that the code is used for constructing a completion list, I suppose that it is being used in a simulation of the parse for each possible next token, in order to decide what the valid next token candidates are. The statement returns true if the token can be acted upon (if not, the candidate can simply be removed from the completion list) and sets yyn to the next action. It will be necessary to distinguish between Shift and Reduce actions. In some parser frameworks, the sign of the action is used for this purpose but there are other possibilities.
So I'd call the function find_parse_action (or find_parse_action_if_any, if you like wordier names).
Note that in an LALR parser, the simulation will need to iteratively apply reduce actions because the acceptability of a token isn't known until a shift action is actually encountered. Simulating reductions involves popping the stack which would be destructive if applied to the real parser stack. So part of the code probably creates and manages a simulated stack. (Although it's also possible that byacc uses a spaghetti stack. I've never actually looked at its skeleton.)
Bison uses similar code to implement "Lookahead Correction" (LAC), which is used to produce more informative error messages. ("Expecting X or Y or Z", which is another completion list activity.) So you can see another possible implementation technique in the Bison skeleton. IIRC, it makes a copy of the stack in order to simulate pops.

Related

How does a shift reduce parser know what rule to apply?

When writing a shift reduce parser, how does a shift reduce figure out what rule to apply efficiently? For example, if I have the following rules
S –> S + S
S –> id
How would the parser quickly determine the rule to apply in the following parse stacks?
$ id # id -> S
$ S # shift
$ S + # shift
$ S + id # id -> S
$ S + S # S + S -> S
$ S
All the examples I've seen just pull the correct rule out of nowhere, but what is the code behind choosing a rule? Pseudocode would be appreciated.
I've taken the examples from here, but pretty much any shift reduce parsing articles I find online just magically know what rule to use and don't show how to choose them.
The rule number is in the parsing table. In other words, it was precomputed when the parsing table was created.
An LR state is a set of LR items, where each item is a production and an index into the production, usually written with a •. When you take a transition from one state to the next one, you move the • one symbol to the right in all the qualifying items. For a shift action, an item qualifies if the symbol following the • is the token being shifted, and for a goto action, which happens at the end of a reduction, an item qualifies if the symbol following the • is the non-terminal which was just reduced.
Normally not all the items in a state qualify, unless there is just one item in the state. But it can happen that there are two or more qualifying items; that's an indication that the grammar probably wasn't LL. Anyway, it doesn't matter. The parser generator takes all the qualifying items and uses them to create a new state (or look up an already constructed state). Newly constructed states are completed by "ε-closure", which is a fancy way of saying that you add all the productions for each non-terminal which follows the • in the new state. (Recursively, which is why it's called a closure.)
When the parser reaches a state where the • is at the end of an item, it can reduce that particular item, which is precisely the production which will be reduced. Reducing an item basically means backing up the parser until you reach the beginning of the item's production, which 8s what the parser stack is used for: each stack entry is a transition, do as you pop the stack you move backwards in the parse history. Once you reach the beginning of the item, you must be in a state which has a goto action on the production's non-terminal. That must be the case because an item with the • at the beginning was added during ε-closure, which only happens when some item(s) in the state have their • before that non-terminal. Then you take the goto action, which registers the fact that an instance of that non-terminal has just been recognised, and continue from there. So there's no magic.
Each reducible item has a lookahead set, which was also computed during table construction, consisting of the possible tokens which might come next. If the actual next token --the lookahead token-- is in that set, the reduction is allowed to happen. If the lookahead token follows the • in the current state, a shift action is allowed. If a state has both a possible reduction action and a possible shift action on the same token, the table has a parsing conflict and the grammar is not LR. The same if two different items are both reducible on that state on the same lookahead. For a grammar to be LR, every state can have at most one possible action for every different lookahead token. (If it has no possible action for the current lookahead, the parse fails and a syntax error is reported.)
In my opinion, you can't really learn this algorithm by reading about, although I've tried to write it. To see how it works, you need to construct (or borrow) a parsing table and play parser, armed with a whiteboard or a big pad of paper to keep track of the parsing stack. If you can find (or build) a parsing table where the items have not been deleted, you might find it easier to follow, although it takes up a lot more space. (G2G, like many "tutorials", deleted the items, possibly making it look like magic. But there are other resources, such as the infamous Dragon Book.)
The parser itself doesn't need to look at the items; all the relevant information has been summarised in the parsing table, which I suppose is why sites like G2G don't show them. And they do create a lot of clutter. Bison can produce Graphview source for an image of the parsing automaton; you need to supply the --report=all command-line option if you want to see the ε-closure in each state.

In shift reduce parsing why the handle always eventually appear on top of the stack and never inside?

I was going through the text Compilers Principles, Techniques and Tools by Ullman et. al where I came across the excerpt where the authors try to justify why stack is the best data structure of shift reduce parsing. They said that it is so because of the fact that
"The handle will always eventually appear on top of the stack, never inside."
The Excerpt
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation. These two steps can be of the form
In case (1), A is replaced by , and then the rightmost nonterminal B in that right side is replaced by . In case (2), A is again replaced first, but this time the right side is a string y of terminals only. The next rightmost nonterminal B will be somewhere to the left of y.
Let us consider case (1) in reverse, where a shift-reduce parser has just reached the configuration
The parser now reduces the handle to B to reach the configuration
in which is the handle, and it gets reduced to A
In case (2), in configuration
the handle is on top of the stack. After reducing the handle to B, the parser can shift the string xy to get the next handle y on top of the stack,
Now the parser reduces y to A.
In both cases, after making a reduction the parser had to shift zero or more symbols to get the next handle onto the stack. It never had to go into the stack to find the handle. It is this aspect of handle pruning that makes a stack a particularly convenient data structure for implementing a shift-reduce parser.
My reasoning and doubts
Intuitively this is how I feel that the statement in can be justified
If there is an handle on the top of the stack, then the algorithm, will first reduce it before pushing the next input symbol on top of the stack. Since before the push any possible handle is reduced, so there is no chance of an handle being on the top of the stack and then pushing a new input symbol thereby causing the handle to go inside the stack.
Moreover I could not understand the logic the authors have given in highlighted portion of the excerpt justifying that the handle cannot occur inside the stack, based on what they say about B and other facts related to it.
Please can anyone help me understand the concept.
The key to the logic expressed by the authors is in the statement at the beginning (emphasis added):
This fact becomes obvious when we consider the possible forms of two successive steps in any rightmost derivation.
It's also important to remember that a bottom-up parser traces out a right-most derivation backwards. Each reduction performed by the parser is a step in the derivation; since the derivation is rightmost the non-terminal being replaced in the derivation step must be the last non-terminal in the sentential form. So if we write down the sequence of reduction actions used by the parser and then read the list backwards, we get the derivation. Alternatively, if we write down the list of productions used in the rightmost derivation and then read it backwards, we get the sequence of parser reductions.
Either way, the point is to prove that the successive handles in the derivation steps correspond to monotonically non-retreating prefixes in the original input. The authors' proof takes two derivation steps (any two derivation steps) and shows that the end of the handle of the second derivation step is not before the end of the handle of the first step (although the ends of the two handles may be at the same point in the input).

Observer pattern usage in syntax tree parsing

I have detailed the specifications of the problem for reasons that will become clear after I ask my question, at the end. The program I am building is a parser in Java for a language with the following syntax (although this is not very relevant to the question):
<expr> ::= [<op> <expr> <expr>] | <symbol> | <value>
<symbol> ::= [a-zA-Z]+
<value> ::= [0-9]+
<op> ::= '+' | '*' | '==' | ‘<’
<assignment> ::= [= <symbol> <expr>]
<prog> ::= <assignment> |
[; <prog> <prog>] |
[if <expr> <prog> <prog>] |
[for <assignment> <expr> <assignment> <prog>] |
[assert <expr>] |
[return <expr>]`
This is an example of code in said language:
[; [= x 0] [; [if [== x 5] [= x 7] [= x [+ x 1]]] [return x]]]
Which is equivalent to:
x = 0;
if (x == 5)
x = 7;
else
x = x + 1;
return x;`
The code is guaranteed to be give in correct syntax; incorrectness of the given code is defined only by having:
a) An used variable (symbol) not previously declared (by declared meaning assigned something to it), even if the variable is used in a branch of an if or some other place that is never reached in the execution of the program;
b) Having a "return" instruction on each path the program could take, meaning the program cannot end without returning on any execution path it may take.
The requirements are that the program should parse the code.
My parser must:
a) Check for said correctness;
b) Parse the code and compute what is the returned value.
My take on this is:
1) Parse the code given into a tree of instructions and expressions;
2) Check for correctness by traversing the tree and seeing if a variable was declared in an upper scope before it was used;
3) Check for correctness by traversing the tree and seeing if any execution branch ends in a "return" instruction;
4) If all previous conditions hold, evaluate the returned value of the code by traversing the tree and remembering the value of all the variables in a HashMap or some other storage.
Now, my problem is that I must implement the parser using the Visitor and Observer design patterns. This is a key requirement for the project. I am fairly new to design patterns and only have a basic grasp about these two.
My question is: where should/can I use the Observer design patter in my parser?
It makes sense to use a Visitor to visit the nodes of the tree for steps 2, 3 and 4. I cannot figure out where I must use the Observer pattern, though.
Is there any place I can use it in my implementation? From my understanding, the Observer pattern takes care of data that can be read and modified by many "observers", the central idea being that an object modifying the data will announce the other objects that may be affected by the modification.
The main data being modified in my program is the tree and the HashMap in which I store the values for the variables. Both of there are accessed in a linear fashion, by only one thing. The tree is built one node at a time, and no other node, or object, for that matter, cares that a node is added or modified. In the evaluation phase, each node is visited and variables are added or modified in the hash table, but no object other than the current visitor from the current node cares about this. I suppose I can make each node an observer which upon observing a change does nothing, or something like that, forcing an Observer pattern, but that isn't really useful.
So, is there an obvious answer which I am completely missing? Is there a not so obvious one, but still giving an useful implementation of Observer? Can I use a half useful slightly forced Observer pattern somewhere in my algorithms, or is fully forced, completely useless way the only way to implement it? Is there a completely different way of approaching the problem which will allow me to use the Visitor and, more importantly, the Observer pattern?
Notes:
I am yet to implement the evaluation of the tree (steps 2, 3 and 4) with Visitor; I have only thought about how I should do it. I will implement it tomorrow and see if there is a way to use Observer somewhere, but having thought about how I could use it for a few hours, I still have no idea. I am hoping, though, that there is a way, which I haven't been able to discover but which will become clear after writing that part.
I apologize for writing so much. I couldn't summarize it better and still give details about the situation any better.
Also, I apologize if I am not clear in explanations. It is quite late, I have though about this for some hours and got tired, and I can't say I have a perfect grasp on the concepts. If anything is unclear or want further details on some matter, don't hesitate to ask. Also, don't hesitate in highlighting any mistakes or wrong paths in my judgement about the problem.
Here are some ideas how you could use well-known patterns and concepts to build an interpreter for your language:
Start processing an input stream by splitting it up into tokens ([;, =, x, 0, ], etc.). This first component (a.k.a. lexer, scanner, tokenizer) strips out irrelevant detail such as whitespace and produces tokens.
You could implement this as a simple state machine that is fed one character at a time. It can therefore be implemented as an observer of the input character sequence.
In the next stage (a.k.a. parsing) you build an abstract syntax tree (AST) from the generated tokens.
The parser is an observer of the tokenizer, i.e. it gets notified of any new tokens the tokenizer produces.(*) It is fed one token at a time. In the case of a fairly simple grammar, the scanner itself can also be some stack-based state machine. (For example, if it needs to match opening and closing brackets, it needs to be able to remember what context / state it was in outside the brackets, thus it'll probably use some kind of stack for context management.)
Once the parser has built an AST, perhaps almost all subsequent steps can be implemented using the visitor pattern. Every algorithm that traverses the AST, locates specific nodes or subtrees, or transforms (parts of) the AST can be modelled as a visitor. Some visitors will only model actions that do not return a value, others return a new AST node for a visited node such that a new transformed AST can be put together. For example:
Check whether the AST or any subtree of it describes a valid program fragment (visit a (sub-) tree, reduce it to a boolean or a list of errors).
Simplify / optimize the program (visit a (sub-) tree, generate a smaller or otherwise more optimal subtree, finally reassemble a new tree from the transformed subtrees).
Execute the AST (visit and interpret each node, execute some code accoeding to the node's meaning).
(*) Calling the parser an observer of the scanner is perhaps somewhat inaccurate. This Software Engineering SE post has a good summary of closely related design patterns. Could be that the scanner implements a strategy for dealing with the tokens.

Isn't an LR(0) parser using lookaheads as well?

An LL(1)-parser needs a lookahead-symbol for being able to decide which production to use. This is the reason why I always thought the term "lookahead" is used, when a parser looks at the next input token without "consuming" it (i.e. it can still be read from the input by the next action). LR(0) parsers, however, made me doubt that this is correct:
Every example of LR(0)-parsers that I've seen also uses the next input token for deciding whether to shift or to reduce.
In case of reduction the input token is not consumed.
I used the freeware tool "ParsingEmu" for generating an LR-table and performing an LR evalutation below for the word "aab". As you can see the column head contain tokens. From the evaluation you can see that the parser is deciding which column to use by looking at the next input token. But when the parser reduces in steps 4 - 6 the input doesn't change (although the parser needs to know the next input token "$" when performing a transition to the next state).
Grammar:
S -> A
A -> aA
A -> b
Table:
Evaluation:
Now I made following assumptions for the reason of my confusion:
My assumption for the definition of "lookahead" (lookahead = input token not being consumed) is wrong. Lookahead just means two different things for either LL-parsers or LR-parsers. If so, how can "lookahead" be defined then?
LR-parsers have (from the theoretical point of view when you would use push-down automaton) additional internal states where they consume the input token by putting it on the stack and therefore are able to make the shift- reduce- decision by just looking on the stack.
The evaluation shown above is LR(1). If true, what would an LR(0) evaluation look like?
Now what is correct, 1, 2 or 3 or something completely different?
It's important to be precise:
An LR(k) parser uses the curent parser state and k lookahead symbols to decide whether to reduce, and if so, by which production.
It also uses a shift transition table to decide which parsing state it should move to after shifting the next input token. The shift transition table is keyed by the current state and the (single) token being shifted, regardless of the value of k.
If in a given parser state, it would be possible to produce both a shift and a reduce action, then the parser has a shift/reduce conflict, and it is invalid. Consequently, the above two determinations could in theory be done nondeterministically.
If in a given parser state, no reduce is possible and the next input symbol cannot be shifted (that is, there is no transition for that state with that input symbol), then the parse has failed and the algorithm terminates.
If, on the other hand, the shift transition leads to the designated Accept state, then the parse succeeds and the algorithm terminates.
What all that means is that the lookahead is used to predict which, if any, reduction should be applied. In an LR(0) parser, the decision to shift (more accurately, to attempt to shift) must be made before reading the next input token, but the computation of the state to transition to do is made after reading the token, at which point it will signal an error if no shift is possible.
LL(k) parsers must predict which production replace a non-terminal with as soon as they see the non-terminal. The basic LL algorithm starts with a stack containing [S, $] (top to bottom) and does whichever of the following is applicable until done:
If the top of the stack is a non-terminal, replace the top of the stack with one of the productions for that non-terminal, using the next k input symbols to decide which one (without moving the input cursor), and continue.
If the top of the stack is a terminal, read the next input token. If it is the same terminal, pop the stack and continue. Otherwise, the parse has failed and the algorithm finishes.
If the stack is empty, the parse has succeeded and the algorithm finishes. (We assume that there is a unique EOF-marker $ at the end of the input.)
In both cases, lookahead has the same meaning: it consists of looking at input tokens without moving the input cursor.
If k is 0, then:
An LR(k) parser must decide whether or not to reduce without examining input, which means that no state can have either two different reduce actions or a reduce and a shift action.
An LL(k) parser must decide which production of a given non-terminal is appicable without examining input. In practice, this means that a each non-terminal can have only one production, which means the language must be finite.

LALR parsers and look-ahead

I'm implementing the automatic construction of an LALR parse table for no reason at all. There are two flavors of this parser, LALR(0) and LALR(1), where the number signifies the amount of look-ahead.
I have gotten myself confused on what look-ahead means.
If my input stream is 'abc' and I have the following production, would I need 0 look-ahead, or 1?
P :== a E
Same question, but I can't choose the correct P production in advance by only looking at the 'a' in the input.
P :== a b E
| a b F
I have additional confusion in that I don't think the latter P-productions really happen in when building a LALR parser generator. The reason is that the grammar is effectively left-factored automatically as we compute the closures.
I was working through this page and was ok until I got to the first/follow section. My issue here is that I don't know why we are calculating these things, so I am having trouble abstracting this in my head.
I almost get the idea that the look-ahead is not related to shifting input, but instead in deciding when to reduce.
I've been reading the Dragon book, but it is about as linear as a Tarantino script. It seems like a great reference for people who already know how to do this.
The first thing you need to do when learning about bottom-up parsing (such as LALR) is to remember that it is completely different from top-down parsing. Top-down parsing starts with a nonterminal, the left-hand-side (LHS) of a production, and guesses which right-hand-side (RHS) to use. Bottom-up parsing, on the other hand, starts by identifying the RHS and then figures out which LHS to select.
To be more specific, a bottom-up parser accumulates incoming tokens into a queue until a right-hand side is at the right-hand end of the queue. Then it reduces that RHS by replacing it with the corresponding LHS, and checks to see whether an appropriate RHS is at the right-hand edge of the modified accumulated input. It keeps on doing that until it decides that no more reductions will take place at that point in the input, and then reads a new token (or, in other words, takes the next input token and shifts it onto the end of the queue.)
This continues until the last token is read and all possible reductions are performed, at which point if what remains is the single non-terminal which is the "start symbol", it accepts the parse.
It is not obligatory for the parser to reduce a RHS just because it appears at the end of the current queue, but it cannot reduce a RHS which is not at the end of the queue. That means that it has to decide whether to reduce or not before it shifts any other token. Since the decision is not always obvious, it may examine one or more tokens which it has not yet read ("lookahead tokens", because it is looking ahead into the input) in order to decide. But it can only look at the next k tokens for some value of k, typically 1.
Here's a very simple example; a comma separated list:
1. Start -> List
2. List -> ELEMENT
3. List -> List ',' ELEMENT
Let's suppose the input is:
ELEMENT , ELEMENT , ELEMENT
At the beginning, the input queue is empty, and since no RHS is empty the only alternative is to shift:
queue remaining input action
---------------------- --------------------------- -----
ELEMENT , ELEMENT , ELEMENT SHIFT
At the next step, the parser decides to reduce using production 2:
ELEMENT , ELEMENT , ELEMENT REDUCE 2
Now there is a List at the end of the queue, so the parser could reduce using production 1, but it decides not to based on the fact that it sees a , in the incoming input. This goes on for a while:
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT REDUCE 3
List , ELEMENT SHIFT
List , ELEMENT SHIFT
List , ELEMENT -- REDUCE 3
Now the lookahead token is the "end of input" pseudo-token. This time, it does decide to reduce:
List -- REDUCE 1
Start -- ACCEPT
and the parse is successful.
That still leaves a few questions. To start with, how do we use the FIRST and FOLLOW sets?
As a simple answer, the FOLLOW set of a non-terminal cannot be computed without knowing the FIRST sets for the non-terminals which might follow that non-terminal. And one way we can decide whether or not a reduction should be performed is to see whether the lookahead is in the FOLLOW set for the target non-terminal of the reduction; if not, the reduction certainly should not be performed. That algorithm is sufficient for the simple grammar above, for example: the reduction of Start -> List is not possible with a lookahead of ,, because , is not in FOLLOW(Start). Grammars whose only conflicts can be resolved in this way are SLR grammars (where S stands for "Simple", which it certainly is).
For most grammars, that is not sufficient, and more analysis has to be performed. It is possible that a symbol might be in the FOLLOW set of a non-terminal, but not in the context which lead to the current stack configuration. In order to determine that, we need to know more about how we got to the current configuration; the various possible analyses lead to LALR, IELR and canonical LR parsing, amongst other possibilities.

Resources