Building parse trees with shift-reduce parsing - parsing

I'm experimenting with parsing on my free time, and I wanted to implement a shift-reduce parser for a very very simple grammar. I've read many online articles but I'm still confused on how to create parse trees. Here's an example of what I want to do:
Grammar:
Expr -> Expr TKN_Op Expr
Expr -> TKN_Num
Here's an example input:
1 + 1 + 1 + 1
That, after tokenization, becomes:
TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num
I understand that:
Shifting means pushing the first input token on the stack and removing it from the input
Reducing means substituting one or more elements on the stack with a grammar element
So, basically, this should happen:
Step 1:
Stack:
Input: TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num
What: Stack is empty. Shift.
Step 2:
Stack: TKN_Num
Input: TKN_Op TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num
What: TKN_Num can be reduced to Expr. Reduce.
Step 3:
Stack: Expr
Input: TKN_Op TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num
What: Cannot reduce. Shift.
Step 4:
Stack: Expr TKN_Op
Input: TKN_Num TKN_Op TKN_Num TKN_Op TKN_Num
What: Cannot reduce. Shift.
Step 5:
Stack: Expr TKN_Op TKN_Num
Input: TKN_Op TKN_Num TKN_Op TKN_Num
What: TKN_Num can be reduced to Expr. Reduce.
// What should I check for reduction?
// Should I try to reduce incrementally using
// only the top of the stack first,
// then adding more stack elements if I couldn't
// reduce the top alone?
Step 6:
Stack: Expr TKN_Op Expr
Input: TKN_Op TKN_Num TKN_Op TKN_Num
What: Expr TKN_Op Expr can be reduced to Expr. Reduce.
Step 7:
Stack: Expr
Input: TKN_Op TKN_Num TKN_Op TKN_Num
What: ...
// And so on...
Apart from the "what to reduce?" doubt, I have no clue how to correctly build a parse tree. The tree should probably look like this:
1 + o
|
1 + o
|
1 + 1
Should I create a new node on reduction?
And when should I add children to the newly create node / when should I create a new root node?

The simple and obvious thing to do is to create a tree node on every reduction, and add the tree nodes from the grammar elements that were reduced to that tree node.
This is easily managed with a node stack that runs in parallel to the "shift token" stack that the raw parser uses. For every reduction for a rule of length N, the shift-token stack is shortened by N, and nonterminal token is pushed on the shift stack. At the same time, shorten the node stack by removing the top N nodes, create a node for the nonterminal, attached the removed N nodes as children, and push that node onto the node stack.
This policy even works with rules that have zero-length right hand side: create a tree node and attach the empty set of children to it (e.g., create a leaf node).
If you think of a "shift" on a terminal node as a reduction (of the characters forming the terminal) to the terminal node, then terminal node shifts fit right in. Create a node for the terminal, and push it onto the stack.
If you do this, you get a "concrete syntax/parse tree" that matches the grammar isomorphically. (We do this for a commercial tool I offer). There are lots of folks that don't like such concrete trees, because they contain nodes for keywords, etc., which don't really add much value. True, but such trees are supremely easy to construct, and supremely easy to understand becuase the grammar is the tree structure. When you have 2500 rules (as we do for a full COBOL parser), this matters. This is also convenient because all the mechanism can be built completely into the parsing infrastructure. The grammar engineer simply writes rules, the parser runs, voila, a tree. It is also easy to change the grammar: just change it, voila, you still get parse trees.
However, if you don't want a concrete tree, e.g., you want an "abstract syntax tree", then what you have to do is let the grammar engineer control which reductions generate nodes; usually be adding some procedural attachment (code) to each grammar rule to be executed on a reduction step. Then if any such procedural attachement produces a node, it is retained on a node stack. Any procedural attachment which produces a node must attach nodes produced by the right hand elements. If any this is what YACC/Bison/... most of the shift-reduce parser engines do. Go read about Yacc or Bison and examine a grammar. This scheme gives you lot of control, at the price of insisting that you take that control. (For what we do, we don't want this much engineering effort in building a grammar).
In the case of producing CSTs, it is conceptually straightforward to remove "useless" nodes from trees; we do that in our tool. The result is a lot like an AST, without the manual effort to write all those procedural attachments.

The reason of your trouble is that you have a shift/reduce conflict in your grammar:
expr: expr OP expr
| number
You can resolve this in 2 ways:
expr: expr OP number
| number
for left associative operators, or
expr: number OP expr
| number
for right associative ones. This should also determine the shape of your tree.
Reduction is usually done when one clause is detected complete. In the right associative case, you would start in a state 1 that expects a number, pushes it onto the value stack and shifts to state 2. In state 2, if the token is not an OP, you can reduce a number to an expr. Otherwise, push the operator and shift to state 1. Once state 1 is complete, you can reduce the number, operator and expression to another expression. Note, you need a mechanism to "return" after a reduction. The overall parser would then start in state 0, say, which immediately goes to state 1 and accepts after reduction.
Note that tools like yacc or bison make this kind of stuff much easier because they bring all the low level machinery and the stacks.

Related

Differences in the two different parsing methodologies

I am quite new to Antlr4, and am interested in simplifying a large grammar that I have. There are two possible approaches that I may take. The first one, is more of the "in-line" approach, which often makes things simpler as I run into less left-recursion errors (and everything is 'there in one place' so it it's a bit easier to track. Here is an example parse with it:
// inline
// input: CASE WHEN 1 THEN 1 WHEN 1 THEN 1 ELSE 1 END
grammar DBParser;
statement: expr EOF;
expr
: 'CASE' expr? ('WHEN' expr 'THEN' expr)+ ('ELSE' expr)? 'END' # caseExpressionInline
| ATOM # constantExpression
;
ATOM: [a-z]+ | [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
The second approach is to break everything out into separate small components, for example:
// separated
grammar DBParser;
statement: expr EOF;
expr
: case_expr # caseExpression
| ATOM # constantExpression
;
case_expr
: 'CASE' case_arg? when_clause_list case_default? 'END'
;
when_clause_list
: when_clause+
;
when_clause
: 'WHEN' expr 'THEN' expr
;
case_default
: 'ELSE' expr
;
case_arg
: expr
;
ATOM: [a-z]+ | [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
A couple high-level things I've noticed are:
The inline version comes out to about half the size of the output code files of the second approach.
The in-line version is about 20% faster! I tried both parsers on a 10MB file of repetitions of the same statement, and the inline version never took over 1s and the separated version took around 1.2s and never went under 1.1s (I ran each about 20x). That seems crazy to me that just in-lining the expressions would have such a large performance difference! That was certainly not my goal but noticed it as I was comparing the two different things in writing up this question.
What would be the preferable way to split things up then?
(With the caveat that SO frowns upon opinion-based questions…)
I think that you should let the programming experience of coding against the generated classes be the most important guidance on how much to break things out.
I find the first style much easier to deal with, and you’ve already pointed out performance benefits.
Like most guidance it can be taken too far in either direction. The second is pretty extreme in the “breaking things out” category and, in my opinion, going to be a pain to deal with the deeply nested structure. I think it appeals to our desire to “keep functions small” in general purpose languages, but you’ll spend a lot more time writing the rest of your code than you will getting the parser right.
Your first example comes close to going to the other extreme. I don’t think it quite crosses the line, but if it were to start getting out of hand, I might opt for something like:
grammar DBParser;
statement: expr EOF;
expr
: case_expr # caseExpression
| ATOM # constantExpression
;
case_expr
: 'CASE' expr? ('WHEN' expr 'THEN' expr)+ ('ELSE' expr)? 'END'
;
I’ve also found that I go back and rework the grammar sometimes as I’m fleshing out the language and find I want a slightly different structure. For example, I found that, when doing code completion, it was sometimes helpful to have different parser rules for different contexts that just required an ID token. The code completion could then tell me that I could use an assigment_dest instead of an `IDK and I could do much better completion, so, that’s a case where an extra level proved useful once I started using the grammar in practice.
To me, refining the grammar to give me the most useful ParseTree is the first priority.

How to visualize every step of constructing AST

I am coding a parser of expression and visualization of it, which means every step of the recursive descent parsing or construction of AST will be visualized like a tiny version of VisuAlgo
// Expression grammer
Goal -> Expr
Expr -> Term + Term
| Expr - Term
| Term
Term -> Term * Factor
| Term / Factor
| Factor
Factor -> (Expr)
| num
| name
So I am wondering what data structure can be easily used for storing every step of constructing AST and how to implement the visualization of every step of constructing AST. Well, I have searched some similar questions and implement a recursive descent parser before, but just can't get a way to figure this out. I will be appreciated if anyone can help me.
This SO answer shows how to parse and build a tree as you parse
You could easily just the print the tree or tree fragments at each point where they are created.
To print the tree, just walk it recursively and print the nodes with indentation equal to depth of recursion.

Faulty bison reduction using %glr-parser and %merge rules

Currently I'm trying to build a parser for VHDL which
has some of the problems C++-Parsers have to face.
The context-free grammar of VHDL produces a parse
forest rather than a single parse tree because of it's
ambiguity regarding function calls and array subscriptions
foo := fun(3) + arr(5);
This assignment can only be parsed unambiguous if the parser
would carry around a hirachically, type-aware symbol table
which it'd use to resolve the ambiguities somewhat on-the-fly.
I don't want to do this, because for statements like the
aforementioned, the parse forest would not grow exponentially, but
rather linear depending on the amount of function calls and
array subscriptions.
(Except, of course, one would torture the parser with statements like)
foo := fun(fun(fun(fun(fun(4)))));
Since bison forces the user to just create one single parse-tree,
I used %merge attributes to collect all subtrees recursively and
added those subtrees under so called AMBIG nodes in the singleton
AST.
The result looks like this.
In order to produce the above, I parsed the token stream "I=I(N);".
The substance of the grammar I used inside the parse.y file, is
collected below. It tries to resemble the ambiguous parts of VHDL:
start: prog
;
/* I cut out every semantic action to make this
text more readable */
prog: assignment ';'
| prog assignment ';'
;
assignment: 'I' '=' expression
;
expression: function_call %merge <stmtmerge2>
| array_indexing %merge <stmtmerge2>
| 'N'
;
function_call: 'I' '(' expression ')'
| 'I'
;
array_indexing: function_call '(' expression ')' %merge <stmtmerge>
| 'I' '(' expression ')' %merge <stmtmerge>
;
The whole sourcecode can be read at this github repository.
And now, let's get down to the actual Problem.
As you can see in the generated parse tree above,
the nodes FCALL1 and ARRIDX1 refer to the same
single node EXPR1 which in turn refers to N1 twice.
This, by all means, should not have happened and I don't
know why. Instead there should be the paths
FCALL1 -> EXPR2 -> N2
ARRIDX1 -> EXPR1 -> N1
Do you have any idea why bison reuses the aforementioned
nodes?
I also wrote a bugreport on the official gnu mailing
list for bison, without a reply to this point though.
Unfortunately, due to the restictions for new stackoverflow
users, I can't provide no link to this bug report...
That behaviour is expected.
expression can be unambiguously reduced, and that reduced value is used by both possible ambiguous reductions which include the value. Remember that GLR, like LR, is a left-to-right parser. When a reduction action is executed, all of the child reductions have already happened. The effect is not different from the use of a terminal in a right-hand side; the terminal will not be artificially copied in order to produce different instances in the ambiguous productions which use it.
For most people, this would be a feature rather than a bug, and I don't mean that as a joke. Without the graph-structured stack, GLR has exponential run-time. If you really want to do a deep copy of shared AST nodes when you merge parse trees, you will have to do it yourself, but I suggest that you find a way to make use of the fact that the parse forest is really an directed acyclic graph rather than a tree; you will probably be able to take advantage of the lack of duplication.

Left recursion parsing

Description:
While reading Compiler Design in C book I came across the following rules to describe a context-free grammar:
a grammar that recognizes a list of one or more statements, each of
which is an arithmetic expression followed by a semicolon. Statements are made up of a
series of semicolon-delimited expressions, each comprising a series of numbers
separated either by asterisks (for multiplication) or plus signs (for addition).
And here is the grammar:
1. statements ::= expression;
2. | expression; statements
3. expression ::= expression + term
4. | term
5. term ::= term * factor
6. | factor
7. factor ::= number
8. | (expression)
The book states that this recursive grammar has a major problem. The right hand side of several productions appear on the left-hand side as in production 3 (And this property is called left recursion) and certain parsers such as recursive-descent parser can't handle left-recursion productions. They just loop forever.
You can understand the problem by considering how the parser decides to apply a particular production when it is replacing a non-terminal that has more than one right hand side. The simple case is evident in Productions 7 and 8. The parser can choose which production to apply when it's expanding a factor by looking at the next input symbol. If this symbol is a number, then the compiler applies Production 7 and replaces the factor with a number. If the next input symbol was an open parenthesis, the parser
would use Production 8. The choice between Productions 5 and 6 cannot be solved in this way, however. In the case of Production 6, the right-hand side of term starts with a factor which, in tum, starts with either a number or left parenthesis. Consequently, the
parser would like to apply Production 6 when a term is being replaced and the next input symbol is a number or left parenthesis. Production 5-the other right-hand side-starts with a term, which can start with a factor, which can start with a number or left parenthesis, and these are the same symbols that were used to choose Production 6.
Question:
That second quote from the book got me completely lost. So by using an example of some statements as (for example) 5 + (7*4) + 14:
What's the difference between factor and term? using the same example
Why can't a recursive-descent parser handle left-recursion productions? (Explain second quote).
What's the difference between factor and term? using the same example
I am not giving the same example as it won't give you clear picture of what you have doubt about!
Given,
term ::= term * factor | factor
factor ::= number | (expression)
Now,suppose if I ask you to find the factors and terms in the expression 2*3*4.
Now,multiplication being left associative, will be evaluated as :-
(2*3)*4
As you can see, here (2*3) is the term and factor is 4(a number). Similarly you can extend this approach upto any level to draw the idea about term.
As per given grammar, if there's a multiplication chain in the given expression, then its sub-part,leaving a single factor, is a term ,which in turn yields another sub-part---the another term, leaving another single factor and so on. This is how expressions are evaluated.
Why can't a recursive-descent parser handle left-recursion productions? (Explain second quote).
Your second statement is quite clear in its essence. A recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the productions of the grammar.
It is said so because it's clear that recursive descent parser will go into infinite loop if the non-terminal keeps on expanding into itself.
Similarly, talking about a recursive descent parser,even with backtracking---When we try to expand a non-terminal, we may eventually find ourselves again trying to expand the same non-terminal without having consumed any input.
A-> Ab
Here,while expanding the non-terminal A can be kept on expanding into
A-> AAb -> AAAb -> ... -> infinite loop of A.
Hence, we prevent left-recursive productions while working with recursive-descent parsers.
The rule factor matches the string "1*3", the rule term does not (though it would match "(1*3)". In essence each rule represents one level of precedence. expression contains the operators with the lowest precedence, factor the second lowest and term the highest. If you're in term and you want to use an operator with lower precedence, you need to add parentheses.
If you implement a recursive descent parser using recursive functions, a rule like a ::= b "*" c | d might be implemented like this:
// Takes the entire input string and the index at which we currently are
// Returns the index after the rule was matched or throws an exception
// if the rule failed
parse_a(input, index) {
try {
after_b = parse_b(input, index)
after_star = parse_string("*", input, after_b)
after_c = parse_c(input, after_star)
return after_c
} catch(ParseFailure) {
// If one of the rules b, "*" or c did not match, try d instead
return parse_d(input, index)
}
}
Something like this would work fine (in practice you might not actually want to use recursive functions, but the approach you'd use instead would still behave similarly). Now, let's consider the left-recursive rule a ::= a "*" b | c instead:
parse_a(input, index) {
try {
after_a = parse_a(input, index)
after_star = parse_string("*", input, after_a)
after_b = parse_c(input, after_star)
return after_b
} catch(ParseFailure) {
// If one of the rules a, "*" or b did not match, try c instead
return parse_c(input, index)
}
}
Now the first thing that the function parse_a does is to call itself again at the same index. This recursive call will again call itself. And this will continue ad infinitum, or rather until the stack overflows and the whole program comes crashing down. If we use a more efficient approach instead of recursive functions, we'll actually get an infinite loop rather than a stack overflow. Either way we don't get the result we want.

Shift-reduce: when to stop reducing?

I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:
S: A;
P
: NUMBER
| '(' S ')'
;
M
: P
| M '*' P
| M '/' P
;
A
: M
| A '+' M
| A '-' M
;
And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?
Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:
S '+'
If we shift '2', the reductions might be:
S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S
Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it
A '+' M
So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?
Edit: An addition to the question follows.
Now suppose we parse 1+2*3. Some shift/reduce operations are as follows:
Stack | Input | Operation
---------+-------+----------------------------------------------
| 1+2*3 |
NUMBER | +2*3 | Shift
A | +2*3 | Reduce (looking ahead, we know to stop at A)
A+ | 2*3 | Shift
A+NUMBER | *3 | Shift (looking ahead, we know to stop at M)
A+M | *3 | Reduce (looking ahead, we know to stop at M)
Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M to A, as doing so would result in an inevitable syntax error after reading *3 ?
The problem you're describing is an issue with creating LR(0) parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0) grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1), however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1) parser would look ahead when it had the 1 on the stack, see that the next symbol is a +, and realize that it shouldn't reduce past A (since that is the only thing it could reduce to that would still match a rule with + in the second position).
An interesting property of LR grammars is that for any grammar which is LR(k) for k>1, it is possible to construct an LR(1) grammar which is equivalent. However, the same does not extend all the way down to LR(0) - there are many grammars which cannot be converted to LR(0).
See here for more details on LR(k)-ness:
http://en.wikipedia.org/wiki/LR_parser
I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.
First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.

Resources