How to print shift or reduce grammar rule? (flex,bison) - parsing

I'm trying to implement a C grammar parser with lex & yacc and show the reducing procedure.
I have to print token list on the left and shift or reduce rule on the right.
Like:
2*4+4/2 //iniput
2 shift 2 2
2 reduce I -> F 2
2 reduce F -> T 2
2 reduce F -> T 2
* shift * 2 *
4 shift 4 2 * 4
4 reduce I -> F 2 * 4
2 * 4 reduce T*F -> T 2 * 4
8 reduce T -> E 2 * 4
8 reduce T -> E 2 * 4
+ shift + 2 * 4 +
4 shift 4 2 * 4 + 4
4 reduce I -> F 2 * 4 + 4
4 reduce F -> T 2 * 4 + 4
4 reduce F -> T 2 * 4 + 4
/ shift / 2 * 4 + 4 /
2 shift 2 2 * 4 + 4 / 2
2 reduce I -> F 2 * 4 + 4 / 2
4 / 2 reduce T/F -> T 2 * 4 + 4 / 2
8 + 2 reduce E+T -> E 2 * 4 + 4 / 2
end of parsing : 2 * 4 + 4 / 2 = 10
I am not familiar with lex & yacc, and I have no idea how to print the procedure out.
Any help is welcome.

You can easily enough ask Bison to show you what it is doing. But it's not going to come out looking like your chart. You'll have to read through the trace and condense it into the desired format. But that's not too hard, and you will appreciate having learned how to do it the first time you have to debug a grammar.
I'm not going to explain here how to write a grammar, nor am I going to talk much about writing scanners. If you haven't done so already, I suggest you read through the simple examples in the bison manual, and then the chapter on semantic values. That will explain a lot of the background for the following.
Bison has some very useful tools for visualising the grammar and the parse. The first is the state/transition table produced when you give bison the --report=all command-line option.
You can use -v, which is what people will usually tell you to do. But I think --report=all is worthwhile for a novice because it comes closer to what you will have seen in class. The -v listing only shows the core items in each state, so it leaves out the items with the dot at the beginning. And it doesn't show you the lookaheads. Since it does show you all the action entries, including the GOTO actions, you can figure everything else out pretty easily. But, at least at the beginning, it's probably better to see all the details.
You can ask bison to draw the state machine. It produces the drawing in Graphviz ("Dot") syntax, so you need Graphviz installed to look at the drawing. And state machines for any non-trivial grammar don't fit on an A4 sheet, or a computer screen, so they're really only useful for toy grammars. Read the manual to see how to tell Bison to output the Graphviz diagram if you want to give it a try.
You'll probably want to refer to the state machine when you're trying to understand the traces.
You could write out parsing actions by just running the state machine by hand, using the actions which Bison shows you. But there's a lot to be said for reading the bison trace. And it's really not very difficult to produce. You just need to add one more command-line option when you invoke bison, and you need to add a few lines to your grammar source file. All of the information here, and a lot more, can be found in the bison manual chapter on grammar debugging
The option is -t or --debug. That tells Bison to generate the additional code to produce the traces. However, it does not enable tracing; you still have to do that by setting the value of the global variable yydebug to 1 (or some other non-zero value). Unfortunately, the variable yydebug is not defined unless the --debug option is specified, so if you just add yydebug = 1; to your main(), your program will no longer compile unless you run bison with the --debug option. That's annoying, so it's worth adding a few more lines to your code. The simplest few lines are these: (which can go just above your definition of main):
#if YYDEBUG
extern int yydebug;
#else
static int yydebug = 0;
#endif
That makes sure that yydebug is defined and usable in main regardless of whether you requested a debugging parser when you ran bison.
But that still doesn't enable traces. To do that, you need one more line (at least) which you can put right at the top of main:
yydebug = 1;
Or you could be a bit more sophisticated and make it possible to run the parser with or without traces, by checking the command-line arguments. A good way to parse command-line arguments is with getopt, but for a quick-and-dirty executable which only has one command-line argument, you could use the sample code below, which sets yydebug only if the executable is invoked with -d as its first command line argument.
This is probably pretty similar to the grammar you were given (or wrote), except that I used longer names for non-terminals.
/* FILE: simple_expr.l */
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int yylex(void);
void yyerror(const char* msg);
%}
%token NUMBER
%printer { fprintf(yyo, "%d", $$); } NUMBER
%%
expr : term
| expr '+' term
| expr '-' term
term : factor
| term '*' factor
| term '/' factor
factor: NUMBER
| '(' expr ')'
%%
#if YYDEBUG
extern int yydebug;
#else
static int yydebug = 0;
#endif
int main(int argc, char* argv[]) {
if (argc > 1 && strcmp(argv[1], "-d") == 0) yydebug = 1;
return yyparse();
}
void yyerror(const char* msg) {
fprintf(stderr, "%s\n", msg);
}
We also need a lexical scanner. Here's a really simple one: (See the flex manual for any details you don't understand.)
/* FILE: simple_expr.l */
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "simple_expr.tab.h"
%}
%option noinput nounput noyywrap nodefault
%%
[[:space:]]+ ;
[[:digit:]]+ { yylval = atoi(yytext); return NUMBER; }
. return yytext[0];
Compile (a Makefile would be useful here. Or whatever you use for building projects):
$ bison -o simple_expr.tab.c -d --debug --report=all simple_expr.y
$ flex -o simple_expr.lex.c simple_expr.l
$ gcc -Wall -o simple_expr simple_expr.tab.c simple_expr.lex.c
You should take a look at simple_expr.output at this point. There you will find the bison state machine report.
Now we run the program with traces enabled. (<<< is what Bash calls a "here-string". It takes a single argument and provides it to the executable as its standard input. This is really handy for debugging parsers.)
The trace is quite long, because, as I said, Bison makes no attempt to compress the information. Here's how it starts:
$ ./simple_expr -d <<< '2 * 3 + 12 / 4'
Starting parse
Entering state 0
Reading a token: Next token is token NUMBER (2)
Shifting token NUMBER (2)
Entering state 1
Reducing stack by rule 7 (line 22):
$1 = token NUMBER (2)
-> $$ = nterm factor ()
Stack now 0
Entering state 5
Reducing stack by rule 4 (line 19):
$1 = nterm factor ()
-> $$ = nterm term ()
Stack now 0
Entering state 4
So, it first shifts the token 2 (which is a NUMBER). (Note: I snuck a %printer declaration into the grammar file so that bison can print out the semantic value of NUMBER tokens. If I hadn't done that, it would just have told me that it read a NUMBER, leaving me to guess which NUMBER it read. So the %printer declarations are really handy. But you need to read the manual to see how to use them properly.)
The shift action causes it to go to state 1. Bison does immediate reductions when the default reduction doesn't depend on lookahead, so the parser now immediately reduces the stack using the rule factor: NUMBER. (You need either the state machine or the code listing with line numbers to see what "rule 7" is. That's one of the reasons we produced the report.)
After the reduction, the stack contains only state 0, which is the state consulted for the GOTO action (on the non-terminal factor, which was just reduced). That action takes us to state 5. Again, an immediate reduction is possible, using rule 4 (term: factor). After the reduction, the stack has again been reduced to just the start state, and the GOTO action takes us to state 4. At this point, another token is actually necesary. You can read the rest of the trace below; hopefully, you can see what's going on.
Reading a token: Next token is token '*' ()
Shifting token '*' ()
Entering state 10
Reading a token: Next token is token NUMBER (3)
Shifting token NUMBER (3)
Entering state 1
Reducing stack by rule 7 (line 22):
$1 = token NUMBER (3)
-> $$ = nterm factor ()
Stack now 0 4 10
Entering state 15
Reducing stack by rule 5 (line 20):
$1 = nterm term ()
$2 = token '*' ()
$3 = nterm factor ()
-> $$ = nterm term ()
Stack now 0
Entering state 4
Reading a token: Next token is token '+' ()
Reducing stack by rule 1 (line 16):
$1 = nterm term ()
-> $$ = nterm expr ()
Stack now 0
Entering state 3
Next token is token '+' ()
Shifting token '+' ()
Entering state 8
Reading a token: Next token is token NUMBER (12)
Shifting token NUMBER (12)
Entering state 1
Reducing stack by rule 7 (line 22):
$1 = token NUMBER (12)
-> $$ = nterm factor ()
Stack now 0 3 8
Entering state 5
Reducing stack by rule 4 (line 19):
$1 = nterm factor ()
-> $$ = nterm term ()
Stack now 0 3 8
Entering state 13
Reading a token: Next token is token '/' ()
Shifting token '/' ()
Entering state 11
Reading a token: Next token is token NUMBER (4)
Shifting token NUMBER (4)
Entering state 1
Reducing stack by rule 7 (line 22):
$1 = token NUMBER (4)
-> $$ = nterm factor ()
Stack now 0 3 8 13 11
Entering state 16
Reducing stack by rule 6 (line 21):
$1 = nterm term ()
$2 = token '/' ()
$3 = nterm factor ()
-> $$ = nterm term ()
Stack now 0 3 8
Entering state 13
Reading a token: Now at end of input.
Reducing stack by rule 2 (line 17):
$1 = nterm expr ()
$2 = token '+' ()
$3 = nterm term ()
-> $$ = nterm expr ()
Stack now 0
Entering state 3
Now at end of input.
Shifting token $end ()
Entering state 7
Stack now 0 3 7
Cleanup: popping token $end ()
Cleanup: popping nterm expr ()

Related

What decides which production the parser tries?

I am trying to build a parser for a desk calculator and am using the following bison code for it.
%union{
float f;
char c;
// int
}
%token <f> NUM
%token <c> ID
%type <f> S E T F G
%%
C : S ';'
| C S ';'
;
S : ID '=' E {fprintf(debug,"13\n");printf("%c has been assigned the value %f.",$1,$3);symbolTable[$1]=$3;}
| E {fprintf(debug,"12\n");result = $$;}
;
E : E '+' T {fprintf(debug,"11\n");$$ = $1+$3;}
| E '-' T {fprintf(debug,"10\n");$$ = $1-$3;}
| T {fprintf(debug,"9\n");$$ = $1;}
;
T : T '*' F {fprintf(debug,"7\n");$$ = $1*$3;}
| T '/' F {fprintf(debug,"6\n");$$ = $1/$3;}
| F {fprintf(debug,"5\n");$$ = $1;}
;
F : G '#' F {fprintf(debug,"4\n");$$ = pow($1,$3);}
| G {fprintf(debug,"3\n");$$ = $1;}
;
G : '(' E ')' {fprintf(debug,"2\n");$$ = $2;}
| NUM {fprintf(debug,"1\n");$$ = $1;}
| ID {fprintf(debug,"0\n");$$ = symbolTable[$1];}
;
%%
My LEX rules are
digit [0-9]
num {digit}+
alpha [A-Za-z]
id {alpha}({alpha}|{digit})*
white [\ \t]
%%
let {printf("let");return LET;}
{num} {yylval.f = atoi(yytext);return NUM;}
{alpha} {yylval.c = yytext[0];return ID;}
[+\-\*/#\(\)] {return yytext[0];}
. {}
%%
The input I gave is a=2+3
When the lexer returns an ID(for 'a'), the parser is going for the production with fprintf(debug,"0\n"). But I want it to go for the production fprintf(debug,"13\n").
So, I am wondering what made my parser go for a reduction on production 0, instead of shifting = to stack, and how do I control it?
What you actually specified is a translation grammar, given by the following:
C → S ';' 14 | C S ';' 8
S → ID '=' E 13 | E 12
E → E '+' T 11 | E '-' T 10 | T 9
T → T '*' F 7 | T "/" F 6 | F 5
F → G '#' F 4 | G 3
G → '(' E ')' 2 | NUM 1 | ID 0
with top-level/start configuration C. (For completeness, I added in 8 and 14).
There is only one word generated from C, by this translation grammar, containing ID '=' NUM '+' NUM as the subword of input tokens, and that is ID ('a') '=' NUM('2') 1 3 5 9 '+' NUM('3') 1 3 5 11 13 ';' 14, which is equal to the input-output pair (ID '=' NUM '+' NUM ';', 1 3 5 9 1 3 5 11 13 14). So, the sequence 1 3 5 9 1 3 5 11 13 14 is the one and only translation. Provided the grammar is LALR(1), then this translation will be produced, as a result; and the grammar is LALR(1).
If you're not getting this result, then that can only mean that you implemented wrong whatever you left out of your description: i.e. the lexer ... or that your grammar processor has a bug or your machine has a failure.
And, no; actually what you did is the better way to see what's going on - just stick in a single printf statement to the right hand side of each rule and run it that way to see what translation sequences are produced. The "trace" facility in the parser generator is superfluous for that very reason ... at least the way it is usually implemented (more on that below). In addition, you can get a direct view of everything with the -v option, which produces the LR(0) tables with LALR(1) annotations.
The kind of built-in testing facility that would actually be more helpful - especially for examples like this - is just what I described: one that echoes the inputs interleaved with the output actions. So, when you run it on "a = 2 + 3 ;", it would give you ID('a') '=' NUM('2') 1 3 5 9 '+' NUM('3') 1 3 5 11 13 ';' 14 with echo turned on, and just 1 3 5 9 1 3 5 11 13 14 with echo turned off. That would actually be more useful to have as a built-in capability, instead of the trace mode you normally see in implementations of yacc.
The POSIX specification actually leaves open the issue of how "YYDEBUG", "yydebug" and "-t" are to be implemented in a compliant implementation of yacc, to make room for alternative approaches like this.
Well, it turns out that the problem is I am not identifying = as a token here, in my LEX.
As silly as it sounds, it points out a very important concept of yacc/Bison. The question of whether to shift or reduce is answered by checking the next symbol, also called the lookahead. In this case, the lookahead was NUM(for 2) and not =, because of my faulty LEX code. Since there is no production involving ID followed by NUM, it is going for a reduction to G.
And about how I figured it out, it turns out bison has a built-in trace feature. It lays out neatly like a diary entry, whatever it does while parsing. each and every step is written down.
To enable it,
Run bison with -Dparse.trace option.
bison calc.y -d -Dparse.trace
In the main function of parser grab the extern yydebug and set it to non-zero value.
int main(){
extern int yydebug;
yydebug = 1;
.
.
.
}

Antlr parser StackOverflowException (for parsing regular expressions)

I made a simple grammar for parsing regular expressions. Unfortunately, when I try to test my regex compiler on large expressions I reach StackOverflowException. The problem is similar to this one except that their solution no longer works in my scenario. Here is my grammar:
union: concat | concat '|' union ;
concat: kleene_closure concat | kleene_closure;
kleene_closure: atomic '*' | atomic ;
atomic : '(' union ')' | LETTER ;
Now the problem is that I have a really large file that looks like
something1 | something2 | something3 | .... | something1000
I use ANTLR's Visitor class for parsing. I know I could probably make some optimization by using +/* like this
union: (concat '|')* concat ;
concat: kleene_closure+;
kleene_closure: atomic '*' | atomic ;
atomic : '(' union ')' | LETTER ;
However, it doesn't really solve the problem, due to recursive nature of this grammar. For instance, it would now fail on the following sample that clearly requires recursion:
(...(((something1) | something2) | something3) | .... ) | something1000
How can I avoid StackOverflowExcpetion? How do other compilers, like for instance C compiler deal with really large texts that have thousands lines of code?
If you're going to use a recursive descent parser, then you will inevitably run into an input which exceeds the call stack depth. This problem is ameliorated by languages like Java which are capable of controlling their own stack depth, so that there is a controllable result like a StackOverflowException. But it's still a real problem.
Parser generators like Yacc/Bison and Java Cup use a bottom-up LALR(1) algorithm which uses an explicit stack for temporary storage, rather than using the call stack for that purpose. That means that the parsers have to manage storage for the parser stack (or use a container ADT from the host language's standard library, if there is one), which is slightly more complex. But you don't have to deal with that complexity; it's built in to the parser generator.
There are several advantages of the explicit stack for the parser generator:
It's easier to control maximum stack size;
The maximum stack size is (usually) only limited by available memory;
It's probably more memory efficient because control flow information doesn't need to be kept in stack frames.
Still, it's not a panacea. A sufficiently complicated expression will exceed any fixed stack size, and that can be lead to certain programs being unparseable. Furthermore, if you take advantage of the flexibility mentioned in the second point above ("only limited by available memory"), you may well find that your compiler is terminated unceremoniously by an OOM process (or a segfault) rather than being able to respond to a more polite out-of-memory exception (depending on OS and configuration, of course).
As to:
How do other compilers, like for instance C compiler deal with really large texts that have thousands lines of code?
Having thousands of lines of code is not a problem if you use a repetition operator in your grammar (or, in the case that you are using an LALR(1) parser, that your grammar is left-recursive). The problem arises, as you note in your question, when you have texts with thousands of nested blocks. And the answer is that many C compilers don't deal gracefully with such texts. Here's a simple experiment with gcc:
$ # A function which generates deeply-nested C programs
$ type deep
deep is a function
deep () {
n=$1;
printf "%s\n%s\n %s\n" '#include <stdio.h>' 'int main(void) {' 'int a0 = 0;';
for ((i=0; i<n; ++i))
do
printf '%*s{ int a%d = a%d + 1;\n' $((i+1)) '' $((i+1)) $i;
done;
printf '%*sprintf("%%d\\n", a%d);\n' $n '' $n;
for ((i=0; i<n; ++i))
do
printf "%s" '}';
done;
printf "%s\n" '}'
}
$ deep 3
#include <stdio.h>
int main(void) {
int a0 = 0;
{ int a1 = a0 + 1;
{ int a2 = a1 + 1;
{ int a3 = a2 + 1;
printf("%d\n", a3);
}}}}
$ # For small depths, GCC is OK with that.
$ deep 3 | gcc -x c - && ./a.out
3
$ # Let's go deeper:
$ deep 10 | gcc -x c - && ./a.out
10
$ deep 100 | gcc -x c - && ./a.out
100
$ deep 1000 | gcc -x c - && ./a.out
1000
$ deep 10000 | gcc -x c - && ./a.out
10000
$ # Ka-bang. (Took quite a long time, too.)
$ deep 100000 | gcc -x c - && ./a.out
gcc: internal compiler error: Segmentation fault (program cc1)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
Without the nested blocks, gcc is still slow but can handle the program:
$ type big
big is a function
big ()
{
n=$1;
printf "%s\n%s\n %s\n" '#include <stdio.h>' 'int main(void) {' 'int a0 = 0;';
for ((i=0; i<n; ++i))
do
printf ' int a%d = a%d + 1;\n' $((i+1)) $i;
done;
printf ' printf("%%d\\n", a%d);\n' $n;
printf "%s\n" '}'
}
$ big 3
#include <stdio.h>
int main(void) {
int a0 = 0;
int a1 = a0 + 1;
int a2 = a1 + 1;
int a3 = a2 + 1;
printf("%d\n", a3);
}
$ $ big 3|gcc -x c - && ./a.out
3
$ big 10000|gcc -x c - && ./a.out
10000
$ big 100000|gcc -x c - && ./a.out
100000
You can define your grammar in ABNF syntax and give it to TGS* to parser it iteratively - without the use of the thread dedicated stack for recursion. The parser generator generates parsers that run iteratively for all of its operations: lexing, parsing, tree construction, tree to string conversion, tree iteration and tree destruction.
The parser at runtime, can also give you the tree building information with events only, then you can build your tree as you want (or do any calculation without any tree). In this case, when you parse with events (a deterministic parser grammar without explicit tree building), if you have enough operative memory to hold the depth of the parsing rules, you can practically "stream" any input regardless of its size.
The deterministic grammar in ABNF (RFC 5234) like syntax is this:
alternative = concatenation *('|' concatenation)
concatenation = 1* kleene-closure
kleene-closure = atomic 0*1 '*'
atomic = '(' alternative ')' / letter
letter = 'a'-'z' / 'A'-'Z'
This grammar however has one letter per item, and for input as "ab" you will get two atomic nodes with one letter per node. If you want to have more letters then maybe this grammar will do:
alternative = concatenation *('|' *ws concatenation)
concatenation = element *(ws 0*1 element)
element = primary 0*1 '*'
primary = '(' *ws alternative ')' / identifier
identifier = 1*('a'-'z' / 'A'-'Z')
ws = %x20 / %x9 / %xA / %xD
You can read that as: an alternative is made of one or more concatenations separated by |. A concatenation is one or more elements separated by at least one white space character. An element may end in * and can be an alternative in scopes or an identifier, which in turn is one or more letters. White space is space, tab, new line or carriage return. If you want to have more complex identifiers you may use this:
identifier = (letter / '_') *(letter / '_' / digit)
letter = 'a'-'z' / 'A'-'Z'
digit = '0'-'9'
*I work on that project.

Bison: GLR-parsing of valid expression fails without error message

I'm working on a GLR-parser in GNU bison and I have the following problem:
the language I'm trying to parse allows boolean expressions including relations (<,>,<=,...) and boolean composition (and, or, not). Now the problem is that the language also allows to have multiple arithmetic expressions on the right side of a relation... and they are composed using the same AND token that is used for boolean composition! This is a very dumb language-design, but I can't change it.
So you can have a > b and c which is supposed to be equivalent to (a > b) and (a > c) and you can also have a > b and c > d which is supposed to be equivalent to (a > b) and (c > d)
The S/R conflict this causes is already obvious in this example: after reading a > b with lookahead and you could either reduce the a > b to a boolean expression and wait for another boolean expression or you could shift the and and wait for another arithmetic expression.
My grammar currently looks like this:
booleanexpression
: relation
| booleanexpression TOK_AND booleanexpression
...
;
relation
: arithmeticexpression TOK_GT maxtree
...
;
maxtree
: arithmeticexpression
| maxtree TOK_AND maxtree
...
;
The language is clearly not LR(k) for any k, since the S/R conflict can't be resolved using any constant k-lookahead, because the arithmeticexpression in between can have arbitrarily many tokens. Because of that, I turned GLR-parsing on.
But when I try to parse a > b and c with this, I can see in my debug outputs, that the parser behaves like this:
it reads the a and at lookahead > it reduces the a to an arithmeticexpression
it reads the b and at lookahead and it reduces the b to an arithmeticexpression and then already to a maxtree
it reduces the a > b to a relation
it reads the c and reduces it to an arithmeticexpression
then nothing happens! The and c are apparently discarded - the debug outputs don't show any action for these tokens. Not even an error message. The corresponding if-statement doesn't exist in my AST (I still get an AST because I have error recovery).
I would think that, after reading the b, there should be 2 stacks. But then the b shouldn't be reduced. Or at least it should give me some error message ("language is ambiguous" would be okay and I have seen that message before - I don't see why it wouldn't apply here). Can anyone make sense of this?
From looking at the grammar for a while, you can tell that the main question here is whether after the next arithmeticexpression there comes
another relation token (then you should reduce)
another boolean composition (then you should shift)
a token outside of the boolean/arithmetic -expression syntax (like THEN) which would terminate the expression and you should also shift
Can you think of a different grammar that captures the situation in a better / more deterministic way? How would you approach the problem? I'm currently thinking about making the grammar more right-to-left, like
booleanexpression : relation AND booleanexpression
maxtree : arithmeticexpression AND maxtree
etc.
I think that would make bison prefer shifting and only reduce on the right first. Maybe by using different non-terminals it would allow a quasi-"lookahead" behind the arithmeticexpression...
Side note: GnuCOBOL handles this problem by just collecting all the tokens, pushing them on an intermediate stack and manually building the expression from there. That discourages me, but I cling to the hope that they did it this way because bison didn't support GLR-parsing when they started...
EDIT:
a small reproducible example
%{
#include <stdio.h>
int yylex ();
void yyerror(const char* msg);
%}
%glr-parser
%left '&'
%left '>'
%%
input: %empty | input bool '\n' {printf("\n");};
arith : 'a' | 'b' | 'c';
maxtree : arith { printf("[maxtree : arith] "); }
| maxtree '&' maxtree { printf("[maxtree : maxtree & maxtree] "); } ;
rel : arith '>' maxtree { printf("[rel : arith > maxtree] "); } ;
bool : rel { printf("[bool : rel] "); }
| bool '&' bool { printf("[bool : bool & bool] "); } ;
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex () {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
return yyparse();
}
this one strangely does print the error message "syntax error" on input a>b&c.
Being able to simplify grammars by using precedence declarations is really handy (sometimes) [Note 1] but it doesn't play well with using GLR parsers because it can lead to early rejection of an unambiguous parse.
The idea behind precedence declarations is that they resolve ambiguities (or, more accurately, shift/reduce conflicts) using a simple one-token lookahead and a configured precedence between the possible reduction and the possible shift. If a grammar has no shift/reduce conflict, the precedence declarations won't be used, but if they are used they will be used to suppress either the shift or the reduce, depending on the (static) precedence relationship.
A Bison-generated GLR parser does not actually resolve ambiguity, but it allows possibly incorrect parses to continue to be developed until the ambiguity is resolved by the grammar. Unlike the use of precedence, this is a delayed resolution; a bit slower but a lot more powerful. (GLR parsers can produce a "parse forest" containing all possible parses. But Bison doesn't implement this feature, since it expects to be parsing programming languages and unlike human languages, programming languages cannot be ambiguous.)
In your language, it is impossible to resolve the non-determinism of the shift/reduce conflict statically, as you note yourself in the question. Your grammar is simply not LR(1), much less operator precedence, and GLR parsing is therefore a practical solution. But you have to allow GLR to do its work. Prematurely eliminating one of the plausible parses with a precedence comparison will prevent the GLR algorithm from considering it later. This will be particularly serious if you manage to eliminate the only parse which could have been correct.
In your grammar, it is impossible to define a precedence relationship between the rel productions and the & symbol, because no precedence relationship exists. In some sentences, the rel reduction needs to win; in other sentences, the shift should win. Since the grammar is not ambiguous, GLR will eventually figure out which is which, as long as both the shift and the reduce are allowed to proceed.
In your full language, both boolean and arithmetic expressions have something akin to operator precedence, but only within their respective domains. An operator precedence parser (and, equivalently, yacc/bison's precedence declarations) works by erasing the difference between different non-terminals; it cannot handle a grammar like yours in which some operator has different precedences in different domains (or between different domains).
Fortunately, this particular use of precedence declarations is only a shortcut; it does not give any additional power to the grammar and can easily and mechanically be implemented by creating new non-terminals, one for each precedence level. The alternative grammar will not be ambiguous. The classic example, which you can find in pretty well any textbook or tutorial which includes parsing arithmetic expressions, is the expr/term/factor grammar. Here I've also provided the precedence grammar for comparison:
%left '+' '-'
%left '*' '/'
%% %%
expr : term
| expr '+' term expr: expr '+' expr
| expr '-' term | expr '-' expr
term : factor
| term '*' factor | expr '*' expr
| term '/' factor | expr '/' expr
factor: ID | ID
| '(' expr ')' | '(' expr ')'
In your minimal example, there are already enough non-terminals that no new ones need to be invented, so I've just rewritten it according to the above model.
I've left the actions as I wrote them, in case the style is useful to you. Note that this style leaks memory like a sieve, but that's ok for quick tests:
%code top {
#define _GNU_SOURCE 1
}
%{
#include <ctype.h>
#include <stdio.h>
#include <string.h>
int yylex(void);
void yyerror(const char* msg);
%}
%define api.value.type { char* }
%glr-parser
%token ID
%%
input : %empty
| input bool '\n' { puts($2); }
arith : ID
maxtree : arith
| maxtree '&' arith { asprintf(&$$, "[maxtree& %s %s]", $1, $3); }
rel : arith '>' maxtree { asprintf(&$$, "[COMP %s %s]", $1, $3); }
bool : rel
| bool '&' rel { asprintf(&$$, "[AND %s %s]", $1, $3); }
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex(void) {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
if (isalpha(c)) {
*(yylval = strdup(" ")) = c;
return ID;
}
else return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
#if YYDEBUG
if (argc > 1 && strncmp(argv[1], "-d", 2) == 0) yydebug = 1;
#endif
return yyparse();
}
Here's a sample run. Note the warning from bison about a shift/reduce conflict. If there had been no such warning, the GLR parser would probably be unnecessary, since a grammar without conflicts is deterministic. (On the other hand, since bison's GLR implementation optimises for determinism, there is not too much cost for using a GLR parser on a deterministic language.)
$ bison -t -o glr_prec.c glr_prec.y
glr_prec.y: warning: 1 shift/reduce conflict [-Wconflicts-sr]
$ gcc -Wall -o glr_prec glr_prec.c
$ ./glr_prec
a>b
[COMP a b]
a>b & c
[COMP a [maxtree& b c]]
a>b & c>d
[AND [COMP a b] [COMP c d]]
a>b & c & c>d
[AND [COMP a [maxtree& b c]] [COMP c d]]
a>b & c>d & e
[AND [COMP a b] [COMP c [maxtree& d e]]]
$
Notes
Although precedence declarations are handy when you understand what's actually going on, there is a huge tendency for people to just cargo-cult them from some other grammar they found on the internet, and not infrequently a grammar which was also cargo-culted from somewhere else. When the precedence declarations don't work as expected, the next step is to randomly modify them in the hopes of finding a configuration which works. Sometimes that succeeds, often leaving behind unnecessary detritus which will go on to be cargo-culted again.
So, although there are circumstances in which precedence declarations really simplify grammars and the unambiguous implementation would be quite a lot more complicated (such as dangling-else resolution in languages which have many different compound statement types), I've still found myself recommending against their use.
In a recent answer to a different question, I wrote what I hope is a good explanation of the precedence algorithm (and if it isn't, please let me know how it falls short).
Welcome to the wonderful world of COBOL. I could be wrong, but you may have a few
additional problems here. An expression such as A > B AND C in COBOL is ambiguous
until you know how C was declared. Consider the following program:
IDENTIFICATION DIVISION.
PROGRAM-ID EXAMPLE.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 A PIC 9 VALUE 2.
01 B PIC 9 VALUE 1.
01 W PIC 9 VALUE 3.
88 C VALUE 3.
PROCEDURE DIVISION.
IF A > B AND C
DISPLAY 'A > B AND 88 LEVEL C is TRUE because W = ' W
ELSE
DISPLAY 'A not > B or 88 LEVEL C is not TRUE'
END-IF
DISPLAY 'A: ' A ' B: ' B ' W:' W
GOBACK
.
Output from this program is:
A > B AND 88 LEVEL C is TRUE because W = 3
A: 2 B: 1 W: 3
In essence the expression: A > B AND C is equivalent to: A > B AND W = 3. Had C
been defined in a manner similar to A and B, the semantics would
have been: A > B AND A > C, which in this case, is FALSE.
The code mentioned above works well, but I had never gotten it to work in my real project, even though I couldn't see a difference between my real project and this code.
This drove me crazy, but I just found another problem in my code, which prevented this method from working:
I had an (admittedly cargo-culted) %skeleton "lalr1.cc" in my prologue, which disabled the GLR parsing again!
I needed to replace this with
%skeleton "glr.cc"

Unclear how a yacc/bison production spec can cause a stack overflow

This is not homework, but it is from a book. I'm given the following grammar:
%{
#include <stdio.h>
#include <ctype.h>
int yylex();
int yyerror();
%}
%%
command : exp '\n' { printf("%d\n", $1); exit(0); }
| error '\n'
{
yyerrok;
printf("reenter expression: ");
}
command
;
exp : exp '+' term { $$ = $1 + $3; }
| exp '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : NUMBER { $$ = $1; }
| '(' exp ')' { $$ = $2; }
;
%%
int main() {
return yyparse();
}
int yylex() {
int c;
/* eliminate blanks*/
while((c = getchar()) == ' ');
if (isdigit(c)) {
ungetc(c, stdin);
scanf("%d\n", &yylval);
return (NUMBER);
}
/* makes the parse stop */
if (c == '\n') return 0;
return (c);
}
int yyerror(char * s) {
fprintf(stderr, "%s\n", s);
return 0;
} /* allows for printing of an error message */
Here is the task:
The simple error recovery technique suggested for the calculator program is flawed in that it
could cause stack overflow after many errors. Rewrite it to remove
this problem.
I can't really figure out how a stack overflow can occur. Given the starting production is the only one that has an error token in it, wouldn't yacc/bison pop all the elements on the stack and before restarting?
When in doubt, the simplest thing is to use bison.
I modified the program slightly in order to avoid the bugs. First, since the new program relies on seeing '\n' tokens, I removed the line if (c == '\n') return 0; which would suppress sending '\n'. Second, I fixed scanf("%d\n", &yylval); to scanf("%d", &yylval);. There's no reason to swallow the whitespace following the number, particularly if the whitespace following the number is a newline. (However, scanf patterns don't distinguish between different kinds of whitespace, so the pattern "%d\n" has exactly the same semantics as "%d ". Neither of those would be correct.)
Then I added the line yydebug = 1; at the top of main and supplied the -t ("trace") option to bison when I built the calculator. That causes the parser to show its progress in detail as it processes the input.
It helps to get a state table dump in order to see what's going on. You can do that with the -v bison option. I'll leave that for readers, though.
Then I ran the program and deliberately typed an syntax error:
./error
Starting parse
Entering state 0
Reading a token: 2++3
The trace facility has already output two lines, but after I give it some input, the trace comes pouring out.
First, the parser absorbs the NUMBER 2 and the operator +: (Note: nterm below is bison's way of saying "non-terminal", while token is a "terminal"; the stack shows only state numbers.)
Next token is token NUMBER ()
Shifting token NUMBER ()
Entering state 2
Reducing stack by rule 9 (line 25):
$1 = token NUMBER ()
-> $$ = nterm factor ()
Stack now 0
Entering state 7
Reducing stack by rule 8 (line 22):
$1 = nterm factor ()
-> $$ = nterm term ()
Stack now 0
Entering state 6
Reading a token: Next token is token '+' ()
Reducing stack by rule 6 (line 18):
$1 = nterm term ()
-> $$ = nterm exp ()
Stack now 0
Entering state 5
Next token is token '+' ()
Shifting token '+' ()
Entering state 12
So far, so good. State 12 is where we get to after we've seen +; here is its definition:
State 12
4 exp: exp '+' . term
7 term: . term '*' factor
8 | . factor
9 factor: . NUMBER
10 | . '(' exp ')'
NUMBER shift, and go to state 2
'(' shift, and go to state 3
term go to state 17
factor go to state 7
(By default, bison doesn't clutter up the state table with non-core items. I added -r itemset to get the full itemset, but it would have been easy enough to do the closure by hand.)
Since in this state we're looking for the right-hand operand of +, only things which can start an expression are valid: NUMBER and (. But that's not what we've got:
Reading a token: Next token is token '+' ()
syntax error
OK, we're in State 12, and if you look at the above state description, you'll see that error is not in the lookahead set either. So:
Error: popping token '+' ()
Stack now 0 5
That puts us back in State 5, which is where an operator was expected:
State 5
1 command: exp . '\n'
4 exp: exp . '+' term
5 | exp . '-' term
'\n' shift, and go to state 11
'+' shift, and go to state 12
'-' shift, and go to state 13
So that state doesn't have a transition on error either. Onwards.
Error: popping nterm exp ()
Stack now 0
OK, back to the beginning. State 0 does have an error transition:
error shift, and go to state 1
So now we can shift the error token and enter state 1, as indicated by the transition table:
Shifting token error ()
Entering state 1
Now we need to synchronize the input by skipping input tokens until we get to a newline token. (Note that bison actually pops and pushes the error token while it's doing this. Try not to let that distract you.)
Next token is token '+' ()
Error: discarding token '+' ()
Error: popping token error ()
Stack now 0
Shifting token error ()
Entering state 1
Reading a token: Next token is token NUMBER ()
Error: discarding token NUMBER ()
Error: popping token error ()
Stack now 0
Shifting token error ()
Entering state 1
Reading a token: Next token is token '\n' ()
Shifting token '\n' ()
Entering state 8
Right, we found the newline. State 5 is command: error '\n' . $#1 command. $#1 is the name of the marker (empty production) which bison inserted in place of the mid-rule action (MRA). State 8 will reduce this marker, causing the MRA to run, which asks me for more input. Note that at this point error recovery is complete. We are now in a perfectly normal state, and the stack reflects the fact that we have, in order, the start (state 0), an error token (state 1) and a newline token (state 8):
Reducing stack by rule 2 (line 13):
-> $$ = nterm $#1 ()
Stack now 0 1 8
Entering state 15
Reading a token: Try again:
After the MRA is reduced, the corresponding action from State 8 is taken and we proceed to State 15 (to avoid clutter, I left out the non-core items):
State 15
3 command: error '\n' $#1 . command
error shift, and go to state 1
NUMBER shift, and go to state 2
'(' shift, and go to state 3
So now we're ready to parse a brand new command, as expected. But we have not yet reduced the error production; it's still on the stack because it can't be reduced until the command following the dot has been reduced. And we haven't even started on it yet.
But it's important to note that State 15 does have a transition on error, as you can see from the state's goto table. It has that transition because the closure includes the two productions for command:
1 command: . exp '\n'
3 | . error '\n' $#1 command
as well as the productions for exp, term and factor, which are also part of the closure.
So what happens if we now enter another error? The stack will be popped back to this point (0 1 8 15), a new error token will be pushed onto the stack (0 1 8 15 1), tokens will be discarded until a newline can be shifted (0 1 8 15 1 8) and a new MRA ($#1, as bison calls it) will be reduced onto the stack (0 1 8 15 1 8 15) at which point we're ready to start parsing yet another attempt.
Hopefully you can see where this is going.
Note that it really is not different from the effect of any other right-recursive production. Had the grammar attempted to accept a number of expressions:
prog: exp '\n'
| exp '\n' { printf("%d\n", $1); } prog
you would see the same stack build-up, which is why right-recursion is discouraged. (And also because you end up inserting MRAs to avoid seeing the results in reverse order as the stack is reduced down to prog at the end of all input.)
command go to state 20
exp go to state 5
term go to state 6
factor go to state 7

specific error recovery in bison/yacc

I'm reading a "Compiler Construction, Principles and Practice" book by Kenneth Louden and trying to understand error recovery in Yacc.
The author is giving an example using the following grammar:
%{
#include <stdio.h>
#include <ctype.h>
int yylex();
int yyerror();
%}
%%
command : exp { printf("%d\n", $1); }
; /* allows printing of the result */
exp : exp '+' term { $$ = $1 + $3; }
| exp '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : NUMBER { $$ = $1; }
| '(' exp ')' { $$ = $2; }
;
%%
int main() {
return yyparse();
}
int yylex() {
int c;
/* eliminate blanks*/
while((c = getchar()) == ' ');
if (isdigit(c)) {
ungetc(c, stdin);
scanf("%d\n", &yylval);
return (NUMBER);
}
/* makes the parse stop */
if (c == '\n') return 0;
return (c);
}
int yyerror(char * s) {
fprintf(stderr, "%s\n", s);
return 0;
} /* allows for printing of an error message */
Which produces the following state table (referred to as table 5.11 later on)
Numbers in the reductions correspond to the following productions:
(1) command : exp.
(2) exp : exp + term.
(3) exp : exp - term.
(4) exp : term.
(5) term : term * factor.
(6) term : factor.
(7) factor : NUMBER.
(8) factor : ( exp ).
Then Dr. Louden gives the following example:
Consider what would hapen if an error production were added to the
yacc definition
factor : NUMBER {$$ = $1;}
| '(' exp ')' {$$=$2;}
| error {$$ = 0;}
;
Consider first erroneous input 2++3 as in the previous example (We continue to use Table 5.11, although the additional error production results in a slightly different table.) As before the parser will
reach the following point:
parsing stack input
$0 exp 2 + 7 +3$
Now the error production for factor will provide that error is a
legal lookahead in state 7 and error will be immediately shifted
onto the stack and reduced to factor, causing the value 0 to be
returned. Now the parser has reached the following point:
parsing stack input
$0 exp 2 + 7 factor 4 +3$
This is a normal situation, and the parser will continue to execute
normally to the end. The effect is to interpret the input as 2+0+3
- the 0 between the two + symbols is there because that is where the error pseudotoken is inserted, and by the action for the error
production, error is viewed as equivalent to a factor with value
0.
My question is very simple:
How did he know by looking at the grammar that in order to recover from this specific error (2++3) he needs to add an error pseudotoken to the factor production? Is there a simple way to do it? Or the only way to do it is to work out multiple examples with the state table and recognize that this particular error will occur on this given state and therefore and if I add an error pseudotoken to a some specific production the error will be fixed.
Any help is appreciated.
In that simple grammar, you have very few options for an error production, and all of them will allow the parse to continue.
Choosing the one at the bottom of the derivation tree makes some sense in this case, but that's not a general purpose heuristic. It's more commonly useful to put error productions at the top of the derivation tree where they can be used to resynchronize the parse. For example, suppose we'd modified the grammar to allow for multiple expressions, each on its own line: (which would require modifying yylex so that it doesn't fake an EOF when it sees \n):
program: %empty
| program '\n'
| program exp '\n' { printf("%g\n", $1); }
Now, if we want to just ignore errors and continue parsing, we can add a resynchronizing error production:
| program error '\n'
The '\n' terminal in the above will cause tokens to be skipped until a newline can be shifted to reduce the error production, so that the parse can continue with the next line.
Not all languages are so easy to resynchronize, though. Statements in C-like languages are not necessarily terminated by ;, and a naive attempt to resynchronize as above would cause a certain amount of confusion if the error were, for example, a missing }. However, it would allow the parse to continue in some way, and that might be sufficient.
In my experience, getting error productions right usually requires a lot of trial and error; it is much more of an art than a science. Trying a lot of erroneous inputs and analysing the error recovery will help.
The point of an error production is to recover from an error. Producing good error messages is an unrelated but equally challenging problem. By the time the parser attempts error recovery, the error message has already been sent to yyerror. (Of course, that function could ignore the error message and leave it to the error production to print an error, but there's no obvious reason to do that.)
One possible strategy for producing good error messages is to do some kind of table lookup (or computation) on the parser stack and the lookahead token. In effect, that's what bison's builtin expanded error handling does, and that often produces pretty reasonable results, so it's a good starting place. Alternative strategies have been explored. One good reference is Clinton Jeffrey's 2003 paper Generating LR Syntax Error Messages from Examples; you might also check out Russ Cox's explanation of how he applied that idea to a Go compiler.

Resources