specific error recovery in bison/yacc - parsing

I'm reading a "Compiler Construction, Principles and Practice" book by Kenneth Louden and trying to understand error recovery in Yacc.
The author is giving an example using the following grammar:
%{
#include <stdio.h>
#include <ctype.h>
int yylex();
int yyerror();
%}
%%
command : exp { printf("%d\n", $1); }
; /* allows printing of the result */
exp : exp '+' term { $$ = $1 + $3; }
| exp '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : NUMBER { $$ = $1; }
| '(' exp ')' { $$ = $2; }
;
%%
int main() {
return yyparse();
}
int yylex() {
int c;
/* eliminate blanks*/
while((c = getchar()) == ' ');
if (isdigit(c)) {
ungetc(c, stdin);
scanf("%d\n", &yylval);
return (NUMBER);
}
/* makes the parse stop */
if (c == '\n') return 0;
return (c);
}
int yyerror(char * s) {
fprintf(stderr, "%s\n", s);
return 0;
} /* allows for printing of an error message */
Which produces the following state table (referred to as table 5.11 later on)
Numbers in the reductions correspond to the following productions:
(1) command : exp.
(2) exp : exp + term.
(3) exp : exp - term.
(4) exp : term.
(5) term : term * factor.
(6) term : factor.
(7) factor : NUMBER.
(8) factor : ( exp ).
Then Dr. Louden gives the following example:
Consider what would hapen if an error production were added to the
yacc definition
factor : NUMBER {$$ = $1;}
| '(' exp ')' {$$=$2;}
| error {$$ = 0;}
;
Consider first erroneous input 2++3 as in the previous example (We continue to use Table 5.11, although the additional error production results in a slightly different table.) As before the parser will
reach the following point:
parsing stack input
$0 exp 2 + 7 +3$
Now the error production for factor will provide that error is a
legal lookahead in state 7 and error will be immediately shifted
onto the stack and reduced to factor, causing the value 0 to be
returned. Now the parser has reached the following point:
parsing stack input
$0 exp 2 + 7 factor 4 +3$
This is a normal situation, and the parser will continue to execute
normally to the end. The effect is to interpret the input as 2+0+3
- the 0 between the two + symbols is there because that is where the error pseudotoken is inserted, and by the action for the error
production, error is viewed as equivalent to a factor with value
0.
My question is very simple:
How did he know by looking at the grammar that in order to recover from this specific error (2++3) he needs to add an error pseudotoken to the factor production? Is there a simple way to do it? Or the only way to do it is to work out multiple examples with the state table and recognize that this particular error will occur on this given state and therefore and if I add an error pseudotoken to a some specific production the error will be fixed.
Any help is appreciated.

In that simple grammar, you have very few options for an error production, and all of them will allow the parse to continue.
Choosing the one at the bottom of the derivation tree makes some sense in this case, but that's not a general purpose heuristic. It's more commonly useful to put error productions at the top of the derivation tree where they can be used to resynchronize the parse. For example, suppose we'd modified the grammar to allow for multiple expressions, each on its own line: (which would require modifying yylex so that it doesn't fake an EOF when it sees \n):
program: %empty
| program '\n'
| program exp '\n' { printf("%g\n", $1); }
Now, if we want to just ignore errors and continue parsing, we can add a resynchronizing error production:
| program error '\n'
The '\n' terminal in the above will cause tokens to be skipped until a newline can be shifted to reduce the error production, so that the parse can continue with the next line.
Not all languages are so easy to resynchronize, though. Statements in C-like languages are not necessarily terminated by ;, and a naive attempt to resynchronize as above would cause a certain amount of confusion if the error were, for example, a missing }. However, it would allow the parse to continue in some way, and that might be sufficient.
In my experience, getting error productions right usually requires a lot of trial and error; it is much more of an art than a science. Trying a lot of erroneous inputs and analysing the error recovery will help.
The point of an error production is to recover from an error. Producing good error messages is an unrelated but equally challenging problem. By the time the parser attempts error recovery, the error message has already been sent to yyerror. (Of course, that function could ignore the error message and leave it to the error production to print an error, but there's no obvious reason to do that.)
One possible strategy for producing good error messages is to do some kind of table lookup (or computation) on the parser stack and the lookahead token. In effect, that's what bison's builtin expanded error handling does, and that often produces pretty reasonable results, so it's a good starting place. Alternative strategies have been explored. One good reference is Clinton Jeffrey's 2003 paper Generating LR Syntax Error Messages from Examples; you might also check out Russ Cox's explanation of how he applied that idea to a Go compiler.

Related

How do I get my flex/bison grammar parser to give a syntax error for unrecognized tokens

I am trying to write a grammatical recognizer using flex and bison to determine if an input string is in L(G), where the language is a union of:
L(G) = {a^i b^j c^k d^l e^m} where i,j,k,l,m > 0 and i=m and k=l
and
L(G) = {e^i d^j c^k b^l a^m} where i,j,k,l,m > 0 and i=2m k=3l and j=2
Right now I have it working fine, but only when using the tokens in the languages. If I include any other token it seems to get ignored and the test passes or fails based on the other allowed tokens. This is problematic because it allows for strings such as "abcdef" to pass the parse even though "f" is not in the language.
The erroneous input that I am testing now is "abcdef". The "abcde" part is correct and gives the correct output, but adding the "f" to the end causes both the syntax error message from yyerror("syntax error"), and the "congratulations; parse succeeded" print statement from main to print.
Using "fabcde" does the same thing I described above. It is giving me the error but it's also giving me the success print statement. I'm using "if(yyparse() == 0))" to print the success statement in main and I'm thinking that might be the culprit here, although I had the same issues when I moved the print statements into the .y file and just used yyparse() and return(1) in main.
Here is my .in file (minus includes):
%%
a return A;
b return B;
c return C;
d return D;
e return E;
. yyerror("syntax error\n\nSorry, Charlie, input string not in L(G)\n"); /* working but still prints success message too */
%%
Here is my .y file (minus includes):
%token A
%token B
%token C
%token D
%token E
%% /* Grammar Rules */
string: as bs cs ds es
{
if(($1 == $5) && ($3 == $4)) {
return(0);
}
else
{
return(-1);
}
}
;
string: es ds cs bs as
{
if(($1 == (2 * $5) && ($3 == (3 * $4)) && ($2 = 2)) {
return(0);
}
else
{
return(-1);
}
}
;
as: A as {$$ = $2 +1;}
;
as: A {$$ = 1;}
;
bs: B bs {$$ = $2 +1;}
;
bs: B {$$ = 1;}
;
cs: C cs {$$ = $2 +1;}
;
cs: C {$$ = 1;}
;
ds: D ds {$$ = $2 +1;}
;
ds: D {$$ = 1;}
;
es: E es {$$ = $2 +1;}
;
es: E {$$ = 1;}
;
%%
my .c file is simple and just returns "congratulations; parse successful" if yyparse() == 0, and "input string is not in L(G)" otherwise.
Everything works perfectly fine when the input strings only include a, b, c, d, and e. I just need to figure out how to make the parser give a syntax error without a success statement if there's any token besides them in the input string.
Here is an image that will help show my issue:
The first two parses work as intended. The third one shows my issue.
If a (f)lex rule does not return anything, then tokens that it matches will be ignored. This is appropriate for comments, but not for tokens you want to have be errors. If you change your catch-all flex rule to
. return *yytext;
Then all unrecognized characters in the input (except for newline, which is the only thing . does not match) will be returned, and will likely cause a Syntax error message from your parser (and a failed return from yyparse. If your grammar contains literal character tokens (eg. '#' to match that character), then it will of course match.
A bison/yacc generated parser expects to parse an entire correct input, up to and including the end-of-input marker, and only then return a success indication (a return value of 0).
Of course, if the input is syntactically incorrect, the parser may return early with an error indication (which is always the value 1 for syntax errors, and 2 if it runs out of memory). In this case, before the parser returns, it will clean up its internal state and free any allocated memory.
It's important that you let the parser do this. Returning from a semantic action in a bison/yacc parser is at best unwise (since it is almost certainly a memory leak) and can also produce confusion precisely because it may result in successful returns after an error message is produced.
Consider, for example, the case of the input abcdea, which is a valid string followed by an invalid a. It's likely that the semantic action for string will be run before the parser attempts to handle the last a, because of parser table compression (which defers error actions in order to save table entries). But your semantic action actually returns 0, bypassing the parser's error reporting and clean-up. If the input is abcdef and your scanner calls yyerror for the invalid token (which is not a particularly good idea either), then the sequence of actions will be:
Scanner prints an error
Parser executes the string semantic action, which returns 0.
Again, proper error handling and clean-up have been bypassed by the return statement in the semantic action.
So don't do that. If you want to report an error in a semantic action, use YYABORT, which will cleanly terminate the parse with an error return. If your top-level production is correct, on the other hand, do nothing. The parser will then verify that the next input token is the end-of-input marker and return success.

Flex and Bison - Grammar that sometimes care about spaces

Currently I'm trying to implement a grammar which is very similar to ruby. To keep it simple, the lexer currently ignores space characters.
However, in some cases the space letter makes big difference:
def some_callback(arg=0)
arg * 100
end
some_callback (1 + 1) + 1 # 300
some_callback(1 + 1) + 1 # 201
some_callback +1 # 100
some_callback+1 # 1
some_callback + 1 # 1
So currently all whitespaces are being ignored by the lexer:
{WHITESPACE} { ; }
And the language says for example something like:
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
One way I can think of to solve this problem would be to explicitly add whitespaces to the whole grammar, but doing so the whole grammar would increase a lot in complexity:
// OLD:
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression T_ADD MultiplicativeExpression
| AdditiveExpression T_SUB MultiplicativeExpression
;
// NEW:
_:
/* empty */
| WHITESPACE _;
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression _ T_ADD _ MultiplicativeExpression
| AdditiveExpression _ T_SUB _ MultiplicativeExpression
;
//...
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
So I liked to ask whether there is any best practice on how to solve this grammar.
Thank you in advance!
Without having a full specification of the syntax you are trying to parse, it's not easy to give a precise answer. In the following, I'm assuming that those are the only two places where the presence (or absence) of whitespace between two tokens affects the parse.
Differentiating between f(...) and f (...) occurs in a surprising number of languages. One common strategy is for the lexer to recognize an identifier which is immediately followed by an open parenthesis as a "FUNCTION_CALL" token.
You'll find that in most awk implementations, for example; in awk, the ambiguity between a function call and concatenation is resolved by requiring that the open parenthesis in a function call immediately follow the identifier. Similarly, the C pre-processor macro definition directive distinguishes between #define foo(A) A (the definition of a macro with arguments) and #define foo (A) (an ordinary macro whose expansion starts with a ( token.
If you're doing this with (f)lex, you can use the / trailing-context operator:
[[:alpha:]_][[:alnum:]_]*/'(' { yylval = strdup(yytext); return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]* { yylval = strdup(yytext); return IDENT; }
The grammar is now pretty straight-forward:
call: FUNC_CALL '(' expression_list ')' /* foo(1, 2) */
| IDENT expression_list /* foo (1, 2) */
| IDENT /* foo * 3 */
This distinction will not be useful in all syntactic contexts, so it will often prove useful to add a non-terminal which will match either identifier form:
name: IDENT | FUNC_CALL
But you will need to be careful with this non-terminal. In particular, using it as part of the expression grammar could lead to parser conflicts. But in other contexts, it will be fine:
func_defn: "def" name '(' parameters ')' block "end"
(I'm aware that this is not the precise syntax for Ruby function definitions. It's just for illustrative purposes.)
More troubling is the other ambiguity, in which it appears that the unary operators + and - should be treated as part of an integer literal in certain circumstances. The behaviour of the Ruby parser suggests that the lexer is combining the sign character with an immediately following number in the case where it might be the first argument to a function. (That is, in the context <identifier><whitespace><sign><digits> where <identifier> is not an already declared local variable.)
That sort of contextual rule could certainly be added to the lexical scanner using start conditions, although it's more than a little ugly. A not-fully-fleshed out implementation, building on the previous:
%x SIGNED_NUMBERS
%%
[[:alpha:]_][[:alnum:]_]*/'(' { yylval.id = strdup(yytext);
return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]*/[[:blank:]] { yylval.id = strdup(yytext);
if (!is_local(yylval.id))
BEGIN(SIGNED_NUMBERS);
return IDENT; }
[[:alpha:]_][[:alnum:]_]*/ { yylval.id = strdup(yytext);
return IDENT; }
<SIGNED_NUMBERS>[[:blank:]]+ ;
/* Numeric patterns, one version for each context */
<SIGNED_NUMBERS>[+-]?[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
BEGIN(INITIAL);
return INTEGER; }
[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
return INTEGER; }
/* ... */
/* If the next character is not a digit or a sign, rescan in INITIAL state */
<SIGNED_NUMBERS>.|\n { yyless(0); BEGIN(INITIAL); }
Another possible solution would be for the lexer to distinguish sign characters which follow a space and are directly followed by a digit, and then let the parser try to figure out whether or not the sign should be combined with the following number. However, this will still depend on being able to distinguish between local variables and other identifiers, which will still require the lexical feedback through the symbol table.
It's worth noting that the end result of all this complication is a language whose semantics are not very obvious in some corner cases. The fact that f+3 and f +3 produce different results could easily lead to subtle bugs which might be very hard to detect. In many projects using languages with these kinds of ambiguities, the project style guide will prohibit legal constructs with unclear semantics. You might want to take this into account in your language design, if you have not already done so.

Bison: GLR-parsing of valid expression fails without error message

I'm working on a GLR-parser in GNU bison and I have the following problem:
the language I'm trying to parse allows boolean expressions including relations (<,>,<=,...) and boolean composition (and, or, not). Now the problem is that the language also allows to have multiple arithmetic expressions on the right side of a relation... and they are composed using the same AND token that is used for boolean composition! This is a very dumb language-design, but I can't change it.
So you can have a > b and c which is supposed to be equivalent to (a > b) and (a > c) and you can also have a > b and c > d which is supposed to be equivalent to (a > b) and (c > d)
The S/R conflict this causes is already obvious in this example: after reading a > b with lookahead and you could either reduce the a > b to a boolean expression and wait for another boolean expression or you could shift the and and wait for another arithmetic expression.
My grammar currently looks like this:
booleanexpression
: relation
| booleanexpression TOK_AND booleanexpression
...
;
relation
: arithmeticexpression TOK_GT maxtree
...
;
maxtree
: arithmeticexpression
| maxtree TOK_AND maxtree
...
;
The language is clearly not LR(k) for any k, since the S/R conflict can't be resolved using any constant k-lookahead, because the arithmeticexpression in between can have arbitrarily many tokens. Because of that, I turned GLR-parsing on.
But when I try to parse a > b and c with this, I can see in my debug outputs, that the parser behaves like this:
it reads the a and at lookahead > it reduces the a to an arithmeticexpression
it reads the b and at lookahead and it reduces the b to an arithmeticexpression and then already to a maxtree
it reduces the a > b to a relation
it reads the c and reduces it to an arithmeticexpression
then nothing happens! The and c are apparently discarded - the debug outputs don't show any action for these tokens. Not even an error message. The corresponding if-statement doesn't exist in my AST (I still get an AST because I have error recovery).
I would think that, after reading the b, there should be 2 stacks. But then the b shouldn't be reduced. Or at least it should give me some error message ("language is ambiguous" would be okay and I have seen that message before - I don't see why it wouldn't apply here). Can anyone make sense of this?
From looking at the grammar for a while, you can tell that the main question here is whether after the next arithmeticexpression there comes
another relation token (then you should reduce)
another boolean composition (then you should shift)
a token outside of the boolean/arithmetic -expression syntax (like THEN) which would terminate the expression and you should also shift
Can you think of a different grammar that captures the situation in a better / more deterministic way? How would you approach the problem? I'm currently thinking about making the grammar more right-to-left, like
booleanexpression : relation AND booleanexpression
maxtree : arithmeticexpression AND maxtree
etc.
I think that would make bison prefer shifting and only reduce on the right first. Maybe by using different non-terminals it would allow a quasi-"lookahead" behind the arithmeticexpression...
Side note: GnuCOBOL handles this problem by just collecting all the tokens, pushing them on an intermediate stack and manually building the expression from there. That discourages me, but I cling to the hope that they did it this way because bison didn't support GLR-parsing when they started...
EDIT:
a small reproducible example
%{
#include <stdio.h>
int yylex ();
void yyerror(const char* msg);
%}
%glr-parser
%left '&'
%left '>'
%%
input: %empty | input bool '\n' {printf("\n");};
arith : 'a' | 'b' | 'c';
maxtree : arith { printf("[maxtree : arith] "); }
| maxtree '&' maxtree { printf("[maxtree : maxtree & maxtree] "); } ;
rel : arith '>' maxtree { printf("[rel : arith > maxtree] "); } ;
bool : rel { printf("[bool : rel] "); }
| bool '&' bool { printf("[bool : bool & bool] "); } ;
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex () {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
return yyparse();
}
this one strangely does print the error message "syntax error" on input a>b&c.
Being able to simplify grammars by using precedence declarations is really handy (sometimes) [Note 1] but it doesn't play well with using GLR parsers because it can lead to early rejection of an unambiguous parse.
The idea behind precedence declarations is that they resolve ambiguities (or, more accurately, shift/reduce conflicts) using a simple one-token lookahead and a configured precedence between the possible reduction and the possible shift. If a grammar has no shift/reduce conflict, the precedence declarations won't be used, but if they are used they will be used to suppress either the shift or the reduce, depending on the (static) precedence relationship.
A Bison-generated GLR parser does not actually resolve ambiguity, but it allows possibly incorrect parses to continue to be developed until the ambiguity is resolved by the grammar. Unlike the use of precedence, this is a delayed resolution; a bit slower but a lot more powerful. (GLR parsers can produce a "parse forest" containing all possible parses. But Bison doesn't implement this feature, since it expects to be parsing programming languages and unlike human languages, programming languages cannot be ambiguous.)
In your language, it is impossible to resolve the non-determinism of the shift/reduce conflict statically, as you note yourself in the question. Your grammar is simply not LR(1), much less operator precedence, and GLR parsing is therefore a practical solution. But you have to allow GLR to do its work. Prematurely eliminating one of the plausible parses with a precedence comparison will prevent the GLR algorithm from considering it later. This will be particularly serious if you manage to eliminate the only parse which could have been correct.
In your grammar, it is impossible to define a precedence relationship between the rel productions and the & symbol, because no precedence relationship exists. In some sentences, the rel reduction needs to win; in other sentences, the shift should win. Since the grammar is not ambiguous, GLR will eventually figure out which is which, as long as both the shift and the reduce are allowed to proceed.
In your full language, both boolean and arithmetic expressions have something akin to operator precedence, but only within their respective domains. An operator precedence parser (and, equivalently, yacc/bison's precedence declarations) works by erasing the difference between different non-terminals; it cannot handle a grammar like yours in which some operator has different precedences in different domains (or between different domains).
Fortunately, this particular use of precedence declarations is only a shortcut; it does not give any additional power to the grammar and can easily and mechanically be implemented by creating new non-terminals, one for each precedence level. The alternative grammar will not be ambiguous. The classic example, which you can find in pretty well any textbook or tutorial which includes parsing arithmetic expressions, is the expr/term/factor grammar. Here I've also provided the precedence grammar for comparison:
%left '+' '-'
%left '*' '/'
%% %%
expr : term
| expr '+' term expr: expr '+' expr
| expr '-' term | expr '-' expr
term : factor
| term '*' factor | expr '*' expr
| term '/' factor | expr '/' expr
factor: ID | ID
| '(' expr ')' | '(' expr ')'
In your minimal example, there are already enough non-terminals that no new ones need to be invented, so I've just rewritten it according to the above model.
I've left the actions as I wrote them, in case the style is useful to you. Note that this style leaks memory like a sieve, but that's ok for quick tests:
%code top {
#define _GNU_SOURCE 1
}
%{
#include <ctype.h>
#include <stdio.h>
#include <string.h>
int yylex(void);
void yyerror(const char* msg);
%}
%define api.value.type { char* }
%glr-parser
%token ID
%%
input : %empty
| input bool '\n' { puts($2); }
arith : ID
maxtree : arith
| maxtree '&' arith { asprintf(&$$, "[maxtree& %s %s]", $1, $3); }
rel : arith '>' maxtree { asprintf(&$$, "[COMP %s %s]", $1, $3); }
bool : rel
| bool '&' rel { asprintf(&$$, "[AND %s %s]", $1, $3); }
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex(void) {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
if (isalpha(c)) {
*(yylval = strdup(" ")) = c;
return ID;
}
else return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
#if YYDEBUG
if (argc > 1 && strncmp(argv[1], "-d", 2) == 0) yydebug = 1;
#endif
return yyparse();
}
Here's a sample run. Note the warning from bison about a shift/reduce conflict. If there had been no such warning, the GLR parser would probably be unnecessary, since a grammar without conflicts is deterministic. (On the other hand, since bison's GLR implementation optimises for determinism, there is not too much cost for using a GLR parser on a deterministic language.)
$ bison -t -o glr_prec.c glr_prec.y
glr_prec.y: warning: 1 shift/reduce conflict [-Wconflicts-sr]
$ gcc -Wall -o glr_prec glr_prec.c
$ ./glr_prec
a>b
[COMP a b]
a>b & c
[COMP a [maxtree& b c]]
a>b & c>d
[AND [COMP a b] [COMP c d]]
a>b & c & c>d
[AND [COMP a [maxtree& b c]] [COMP c d]]
a>b & c>d & e
[AND [COMP a b] [COMP c [maxtree& d e]]]
$
Notes
Although precedence declarations are handy when you understand what's actually going on, there is a huge tendency for people to just cargo-cult them from some other grammar they found on the internet, and not infrequently a grammar which was also cargo-culted from somewhere else. When the precedence declarations don't work as expected, the next step is to randomly modify them in the hopes of finding a configuration which works. Sometimes that succeeds, often leaving behind unnecessary detritus which will go on to be cargo-culted again.
So, although there are circumstances in which precedence declarations really simplify grammars and the unambiguous implementation would be quite a lot more complicated (such as dangling-else resolution in languages which have many different compound statement types), I've still found myself recommending against their use.
In a recent answer to a different question, I wrote what I hope is a good explanation of the precedence algorithm (and if it isn't, please let me know how it falls short).
Welcome to the wonderful world of COBOL. I could be wrong, but you may have a few
additional problems here. An expression such as A > B AND C in COBOL is ambiguous
until you know how C was declared. Consider the following program:
IDENTIFICATION DIVISION.
PROGRAM-ID EXAMPLE.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 A PIC 9 VALUE 2.
01 B PIC 9 VALUE 1.
01 W PIC 9 VALUE 3.
88 C VALUE 3.
PROCEDURE DIVISION.
IF A > B AND C
DISPLAY 'A > B AND 88 LEVEL C is TRUE because W = ' W
ELSE
DISPLAY 'A not > B or 88 LEVEL C is not TRUE'
END-IF
DISPLAY 'A: ' A ' B: ' B ' W:' W
GOBACK
.
Output from this program is:
A > B AND 88 LEVEL C is TRUE because W = 3
A: 2 B: 1 W: 3
In essence the expression: A > B AND C is equivalent to: A > B AND W = 3. Had C
been defined in a manner similar to A and B, the semantics would
have been: A > B AND A > C, which in this case, is FALSE.
The code mentioned above works well, but I had never gotten it to work in my real project, even though I couldn't see a difference between my real project and this code.
This drove me crazy, but I just found another problem in my code, which prevented this method from working:
I had an (admittedly cargo-culted) %skeleton "lalr1.cc" in my prologue, which disabled the GLR parsing again!
I needed to replace this with
%skeleton "glr.cc"

Unclear how a yacc/bison production spec can cause a stack overflow

This is not homework, but it is from a book. I'm given the following grammar:
%{
#include <stdio.h>
#include <ctype.h>
int yylex();
int yyerror();
%}
%%
command : exp '\n' { printf("%d\n", $1); exit(0); }
| error '\n'
{
yyerrok;
printf("reenter expression: ");
}
command
;
exp : exp '+' term { $$ = $1 + $3; }
| exp '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : NUMBER { $$ = $1; }
| '(' exp ')' { $$ = $2; }
;
%%
int main() {
return yyparse();
}
int yylex() {
int c;
/* eliminate blanks*/
while((c = getchar()) == ' ');
if (isdigit(c)) {
ungetc(c, stdin);
scanf("%d\n", &yylval);
return (NUMBER);
}
/* makes the parse stop */
if (c == '\n') return 0;
return (c);
}
int yyerror(char * s) {
fprintf(stderr, "%s\n", s);
return 0;
} /* allows for printing of an error message */
Here is the task:
The simple error recovery technique suggested for the calculator program is flawed in that it
could cause stack overflow after many errors. Rewrite it to remove
this problem.
I can't really figure out how a stack overflow can occur. Given the starting production is the only one that has an error token in it, wouldn't yacc/bison pop all the elements on the stack and before restarting?
When in doubt, the simplest thing is to use bison.
I modified the program slightly in order to avoid the bugs. First, since the new program relies on seeing '\n' tokens, I removed the line if (c == '\n') return 0; which would suppress sending '\n'. Second, I fixed scanf("%d\n", &yylval); to scanf("%d", &yylval);. There's no reason to swallow the whitespace following the number, particularly if the whitespace following the number is a newline. (However, scanf patterns don't distinguish between different kinds of whitespace, so the pattern "%d\n" has exactly the same semantics as "%d ". Neither of those would be correct.)
Then I added the line yydebug = 1; at the top of main and supplied the -t ("trace") option to bison when I built the calculator. That causes the parser to show its progress in detail as it processes the input.
It helps to get a state table dump in order to see what's going on. You can do that with the -v bison option. I'll leave that for readers, though.
Then I ran the program and deliberately typed an syntax error:
./error
Starting parse
Entering state 0
Reading a token: 2++3
The trace facility has already output two lines, but after I give it some input, the trace comes pouring out.
First, the parser absorbs the NUMBER 2 and the operator +: (Note: nterm below is bison's way of saying "non-terminal", while token is a "terminal"; the stack shows only state numbers.)
Next token is token NUMBER ()
Shifting token NUMBER ()
Entering state 2
Reducing stack by rule 9 (line 25):
$1 = token NUMBER ()
-> $$ = nterm factor ()
Stack now 0
Entering state 7
Reducing stack by rule 8 (line 22):
$1 = nterm factor ()
-> $$ = nterm term ()
Stack now 0
Entering state 6
Reading a token: Next token is token '+' ()
Reducing stack by rule 6 (line 18):
$1 = nterm term ()
-> $$ = nterm exp ()
Stack now 0
Entering state 5
Next token is token '+' ()
Shifting token '+' ()
Entering state 12
So far, so good. State 12 is where we get to after we've seen +; here is its definition:
State 12
4 exp: exp '+' . term
7 term: . term '*' factor
8 | . factor
9 factor: . NUMBER
10 | . '(' exp ')'
NUMBER shift, and go to state 2
'(' shift, and go to state 3
term go to state 17
factor go to state 7
(By default, bison doesn't clutter up the state table with non-core items. I added -r itemset to get the full itemset, but it would have been easy enough to do the closure by hand.)
Since in this state we're looking for the right-hand operand of +, only things which can start an expression are valid: NUMBER and (. But that's not what we've got:
Reading a token: Next token is token '+' ()
syntax error
OK, we're in State 12, and if you look at the above state description, you'll see that error is not in the lookahead set either. So:
Error: popping token '+' ()
Stack now 0 5
That puts us back in State 5, which is where an operator was expected:
State 5
1 command: exp . '\n'
4 exp: exp . '+' term
5 | exp . '-' term
'\n' shift, and go to state 11
'+' shift, and go to state 12
'-' shift, and go to state 13
So that state doesn't have a transition on error either. Onwards.
Error: popping nterm exp ()
Stack now 0
OK, back to the beginning. State 0 does have an error transition:
error shift, and go to state 1
So now we can shift the error token and enter state 1, as indicated by the transition table:
Shifting token error ()
Entering state 1
Now we need to synchronize the input by skipping input tokens until we get to a newline token. (Note that bison actually pops and pushes the error token while it's doing this. Try not to let that distract you.)
Next token is token '+' ()
Error: discarding token '+' ()
Error: popping token error ()
Stack now 0
Shifting token error ()
Entering state 1
Reading a token: Next token is token NUMBER ()
Error: discarding token NUMBER ()
Error: popping token error ()
Stack now 0
Shifting token error ()
Entering state 1
Reading a token: Next token is token '\n' ()
Shifting token '\n' ()
Entering state 8
Right, we found the newline. State 5 is command: error '\n' . $#1 command. $#1 is the name of the marker (empty production) which bison inserted in place of the mid-rule action (MRA). State 8 will reduce this marker, causing the MRA to run, which asks me for more input. Note that at this point error recovery is complete. We are now in a perfectly normal state, and the stack reflects the fact that we have, in order, the start (state 0), an error token (state 1) and a newline token (state 8):
Reducing stack by rule 2 (line 13):
-> $$ = nterm $#1 ()
Stack now 0 1 8
Entering state 15
Reading a token: Try again:
After the MRA is reduced, the corresponding action from State 8 is taken and we proceed to State 15 (to avoid clutter, I left out the non-core items):
State 15
3 command: error '\n' $#1 . command
error shift, and go to state 1
NUMBER shift, and go to state 2
'(' shift, and go to state 3
So now we're ready to parse a brand new command, as expected. But we have not yet reduced the error production; it's still on the stack because it can't be reduced until the command following the dot has been reduced. And we haven't even started on it yet.
But it's important to note that State 15 does have a transition on error, as you can see from the state's goto table. It has that transition because the closure includes the two productions for command:
1 command: . exp '\n'
3 | . error '\n' $#1 command
as well as the productions for exp, term and factor, which are also part of the closure.
So what happens if we now enter another error? The stack will be popped back to this point (0 1 8 15), a new error token will be pushed onto the stack (0 1 8 15 1), tokens will be discarded until a newline can be shifted (0 1 8 15 1 8) and a new MRA ($#1, as bison calls it) will be reduced onto the stack (0 1 8 15 1 8 15) at which point we're ready to start parsing yet another attempt.
Hopefully you can see where this is going.
Note that it really is not different from the effect of any other right-recursive production. Had the grammar attempted to accept a number of expressions:
prog: exp '\n'
| exp '\n' { printf("%d\n", $1); } prog
you would see the same stack build-up, which is why right-recursion is discouraged. (And also because you end up inserting MRAs to avoid seeing the results in reverse order as the stack is reduced down to prog at the end of all input.)
command go to state 20
exp go to state 5
term go to state 6
factor go to state 7

How to create grammar for applying De Morgan's theorem to an expression using yacc?

I would like to apply Demorgan's theorem to an input using yacc and lex.
The input could be any expression such as a+b, !(A+B) etc:
The expression a+b should result in !a∙!b
The expression !(a+b) should result in a+b
I think the lex part is done but I'm having difficulty with the yacc grammar needed to apply the laws to an expression.
What I'm trying to implement is the following algorithm. Consider the following equation as input: Y = A+B
After applying De Morgan's law it becomes: !Y = !(A+B)
Finally, expanding the parentheses should result in !Y = !A∙!B
here lex code:
%{
#include <stdio.h>
#include "y.tab.h"
extern int yylval;
int yywrap (void);
%}
%%
[a-zA-Z]+ {yylval = *yytext; return ALPHABET;}
"&&" return AND;
"||" return OR;
"=" return ('=');
[\t] ;
\n return 0;
. return yytext[0];
"0exit" return 0;
%%
int yywrap (void)
{
return 1;
}
Here is my yacc code:
%{
#include <stdio.h>
int yylex (void);
void yyerror (char *);
extern FILE* yyin;
%}
%token ALPHABET
%left '+''*'
%right '=' '!' NOT
%left AND OR
%start check
%%
check : expr {printf("%s\n",$$);}
;
expr : plus
|plus '+' plus {$$ = $1 + $3;}
;
plus : times
|times '*' times {$$ = $1 * $3;}
;
times : and_op
|and_op AND and_op{$$ = $1 && $3;}
;
and_op : or_op
|or_op OR or_op {$$ = $1 || $3;}
;
or_op : not_op
|'!' not_op {$$ = !$2;}
;
not_op : paren
|'(' paren ')' {$$ = $2;}
;
paren :
|ALPHABET {$$ = $1;}
;
/*
E: E '+' E {$$ = $1 + $3;}
|E '*' E {$$ = $1 * $3;}
|E '=' E {$$ = $1 = $3;}
|E AND E {$$ = ($1 && $3);}
|E OR E {$$ = ($1 || $3);}
|'(' E ')' {$$ = $2;}
|'!' E %prec NOT {$$ = !$2;}
|ALPHABET {$$ = $1;}
;*/
%%
int main()
{
char filename[30];
char * line = NULL;
size_t len = 0;
printf("\nEnter filename\n");
scanf("%s",filename);
FILE *fp = fopen(filename, "r");
if(fp == NULL)
{
fprintf(stderr,"Can't read file %s\n",filename);
exit(EXIT_FAILURE);
}
yyin = fp;
// while (getline(&line, &len, fp) != -1)
// {
// printf("%s",line);
// }
// printf("Enter the expression:\n");
do
{
yyparse();
}while(!feof(yyin));
return 0;
}
You are trying to build a computer algebra system.
Your task is conceptually simple:
Define a lexer for the atoms of your "boolean" expressions
Define a parser for propositional logic in terms of the lexemes
Build a tree that stores the expressions
Define procedures that implement logical equivalences (DeMorgan's theorem is one), that find a place in the tree where it can be applied by matching tree structure, and then modifying the tree accordingly
Run those procedures to achieve the logic rewrites you want
Prettyprint the final AST as the answer
But conceptually simple doesn't necessarily mean easy to do and get it all right.
(f)lex and yacc are designed to help you do steps 1-3 in a relatively straightforward way; their documentation contains a pretty good guide.
They won't help with steps 4-6 at all, and this is where the real work happens. (Your grammar looks like a pretty good start for this part).
(You can do 1-3 without flex and yacc by building
a recursive descent parser that also happens to build the AST as it goes).
Step 4 can be messy, because you have to decide what logical theorems you wish to use, and then write a procedure for each one to do tree matching, and tree smashing, to achieve the desired result. You can do it; its just procedural code that walks up and down the tree comparing node types and relations to children for a match, and then delinking nodes, deleting nodes, creating nodes, and relinking them to effect the tree modification. This is just a bunch of code.
A subtley of algebraic rewrites is now going to bite you: (boolean) algebra has associative and commutative operators. What this means is that some algebra rules will apply to parts of the tree that are arbitrarily far apart. Consider this rule:
a*(b + !a) => a*(b)
What happens when the actual term being parsed looks like:
q*(a + b + c + ... !q ... + z)
"Simple" procedural code to look at the tree now has to walk arbitrarily far down on of the subtrees to find where the rule can apply. Suddenly coding the matching logic isn't so easy, nor is the tree-smash to implement the effect.
If we ignore associative and commutative issues, for complex matches and modifications, the code might be a bit clumsy to write and hard to read; after you've done it once this will be obvious. If you only want to do DeMorgan-over-or, you can do it relatively easily by just coding it. If you want to implement lots of boolean algebras rules for simplification, this will start to be painful. What you'd ideally like to do is express the logic rules in the same notation as your boolean logic so they are easily expressed, but now you need something that can read and interpret the logic rules. That is complex piece of code, but if done right, you can code the logic rules something like the following:
rule deMorgan_for_or(t1:boolexp, t2:boolexp):boolexp->boolexp
" ! (\t1 + \t2) " -> " !\t1 * !\t2 ";
A related problem (step 5) is, where do you want apply the logic rules? Just because you can apply DeMorgan's law in 15 places in a very big logic term, doesn't mean you necessarily want to do that. So somewhere you need to have a control mechanism that decides which of your many rules should apply, and where they should apply. This gets you into metaprogramming, a whole new topic.
If your rules are "monotonic", that is, they in effect can only be applied once, you can simply run them all everywhere and get a terminating computation, if that monotonic answer is the one you want. If you have rules that are inverses (e.g., !(x+y) => !x * !y, and !a * !b => !(a+b)), then your rules may run forever repeatedly doing and undoing a rewrite. So you have to be careful to ensure you get termination.
Finally, when you have the modified tree, you'll need to print it back out in readable form (Step 6). See my SO answer on how to build a prettyprinter.
Doing all of this for one or two rules by yourself is a great learning exercise.
Doing it with the idea of producing a useful tool is a whole different animal. There what you want is a set of infrastructure that makes this easy to express: a program transformation system. You can see a complete example of this what it looks like for a system doing arithmetic rather than boolean computations using surface syntax rewrite rules, including the handling the associative and commutative rewrite issues. In another example, you can see what it looks like for boolean logic (see simplify_boolean near end of page), which shows a real example for rules like I wrote above.

Resources