Eliminating shift/reduce errors for binary ops - parsing

fsyacc is emitting shift/reduce errors for all binary ops.
I have this recursive production:
scalar_expr:
| scalar_expr binary_op scalar_expr { Binary($2, $1, $3) }
Changing it to
scalar_expr:
| constant binary_op constant { Binary($2, Constant($1), Constant($3)) }
eliminates the errors (but isn't what I want). Precedence and associativity are defined as follows:
%left BITAND BITOR BITXOR
%left ADD SUB
%left MUL DIV MOD
Here's an excerpt from the listing file showing the state that produces the errors (one other state has the same errors).
state 42:
items:
scalar_expr -> scalar_expr . binary_op scalar_expr
scalar_expr -> scalar_expr binary_op scalar_expr .
actions:
action 'EOF' (noprec): reduce scalar_expr --> scalar_expr binary_op scalar_expr
action 'MUL' (explicit left 9999): shift 8
action 'DIV' (explicit left 9999): shift 9
action 'MOD' (explicit left 9999): shift 10
action 'ADD' (explicit left 9998): shift 6
action 'SUB' (explicit left 9998): shift 7
action 'BITAND' (explicit left 9997): shift 11
action 'BITOR' (explicit left 9997): shift 12
action 'BITXOR' (explicit left 9997): shift 13
You can see the parser shifts in all cases, which is correct, I think. I haven't found a case where the behavior is incorrect, at least.
How can I restate the grammar to eliminate these errors?

Is binary_op actually a production, i.e. you have something like:
binary_op:
| ADD { OpDU.Add }
| SUB { OpDU.Sub }
...
If so I think that is the problem, since I assume the precedence rules you defined wouldn't be honored in constant binary_op constant. You need to enumerate each scalar_expr pattern explicitly, e.g.
scalar_expr:
| scalar_expr ADD scalar_expr { Binary(OpDU.Add, $1, $3) }
| scalar_expr SUB scalar_expr { Binary(OpDU.Sub, $1, $3) }
...
(I don't think there is any way to abstract away this repetitiveness with FsYacc)

As Stephen pointed out in his answer, the precedence rules for your operators won't apply if you move them to a separate production. This is strange, since the state you posted seems to honor them correctly.
However, you should be able to declare and apply explicit precedence rules, e.g. you could define them as:
%left ADD_SUB_OP
%left MUL_DIV_OP
and apply them like this:
scalar_expr:
| scalar_expr add_sub_ops scalar_expr %prec ADD_SUB_OP { Binary($2, $1, $3) }
| scalar_expr mul_div_ops scalar_expr %prec MUL_DIV_OP { Binary($2, $1, $3) }
This way you still have to define multiple rules, but you can group all operators with the same precedence in a group (Note that the names I chose are a poor example; you may want to use names that reflect the precedence or describe the operators within so it's clear what they indicate).
Whether or not this solution is worth it depends on the number of operators per group I suppose.

Related

Solve shift/reduce conflict across rules

I'm trying to learn bison by writing a simple math parser and evaluator. I'm currently implementing variables. A variable can be part of a expression however I'd like do something different when one enters only a single variable name as input, which by itself is also a valid expression and hence the shift reduce conflict. I've reduced the language to this:
%token <double> NUM
%token <const char*> VAR
%nterm <double> exp
%left '+'
%precedence TWO
%precedence ONE
%%
input:
%empty
| input line
;
line:
'\n'
| VAR '\n' %prec ONE
| exp '\n' %prec TWO
;
exp:
NUM
| VAR %prec TWO
| exp '+' exp { $$ = $1 + $3; }
;
%%
As you can see, I've tried solving this by adding the ONE and TWO precedences manually to some rules, however it doesn't seem to work, I always get the exact same conflict. The goal is to prefer the line: VAR '\n' rule for a line consisting of nothing but a variable name, otherwise parse it as expression.
For reference, the conflicting state:
State 4
4 line: VAR . '\n'
7 exp: VAR . ['+', '\n']
'\n' shift, and go to state 8
'\n' [reduce using rule 7 (exp)]
$default reduce using rule 7 (exp)
Precedence comparisons are always, without exception, between a production and a token. (At least, on Yacc/Bison). So you can be sure that if your precedence level list does not contain a real token, it will have no effect whatsoever.
For the same reason, you cannot resolve reduce-reduce conflicts with precedence. That doesn't matter in this case, since it's a shift-reduce conflict, but all the same it's useful to know.
To be even more specific, the precedence comparison is between a reduction (using the precedence of the production to be reduced) and that of the incoming lookahead token. In this case, the lookahead token is \n and the reduction is exp: VAR. The precedence level of that production is the precedence of VAR, since that is the last terminal symbol in the production. So if you want the shift to win out over the reduction, you need to declare your precedences so that the shift is higher:
%precedence VAR
%precedence '\n'
No pseudotokens (or %prec modifiers) are needed.
This will not change the parse, because Bison always prefers shift if there are no applicable precedence rules. But it will suppress the warning.

Simple YACC grammar with problems

I want to implement a simple YACC parser, but I have problems with my grammar:
%union {
s string
b []byte
t *ASTNode
}
%token AND OR NOT
%token<s> STRING
%token<b> BYTES
%type<t> expr
%left AND
%left OR
%right NOT
%%
expr: '(' expr ')' {{ $$ = $2 }}
| expr AND expr {{ $$ = NewASTComplexNode(OPND_AND, $1, $3) }}
| expr AND NOT expr {{ $$ = NewASTComplexNode(OPND_AND, $1, NewASTComplexNode(OPND_NOT, $4, nil)) }}
| NOT expr AND expr {{ $$ = NewASTComplexNode(OPND_AND, NewASTComplexNode(OPND_NOT, $2, nil), $4) }}
| expr OR expr {{ $$ = NewASTComplexNode(OPND_OR, $1, $3) }}
| STRING {{ $$ = NewASTSimpleNode(OPRT_STRING, $1) }}
| BYTES {{ $$ = NewASTSimpleNode(OPRT_BYTES, $1) }}
;
Cam someone explain me why it gives me these errors?:
rule expr: NOT expr AND expr never reduced
1 rules never reduced
conflicts: 3 reduce/reduce
In a comment, it is clarified that the requirement is that:
The NOT operator should apply only to operands of AND and [the operands] shouldn't be both NOT.
The second part of that requirement is a little odd, since the AND operator is defined to be left-associative. That would mean that
a AND NOT b AND NOT c
would be legal, because it is grouped as (a AND NOT b) AND NOT c, in which both AND operators have one positive term. But rotating the arguments (which might not change the semantics at all) produces:
NOT b AND NOT c AND a
which would be illegal because the first grouping (NOT b AND NOT c) contains two NOT expressions.
It might be that the intention was that any conjunction (sequence of AND operators) contain at least one positive term.
Both of these constraints are possible, but they cannot be achieved using operator precedence declarations.
Operator precedence can be used to resolve ambiguity in an ambiguous grammar (and expr: expr OR expr is certainly ambiguous, since it allows OR to associate in either direction). But it cannot be used to import additional requirements on operands, particularly not a requirement which takes both operands into account [Note 1]. In order to do that, you need to write out an unambiguous grammar. Fortunately, that's not too difficult.
The basis for the unambiguous grammar is what is sometimes called cascading precedence style; this effectively encodes the precedence into rules, so that precedence declarations are unnecessary. The basic boolean grammar looks like this:
expr: conjunction /* Cascade to the next level */
| expr OR conjunction
conjunction
: term /* Continue the cascade */
| conjunction AND term
term: atom /* NOT is a prefix operator */
| NOT term /* which is not what you want */
atom: '(' expr ')'
| STRING
Each precedence level has a corresponding non-terminal, and the levels "cascade" in the sense that each one, except the last, includes the next level as a unit production.
To adapt this to the requirement that NOT be restricted to at most one operand of an AND operator, we can write out the possibilities more or less as you did in the original grammar, but respecting the cascading style:
expr: conjunction
| expr OR conjunction
conjunction
: atom /* NOT is integrated, so no need for 'term' */
| conjunction AND atom
| conjunction AND NOT atom /* Can extend with a negative */
| NOT atom AND atom /* Special case for negative at beginning */
atom: '(' expr ')'
| STRING
The third rule for conjunction (conjunction: conjunction AND NOT atom) allows any number of NOT applications to be added at the end of a list of operands, but does not allow consecutive NOT operands at the beginning of the list. A single NOT at the beginning is allowed by the fourth rule.
If you prefer the rule that a conjunction has to have at least one positive term, you can use the following very simple adaptation:
expr: conjunction
| expr OR conjunction
conjunction
: atom /* NOT is integrated, so no need for 'term' */
| conjunction AND atom
| conjunction AND NOT atom
| negative AND atom /* Possible initial list of negatives */
negative /* This is not part of the cascade */
: NOT atom
| negative AND NOT atom
atom: '(' expr ')'
| STRING
In this variant, negative will match, for example, NOT a AND NOT b AND NOT c. But because it's not in the cascade, that sequence doesn't automatically become a valid expression. In order for it to be used as an expression, it needs to be part of conjunction: negative AND atom, which requires that the sequence include a positive.
Notes
The %nonassoc precedence declaration can be used to reject chained use of operators from the same precedence level. It's a bit of a hack, and can sometimes have unexpected consequences. The expected use case is operators like < in languages like C which don't have special handling for chained comparison; using %nonassoc you can declare that chaining is illegal in the precedence level for comparison operators.
But %nonassoc only works within a single precedence level, and it only works if all the operators in the precedence level require the same treatment. If the intended grammar does not fully conform to those requirements, it will be necessary -- as with this grammar -- to abandon the use of precedence declarations and write out an unambiguous grammar.

Flex and Bison - Grammar that sometimes care about spaces

Currently I'm trying to implement a grammar which is very similar to ruby. To keep it simple, the lexer currently ignores space characters.
However, in some cases the space letter makes big difference:
def some_callback(arg=0)
arg * 100
end
some_callback (1 + 1) + 1 # 300
some_callback(1 + 1) + 1 # 201
some_callback +1 # 100
some_callback+1 # 1
some_callback + 1 # 1
So currently all whitespaces are being ignored by the lexer:
{WHITESPACE} { ; }
And the language says for example something like:
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
One way I can think of to solve this problem would be to explicitly add whitespaces to the whole grammar, but doing so the whole grammar would increase a lot in complexity:
// OLD:
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression T_ADD MultiplicativeExpression
| AdditiveExpression T_SUB MultiplicativeExpression
;
// NEW:
_:
/* empty */
| WHITESPACE _;
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression _ T_ADD _ MultiplicativeExpression
| AdditiveExpression _ T_SUB _ MultiplicativeExpression
;
//...
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
So I liked to ask whether there is any best practice on how to solve this grammar.
Thank you in advance!
Without having a full specification of the syntax you are trying to parse, it's not easy to give a precise answer. In the following, I'm assuming that those are the only two places where the presence (or absence) of whitespace between two tokens affects the parse.
Differentiating between f(...) and f (...) occurs in a surprising number of languages. One common strategy is for the lexer to recognize an identifier which is immediately followed by an open parenthesis as a "FUNCTION_CALL" token.
You'll find that in most awk implementations, for example; in awk, the ambiguity between a function call and concatenation is resolved by requiring that the open parenthesis in a function call immediately follow the identifier. Similarly, the C pre-processor macro definition directive distinguishes between #define foo(A) A (the definition of a macro with arguments) and #define foo (A) (an ordinary macro whose expansion starts with a ( token.
If you're doing this with (f)lex, you can use the / trailing-context operator:
[[:alpha:]_][[:alnum:]_]*/'(' { yylval = strdup(yytext); return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]* { yylval = strdup(yytext); return IDENT; }
The grammar is now pretty straight-forward:
call: FUNC_CALL '(' expression_list ')' /* foo(1, 2) */
| IDENT expression_list /* foo (1, 2) */
| IDENT /* foo * 3 */
This distinction will not be useful in all syntactic contexts, so it will often prove useful to add a non-terminal which will match either identifier form:
name: IDENT | FUNC_CALL
But you will need to be careful with this non-terminal. In particular, using it as part of the expression grammar could lead to parser conflicts. But in other contexts, it will be fine:
func_defn: "def" name '(' parameters ')' block "end"
(I'm aware that this is not the precise syntax for Ruby function definitions. It's just for illustrative purposes.)
More troubling is the other ambiguity, in which it appears that the unary operators + and - should be treated as part of an integer literal in certain circumstances. The behaviour of the Ruby parser suggests that the lexer is combining the sign character with an immediately following number in the case where it might be the first argument to a function. (That is, in the context <identifier><whitespace><sign><digits> where <identifier> is not an already declared local variable.)
That sort of contextual rule could certainly be added to the lexical scanner using start conditions, although it's more than a little ugly. A not-fully-fleshed out implementation, building on the previous:
%x SIGNED_NUMBERS
%%
[[:alpha:]_][[:alnum:]_]*/'(' { yylval.id = strdup(yytext);
return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]*/[[:blank:]] { yylval.id = strdup(yytext);
if (!is_local(yylval.id))
BEGIN(SIGNED_NUMBERS);
return IDENT; }
[[:alpha:]_][[:alnum:]_]*/ { yylval.id = strdup(yytext);
return IDENT; }
<SIGNED_NUMBERS>[[:blank:]]+ ;
/* Numeric patterns, one version for each context */
<SIGNED_NUMBERS>[+-]?[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
BEGIN(INITIAL);
return INTEGER; }
[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
return INTEGER; }
/* ... */
/* If the next character is not a digit or a sign, rescan in INITIAL state */
<SIGNED_NUMBERS>.|\n { yyless(0); BEGIN(INITIAL); }
Another possible solution would be for the lexer to distinguish sign characters which follow a space and are directly followed by a digit, and then let the parser try to figure out whether or not the sign should be combined with the following number. However, this will still depend on being able to distinguish between local variables and other identifiers, which will still require the lexical feedback through the symbol table.
It's worth noting that the end result of all this complication is a language whose semantics are not very obvious in some corner cases. The fact that f+3 and f +3 produce different results could easily lead to subtle bugs which might be very hard to detect. In many projects using languages with these kinds of ambiguities, the project style guide will prohibit legal constructs with unclear semantics. You might want to take this into account in your language design, if you have not already done so.

Happy Context-Dependent Operator Precedence

I have two snippets of Happy code here, one that uses normal precedence rules, and one that uses context-dependent precedence rules (both of which are described here).
Normal:
%left '+'
%left '*'
%%
Exp :: { Exp }
: Exp '+' Exp { Plus $1 $3 }
| Exp '*' Exp { Times $1 $3 }
| var { Var $1 }
Context-dependent:
%left PLUS
%left TIMES
%%
Exp :: { Exp }
: Exp '+' Exp %prec PLUS { Plus $1 $3 }
| Exp '*' Exp %prec TIMES { Times $1 $3 }
| var { Var $1 }
Given the input:
a * b + c * d
The normal version gives:
Plus (Times (Var "a") (Var "b")) (Times (Var "c") (Var "d"))
whereas the context-dependent version gives:
Times (Var "a") (Plus (Var "b") (Times (Var "c") (Var "c")))
Shouldn't these both give the same output? What am I doing wrong here that's making them generate different parse trees?
"Context-dependent precedence" is a very misleading way of describing that feature. The description of the precedence algorithm in the preceding section is largely accurate, though.
As it says, a precedence comparison is always between a production (which could be reduced) and a terminal (which could be shifted). That simple fact is often clouded by the decision to design the precedence declaration syntax as though precedence were solely an attribute of the terminal.
The production's precedence is set by copying the precedence of the last terminal in the production unless there is an explicit declaration with %prec. Or to put it another way, the production's precdence is set with a %prec clause, defaulting to the precedence of the last token. Either way, you can only define the precedence of the production by saying that it is the same as that of some terminal. Since that is not always convenient, the parser generator gives you the option of using an arbitrary name which is not the name of a grammar symbol. The implementation is to treat the name as a terminal and ignore the fact that it is never actually used in any grammar rule, but logically it is the name of a precedence level which is to be assigned to that particular production.
In your first example, you let the productions default their precedence to the last (in fact, the only) terminal in each production. But in your second example you have defined two named precedence levels, PLUS and TIMES and you use those to set the precedence of two productions. But you do not declare the precedence of any terminal. So when the parser generator attempts to check the relative precedence of the production which could be reduced and the terminal which could be shifted, it finds that only one of those things has a declared precedence. And in that case, it always shifts.

Why in some cases I can't use a token as precedence marker

Assume this code works:
left '*'
left '+'
expr: expr '+' expr
| expr '*' expr
;
I want to define an other precedence marker like:
left MULTIPLY
left PLUS
expr: expr '+' expr %prec PLUS
| expr '*' expr %prec MULTIPLY
;
Yet this is not actually effective.
I suppose these two forms should be equivalent, however, they're not.
It's not on practical problem. I just want to know the reason and the principle for this phenomenon.
Thanks.
Yacc precedence rules aren't really about the precedence of expressions, though they can be used for that. Instead, they are a way of resolving shift/reduce conflicts (and ONLY shift/reduce conflicts) explicitly.
Understanding how it works requires understanding how shift/reduce (bottom up) parsing works. The basic idea is that you read token symbols from the input and push ("shift") those tokens onto a stack. When the symbols on the top of the stack match the right hand side of some rule in the grammar, you may "reduce" the rule, popping the symbols from the stack and replacing them with a single symbol from the left side of the rule. You repeat this process, shifting tokens and reducing rules until you've read the entire input and reduced it to a single instance of the start symbol, at which point you've successfully parsed the entire input.
The essential problem with the above (and what the whole machinery of the parser generator is solving) is knowing when to reduce a rule vs when to shift a token if both are possible. The parser generator (yacc or bison) builds a state machine that tracks which symbols have been shifted and so knows what 'partially matched' rules are currently possible and limits shifts just to those tokens that can match more of such a rule. This does not work if the grammar in question is not LALR(1), and so in such cases yacc/bsion reports shift/reduce or reduce/reduce conflicts.
The way that precedence rules work to resolve shift reduce conflicts is by assigning a precedence to certain tokens and rules in the grammar. Whenever there is a shift/reduce conflict between a token to be shifted and a rule to be reduced, and BOTH have a precedence it will do whichever one has higher precedence. If they have the SAME precedence, then it looks at the %left/%right/%nonassoc flag associated with the precedence level -- %left means reduce, %right means shift, and %nonassoc means do neither and treat it as a syntax error.
The only tricky remaining bit is how tokens and rules get their precedence. Tokens get theirs from the %left/%right/%nonassoc directive they are in, which also sets the ordering. Rules get precedence from a %prec directive OR from the right-most terminal on their right-hand-side. So when you have:
%left '*'
%left '+'
expr: expr '+' expr
| expr '*' expr
;
You are setting the precedence of '*' and '+' with the %left directives, and the two rules get their precedence from those tokens.
When you have:
%left MULTIPLY
%left PLUS
expr: expr '+' expr %prec PLUS
| expr '*' expr %prec MULTIPLY
;
You are setting the precedence of the tokens MULTIPLY and PLUS and then explicitly setting the rules to have those precedences. However you are NOT SETTING ANY PRECEDENCE for the tokens '*' and '+'. So when there is a shift/reduce conflict between one of the two rules and either '*' or '+', precedence does not resolve it because the token has no precedence.
You say you are not trying to solve a specific, practical problem. And from your question, I'm a little confused about how you are trying to use the precedence marker.
I think you will find that you don't need to use the precedence marker often. It is usually simpler, and clearer to the reader, to rewrite your grammar so that precedence is explicitly accounted for. To give multiply and divide higher precedence than add and subtract, you can do something like this (example adapted from John Levine, lex & yacc 2/e, 1992):
%token NAME NUMBER
%%
stmt : NAME '=' expr
| expr
;
expr : expr '+' term
| expr '-' term
| term
;
term : term '*' factor
| term '/' factor
| factor
;
factor : '(' expr ')'
| '-' factor
| NUMBER
;
In your example, PLUS and MULTIPLY are not real tokens; you can't use them interchangeably with '+' and '*'. Levine calls them pseudo-tokens. They are there to link your productions back to your list of precedences that you have defined with %left and %nonassoc declarations. He gives this example of how you might use %prec to give unary minus high precedence even though the '-' token has low precedence:
%token NAME NUMBER
%left '-' '+'
%left '*' '/'
%nonassoc UMINUS
%%
stmt : NAME '=' expr
| expr
;
expr : expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '-' expr %prec UMINUS
| '(' expr ')'
| NUMBER
;
To sum up, I would recommend following the pattern of my first code example rather than the second; make the grammar explicit.
Shift-reduce conflicts are a conflict between trying to reduce a production versus shifting a token and moving to the nest state. When Bison is resolving a conflict its not comparing two rules and choosing one of them - its comparing one rule that it wants to reduce and the token that you want to shift in the other rule(s). This might be clearer if you have two rules to shift:
expr: expr '+' expr
| expr '*' expr
| expr '*' '*' expr
The reason this is all confusing is that the way Bison gives a precedence to the "reduce" rule is to associate it with a token (the last terminal in the rule by default or the token from the prec declaration) and then it uses the precedence table to compares that token to the token you are trying to shift. Basically, prec declarations only make sense for the "reduce" part of a conflict and they are not counted for the shift part.
One way to see this is with the following grammar
command: IF '(' expr ')' command %prec NOELSE
: IF '(' expr ')' command ELSE command
In this grammar you need to choose between reducing the first rule or shifting the ELSE token. You do this by either giving precedences to the ')' token and to the ELSE token or by using a prec declaration and giving a precedence for NOELSE instead of ')'. If you try to give a prec declaration to the second it will get ignored and Bison will continue trying to look for the precedence of the ELSE token in the precedence table.

Resources