Lemon Parser - How can I set different associativity for unary minus and substraction? - parsing

expr ::= expr MINUS expr.
expr ::= MINUS expr.
I need to set different associativity for the 2 MINUS tokens. But I can't twice set associativity for MINUS.
%left PLUS MINUS. // + -
%right NOT MINUS. // ! - // error!

This is answered in the Lemon documentation, which provide an example of that specific requirement:
The precedence of a grammar rule is equal to the precedence of the left-most terminal symbol in the rule for which a precedence is defined. This is normally what you want, but in those cases where you want the precedence of a grammar rule to be something different, you can specify an alternative precedence symbol by putting the symbol in square braces after the period at the end of the rule and before any C-code. For example:
expr = MINUS expr. [NOT]
This rule has a precedence equal to that of the NOT symbol, not the MINUS symbol as would have been the case by default.
The above example assumes you have a token NOT which you have placed in the correct order in your precedence list.

Related

Remove ambiguity in grammar for expression casting

I'm working on a small translator in JISON, but I've run into a problem when trying to implement the cast of expressions, since it generates an ambiguity in the grammar when trying to add the production of cast. I need to add the productions to the cast option, so in principle I should have something like this:
expr: OPEN_PAREN type CLOSE_PAREN expr
However, since in my grammar I must be able to have expressions in parentheses, I already have the following production, so the grammar is now ambiguous:
expr: '(' expr ')'
Initially I had the following grammar for expressions:
expr : expr PLUS expr
| expr MINUS expr
| expr TIMESexpr
| expr DIV expr
| expr MOD expr
| expr POWER expr
| MINUS expr %prec UMINUS
| expr LESS_THAN expr
| expr GREATER_THAN expr
| expr LESS_OR_EQUAL expr
| expr GREATER_OR_EQUAL expr
| expr EQUALS expr
| expr DIFFERENT expr
| expr OR expr
| expr AND expr
| NOT expr
| OPEN_PAREN expr CLOSE_PAREN
| INT_LITERAL
| DOUBLE_LITERAL
| BOOLEAN_LITERAL
| CHAR_LITERAL
| STRING_LTIERAL
| ID;
Ambiguity was handled by applying the following precedence and associativity rules:
%left 'ASSIGNEMENT'
%left 'OR'
%left 'AND'
%left 'XOR'
%left 'EQUALS', 'DIFFERENT'
%left 'LESS_THAN ', 'GREATER_THAN ', 'LESS_OR_EQUAL ', 'GREATER_OR_EQUAL '
%left 'PLUS', 'MINUS'
%left 'TIMES', 'DIV', 'MOD'
%right 'POWER'
%right 'UMINUS', 'NOT'
I can't find a way to write a production that allows me to add the cast without falling into an ambiguity. Is there a way to modify this grammar without having to write an unambiguous grammar? Is there a way I can resolve this issue using JISON, which I may not have been able to see?
Any ideas are welcome.
This is what I was trying, however it's still ambiguous:
expr: OPEN_PAREN type CLOSE_PAREN expr
| OPEN_PAREN expr CLOSE_PAREN
The problem is that you don't specify the precedence of the cast operator, which is effectively a unary operator whose precedence should be the same as any other unary operator, such as NOT. (See below for a discussion of UMINUS.)
The parsing conflicts you received are not related to the fact that expr: '(' expr ')' is also a production. That would prevent LL(1) parsing, because the two productions start with the same sequence, but that's not an ambiguity. It doesn't affect bottom-up parsing in any way; the two productions are unambiguously recognisable.
Rather, the conflicts are the result of the parser not knowing whether (type)a+b means ((type)a+b or (type)(a+b), which is no different from the ambiguity of unary minus (should -a/b be parsed as (-a)/b or -(a/b)?), which is resolved by putting UMINUS at the end of the precedence list.
In the case of casts, you don't need to use a %prec declaration with a pseudo-token; that's only necessary for - because - could also be a binary operator, with a different (reduction) precedence. The precedence of the production:
expr: '(' type ')' expr
is ) (at least in yacc/bison), because that's the last terminal in the production. There's no need to give ) a shift precedence, because the grammar requires it to always be shifted.
Three notes:
Assignment is right-associative. a = b = 3 means a = (b = 3), not (a = b) = 3.
In the particular case of unary minus (and, by extension, unary plus if you feel like implementing it), there's a good argument for putting it ahead of exponentiation, so that -a**b is parsed as -(a**b). But that doesn't mean you should move other unary operators up from the end; (type)a**b should be parsed as ((type)a)**b. Nothing says that all unary operators have to have the same precedence.
When you add postfix operators -- notably function calls and array subscripts -- you will want to put them after the unary prefix operators. -a[3] most certainly does not mean (-a)[3]. These postfix operators are, in a way, duals of the prefix operators. As noted above, expr: '(' type ')' expr has precedence ')', which is only used as a reduction precedence. Conversely, expr: expr '(' expr-list ')' does not require a reduction precedence; the relevant token whose shift precedence needs to be declared is (.
So, according to all the above, your precedence declarations might be:
%right ASSIGNMENT
%left OR
%left AND
%left XOR
%left EQUALS DIFFERENT
%left LESS_THAN GREATER_THAN LESS_OR_EQUAL GREATER_OR_EQUAL
%left PLUS MINUS
%left TIMES DIV MOD
%right UMINUS
%right POWER
%right NOT CLOSE_PAREN
%right OPEN_PAREN OPEN_BRACKET
I listed all the unary operators using right associativity, which is somewhat arbitrary; either %left or %right would have the same effect, since it is impossible for a unary operator to compete with another instance of the same operator for the same operand; for unary operators, only the precedence level makes any difference. But it's customary to mark unary operators with %right.
Bison allows the use of %precedence to declare precedence levels for operators which have no associativity, but Jison doesn't have that feature. Both Bison and Jison do allow the use of %nonassoc, but that's very different: it says that it is a syntax error if either operand to the operator is an application of the same operator. That restriction is, for example, sometimes applied to comparison operators, in order to make a < b < c a syntax error.
Usually the way this problem is handled is by having type names as distinct keywords that can't be expressions by themselves. That way, after seeing an (, the next token being a type means it is a cast and the next token being an identifier means it is an expression, so there is no ambiguity.
However, your grammar appears to allow type names (INT, DOUBLE, etc) as expressions. This doesn't make a lot of sense, and causes your parsing problem, as differentiating between a cast and a parenthesized expression will require more lookahead.
The easiest fix would be to remove these productions (though you should still have something like expr : CONSTANT_LITERAL for literal constants)

Solve shift/reduce conflict across rules

I'm trying to learn bison by writing a simple math parser and evaluator. I'm currently implementing variables. A variable can be part of a expression however I'd like do something different when one enters only a single variable name as input, which by itself is also a valid expression and hence the shift reduce conflict. I've reduced the language to this:
%token <double> NUM
%token <const char*> VAR
%nterm <double> exp
%left '+'
%precedence TWO
%precedence ONE
%%
input:
%empty
| input line
;
line:
'\n'
| VAR '\n' %prec ONE
| exp '\n' %prec TWO
;
exp:
NUM
| VAR %prec TWO
| exp '+' exp { $$ = $1 + $3; }
;
%%
As you can see, I've tried solving this by adding the ONE and TWO precedences manually to some rules, however it doesn't seem to work, I always get the exact same conflict. The goal is to prefer the line: VAR '\n' rule for a line consisting of nothing but a variable name, otherwise parse it as expression.
For reference, the conflicting state:
State 4
4 line: VAR . '\n'
7 exp: VAR . ['+', '\n']
'\n' shift, and go to state 8
'\n' [reduce using rule 7 (exp)]
$default reduce using rule 7 (exp)
Precedence comparisons are always, without exception, between a production and a token. (At least, on Yacc/Bison). So you can be sure that if your precedence level list does not contain a real token, it will have no effect whatsoever.
For the same reason, you cannot resolve reduce-reduce conflicts with precedence. That doesn't matter in this case, since it's a shift-reduce conflict, but all the same it's useful to know.
To be even more specific, the precedence comparison is between a reduction (using the precedence of the production to be reduced) and that of the incoming lookahead token. In this case, the lookahead token is \n and the reduction is exp: VAR. The precedence level of that production is the precedence of VAR, since that is the last terminal symbol in the production. So if you want the shift to win out over the reduction, you need to declare your precedences so that the shift is higher:
%precedence VAR
%precedence '\n'
No pseudotokens (or %prec modifiers) are needed.
This will not change the parse, because Bison always prefers shift if there are no applicable precedence rules. But it will suppress the warning.

operator precedence with more than 2 recursions

I am trying some combinations of operator precedence and associativity on bison. While some cases it looks odd, basic question appears that if the below rule is valid which do appear not wrong.
expr: expr OP1 expr OP5 '+' expr
According to bison info page, rule takes precedence from last terminal symbol or precedence explicitly assigned to it.
Below is a precedence and full expr rules paste from code:
%left OP4
%left OP3
%left OP2
%left OP1
%left '*'
%left '/'
%left '+'
%left '-'
expr: NUM { $$ = $1; }
| expr OP2 expr OP5 '+' expr { printf("+"); }
| expr OP1 expr OP5 '-' expr { printf("-"); }
| expr OP4 expr OP5 '*' expr { printf("*"); }
| expr OP3 expr OP5 '/' expr { printf("/"); }
;
Below is data tokens given:
1op11op5-2op22op5+3op33op5/4op44op5*5
Output on executing parser is below as expected.
-+/*
Now, once precedence is flipped between arithmetic operators and OP, result is reversed indicating that not last terminals are influencing the rule precedence.
%left '*'
%left '/'
%left '+'
%left '-'
%left OP4
%left OP3
%left OP2
%left OP1
Now output of parser is reverse which indicates last terminal precendence is not helping:
*/-+
Further if for first combination with only arithmetic operators at higher precedence than OP operators, if precedence of OP operators is removed, result is still as second combination with no precendece of rules playing. This result makes it difficult to conclude that second expr is used for precdence rather than 3rd one.
What can be concluded from above results.
What can be the precedence and associativity logic in case more than 2 recursions are used in rules?
The 'precdence' rules in bison really don't have much to do with operator precedence in the traditional sense -- they're really just a hack for resolving shift-reduce conflicts in a way that can implement simple operator precedence in an ambiguous grammar. So to understand how they work, you need to understand shift/reduce parsing and how the precedence rules are used to resolve conflicts.
The actual mechanism is actually pretty simple. Whenever bison has a shift/reduce conflict in the parser it generates for a grammar, it looks at the rule (to reduce) and the token (to shift) involved in the conflict, and if BOTH have assigned precedence, it resolves the conflict in favor of whichever one has higher precedence. That's it.
So this works pretty well for resolving the precedence of simple binary operators, but if you try to use it for anything more complex, it will likely get you into trouble. In your examples, the conflicts all come between the rules and the shifts for OP1/2/3/4, so all that matters is the relatvie precedence of those two groups -- the precedence within each group is irrelevant as there are never any conflicts between them. So when the rules are higher precedence, they'll reduce left to right, and when the OP tokens are higher precedence it will shift (resulting in eventual right to left reductions).

How to overcome shift-reduce conflict in LALR grammar

I am trying to parse positive and negative decimals.
number(N) ::= pnumber(N1).
number(N) ::= nnumber(N1).
number(N) ::= pnumber(N1) DOT pnumber(N2).
number(N) ::= nnumber(N1) DOT pnumber(N2).
pnumber(N) ::= NUMBER(N1).
nnumber(N) ::= MINUS NUMBER(N1).
The inclusion of the first two rules gives a shift/reduce conflict but I don't know how I can write the grammar such that the conflict never occurs.
I am using the Lemon parser.
Edit: conflicts from .out file
State 79:
(56) number ::= nnumber *
number ::= nnumber * DOT pnumber
DOT shift 39
DOT reduce 56 ** Parsing conflict **
{default} reduce 56 number ::= nnumber
State 80:
(55) number ::= pnumber *
number ::= pnumber * DOT pnumber
DOT shift 40
DOT reduce 55 ** Parsing conflict **
{default} reduce 55 number ::= pnumber
State 39:
number ::= nnumber DOT * pnumber
pnumber ::= * NUMBER
NUMBER shift-reduce 59 pnumber ::= NUMBER
pnumber shift-reduce 58 number ::= nnumber DOT pnumber
State 40:
number ::= pnumber DOT * pnumber
pnumber ::= * NUMBER
NUMBER shift-reduce 59 pnumber ::= NUMBER
pnumber shift-reduce 57 number ::= pnumber DOT pnumber
Edit 2: Minimal grammar that causes issue
start ::= prog.
prog ::= rule.
rule ::= REVERSE_IMPLICATION body DOT.
body ::= bodydef.
body ::= body CONJUNCTION bodydef.
bodydef ::= literal.
literal ::= variable.
variable ::= number.
number ::= pnumber.
number ::= nnumber.
number ::= pnumber DOT pnumber.
number ::= nnumber DOT pnumber.
pnumber ::= NUMBER.
nnumber ::= MINUS NUMBER.
The conflicts you show indicate a problem with how the number non-terminal is used, not with number itself.
The basic problem is that after seeing a pnumber or nnumber, when the next token of lookahead is a DOT, it can't decide if that should be the end of the number (reduce, so DOT is part of some other non-terminal after the number), or if the DOT should be treated as part of the number (shifted so it can later reduce one of the p/nnumber DOT pnumber rules.)
So in order to diagnose the problem, you'll need to show all the rules that use number anywhere on the right hand side (and recursively any other rules that use any of those rules' non-terminals on the right).
Note that it is rarely useful to post just a fragment of a grammar, as the LR parser construction process depends heavily on the context of where the rules are used elsewhere in the grammar...
So the problem here is that you need two-token lookahead to differentiate between a DOT in a (real) number literal and a DOT at the end of a rule.
The easy fix is to let the lexer deal with it -- lexers can do small amounts of lookahead quite easily, so you can recognize REAL_NUMBER as a distinct non-terminal from NUMBER (probably still without the -, so you'd end up with
number ::= NUMBER | MINUS NUMBER | REAL_NUMBER | MINUS REAL_NUMBER
It's much harder to remove the conflict by factoring the grammar but it can be done.
In general, to refactor a grammar to remove a lookahead conflict, you need to figure out the rules that manifest the conflict (rule and number here) and refactor things to bring them together into rules that have common prefixes until you get far enough along to disambiguate.
First, I'm going to assume there are other rules besides number that can appear here, as otherwise we could just eliminate all the intervening rules.
variable ::= number | name
We want to move the number rule "up" in the grammar to get it into the same place as rule with DOT. So we need to split the containing rules to special case when they end with a number. We add a suffix to denote the rules that correspond to the original rule with all versions that end in a number split off
variable ::= number | variable_n
variable_n ::= name
...and propagate that "up"
literal ::= number | literal_n
literal_n ::= variable_n
...and again
bodydef ::= number | bodydef_n
bodydef_n := literal_n
...and again
body ::= number | body_n
body := body CONJUNCTION number
body_n ::= bodydef_n
body_n ::= body CONJUNCTION bodydef_n
Notice that as you move it up, you need to split up more and more rules, so this process can blow up the grammar quite a bit. However, rules that are used only at the end of a rhs that you're refactoring will end up only needing the _n version, so you don't necessarily have to double the number of rules.
...last step
rule ::= REVERSE_IMPLICATION body_n DOT
rule ::= REVERSE_IMPLICATION number DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION number DOT
Now you have the DOTs in all the same places, so expand the number rules:
rule ::= REVERSE_IMPLICATION body_n DOT
rule ::= REVERSE_IMPLICATION integer DOT
rule ::= REVERSE_IMPLICATION integer DOT pnumber DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT pnumber DOT
and the shift-reduce conflicts are gone, because the rules have common prefixes up until past the needed lookahead to determine which to use.
I've reduced the number of rules in this final expansion by adding
integer ::= pnumber | nnumber
You have to declare the associativity of the DOT operator token with %left or %right.
Or, another idea is to drop this intermediate reduction. The obvious feature in your grammar is that numbers grow by DOT followed by a number. That can be captured with a single rule:
number : number DOT NUMBER
A number followed by a DOT followed by a NUMBER token is still a number.
This rule doesn't require DOT to have an associativity declared, because there is no ambiguity; the rule is purely left-recursive, and the right hand of DOT is a terminal token. The parser must reduce the top of the stack to number when the state machine is at this point, and then shift DOT:
number : number DOT NUMBER
The language which you are parsing here is regular; it can be parsed by regular expressions without any recursion. That is why rules that have both left and right recursion in them and require associativity to be declared are somewhat of a "big hammer".

Why in some cases I can't use a token as precedence marker

Assume this code works:
left '*'
left '+'
expr: expr '+' expr
| expr '*' expr
;
I want to define an other precedence marker like:
left MULTIPLY
left PLUS
expr: expr '+' expr %prec PLUS
| expr '*' expr %prec MULTIPLY
;
Yet this is not actually effective.
I suppose these two forms should be equivalent, however, they're not.
It's not on practical problem. I just want to know the reason and the principle for this phenomenon.
Thanks.
Yacc precedence rules aren't really about the precedence of expressions, though they can be used for that. Instead, they are a way of resolving shift/reduce conflicts (and ONLY shift/reduce conflicts) explicitly.
Understanding how it works requires understanding how shift/reduce (bottom up) parsing works. The basic idea is that you read token symbols from the input and push ("shift") those tokens onto a stack. When the symbols on the top of the stack match the right hand side of some rule in the grammar, you may "reduce" the rule, popping the symbols from the stack and replacing them with a single symbol from the left side of the rule. You repeat this process, shifting tokens and reducing rules until you've read the entire input and reduced it to a single instance of the start symbol, at which point you've successfully parsed the entire input.
The essential problem with the above (and what the whole machinery of the parser generator is solving) is knowing when to reduce a rule vs when to shift a token if both are possible. The parser generator (yacc or bison) builds a state machine that tracks which symbols have been shifted and so knows what 'partially matched' rules are currently possible and limits shifts just to those tokens that can match more of such a rule. This does not work if the grammar in question is not LALR(1), and so in such cases yacc/bsion reports shift/reduce or reduce/reduce conflicts.
The way that precedence rules work to resolve shift reduce conflicts is by assigning a precedence to certain tokens and rules in the grammar. Whenever there is a shift/reduce conflict between a token to be shifted and a rule to be reduced, and BOTH have a precedence it will do whichever one has higher precedence. If they have the SAME precedence, then it looks at the %left/%right/%nonassoc flag associated with the precedence level -- %left means reduce, %right means shift, and %nonassoc means do neither and treat it as a syntax error.
The only tricky remaining bit is how tokens and rules get their precedence. Tokens get theirs from the %left/%right/%nonassoc directive they are in, which also sets the ordering. Rules get precedence from a %prec directive OR from the right-most terminal on their right-hand-side. So when you have:
%left '*'
%left '+'
expr: expr '+' expr
| expr '*' expr
;
You are setting the precedence of '*' and '+' with the %left directives, and the two rules get their precedence from those tokens.
When you have:
%left MULTIPLY
%left PLUS
expr: expr '+' expr %prec PLUS
| expr '*' expr %prec MULTIPLY
;
You are setting the precedence of the tokens MULTIPLY and PLUS and then explicitly setting the rules to have those precedences. However you are NOT SETTING ANY PRECEDENCE for the tokens '*' and '+'. So when there is a shift/reduce conflict between one of the two rules and either '*' or '+', precedence does not resolve it because the token has no precedence.
You say you are not trying to solve a specific, practical problem. And from your question, I'm a little confused about how you are trying to use the precedence marker.
I think you will find that you don't need to use the precedence marker often. It is usually simpler, and clearer to the reader, to rewrite your grammar so that precedence is explicitly accounted for. To give multiply and divide higher precedence than add and subtract, you can do something like this (example adapted from John Levine, lex & yacc 2/e, 1992):
%token NAME NUMBER
%%
stmt : NAME '=' expr
| expr
;
expr : expr '+' term
| expr '-' term
| term
;
term : term '*' factor
| term '/' factor
| factor
;
factor : '(' expr ')'
| '-' factor
| NUMBER
;
In your example, PLUS and MULTIPLY are not real tokens; you can't use them interchangeably with '+' and '*'. Levine calls them pseudo-tokens. They are there to link your productions back to your list of precedences that you have defined with %left and %nonassoc declarations. He gives this example of how you might use %prec to give unary minus high precedence even though the '-' token has low precedence:
%token NAME NUMBER
%left '-' '+'
%left '*' '/'
%nonassoc UMINUS
%%
stmt : NAME '=' expr
| expr
;
expr : expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '-' expr %prec UMINUS
| '(' expr ')'
| NUMBER
;
To sum up, I would recommend following the pattern of my first code example rather than the second; make the grammar explicit.
Shift-reduce conflicts are a conflict between trying to reduce a production versus shifting a token and moving to the nest state. When Bison is resolving a conflict its not comparing two rules and choosing one of them - its comparing one rule that it wants to reduce and the token that you want to shift in the other rule(s). This might be clearer if you have two rules to shift:
expr: expr '+' expr
| expr '*' expr
| expr '*' '*' expr
The reason this is all confusing is that the way Bison gives a precedence to the "reduce" rule is to associate it with a token (the last terminal in the rule by default or the token from the prec declaration) and then it uses the precedence table to compares that token to the token you are trying to shift. Basically, prec declarations only make sense for the "reduce" part of a conflict and they are not counted for the shift part.
One way to see this is with the following grammar
command: IF '(' expr ')' command %prec NOELSE
: IF '(' expr ')' command ELSE command
In this grammar you need to choose between reducing the first rule or shifting the ELSE token. You do this by either giving precedences to the ')' token and to the ELSE token or by using a prec declaration and giving a precedence for NOELSE instead of ')'. If you try to give a prec declaration to the second it will get ignored and Bison will continue trying to look for the precedence of the ELSE token in the precedence table.

Resources