Problem part of grammar:
expr_var: var_or_ID
| expr_var '[' expr ']'
| NEW expr_var '(' expr_listE ')'
| expr_var '(' expr_listE ')'
| expr_var ARROW expr_var
| expr_var ARROW '{' expr_var '}'
| expr_var DCOLON expr_var
| expr_var DCOLON '{' expr_var '}'
| '(' expr_var ')'
;
Description of that problem:
expr_var -> NEW expr_var '(' expr_listE ')' . (rule 87)
expr_var -> expr_var '(' expr_listE ')' . (rule 88)
DCOLON reduce using rule 87 (expr_var)
DCOLON [reduce using rule 88 (expr_var)]
ARROW reduce using rule 87 (expr_var)
ARROW [reduce using rule 88 (expr_var)]
'[' reduce using rule 87 (expr_var)
'[' [reduce using rule 88 (expr_var)]
'(' reduce using rule 87 (expr_var)
'(' [reduce using rule 88 (expr_var)]
$default reduce using rule 87 (expr_var)
Each operator(ARROW,DCOLON...) is left-associative. Operator NEW is non-associative.
How can i resolve that conflict?
The issue here is the ambiguity in:
new foo(1)(2)
This could be parsed in two ways:
(new foo(1))(2) // call the new object
new (foo(1))(2) // function returns class to construct
Which (if either) of these is semantically feasible depends on your language. It is certainly conceivable that both are meaningful. The grammar does not provide any hints about the semantics.
So it is necessary for you to decide which (if either) of these is the intended parse, and craft the grammar accordingly.
Precedence declarations will not help because precedence is only used to resolve shift/reduce conflicts: a precedence relationship is always between a production and a lookahead token, never between two productions.
Bison will resolve reduced/reduce conflicts in favour of the first production, as it has done in this case. So if the syntax is legal and unambiguous then you could select which parse is desired by reordering the productions if necessary. That will leave you with the warning, but you can suppress it with an %expect-rr declaration.
You could also suppress one or the other parse (or both) by explicitly modifying the grammar to exclude new expressions from being the first argument in a call and/or exclude calls from being the first argument of a new, by means of an explicit subset of expr_var.
Related
I'm working on a small translator in JISON, but I've run into a problem when trying to implement the cast of expressions, since it generates an ambiguity in the grammar when trying to add the production of cast. I need to add the productions to the cast option, so in principle I should have something like this:
expr: OPEN_PAREN type CLOSE_PAREN expr
However, since in my grammar I must be able to have expressions in parentheses, I already have the following production, so the grammar is now ambiguous:
expr: '(' expr ')'
Initially I had the following grammar for expressions:
expr : expr PLUS expr
| expr MINUS expr
| expr TIMESexpr
| expr DIV expr
| expr MOD expr
| expr POWER expr
| MINUS expr %prec UMINUS
| expr LESS_THAN expr
| expr GREATER_THAN expr
| expr LESS_OR_EQUAL expr
| expr GREATER_OR_EQUAL expr
| expr EQUALS expr
| expr DIFFERENT expr
| expr OR expr
| expr AND expr
| NOT expr
| OPEN_PAREN expr CLOSE_PAREN
| INT_LITERAL
| DOUBLE_LITERAL
| BOOLEAN_LITERAL
| CHAR_LITERAL
| STRING_LTIERAL
| ID;
Ambiguity was handled by applying the following precedence and associativity rules:
%left 'ASSIGNEMENT'
%left 'OR'
%left 'AND'
%left 'XOR'
%left 'EQUALS', 'DIFFERENT'
%left 'LESS_THAN ', 'GREATER_THAN ', 'LESS_OR_EQUAL ', 'GREATER_OR_EQUAL '
%left 'PLUS', 'MINUS'
%left 'TIMES', 'DIV', 'MOD'
%right 'POWER'
%right 'UMINUS', 'NOT'
I can't find a way to write a production that allows me to add the cast without falling into an ambiguity. Is there a way to modify this grammar without having to write an unambiguous grammar? Is there a way I can resolve this issue using JISON, which I may not have been able to see?
Any ideas are welcome.
This is what I was trying, however it's still ambiguous:
expr: OPEN_PAREN type CLOSE_PAREN expr
| OPEN_PAREN expr CLOSE_PAREN
The problem is that you don't specify the precedence of the cast operator, which is effectively a unary operator whose precedence should be the same as any other unary operator, such as NOT. (See below for a discussion of UMINUS.)
The parsing conflicts you received are not related to the fact that expr: '(' expr ')' is also a production. That would prevent LL(1) parsing, because the two productions start with the same sequence, but that's not an ambiguity. It doesn't affect bottom-up parsing in any way; the two productions are unambiguously recognisable.
Rather, the conflicts are the result of the parser not knowing whether (type)a+b means ((type)a+b or (type)(a+b), which is no different from the ambiguity of unary minus (should -a/b be parsed as (-a)/b or -(a/b)?), which is resolved by putting UMINUS at the end of the precedence list.
In the case of casts, you don't need to use a %prec declaration with a pseudo-token; that's only necessary for - because - could also be a binary operator, with a different (reduction) precedence. The precedence of the production:
expr: '(' type ')' expr
is ) (at least in yacc/bison), because that's the last terminal in the production. There's no need to give ) a shift precedence, because the grammar requires it to always be shifted.
Three notes:
Assignment is right-associative. a = b = 3 means a = (b = 3), not (a = b) = 3.
In the particular case of unary minus (and, by extension, unary plus if you feel like implementing it), there's a good argument for putting it ahead of exponentiation, so that -a**b is parsed as -(a**b). But that doesn't mean you should move other unary operators up from the end; (type)a**b should be parsed as ((type)a)**b. Nothing says that all unary operators have to have the same precedence.
When you add postfix operators -- notably function calls and array subscripts -- you will want to put them after the unary prefix operators. -a[3] most certainly does not mean (-a)[3]. These postfix operators are, in a way, duals of the prefix operators. As noted above, expr: '(' type ')' expr has precedence ')', which is only used as a reduction precedence. Conversely, expr: expr '(' expr-list ')' does not require a reduction precedence; the relevant token whose shift precedence needs to be declared is (.
So, according to all the above, your precedence declarations might be:
%right ASSIGNMENT
%left OR
%left AND
%left XOR
%left EQUALS DIFFERENT
%left LESS_THAN GREATER_THAN LESS_OR_EQUAL GREATER_OR_EQUAL
%left PLUS MINUS
%left TIMES DIV MOD
%right UMINUS
%right POWER
%right NOT CLOSE_PAREN
%right OPEN_PAREN OPEN_BRACKET
I listed all the unary operators using right associativity, which is somewhat arbitrary; either %left or %right would have the same effect, since it is impossible for a unary operator to compete with another instance of the same operator for the same operand; for unary operators, only the precedence level makes any difference. But it's customary to mark unary operators with %right.
Bison allows the use of %precedence to declare precedence levels for operators which have no associativity, but Jison doesn't have that feature. Both Bison and Jison do allow the use of %nonassoc, but that's very different: it says that it is a syntax error if either operand to the operator is an application of the same operator. That restriction is, for example, sometimes applied to comparison operators, in order to make a < b < c a syntax error.
Usually the way this problem is handled is by having type names as distinct keywords that can't be expressions by themselves. That way, after seeing an (, the next token being a type means it is a cast and the next token being an identifier means it is an expression, so there is no ambiguity.
However, your grammar appears to allow type names (INT, DOUBLE, etc) as expressions. This doesn't make a lot of sense, and causes your parsing problem, as differentiating between a cast and a parenthesized expression will require more lookahead.
The easiest fix would be to remove these productions (though you should still have something like expr : CONSTANT_LITERAL for literal constants)
I want to write the rules for arithmetic expressions in YACC; where the following operations are defined:
+ - * / ()
But, I don't want the statement to have surrounding parentheses. That is, a+(b*c) should have a matching rule but (a+(b*c)) shouldn't.
How can I achieve this?
The motive:
In my grammar I define a set like this: (1,2,3,4) and I want (5) to be treated as a 1-element set. The ambiguity causes a reduce/reduce conflict.
Here's a pretty minimal arithmetic grammar. It handles the four operators you mention and assignment statements:
stmt: ID '=' expr ';'
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
It's easy to define "set" literals:
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
If we assume that a set literal can only appear as the value in an assignment statement, and not as the operand of an arithmetic operator, then we would add a syntax for "expressions or set literals":
value: expr | set
and modify the syntax for assignment statements to use that:
stmt: ID '=' value ';'
But that leads to the reduce/reduce conflict you mention because (5) could be an expr, through the expansion expr → term → factor → '(' expr ')'.
Here are three solutions to this ambiguity:
1. Explicitly remove the ambiguity
Disambiguating is tedious but not particularly difficult; we just define two kinds of subexpression at each precedence level, one which is possibly parenthesized and one which is definitely not surrounded by parentheses. We start with some short-hand for a parenthesized expression:
paren: '(' expr ')'
and then for each subexpression type X, we add a production pp_X:
pp_term: term | paren
and modify the existing production by allowing possibly parenthesized subexpressions as operands:
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
Unfortunately, we will still end up with a shift/reduce conflict, because of the way expr_list was defined. Confronted with the beginning of an assignment statement:
a = ( 5 )
having finished with the 5, so that ) is the lookahead token, the parser does not know whether the (5) is a set (in which case the next token will be a ;) or a paren (which is only valid if the next token is an operand). This is not an ambiguity -- the parse could be trivially resolved with an LR(2) parse table -- but there are not many tools which can generate LR(2) parsers. So we sidestep the issue by insisting that the expr_list has to have two expressions, and adding paren to the productions for set:
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
Now the parser doesn't need to choose between expr_list and expr in the assignment statement; it simply reduces (5) to paren and waits for the next token to clarify the parse.
So that ends up with:
stmt: ID '=' value ';'
value: expr | set
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
paren: '(' expr ')'
pp_expr: expr | paren
expr: term | pp_expr '-' pp_term | pp_expr '+' pp_term
pp_term: term | paren
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
pp_factor: factor | paren
factor: ID | NUMBER | '-' pp_factor
which has no conflicts.
2. Use a GLR parser
Although it is possible to explicitly disambiguate, the resulting grammar is bloated and not really very clear, which is unfortunate.
Bison can generated GLR parsers, which would allow for a much simpler grammar. In fact, the original grammar would work almost without modification; we just need to use the Bison %dprec dynamic precedence declaration to indicate how to disambiguate:
%glr-parser
%%
stmt: ID '=' value ';'
value: expr %dprec 1
| set %dprec 2
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
The %dprec declarations in the two productions for value tell the parser to prefer value: set if both productions are possible. (They have no effect in contexts in which only one production is possible.)
3. Fix the language
While it is possible to parse the language as specified, we might not be doing anyone any favours. There might even be complaints from people who are surprised when they change
a = ( some complicated expression ) * 2
to
a = ( some complicated expression )
and suddenly a becomes a set instead of a scalar.
It is often the case that languages for which the grammar is not obvious are also hard for humans to parse. (See, for example, C++'s "most vexing parse").
Python, which uses ( expression list ) to create tuple literals, takes a very simple approach: ( expression ) is always an expression, so a tuple needs to either be empty or contain at least one comma. To make the latter possible, Python allows a tuple literal to be written with a trailing comma; the trailing comma is optional unless the tuple contains a single element. So (5) is an expression, while (), (5,), (5,6) and (5,6,) are all tuples (the last two are semantically identical).
Python lists are written between square brackets; here, a trailing comma is again permitted, but it is never required because [5] is not ambiguous. So [], [5], [5,], [5,6] and [5,6,] are all lists.
The language I'm writing a parser for has three constructs that are relevant here: the ord operator, represented by TOK_ORD, which casts character expressions into integer expressions, and [ ] and ., which are used for index and field access, respectively, just like in C.
Here's an excerpt from my precedence rules:
%right TOK_ORD
%left PREC_INDEX PREC_MEMBER
My grammar has a nonterminal expr, which represents expressions. Here's some relevant snippets from the grammar (TOK_IDENT is a regular expression for identifiers defined in my .l file):
expr : TOK_ORD expr { /* semantic actions */ }
| variable { /* semantic actions */ }
;
variable : expr '[' expr ']' %prec PREC_INDEX { /* semantic actions */ }
| expr '.' TOK_IDENT %prec PREC_MEMBER { /* semantic actions */ }
;
Here's information about the shift/reduce conflict from the bison output file:
state 61
42 expr: expr . '=' expr
43 | expr . TOK_EQ expr
44 | expr . TOK_NE expr
45 | expr . TOK_LT expr
46 | expr . TOK_LE expr
47 | expr . TOK_GT expr
48 | expr . TOK_GE expr
49 | expr . '+' expr
50 | expr . '-' expr
51 | expr . '*' expr
52 | expr . '/' expr
53 | expr . '%' expr
57 | TOK_ORD expr .
72 variable: expr . '[' expr ']'
73 | expr . '.' TOK_IDENT
'[' shift, and go to state 92
'.' shift, and go to state 93
'[' [reduce using rule 57 (expr)]
'.' [reduce using rule 57 (expr)]
$default reduce using rule 57 (expr)
States 92 and 93, for reference:
state 92
72 variable: expr '[' . expr ']'
TOK_FALSE shift, and go to state 14
TOK_TRUE shift, and go to state 15
TOK_NULL shift, and go to state 16
TOK_NEW shift, and go to state 17
TOK_IDENT shift, and go to state 53
TOK_INTCON shift, and go to state 19
TOK_CHARCON shift, and go to state 20
TOK_STRINGCON shift, and go to state 21
TOK_ORD shift, and go to state 22
TOK_CHR shift, and go to state 23
'+' shift, and go to state 24
'-' shift, and go to state 25
'!' shift, and go to state 26
'(' shift, and go to state 29
expr go to state 125
allocator go to state 44
call go to state 45
callargs go to state 46
variable go to state 47
constant go to state 48
state 93
73 variable: expr '.' . TOK_IDENT
I don't understand why there is a shift/reduce conflict. My grammar clearly defines index and field access to have higher precedence than ord, but that doesn't seem to be enough.
In case you're wondering, yes, this is homework, but the assignment has already been turned in and graded. I'm going back through and trying to fix shift/reduce conflicts so I can better understand what's going on.
Precedence resolution of shift/reduce conflicts works by comparing the precedence of a production (or reduction, if you prefer) with the precedence of a token (the lookahead token).
This simple fact is slightly obscured because bison sets the precedence of a production based on the precedence of the last token in the production (by default), so it appears that precedence values are only being assigned to tokens and the precedence comparison is between token precedences. But that's not accurate: the precedence comparison is always between a production (which might be reduced) and a token (which might be shifted).
As the bison manual says:
…the resolution of conflicts works by comparing the
precedence of the rule being considered with that of the lookahead
token. If the token's precedence is higher, the choice is to shift.
If the rule's precedence is higher, the choice is to reduce.
Now, you define the precedence of the two variable productions using explicit declarations:
variable : expr '[' expr ']' %prec PREC_INDEX { /* semantic actions */ }
| expr '.' TOK_IDENT %prec PREC_MEMBER { /* semantic actions */ }
But those productions never participate in shift/reduce conflicts. When the end of either of the variable rules is reached, there is no possibility of shifting. The itemsets in which those productions can be reduced have no shift actions.
In the state which contains the shift/reduce conflict, the conflict is between the potential reduction:
expr: TOK_ORD expr
and the possible shifts involving lookahead tokens . and [. These conflicts will be resolved by comparing the precedence of the TOK_ORD reduction with the precedences of the lookahead tokens . and [, respectively. If those tokens do not have declared precedences, then the shift/reduce conflict cannot be resolved using the precedence mechanism.
In that case, though, I would expect there to be a large number of shift/reduce conflicts in other states, where the options are to reduce a binary operator (such as expr: expr '+' expr) or to shift the ./[ to extend the rightmost expr. So if there are no such shift/reduce conflicts, the explanation is more complicated and I'd need to see a lot more of the grammar to understand it.
Let's imagine I want to be able to parse values like this (each line is a separate example):
x
(x)
((((x))))
x = x
(((x))) = x
(x) = ((x))
I've written this YACC grammar:
%%
Line: Binding | Expr
Binding: Pattern '=' Expr
Expr: Id | '(' Expr ')'
Pattern: Id | '(' Pattern ')'
Id: 'x'
But I get a reduce/reduce conflict:
$ bison example.y
example.y: warning: 1 reduce/reduce conflict [-Wconflicts-rr]
Any hint as to how to solve it? I am using GNU bison 3.0.2
Reduce/reduce conflicts often mean there is a fundamental problem in the grammar.
The first step in resolving is to get the output file (bison -v example.y produces example.output). Bison 2.3 says (in part):
state 7
4 Expr: Id .
6 Pattern: Id .
'=' reduce using rule 6 (Pattern)
')' reduce using rule 4 (Expr)
')' [reduce using rule 6 (Pattern)]
$default reduce using rule 4 (Expr)
The conflict is clear; after the grammar reads an x (and reduces that to an Id) and a ), it doesn't know whether to reduce the expression as an Expr or as a Pattern. That presents a problem.
I think you should rewrite the grammar without one of Expr and Pattern:
%%
Line: Binding | Expr
Binding: Expr '=' Expr
Expr: Id | '(' Expr ')'
Id: 'x'
Your grammar is not LR(k) for any k. So you either need to fix the grammar or use a GLR parser.
Suppose the input starts with:
(((((((((((((x
Up to here, there is no problem, because every character has been shifted onto the parser stack.
But now what? At the next step, x must be reduced and the lookahead is ). If there is an = somewhere in the future, x is a Pattern. Otherwise, it is an Expr.
You can fix the grammar by:
getting rid of Pattern and changing Binding to Expr | Expr '=' Expr;
getting rid of all the definitions of Expr and replacing them with Expr: Pattern
The second alternative is probably better in the long run, because it is likely that in the full grammar which you are imagining (or developing), Pattern is a subset of Expr, rather than being identical to Expr. Factoring Expr into a unit production for Pattern and the non-Pattern alternatives will allow you to parse the grammar with an LALR(1) parser (if the rest of the grammar conforms).
Or you can use a GLR grammar, as noted above.
I have a YACC grammar for parsing expressions in C++. Here is lite version:
// yacc.y
%token IDENT
%%
expr:
call_expr
| expr '<' call_expr
| expr '>' call_expr
;
call_expr:
IDENT
| '(' expr ')'
| IDENT '<' args '>' '(' args ')'
;
args:
IDENT
| args ',' IDENT
;
%%
When I want to support function call with template arguments, I got a shift/reduce conflict.
When we got input IDENT '<' IDENT, yacc doesn't know whether we should shift or reduce.
I want IDENT '<' args '>' '(' args ')' got higher precedence level than expr '<' call_expr, so I can parse the following exprs.
x < y
f<x>(a,b)
f<x,y>(a,b) < g<x,y>(c,d)
I see C++/C# both support this syntax. Is there any way to solve this problem with yacc?
How do I modify the .y file?
Thank you!
You want the -v option to yacc/bison. It will give you a .output file with all the info about the generated shift/reduce parser. With your grammar, bison gives you:
State 1 conflicts: 1 shift/reduce
:
state 1
4 call_expr: IDENT .
6 | IDENT . '<' args '>' '(' args ')'
'<' shift, and go to state 5
'<' [reduce using rule 4 (call_expr)]
$default reduce using rule 4 (call_expr)
which shows you where the problem is. After seeing an IDENT, when the next token is <, it doesn't know if it should reduce that call_expr (to ultimately match the rule expr: expr '<' call_expr) or if it should shift to match rule 6.
Parsing this with only 1 token lookahead is hard, as you have two distinct meanings of the token < (less-than or open angle bracket), and which is meant depends on the later tokens.
This case is actually even worse, as it is ambiguous, as an input like
a < b > ( c )
is probably a template call with both args lists being singletons, but it could also be
( a < b ) > ( c )
so just unfactoring the grammar won't help. Your best bet is to use a more powerful parsing method, like bison's %glr-parser option or btyacc