reduce/reduce conflict in CUP - parsing

I am implementing a parser for a subset of Java using Java CUP.
The grammar is like
vardecl ::= type ID
type ::= ID | INT | FLOAT | ...
exp ::= ID | exp LBRACKET exp RBRACKET | ...
stmt ::= ID ASSIGN exp SEMI
This works fine, but when I add
stmt ::= ID ASSIGN exp SEMI
|ID LBRACKET exp RBRACKET ASSIGN exp SEMI
CUP won't work, the warnings are:
Warning : *** Shift/Reduce conflict found in state #122
between exp ::= identifier (*)
and statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI
under symbol LBRACKET
Resolved in favor of shifting.
Warning : *** Reduce/Reduce conflict found in state #42
between type ::= identifier (*)
and exp ::= identifier (*)
under symbols: {}
Resolved in favor of the first production.
Warning : *** Shift/Reduce conflict found in state #42
between type ::= identifier (*)
and statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI
under symbol LBRACKET
Resolved in favor of shifting.
Warning : *** Shift/Reduce conflict found in state #42
between exp ::= identifier (*)
and statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI
under symbol LBRACKET
Resolved in favor of shifting.
I think there are two problems:
1. type ::= ID and exp ::= ID, when the parser sees an ID, it wants to reduce it, but it doesn't know which to reduce, type or exp.
stmt ::= ID LBRACKET exp RBRACKET ASSIGN exp SEMI is for assignment of an element in array, such as arr[key] = value;
exp :: exp LBRACKET exp RBRACKET is for expression of get an element from array, such as arr[key]
So in the case arr[key], when the parser sees arr, it knows that it is an ID, but it doesn't know if it should shift or reduce to exp.
However, I have no idea of how to fix this, please give me some advice if you have, thanks a lot.

Your analysis is correct. The grammar is LR(2) because declarations cannot be identified until the ] token is seen, which will be the second-next token from the ID which could be a type.
One simple solution is to hack the lexer to return [] as a single token when the brackets appear as consecutive tokens. (The lexer should probably allow whitespace between the brackets, too, so it's not quite trivial but it's not complicated.) If a [ is not immediately followed by a ], the lexer will return it as an ordinary [. That makes it easy for the parser to distinguish between assignment to an array (which will have a [ token) and declaration of an array (which will have a [] token).
It's also possible to rewrite the grammar, but that's a real nuisance.
The second problem -- array indexing assignment versus array indexed expressions. Normally programming languages allow assignment of the form:
exp [ exp ] = exp
and not just ID [ exp ]. Making this change will delay the need to reduce until late enough for the parser to identify the correct reduction. Depending on the language, it's possible that this syntax is not semantically meaningful but checking that is in the realm of type checking (semantics) not syntax. If there is some syntax of that form which is meaningful, however, there is no obvious reason to prohibit it.
Some parser generators implement GLR parsers. A GLR parser would have no problem with this grammar because it is no ambiguous. But CUP isn't such a generator.

Related

Remove ambiguity in grammar for expression casting

I'm working on a small translator in JISON, but I've run into a problem when trying to implement the cast of expressions, since it generates an ambiguity in the grammar when trying to add the production of cast. I need to add the productions to the cast option, so in principle I should have something like this:
expr: OPEN_PAREN type CLOSE_PAREN expr
However, since in my grammar I must be able to have expressions in parentheses, I already have the following production, so the grammar is now ambiguous:
expr: '(' expr ')'
Initially I had the following grammar for expressions:
expr : expr PLUS expr
| expr MINUS expr
| expr TIMESexpr
| expr DIV expr
| expr MOD expr
| expr POWER expr
| MINUS expr %prec UMINUS
| expr LESS_THAN expr
| expr GREATER_THAN expr
| expr LESS_OR_EQUAL expr
| expr GREATER_OR_EQUAL expr
| expr EQUALS expr
| expr DIFFERENT expr
| expr OR expr
| expr AND expr
| NOT expr
| OPEN_PAREN expr CLOSE_PAREN
| INT_LITERAL
| DOUBLE_LITERAL
| BOOLEAN_LITERAL
| CHAR_LITERAL
| STRING_LTIERAL
| ID;
Ambiguity was handled by applying the following precedence and associativity rules:
%left 'ASSIGNEMENT'
%left 'OR'
%left 'AND'
%left 'XOR'
%left 'EQUALS', 'DIFFERENT'
%left 'LESS_THAN ', 'GREATER_THAN ', 'LESS_OR_EQUAL ', 'GREATER_OR_EQUAL '
%left 'PLUS', 'MINUS'
%left 'TIMES', 'DIV', 'MOD'
%right 'POWER'
%right 'UMINUS', 'NOT'
I can't find a way to write a production that allows me to add the cast without falling into an ambiguity. Is there a way to modify this grammar without having to write an unambiguous grammar? Is there a way I can resolve this issue using JISON, which I may not have been able to see?
Any ideas are welcome.
This is what I was trying, however it's still ambiguous:
expr: OPEN_PAREN type CLOSE_PAREN expr
| OPEN_PAREN expr CLOSE_PAREN
The problem is that you don't specify the precedence of the cast operator, which is effectively a unary operator whose precedence should be the same as any other unary operator, such as NOT. (See below for a discussion of UMINUS.)
The parsing conflicts you received are not related to the fact that expr: '(' expr ')' is also a production. That would prevent LL(1) parsing, because the two productions start with the same sequence, but that's not an ambiguity. It doesn't affect bottom-up parsing in any way; the two productions are unambiguously recognisable.
Rather, the conflicts are the result of the parser not knowing whether (type)a+b means ((type)a+b or (type)(a+b), which is no different from the ambiguity of unary minus (should -a/b be parsed as (-a)/b or -(a/b)?), which is resolved by putting UMINUS at the end of the precedence list.
In the case of casts, you don't need to use a %prec declaration with a pseudo-token; that's only necessary for - because - could also be a binary operator, with a different (reduction) precedence. The precedence of the production:
expr: '(' type ')' expr
is ) (at least in yacc/bison), because that's the last terminal in the production. There's no need to give ) a shift precedence, because the grammar requires it to always be shifted.
Three notes:
Assignment is right-associative. a = b = 3 means a = (b = 3), not (a = b) = 3.
In the particular case of unary minus (and, by extension, unary plus if you feel like implementing it), there's a good argument for putting it ahead of exponentiation, so that -a**b is parsed as -(a**b). But that doesn't mean you should move other unary operators up from the end; (type)a**b should be parsed as ((type)a)**b. Nothing says that all unary operators have to have the same precedence.
When you add postfix operators -- notably function calls and array subscripts -- you will want to put them after the unary prefix operators. -a[3] most certainly does not mean (-a)[3]. These postfix operators are, in a way, duals of the prefix operators. As noted above, expr: '(' type ')' expr has precedence ')', which is only used as a reduction precedence. Conversely, expr: expr '(' expr-list ')' does not require a reduction precedence; the relevant token whose shift precedence needs to be declared is (.
So, according to all the above, your precedence declarations might be:
%right ASSIGNMENT
%left OR
%left AND
%left XOR
%left EQUALS DIFFERENT
%left LESS_THAN GREATER_THAN LESS_OR_EQUAL GREATER_OR_EQUAL
%left PLUS MINUS
%left TIMES DIV MOD
%right UMINUS
%right POWER
%right NOT CLOSE_PAREN
%right OPEN_PAREN OPEN_BRACKET
I listed all the unary operators using right associativity, which is somewhat arbitrary; either %left or %right would have the same effect, since it is impossible for a unary operator to compete with another instance of the same operator for the same operand; for unary operators, only the precedence level makes any difference. But it's customary to mark unary operators with %right.
Bison allows the use of %precedence to declare precedence levels for operators which have no associativity, but Jison doesn't have that feature. Both Bison and Jison do allow the use of %nonassoc, but that's very different: it says that it is a syntax error if either operand to the operator is an application of the same operator. That restriction is, for example, sometimes applied to comparison operators, in order to make a < b < c a syntax error.
Usually the way this problem is handled is by having type names as distinct keywords that can't be expressions by themselves. That way, after seeing an (, the next token being a type means it is a cast and the next token being an identifier means it is an expression, so there is no ambiguity.
However, your grammar appears to allow type names (INT, DOUBLE, etc) as expressions. This doesn't make a lot of sense, and causes your parsing problem, as differentiating between a cast and a parenthesized expression will require more lookahead.
The easiest fix would be to remove these productions (though you should still have something like expr : CONSTANT_LITERAL for literal constants)

YACC grammar for arithmetic expressions, with no surrounding parentheses

I want to write the rules for arithmetic expressions in YACC; where the following operations are defined:
+ - * / ()
But, I don't want the statement to have surrounding parentheses. That is, a+(b*c) should have a matching rule but (a+(b*c)) shouldn't.
How can I achieve this?
The motive:
In my grammar I define a set like this: (1,2,3,4) and I want (5) to be treated as a 1-element set. The ambiguity causes a reduce/reduce conflict.
Here's a pretty minimal arithmetic grammar. It handles the four operators you mention and assignment statements:
stmt: ID '=' expr ';'
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
It's easy to define "set" literals:
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
If we assume that a set literal can only appear as the value in an assignment statement, and not as the operand of an arithmetic operator, then we would add a syntax for "expressions or set literals":
value: expr | set
and modify the syntax for assignment statements to use that:
stmt: ID '=' value ';'
But that leads to the reduce/reduce conflict you mention because (5) could be an expr, through the expansion expr → term → factor → '(' expr ')'.
Here are three solutions to this ambiguity:
1. Explicitly remove the ambiguity
Disambiguating is tedious but not particularly difficult; we just define two kinds of subexpression at each precedence level, one which is possibly parenthesized and one which is definitely not surrounded by parentheses. We start with some short-hand for a parenthesized expression:
paren: '(' expr ')'
and then for each subexpression type X, we add a production pp_X:
pp_term: term | paren
and modify the existing production by allowing possibly parenthesized subexpressions as operands:
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
Unfortunately, we will still end up with a shift/reduce conflict, because of the way expr_list was defined. Confronted with the beginning of an assignment statement:
a = ( 5 )
having finished with the 5, so that ) is the lookahead token, the parser does not know whether the (5) is a set (in which case the next token will be a ;) or a paren (which is only valid if the next token is an operand). This is not an ambiguity -- the parse could be trivially resolved with an LR(2) parse table -- but there are not many tools which can generate LR(2) parsers. So we sidestep the issue by insisting that the expr_list has to have two expressions, and adding paren to the productions for set:
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
Now the parser doesn't need to choose between expr_list and expr in the assignment statement; it simply reduces (5) to paren and waits for the next token to clarify the parse.
So that ends up with:
stmt: ID '=' value ';'
value: expr | set
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
paren: '(' expr ')'
pp_expr: expr | paren
expr: term | pp_expr '-' pp_term | pp_expr '+' pp_term
pp_term: term | paren
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
pp_factor: factor | paren
factor: ID | NUMBER | '-' pp_factor
which has no conflicts.
2. Use a GLR parser
Although it is possible to explicitly disambiguate, the resulting grammar is bloated and not really very clear, which is unfortunate.
Bison can generated GLR parsers, which would allow for a much simpler grammar. In fact, the original grammar would work almost without modification; we just need to use the Bison %dprec dynamic precedence declaration to indicate how to disambiguate:
%glr-parser
%%
stmt: ID '=' value ';'
value: expr %dprec 1
| set %dprec 2
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
The %dprec declarations in the two productions for value tell the parser to prefer value: set if both productions are possible. (They have no effect in contexts in which only one production is possible.)
3. Fix the language
While it is possible to parse the language as specified, we might not be doing anyone any favours. There might even be complaints from people who are surprised when they change
a = ( some complicated expression ) * 2
to
a = ( some complicated expression )
and suddenly a becomes a set instead of a scalar.
It is often the case that languages for which the grammar is not obvious are also hard for humans to parse. (See, for example, C++'s "most vexing parse").
Python, which uses ( expression list ) to create tuple literals, takes a very simple approach: ( expression ) is always an expression, so a tuple needs to either be empty or contain at least one comma. To make the latter possible, Python allows a tuple literal to be written with a trailing comma; the trailing comma is optional unless the tuple contains a single element. So (5) is an expression, while (), (5,), (5,6) and (5,6,) are all tuples (the last two are semantically identical).
Python lists are written between square brackets; here, a trailing comma is again permitted, but it is never required because [5] is not ambiguous. So [], [5], [5,], [5,6] and [5,6,] are all lists.

Solving shift/reduce conflict in expression grammar

I am new to bison and I am trying to make a grammar parsing expressions.
I am facing a shift/reduce conflight right now I am not able to solve.
The grammar is the following:
%left "[" "("
%left "+"
%%
expression_list : expression_list "," expression
| expression
| /*empty*/
;
expression : "(" expression ")"
| STRING_LITERAL
| INTEGER_LITERAL
| DOUBLE_LITERAL
| expression "(" expression_list ")" /*function call*/
| expression "[" expression "]" /*index access*/
| expression "+" expression
;
This is my grammar, but I am facing a shift/reduce conflict with those two rules "(" expression ")" and expression "(" expression_list ")".
How can I resolve this conflict?
EDIT: I know I could solve this using precedence climbing, but I would like to not do so, because this is only a small part of the expression grammar, and the size of the expression grammar would explode using precedence climbing.
There is no shift-reduce conflict in the grammar as presented, so I suppose that it is just an excerpt of the full grammar. In particular, there will be precisely the shift/reduce conflict mentioned if the real grammar includes:
%start program
%%
program: %empty
| program expression
In that case, you will run into an ambiguity because given, for example, a(b), the parser cannot tell whether it is a single call-expression or two consecutive expressions, first a single variable, and second a parenthesized expression. To avoid this problem you need to have some token which separates expression (statements).
There are some other issues:
expression_list : expression_list "," expression
| expression
| /*empty*/
;
That allows an expression list to be ,foo (as in f(,foo)), which is likely not desirable. Better would be
arguments: %empty
| expr_list
expr_list: expr
| expr_list ',' expr
And the precedences are probably backwards. Usually one wants postfix operators like call and index to bind more tightly than arithmetic operators, so they should come at the end. Otherwise a+b(7) is (a+b)(7), which is unconventional.

shift/reduce Error with Cup

Hi i am writing a Parser for a Programming language my university uses, with jflex and Cup
I started with just the first basic structures such as Processes an Variable Declarations.
I get the following Errors
Warning : *** Shift/Reduce conflict found in state #4
between vardecls ::= (*)
and vardecl ::= (*) IDENT COLON vartyp SEMI
and vardecl ::= (*) IDENT COLON vartyp EQEQ INT SEMI
under symbol IDENT
Resolved in favor of shifting.
Warning : *** Shift/Reduce conflict found in state #2
between vardecls ::= (*)
and vardecl ::= (*) IDENT COLON vartyp SEMI
and vardecl ::= (*) IDENT COLON vartyp EQEQ INT SEMI
under symbol IDENT
Resolved in favor of shifting.
My Code in Cup looks like this :
non terminal programm;
non terminal programmtype;
non terminal vardecl;
non terminal vardecls;
non terminal processdecl;
non terminal processdecls;
non terminal vartyp;
programm ::= programmtype:pt vardecls:vd processdecls:pd
{: RESULT = new SolutionNode(pt, vd, pd); :} ;
programmtype ::= IDENT:v
{: RESULT = ProblemType.KA; :} ;
vardecls ::= vardecl:v1 vardecls:v2
{: v2.add(v1);
RESULT = v2; :}
|
{: ArrayList<VarDecl> list = new ArrayList<VarDecl>() ;
RESULT = list; :}
;
vardecl ::= IDENT:id COLON vartyp:vt SEMI
{: RESULT = new VarDecl(id, vt); :}
| IDENT:id COLON vartyp:vt EQEQ INT:i1 SEMI
{: RESULT = new VarDecl(id, vt, i1); :}
;
vartyp ::= INTEGER
{: RESULT = VarType.Integer ; :}
;
processdecls ::= processdecl:v1 processdecls:v2
{: v2.add(v1);
RESULT = v2; :}
| {: ArrayList<ProcessDecl> list = new ArrayList<ProcessDecl>() ;
RESULT = list; :}
;
processdecl ::= IDENT:id COLON PROCESS vardecls:vd BEGIN END SEMI
{: RESULT = new ProcessDecl(id, vd); :}
;
I Guess i get the Errors because the Process Declaration and the VariableDeclaration both start with Identifiers then a ":" and then either the Terminal PROCESS or a Terminal like INTEGER. If so i'd like to know how i can tell my Parser to look ahead a bit more. Or whatever Solution is possible.
Thanks for your answers.
Your diagnosis is absolutely correct. Because the parser cannot know whether IDENT starts a processdecl or a vardecl without two more lookahead tokens, it cannot know when it has just reduced a vardecl and is looking at an IDENT whether it is about to see another vardecl or a processdecl.
In the first case, it must just shift the IDENT as part of the following vardecl. In the second case, it needs to first reduce an empty vardecls and then successively reduce vardecls until it has constructed the complete list.
To get rid of the shift reduce conflict, you need to simplify the parser's decision-making.
The simplest solution is to allow the parser to accept declarations in any order. Then you end up with something like this:
program ::= program_type declaration_list ;
declaration_list ::=
var_declaration declaration_list
| process_declaration declaration_list
|
;
var_declaration_list ::=
var_declaration var_declaration_list
|
;
process_declaration ::=
IDENT:id COLON PROCESS var_declaration_list BEGIN END SEMI ;
(Personally, I'd make the declaration lists left-recursive rather than right-recursive, but it depends whether you prefer to append or prepend in the list's action. Left-recursion uses less parser stack.)
If you really want to insist that all variable declarations come before any process declaration, you can check for that in the action for declaration_list.
Alternatively, you can start by making both types of declaration list left-recursive instead of right recursive. That will almost work, but it will still generate a shift-reduce conflict in the same state as the original grammar, this time because it needs to reduce an empty process declaration list before the first process declaration can be reduced.
Fortunately, that's easier to work around. If the process declaration list cannot be empty, there is no problem, so it's just a question of rearranging the productions:
program ::= program_type var_declaration_list process_declaration_list
| program_type var_declaration_list
;
var_declaration_list ::=
var_declaration var_declaration_list
|
;
process_declaration_list ::=
process_declaration_list process_declaration
| process_declaration
;
Finally, an ugly but possible alternative is to make the variable declaration list left-recursive and the process declaration list right-recursive. In that case, there is no empty production between the last variable declaration and the first process declaration.

Shift/reduce conflict with expression call

When I'm trying to compile this simple parser using Lemon, I get a conflict but I can't see which rule is wrong. The conflict disappear if I remove the binaryexpression or the callexpression.
%left Add.
program ::= expression.
expression ::= binaryexpression.
expression ::= callexpression.
binaryexpression ::= expression Add expression.
callexpression ::= expression arguments.
arguments ::= LParenthesis argumentlist RParenthesis.
arguments ::= LParenthesis RParenthesis.
argumentlist ::= expression argumentlist.
argumentlist ::= expression.
[edit] Adding a left-side associativity to LParenthesis has solved the conflict.
However, I'm willing to know if it's the correct thing to do : I've seen that some grammars (f.e. C++) have a different precedence for the construction-operator '()' and the call-operator '()'. So I'm not sure about the right thing to do.
The problem is that the grammar is ambiguous. It is not possible to decide between reducing to binaryexpression or callexpression without looking at all the input sequence. The ambiguity is because of the left recursion over expression, which cannot be ended because expression cannot derive a terminal.

Resources