I have a YACC grammar for parsing expressions in C++. Here is lite version:
// yacc.y
%token IDENT
%%
expr:
call_expr
| expr '<' call_expr
| expr '>' call_expr
;
call_expr:
IDENT
| '(' expr ')'
| IDENT '<' args '>' '(' args ')'
;
args:
IDENT
| args ',' IDENT
;
%%
When I want to support function call with template arguments, I got a shift/reduce conflict.
When we got input IDENT '<' IDENT, yacc doesn't know whether we should shift or reduce.
I want IDENT '<' args '>' '(' args ')' got higher precedence level than expr '<' call_expr, so I can parse the following exprs.
x < y
f<x>(a,b)
f<x,y>(a,b) < g<x,y>(c,d)
I see C++/C# both support this syntax. Is there any way to solve this problem with yacc?
How do I modify the .y file?
Thank you!
You want the -v option to yacc/bison. It will give you a .output file with all the info about the generated shift/reduce parser. With your grammar, bison gives you:
State 1 conflicts: 1 shift/reduce
:
state 1
4 call_expr: IDENT .
6 | IDENT . '<' args '>' '(' args ')'
'<' shift, and go to state 5
'<' [reduce using rule 4 (call_expr)]
$default reduce using rule 4 (call_expr)
which shows you where the problem is. After seeing an IDENT, when the next token is <, it doesn't know if it should reduce that call_expr (to ultimately match the rule expr: expr '<' call_expr) or if it should shift to match rule 6.
Parsing this with only 1 token lookahead is hard, as you have two distinct meanings of the token < (less-than or open angle bracket), and which is meant depends on the later tokens.
This case is actually even worse, as it is ambiguous, as an input like
a < b > ( c )
is probably a template call with both args lists being singletons, but it could also be
( a < b ) > ( c )
so just unfactoring the grammar won't help. Your best bet is to use a more powerful parsing method, like bison's %glr-parser option or btyacc
Related
I'm working on a small translator in JISON, but I've run into a problem when trying to implement the cast of expressions, since it generates an ambiguity in the grammar when trying to add the production of cast. I need to add the productions to the cast option, so in principle I should have something like this:
expr: OPEN_PAREN type CLOSE_PAREN expr
However, since in my grammar I must be able to have expressions in parentheses, I already have the following production, so the grammar is now ambiguous:
expr: '(' expr ')'
Initially I had the following grammar for expressions:
expr : expr PLUS expr
| expr MINUS expr
| expr TIMESexpr
| expr DIV expr
| expr MOD expr
| expr POWER expr
| MINUS expr %prec UMINUS
| expr LESS_THAN expr
| expr GREATER_THAN expr
| expr LESS_OR_EQUAL expr
| expr GREATER_OR_EQUAL expr
| expr EQUALS expr
| expr DIFFERENT expr
| expr OR expr
| expr AND expr
| NOT expr
| OPEN_PAREN expr CLOSE_PAREN
| INT_LITERAL
| DOUBLE_LITERAL
| BOOLEAN_LITERAL
| CHAR_LITERAL
| STRING_LTIERAL
| ID;
Ambiguity was handled by applying the following precedence and associativity rules:
%left 'ASSIGNEMENT'
%left 'OR'
%left 'AND'
%left 'XOR'
%left 'EQUALS', 'DIFFERENT'
%left 'LESS_THAN ', 'GREATER_THAN ', 'LESS_OR_EQUAL ', 'GREATER_OR_EQUAL '
%left 'PLUS', 'MINUS'
%left 'TIMES', 'DIV', 'MOD'
%right 'POWER'
%right 'UMINUS', 'NOT'
I can't find a way to write a production that allows me to add the cast without falling into an ambiguity. Is there a way to modify this grammar without having to write an unambiguous grammar? Is there a way I can resolve this issue using JISON, which I may not have been able to see?
Any ideas are welcome.
This is what I was trying, however it's still ambiguous:
expr: OPEN_PAREN type CLOSE_PAREN expr
| OPEN_PAREN expr CLOSE_PAREN
The problem is that you don't specify the precedence of the cast operator, which is effectively a unary operator whose precedence should be the same as any other unary operator, such as NOT. (See below for a discussion of UMINUS.)
The parsing conflicts you received are not related to the fact that expr: '(' expr ')' is also a production. That would prevent LL(1) parsing, because the two productions start with the same sequence, but that's not an ambiguity. It doesn't affect bottom-up parsing in any way; the two productions are unambiguously recognisable.
Rather, the conflicts are the result of the parser not knowing whether (type)a+b means ((type)a+b or (type)(a+b), which is no different from the ambiguity of unary minus (should -a/b be parsed as (-a)/b or -(a/b)?), which is resolved by putting UMINUS at the end of the precedence list.
In the case of casts, you don't need to use a %prec declaration with a pseudo-token; that's only necessary for - because - could also be a binary operator, with a different (reduction) precedence. The precedence of the production:
expr: '(' type ')' expr
is ) (at least in yacc/bison), because that's the last terminal in the production. There's no need to give ) a shift precedence, because the grammar requires it to always be shifted.
Three notes:
Assignment is right-associative. a = b = 3 means a = (b = 3), not (a = b) = 3.
In the particular case of unary minus (and, by extension, unary plus if you feel like implementing it), there's a good argument for putting it ahead of exponentiation, so that -a**b is parsed as -(a**b). But that doesn't mean you should move other unary operators up from the end; (type)a**b should be parsed as ((type)a)**b. Nothing says that all unary operators have to have the same precedence.
When you add postfix operators -- notably function calls and array subscripts -- you will want to put them after the unary prefix operators. -a[3] most certainly does not mean (-a)[3]. These postfix operators are, in a way, duals of the prefix operators. As noted above, expr: '(' type ')' expr has precedence ')', which is only used as a reduction precedence. Conversely, expr: expr '(' expr-list ')' does not require a reduction precedence; the relevant token whose shift precedence needs to be declared is (.
So, according to all the above, your precedence declarations might be:
%right ASSIGNMENT
%left OR
%left AND
%left XOR
%left EQUALS DIFFERENT
%left LESS_THAN GREATER_THAN LESS_OR_EQUAL GREATER_OR_EQUAL
%left PLUS MINUS
%left TIMES DIV MOD
%right UMINUS
%right POWER
%right NOT CLOSE_PAREN
%right OPEN_PAREN OPEN_BRACKET
I listed all the unary operators using right associativity, which is somewhat arbitrary; either %left or %right would have the same effect, since it is impossible for a unary operator to compete with another instance of the same operator for the same operand; for unary operators, only the precedence level makes any difference. But it's customary to mark unary operators with %right.
Bison allows the use of %precedence to declare precedence levels for operators which have no associativity, but Jison doesn't have that feature. Both Bison and Jison do allow the use of %nonassoc, but that's very different: it says that it is a syntax error if either operand to the operator is an application of the same operator. That restriction is, for example, sometimes applied to comparison operators, in order to make a < b < c a syntax error.
Usually the way this problem is handled is by having type names as distinct keywords that can't be expressions by themselves. That way, after seeing an (, the next token being a type means it is a cast and the next token being an identifier means it is an expression, so there is no ambiguity.
However, your grammar appears to allow type names (INT, DOUBLE, etc) as expressions. This doesn't make a lot of sense, and causes your parsing problem, as differentiating between a cast and a parenthesized expression will require more lookahead.
The easiest fix would be to remove these productions (though you should still have something like expr : CONSTANT_LITERAL for literal constants)
I want to write the rules for arithmetic expressions in YACC; where the following operations are defined:
+ - * / ()
But, I don't want the statement to have surrounding parentheses. That is, a+(b*c) should have a matching rule but (a+(b*c)) shouldn't.
How can I achieve this?
The motive:
In my grammar I define a set like this: (1,2,3,4) and I want (5) to be treated as a 1-element set. The ambiguity causes a reduce/reduce conflict.
Here's a pretty minimal arithmetic grammar. It handles the four operators you mention and assignment statements:
stmt: ID '=' expr ';'
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
It's easy to define "set" literals:
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
If we assume that a set literal can only appear as the value in an assignment statement, and not as the operand of an arithmetic operator, then we would add a syntax for "expressions or set literals":
value: expr | set
and modify the syntax for assignment statements to use that:
stmt: ID '=' value ';'
But that leads to the reduce/reduce conflict you mention because (5) could be an expr, through the expansion expr → term → factor → '(' expr ')'.
Here are three solutions to this ambiguity:
1. Explicitly remove the ambiguity
Disambiguating is tedious but not particularly difficult; we just define two kinds of subexpression at each precedence level, one which is possibly parenthesized and one which is definitely not surrounded by parentheses. We start with some short-hand for a parenthesized expression:
paren: '(' expr ')'
and then for each subexpression type X, we add a production pp_X:
pp_term: term | paren
and modify the existing production by allowing possibly parenthesized subexpressions as operands:
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
Unfortunately, we will still end up with a shift/reduce conflict, because of the way expr_list was defined. Confronted with the beginning of an assignment statement:
a = ( 5 )
having finished with the 5, so that ) is the lookahead token, the parser does not know whether the (5) is a set (in which case the next token will be a ;) or a paren (which is only valid if the next token is an operand). This is not an ambiguity -- the parse could be trivially resolved with an LR(2) parse table -- but there are not many tools which can generate LR(2) parsers. So we sidestep the issue by insisting that the expr_list has to have two expressions, and adding paren to the productions for set:
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
Now the parser doesn't need to choose between expr_list and expr in the assignment statement; it simply reduces (5) to paren and waits for the next token to clarify the parse.
So that ends up with:
stmt: ID '=' value ';'
value: expr | set
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
paren: '(' expr ')'
pp_expr: expr | paren
expr: term | pp_expr '-' pp_term | pp_expr '+' pp_term
pp_term: term | paren
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
pp_factor: factor | paren
factor: ID | NUMBER | '-' pp_factor
which has no conflicts.
2. Use a GLR parser
Although it is possible to explicitly disambiguate, the resulting grammar is bloated and not really very clear, which is unfortunate.
Bison can generated GLR parsers, which would allow for a much simpler grammar. In fact, the original grammar would work almost without modification; we just need to use the Bison %dprec dynamic precedence declaration to indicate how to disambiguate:
%glr-parser
%%
stmt: ID '=' value ';'
value: expr %dprec 1
| set %dprec 2
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
The %dprec declarations in the two productions for value tell the parser to prefer value: set if both productions are possible. (They have no effect in contexts in which only one production is possible.)
3. Fix the language
While it is possible to parse the language as specified, we might not be doing anyone any favours. There might even be complaints from people who are surprised when they change
a = ( some complicated expression ) * 2
to
a = ( some complicated expression )
and suddenly a becomes a set instead of a scalar.
It is often the case that languages for which the grammar is not obvious are also hard for humans to parse. (See, for example, C++'s "most vexing parse").
Python, which uses ( expression list ) to create tuple literals, takes a very simple approach: ( expression ) is always an expression, so a tuple needs to either be empty or contain at least one comma. To make the latter possible, Python allows a tuple literal to be written with a trailing comma; the trailing comma is optional unless the tuple contains a single element. So (5) is an expression, while (), (5,), (5,6) and (5,6,) are all tuples (the last two are semantically identical).
Python lists are written between square brackets; here, a trailing comma is again permitted, but it is never required because [5] is not ambiguous. So [], [5], [5,], [5,6] and [5,6,] are all lists.
I am new to bison and I am trying to make a grammar parsing expressions.
I am facing a shift/reduce conflight right now I am not able to solve.
The grammar is the following:
%left "[" "("
%left "+"
%%
expression_list : expression_list "," expression
| expression
| /*empty*/
;
expression : "(" expression ")"
| STRING_LITERAL
| INTEGER_LITERAL
| DOUBLE_LITERAL
| expression "(" expression_list ")" /*function call*/
| expression "[" expression "]" /*index access*/
| expression "+" expression
;
This is my grammar, but I am facing a shift/reduce conflict with those two rules "(" expression ")" and expression "(" expression_list ")".
How can I resolve this conflict?
EDIT: I know I could solve this using precedence climbing, but I would like to not do so, because this is only a small part of the expression grammar, and the size of the expression grammar would explode using precedence climbing.
There is no shift-reduce conflict in the grammar as presented, so I suppose that it is just an excerpt of the full grammar. In particular, there will be precisely the shift/reduce conflict mentioned if the real grammar includes:
%start program
%%
program: %empty
| program expression
In that case, you will run into an ambiguity because given, for example, a(b), the parser cannot tell whether it is a single call-expression or two consecutive expressions, first a single variable, and second a parenthesized expression. To avoid this problem you need to have some token which separates expression (statements).
There are some other issues:
expression_list : expression_list "," expression
| expression
| /*empty*/
;
That allows an expression list to be ,foo (as in f(,foo)), which is likely not desirable. Better would be
arguments: %empty
| expr_list
expr_list: expr
| expr_list ',' expr
And the precedences are probably backwards. Usually one wants postfix operators like call and index to bind more tightly than arithmetic operators, so they should come at the end. Otherwise a+b(7) is (a+b)(7), which is unconventional.
I am defining a grammar in ANTLR that will express an expression which includes logical operator and parenthesis together.
Here is the grammar
grammar simpleGrammar;
/* This will be the entry point of the parser. */
parse
:
expression EOF
;
expression
:
expression binOp expression | ID | unOp (expression) | '(' expression ')'
;
binOp
:
('AND' | 'OR')
;
unOp
:
'NOT'
;
ID :
('a'..'z' | 'A'..'Z')+
;
The defined grammar can able to express parse tree without parenthesis but when I input an example with parenthesis for example, (Apple OR Bananana)AND Orange
It is showing MismatchedTokenException
So, It will be really appreciated if someone explains how to define the grammar in order to express the parenthesis.
You forgot to tell ANTLR what to do with whitespace. For example:
WS : [ \t\r\n] -> skip;
Add this and you grammar will work.
As a side note, your grammar has the same precedence for the AND and OR operators. And these operators have higher precedence than NOT. As this goes against conventional rules, I'd advise you to write your expression rule like this instead:
expression
: '(' expression ')' # parenExp
| 'NOT' expression # notExpr
| expression 'AND' expression # andExpr
| expression 'OR' expression # orExpr
| ID # atomExpr
;
Let's imagine I want to be able to parse values like this (each line is a separate example):
x
(x)
((((x))))
x = x
(((x))) = x
(x) = ((x))
I've written this YACC grammar:
%%
Line: Binding | Expr
Binding: Pattern '=' Expr
Expr: Id | '(' Expr ')'
Pattern: Id | '(' Pattern ')'
Id: 'x'
But I get a reduce/reduce conflict:
$ bison example.y
example.y: warning: 1 reduce/reduce conflict [-Wconflicts-rr]
Any hint as to how to solve it? I am using GNU bison 3.0.2
Reduce/reduce conflicts often mean there is a fundamental problem in the grammar.
The first step in resolving is to get the output file (bison -v example.y produces example.output). Bison 2.3 says (in part):
state 7
4 Expr: Id .
6 Pattern: Id .
'=' reduce using rule 6 (Pattern)
')' reduce using rule 4 (Expr)
')' [reduce using rule 6 (Pattern)]
$default reduce using rule 4 (Expr)
The conflict is clear; after the grammar reads an x (and reduces that to an Id) and a ), it doesn't know whether to reduce the expression as an Expr or as a Pattern. That presents a problem.
I think you should rewrite the grammar without one of Expr and Pattern:
%%
Line: Binding | Expr
Binding: Expr '=' Expr
Expr: Id | '(' Expr ')'
Id: 'x'
Your grammar is not LR(k) for any k. So you either need to fix the grammar or use a GLR parser.
Suppose the input starts with:
(((((((((((((x
Up to here, there is no problem, because every character has been shifted onto the parser stack.
But now what? At the next step, x must be reduced and the lookahead is ). If there is an = somewhere in the future, x is a Pattern. Otherwise, it is an Expr.
You can fix the grammar by:
getting rid of Pattern and changing Binding to Expr | Expr '=' Expr;
getting rid of all the definitions of Expr and replacing them with Expr: Pattern
The second alternative is probably better in the long run, because it is likely that in the full grammar which you are imagining (or developing), Pattern is a subset of Expr, rather than being identical to Expr. Factoring Expr into a unit production for Pattern and the non-Pattern alternatives will allow you to parse the grammar with an LALR(1) parser (if the rest of the grammar conforms).
Or you can use a GLR grammar, as noted above.