Why doesn't Bison accept this grammar file? - parsing

When I use the command bison -d -o parser.java parser.y to generate a parser from my grammar file parser.y, Bison produces the following error:
:8.8-10: syntax error, unexpected string, expecting char or identifier or type
Here is the file parser.y:
%{
import java.util.;
import java.io.;
%}
%start PROGRAM
%token number identifier function break call if else let read return while write
%token "(" ")" "{" "}" ";" "=" "+" "-" "" "/" "%" "<" ">" " <= " " >= " "==" "!=" "&" "|" "~" "!"
%left "+" "-"
%left "" "/" "%"
%left "&" "|"
%nonassoc "!"
%type <Node> PROGRAM FUNCTION PARAMLIST BLOCK STATEMENT IF ELSE EXPR
%type <String> identifier
%type <Integer> number
%union {
Node node;
String identifier;
int number;
}
%%
PROGRAM:
| PROGRAM FUNCTION
| BLOCK
;
FUNCTION:
function identifier '(' PARAMLIST ')' BLOCK
;
PARAMLIST:
identifier
| identifier ',' PARAMLIST
|
;
BLOCK:
'{' STATEMENT '}'
;
STATEMENT:
BREAK
| CALL ';'
| IF
| LET
| READ
| RETURN
| WHILE
| WRITE
;
BREAK:
break ';'
;
CALL:
call identifier '(' ARGLIST ')'
;
ARGLIST:
EXPR
| EXPR ',' ARGLIST
|
;
IF:
if EXPR BLOCK ELSE
;
ELSE:
else BLOCK
|
;
LET:
let identifier '=' EXPR ';'
| let identifier '=' CALL ';'
;
READ:
read identifier ';'
;
RETURN:
return EXPR ';'
;
WHILE:
while EXPR BLOCK
;
WRITE:
write EXPR ';'
;
EXPR:
number
| identifier
| '(' EXPR ')'
| '!' EXPR
| '~' EXPR
| EXPR '+' EXPR
| EXPR '-' EXPR
| EXPR '*' EXPR
| EXPR '/' EXPR
| EXPR '%' EXPR
| EXPR '&' EXPR
| EXPR '|' EXPR
| EXPR '<' EXPR
| EXPR '>' EXPR
| EXPR "<=" EXPR
| EXPR ">=" EXPR
| EXPR "==" EXPR
| EXPR "!=" EXPR
;
%%
int yyerror(String s) {
System.err.println("error: " + s);
}

Bison doesn't allow you to declare quoted token names (such as "(") with the %token declaration. It knows they are tokens; they cannot be anything else.
You use the %token declaration to declare symbolic names for tokens, which you will find useful when writing your lexer. In the declaration, the symbolic name comes first, optionally followed by the double-quoted alias. You can repeat that as often as you like. For example, you could write:
%token TK_LE "<=" TK_GE ">="
You can then use either the symbolic name or the alias in your grammar, but using the alias makes your grammar more readable. Also, Bison uses the alias when constructing error messages, which is a good thing since "expecting TK_SEMIC" is not a great way to communicate with a user that a ";" was required.
Keep in mind that a single-quoted single character token, such as '(', is not the same token as the double-quoted alias. In your grammar, you use '(' but attempt to declare "(". Had you succeeded in declaring "(", you would have gotten an "unused token" warning. Since '(' doesn't require a symbolic name, you can just remove the declaration. You will only need them for multicharacter tokens like "<=". (Note that spaces are significant inside quotes. " <= " is not the same as "<=".)
Symbolic token names are used as Java values, so their names cannot conflict with variables or Java keywords. You cannot, for example, use break as a symbolic token name. Trying to do so will cause compilation errors.
For this reason, it's customary to write token names in ALL_CAPS, and non-terminals in lower case. Non-terminals names are not used in the generated code, so you can use whatever names you wish.
You reverse this convention, which will cause a variety of errors when you compile the generated parser, and which is hard to read for those of us accustomed to the standard style.
A couple of other notes:
The bison Java interface does not use a %union declaration. The %type declarations are sufficient.
You are missing precedence declarations for many operators, particularly comparison operators. That will lead to a large number of parser conflicts. Make sure you write the precedence levels in the correct order.

Related

why are these bison rules useless?

I am trying to create a simple compiler using flex and bison, however all the rules i've written down have shown to be useless "14 useless nonterminals and 66 useless rules". What makes a rule useless and is there a way to fix it?
File: Decl_Class'*' 'EOF'
Decl_Class: CLASS IDENT '(' EXTENDS IDENT ')' '?' {Field'*'}
Field: Variable
|Constructor
|Method
Variable: Modifier'*' Expr_type Decl_variables
Decl_variables: Decl_variable
|Decl_variable ',' Decl_variables
Decl_variable: IDENT
|IDENT '=' Expr
Constructor: Modifier'*' IDENT '(' Expr_type | VOID ')' IDENT '(' Params '?' ')' {Instructions'*'}
Method: Modifier'*' '(' Expr_type | VOID ')' IDENT '(' Params '?' ')' {Instructions'*'}
Modifier: IDENT
Expr: IDENT
Params: '(' Expr_type ')' IDENT | '(' Expr_type ')' IDENT ',' Params
Expr_type: BOOLEAN | INT | DOUBLE | IDENT | INTEGER | REAL | TRU | FALS
| THIS | NULLVAL
| '(' Expr ')'
| Access|Access '=' Expr|Access '(' L_Expr '?' ')'
| NEW IDENT '(' L_Expr '?' ')'
| '+''+'Expr | '-''-'Expr | Expr'+''+' | Expr'-''-'
| '!'Expr | '-'Expr | '+'Expr
| Expr Operator Expr
| '(' Expr_type ')' Expr_type
Operator: "==" | "!=" | "<" | "<=" | ">" | ">=" | "+" | "-" | "*" | "/" | "%" | "&&" | "||"
Access: IDENT | Expr '.' IDENT
L_Expr: Expr | Expr ',' L_Expr
Instruction: ';'
| Expr';'
| Expr_type Decl_variables';'
| IF '(' Expr ')' Instruction
| IF '(' Expr ')' Instruction ELSE Instruction
| WHILE '(' Expr ')' Instruction
| FOR '(' L_Expr '?' ';' Expr '?' ';' L_Expr '?'')' Instruction
| FOR '(' Expr ')' Decl_variables ';' Expr '?' ';' L_Expr '?'')' Instruction { Instructions'*' }
| RETURN Expr '?'';'
Decl_Class: CLASS IDENT '(' EXTENDS IDENT ')' '?' {Field'*'}
Here the part inside the {} is a code action (which will produce a syntax error when compiled), not a reference to the Field non-terminal. So Field is never actually used and neither are the non-terminal referenced by it. That's what makes them useless: they're never used.
PS: At various places in your grammar you're using '*' and '?' in a way that suggests the intention may be to match zero or more or zero or one items respectively. Be aware that all '*' and '?' do is to match a token with the given value. There is no syntactic shortcut to repeat something or make it optional in bison - you'll need to define separate non-terminals for that.
PPS: In most (all?) languages that have ++ and -- operators, those consist of a single token not two subsequent '+' or '-' tokens (so - -x would be double negation and only --x without the space between the -s would be a decrement). So your rules for the decrement and increment operators are unusual in that regard.

YACC grammar for arithmetic expressions, with no surrounding parentheses

I want to write the rules for arithmetic expressions in YACC; where the following operations are defined:
+ - * / ()
But, I don't want the statement to have surrounding parentheses. That is, a+(b*c) should have a matching rule but (a+(b*c)) shouldn't.
How can I achieve this?
The motive:
In my grammar I define a set like this: (1,2,3,4) and I want (5) to be treated as a 1-element set. The ambiguity causes a reduce/reduce conflict.
Here's a pretty minimal arithmetic grammar. It handles the four operators you mention and assignment statements:
stmt: ID '=' expr ';'
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
It's easy to define "set" literals:
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
If we assume that a set literal can only appear as the value in an assignment statement, and not as the operand of an arithmetic operator, then we would add a syntax for "expressions or set literals":
value: expr | set
and modify the syntax for assignment statements to use that:
stmt: ID '=' value ';'
But that leads to the reduce/reduce conflict you mention because (5) could be an expr, through the expansion expr → term → factor → '(' expr ')'.
Here are three solutions to this ambiguity:
1. Explicitly remove the ambiguity
Disambiguating is tedious but not particularly difficult; we just define two kinds of subexpression at each precedence level, one which is possibly parenthesized and one which is definitely not surrounded by parentheses. We start with some short-hand for a parenthesized expression:
paren: '(' expr ')'
and then for each subexpression type X, we add a production pp_X:
pp_term: term | paren
and modify the existing production by allowing possibly parenthesized subexpressions as operands:
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
Unfortunately, we will still end up with a shift/reduce conflict, because of the way expr_list was defined. Confronted with the beginning of an assignment statement:
a = ( 5 )
having finished with the 5, so that ) is the lookahead token, the parser does not know whether the (5) is a set (in which case the next token will be a ;) or a paren (which is only valid if the next token is an operand). This is not an ambiguity -- the parse could be trivially resolved with an LR(2) parse table -- but there are not many tools which can generate LR(2) parsers. So we sidestep the issue by insisting that the expr_list has to have two expressions, and adding paren to the productions for set:
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
Now the parser doesn't need to choose between expr_list and expr in the assignment statement; it simply reduces (5) to paren and waits for the next token to clarify the parse.
So that ends up with:
stmt: ID '=' value ';'
value: expr | set
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
paren: '(' expr ')'
pp_expr: expr | paren
expr: term | pp_expr '-' pp_term | pp_expr '+' pp_term
pp_term: term | paren
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
pp_factor: factor | paren
factor: ID | NUMBER | '-' pp_factor
which has no conflicts.
2. Use a GLR parser
Although it is possible to explicitly disambiguate, the resulting grammar is bloated and not really very clear, which is unfortunate.
Bison can generated GLR parsers, which would allow for a much simpler grammar. In fact, the original grammar would work almost without modification; we just need to use the Bison %dprec dynamic precedence declaration to indicate how to disambiguate:
%glr-parser
%%
stmt: ID '=' value ';'
value: expr %dprec 1
| set %dprec 2
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
The %dprec declarations in the two productions for value tell the parser to prefer value: set if both productions are possible. (They have no effect in contexts in which only one production is possible.)
3. Fix the language
While it is possible to parse the language as specified, we might not be doing anyone any favours. There might even be complaints from people who are surprised when they change
a = ( some complicated expression ) * 2
to
a = ( some complicated expression )
and suddenly a becomes a set instead of a scalar.
It is often the case that languages for which the grammar is not obvious are also hard for humans to parse. (See, for example, C++'s "most vexing parse").
Python, which uses ( expression list ) to create tuple literals, takes a very simple approach: ( expression ) is always an expression, so a tuple needs to either be empty or contain at least one comma. To make the latter possible, Python allows a tuple literal to be written with a trailing comma; the trailing comma is optional unless the tuple contains a single element. So (5) is an expression, while (), (5,), (5,6) and (5,6,) are all tuples (the last two are semantically identical).
Python lists are written between square brackets; here, a trailing comma is again permitted, but it is never required because [5] is not ambiguous. So [], [5], [5,], [5,6] and [5,6,] are all lists.

ANTLR4 parser rules with other parser rules as arguments (meta-rules)

I would like to be able to write a "meta-rule" in ANTLR4 that takes a rule as an input argument and performs a set modification to that rule. Here's an example grammar:
grammar G;
WS: [ \t\n\r] + -> skip;
CHAR: [a-z];
term: (CHAR)+;
sum: term ('+' term)+;
pterm: '(' term ')' | '(' pterm ')';
psum: '(' sum ')' | '(' psum ')';
expr: term | sum | pterm | psum;
The rules for pterm and psum perform the same action on term and sum, enclosing them in possibly nested parentheses. I would like to be able to replace the last three lines above with something like the following:
enclose[rule]: '(' rule ')' | '(' enclose(rule) ')';
expr: term | sum | enclose(term) | enclose(sum);
Is there a way to construct a meta-rule like this?
The short answer is, no.
Better to resolve by refactoring the grammar and identifying the structurally significant terms:
expr: LPAREN sum RPAREN | LPAREN expr RPAREN ;
sum : term ('+' term)* ; // changed to Kleene star
term: CHAR+ ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : [a-z] ;
WS : [ \t\n\r]+ -> skip ;
The sum rule will consume all terms, so the expr rule only needs to handle sums.

How to solve SHIFT/REDUCE conflict - in parser generator

I need help to solve this one and explanation how to deal with this SHIFT/REDUCE CONFLICTS in future.
I have some conflicts between few states in my cup file.
Grammer look like this:
I have conflicts between "(" [ActPars] ")" states.
1. Statement = Designator ("=" Expr | "++" | "‐‐" | "(" [ActPars] ")" ) ";"
2. Factor = number | charConst | Designator [ "(" [ActPars] ")" ].
I don't want to paste whole 700 lines of cup file.
I will give you the relevant states and error output.
This is code for the line 1.)
Matched ::= Designator LPAREN ActParamsList RPAREN SEMI_COMMA
ActParamsList ::= ActPars
|
/* EPS */
;
ActPars ::= Expr
|
Expr ActPComma
;
ActPComma ::= COMMA ActPars;
This is for the line 2.)
Factor ::= Designator ActParamsOptional ;
ActParamsOptional ::= LPAREN ActParamsList2 RPAREN
|
/* EPS */
;
ActParamsList2 ::= ActPars
|
/* EPS */
;
Expr ::= SUBSTRACT Term RepOptionalExpression
|
Term RepOptionalExpression
;
The ERROR output looks like this:
Warning : *** Shift/Reduce conflict found in state #182
between ActParamsOptional ::= LPAREN ActParamsList RPAREN (*)
and Matched ::= Designator LPAREN ActParamsList RPAREN (*) SEMI_COMMA
under symbol SEMI_COMMA
Resolved in favor of shifting.
Error : * More conflicts encountered than expected -- parser generation aborted
I believe the problem is that your parser won't know if it should shift to the token:
SEMI_COMMA
or reduce to the token
ActParamsOptional
since the tokens defined in both ActParamsOptional and Matched are
LPAREN ActPars RPAREN

How to fix YACC shift/reduce conflicts from post-increment operator?

I'm writing a grammar in YACC (actually Bison), and I'm having a shift/reduce problem. It results from including the postfix increment and decrement operators. Here is a trimmed down version of the grammar:
%token NUMBER ID INC DEC
%left '+' '-'
%left '*' '/'
%right PREINC
%left POSTINC
%%
expr: NUMBER
| ID
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| INC expr %prec PREINC
| DEC expr %prec PREINC
| expr INC %prec POSTINC
| expr DEC %prec POSTINC
| '(' expr ')'
;
%%
Bison tells me there are 12 shift/reduce conflicts, but if I comment out the lines for the postfix increment and decrement, it works fine. Does anyone know how to fix this conflict? At this point, I'm considering moving to an LL(k) parser generator, which makes it much easier, but LALR grammars have always seemed much more natural to write. I'm also considering GLR, but I don't know of any good C/C++ GLR parser generators.
Bison/Yacc can generate a GLR parser if you specify %glr-parser in the option section.
Try this:
%token NUMBER ID INC DEC
%left '+' '-'
%left '*' '/'
%nonassoc '++' '--'
%left '('
%%
expr: NUMBER
| ID
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '++' expr
| '--' expr
| expr '++'
| expr '--'
| '(' expr ')'
;
%%
The key is to declare postfix operators as non associative. Otherwise you would be able to
++var++--
The parenthesis also need to be given a precedence to minimize shift/reduce warnings
I like to define more items. You shouldn't need the %left, %right, %prec stuff.
simple_expr: NUMBER
| INC simple_expr
| DEC simple_expr
| '(' expr ')'
;
term: simple_expr
| term '*' simple_expr
| term '/' simple_expr
;
expr: term
| expr '+' term
| expr '-' term
;
Play around with this approach.
This basic problem is that you don't have a precedence for the INC and DEC tokens, so it doesn't know how to resolve ambiguities involving a lookahead of INC or DEC. If you add
%right INC DEC
at the end of the precedence list (you want unaries to be higher precedence and postfix higher than prefix), it will fix it, and you can even get rid of all the PREINC/POSTINC stuff, as it's irrelevant.
preincrement and postincrement operators have nonassoc so define that in the precedence section and in the rules make the precedence of these operators high by using %prec

Resources