I have this grammar of a C# like language, and I want to make a parser for it, but when I put the grammar it tells me about Shift/Reduce conflicts. I tried to fix some but I can't seem to find another way to improve this grammar. Any help would be greatly appreciated :D Here's the grammar:
Program: Decl
| Program Decl
;
Decl: VariableDecl
| FunctionDecl
| ClassDecl
| InterfaceDecl
;
VariableDecl: Variable SEMICOLON
;
Variable: Type IDENTIFIER
;
Type: TOKINT
| TOKDOUBLE
| TOKBOOL
| TOKSTRING
| IDENTIFIER
| Type BRACKETS
;
FunctionDecl: Type IDENTIFIER OPARENS Formals CPARENS StmtBlock
| TOKVOID IDENTIFIER OPARENS Formals CPARENS StmtBlock
;
Formals: VariablePlus
| /* epsilon */
;
VariablePlus: Variable
| VariablePlus COMMA Variable
;
ClassDecl: TOKCLASS IDENTIFIER OptExtends OptImplements OBRACE ListaField CBRACE
;
OptExtends: TOKEXTENDS IDENTIFIER
| /* epsilon */
;
OptImplements: TOKIMPLEMENTS ListaIdent
| /* epsilon */
;
ListaIdent: ListaIdent COMMA IDENTIFIER
| IDENTIFIER
;
ListaField: ListaField Field
| /* epsilon */
;
Field: VariableDecl
| FunctionDecl
;
InterfaceDecl: TOKINTERFACE IDENTIFIER OBRACE ListaProto CBRACE
;
ListaProto: ListaProto Prototype
| /* epsilon */
;
Prototype: Type IDENTIFIER OPARENS Formals CPARENS SEMICOLON
| TOKVOID IDENTIFIER OPARENS Formals CPARENS SEMICOLON
;
StmtBlock: OBRACE ListaOptG CBRACE
;
ListaOptG: /* epsilon */
| VariableDecl ListaOptG
| Stmt ListaOptG
;
Stmt: OptExpr SEMICOLON
| IfStmt
| WhileStmt
| ForStmt
| BreakStmt
| ReturnStmt
| PrintStmt
| StmtBlock
;
OptExpr: Expr
| /* epsilon */
;
IfStmt: TOKIF OPARENS Expr CPARENS Stmt OptElse
;
OptElse: TOKELSE Stmt
| /* epsilon */
;
WhileStmt: TOKWHILE OPARENS Expr CPARENS Stmt
;
ForStmt: TOKFOR OPARENS OptExpr SEMICOLON Expr SEMICOLON OptExpr CPARENS Stmt
;
ReturnStmt: TOKRETURN OptExpr SEMICOLON
;
BreakStmt: TOKBREAK SEMICOLON
;
PrintStmt: TOKPRINT OPARENS ListaExprPlus CPARENS SEMICOLON
;
ListaExprPlus: Expr
| ListaExprPlus COMMA Expr
;
Expr: LValue LOCATION Expr
| Constant
| LValue
| TOKTHIS
| Call
| OPARENS Expr CPARENS
| Expr PLUS Expr
| Expr MINUS Expr
| Expr TIMES Expr
| Expr DIVIDED Expr
| Expr MODULO Expr
| MINUS Expr
| Expr LESSTHAN Expr
| Expr LESSEQUALTHAN Expr
| Expr GREATERTHAN Expr
| Expr GREATEREQUALTHAN Expr
| Expr EQUALS Expr
| Expr NOTEQUALS Expr
| Expr AND Expr
| Expr OR Expr
| NOT Expr
| TOKNEW OPARENS IDENTIFIER CPARENS
| TOKNEWARRAY OPARENS Expr COMMA Type CPARENS
| TOKREADINTEGER OPARENS CPARENS
| TOKREADLINE OPARENS CPARENS
| TOKMALLOC OPARENS Expr CPARENS
;
LValue: IDENTIFIER
| Expr PERIOD IDENTIFIER
| Expr OBRACKET Expr CBRACKET
;
Call: IDENTIFIER OPARENS Actuals CPARENS
| Expr PERIOD IDENTIFIER OPARENS Actuals CPARENS
| Expr PERIOD LibCall OPARENS Actuals CPARENS
;
LibCall: TOKGETBYTE OPARENS Expr CPARENS
| TOKSETBYTE OPARENS Expr COMMA Expr CPARENS
;
Actuals: ListaExprPlus
| /* epsilon */
;
Constant: INTCONSTANT
| DOUBLECONSTANT
| BOOLCONSTANT
| STRINGCONSTANT
| TOKNULL
;
The old Bison version on my school's server says you have 241 shift/reduce conflicts. One is the dangling if/else statement. Putting "OptElse" does NOT solve it. You should just write out the IfStmt and an IfElseStmt and then use %nonassoc and %prec options in bison to fix it.
Your expressions are the issue of almost all of the other 240 conflicts. What you need to do is either force precedence rules (messy and a terrible idea) or break your arithmetic expressions into stuff like:
AddSubtractExpr: AddSubtractExpr PLUS MultDivExpr | ....
;
MultDivExpr: MultiDivExpr TIMES Factor | ....
;
Factor: Variable | LPAREN Expr RPAREN | call | ...
;
Since Bison produces a bottom up parser, something like this will give you correct order of operations. If you have a copy of the first edition of the Dragon Book, you should look at the grammar in Appendix A. I believe the 2nd edition also has similar rules for simple expressions.
conflicts (shift/reduce or reduce/reduce) mean that your grammar is not LALR(1) so can't be handled by bison directly without help. There are a number of immediately obvious problems:
expression ambiguity -- there's no precedence in the grammar, so things like a + b * c are ambiguous. You can fix this by adding precedence rules, or by splitting the Expr rule into separate AdditiveExpr, MultiplicativeExpr, ConditionalExpr etc rules.
dangling else ambiguity -- if (a) if (b) x; else y; -- the else could be matched with either if. You can either ignore this if the default shift is correct (it usually is for this specific case, but ignoring errors is always dangerous) or split the Stmt rule
There are many books on grammars and parsing that will help with this.
Related
I have the following expression group where everything is thrown into the same expr rule:
grammar MyGrammar;
expr
: '(' expr ')'
// BoolExressions -- cannot move these out or else get Left-Recursion
| expr ('=' | '!=') expr
| expr 'AND' expr
| expr 'OR' expr
| ATOM
;
ATOM: [a-z]+ | [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
It works, but I would like to extract the boolExpression stuff so that I can use that separately, as some other rules I have must use a boolean expression rather than any expression. However, as soon as I do that I get a left-recursion error.
What would be a good way to break this up, so that I can separate the BooleanExpression stuff? Ideally, I would like it to "look like this":
grammar MyGrammar;
expr
: '(' expr ')'
| boolExpr
| ATOM
;
boolExpr
: expr ('=' | '!=') expr
| expr 'AND' expr
| expr 'OR' expr
;
ATOM: [a-z]+ | [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
// error(119): The following sets of rules are
// mutually left-recursive [expr, boolExpr]
it doesn’t quite get you a single boolExpr, but you should consider labeled alternatives:
grammar MyGrammar;
expr
: '(' expr ')'
// BoolExressions -- cannot move these out or else get Left-Recursion
| expr ('=' | '!=') expr # compareExpr
| expr 'AND' expr # andExpr
| expr 'OR' expr # orExpr
| ATOM
;
ATOM: [a-z]+ | [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
This creates separate *Context classes for each alternative which significantly reduces the complexity of contexts your listeners and visitors will deal with (there will be more of them, though, obviously). Symbols are also scoped to each alternative so you can do something like:
grammar MyGrammar;
expr
: '(' expr ')'
// BoolExressions -- cannot move these out or else get Left-Recursion
| lhs=expr ('=' | '!=') rhs=expr # compareExpr
| lhs=expr 'AND' rhs=expr # andExpr
| lhs=expr 'OR' rhs=expr # orExpr
| ATOM
;
ATOM: [a-z]+ | [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
I am translating the rules of my grammar into an AST.
Is it necessary to use the "and" operator in defining our AST?
For instance, I have translated my grammar thus far like so:
type program =
| Decls of typ * identifier * decls_prime
type typ =
| INT
| BOOL
| VOID
type identifier = string
(* decls_prime = vdecl decls | fdecl decls *)
type declsprime =
| Vdecl of variabledeclaration * decls
| Fdecl of functiondeclaration * decls
(*“lparen” formals_opt “rparen” “LBRACE” vdecl_list stmt_list “RBRACE”*)
type functiondeclaration =
| Fdecl of variabledeclarationlist * stmtlist
(*formals_opt = formal_list | epsilon *)
type FormalsOpt =
|FormalsOpt of formallist
(* typ “ID” formal_list_prime *)
type formalList =
| FormalList of typ * identifier * formallistprime
type formallistprime =
| FormalListPrime of formalList
type variabledeclarationlist =
| VdeclList of variabledeclaration * variabledeclarationlist
(*stmt stmt_list | epsilon*)
type stmtlist =
| StmtList of stmt * stmtlist
| StmtlistNil
(* stmt = “RETURN” stmt_prime| expr SEMI |“LBRACE” stmt_list RBRACE| IF LPAREN expr RPAREN stmt stmt_prime_prime| FOR LPAREN expr_opt SEMI expr SEMI expr_opt RPAREN stmt| WHILE LPAREN expr RPAREN stmt*)
type Stmt
| Return of stmtprime
| Expression of expr
| StmtList of stmtlist
| IF of expr * stmt * stmtprimeprime
| FOR of expropt * expr * expropt * stmt
| WHILE of expr * stmt
(*stmt_prime = SEMI| expr SEMI*)
type stmtprime
| SEMI
| Expression of expr
(*NOELSE | ELSE stmt*)
type stmtprimeprime
| NOELSE
| ELSE of stmt
(* Expr_opt = expr | epsilon *)
type expropt =
| Expression of expr
| ExprNil
type Expr
type ExprPrime
(* Actuals_opt = actuals_list | epsilon *)
type ActualsOpt=
| ActualsList of actualslist
| ActualsNil
type ActualsList =
| ActualsList of expr * actualslistprime
(*actualslistprime = COMMA expr actuals_list_prime | epsilon*)
type actualslistprime =
| ActualsListPrime of expr * actualslistprime
| ALPNil
But it looks as though this example from Illinois uses a slightly different structure:
type program = Program of (class_decl list)
and class_decl = Class of id * id * (var_decl list) * (method_decl list)
and method_decl = Method....
Is it necessary to use "and" when defining my AST? And moreover, is it wrong for me to use a StmtList type rather than (stmt list) even though I call the AST StmtList method correctly in my parser?
You only need and when your definitions are mutually recursive. That is, if a statement could contain an expression and an expression could in turn contain a statement, then Expr and Stmt would have to be connected with an and. If your code compiles without and, you don't need the and.
PS: This is unrelated to your question, but I think it would make a lot more sense to use the list and option types than to define your own versions for specific types (such as stmntlist, expropt etc.). stmtprime is another such case: You could just define Return as Return of expr option and get rid of the stmtprime type. Same with stmtprimeprime.
Currently, I've just defined simple rules in ANTLR4:
// Recognizer Rules
program : (class_dcl)+ EOF;
class_dcl: 'class' ID ('extends' ID)? '{' class_body '}';
class_body: (const_dcl|var_dcl|method_dcl)*;
const_dcl: ('static')? 'final' PRIMITIVE_TYPE ID '=' expr ';';
var_dcl: ('static')? id_list ':' type ';';
method_dcl: PRIMITIVE_TYPE ('static')? ID '(' para_list ')' block_stm;
para_list: (para_dcl (';' para_dcl)*)?;
para_dcl: id_list ':' PRIMITIVE_TYPE;
block_stm: '{' '}';
expr: <assoc=right> expr '=' expr | expr1;
expr1: term ('<' | '>' | '<=' | '>=' | '==' | '!=') term | term;
term: ('+'|'-') term | term ('*'|'/') term | term ('+'|'-') term | fact;
fact: INTLIT | FLOATLIT | BOOLLIT | ID | '(' expr ')';
type: PRIMITIVE_TYPE ('[' INTLIT ']')?;
id_list: ID (',' ID)*;
// Lexer Rules
KEYWORD: PRIMITIVE_TYPE | BOOLLIT | 'class' | 'extends' | 'if' | 'then' | 'else'
| 'null' | 'break' | 'continue' | 'while' | 'return' | 'self' | 'final'
| 'static' | 'new' | 'do';
SEPARATOR: '[' | ']' | '{' | '}' | '(' | ')' | ';' | ':' | '.' | ',';
OPERATOR: '^' | 'new' | '=' | UNA_OPERATOR | BIN_OPERATOR;
UNA_OPERATOR: '!';
BIN_OPERATOR: '+' | '-' | '*' | '\\' | '/' | '%' | '>' | '>=' | '<' | '<='
| '==' | '<>' | '&&' | '||' | ':=';
PRIMITIVE_TYPE: 'integer' | 'float' | 'bool' | 'string' | 'void';
BOOLLIT: 'true' | 'false';
FLOATLIT: [0-9]+ ((('.'[0-9]* (('E'|'e')('+'|'-')?[0-9]+)? ))|(('E'|'e')('+'|'-')? [0-9]+));
INTLIT: [0-9]+;
STRINGLIT: '"' ('\\'[bfrnt\\"]|~[\r\t\n\\"])* '"';
ILLEGAL_ESC: '"' (('\\'[bfrnt\\"]|~[\n\\"]))* ('\\'(~[bfrnt\\"]))
{if (true) throw new bkool.parser.IllegalEscape(getText());};
UNCLOSED_STRING: '"'('\\'[bfrnt\\"]|~[\r\t\n\\"])*
{if (true) throw new bkool.parser.UncloseString(getText());};
COMMENT: (BLOCK_COMMENT|LINE_COMMENT) -> skip;
BLOCK_COMMENT: '(''*'(('*')?(~')'))*'*'')';
LINE_COMMENT: '#' (~[\n])* ('\n'|EOF);
ID: [a-zA-z_]+ [a-zA-z_0-9]* ;
WS: [ \t\r\n]+ -> skip ;
ERROR_TOKEN: . {if (true) throw new bkool.parser.ErrorToken(getText());};
I opened the parse tree, and tried to test:
class abc
{
final integer x=1;
}
It returned errors:
BKOOL::program:3:8: mismatched input 'integer' expecting PRIMITIVE_TYPE
BKOOL::program:3:17: mismatched input '=' expecting {':', ','}
I still haven't got why. Could you please help me why it didn't recognize rules and tokens as I expected?
Lexer rules are exclusive. The longest wins, and the tiebreaker is the grammar order.
In your case; integer is a KEYWORD instead of PRIMITIVE_TYPE.
What you should do here:
Make one distinct token per keyword instead of an all-catching KEYWORD rule.
Turn PRIMITIVE_TYPE into a parser rule
Same for operators
Right now, your example:
class abc
{
final integer x=1;
}
Gets converted to lexemes such as:
class ID { final KEYWORD ID = INTLIT ; }
This is thanks to the implicit token typing, as you've used definitions such as 'class' in your parser rules. These get converted to anonymous tokens such as T_001 : 'class'; which get the highest priority.
If this weren't the case, you'd end up with:
KEYWORD ID SEPARATOR KEYWORD KEYWORD ID OPERATOR INTLIT ; SEPARATOR
And that's... not quite easy to parse ;-)
That's why I'm telling you to breakdown your tokens properly.
I'm writing an ANTLR lexer/parser for context free grammar.
This is what I have now:
statement
: assignment_statement
;
assignment_statement
: IDENTIFIER '=' expression ';'
;
term
: IDENT
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENT '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
So my assignment statement is identified by the form
IDENTIFIER = expression;
However, assignment statement should also take into account cases when the right hand side is a function call (the return value of the statement). For example,
items = getItems();
What grammar rule should I add for this? I thought of adding a function call to the "expression" rule, but I wasn't sure if function call should be regarded as expression..
Thanks
This grammar looks fine to me. I am assuming that IDENT and IDENTIFIER are the same and that you have additional productions for the remaining terminals.
This production seems to define a function call.
| IDENT '(' actualParameters ')'
You need a production for the actual parameters, something like this.
actualParameters : nothing | expression ( ',' expression )*
I am writing a simple parser in bison. The parser checks whether a program has any syntax errors with respect to my following grammar:
%{
#include <stdio.h>
void yyerror (const char *s) /* Called by yyparse on error */
{
printf ("%s\n", s);
}
%}
%token tNUM tINT tREAL tIDENT tINTTYPE tREALTYPE tINTMATRIXTYPE
%token tREALMATRIXTYPE tINTVECTORTYPE tREALVECTORTYPE tTRANSPOSE
%token tIF tENDIF tDOTPROD tEQ tNE tGTE tLTE tGT tLT tOR tAND
%left "(" ")" "[" "]"
%left "<" "<=" ">" ">="
%right "="
%left "+" "-"
%left "*" "/"
%left "||"
%left "&&"
%left "==" "!="
%% /* Grammar rules and actions follow */
prog: stmtlst ;
stmtlst: stmt | stmt stmtlst ;
stmt: decl | asgn | if;
decl: type vars "=" expr ";" ;
type: tINTTYPE | tINTVECTORTYPE | tINTMATRIXTYPE | tREALTYPE | tREALVECTORTYPE
| tREALMATRIXTYPE ;
vars: tIDENT | tIDENT "," vars ;
asgn: tIDENT "=" expr ";" ;
if: tIF "(" bool ")" stmtlst tENDIF ;
expr: tIDENT | tINT | tREAL | vectorLit | matrixLit | expr "+" expr| expr "-" expr
| expr "*" expr | expr "/" expr| expr tDOTPROD expr | transpose ;
transpose: tTRANSPOSE "(" expr ")" ;
vectorLit: "[" row "]" ;
matrixLit: "[" row ";" rows "]" ;
row: value | value "," row ;
rows: row | row ";" rows ;
value: tINT | tREAL | tIDENT ;
bool: comp | bool tAND bool | bool tOR bool ;
comp: expr relation expr ;
relation: tGT | tLT | tGTE | tLTE | tNE | tEQ ;
%%
int main ()
{
if (yyparse()) {
// parse error
printf("ERROR\n");
return 1;
}
else {
// successful parsing
printf("OK\n");
return 0;
}
}
The code may look long and complicated, but i think what i am going to ask does not need the full code, but in any case i preferred to write the code. I am sure my grammar is correct, but ambiguous. When i try to create the executable of the program by writing "bison -d filename.y", i get an error saying that conflicts: 13 shift/reduce. I defined the precedence of the operators at the beginning of this file, and i tried a lot of combinations of these precedences, but i still get this error. How can i remove this ambiguity? Thank you
tOR, tAND, and tDOTPROD need to have their precedence specified as well.