I have the following bison grammar (as part of a more complex grammar):
classDeclaration : CLASS ID EXTENDS ID LBRACE variableDeclarationList methodDeclarationList RBRACE
;
variableDeclarationList : variableDeclarationList variableDeclaration
| /* empty */
;
variableDeclaration : type ID SEMICOLON
;
type : NATTYPE | ID
;
methodDeclarationList : methodDeclarationList methodDeclaration
| /* empty */
;
methodDeclaration : type ID LPAREN parameterDeclarationList RPAREN variableExpressionBlock
;
which is supposed to describe class declarations which look like this:
class foo extends object
{
nat number;
nat divide(nat aNumber)
{
0;
}
}
or this:
class foo extends object
{
nat divide(nat aNumber)
{
0;
}
}
or this:
class foo extends object
{
}
Problem is that there is ambiguity where variable declarations end and method declarations begin (2 shift/reduce conflicts). For example, the method declaration looks like a variable declaration until it sees the parenthesis.
How can I rewrite this grammar to eliminate this ambiguity?
To clarify: the class body can be empty, the only constraint is that variable declarations come before method declarations if there are any.
This isn't an ambiguity, its a lookahead problem. The problem is that you need 3 tokens of lookahead (up to the SEMICOLON or LPAREN of the next declaration) for the parser to figure out where the end of the variableDeclarationList is, as it needs to reduce an empty methodDeclarationList before it starts parsing more methodDeclarations.
The way to fix this is to remove the need for an empty reduction at the start of a method declaration list:
methodDeclarationList : nonEmptyMethodDeclarationList | /*empty */ ;
nonEmptyMethodDeclarationList : nonEmptyMethodDeclarationList methodDeclaration
| methodDeclaration
;
With this, the parser does not need to reduce an empty methodDeclarationList UNLESS there are no methods at all -- and in that case, only one token of lookahead is needed to see the RBRACE
Another way to do it is to not even have empty rules, and instead use multiple options, one with the nonterminal and one without.
classDeclaration : CLASS ID EXTENDS ID LBRACE RBRACE
| CLASS ID EXTENDS ID LBRACE methodDeclarationList RBRACE
| CLASS ID EXTENDS ID LBRACE variableDeclarationList RBRACE
| CLASS ID EXTENDS ID LBRACE variableDeclarationList methodDeclarationList RBRACE
;
variableDeclarationList : variableDeclaration
| variableDeclarationList variableDeclaration
;
variableDeclaration : type ID SEMICOLON
;
type : NATTYPE
| ID
;
methodDeclarationList : methodDeclaration
| methodDeclarationList methodDeclaration
;
methodDeclaration : type ID LPAREN RPAREN variableExpressionBlock
| type ID LPAREN parameterList RPAREN variableExpressionBlock
;
I am not familiar with bison, but have you tried making a rule for the common prefix of both rules? the "type ID" is present in both the variable and method patterns.
So if you had say:
typedId : type ID
;
and then
variableDeclaration : typedId SEMICOLON
;
methodDeclaration : typedId LPAREN parameterDeclarationList RPAREN variableExpressionBlock
;
this way the variable rule and method rule would not be looked at untill the common prefix is already pushed, and the next token whould be unambiguous.
I have not played with this sort of thing in years I hope this helps.
This slightly modified grammar works:
%token CLASS EXTENDS ID LBRACE RBRACE SEMICOLON NATTYPE LPAREN RPAREN DIGIT COMMA
%%
classDeclaration : CLASS ID EXTENDS ID LBRACE declarationList RBRACE
;
declarationList : /* Empty */
| declarationList declaration
;
declaration : variableDeclaration
| methodDeclaration
;
variableDeclaration : parameterDeclaration SEMICOLON
;
type : NATTYPE | ID
;
methodDeclaration : parameterDeclaration LPAREN parameterDeclarationList RPAREN
variableExpressionBlock
;
variableExpressionBlock : LBRACE DIGIT RBRACE
;
parameterDeclarationList : /* empty */
| parameterDeclarationList COMMA parameterDeclaration
;
parameterDeclaration : type ID
;
You probably want to rename the non-terminal 'parameterDeclaration' to something like 'singleVariableDeclaration', but by avoiding having two possibly empty rules in a row (the original 'variableDeclarationList' and 'methodDeclarationList', you avoid the ambiguity.
This does allow, syntactically, methods interleaved with variables in the class's declarationList. If that isn't acceptable for some reason, consider making that a semantic error rather than a syntactic error. If it must be a syntax error, then someone is going to have to do some thinking; I vote to make you do the thinking.
If you insist on at least one method declaration, then the grammar is unambiguous:
methodDeclarationList : methodDeclarationList methodDeclaration
| methodDeclaration /* empty */
;
If you try the same with a variable declaration list, the grammar still has two S/R conflicts.
One possibility, not to be completely ignored, is to use the Bison feature, %expect 2, to indicate that 2 shift/reduce conflicts are expected.
%expect 2
%token CLASS EXTENDS ID LBRACE RBRACE SEMICOLON NATTYPE LPAREN RPAREN DIGIT COMMA
%%
classDeclaration : CLASS ID EXTENDS ID LBRACE variableDeclarationList methodDeclarationList RBRACE
;
variableDeclarationList : variableDeclarationList variableDeclaration
| /* empty */
;
variableDeclaration : singleVariableDeclaration SEMICOLON
;
type : NATTYPE | ID
;
methodDeclarationList : methodDeclarationList methodDeclaration
| /* empty */
;
methodDeclaration : singleVariableDeclaration LPAREN parameterDeclarationList RPAREN variableExpressionBlock
;
variableExpressionBlock : LBRACE DIGIT RBRACE
;
parameterDeclarationList : /* empty */
| parameterDeclarationList COMMA parameterDeclaration
;
parameterDeclaration : singleVariableDeclaration
;
singleVariableDeclaration : type ID
;
Related
I am creating parser and lexer rules for Decaf programming language written in ANTLR4. I'm trying to parse a test file and keep getting an error, there must be something wrong in the grammar but i cant figure it out.
My test file looks like:
class Program {
int i[10];
}
The error is : line 2:8 mismatched input '10' expecting INT_LITERAL
And here is the full Decaf.g4 grammar file
grammar Decaf;
/*
LEXER RULES
-----------
Lexer rules define the basic syntax of individual words and symbols of a
valid Decaf program. Lexer rules follow regular expression syntax.
Complete the lexer rules following the Decaf Language Specification.
*/
CLASS : 'class';
INT : 'int';
RETURN : 'return';
VOID : 'void';
IF : 'if';
ELSE : 'else';
FOR : 'for';
BREAK : 'break';
CONTINUE : 'continue';
CALLOUT : 'callout';
TRUE : 'True' ;
FALSE : 'False' ;
BOOLEAN : 'boolean';
LCURLY : '{';
RCURLY : '}';
LBRACE : '(';
RBRACE : ')';
LSQUARE : '[';
RSQUARE : ']';
ADD : '+';
SUB : '-';
MUL : '*';
DIV : '/';
EQ : '=';
SEMI : ';';
COMMA : ',';
AND : '&&';
LESS : '<';
GREATER : '>';
LESSEQUAL : '<=' ;
GREATEREQUAL : '>=' ;
EQUALTO : '==' ;
NOTEQUAL : '!=' ;
EXCLAMATION : '!';
fragment CHAR : (' '..'!') | ('#'..'&') | ('('..'[') | (']'..'~') | ('\\'[']) | ('\\"') | ('\\') | ('\t') | ('\n');
CHAR_LITERAL : '\'' CHAR '\'';
//STRING_LITERAL : '"' CHAR+ '"' ;
HEXMARK : '0x';
fragment HEXA : [a-fA-F];
fragment HEXDIGIT : DIGIT | HEXA ;
HEX_LITERAL : HEXMARK HEXDIGIT+;
STRING : '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\';
fragment DIGIT : [0-9];
DECIMAL_LITERAL : DIGIT(DIGIT)*;
COMMENT : '//' ~('\n')* '\n' -> skip;
WS : (' ' | '\n' | '\t' | '\r') + -> skip;
fragment ALPHA : [a-zA-Z] | '_';
fragment ALPHA_NUM : ALPHA | DIGIT;
ID : ALPHA ALPHA_NUM*;
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
BOOL_LITERAL : TRUE | FALSE;
/*
PARSER RULES
------------
Parser rules are all lower case, and make use of lexer rules defined above
and other parser rules defined below. Parser rules also follow regular
expression syntax. Complete the parser rules following the Decaf Language
Specification.
*/
program : CLASS ID LCURLY field_decl* method_decl* RCURLY EOF;
field_name : ID | ID LSQUARE INT_LITERAL RSQUARE;
field_decl : datatype field_name (COMMA field_name)* SEMI;
method_decl : (datatype | VOID) ID LBRACE ((datatype ID) (COMMA datatype ID)*)? RBRACE block;
block : LCURLY var_decl* statement* RCURLY;
var_decl : datatype ID (COMMA ID)* SEMI;
datatype : INT | BOOLEAN;
statement : location assign_op expr SEMI
| method_call SEMI
| IF LBRACE expr RBRACE block (ELSE block)?
| FOR ID EQ expr COMMA expr block
| RETURN (expr)? SEMI
| BREAK SEMI
| CONTINUE SEMI
| block;
assign_op : EQ
| ADD EQ
| SUB EQ;
method_call : method_name LBRACE (expr (COMMA expr)*)? RBRACE
| CALLOUT LBRACE STRING(COMMA callout_arg (COMMA callout_arg)*) RBRACE;
method_name : ID;
location : ID | ID LSQUARE expr RSQUARE;
expr : location
| method_call
| literal
| expr bin_op expr
| SUB expr
| EXCLAMATION expr
| LBRACE expr RBRACE;
callout_arg : expr
| STRING ;
bin_op : arith_op
| rel_op
| eq_op
| cond_op;
arith_op : ADD | SUB | MUL | DIV | '%' ;
rel_op : LESS | GREATER | LESSEQUAL | GREATEREQUAL ;
eq_op : EQUALTO | NOTEQUAL ;
cond_op : AND | '||' ;
literal : INT_LITERAL | CHAR_LITERAL | BOOL_LITERAL ;
Whenever there are 2 or more lexer rules that match the same characters, the one defined first wins. In your case, these 2 rules both match 10:
DECIMAL_LITERAL : DIGIT(DIGIT)*;
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
and since INT_LITERAL is defined after DECIMAL_LITERAL, the lexer will never create a INT_LITERAL token. If you now try to use it in a parser rule, you get an error message you posted.
The solution: remove INT_LITERAL from your lexer and create a parser rule instead:
int_literal : DECIMAL_LITERAL | HEX_LITERAL;
and use int_literal in your parser rules instead.
The grammar below is failing to generate a parser with this error:
error(119): Sable.g4::: The following sets of rules are mutually left-recursive [expression]
1 error(s)
The error disappears when I comment out the embedded actions, but comes back when I make them effective.
Why are the actions bringing about such left-recursion that antlr4 cannot handle it?
grammar Sable;
options {}
#header {
package org.sable.parser;
import org.sable.parser.internal.*;
}
sourceFile : statement* EOF;
statement : expressionStatement (SEMICOLON | NEWLINE);
expressionStatement : expression;
expression:
{System.out.println("w");} expression '+' expression {System.out.println("tf?");}
| IDENTIFIER
;
SEMICOLON : ';';
RPARENTH : '(';
LPARENTH : ')';
NEWLINE:
('\u000D' '\u000A')
| '\u000A'
;
WS : WhiteSpaceNotNewline -> channel(HIDDEN);
IDENTIFIER:
(IdentifierHead IdentifierCharacter*)
| ('`'(IdentifierHead IdentifierCharacter*)'`')
;
fragment DecimalDigit :'0'..'9';
fragment IdentifierHead:
'a'..'z'
| 'A'..'Z'
;
fragment IdentifierCharacter:
DecimalDigit
| IdentifierHead
;
// Non-newline whitespaces are defined apart because they carry meaning in
// certain contexts, e.g. within space-aware operators.
fragment WhiteSpaceNotNewline : [\u0020\u000C\u0009u000B\u000C];
UPDATE: the following workaround kind of solves this specific situation, but not the case when init/after-like actions are needed on a lesser scope - per choice, not per rule.
expression
#init {
enable(Token.HIDDEN_CHANNEL);
}
#after {
disable(Token.HIDDEN_CHANNEL);
}:
expression '+' expression
| IDENTIFIER
;
I'm writing an ANTLR lexer/parser for context free grammar.
This is what I have now:
statement
: assignment_statement
;
assignment_statement
: IDENTIFIER '=' expression ';'
;
term
: IDENT
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENT '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
So my assignment statement is identified by the form
IDENTIFIER = expression;
However, assignment statement should also take into account cases when the right hand side is a function call (the return value of the statement). For example,
items = getItems();
What grammar rule should I add for this? I thought of adding a function call to the "expression" rule, but I wasn't sure if function call should be regarded as expression..
Thanks
This grammar looks fine to me. I am assuming that IDENT and IDENTIFIER are the same and that you have additional productions for the remaining terminals.
This production seems to define a function call.
| IDENT '(' actualParameters ')'
You need a production for the actual parameters, something like this.
actualParameters : nothing | expression ( ',' expression )*
Background
I have been using ANTLRWorks (V 1.4.3) for a few days now and trying to write a simple Boolean parser. The combined lexer/parser grammar below works well for most of the requirements including support for quoted white-spaced text as operands for a Boolean expression.
Problem
I would like the grammar to work for white-spaced operands without the need of quotes.
Example
For example, expression-
"left right" AND center
should have the same parse tree even after dropping the quotes-
left right AND center.
I have been learning about backtracking, predicates etc but can't seem to find a solution.
Code
Below is the grammar I have got so far. Any feedback on the foolish mistakes is appreciated :).
Lexer/Parser Grammar
grammar boolean_expr;
options {
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
}
#modifier{public}
#ctorModifier{public}
#lexer::namespace{Org.CSharp.Parsers}
#parser::namespace{Org.CSharp.Parsers}
public
evaluator
: expr EOF
;
public
expr
: orexpr
;
public
orexpr
: andexpr (OR^ andexpr)*
;
public
andexpr
: notexpr (AND^ notexpr)*
;
public
notexpr
: (NOT^)? atom
;
public
atom
: word | LPAREN! expr RPAREN!
;
public
word
: QUOTED_TEXT | TEXT
;
/*
* Lexer Rules
*/
LPAREN
: '('
;
RPAREN
: ')'
;
AND
: 'AND'
;
OR
: 'OR'
;
NOT
: 'NOT'
;
WS
: ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
QUOTED_TEXT
: '"' (LETTER | DIGIT | ' ' | ',' | '-')+ '"'
;
TEXT
: (LETTER | DIGIT)+
;
/*
Fragment lexer rules can be used by other lexer rules, but do not return tokens by themselves
*/
fragment DIGIT
: ('0'..'9')
;
fragment LOWER
: ('a'..'z')
;
fragment UPPER
: ('A'..'Z')
;
fragment LETTER
: LOWER | UPPER
;
Simply let TEXT in your atom rule match once or more: TEXT+. When it matches a TEXT token more than once, you'll also want to create a custom root node for these TEXT tokens (I added an imaginary token called WORD in the grammar below).
grammar boolean_expr;
options {
output=AST;
}
tokens {
WORD;
}
evaluator
: expr EOF
;
...
word
: QUOTED_TEXT
| TEXT+ -> ^(WORD TEXT+)
;
...
Your input "left right AND center" would now be parsed as follows:
If I just add on to the following yacc file, will it turn into a parser?
/* C-Minus BNF Grammar */
%token ELSE
%token IF
%token INT
%token RETURN
%token VOID
%token WHILE
%token ID
%token NUM
%token LTE
%token GTE
%token EQUAL
%token NOTEQUAL
%%
program : declaration_list ;
declaration_list : declaration_list declaration | declaration ;
declaration : var_declaration | fun_declaration ;
var_declaration : type_specifier ID ';'
| type_specifier ID '[' NUM ']' ';' ;
type_specifier : INT | VOID ;
fun_declaration : type_specifier ID '(' params ')' compound_stmt ;
params : param_list | VOID ;
param_list : param_list ',' param
| param ;
param : type_specifier ID | type_specifier ID '[' ']' ;
compound_stmt : '{' local_declarations statement_list '}' ;
local_declarations : local_declarations var_declaration
| /* empty */ ;
statement_list : statement_list statement
| /* empty */ ;
statement : expression_stmt
| compound_stmt
| selection_stmt
| iteration_stmt
| return_stmt ;
expression_stmt : expression ';'
| ';' ;
selection_stmt : IF '(' expression ')' statement
| IF '(' expression ')' statement ELSE statement ;
iteration_stmt : WHILE '(' expression ')' statement ;
return_stmt : RETURN ';' | RETURN expression ';' ;
expression : var '=' expression | simple_expression ;
var : ID | ID '[' expression ']' ;
simple_expression : additive_expression relop additive_expression
| additive_expression ;
relop : LTE | '<' | '>' | GTE | EQUAL | NOTEQUAL ;
additive_expression : additive_expression addop term | term ;
addop : '+' | '-' ;
term : term mulop factor | factor ;
mulop : '*' | '/' ;
factor : '(' expression ')' | var | call | NUM ;
call : ID '(' args ')' ;
args : arg_list | /* empty */ ;
arg_list : arg_list ',' expression | expression ;
Heh
Its only a grammer of PL
To make it a parser you need to add some code into this.
Like there http://dinosaur.compilertools.net/yacc/index.html
Look at chapter 2. Actions
Also you'd need lexical analyzer -- 3: Lexical Analysis