ANTLR4-Mutually left-recursive grammar - parsing

i have the following antlr4 grammar and i get mutually left-recursive error , how can i fix it ?
expr : expr_prefix term ;
expr_prefix : expr_prefix term addop
| () ;
term : factor_prefix factor;
factor_prefix : factor_prefix factor mulop
| () ;
factor : primary
| call_expr ;
primary : ( expr )
| id
| INTLITERAL
| FLOATLITERAL ;

Related

Parse Rules Decaf grammar antlr4

I am creating parser and lexer rules for Decaf programming language written in ANTLR4. I'm trying to parse a test file and keep getting an error, there must be something wrong in the grammar but i cant figure it out.
My test file looks like:
class Program {
int i[10];
}
The error is : line 2:8 mismatched input '10' expecting INT_LITERAL
And here is the full Decaf.g4 grammar file
grammar Decaf;
/*
LEXER RULES
-----------
Lexer rules define the basic syntax of individual words and symbols of a
valid Decaf program. Lexer rules follow regular expression syntax.
Complete the lexer rules following the Decaf Language Specification.
*/
CLASS : 'class';
INT : 'int';
RETURN : 'return';
VOID : 'void';
IF : 'if';
ELSE : 'else';
FOR : 'for';
BREAK : 'break';
CONTINUE : 'continue';
CALLOUT : 'callout';
TRUE : 'True' ;
FALSE : 'False' ;
BOOLEAN : 'boolean';
LCURLY : '{';
RCURLY : '}';
LBRACE : '(';
RBRACE : ')';
LSQUARE : '[';
RSQUARE : ']';
ADD : '+';
SUB : '-';
MUL : '*';
DIV : '/';
EQ : '=';
SEMI : ';';
COMMA : ',';
AND : '&&';
LESS : '<';
GREATER : '>';
LESSEQUAL : '<=' ;
GREATEREQUAL : '>=' ;
EQUALTO : '==' ;
NOTEQUAL : '!=' ;
EXCLAMATION : '!';
fragment CHAR : (' '..'!') | ('#'..'&') | ('('..'[') | (']'..'~') | ('\\'[']) | ('\\"') | ('\\') | ('\t') | ('\n');
CHAR_LITERAL : '\'' CHAR '\'';
//STRING_LITERAL : '"' CHAR+ '"' ;
HEXMARK : '0x';
fragment HEXA : [a-fA-F];
fragment HEXDIGIT : DIGIT | HEXA ;
HEX_LITERAL : HEXMARK HEXDIGIT+;
STRING : '"' (ESC|.)*? '"';
fragment ESC : '\\"' | '\\\\';
fragment DIGIT : [0-9];
DECIMAL_LITERAL : DIGIT(DIGIT)*;
COMMENT : '//' ~('\n')* '\n' -> skip;
WS : (' ' | '\n' | '\t' | '\r') + -> skip;
fragment ALPHA : [a-zA-Z] | '_';
fragment ALPHA_NUM : ALPHA | DIGIT;
ID : ALPHA ALPHA_NUM*;
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
BOOL_LITERAL : TRUE | FALSE;
/*
PARSER RULES
------------
Parser rules are all lower case, and make use of lexer rules defined above
and other parser rules defined below. Parser rules also follow regular
expression syntax. Complete the parser rules following the Decaf Language
Specification.
*/
program : CLASS ID LCURLY field_decl* method_decl* RCURLY EOF;
field_name : ID | ID LSQUARE INT_LITERAL RSQUARE;
field_decl : datatype field_name (COMMA field_name)* SEMI;
method_decl : (datatype | VOID) ID LBRACE ((datatype ID) (COMMA datatype ID)*)? RBRACE block;
block : LCURLY var_decl* statement* RCURLY;
var_decl : datatype ID (COMMA ID)* SEMI;
datatype : INT | BOOLEAN;
statement : location assign_op expr SEMI
| method_call SEMI
| IF LBRACE expr RBRACE block (ELSE block)?
| FOR ID EQ expr COMMA expr block
| RETURN (expr)? SEMI
| BREAK SEMI
| CONTINUE SEMI
| block;
assign_op : EQ
| ADD EQ
| SUB EQ;
method_call : method_name LBRACE (expr (COMMA expr)*)? RBRACE
| CALLOUT LBRACE STRING(COMMA callout_arg (COMMA callout_arg)*) RBRACE;
method_name : ID;
location : ID | ID LSQUARE expr RSQUARE;
expr : location
| method_call
| literal
| expr bin_op expr
| SUB expr
| EXCLAMATION expr
| LBRACE expr RBRACE;
callout_arg : expr
| STRING ;
bin_op : arith_op
| rel_op
| eq_op
| cond_op;
arith_op : ADD | SUB | MUL | DIV | '%' ;
rel_op : LESS | GREATER | LESSEQUAL | GREATEREQUAL ;
eq_op : EQUALTO | NOTEQUAL ;
cond_op : AND | '||' ;
literal : INT_LITERAL | CHAR_LITERAL | BOOL_LITERAL ;
Whenever there are 2 or more lexer rules that match the same characters, the one defined first wins. In your case, these 2 rules both match 10:
DECIMAL_LITERAL : DIGIT(DIGIT)*;
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
and since INT_LITERAL is defined after DECIMAL_LITERAL, the lexer will never create a INT_LITERAL token. If you now try to use it in a parser rule, you get an error message you posted.
The solution: remove INT_LITERAL from your lexer and create a parser rule instead:
int_literal : DECIMAL_LITERAL | HEX_LITERAL;
and use int_literal in your parser rules instead.

ANTLR4 precedence of operator

This is my grammar:
grammar FOOL;
#header {
import java.util.ArrayList;
}
#lexer::members {
public ArrayList<String> lexicalErrors = new ArrayList<>();
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
prog : exp SEMIC #singleExp
| let exp SEMIC #letInExp
| (classdec)+ SEMIC (let)? exp SEMIC #classExp
;
classdec : CLASS ID ( EXTENDS ID )? (LPAR (vardec ( COMMA vardec)*)? RPAR)? (CLPAR ((fun SEMIC)+)? CRPAR)?;
let : LET (dec SEMIC)+ IN ;
vardec : type ID ;
varasm : vardec ASM exp ;
fun : type ID LPAR ( vardec ( COMMA vardec)* )? RPAR (let)? exp ;
dec : varasm #varAssignment
| fun #funDeclaration
;
type : INT
| BOOL
| ID
;
exp : left=term (operator=(PLUS | MINUS) right=term)*
;
term : left=factor (operator=(TIMES | DIV) right=factor)*
;
factor : left=value (operator=(EQ | LESSEQ | GREATEREQ | GREATER | LESS | AND | OR ) right=value)*
;
value : MINUS?INTEGER #intVal
| (NOT)? ( TRUE | FALSE ) #boolVal
| LPAR exp RPAR #baseExp
| IF cond=exp THEN CLPAR thenBranch=exp CRPAR (ELSE CLPAR elseBranch=exp CRPAR)? #ifExp
| MINUS?ID #varExp
| THIS #thisExp
| funcall #funExp
| (ID | THIS) DOT funcall #methodExp
| NEW ID ( LPAR (exp (COMMA exp)* )? RPAR)? #newExp
| PRINT ( exp ) #print
;
/* PRINT LPAR exp RPAR */
funcall
: ID ( LPAR (exp (COMMA exp)* )? RPAR )
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
SEMIC : ';' ;
COLON : ':' ;
COMMA : ',' ;
EQ : '==' ;
ASM : '=' ;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIV : '/' ;
TRUE : 'true' ;
FALSE : 'false' ;
LPAR : '(' ;
RPAR : ')' ;
CLPAR : '{' ;
CRPAR : '}' ;
IF : 'if' ;
THEN : 'then' ;
ELSE : 'else' ;
PRINT : 'print' ;
LET : 'let' ;
IN : 'in' ;
VAR : 'var' ;
FUN : 'fun' ;
INT : 'int' ;
BOOL : 'bool' ;
CLASS : 'class' ;
EXTENDS : 'extends' ;
THIS : 'this' ;
NEW : 'new' ;
DOT : '.' ;
LESSEQ : ('<=' | '=<') ;
GREATEREQ : ('>=' | '=>') ;
GREATER: '>' ;
LESS : '<' ;
AND : '&&' ;
OR : '||' ;
NOT : '!' ;
//Numbers
fragment DIGIT : '0'..'9';
INTEGER : DIGIT+;
//IDs
fragment CHAR : 'a'..'z' |'A'..'Z' ;
ID : CHAR (CHAR | DIGIT)* ;
//ESCAPED SEQUENCES
WS : (' '|'\t'|'\n'|'\r')-> skip;
LINECOMENTS : '//' (~('\n'|'\r'))* -> skip;
BLOCKCOMENTS : '/*'( ~('/'|'*')|'/'~'*'|'*'~'/'|BLOCKCOMENTS)* '*/' -> skip;
ERR_UNKNOWN_CHAR
: . { lexicalErrors.add("UNKNOWN_CHAR " + getText()); }
;
I think that there is a problem in the grammar concerning the precedence of operator.
In particular, this one
let
int x = (5-2)+4;
in
print x;
prints 7, while this one:
let
int x = 5-2+4;
in
print x;
prints 9.
Why the first one works? How can I make the second one working, only changing the grammar?
I think there is something to change in exp, term or factor.
This is the first parse tree http://it.tinypic.com/r/2nj8tqw/9 .
This is the second parse tree http://it.tinypic.com/r/2iv02z6/9 .
exp : left=term (operator=(PLUS | MINUS) right=exp)?
This produces parse tree that is causing it. Simply put, 5 - 2 + 4 will be parsed as:
term PLUS exp
2 term MINUS exp
2 term
4
This should help, although you'll have to change the evaluation logic:
exp : left=term (operator=(PLUS | MINUS) right=term)*
Same for factor and any other possible binary operations.

How to make certain rules mandatory in Antlr

I wrote the following grammar which should check for a conditional expression.
Examples below is what I want to achieve using this grammar:
test invalid
test = 1 valid
test = 1 and another_test>=0.2 valid
test = 1 kasd y = 1 invalid (two conditions MUST be separated by AND/OR)
a = 1 or (b=1 and c) invalid (there cannot be a lonely character like 'c'. It should always be a triplet. i.e, literal operator literal)
grammar expression;
expr
: literal_value
| expr ( '='|'<>'| '<' | '<=' | '>' | '>=' ) expr
| expr K_AND expr
| expr K_OR expr
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
| '(' expr ')'
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
| IDENTIFIER
;
keyword
: K_AND
| K_OR
;
name
: any_name
;
function_name
: any_name
;
database_name
: any_name
;
table_name
: any_name
;
column_name
: any_name
;
any_name
: IDENTIFIER
| keyword
| STRING_LITERAL
| '(' any_name ')'
;
K_AND : A N D;
K_OR : O R;
IDENTIFIER
: '"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [a-zA-Z_0-9]*
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
WS: [ \n\t\r]+ -> skip;
So my question is, how can I get the grammar to work for the examples mentioned above? Can we make certain words as mandatory between two triplets (literal operator literal)? In a sense I'm just trying to get a parser to validate the where clause condition but only simple condition and functions are permitted. I also want have a visitor that retrieves the values like function, parenthesis, any literal etc in Java, how to achieve that?
Yes and no.
You can change your grammar to only allow expressions that are comparisons and logical operations on the same:
expr
: term ( '='|'<>'| '<' | '<=' | '>' | '>=' ) term
| expr K_AND expr
| expr K_OR expr
| '(' expr ')'
;
term
: literal_value
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
;
The issue comes if you want to allow boolean variables or functions -- you need to classify the functions/vars in your lexer and have a different terminal for each, which is tricky and error prone.
Instead, it is generally better to NOT do this kind of checking in the parser -- have your parser be permissive and accept anything expression-like, and generate an expression tree for it. Then have a separate pass over the tree (called a type checker) that checks the types of the operands of operations and the arguments to functions.
This latter approach (with a separate type checker) generally ends up being much simpler, clearer, more flexible, and gives better error messages (rather than just 'syntax error').

antlrworks with multiple equals

I'm having a problem with my grammar.
It says Decision can match input such as "{EQUAL, GREATER..GREATER_EQUAL, LOWER..LOWER_EQUAL, NOT_EQUAL}" using multiple alternatives: 1, 2, Although all the trees of rules are correct.
Anyone to help?!
grammar test;
//parser : .* EOF;
program :T_PROGRAM ID T_LEFTPAR identifier_list T_RIGHTPAR T_SEMICOLON declarations subprogram_declarations compound_statement T_DOT;
identifier_list :(ID) (',' ID)*;
declarations :() (T_VAR identifier_list COLON type T_SEMICOLON)* ;
type : standard_type|
T_ARRAY T_LEFTBRACK NUM T_TO NUM T_RIGHTBRACK T_OF standard_type ;
standard_type : INT
| FLOAT ;
subprogram_declarations :() (subprogram_declaration T_SEMICOLON)* ;
subprogram_declaration : subprogram_head declarations compound_statement;
subprogram_head :T_FUNCTION ID arguments COLON standard_type |
T_PROCEDURE ID arguments ;
arguments :T_LEFTPAR parameter_list T_RIGHTPAR | ;
parameter_list :(identifier_list COLON type) (T_SEMICOLON identifier_list COLON type)*;
compound_statement : T_BEGIN optional_statements T_END;
optional_statements :statement_list | ;
statement_list :(statement) (T_SEMICOLON statement)*;
statement :
variable ASSIGN expression
| procedure_statement
| compound_statement
| T_IF expression T_THEN statement T_ELSE statement
| T_WHILE expression T_DO statement
;
procedure_statement :ID
| ID T_LEFTPAR expression_list T_RIGHTPAR;
expression_list : (expression) (',' expression)*;
variable : ID T_LEFTBRACK expression T_RIGHTBRACK | ;
expression :( () |simple_expression) (( LOWER | LOWER_EQUAL | GREATER | GREATER_EQUAL | EQUAL | NOT_EQUAL ) simple_expression)* ;
simple_expression :
( () | sign ) term (( PLUS | MINUS | T_OR ) term)*;
term :
(factor) (( CROSS | DIVIDE | MOD | T_AND ) factor)*;
factor :
variable
|ID T_LEFTPAR expression_list T_RIGHTPAR
| NUM
| T_LEFTPAR expression T_RIGHTPAR
| T_NOT factor;
sign :
'+'
| '-';
/********/
T_PROGRAM : 'program';
T_FUNCTION : 'function';
T_PROCEDURE : 'procedure';
T_READ : 'read';
T_WRITE : 'write';
T_OF : 'of';
T_ARRAY : 'array';
T_VAR : 'var';
T_FLOAT : 'float';
T_INT : 'int';
T_CHAR : 'char';
T_STRING : 'string';
T_BEGIN : 'begin';
T_END : 'end';
T_IF : 'if';
T_THEN : 'then';
T_ELSE : 'else';
T_WHILE : 'while';
T_DO : 'do';
T_NOT : 'not';
NUM : INT
| FLOAT;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
LOWER : '<';
LOWER_EQUAL : '<=';
GREATER : '>';
GREATER_EQUAL : '>=';
EQUAL : '=';
NOT_EQUAL : '<>';
ASSIGN :
':=';
COLON :
':';
PLUS : '+';
MINUS : '-';
T_OR : 'OR';
CROSS : '*';
DIVIDE : '/';
MOD : 'MOD';
T_AND : 'AND';
T_LEFTPAR
: '(';
T_RIGHTPAR
: ')';
T_LEFTBRACK
: '[';
T_RIGHTBRACK
: ']';
T_TO
: '..';
T_DOT : '.';
T_SEMICOLON
: ';';
T_COMMA
: ',';
T_BADNUM
: (NUM)(CHAR)*;
T_BADSTRING
: '"' ( ESC_SEQ | ~('\\'|'"') )*WS;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
fragment
INT : '0'..'9'+
;
fragment
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
Problem 1
The ambiguity (the errors/warnings that certain input can be matched in more than 1 way) originate mainly from the fact that both your variable and expression rules can match empty input:
variable
: ID T_LEFTBRACK expression T_RIGHTBRACK
| /* nothing */
;
expression
: (
(/* nothing */)
| simple_expression
)
(
(LOWER | LOWER_EQUAL | GREATER | GREATER_EQUAL | EQUAL | NOT_EQUAL) simple_expression
)* /* because of the `*`, this can also match nothing */
;
(I added the /* ... */ comments and reformatted the rules to make them more readable)
Fix 1
You probably want to do it like this instead:
variable
: ID T_LEFTBRACK expression T_RIGHTBRACK
;
expression
: simple_expression ((LOWER | LOWER_EQUAL | GREATER | GREATER_EQUAL | EQUAL | NOT_EQUAL) simple_expression)*
;
Problem 2
Another problem (one that will show up once you would have resolved the ambiguities, is that you defined the tokens MOD, T_AND and T_OR after your ID rule:
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
...
MOD : 'MOD';
T_AND : 'AND';
T_OR : 'OR';
This will cause MOD, T_AND and T_OR to be never created since ID matches these characters too.
Fix 2
Place MOD, T_AND and T_OR before your ID rule:
MOD : 'MOD';
T_AND : 'AND';
T_OR : 'OR';
...
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;

Assignment as expression in Antlr grammar

I'm trying to extend the grammar of the Tiny Language to treat assignment as expression. Thus it would be valid to write
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
a = 1 = 2; // invalid
Assignment differs from other operators in two aspects. It's right associative (not a big deal), and its left-hand side is has to be a variable. So I changed the grammar like this
statement: assignmentExpr | functionCall ...;
assignmentExpr: Identifier indexes? '=' expression;
expression: assignmentExpr | condExpr;
It doesn't work, because it contains a non-LL(*) decision. I also tried this variant:
assignmentExpr: Identifier indexes? '=' (expression | condExpr);
but I got the same error. I am interested in
This specific question
Given a grammar with a non-LL(*) decision, how to find the two paths that cause the problem
How to fix it
I think you can change your grammar like this to achieve the same, without using syntactic predicates:
statement: Expr ';' | functionCall ';'...;
Expr: Identifier indexes? '=' Expr | condExpr ;
condExpr: .... and so on;
I altered Bart's example with this idea in mind:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
And for the input:
a=b=4;
a = 2 * (b = 1);
you get following parse tree:
The key here is that you need to "assure" the parser that inside an expression, there is something ahead that satisfies the expression. This can be done using a syntactic predicate (the ( ... )=> parts in the add and mult rules).
A quick demo:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
ASSIGN;
}
parse
: stat* EOF -> ^(ROOT stat+)
;
stat
: expr ';' -> expr
;
expr
: add
;
add
: mult ((('+' | '-') mult)=> ('+' | '-')^ mult)*
;
mult
: atom ((('*' | '/') atom)=> ('*' | '/')^ atom)*
;
atom
: (Id -> Id) ('=' expr -> ^(ASSIGN Id expr))?
| Num
| '(' expr ')' -> expr
;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse the input:
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
into the following AST:

Resources