I'm building a parser for SML using ANTLR 4.8, and for some reason the generated parser keeps choking on integer literals:
# CLASSPATH=bin ./scripts/grun SML expression -tree <<<'1'
line 1:0 mismatched input '1' expecting {'(', 'let', 'op', '{', '()', '[', '#', 'raise', 'if', 'while', 'case', 'fn', LONGID, CONSTANT}
(expression 1)
I've trimmed as much as I can from the grammar to still show this issue, which appears very strange. This grammar shows the issue (despite LABEL not even being used):
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
On the other hand, removing LABEL makes positive numbers work again:
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
I've tried replacing NUM* with DIGIT? and similar variations, but that didn't fix my problem.
I'm really not sure what's going on, so I suspect it's something deeper than the syntax I'm using.
As already mentioned in the comments by Rici: the lexer tries to match as much characters as possible, and when 2 or more rules match the same characters, the one defined first "wins". So with rules like these:
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
the input 1 will always become a LABEL. And input like 0 will always be a CONSTANT. An INT token will only be created when a ~ is encountered followed by some digits. The NUM and DIGIT will never produce a token since the rules before it will be matched. The fact that NUM and DIGIT can never become tokens on their own, makes them candidates to becoming fragment tokens:
fragment NUM : DIGIT+ ;
fragment DIGIT : [0-9] ;
That way, you can't accidentally use these tokens inside parser rules.
Also, making ~ part of a token is usually not the way to go. You'll probably also want ~(1 + 2) to be a valid expression. So an unary operator like ~ is often better used in a parser rule: expression : '~' expression | ... ;.
Finally, if you want to make a distinction between a non-zero integer value as a label, you can do it like this:
grammar SML_Small;
expression
: '(' expression ')'
| '~' expression
| integer
;
integer
: INT
| INT_NON_ZERO
;
label
: INT_NON_ZERO
;
INT_NON_ZERO : [1-9] DIGIT* ;
INT : DIGIT+ ;
SPACES : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
I'm trying to parse a simple integer declaration in antlr4.
The grammar I'm doing now is:
main : 'int' var '=' NUMBER+ ;
var : LETTER (LETTER | NUMBER)* ;
LETTER: [a-zA-Z_] ;
NUMBER: [0-9] ;
WS : [ \t\r\n]+ -> skip ;
When I tried to test the main rule with int int_A = 0, I got an error:
extraneous input 'int' expecting LETTER.
I know it's because the variable name 'int_A' contains the keyword 'int', but how do I modify my grammar? Thanks.
The lexer creates tokens with as much characters as possible. So int_A is being tokenised as the following 3 tokens:
'int' (int keyword defined in parser)
LETTER (_)
LETTER (A)
So the parser cannot create a var with these tokens.
Instead of a parser rule var, make it a lexer rule:
main : 'int' VAR '=' NUMBER+ ;
VAR : [a-zA-Z_] ([a-zA-Z_] | [0-9])* ;
NUMBER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
I am trying to parse ISO 8601 period expressions like "P3M2D", using antlr4. But I am hitting some kind of roadblock and will appreciate help. I am rather new to both antlr and compilers.
My grammar is as below. I have combined the lexer and parser rules in one go here:
grammar test_iso ;
// import testLexerRules ;
iso : ( date_expr NEWLINE)* EOF;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
///////////////////////////////////////////
// in separate file : test_lexer.g4
// lexer grammar testLexerRules ;
///////////////////////////////////////////
fragment
TODAY
: 'today' | 'TODAY'
;
fragment
NOW
: 'now' | 'NOW'
;
DATETIME_NAME
: TODAY
| NOW
;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment
DIGIT : [0-9] ;
fragment
INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
//
// identifiers
//
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
fragment
ALPHA : [a-zA-Z] ;
fragment
ALPH_NUM : [a-zA-Z_0-9] ;
fragment
ALPHA_UPPER : [A-Z] ;
fragment
ALPHA_UPPER_NUM : [A-Z_0-9] ;
//////////////////////////////////////////////
NEWLINE : '\r\n' ;
WS : [ \t]+ -> skip ;
In test run, it never hits the iso8601_interval_d rule, it always goes to ID rule.
C:\lab>java org.antlr.v4.gui.TestRig test_iso iso -tokens -tree
now + P3M2D
^Z
ID seen P3M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:10='P3M2D',<ID>,1:6]
[#3,11:12='\r\n',<'
'>,1:11]
[#4,13:12='<EOF>',<EOF>,2:0]
line 1:6 mismatched input 'P3M2D' expecting 'P'
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P3M2D))) \r\n <EOF>)
If I remove the "ID" rule and run again, it parses as desired:
now + P3M2D
^Z
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:6='P',<'P'>,1:6]
[#3,7:7='3',<NUMBER_INT>,1:7]
[#4,8:8='M',<'M'>,1:8]
[#5,9:9='2',<NUMBER_INT>,1:9]
[#6,10:10='D',<'D'>,1:10]
[#7,11:12='\r\n',<'
'>,1:11]
[#8,13:12='<EOF>',<EOF>,2:0]
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P 3 M 2 D))) \r\n <EOF>)
I also tried prefixing a special character like "#" in the parser rule
iso8601_interval_d
: '#P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
but now a different kind of failure
now + #P3M2D
^Z
ID seen M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:7='#P',<'#P'>,1:6]
[#3,8:8='3',<NUMBER_INT>,1:8]
[#4,9:11='M2D',<ID>,1:9]
[#5,12:13='\r\n',<'
'>,1:12]
[#6,14:13='<EOF>',<EOF>,2:0]
line 1:9 no viable alternative at input '3M2D'
ISO8601_INTERVAL DATE seen #P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d #P 3 M2D))) \r\n <EOF>)
I am sure I am not the first one to hit upon something like this. What is the antlr idiom here?
EDIT -- I need the ID token elsewhere in other parts of my grammar that I have omitted here, to highlight the problem I am facing.
Like find out even by other, the issue is in the ID token. The fact is that the duration syntax for iso-8601 is a valid ID. Besides the solution figured out by #Mike. If something called island grammar is suitable for your needs you can use ANTLR's lexical modes to exclude the ID lexer rule while parsing an iso date.
Belove there is an examples on how it could work
parser grammar iso;
options { tokenVocab=iso_lexer; }
iso : ISO_BEGIN ( date_expr NEWLINE)* ISO_END;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
then in the lexer
lexer grammar iso_lexer;
//
// identifiers (in DEFAULT_MODE)
//
ISO_BEGIN
: '<#' -> mode(ISO)
;
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
WS0 : [ \t]+ -> skip ;
// all the following token are scanned only when iso mode is active
mode ISO;
ISO_END
: '#>' -> mode(DEFAULT_MODE)
;
WS0 : [ \t]+ -> skip ;
NEWLINE : '\r'? '\n' ;
ADD : '+' ;
SUB : '-' ;
LPAREN : '(' ;
RPAREN : ')' ;
P : 'P' ;
Y : 'Y' ;
M : 'M' ;
W : 'W' ;
D : 'D' ;
DATETIME_NAME
: TODAY
| NOW
;
fragment TODAY: 'today' | 'TODAY' ;
fragment NOW : 'now' | 'NOW' ;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment DIGIT : [0-9] ;
fragment INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
fragment ALPHA : [a-zA-Z] ;
fragment ALPH_NUM : [a-zA-Z_0-9] ;
fragment ALPHA_UPPER : [A-Z] ;
fragment ALPHA_UPPER_NUM : [A-Z_0-9] ;
Such grammar can parse expressions like
Pluton Planet <% now + P10Y
%>
I changed a bit the parser rule iso to demonstrate ID and period mixing.
Hope this helps
It's not possible what you wanna do. ID matches the same input as iso8601_interval. In such cases ANTLR4 picks the longest match, which is ID as it can match an unlimited number of characters.
The only way you could possible make it work in the grammar is to exclude P as a possible ID introducer, which then can exclusively be used for the duration.
Another option is a post processing step. Parse durations like any other identifier and in your semantic phase check all those ids that look like a duration. This is probably the best solution.
all. I have created the grammar(it's part of the bigger grammar) to discover my problem
When I parse the string
00 */3,5 * * 5 America/New_York
I have following exception
line 1:4 no viable alternative at input '/'
As I have discovered, problem is that substring /3,5 does fully parsed with "with_step_value" rule, instead of it parser get only first sybmbol. But why? As I understand, antlr try to parse as long string as he can and substring "/3,5" in my poind of view satisfied to rule "with_step_value"
So, why it's happends and how to fix it?
Regards,
Vladimir
Please, see grammars and pictures bellow
/*File trigger validator lexer */
lexer grammar CronPartLexer;
INT_LIST: INTEGER (COMMA INTEGER)* ;
INTERVAL
:
INTEGER DASH INTEGER
;
INTEGER
:
[0-9]+
;
DASH
:
'-'
;
SLASH
:
'/'
;
COMMA
:
','
;
UNDERSCORE
:
'_'
;
ID
:
[a-zA-Z] [a-zA-Z0-9]*
;
ASTERISK:'*';
WS
:
[ \t\r\n]+ -> skip
;
grammar CronPartValidator;
options
{
tokenVocab = CronPartLexer;
}
cron_part
:
minutes hours days_of_month month week_days time_zone?;
minutes
:
with_step_value
;
time_zone
:
timezone_part
(
SLASH timezone_part
)?
;
timezone_part
:
ID
(
UNDERSCORE ID
)?
;
hours
:
with_step_value
;
//
//
days_of_month
:
with_step_value
;
//
month
:
with_step_value
;
//
week_days
:
with_step_value
;
with_step_value:
INT_LIST|ASTERISK|INTERVAL ((SLASH INT_LIST)?)
;
Parse Tree of the full string
Parse Tree of "with_step_value" "*/3,5"
The rule
with_step_value: INT_LIST|ASTERISK|INTERVAL ((SLASH INT_LIST)?) ;
will only match an INT_LIST or an ASTERISK or an INTERVAL (SLASH INT_LIST)?.
Perhaps this is what was intended:
with_step_value
: ( INT_LIST
| ASTERISK
| INTERVAL
) (SLASH INT_LIST)?
;
This is my first try with an ANTLR4-grammar. It should recognize a very easy statement, starting with the command 'label', followed by a colon, then an arbitrary text, ended by semicolon. But the parser does not recognize 'label' as description. Why?
grammar test;
prog: stat+;
stat:
description content
;
description:
'label' COLON
;
content:
TEXT
;
TEXT:
.*? ';'
;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
COMMENT
: '//' (~('\n'|'\r'))*
;
COLON : ':' ;
ID: [a-zA-z]+;
INT: [0-9]+;
NEWLINE: '\r'? '\n';
WS : [ \t\n\r]+ -> skip ;
An example for the code:
label:
this is an error;
wronglabel:YYY
this should be a error;
The error is:
line 1:0 mismatched input 'label: \nthis is an error;' expecting 'label'
(prog label: \nthis is an error; \n\n\nwronglabel:YYY\nthis should be a error; \n)
This works much better:
grammar test;
prog: stat+;
stat:
description content
;
description:
'label' COLON
;
content:
text
;
text:
.*? ';'
;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
COMMENT
: '//' (~('\n'|'\r'))*
;
COLON : ':' ;
ID: [a-zA-z]+;
NEWLINE: '\r'? '\n';
WS : [ \t\n\r]+ -> skip ;
Seems I mixed lexer and parser rules:
lexer rules have to be lower case,
parser rules uppercase.
So I changed the TEXT-rule into a text-rule.