ANTLR4 grammar for SML choking on positive integer literals - parsing

I'm building a parser for SML using ANTLR 4.8, and for some reason the generated parser keeps choking on integer literals:
# CLASSPATH=bin ./scripts/grun SML expression -tree <<<'1'
line 1:0 mismatched input '1' expecting {'(', 'let', 'op', '{', '()', '[', '#', 'raise', 'if', 'while', 'case', 'fn', LONGID, CONSTANT}
(expression 1)
I've trimmed as much as I can from the grammar to still show this issue, which appears very strange. This grammar shows the issue (despite LABEL not even being used):
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
On the other hand, removing LABEL makes positive numbers work again:
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
I've tried replacing NUM* with DIGIT? and similar variations, but that didn't fix my problem.
I'm really not sure what's going on, so I suspect it's something deeper than the syntax I'm using.

As already mentioned in the comments by Rici: the lexer tries to match as much characters as possible, and when 2 or more rules match the same characters, the one defined first "wins". So with rules like these:
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
the input 1 will always become a LABEL. And input like 0 will always be a CONSTANT. An INT token will only be created when a ~ is encountered followed by some digits. The NUM and DIGIT will never produce a token since the rules before it will be matched. The fact that NUM and DIGIT can never become tokens on their own, makes them candidates to becoming fragment tokens:
fragment NUM : DIGIT+ ;
fragment DIGIT : [0-9] ;
That way, you can't accidentally use these tokens inside parser rules.
Also, making ~ part of a token is usually not the way to go. You'll probably also want ~(1 + 2) to be a valid expression. So an unary operator like ~ is often better used in a parser rule: expression : '~' expression | ... ;.
Finally, if you want to make a distinction between a non-zero integer value as a label, you can do it like this:
grammar SML_Small;
expression
: '(' expression ')'
| '~' expression
| integer
;
integer
: INT
| INT_NON_ZERO
;
label
: INT_NON_ZERO
;
INT_NON_ZERO : [1-9] DIGIT* ;
INT : DIGIT+ ;
SPACES : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;

Related

ANTLR4 parsing a keyword-contained variable name

I'm trying to parse a simple integer declaration in antlr4.
The grammar I'm doing now is:
main : 'int' var '=' NUMBER+ ;
var : LETTER (LETTER | NUMBER)* ;
LETTER: [a-zA-Z_] ;
NUMBER: [0-9] ;
WS : [ \t\r\n]+ -> skip ;
When I tried to test the main rule with int int_A = 0, I got an error:
extraneous input 'int' expecting LETTER.
I know it's because the variable name 'int_A' contains the keyword 'int', but how do I modify my grammar? Thanks.
The lexer creates tokens with as much characters as possible. So int_A is being tokenised as the following 3 tokens:
'int' (int keyword defined in parser)
LETTER (_)
LETTER (A)
So the parser cannot create a var with these tokens.
Instead of a parser rule var, make it a lexer rule:
main : 'int' VAR '=' NUMBER+ ;
VAR : [a-zA-Z_] ([a-zA-Z_] | [0-9])* ;
NUMBER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;

In antlr4.7 how to parse a rule like ISO 8601 interval "P3M2D" ahead of an "ID" rule

I am trying to parse ISO 8601 period expressions like "P3M2D", using antlr4. But I am hitting some kind of roadblock and will appreciate help. I am rather new to both antlr and compilers.
My grammar is as below. I have combined the lexer and parser rules in one go here:
grammar test_iso ;
// import testLexerRules ;
iso : ( date_expr NEWLINE)* EOF;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
///////////////////////////////////////////
// in separate file : test_lexer.g4
// lexer grammar testLexerRules ;
///////////////////////////////////////////
fragment
TODAY
: 'today' | 'TODAY'
;
fragment
NOW
: 'now' | 'NOW'
;
DATETIME_NAME
: TODAY
| NOW
;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment
DIGIT : [0-9] ;
fragment
INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
//
// identifiers
//
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
fragment
ALPHA : [a-zA-Z] ;
fragment
ALPH_NUM : [a-zA-Z_0-9] ;
fragment
ALPHA_UPPER : [A-Z] ;
fragment
ALPHA_UPPER_NUM : [A-Z_0-9] ;
//////////////////////////////////////////////
NEWLINE : '\r\n' ;
WS : [ \t]+ -> skip ;
In test run, it never hits the iso8601_interval_d rule, it always goes to ID rule.
C:\lab>java org.antlr.v4.gui.TestRig test_iso iso -tokens -tree
now + P3M2D
^Z
ID seen P3M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:10='P3M2D',<ID>,1:6]
[#3,11:12='\r\n',<'
'>,1:11]
[#4,13:12='<EOF>',<EOF>,2:0]
line 1:6 mismatched input 'P3M2D' expecting 'P'
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P3M2D))) \r\n <EOF>)
If I remove the "ID" rule and run again, it parses as desired:
now + P3M2D
^Z
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:6='P',<'P'>,1:6]
[#3,7:7='3',<NUMBER_INT>,1:7]
[#4,8:8='M',<'M'>,1:8]
[#5,9:9='2',<NUMBER_INT>,1:9]
[#6,10:10='D',<'D'>,1:10]
[#7,11:12='\r\n',<'
'>,1:11]
[#8,13:12='<EOF>',<EOF>,2:0]
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P 3 M 2 D))) \r\n <EOF>)
I also tried prefixing a special character like "#" in the parser rule
iso8601_interval_d
: '#P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
but now a different kind of failure
now + #P3M2D
^Z
ID seen M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:7='#P',<'#P'>,1:6]
[#3,8:8='3',<NUMBER_INT>,1:8]
[#4,9:11='M2D',<ID>,1:9]
[#5,12:13='\r\n',<'
'>,1:12]
[#6,14:13='<EOF>',<EOF>,2:0]
line 1:9 no viable alternative at input '3M2D'
ISO8601_INTERVAL DATE seen #P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d #P 3 M2D))) \r\n <EOF>)
I am sure I am not the first one to hit upon something like this. What is the antlr idiom here?
EDIT -- I need the ID token elsewhere in other parts of my grammar that I have omitted here, to highlight the problem I am facing.
Like find out even by other, the issue is in the ID token. The fact is that the duration syntax for iso-8601 is a valid ID. Besides the solution figured out by #Mike. If something called island grammar is suitable for your needs you can use ANTLR's lexical modes to exclude the ID lexer rule while parsing an iso date.
Belove there is an examples on how it could work
parser grammar iso;
options { tokenVocab=iso_lexer; }
iso : ISO_BEGIN ( date_expr NEWLINE)* ISO_END;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
then in the lexer
lexer grammar iso_lexer;
//
// identifiers (in DEFAULT_MODE)
//
ISO_BEGIN
: '<#' -> mode(ISO)
;
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
WS0 : [ \t]+ -> skip ;
// all the following token are scanned only when iso mode is active
mode ISO;
ISO_END
: '#>' -> mode(DEFAULT_MODE)
;
WS0 : [ \t]+ -> skip ;
NEWLINE : '\r'? '\n' ;
ADD : '+' ;
SUB : '-' ;
LPAREN : '(' ;
RPAREN : ')' ;
P : 'P' ;
Y : 'Y' ;
M : 'M' ;
W : 'W' ;
D : 'D' ;
DATETIME_NAME
: TODAY
| NOW
;
fragment TODAY: 'today' | 'TODAY' ;
fragment NOW : 'now' | 'NOW' ;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment DIGIT : [0-9] ;
fragment INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
fragment ALPHA : [a-zA-Z] ;
fragment ALPH_NUM : [a-zA-Z_0-9] ;
fragment ALPHA_UPPER : [A-Z] ;
fragment ALPHA_UPPER_NUM : [A-Z_0-9] ;
Such grammar can parse expressions like
Pluton Planet <% now + P10Y
%>
I changed a bit the parser rule iso to demonstrate ID and period mixing.
Hope this helps
It's not possible what you wanna do. ID matches the same input as iso8601_interval. In such cases ANTLR4 picks the longest match, which is ID as it can match an unlimited number of characters.
The only way you could possible make it work in the grammar is to exclude P as a possible ID introducer, which then can exclusively be used for the duration.
Another option is a post processing step. Parse durations like any other identifier and in your semantic phase check all those ids that look like a duration. This is probably the best solution.

ANTLR4 grammar, rule accept only part of the sentence

all. I have created the grammar(it's part of the bigger grammar) to discover my problem
When I parse the string
00 */3,5 * * 5 America/New_York
I have following exception
line 1:4 no viable alternative at input '/'
As I have discovered, problem is that substring /3,5 does fully parsed with "with_step_value" rule, instead of it parser get only first sybmbol. But why? As I understand, antlr try to parse as long string as he can and substring "/3,5" in my poind of view satisfied to rule "with_step_value"
So, why it's happends and how to fix it?
Regards,
Vladimir
Please, see grammars and pictures bellow
/*File trigger validator lexer */
lexer grammar CronPartLexer;
INT_LIST: INTEGER (COMMA INTEGER)* ;
INTERVAL
:
INTEGER DASH INTEGER
;
INTEGER
:
[0-9]+
;
DASH
:
'-'
;
SLASH
:
'/'
;
COMMA
:
','
;
UNDERSCORE
:
'_'
;
ID
:
[a-zA-Z] [a-zA-Z0-9]*
;
ASTERISK:'*';
WS
:
[ \t\r\n]+ -> skip
;
grammar CronPartValidator;
options
{
tokenVocab = CronPartLexer;
}
cron_part
:
minutes hours days_of_month month week_days time_zone?;
minutes
:
with_step_value
;
time_zone
:
timezone_part
(
SLASH timezone_part
)?
;
timezone_part
:
ID
(
UNDERSCORE ID
)?
;
hours
:
with_step_value
;
//
//
days_of_month
:
with_step_value
;
//
month
:
with_step_value
;
//
week_days
:
with_step_value
;
with_step_value:
INT_LIST|ASTERISK|INTERVAL ((SLASH INT_LIST)?)
;
Parse Tree of the full string
Parse Tree of "with_step_value" "*/3,5"
The rule
with_step_value: INT_LIST|ASTERISK|INTERVAL ((SLASH INT_LIST)?) ;
will only match an INT_LIST or an ASTERISK or an INTERVAL (SLASH INT_LIST)?.
Perhaps this is what was intended:
with_step_value
: ( INT_LIST
| ASTERISK
| INTERVAL
) (SLASH INT_LIST)?
;

ANTLR4 grammar to specify parent child relationship

I've created a grammar to express a search through a Map using key and value pairs using an ANTLR4 grammar file:
START: 'SEARCH FOR';
VALUE_EXPRESSION: 'VALUE:'[a-zA-Z0-9]+;
MATCH: 'MATCHING';
COMMA: ',';
KEY_EXPRESSION: 'KEY:'[a-zA-Z0-9]*;
KEY_VALUE_PAIR: KEY_EXPRESSION MATCH VALUE_EXPRESSION;
r : START KEY_VALUE_PAIR (COMMA KEY_VALUE_PAIR)*;
WS: [ \n\t\r]+ -> skip;
The "Interpret Lexer" in ANTLRWorks produces:
And the "Parse Tree" like this:
I'm not sure if this is the correct (or even typical) way to go about parsing an input string but what I'd like to do is have each of the key/value pairs split up and placed under a parent node like such:
[SEARCH FOR] [PAIR], [PAIR]
| |
/ \ / \
/ \ / \
/ \ / \
colour red size small
My belief is that in doing this It will make like easier when I come to walk the tree.
I've searched around and tried to use the caret '^' character to specify the parent but ANTLRWorks always indicates that there is an error in my grammar.
Can anybody help with this, or possibly supply another solution (if this is an atypical approach)?
You can probably simplify this even further. You might want to have a LEXER rule for your keys to keep track of them. So below, I am simply using string as the key. But you could define a lexer rule for 'colour', 'size', etc... Also, I did away with the matching. Instead, I created a set of pairs.
grammar GRAMMAR;
start: START set ;
set
: pair (',' pair)*
;
pair: STRING ':' value ;
value
: STRING
| NUMBER
;
START: 'SEARCH FOR: ' ;
STRING : '"' [a-zA-Z_0-9]* '"' ;
NUMBER
: '-'? INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r]+ -> skip ;

Support optional quotes in a Boolean expression

Background
I have been using ANTLRWorks (V 1.4.3) for a few days now and trying to write a simple Boolean parser. The combined lexer/parser grammar below works well for most of the requirements including support for quoted white-spaced text as operands for a Boolean expression.
Problem
I would like the grammar to work for white-spaced operands without the need of quotes.
Example
For example, expression-
"left right" AND center
should have the same parse tree even after dropping the quotes-
left right AND center.
I have been learning about backtracking, predicates etc but can't seem to find a solution.
Code
Below is the grammar I have got so far. Any feedback on the foolish mistakes is appreciated :).
Lexer/Parser Grammar
grammar boolean_expr;
options {
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
}
#modifier{public}
#ctorModifier{public}
#lexer::namespace{Org.CSharp.Parsers}
#parser::namespace{Org.CSharp.Parsers}
public
evaluator
: expr EOF
;
public
expr
: orexpr
;
public
orexpr
: andexpr (OR^ andexpr)*
;
public
andexpr
: notexpr (AND^ notexpr)*
;
public
notexpr
: (NOT^)? atom
;
public
atom
: word | LPAREN! expr RPAREN!
;
public
word
: QUOTED_TEXT | TEXT
;
/*
* Lexer Rules
*/
LPAREN
: '('
;
RPAREN
: ')'
;
AND
: 'AND'
;
OR
: 'OR'
;
NOT
: 'NOT'
;
WS
: ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
QUOTED_TEXT
: '"' (LETTER | DIGIT | ' ' | ',' | '-')+ '"'
;
TEXT
: (LETTER | DIGIT)+
;
/*
Fragment lexer rules can be used by other lexer rules, but do not return tokens by themselves
*/
fragment DIGIT
: ('0'..'9')
;
fragment LOWER
: ('a'..'z')
;
fragment UPPER
: ('A'..'Z')
;
fragment LETTER
: LOWER | UPPER
;
Simply let TEXT in your atom rule match once or more: TEXT+. When it matches a TEXT token more than once, you'll also want to create a custom root node for these TEXT tokens (I added an imaginary token called WORD in the grammar below).
grammar boolean_expr;
options {
output=AST;
}
tokens {
WORD;
}
evaluator
: expr EOF
;
...
word
: QUOTED_TEXT
| TEXT+ -> ^(WORD TEXT+)
;
...
Your input "left right AND center" would now be parsed as follows:

Resources