I'm new to ANTLR and I am trying to parse something like
ref:something title:(something else) blah ref other
and to obtain a list like
KEY = ref VALUE = something
KEY = title VALUE = something else
KEY = null VALUE = blah
KEY = null VALUE = ref // same ref string as item 1 key
KEY = null VALUE = other
The grammar I have is
searchCriteriaList
locals[List<object> s = new List<object>()]
: t+=criteriaBean (WS t+=criteriaBean)* { $s.addAll($t); }
;
criteriaBean : (KEY ':' WS* expression)
| expression ;
expression : '(' WORD (WS WORD)* ')'
| WORD ;
/*
* Lexer Rules
*/
fragment A : ('A'|'a') ;
fragment B : ('B'|'b') ;
fragment C : ('C'|'c') ;
fragment D : ('D'|'d') ;
fragment E : ('E'|'e') ;
fragment F : ('F'|'f') ;
fragment G : ('G'|'g') ;
fragment H : ('H'|'h') ;
fragment I : ('I'|'i') ;
fragment J : ('J'|'j') ;
fragment K : ('K'|'k') ;
fragment L : ('L'|'l') ;
fragment M : ('M'|'m') ;
fragment N : ('N'|'n') ;
fragment O : ('O'|'o') ;
fragment P : ('P'|'p') ;
fragment Q : ('Q'|'q') ;
fragment R : ('R'|'r') ;
fragment S : ('S'|'s') ;
fragment T : ('T'|'t') ;
fragment U : ('U'|'u') ;
fragment V : ('V'|'v') ;
fragment W : ('W'|'w') ;
fragment X : ('X'|'x') ;
fragment Y : ('Y'|'y') ;
fragment Z : ('Z'|'z') ;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
TITLE : T I T L E ;
MESSAGE : M E S S A G E ;
REF : R E F ;
KEY : TITLE | MESSAGE | REF ;
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
WS : [ \t\u000C\r\n] ;
When I try parsing the string I get 2 exceptions and in the addAll method I end up with 3 elements rather than 5.
Can someone point me into the right direction? What I am doing wrong?
Thanks,
S
PS: The exception I am getting is:
Exception of type 'Antlr4.Runtime.InputMismatchException' was thrown.
InputStream: {ref:something title:(something else) blah ref other }
OffendingToken: {[#0,0:2='ref',<5>,1:0]}
The lexer tries to match as much characters as possible when constructing tokens. When 2 or more lexer rules match the same characters, the rule defined first "wins". With this in mind, the KEY token will never be created since TITLE, MESSAGE and REF are defined above it:
TITLE : T I T L E ;
MESSAGE : M E S S A G E ;
REF : R E F ;
KEY : TITLE | MESSAGE | REF ;
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
So the input ref will always become a REF token, never a KEY or a WORD. What you need to do is create a parser rule from KEY instead.
Also, since you want a WORD to also match your keywords, you should not do this:
expression
: '(' WORD (WS WORD)* ')'
| WORD
;
but something like this instead:
expression
: '(' word (WS word)* ')'
| word
;
word
: key
| WORD
;
key
: TITLE
| MESSAGE
| REF
;
Oh, and this:
fragment Z : ('Z'|'z') ;
can be rewritten as:
fragment Z : [Zz] ;
And is there a particular reason you're littering your parser rules with WS tokens? You could just remove them during tokenisation:
WS : [ \t\u000C\r\n] -> skip;
Related
I'm working on a lexer and parser for an old object oriented chat system (MOO in case any readers are familiar with its language). Within this language, any of the below examples are valid floating point numbers:
2.3
3.
.2
3e+5
The language also implements an indexing syntax for extracting one or more characters from a string or list (which is a set of comma separated expressions enclosed in curly braces). The problem arises from the fact that the language supports a range operator inside the index brackets. For example:
a = foo[1..3];
I understand that ANTLR wants to match the longest possible match first. Unfortunately this results in the lexer seeing '1..3' as two floating points numbers (1. and .3), rather than two integers with a range operator ('..') between them. Is there any way to solve this short of using lexer modes? Given that the values inside of an indexing expression can be any valid expression, I would have to duplicate a lot of token rules (essentially all but the floating point numbers as I understand it). Now granted I'm new to ANTLR so I'm sure I'm missing something and any help is much appreciated. I will supply my lexer grammar below:
lexer grammar MooLexer;
channels { COMMENTS_CHANNEL }
SINGLE_LINE_COMMENT
: '//' INPUT_CHARACTER* -> channel(COMMENTS_CHANNEL);
DELIMITED_COMMENT
: '/*' .*? '*/' -> channel(COMMENTS_CHANNEL);
WS
: [ \t\r\n] -> channel(HIDDEN)
;
IF
: I F
;
ELSE
: E L S E
;
ELSEIF
: E L S E I F
;
ENDIF
: E N D I F
;
FOR
: F O R;
ENDFOR
: E N D F O R;
WHILE
: W H I L E
;
ENDWHILE
: E N D W H I L E
;
FORK
: F O R K
;
ENDFORK
: E N D F O R K
;
RETURN
: R E T U R N
;
BREAK
: B R E A K
;
CONTINUE
: C O N T I N U E
;
TRY
: T R Y
;
EXCEPT
: E X C E P T
;
ENDTRY
: E N D T R Y
;
IN
: I N
;
SPLICER
: '#';
UNDERSCORE
: '_';
DOLLAR
: '$';
SEMI
: ';';
COLON
: ':';
DOT
: '.';
COMMA
: ',';
BANG
: '!';
OPEN_QUOTE
: '`';
SINGLE_QUOTE
: '\'';
LEFT_BRACKET
: '[';
RIGHT_BRACKET
: ']';
LEFT_CURLY_BRACE
: '{';
RIGHT_CURLY_BRACE
: '}';
LEFT_PARENTHESIS
: '(';
RIGHT_PARENTHESIS
: ')';
PLUS
: '+';
MINUS
: '-';
STAR
: '*';
DIV
: '/';
PERCENT
: '%';
PIPE
: '|';
CARET
: '^';
ASSIGNMENT
: '=';
QMARK
: '?';
OP_AND
: '&&';
OP_OR
: '||';
OP_EQUALS
: '==';
OP_NOT_EQUAL
: '!=';
OP_LESS_THAN
: '<';
OP_GREATER_THAN
: '>';
OP_LESS_THAN_OR_EQUAL_TO
: '<=';
OP_GREATER_THAN_OR_EQUAL_TO
: '>=';
RANGE
: '..';
ERROR
: 'E_NONE'
| 'E_TYPE'
| 'E_DIV'
| 'E_PERM'
| 'E_PROPNF'
| 'E_VERBNF'
| 'E_VARNF'
| 'E_INVIND'
| 'E_RECMOVE'
| 'E_MAXREC'
| 'E_RANGE'
| 'E_ARGS'
| 'E_NACC'
| 'E_INVARG'
| 'E_QUOTA'
| 'E_FLOAT'
;
OBJECT
: '#' DIGIT+
| '#-' DIGIT+
;
STRING
: '"' ( ESC | [ !] | [#-[] | [\]-~] | [\t] )* '"';
INTEGER
: DIGIT+;
FLOAT
: DIGIT+ [.] (DIGIT*)? (EXPONENTNOTATION EXPONENTSIGN DIGIT+)?
| [.] DIGIT+ (EXPONENTNOTATION EXPONENTSIGN DIGIT+)?
| DIGIT+ EXPONENTNOTATION EXPONENTSIGN DIGIT+
;
IDENTIFIER
: (LETTER | DIGIT | UNDERSCORE)+
;
LETTER
: LOWERCASE
| UPPERCASE
;
/*
* fragments
*/
fragment LOWERCASE
: [a-z] ;
fragment UPPERCASE
: [A-Z] ;
fragment EXPONENTNOTATION
: ('E' | 'e');
fragment EXPONENTSIGN
: ('-' | '+');
fragment DIGIT
: [0-9] ;
fragment ESC
: '\\"' | '\\\\' ;
fragment INPUT_CHARACTER
: ~[\r\n\u0085\u2028\u2029];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
No, AFAIK, there is no way to solve this using lexer modes. You'll need a predicate with a bit of target specific code. If Java is your target, that might look like this:
lexer grammar RangeTestLexer;
FLOAT
: [0-9]+ '.' [0-9]+
| [0-9]+ '.' {_input.LA(1) != '.'}?
| '.' [0-9]+
;
INTEGER
: [0-9]+
;
RANGE
: '..'
;
SPACES
: [ \t\r\n] -> skip
;
If you run the following Java code:
Lexer lexer = new RangeTestLexer(CharStreams.fromString("1 .2 3. 4.5 6..7 8 .. 9"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s `%s`\n", RangeTestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
you'll get the following output:
INTEGER `1`
FLOAT `.2`
FLOAT `3.`
FLOAT `4.5`
INTEGER `6`
RANGE `..`
INTEGER `7`
INTEGER `8`
RANGE `..`
INTEGER `9`
EOF `<EOF>`
The { ... }? is the predicate and the embedded code must evaluate to a boolean. In my example, the Java code _input.LA(1) != '.' returns true if the character stream 1 step ahead of the current position does not equal a '.' char.
I have such grammar:
grammar SearchQuery;
queryDeclaration : predicateGroupItem predicateGroupItemWithBooleanOperator* ;
predicateGroupItemWithBooleanOperator : groupOperator predicateGroupItem ;
predicateGroupItem : LEFT_BRACKET variable variableDelimiter
expression expressionWithBoolean* RIGHT_BRACKET ;
variable : VARIABLE_STRING ;
variableDelimiter : VAR_DELIMITER ;
expressionWithBoolean : boolOperator expression ;
expression : value ;
value : polygonType;
boolOperator : or
;
or : OR ;
groupOperator : AND ;
polygonType : POLYGON LEFT_BRACKET pointList (POLYGON_DELIMITER pointList)* RIGHT_BRACKET ;
longType : LONG ;
doubleType : DOUBLE ;
pointList : point
| LEFT_BRACKET point ( POLYGON_DELIMITER point)* RIGHT_BRACKET
;
point : latitude longitude ;
latitude : longType
| doubleType
;
longitude : longType
| doubleType
;
POLYGON : [pP] [oO] [lL] [yY] [gG] [oO] [nN] ;
LONG : DIGIT+ ;
DOUBLE : DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
AND : [aA] [nN] [dD] ;
OR : COMMA
| [oO] [rR]
;
VARIABLE_STRING : [a-zA-Z0-9.]+ ;
COMMA : ',' ;
POLYGON_DELIMITER : ';' ;
VAR_DELIMITER : ':' ;
RIGHT_BRACKET : ')' ;
LEFT_BRACKET : '(' ;
WS : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
Problem is that I can't use COMMA tag with different rules simultaneously in polygonType, pointList rules (I need to use COMMA except for POLYGON_DELIMITER) and boolOperator rule (there is COMMA used)
Other words, if we will change POLYGON_DELIMITER to COMMA and
test such grammar with a value like this
(polygons: polygon(20 30.4, 23.4 23),
polygon(20 30.4, 23.4 23),
polygon(20 30.4, 23.4 23))
we will get an error
mismatch input: ',' expecting {',', ')'}
I will happy if somebody will help me to understand the problem.
P.S. if we will not change current grammar the value for the testing it is
(poligons: polygon(20 30.4; 23.4 23),
polygon(20 30.4; 23.4 23),
polygon(20 30.4; 23.4 23))
Because of these rules:
OR : COMMA
| [oO] [rR]
;
COMMA : ',' ;
the lexer will never produce a COMMA token since it is already matched by the OR token. And because OR is defined before COMMA, it gets precedence.
That is what the error message mismatch input: ',' expecting {',', ')'} really means. In other words: mismatch input: OR expecting {COMMA, RIGHT_BRACKET}
What you should do (if the OR operator can be either "or" or ",") is let the parser rule or match the COMMA:
or : OR
| COMMA
;
OR : [oO] [rR]
;
COMMA : ',' ;
This is freaking me out, I just can't find a solution to it. I have a grammar for search queries and would like to match any searchterm in a query composed out of printable letters except for special characters "(", ")". Strings enclosed in quotes are handled separately and work.
Here is a somewhat working grammar:
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start
: searchclause EOF
;
searchclause
: table expr
;
expr
: fieldsearch
| searchop fieldsearch
| unop expr
| expr relop expr
| lparen expr relop expr rparen
;
lparen
: '('
;
rparen
: ')'
;
unop
: NOT
;
relop
: AND
| OR
;
searchop
: NO
| EVERY
;
fieldsearch
: field EQ searchterm
;
field
: ID
;
table
: ID
;
searchterm
:
| STRING
| ID+
| DIGIT+
| DIGIT+ ID+
;
STRING
: '"' ~('\n'|'"')* ('"' )
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '='
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: ('0' .. '9')
;
/*
NOT_SPECIAL
: ~(' ' | '\t' | '\n' | '\r' | '\'' | '"' | ';' | '.' | '=' | '(' | ')' )
; */
WS
: [ \r\n\t] + -> skip
;
The problem is that searchterm is too restricted. It should match any character that is in the commented out NOT_SPECIAL, i.e., valid queries would be:
Person Name=%
Person Address=^%Street%%%$^&*#^
But whenever I try to put NOT_SPECIAL in any way into the definition of searchterm it doesn't work. I have tried putting it literally into the rule, too (commenting out NOT_SPECIAL) and many others things, but it just doesn't work. In most of my attempts the grammar just complained about extraneous input after "=" and said it was expecting EOF. But I also cannot put EOF into NOT_SPECIAL.
Is there any way I can simply parse every text after "=" in rule fieldsearch until there is a whitespace or ")", "("?
N.B. The STRING rule works fine, but the user ought not be required to use quotes every time, because this is a command line tool and they'd need to be escaped.
Target language is Go.
You could solve that by introducing a lexical mode that you'll enter whenever you match an EQ token. Once in that lexical mode, you either match a (, ) or a whitespace (in which case you pop out of the lexical mode), or you keep matching your NOT_SPECIAL chars.
By using lexical modes, you must define your lexer- and parser rules in their own files. Be sure to use lexer grammar ... and parser grammar ... instead of the grammar ... you use in a combined .g4 file.
A quick demo:
lexer grammar MdbLexer;
STRING
: '"' ~[\r\n"]* '"'
;
OPAR
: '('
;
CPAR
: ')'
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '=' -> pushMode(NOT_SPECIAL_MODE)
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: [0-9]
;
WS
: [ \r\n\t]+ -> skip
;
fragment VALID_ID_START
: [a-zA-Z_]
;
fragment VALID_ID_CHAR
: [a-zA-Z_0-9]
;
mode NOT_SPECIAL_MODE;
OPAR2
: '(' -> type(OPAR), popMode
;
CPAR2
: ')' -> type(CPAR), popMode
;
WS2
: [ \t\r\n] -> skip, popMode
;
NOT_SPECIAL
: ~[ \t\r\n()]+
;
Your parser grammar would start like this:
parser grammar MdbParser;
options {
tokenVocab=MdbLexer;
}
start
: searchclause EOF
;
// your other parser rules
My Go is a bit rusty, but a small Java test:
String source = "Person Address=^%Street%%%$^&*#^()";
MdbLexer lexer = new MdbLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-15s %s\n", MdbLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
print the following:
ID Person
ID Address
EQ =
NOT_SPECIAL ^%Street%%%$^&*#^
OPAR (
CPAR )
EOF <EOF>
This is my grammar:
grammar FOOL;
#header {
import java.util.ArrayList;
}
#lexer::members {
public ArrayList<String> lexicalErrors = new ArrayList<>();
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
prog : exp SEMIC #singleExp
| let exp SEMIC #letInExp
| (classdec)+ SEMIC (let)? exp SEMIC #classExp
;
classdec : CLASS ID ( EXTENDS ID )? (LPAR (vardec ( COMMA vardec)*)? RPAR)? (CLPAR ((fun SEMIC)+)? CRPAR)?;
let : LET (dec SEMIC)+ IN ;
vardec : type ID ;
varasm : vardec ASM exp ;
fun : type ID LPAR ( vardec ( COMMA vardec)* )? RPAR (let)? exp ;
dec : varasm #varAssignment
| fun #funDeclaration
;
type : INT
| BOOL
| ID
;
exp : left=term (operator=(PLUS | MINUS) right=term)*
;
term : left=factor (operator=(TIMES | DIV) right=factor)*
;
factor : left=value (operator=(EQ | LESSEQ | GREATEREQ | GREATER | LESS | AND | OR ) right=value)*
;
value : MINUS?INTEGER #intVal
| (NOT)? ( TRUE | FALSE ) #boolVal
| LPAR exp RPAR #baseExp
| IF cond=exp THEN CLPAR thenBranch=exp CRPAR (ELSE CLPAR elseBranch=exp CRPAR)? #ifExp
| MINUS?ID #varExp
| THIS #thisExp
| funcall #funExp
| (ID | THIS) DOT funcall #methodExp
| NEW ID ( LPAR (exp (COMMA exp)* )? RPAR)? #newExp
| PRINT ( exp ) #print
;
/* PRINT LPAR exp RPAR */
funcall
: ID ( LPAR (exp (COMMA exp)* )? RPAR )
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
SEMIC : ';' ;
COLON : ':' ;
COMMA : ',' ;
EQ : '==' ;
ASM : '=' ;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIV : '/' ;
TRUE : 'true' ;
FALSE : 'false' ;
LPAR : '(' ;
RPAR : ')' ;
CLPAR : '{' ;
CRPAR : '}' ;
IF : 'if' ;
THEN : 'then' ;
ELSE : 'else' ;
PRINT : 'print' ;
LET : 'let' ;
IN : 'in' ;
VAR : 'var' ;
FUN : 'fun' ;
INT : 'int' ;
BOOL : 'bool' ;
CLASS : 'class' ;
EXTENDS : 'extends' ;
THIS : 'this' ;
NEW : 'new' ;
DOT : '.' ;
LESSEQ : ('<=' | '=<') ;
GREATEREQ : ('>=' | '=>') ;
GREATER: '>' ;
LESS : '<' ;
AND : '&&' ;
OR : '||' ;
NOT : '!' ;
//Numbers
fragment DIGIT : '0'..'9';
INTEGER : DIGIT+;
//IDs
fragment CHAR : 'a'..'z' |'A'..'Z' ;
ID : CHAR (CHAR | DIGIT)* ;
//ESCAPED SEQUENCES
WS : (' '|'\t'|'\n'|'\r')-> skip;
LINECOMENTS : '//' (~('\n'|'\r'))* -> skip;
BLOCKCOMENTS : '/*'( ~('/'|'*')|'/'~'*'|'*'~'/'|BLOCKCOMENTS)* '*/' -> skip;
ERR_UNKNOWN_CHAR
: . { lexicalErrors.add("UNKNOWN_CHAR " + getText()); }
;
I think that there is a problem in the grammar concerning the precedence of operator.
In particular, this one
let
int x = (5-2)+4;
in
print x;
prints 7, while this one:
let
int x = 5-2+4;
in
print x;
prints 9.
Why the first one works? How can I make the second one working, only changing the grammar?
I think there is something to change in exp, term or factor.
This is the first parse tree http://it.tinypic.com/r/2nj8tqw/9 .
This is the second parse tree http://it.tinypic.com/r/2iv02z6/9 .
exp : left=term (operator=(PLUS | MINUS) right=exp)?
This produces parse tree that is causing it. Simply put, 5 - 2 + 4 will be parsed as:
term PLUS exp
2 term MINUS exp
2 term
4
This should help, although you'll have to change the evaluation logic:
exp : left=term (operator=(PLUS | MINUS) right=term)*
Same for factor and any other possible binary operations.
I am trying to parse ISO 8601 period expressions like "P3M2D", using antlr4. But I am hitting some kind of roadblock and will appreciate help. I am rather new to both antlr and compilers.
My grammar is as below. I have combined the lexer and parser rules in one go here:
grammar test_iso ;
// import testLexerRules ;
iso : ( date_expr NEWLINE)* EOF;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
///////////////////////////////////////////
// in separate file : test_lexer.g4
// lexer grammar testLexerRules ;
///////////////////////////////////////////
fragment
TODAY
: 'today' | 'TODAY'
;
fragment
NOW
: 'now' | 'NOW'
;
DATETIME_NAME
: TODAY
| NOW
;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment
DIGIT : [0-9] ;
fragment
INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
//
// identifiers
//
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
fragment
ALPHA : [a-zA-Z] ;
fragment
ALPH_NUM : [a-zA-Z_0-9] ;
fragment
ALPHA_UPPER : [A-Z] ;
fragment
ALPHA_UPPER_NUM : [A-Z_0-9] ;
//////////////////////////////////////////////
NEWLINE : '\r\n' ;
WS : [ \t]+ -> skip ;
In test run, it never hits the iso8601_interval_d rule, it always goes to ID rule.
C:\lab>java org.antlr.v4.gui.TestRig test_iso iso -tokens -tree
now + P3M2D
^Z
ID seen P3M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:10='P3M2D',<ID>,1:6]
[#3,11:12='\r\n',<'
'>,1:11]
[#4,13:12='<EOF>',<EOF>,2:0]
line 1:6 mismatched input 'P3M2D' expecting 'P'
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P3M2D))) \r\n <EOF>)
If I remove the "ID" rule and run again, it parses as desired:
now + P3M2D
^Z
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:6='P',<'P'>,1:6]
[#3,7:7='3',<NUMBER_INT>,1:7]
[#4,8:8='M',<'M'>,1:8]
[#5,9:9='2',<NUMBER_INT>,1:9]
[#6,10:10='D',<'D'>,1:10]
[#7,11:12='\r\n',<'
'>,1:11]
[#8,13:12='<EOF>',<EOF>,2:0]
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P 3 M 2 D))) \r\n <EOF>)
I also tried prefixing a special character like "#" in the parser rule
iso8601_interval_d
: '#P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
but now a different kind of failure
now + #P3M2D
^Z
ID seen M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:7='#P',<'#P'>,1:6]
[#3,8:8='3',<NUMBER_INT>,1:8]
[#4,9:11='M2D',<ID>,1:9]
[#5,12:13='\r\n',<'
'>,1:12]
[#6,14:13='<EOF>',<EOF>,2:0]
line 1:9 no viable alternative at input '3M2D'
ISO8601_INTERVAL DATE seen #P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d #P 3 M2D))) \r\n <EOF>)
I am sure I am not the first one to hit upon something like this. What is the antlr idiom here?
EDIT -- I need the ID token elsewhere in other parts of my grammar that I have omitted here, to highlight the problem I am facing.
Like find out even by other, the issue is in the ID token. The fact is that the duration syntax for iso-8601 is a valid ID. Besides the solution figured out by #Mike. If something called island grammar is suitable for your needs you can use ANTLR's lexical modes to exclude the ID lexer rule while parsing an iso date.
Belove there is an examples on how it could work
parser grammar iso;
options { tokenVocab=iso_lexer; }
iso : ISO_BEGIN ( date_expr NEWLINE)* ISO_END;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
then in the lexer
lexer grammar iso_lexer;
//
// identifiers (in DEFAULT_MODE)
//
ISO_BEGIN
: '<#' -> mode(ISO)
;
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
WS0 : [ \t]+ -> skip ;
// all the following token are scanned only when iso mode is active
mode ISO;
ISO_END
: '#>' -> mode(DEFAULT_MODE)
;
WS0 : [ \t]+ -> skip ;
NEWLINE : '\r'? '\n' ;
ADD : '+' ;
SUB : '-' ;
LPAREN : '(' ;
RPAREN : ')' ;
P : 'P' ;
Y : 'Y' ;
M : 'M' ;
W : 'W' ;
D : 'D' ;
DATETIME_NAME
: TODAY
| NOW
;
fragment TODAY: 'today' | 'TODAY' ;
fragment NOW : 'now' | 'NOW' ;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment DIGIT : [0-9] ;
fragment INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
fragment ALPHA : [a-zA-Z] ;
fragment ALPH_NUM : [a-zA-Z_0-9] ;
fragment ALPHA_UPPER : [A-Z] ;
fragment ALPHA_UPPER_NUM : [A-Z_0-9] ;
Such grammar can parse expressions like
Pluton Planet <% now + P10Y
%>
I changed a bit the parser rule iso to demonstrate ID and period mixing.
Hope this helps
It's not possible what you wanna do. ID matches the same input as iso8601_interval. In such cases ANTLR4 picks the longest match, which is ID as it can match an unlimited number of characters.
The only way you could possible make it work in the grammar is to exclude P as a possible ID introducer, which then can exclusively be used for the duration.
Another option is a post processing step. Parse durations like any other identifier and in your semantic phase check all those ids that look like a duration. This is probably the best solution.