I am trying to parse ISO 8601 period expressions like "P3M2D", using antlr4. But I am hitting some kind of roadblock and will appreciate help. I am rather new to both antlr and compilers.
My grammar is as below. I have combined the lexer and parser rules in one go here:
grammar test_iso ;
// import testLexerRules ;
iso : ( date_expr NEWLINE)* EOF;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
///////////////////////////////////////////
// in separate file : test_lexer.g4
// lexer grammar testLexerRules ;
///////////////////////////////////////////
fragment
TODAY
: 'today' | 'TODAY'
;
fragment
NOW
: 'now' | 'NOW'
;
DATETIME_NAME
: TODAY
| NOW
;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment
DIGIT : [0-9] ;
fragment
INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
//
// identifiers
//
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
fragment
ALPHA : [a-zA-Z] ;
fragment
ALPH_NUM : [a-zA-Z_0-9] ;
fragment
ALPHA_UPPER : [A-Z] ;
fragment
ALPHA_UPPER_NUM : [A-Z_0-9] ;
//////////////////////////////////////////////
NEWLINE : '\r\n' ;
WS : [ \t]+ -> skip ;
In test run, it never hits the iso8601_interval_d rule, it always goes to ID rule.
C:\lab>java org.antlr.v4.gui.TestRig test_iso iso -tokens -tree
now + P3M2D
^Z
ID seen P3M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:10='P3M2D',<ID>,1:6]
[#3,11:12='\r\n',<'
'>,1:11]
[#4,13:12='<EOF>',<EOF>,2:0]
line 1:6 mismatched input 'P3M2D' expecting 'P'
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P3M2D))) \r\n <EOF>)
If I remove the "ID" rule and run again, it parses as desired:
now + P3M2D
^Z
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:6='P',<'P'>,1:6]
[#3,7:7='3',<NUMBER_INT>,1:7]
[#4,8:8='M',<'M'>,1:8]
[#5,9:9='2',<NUMBER_INT>,1:9]
[#6,10:10='D',<'D'>,1:10]
[#7,11:12='\r\n',<'
'>,1:11]
[#8,13:12='<EOF>',<EOF>,2:0]
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P 3 M 2 D))) \r\n <EOF>)
I also tried prefixing a special character like "#" in the parser rule
iso8601_interval_d
: '#P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
but now a different kind of failure
now + #P3M2D
^Z
ID seen M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:7='#P',<'#P'>,1:6]
[#3,8:8='3',<NUMBER_INT>,1:8]
[#4,9:11='M2D',<ID>,1:9]
[#5,12:13='\r\n',<'
'>,1:12]
[#6,14:13='<EOF>',<EOF>,2:0]
line 1:9 no viable alternative at input '3M2D'
ISO8601_INTERVAL DATE seen #P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d #P 3 M2D))) \r\n <EOF>)
I am sure I am not the first one to hit upon something like this. What is the antlr idiom here?
EDIT -- I need the ID token elsewhere in other parts of my grammar that I have omitted here, to highlight the problem I am facing.
Like find out even by other, the issue is in the ID token. The fact is that the duration syntax for iso-8601 is a valid ID. Besides the solution figured out by #Mike. If something called island grammar is suitable for your needs you can use ANTLR's lexical modes to exclude the ID lexer rule while parsing an iso date.
Belove there is an examples on how it could work
parser grammar iso;
options { tokenVocab=iso_lexer; }
iso : ISO_BEGIN ( date_expr NEWLINE)* ISO_END;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
then in the lexer
lexer grammar iso_lexer;
//
// identifiers (in DEFAULT_MODE)
//
ISO_BEGIN
: '<#' -> mode(ISO)
;
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
WS0 : [ \t]+ -> skip ;
// all the following token are scanned only when iso mode is active
mode ISO;
ISO_END
: '#>' -> mode(DEFAULT_MODE)
;
WS0 : [ \t]+ -> skip ;
NEWLINE : '\r'? '\n' ;
ADD : '+' ;
SUB : '-' ;
LPAREN : '(' ;
RPAREN : ')' ;
P : 'P' ;
Y : 'Y' ;
M : 'M' ;
W : 'W' ;
D : 'D' ;
DATETIME_NAME
: TODAY
| NOW
;
fragment TODAY: 'today' | 'TODAY' ;
fragment NOW : 'now' | 'NOW' ;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment DIGIT : [0-9] ;
fragment INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
fragment ALPHA : [a-zA-Z] ;
fragment ALPH_NUM : [a-zA-Z_0-9] ;
fragment ALPHA_UPPER : [A-Z] ;
fragment ALPHA_UPPER_NUM : [A-Z_0-9] ;
Such grammar can parse expressions like
Pluton Planet <% now + P10Y
%>
I changed a bit the parser rule iso to demonstrate ID and period mixing.
Hope this helps
It's not possible what you wanna do. ID matches the same input as iso8601_interval. In such cases ANTLR4 picks the longest match, which is ID as it can match an unlimited number of characters.
The only way you could possible make it work in the grammar is to exclude P as a possible ID introducer, which then can exclusively be used for the duration.
Another option is a post processing step. Parse durations like any other identifier and in your semantic phase check all those ids that look like a duration. This is probably the best solution.
I'm on my first steps using ANTLR 4 with IntelliJ. I am trying to create a simple Recursive Climbing Parser for mathematical expressions. I get an error
line 1:0 mismatched input '3' expecting {VARIABLE; REALNUM, INTNUM}
It seems like the lexer does not correctly turn the 3 into the token, the parser uses, but I can not find the Problem there.
Lexer:
lexer grammar testLexer;
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/';
SIN: 'sin'|'Sin'|'SIN';
COS: 'cos'|'Cos'|'COS';
TAN: 'tan'|'Tan'|'TAN';
LN: 'ln'|'LN'|'Ln';
LOG: 'Log'|'log'|'LOG';
SQRT: 'sqrt'|'Sqrt'|'SQRT';
LBRACE: '(';
RBRACE: ')';
POW: '^';
SPACE: ' ' -> skip;
EQUAL: '=';
VARIABLE: [a-zA-Z][a-zA-Z0-9]*;
INTNUM: [0-9]+;
REALNUM: [0-9]+[,|.][0-9]+;
WS: [\r\t\n]+ -> skip;
SEMICOLON: ';';
Parser:
parser
grammar testParser;
expression returns [double value]
: exp=additiveExpression {$value = $exp.value;};
equalityExpression returns [double value]
: m1 = additiveExpression (EQUAL additiveExpression)* {$value = $m1.value;};
additiveExpression returns [double value]
: m2 = multiplikativeExpression {$value = $m2.value;}
(PLUS m1=multiplikativeExpression {$value += $m1.value;}
|MINUS m1=multiplikativeExpression {$value -= $m1.value;}
)* ;
multiplikativeExpression returns [double value]
: m3 = powExpression {$value = $m3.value;}
(TIMES powExpression {$value *= $m3.value;}
|DIV powExpression {$value /= $m3.value;}
)* ;
powExpression returns [double value]
: (bracedExpression)
(POW (m4=expression) {$value = Math.pow($value, $m4.value);}
)*;
bracedExpression returns [double value]
: (LBRACE m5 = expression RBRACE {$value = $m5.value;}
|LBRACE m6 = unaryExpression RBRACE {$value = $m6.value;}
| m7 =unaryExpression {$value = $m7.value;});
unaryExpression returns [double value]
: m7= atomExpression {$value = $m7.value;}
| (SIN m6=bracedExpression {$value = Math.sin($m6.value);}
|COS m6=bracedExpression {$value = Math.cos($m6.value);}
|TAN m6=bracedExpression {$value = Math.tan($m6.value);}
|LOG m6=bracedExpression {$value = Math.log($m6.value);}
|SQRT m6=bracedExpression {$value = Math.sqrt($m6.value);}
)
|EOF;
atomExpression returns [double value]
: VARIABLE {$value = 1;}
|m7 = REALNUM {$value = Double.parseDouble($m7.text);}
| m7 = INTNUM {$value = Integer.parseInt($m7.text);};
The input is just the simple term 3, but the error also occurs on longer input strings like 2+3.
Your example lexes just fine for me. Here is the TestRig output with -tokens turned on:
C:\prj\ANTLR_SO_BENCH\Bench>java org.antlr.v4.gui.TestRig Grammar1 program -tokens SOURCE.txt
[#0,0:0='3',<INTNUM>,1:0]
[#1,1:1='+',<'+'>,1:1]
[#2,2:2='2',<INTNUM>,1:2]
[#3,5:4='<EOF>',<EOF>,2:0]
And REALNUM tokens work too:
C:\prj\ANTLR_SO_BENCH\Bench>java org.antlr.v4.gui.TestRig Grammar1 program -tokens SOURCE.txt
[#0,0:3='3.14',<REALNUM>,1:0]
[#1,5:5='*',<'*'>,1:5]
[#2,7:9='2.1',<REALNUM>,1:7]
[#3,12:11='<EOF>',<EOF>,2:0]
So I'm not sure anymore what to recommend based on your question.
I am trying to adapt the STRING part of Pair in Object to a CamelString, but it fails. and report "No viable alternative at input".
I have tried to used my CamelString as an independent grammar, it works well. I think it means there is ambiguity in my grammar, but I can not understand why.
For the test input
{
'BaaaBcccCdddd':'cc'
}
Ther error is
line 2:2 no viable alternative at input '{'BaaaBcccCdddd''
The following is my grammar. It's almost the same with the standard JSON grammar for ANTLR 4.
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
grammar Json;
json: object
| array
;
object
: '{' pair (',' pair)* '}'
| '{' '}' // empty object
;
pair : camel_string ':' value;
camel_string : '\'' (camel_body)*? '\'';
STRING
: '\'' (ESC | ~['\\])* '\'';
camel_body: CAMEL_BODY;
CAMEL_START: [a-z] ALPHA_NUM_LOWER*;
CAMEL_BODY: [A-Z] ALPHA_NUM_LOWER*;
CAMEL_END: [A-Z]+;
fragment ALPHA_NUM_LOWER : [0-9a-z];
array
: '[' value (',' value)* ']'
| '[' ']' // empty array
;
value
: STRING
| NUMBER
| object // recursion
| array // recursion
| 'true' // keywords
| 'false'
| 'null'
;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
NUMBER
: '-'? INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r]+ -> skip ;
grammar TestCSharpParser;
options {
language=CSharp3;
}
#parser::namespace { Demo.Antlr }
#lexer::namespace { Demo.Antlr }
parse returns [double value]
: exp EOF {$value = $exp.value;}
;
exp returns [double value]
: addExp {$value = $addExp.value;}
;
addExp returns [double value]
: a=mulExp {$value = $a.value;}
( '+' b=mulExp {$value += $b.value;}
| '-' b=mulExp {$value -= $b.value;}
)*
;
mulExp returns [double value]
: a=unaryExp {$value = $a.value;}
( '*' b=unaryExp {$value *= $b.value;}
| '/' b=unaryExp {$value /= $b.value;}
)*
;
unaryExp returns [double value]
: '-' atom {$value = -1.0 * $atom.value;}
| atom {$value = $atom.value;}
;
atom returns [double value]
: Number {$value = Double.Parse($Number.Text, CultureInfo.InvariantCulture);}
| '(' exp ')' {$value = $exp.value;}
;
Number
: ('0'..'9')+ ('.' ('0'..'9')+)?
;
Space
: (' ' | '\t' | '\r' | '\n'){$channel = HIDDEN;}
;
The grammar won't parse the simple statement 4/5 or (4/5) tried this using ANTLRWorks.
Does anyone have any idea why this is happening? This to my mind should work correctly.
It keeps giving me the NoViableAltException.
I see several problems related to the use of the CSharp3 target.
The CSharp2 and CSharp3 targets define the constant Hidden instead of HIDDEN
ANTLRWorks cannot be used to generate parsers for grammars targeting the CSharp2 or CSharp3 targets. The parser must be generated either by MSBuild (preferred) or by using Antlr3.exe. These are documented on the ANTLR 3 C# Releases wiki page.
ANTLRWorks cannot be used to test parsers generated for the CSharp2 or CSharp3 targets. Any results reported by the interpreter or debugger cannot be trusted.
So I'm trying to write a parser in ANTLR, this is my first time using it and I'm running into a problem that I can't find a solution for, apologies if this is a very simple problem. Anyway, the error I'm getting is:
"(100): Expr.g:1:13:syntax error: antlr: MismatchedTokenException(74!=52)"
The code I'm currently using is:
grammar Expr.g;
options{
output=AST;
}
tokens{
MAIN = 'main';
OPENBRACKET = '(';
CLOSEBRACKET = ')';
OPENCURLYBRACKET = '{';
CLOSECURLYBRACKET = '}';
COMMA = ',';
SEMICOLON = ';';
GREATERTHAN = '>';
LESSTHAN = '<';
GREATEROREQUALTHAN = '>=';
LESSTHANOREQUALTHAN = '<=';
NOTEQUAL = '!=';
ISEQUALTO = '==';
WHILE = 'while';
IF = 'if';
ELSE = 'else';
READ = 'read';
OUTPUT = 'output';
PRINT = 'print';
RETURN = 'return';
READC = 'readc';
OUTPUTC = 'outputc';
PLUS = '+';
MINUS = '-';
DIVIDE = '/';
MULTIPLY = '*';
PERCENTAGE = '%';
}
#header {
//package test;
import java.util.HashMap;
}
#lexer::header {
//package test;
}
#members {
/** Map variable name to Integer object holding value */
HashMap memory = new HashMap();
}
prog: stat+ ;
stat: expr NEWLINE {System.out.println($expr.value);}
| ID '=' expr NEWLINE
{memory.put($ID.text, new Integer($expr.value));}
| NEWLINE
;
expr returns [int value]
: e=multExpr {$value = $e.value;}
( '+' e=multExpr {$value += $e.value;}
| '-' e=multExpr {$value -= $e.value;}
)*
;
multExpr returns [int value]
: e=atom {$value = $e.value;} ('*' e=atom {$value *= $e.value;})*
;
atom returns [int value]
: INT {$value = Integer.parseInt($INT.text);}
| ID
{
Integer v = (Integer)memory.get($ID.text);
if ( v!=null ) $value = v.intValue();
else System.err.println("undefined variable "+$ID.text);
}
| '(' e=expr ')' {$value = $e.value;}
;
IDENT : ('a'..'z'^|'A'..'Z'^)+ ; : .;
INT : '0'..'9'+ ;
NEWLINE:'\r'? '\n' ;
WS : (' '|'\t')+ {skip();} ;
Thanks for any help.
EDIT: Well, I'm an idiot, it's just a formatting error. Thanks for the responses from those who helped out.
You have some illegal characters after your IDENT token:
IDENT : ('a'..'z'^|'A'..'Z'^)+ ; : .;
The : .; are invalid there. And you're also trying to mix the tree-rewrite operator ^ inside a lexer rule, which is illegal: remove them. Lastly, you've named it IDENT while in your parser rules, you're using ID.
It should be:
ID : ('a'..'z' | 'A'..'Z')+ ;