Why is antrl4 not recognizing tokens as part of rules in grammar? - parsing

I am using antlr4 to parse .eds files. I wrote a grammar, and I'm having an issue where the parser is parsing every token in the body of a section as part of the body. It seems like antlr4 is just ignoring my grammar rules for the body.
Here is my grammar:
grammar test;
eds : section+;
section : header body;
header : '[' header_name ']';
body : field+;
field : name '=' value STMTEND;
header_name : ~(']')+;
name : Identifier;
raw_value : string
| integer
| hex
| version
| date
| time;
value : raw_value
| list;
list : raw_value list_value+;
list_value : ',' raw_value
| ',';
string : String_standard
| string_list;
string_list : String_standard string_list
| String_standard String_standard;
integer : Integer;
version : Version;
date : Date;
time : Time;
hex : Hex;
String_standard : '"' ( Escape | ~('\'' | '\\' | '\n' | '\r') | '.' | '+' + '/' | ' ') + '"';
Escape : '\\' ( '\'' | '\\' );
Integer : NUMBER+;
Hex : '0' 'x' HEX_DIGIT+;
Version : NUMBER+ '.' NUMBER+
| NUMBER+ '.' NUMBER+ '.' NUMBER+
| NUMBER+ '.' NUMBER+ '.' NUMBER+ '.' NUMBER+;
Date : NUMBER NUMBER '-' NUMBER NUMBER '-' NUMBER NUMBER NUMBER NUMBER;
Time : NUMBER NUMBER ':' NUMBER NUMBER ':' NUMBER NUMBER;
Identifier : Identifier_Char+;
HeaderID : Header_Char+;
fragment
Identifier_Char : LETTER
| NUMBER
| '_';
fragment
Header_Char : LETTER
| NUMBER
| '_'
| ' ';
fragment LETTER : [a-zA-Z];
fragment HEX_DIGIT : [a-fA-F0-9];
fragment NUMBER : [0-9];
STMTEND : SEMICOLON;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
WS: [ \t\r\n\u000C]+ -> channel(HIDDEN);
LINE_COMMENT: '$' ~[\r\n]* -> channel(HIDDEN);
Here is my input:
[File]
DescText = "EtherNet/IP EDS for ANT lite+ PLC";
CreateDate = 02-16-2018;
CreateTime = 14:13:46;
ModDate = 10-11-2019;
ModTime = 11:05:09;
Revision = 1.2;
HomeURL = "www.bluebotics.com";
1_IOC_Details_License = 0x7B457ED4;
When I visualize the parse tree with the antlr4 gui I see that the header was parsed correctly but the body just has a child for every token:
Here is the tree output where you can see it didn't parse the body at all:
(eds (section (header [ (header_name File) ]) (body DescText = "EtherNet/IP EDS for ANT lite+ PLC" ; CreateDate = 02 16 2018 ; CreateTime = 14 13 46 ; ModDate = 10 11 2019 ; ModTime = 11 05 09 ; Revision = 1 2 ; HomeURL = "www.bluebotics.com" ; 1_IOC_Details_License = 0x7B457ED4 ;)))
How do I alter my grammar so that antlr actually parses the body?

Place ANY : .; at the end of your grammar so that the lexer does not produce any errors/warnings. That way, it is easier to see where things go wrong. With that ANY rule added, you will see that your input is tokenised like this:
null `[`
Identifier `File`
null `]`
WS `\n `
HeaderID `DescText `
null `=`
HeaderID ` `
String_standard `"EtherNet/IP EDS for ANT lite+ PLC"`
STMTEND `;`
WS `\n `
HeaderID `CreateDate `
null `=`
HeaderID ` 02`
ANY `-`
Integer `16`
ANY `-`
Integer `2018`
STMTEND `;`
WS `\n `
HeaderID `CreateTime `
null `=`
HeaderID ` 14`
ANY `:`
Integer `13`
ANY `:`
Integer `46`
STMTEND `;`
WS `\n `
HeaderID `ModDate `
null `=`
HeaderID ` 10`
ANY `-`
Integer `11`
ANY `-`
Integer `2019`
STMTEND `;`
WS `\n `
HeaderID `ModTime `
null `=`
HeaderID ` 11`
ANY `:`
Integer `05`
ANY `:`
Integer `09`
STMTEND `;`
WS `\n `
HeaderID `Revision `
null `=`
HeaderID ` 1`
ANY `.`
Integer `2`
STMTEND `;`
WS `\n `
HeaderID `HomeURL `
null `=`
HeaderID ` `
String_standard `"www.bluebotics.com"`
STMTEND `;`
WS `\n `
HeaderID `1_IOC_Details_License `
null `=`
HeaderID ` 0x7B457ED4`
STMTEND `;`
EOF `<EOF>`
As you can see, your HeaderID is messing things up: it should really not contain spaces. Remove this HeaderID rule (and the ANY rule as well) and your parser will parse it correctly:

Related

Bison Reduce/Reduce conflicts Warning

While producing a pretty basic XML-like language parser, I'm stumbling upon reduce/reduce conflicts. Trying to figure out their source.
bison.y: warning: 2 reduce/reduce conflicts [-wconflicts-rr]
bison -d bison.y error - warnings
The warning is as seen above.
Code :
%%
%token WORKBOOK_ST_TAG WORKBOOK_END_TAG
%token STYLES_ST_TAG STYLES_END_TAG
%token STYLE_ST_TAG STYLE_END_TAG
%token ID_TAG
%token WORKSHEET_ST_TAG WORKSHEET_END_TAG
%token NAME_TAG PROTECTED_TAG
%token TABLE_ST_TAG TABLE_END_TAG
%token EXP_RC_TAG EXP_CC_TAG
%token COLUMN_OP_TAG COLUMN_CL_TAG WIDTH_TAG
%token ROW_ST_TAG ROW_END_TAG
%token HEIGHT_TAG HIDDEN_TAG
%token CELL_ST_TAG CELL_END_TAG
%token MERGE_A_TAG MERGE_D_TAG
%token STYLE_ID_TAG
%token DATA_ST_TAG DATA_END_TAG
%token TYPE_TAG
%token NUM_ATTR STRING_ATTR DATETIME_ATTR BOOL_ATTR
%token INT DECIMAL STRING BOOLEAN DATETIME
%token COMM_ST_TAG COMM_END_TAG
%%
My proposed grammar:
workbook : WORKBOOK_ST_TAG wb_cont WORKBOOK_END_TAG
;
wb_cont : styles worksheet worksheets | comment wb_cont
;
worksheets : worksheet worksheets |
;
styles : STYLES_ST_TAG styles_cont STYLES_END_TAG styles | ;
;
styles_cont : style styles_cont | comment styles_cont |
;
style : style_st_tag style_cont STYLE_END_TAG
;
style_st_tag : STYLE_ST_TAG ID '>'
;
ID : ID_TAG '=' '"' STRING '"'
;
style_cont : | comment style_cont
;
worksheet : ws_st_tag ws_cont WORKSHEET_END_TAG
;
ws_st_tag : WORKSHEET_ST_TAG ws_attr '>'
;
ws_attr : name | name protected | protected name
;
name : NAME_TAG '=' '"' STRING '"'
;
protected : PROTECTED_TAG '=' '"' BOOLEAN '"'
;
ws_cont : table ws_cont | comment ws_cont |
;
table : table_st_tag table_cont TABLE_END_TAG
;
table_st_tag : TABLE_ST_TAG table_attr '>'
;
table_attr : ExpCC table_attr | ExpRC table_attr | styleID table_attr |
;
ExpCC : EXP_CC_TAG '=' '"' INT '"'
;
ExpRC : EXP_RC_TAG '=' '"' INT '"'
;
table_cont : | columns rows | comment table_cont
;
columns: | column columns
;
rows : | row rows
;
column : col_tag
;
col_tag : COLUMN_OP_TAG col_attr COLUMN_CL_TAG
;
col_attr : | height col_attr | width col_attr | styleID col_attr
;
width : WIDTH_TAG '=' '"' INT '"'
;
row : row_st_tag row_cont ROW_END_TAG
;
row_st_tag : ROW_ST_TAG row_attr '>'
;
row_attr : | height row_attr | hidden row_attr | styleID row_attr
;
height : HEIGHT_TAG '=' '"' INT '"'
;
hidden : HIDDEN_TAG '=' '"' BOOLEAN '"'
;
row_cont : | cell row_cont | comment row_cont
;
cell : cell_st_tag cell_cont CELL_END_TAG
;
cell_st_tag : CELL_ST_TAG cell_attr '>'
;
cell_attr : | mergeacross cell_attr | mergedown cell_attr | styleID cell_attr
;
mergeacross : MERGE_A_TAG '=' '"'INT'"'
;
mergedown : MERGE_D_TAG '=' '"'INT'"'
;
styleID : STYLE_ID_TAG '=' '"'STRING'"'
;
cell_cont : | data cell_cont | comment cell_cont
;
data : data_st_tag data_cont data_end_tag
;
data_st_tag : DATA_ST_TAG data_type '>'
;
data_end_tag : DATA_END_TAG
;
data_type : TYPE_TAG '"' data_attr '"'
;
data_attr : NUM_ATTR | STRING_ATTR | DATETIME_ATTR | BOOL_ATTR
;
data_cont : | STRING data_cont | INT data_cont | comment data_cont
;
comment : COMM_ST_TAG comm_cont COMM_END_TAG
;
comm_cont : | STRING comm_cont | INT comm_cont | STRING '-' comm_cont | INT '-' comm_cont
;
%%
Due to the necessity for being able to place comments almost anywhere on the desired XML-like language I have developed this recursive grammar.
Although I can not specify in which point the conflicts are born.
A desired grammar for my problem is something like this:
<ss:Workbook>
<ss:Styles>
<ss:Style ss:ID=”s123”></ss:Style>
<ss:Style ss:ID=”x123”></ss:Style>
</ss:Styles>
<ss:Worksheet ss:Name=”sheet1”>
<ss:Table ss:ExpandedColumnCount=”2”>
<ss:Column ss:StyleID=”s123”/>
<ss:Column ss:StyleID=”s123”/>
<ss:Row>
<ss:Cell>
<ss:Data ss:Type=”Number”>1234</ss:Data>
</ss:Cell>
<ss:Cell>
<ss:Data ss:Type=”String”>string data</ss:Data>
</ss:Cell>
</ss:Row>
</ss:Table>
</ss:Worksheet>
</ss:Workbook>

antlr4 line 2:0 mismatched input 'if' expecting {'if', OTHER}

I am having a bit of difficulty in my g4 file. Below is my grammar:
// Define a grammar called Hello
grammar GYOO;
program : 'begin' block+ 'end';
block
: statement+
;
statement
: assign
| print
| add
| ifstatement
| OTHER {System.err.println("unknown char: " + $OTHER.text);}
;
assign
: 'let' ID 'be' expression
;
print
: 'print' (NUMBER | ID)
;
ifstatement
: 'if' condition_block (ELSE IF condition_block)* (ELSE stat_block)?
;
add
: (NUMBER | ID) OPERATOR (NUMBER | ID) ASSIGN ID
;
stat_block
: OBRACE block CBRACE
| statement
;
condition_block
: expression stat_block
;
expression
: NOT expression //notExpr
| expression (MULT | DIV | MOD) expression //multiplicationExpr
| expression (PLUS | MINUS) expression //additiveExpr
| expression (LTEQ | GTEQ | LT | GT) expression //relationalExpr
| expression (EQ | NEQ) expression //equalityExpr
| expression AND expression //andExpr
| expression OR expression //orExpr
| atom //atomExpr
;
atom
: (NUMBER | FLOAT) //numberAtom
| (TRUE | FALSE) //booleanAtom
| ID //idAtom
| STRING //stringAtom
| NULL //nullAtom
;
ID : [a-z]+ ;
NUMBER : [0-9]+ ;
OPERATOR : '+' | '-' | '*' | '/';
ASSIGN : '=';
WS : (' ' | '\t' | '\r' | '\n') + -> skip;
OPAR : '(';
CPAR : ')';
OBRACE : '{';
CBRACE : '}';
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
IF : 'if';
ELSE : 'else';
OR : 'or';
AND : 'and';
EQ : 'is'; //'=='
NEQ : 'is not'; //'!='
GT : 'greater'; //'>'
LT : 'lower'; //'<'
GTEQ : 'is greater'; //'>='
LTEQ : 'is lower'; //'<='
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
MOD : '%';
POW : '^';
NOT : 'not';
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
STRING
: '"' (~["\r\n] | '""')* '"'
;
COMMENT
: '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(HIDDEN)
;
OTHER
: .
;
When i try to -gui tree from antlr it shows me this error:
line 2:3 missing OPERATOR at 'a'
This error is given from this code example:
begin
let a be true
if a is true
print a
end
Basically it does not recognizes the ifstatement beggining with IF 'if' and it shows the tree like i am making an assignment.
How can i fix this?
P.S. I also tried to reposition my statements. Also tried to remove all statements and leave only ifstatement, and same thing happens.
Thanks
There is at least one issue:
ID : [a-z]+ ;
...
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
IF : 'if';
ELSE : 'else';
OR : 'or';
...
NOT : 'not';
Since ID is placed before TRUE .. NOT, those tokens will never be created since ID has precedence over them (and ID matches these tokens as well).
Start by moving ID beneath the NOT token.

Antlr not recognizing number

I have 3 types of numbers defined, number, decimal and percentage.
Percentage : (Sign)? Digit+ (Dot Digit+)? '%' ;
Number : Sign? Digit+;
Decimal : Sign? Digit+ Dot Digit*;
Percentage and decimal work fine but when I assign a number, unless I put a sign (+ or -) in front of the number, it doesn't recognize it as a number.
number foo = +5 // does recognize
number foo = 5; // does not recognize
It does recognize it in an evaluation expression.
if (foo == 5 ) // does recognize
Here is my language (I took out the functions and left only the language recognition).
grammar Fetal;
transaction : begin statements end;
begin : 'begin' ;
end : 'end' ;
statements : (statement)+
;
statement
: declaration ';'
| command ';'
| assignment ';'
| evaluation
| ';'
;
declaration : type var;
var returns : identifier;
type returns
: DecimalType
| NumberType
| StringType
| BooleanType
| DateType
| ObjectType
| DaoType
;
assignment
: lharg Equals rharg
| lharg unaryOP rharg
;
assignmentOp : Equals
;
unaryOP : PlusEquals
| MinusEquals
| MultiplyEquals
| DivideEquals
| ModuloEquals
| ExponentEquals
;
expressionOp : arithExpressOp
| bitwiseExpressOp
;
arithExpressOp : Multiply
| Divide
| Plus
| Minus
| Modulo
| Exponent
;
bitwiseExpressOp
: And
| Or
| Not
;
comparisonOp : IsEqualTo
| IsLessThan
| IsLessThanOrEqualTo
| IsGreaterThan
| IsGreaterThanOrEqualTo
| IsNotEqualTo
;
logicExpressOp : AndExpression
| OrExpression
| ExclusiveOrExpression
;
rharg returns
: rharg expressionOp rharg
| '(' rharg expressionOp rharg ')'
| var
| literal
| assignmentCommands
;
lharg returns : var;
identifier : Identifier;
evaluation : IfStatement '(' evalExpression ')' block (Else block)?;
block : OpenBracket statements CloseBracket;
evalExpression
: evalExpression logicExpressOp evalExpression
| '(' evalExpression logicExpressOp evalExpression ')'
| eval
| '(' eval ')'
;
eval : rharg comparisonOp rharg ;
assignmentCommands
: GetBalance '(' stringArg ')'
| GetVariableType '(' var ')'
| GetDescription
| Today
| GetDays '(' startPeriod=dateArg ',' endPeriod=dateArg ')'
| DayOfTheWeek '(' dateArg ')'
| GetCalendarDay '(' dateArg ')'
| GetMonth '(' dateArg ')'
| GetYear '(' dateArg ')'
| Import '(' stringArg ')' /* Import( path ) */
| Lookup '(' sql=stringArg ',' argumentList ')' /* Lookup( table, SQL) */
| List '(' sql=stringArg ',' argumentList ')' /* List( table, SQL) */
| invocation
;
command : Print '(' rharg ')'
| Credit '(' amtArg ',' stringArg ')'
| Debit '(' amtArg ',' stringArg ')'
| Ledger '(' debitOrCredit ',' amtArg ',' acc=stringArg ',' desc=stringArg ')'
| Alias '(' account=stringArg ',' name=stringArg ')'
| MapFile ':' stringArg
| invocation
| Update '(' sql=stringArg ',' argumentList ')'
;
invocation
: o=objectLiteral '.' m=identifier '('argumentList? ')'
| o=objectLiteral '.' m=identifier '()'
;
argumentList
: rharg (',' rharg )*
;
amtArg : rharg ;
stringArg : rharg ;
numberArg : rharg ;
dateArg : rharg ;
debitOrCredit : charLiteral ;
literal
: numericLiteral
| doubleLiteral
| booleanLiteral
| percentLiteral
| stringLiteral
| dateLiteral
;
fileName : '<' fn=Identifier ('.' ft=Identifier)? '>' ;
charLiteral : ('D' | 'C');
numericLiteral : Number ;
doubleLiteral : Decimal ;
percentLiteral : Percentage ;
booleanLiteral : Boolean ;
stringLiteral : String ;
dateLiteral : Date ;
objectLiteral : Identifier ;
daoLiteral : Identifier ;
//Below are Token definitions
// Data Types
DecimalType : 'decimal' ;
NumberType : 'number' ;
StringType : 'string' ;
BooleanType : 'boolean' ;
DateType : 'date' ;
ObjectType : 'object' ;
DaoType : 'dao' ;
/******************************************************************
* Assignmnt operator
******************************************************************/
Equals : '=' ;
/*****************************************************************
* Unary operators
*****************************************************************/
PlusEquals : '+=' ;
MinusEquals : '-=' ;
MultiplyEquals : '*=' ;
DivideEquals : '/=' ;
ModuloEquals : '%=' ;
ExponentEquals : '^=' ;
/*****************************************************************
* Binary operators
*****************************************************************/
Plus : '+' ;
Minus : '-' ;
Multiply : '*' ;
Divide : '/' ;
Modulo : '%' ;
Exponent : '^' ;
/***************************************************************
* Bitwise operators
***************************************************************/
And : '&' ;
Or : '|' ;
Not : '!' ;
/*************************************************************
* Compariso operators
*************************************************************/
IsEqualTo : '==' ;
IsLessThan : '<' ;
IsLessThanOrEqualTo : '<=' ;
IsGreaterThan : '>' ;
IsGreaterThanOrEqualTo : '>=' ;
IsNotEqualTo : '!=' ;
/*************************************************************
* Expression operators
*************************************************************/
AndExpression : '&&' ;
OrExpression : '||' ;
ExclusiveOrExpression : '^^' ;
// Reserve words (Assignment Commands)
GetBalance : 'getBalance';
GetVariableType : 'getVariableType' ;
GetDescription : 'getDescription' ;
Today : 'today';
GetDays : 'getDays' ;
DayOfTheWeek : 'dayOfTheWeek' ;
GetCalendarDay : 'getCalendarDay' ;
GetMonth : 'getMonth' ;
GetYear : 'getYear' ;
Import : 'import' ;
Lookup : 'lookup' ;
List : 'list' ;
// Reserve words (Commands)
Credit : 'credit';
Debit : 'debit';
Ledger : 'ledger';
Alias : 'alias' ;
MapFile : 'mapFile' ;
Update : 'update' ;
Print : 'print';
IfStatement : 'if';
Else : 'else';
OpenBracket : '{';
CloseBracket : '}';
Percentage : (Sign)? Digit+ (Dot Digit+)? '%' ;
Boolean : 'true' | 'false';
Number : Sign? Digit+;
Decimal : Sign? Digit+ Dot Digit*;
Date : Year '-' Month '-' Day;
Identifier
: IdentifierNondigit
( IdentifierNondigit
| Digit
)*
;
String: '"' ( ESC | ~[\\"] )* '"';
/************************************************************
* Fragment Definitions
************************************************************/
fragment
ESC : '\\' [abtnfrv"'\\]
;
fragment
IdentifierNondigit
: Nondigit
//| // other implementation-defined characters...
;
fragment
Nondigit
: [a-zA-Z_]
;
fragment
Digit
: [0-9]
;
fragment
Sign : Plus | Minus;
fragment
Digits
: [-+]?[0-9]+
;
fragment
Year
: Digit Digit Digit Digit;
fragment
Month
: Digit Digit;
fragment
Day
: Digit Digit;
fragment Dot : '.';
fragment
SCharSequence
: SChar+
;
fragment
SChar
: ~["\\\r\n]
| SimpleEscapeSequence
| '\\\n' // Added line
| '\\\r\n' // Added line
;
fragment
CChar
: ~['\\\r\n]
| SimpleEscapeSequence
;
fragment
SimpleEscapeSequence
: '\\' ['"?abfnrtv\\]
;
ExtendedAscii
: [\x80-\xfe]+
-> skip
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
BlockComment
: '/*' .*? '*/'
-> skip
;
LineComment
: '//' ~[\r\n]*
-> skip
;
I have a hunch that this use of a fragment is incorrect:
fragment Sign : Plus | Minus;
I couldn't find anything in the reference book, but I think it needs to be changed to something like this:
fragment Sign : [+-];
I found the issue. I was using version 4.5.2-1 because every attempt to upgrade to 4.7 caused more errors and I didn't want to cause more errors while trying to solve another. I finally broke down and upgraded the libraries to 4.7, fixed the errors and the number recognition issue disappeared. It was a bug in the library, all this time.

Don't passed string in the antlr4 grammar

I have create the grammar(see it bellow) and when I try to validate the string
Time;15 16 * * 1-5; 'muni_eval_comments_'yyyyMMdd_HHmmss'.csv';
have following error message
line 1:, charPositionInLine:57, msg: extraneous input '.csv' expecting {';', '-', ''', '.', '_', ID}
Where I'm wrong and how to fix it?
Regards,
Vladimir
lexer grammar FileTriggerLexer;
#header {
}
STEP
:
'/' INTEGER
;
SCHEDULE
:
'Schedule'
;
SEMICOLON
:
';'
;
ASTERISK
:
'*'
;
CRON
:
'cron'
;
MARKET_CRON
:
'marketCron'
;
COMBINED
:
'combined'
;
FILE_FEED
:
'FileFeed'
;
TIME: 'Time';
LBRACKET
:
'('
;
RBRACKET
:
')'
;
PERCENT
:
'%'
;
INTEGER
:
[0-9]+
;
DASH
:
'-'
;
DOUBLE_QUOTE
:
'"'
;
QUOTE
:
'\''
;
SLASH
:
'/'
;
DOT
:
'.'
;
COMMA
:
','
;
UNDERSCORE
:
'_'
;
ID
:
[a-zA-Z] [a-zA-Z0-9]*
;
REGEX
:
(
ID
| DOT
| ASTERISK
| INTEGER
| PERCENT
)+
;
WS
:
[ \t\r\n]+ -> skip
;
/**
* Define a grammar called Hello
*/
grammar FileTriggerValidator;
options
{
tokenVocab = FileTriggerLexer;
}
r
:
(
schedule
| file_feed
| time_feed
)+
;
time_feed
:
TIME SEMICOLON cron_part SEMICOLON file_name SEMICOLON
;
file_feed
:
file_feed_name SEMICOLON source_file SEMICOLON source_host SEMICOLON
source_host SEMICOLON regEx SEMICOLON regEx
(
SEMICOLON source_host
)*
;
formatString
:
source_host
(
'%' source_host?
)* DOT source_host
;
regEx
:
REGEX
;
source_host
:
ID
(
DASH ID
)*
;
file_feed_name
:
FILE_FEED
;
source_file
:
(
ID
| DASH
| UNDERSCORE
)+
;
schedule
:
SCHEDULE SEMICOLON schedule_defining SEMICOLON file_name SEMICOLON timezone
(
SEMICOLON INTEGER
)?
;
schedule_defining
:
cron
| market_cron
| combined_cron
;
cron
:
CRON LBRACKET DOUBLE_QUOTE cron_part timezone DOUBLE_QUOTE RBRACKET
;
market_cron
:
MARKET_CRON LBRACKET DOUBLE_QUOTE cron_part timezone DOUBLE_QUOTE COMMA
DOUBLE_QUOTE ID DOUBLE_QUOTE RBRACKET
;
combined_cron
:
COMBINED LBRACKET cron_list_element
(
COMMA cron_list_element
)* RBRACKET
;
mic_defining
:
ID
;
file_name
:
(
ID
| DOT
| QUOTE
| DASH
| UNDERSCORE
)+
;
cron_list_element
:
cron
| market_cron
;
//
schedule_defined_string
:
cron
;
//
cron_part
:
minutes hours days_of_month month week_days
;
//
minutes
:
INTEGER
| with_step_value
;
//
hours
:
INTEGER
| with_step_value
;
//
int_list
:
INTEGER
(
COMMA INTEGER
)*
;
interval
:
INTEGER DASH INTEGER
;
//
days_of_month
:
INTEGER
| with_step_value
;
//
month
:
INTEGER
| with_step_value
;
//
week_days
:
INTEGER
| with_step_value
;
//
timezone
:
timezone_part
(
SLASH timezone_part
)?
;
//
timezone_part
:
ID
(
UNDERSCORE ID
)?
;
//
with_step_value
:
(
int_list
| interval
| ASTERISK
) STEP?
;
Your .csv code fragment has been recognized as REGEX token:
TIME SEMICOLON(;) INTEGER(15) INTEGER(16) ASTERISK(*) ASTERISK(*)
INTEGER(1) DASH(-) INTEGER(5) SEMICOLON(;) QUOTE(') ID(muni) UNDERSCORE(_)
ID(eval) UNDERSCORE(_) ID(comments) UNDERSCORE(_) QUOTE(') ID(yyyyMMdd)
UNDERSCORE(_) ID(HHmmss) QUOTE(') REGEX(.csv) QUOTE(') SEMICOLON(;) EOF(<EOF>) EOF
But file_name does not contain REGEX token:
file_name
:
(
ID
| DOT
| QUOTE
| DASH
| UNDERSCORE
)+
;
Try to include REGEX to file_name rule or remove REGEX and use regEx parser rule instead:
regEx
:
(
ID
| DOT
| ASTERISK
| INTEGER
| PERCENT
)+
;

Parsing string interpolation in ANTLR

I'm working on a simple string manipulation DSL for internal purposes, and I would like the language to support string interpolation as it is used in Ruby.
For example:
name = "Bob"
msg = "Hello ${name}!"
print(msg) # prints "Hello Bob!"
I'm attempting to implement my parser in ANTLRv3, but I'm pretty inexperienced with using ANTLR so I'm unsure how to implement this feature. So far, I've specified my string literals in the lexer, but in this case I'll obviously need to handle the interpolation content in the parser.
My current string literal grammar looks like this:
STRINGLITERAL : '"' ( StringEscapeSeq | ~( '\\' | '"' | '\r' | '\n' ) )* '"' ;
fragment StringEscapeSeq : '\\' ( 't' | 'n' | 'r' | '"' | '\\' | '$' | ('0'..'9')) ;
Moving the string literal handling into the parser seems to make everything else stop working as it should. Cursory web searches didn't yield any information. Any suggestions as to how to get started on this?
I'm no ANTLR expert, but here's a possible grammar:
grammar Str;
parse
: ((Space)* statement (Space)* ';')+ (Space)* EOF
;
statement
: print | assignment
;
print
: 'print' '(' (Identifier | stringLiteral) ')'
;
assignment
: Identifier (Space)* '=' (Space)* stringLiteral
;
stringLiteral
: '"' (Identifier | EscapeSequence | NormalChar | Space | Interpolation)* '"'
;
Interpolation
: '${' Identifier '}'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
EscapeSequence
: '\\' SpecialChar
;
SpecialChar
: '"' | '\\' | '$'
;
Space
: (' ' | '\t' | '\r' | '\n')
;
NormalChar
: ~SpecialChar
;
As you notice, there are a couple of (Space)*-es inside the example grammar. This is because the stringLiteral is a parser-rule instead of a lexer-rule. Therefor, when tokenizing the source file, the lexer cannot know if a white space is part of a string literal, or is just a space inside the source file that can be ignored.
I tested the example with a little Java class and all worked as expected:
/* the same grammar, but now with a bit of Java code in it */
grammar Str;
#parser::header {
package antlrdemo;
import java.util.HashMap;
}
#lexer::header {
package antlrdemo;
}
#parser::members {
HashMap<String, String> vars = new HashMap<String, String>();
}
parse
: ((Space)* statement (Space)* ';')+ (Space)* EOF
;
statement
: print | assignment
;
print
: 'print' '('
( id=Identifier {System.out.println("> "+vars.get($id.text));}
| st=stringLiteral {System.out.println("> "+$st.value);}
)
')'
;
assignment
: id=Identifier (Space)* '=' (Space)* st=stringLiteral {vars.put($id.text, $st.value);}
;
stringLiteral returns [String value]
: '"'
{StringBuilder b = new StringBuilder();}
( id=Identifier {b.append($id.text);}
| es=EscapeSequence {b.append($es.text);}
| ch=(NormalChar | Space) {b.append($ch.text);}
| in=Interpolation {b.append(vars.get($in.text.substring(2, $in.text.length()-1)));}
)*
'"'
{$value = b.toString();}
;
Interpolation
: '${' i=Identifier '}'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
EscapeSequence
: '\\' SpecialChar
;
SpecialChar
: '"' | '\\' | '$'
;
Space
: (' ' | '\t' | '\r' | '\n')
;
NormalChar
: ~SpecialChar
;
And a class with a main method to test it all:
package antlrdemo;
import org.antlr.runtime.*;
public class ANTLRDemo {
public static void main(String[] args) throws RecognitionException {
String source = "name = \"Bob\"; \n"+
"msg = \"Hello ${name}\"; \n"+
"print(msg); \n"+
"print(\"Bye \\${for} now!\"); ";
ANTLRStringStream in = new ANTLRStringStream(source);
StrLexer lexer = new StrLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
StrParser parser = new StrParser(tokens);
parser.parse();
}
}
which produces the following output:
> Hello Bob
> Bye \${for} now!
Again, I am no expert, but this (at least) gives you a way to solve it.
HTH.

Resources