ANTLR4 grammar, rule accept only part of the sentence - parsing

all. I have created the grammar(it's part of the bigger grammar) to discover my problem
When I parse the string
00 */3,5 * * 5 America/New_York
I have following exception
line 1:4 no viable alternative at input '/'
As I have discovered, problem is that substring /3,5 does fully parsed with "with_step_value" rule, instead of it parser get only first sybmbol. But why? As I understand, antlr try to parse as long string as he can and substring "/3,5" in my poind of view satisfied to rule "with_step_value"
So, why it's happends and how to fix it?
Regards,
Vladimir
Please, see grammars and pictures bellow
/*File trigger validator lexer */
lexer grammar CronPartLexer;
INT_LIST: INTEGER (COMMA INTEGER)* ;
INTERVAL
:
INTEGER DASH INTEGER
;
INTEGER
:
[0-9]+
;
DASH
:
'-'
;
SLASH
:
'/'
;
COMMA
:
','
;
UNDERSCORE
:
'_'
;
ID
:
[a-zA-Z] [a-zA-Z0-9]*
;
ASTERISK:'*';
WS
:
[ \t\r\n]+ -> skip
;
grammar CronPartValidator;
options
{
tokenVocab = CronPartLexer;
}
cron_part
:
minutes hours days_of_month month week_days time_zone?;
minutes
:
with_step_value
;
time_zone
:
timezone_part
(
SLASH timezone_part
)?
;
timezone_part
:
ID
(
UNDERSCORE ID
)?
;
hours
:
with_step_value
;
//
//
days_of_month
:
with_step_value
;
//
month
:
with_step_value
;
//
week_days
:
with_step_value
;
with_step_value:
INT_LIST|ASTERISK|INTERVAL ((SLASH INT_LIST)?)
;
Parse Tree of the full string
Parse Tree of "with_step_value" "*/3,5"

The rule
with_step_value: INT_LIST|ASTERISK|INTERVAL ((SLASH INT_LIST)?) ;
will only match an INT_LIST or an ASTERISK or an INTERVAL (SLASH INT_LIST)?.
Perhaps this is what was intended:
with_step_value
: ( INT_LIST
| ASTERISK
| INTERVAL
) (SLASH INT_LIST)?
;

Related

ANTLR4 grammar for SML choking on positive integer literals

I'm building a parser for SML using ANTLR 4.8, and for some reason the generated parser keeps choking on integer literals:
# CLASSPATH=bin ./scripts/grun SML expression -tree <<<'1'
line 1:0 mismatched input '1' expecting {'(', 'let', 'op', '{', '()', '[', '#', 'raise', 'if', 'while', 'case', 'fn', LONGID, CONSTANT}
(expression 1)
I've trimmed as much as I can from the grammar to still show this issue, which appears very strange. This grammar shows the issue (despite LABEL not even being used):
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
On the other hand, removing LABEL makes positive numbers work again:
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
I've tried replacing NUM* with DIGIT? and similar variations, but that didn't fix my problem.
I'm really not sure what's going on, so I suspect it's something deeper than the syntax I'm using.
As already mentioned in the comments by Rici: the lexer tries to match as much characters as possible, and when 2 or more rules match the same characters, the one defined first "wins". So with rules like these:
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
the input 1 will always become a LABEL. And input like 0 will always be a CONSTANT. An INT token will only be created when a ~ is encountered followed by some digits. The NUM and DIGIT will never produce a token since the rules before it will be matched. The fact that NUM and DIGIT can never become tokens on their own, makes them candidates to becoming fragment tokens:
fragment NUM : DIGIT+ ;
fragment DIGIT : [0-9] ;
That way, you can't accidentally use these tokens inside parser rules.
Also, making ~ part of a token is usually not the way to go. You'll probably also want ~(1 + 2) to be a valid expression. So an unary operator like ~ is often better used in a parser rule: expression : '~' expression | ... ;.
Finally, if you want to make a distinction between a non-zero integer value as a label, you can do it like this:
grammar SML_Small;
expression
: '(' expression ')'
| '~' expression
| integer
;
integer
: INT
| INT_NON_ZERO
;
label
: INT_NON_ZERO
;
INT_NON_ZERO : [1-9] DIGIT* ;
INT : DIGIT+ ;
SPACES : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;

ANTLR4: How to set different context within rule for the same tag?

I have such grammar:
grammar SearchQuery;
queryDeclaration : predicateGroupItem predicateGroupItemWithBooleanOperator* ;
predicateGroupItemWithBooleanOperator : groupOperator predicateGroupItem ;
predicateGroupItem : LEFT_BRACKET variable variableDelimiter
expression expressionWithBoolean* RIGHT_BRACKET ;
variable : VARIABLE_STRING ;
variableDelimiter : VAR_DELIMITER ;
expressionWithBoolean : boolOperator expression ;
expression : value ;
value : polygonType;
boolOperator : or
;
or : OR ;
groupOperator : AND ;
polygonType : POLYGON LEFT_BRACKET pointList (POLYGON_DELIMITER pointList)* RIGHT_BRACKET ;
longType : LONG ;
doubleType : DOUBLE ;
pointList : point
| LEFT_BRACKET point ( POLYGON_DELIMITER point)* RIGHT_BRACKET
;
point : latitude longitude ;
latitude : longType
| doubleType
;
longitude : longType
| doubleType
;
POLYGON : [pP] [oO] [lL] [yY] [gG] [oO] [nN] ;
LONG : DIGIT+ ;
DOUBLE : DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
AND : [aA] [nN] [dD] ;
OR : COMMA
| [oO] [rR]
;
VARIABLE_STRING : [a-zA-Z0-9.]+ ;
COMMA : ',' ;
POLYGON_DELIMITER : ';' ;
VAR_DELIMITER : ':' ;
RIGHT_BRACKET : ')' ;
LEFT_BRACKET : '(' ;
WS : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
Problem is that I can't use COMMA tag with different rules simultaneously in polygonType, pointList rules (I need to use COMMA except for POLYGON_DELIMITER) and boolOperator rule (there is COMMA used)
Other words, if we will change POLYGON_DELIMITER to COMMA and
test such grammar with a value like this
(polygons: polygon(20 30.4, 23.4 23),
polygon(20 30.4, 23.4 23),
polygon(20 30.4, 23.4 23))
we will get an error
mismatch input: ',' expecting {',', ')'}
I will happy if somebody will help me to understand the problem.
P.S. if we will not change current grammar the value for the testing it is
(poligons: polygon(20 30.4; 23.4 23),
polygon(20 30.4; 23.4 23),
polygon(20 30.4; 23.4 23))
Because of these rules:
OR : COMMA
| [oO] [rR]
;
COMMA : ',' ;
the lexer will never produce a COMMA token since it is already matched by the OR token. And because OR is defined before COMMA, it gets precedence.
That is what the error message mismatch input: ',' expecting {',', ')'} really means. In other words: mismatch input: OR expecting {COMMA, RIGHT_BRACKET}
What you should do (if the OR operator can be either "or" or ",") is let the parser rule or match the COMMA:
or : OR
| COMMA
;
OR : [oO] [rR]
;
COMMA : ',' ;

ANTLR : issue with greedy rule

I never worked with ANTLR and generative grammars, so this is my first attempt.
I have a custom language I need to parse.
Here's an example:
-- This is a comment
CMD.CMD1:foo_bar_123
CMD.CMD2
CMD.CMD4:9 of 28 (full)
CMD.NOTES:
This is an note.
A line
(1) there could be anything here foo_bar_123 & $ £ _ , . ==> BOOM
(3) same here
CMD.END_NOTES:
Briefly, there could be 4 types of lines:
1) -- comment
2) <section>.<command>
3) <section>.<command>: <arg>
4) <section>.<command>:
<arg1>
<arg2>
...
<section>.<end_command>:
<section> is the literal "CMD"
<command> is a single word (uppercase, lowercase letters, numbers, '_')
<end_command> is the same word of <command> but preceded by the literal "end_"
<arg> could be any character
Here's what I've done so far:
grammar MyGrammar;
/*
* Parser Rules
*/
root : line+ EOF ;
line : (comment_line | command_line | normal_line) NEWLINE;
comment_line : COMMENT ;
command_line : section '.' command ((COLON WHITESPACE*)? arg)? ;
normal_line : TEXT ;
section : CMD ;
command : WORD ;
arg : TEXT ;
/*
* Lexer Rules
*/
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
NUMBER : DIGIT+ ([.,] DIGIT+)? ;
CMD : 'CMD';
COLON : ':' ;
COMMENT : '--' ~[\r\n]*;
WHITESPACE : (' ' | '\t') ;
NEWLINE : ('\r'? '\n' | '\r')+;
WORD : (LOWERCASE | UPPERCASE | NUMBER | '_')+ ;
TEXT : ~[\r\n]* ;
This is a test for my grammar:
$antlr4 MyGrammar.g4
warning(146): MyGrammar.g4:45:0: non-fragment lexer rule TEXT can match the empty string
$javac MyGrammar*.java
$grun MyGrammar root -tokens
CMD.NEW
[#0,0:6='CMD.NEW',<TEXT>,1:0]
[#1,7:7='\n',<NEWLINE>,1:7]
[#2,8:7='<EOF>',<EOF>,2:0]
The problem is that "CMD.NEW" gets swallowed by TEXT, because that rule is greedy.
Anyone can help me with this?
Thanks
There is a grammar ambiguity.
In the example you have provided CMD.NEW can match both command_line and normal_line.
Thus, given the expression:
line : (comment_line | command_line | normal_line) NEWLINE;
the parser can not definitely say what rule to accept (command_line or normal_line), so it matches it to normal_line which is actually a simple TEXT.
Consider rewriting your grammar in the way the parser can always say what rule to accept.
UPDATE:
Try this (I did not test that, but it should work):
grammar MyGrammar;
/*
* Parser Rules
*/
root : line+ EOF ;
line : (comment_line | command_line) NEWLINE;
comment_line : COMMENT ;
command_line : CMD '.' (note_cmd | command);
command : command_name ((COLON WHITESPACE*)? arg)? ;
note_cmd : notes .*? (CMD '.' END_NOTES) ;
command_name : WORD ;
arg : TEXT ;
/*
* Lexer Rules
*/
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
NUMBER : DIGIT+ ([.,] DIGIT+)? ;
CMD : 'CMD';
COLON : ':' ;
COMMENT : '--' ~[\r\n]*;
WHITESPACE : (' ' | '\t') ;
NEWLINE : ('\r'? '\n' | '\r')+;
WORD : (LOWERCASE | UPPERCASE | NUMBER | '_')+ ;
NOTES : 'NOTES';
END_NOTES : 'END_NOTES';
TEXT : ~[\r\n]* ;

Match any printable letter-like characters in ANTLR4 with Go as target

This is freaking me out, I just can't find a solution to it. I have a grammar for search queries and would like to match any searchterm in a query composed out of printable letters except for special characters "(", ")". Strings enclosed in quotes are handled separately and work.
Here is a somewhat working grammar:
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start
: searchclause EOF
;
searchclause
: table expr
;
expr
: fieldsearch
| searchop fieldsearch
| unop expr
| expr relop expr
| lparen expr relop expr rparen
;
lparen
: '('
;
rparen
: ')'
;
unop
: NOT
;
relop
: AND
| OR
;
searchop
: NO
| EVERY
;
fieldsearch
: field EQ searchterm
;
field
: ID
;
table
: ID
;
searchterm
:
| STRING
| ID+
| DIGIT+
| DIGIT+ ID+
;
STRING
: '"' ~('\n'|'"')* ('"' )
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '='
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: ('0' .. '9')
;
/*
NOT_SPECIAL
: ~(' ' | '\t' | '\n' | '\r' | '\'' | '"' | ';' | '.' | '=' | '(' | ')' )
; */
WS
: [ \r\n\t] + -> skip
;
The problem is that searchterm is too restricted. It should match any character that is in the commented out NOT_SPECIAL, i.e., valid queries would be:
Person Name=%
Person Address=^%Street%%%$^&*#^
But whenever I try to put NOT_SPECIAL in any way into the definition of searchterm it doesn't work. I have tried putting it literally into the rule, too (commenting out NOT_SPECIAL) and many others things, but it just doesn't work. In most of my attempts the grammar just complained about extraneous input after "=" and said it was expecting EOF. But I also cannot put EOF into NOT_SPECIAL.
Is there any way I can simply parse every text after "=" in rule fieldsearch until there is a whitespace or ")", "("?
N.B. The STRING rule works fine, but the user ought not be required to use quotes every time, because this is a command line tool and they'd need to be escaped.
Target language is Go.
You could solve that by introducing a lexical mode that you'll enter whenever you match an EQ token. Once in that lexical mode, you either match a (, ) or a whitespace (in which case you pop out of the lexical mode), or you keep matching your NOT_SPECIAL chars.
By using lexical modes, you must define your lexer- and parser rules in their own files. Be sure to use lexer grammar ... and parser grammar ... instead of the grammar ... you use in a combined .g4 file.
A quick demo:
lexer grammar MdbLexer;
STRING
: '"' ~[\r\n"]* '"'
;
OPAR
: '('
;
CPAR
: ')'
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '=' -> pushMode(NOT_SPECIAL_MODE)
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: [0-9]
;
WS
: [ \r\n\t]+ -> skip
;
fragment VALID_ID_START
: [a-zA-Z_]
;
fragment VALID_ID_CHAR
: [a-zA-Z_0-9]
;
mode NOT_SPECIAL_MODE;
OPAR2
: '(' -> type(OPAR), popMode
;
CPAR2
: ')' -> type(CPAR), popMode
;
WS2
: [ \t\r\n] -> skip, popMode
;
NOT_SPECIAL
: ~[ \t\r\n()]+
;
Your parser grammar would start like this:
parser grammar MdbParser;
options {
tokenVocab=MdbLexer;
}
start
: searchclause EOF
;
// your other parser rules
My Go is a bit rusty, but a small Java test:
String source = "Person Address=^%Street%%%$^&*#^()";
MdbLexer lexer = new MdbLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-15s %s\n", MdbLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
print the following:
ID Person
ID Address
EQ =
NOT_SPECIAL ^%Street%%%$^&*#^
OPAR (
CPAR )
EOF <EOF>

In antlr4.7 how to parse a rule like ISO 8601 interval "P3M2D" ahead of an "ID" rule

I am trying to parse ISO 8601 period expressions like "P3M2D", using antlr4. But I am hitting some kind of roadblock and will appreciate help. I am rather new to both antlr and compilers.
My grammar is as below. I have combined the lexer and parser rules in one go here:
grammar test_iso ;
// import testLexerRules ;
iso : ( date_expr NEWLINE)* EOF;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
///////////////////////////////////////////
// in separate file : test_lexer.g4
// lexer grammar testLexerRules ;
///////////////////////////////////////////
fragment
TODAY
: 'today' | 'TODAY'
;
fragment
NOW
: 'now' | 'NOW'
;
DATETIME_NAME
: TODAY
| NOW
;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment
DIGIT : [0-9] ;
fragment
INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
//
// identifiers
//
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
fragment
ALPHA : [a-zA-Z] ;
fragment
ALPH_NUM : [a-zA-Z_0-9] ;
fragment
ALPHA_UPPER : [A-Z] ;
fragment
ALPHA_UPPER_NUM : [A-Z_0-9] ;
//////////////////////////////////////////////
NEWLINE : '\r\n' ;
WS : [ \t]+ -> skip ;
In test run, it never hits the iso8601_interval_d rule, it always goes to ID rule.
C:\lab>java org.antlr.v4.gui.TestRig test_iso iso -tokens -tree
now + P3M2D
^Z
ID seen P3M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:10='P3M2D',<ID>,1:6]
[#3,11:12='\r\n',<'
'>,1:11]
[#4,13:12='<EOF>',<EOF>,2:0]
line 1:6 mismatched input 'P3M2D' expecting 'P'
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P3M2D))) \r\n <EOF>)
If I remove the "ID" rule and run again, it parses as desired:
now + P3M2D
^Z
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:6='P',<'P'>,1:6]
[#3,7:7='3',<NUMBER_INT>,1:7]
[#4,8:8='M',<'M'>,1:8]
[#5,9:9='2',<NUMBER_INT>,1:9]
[#6,10:10='D',<'D'>,1:10]
[#7,11:12='\r\n',<'
'>,1:11]
[#8,13:12='<EOF>',<EOF>,2:0]
ISO8601_INTERVAL DATE seen P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d P 3 M 2 D))) \r\n <EOF>)
I also tried prefixing a special character like "#" in the parser rule
iso8601_interval_d
: '#P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
but now a different kind of failure
now + #P3M2D
^Z
ID seen M2D
[#0,0:2='now',<DATETIME_NAME>,1:0]
[#1,4:4='+',<'+'>,1:4]
[#2,6:7='#P',<'#P'>,1:6]
[#3,8:8='3',<NUMBER_INT>,1:8]
[#4,9:11='M2D',<ID>,1:9]
[#5,12:13='\r\n',<'
'>,1:12]
[#6,14:13='<EOF>',<EOF>,2:0]
line 1:9 no viable alternative at input '3M2D'
ISO8601_INTERVAL DATE seen #P3M2D
(iso (date_expr (date_expr now) + (iso8601_interval (iso8601_interval_d #P 3 M2D))) \r\n <EOF>)
I am sure I am not the first one to hit upon something like this. What is the antlr idiom here?
EDIT -- I need the ID token elsewhere in other parts of my grammar that I have omitted here, to highlight the problem I am facing.
Like find out even by other, the issue is in the ID token. The fact is that the duration syntax for iso-8601 is a valid ID. Besides the solution figured out by #Mike. If something called island grammar is suitable for your needs you can use ANTLR's lexical modes to exclude the ID lexer rule while parsing an iso date.
Belove there is an examples on how it could work
parser grammar iso;
options { tokenVocab=iso_lexer; }
iso : ISO_BEGIN ( date_expr NEWLINE)* ISO_END;
date_expr
: date_expr op=( '+' | '-' ) iso8601_interval #dateexpr_Interval
| date_expr op='-' date_expr #dateexpr_Diff
| DATETIME_NAME #dateexpr_Named
| '(' inner=date_expr ')' #dateexpr_Paren
;
///////////////////////////////////////////
iso8601_interval
: iso8601_interval_d
{ System.out.println("ISO8601_INTERVAL DATE seen " + $text);}
;
iso8601_interval_d
: 'P' ( y=NUMBER_INT 'Y' )? ( m=NUMBER_INT 'M' )? ( w=NUMBER_INT 'W' )? ( d=NUMBER_INT 'D' )?
;
then in the lexer
lexer grammar iso_lexer;
//
// identifiers (in DEFAULT_MODE)
//
ISO_BEGIN
: '<#' -> mode(ISO)
;
ID
: ALPHA ALPH_NUM*
{ System.out.println("ID seen " + getText()); }
;
ID_SQLFUNC
: 'h$' ALPHA_UPPER ALPHA_UPPER_NUM*
{ System.out.println("SQL FUNC seen " + getText()); }
;
WS0 : [ \t]+ -> skip ;
// all the following token are scanned only when iso mode is active
mode ISO;
ISO_END
: '#>' -> mode(DEFAULT_MODE)
;
WS0 : [ \t]+ -> skip ;
NEWLINE : '\r'? '\n' ;
ADD : '+' ;
SUB : '-' ;
LPAREN : '(' ;
RPAREN : ')' ;
P : 'P' ;
Y : 'Y' ;
M : 'M' ;
W : 'W' ;
D : 'D' ;
DATETIME_NAME
: TODAY
| NOW
;
fragment TODAY: 'today' | 'TODAY' ;
fragment NOW : 'now' | 'NOW' ;
///////////////////////////////////////////
NUMBER_INT
: '-'? INT // -3, 45
;
fragment DIGIT : [0-9] ;
fragment INT : '0' | [1-9] DIGIT* ;
//////////////////////////////////////////////
fragment ALPHA : [a-zA-Z] ;
fragment ALPH_NUM : [a-zA-Z_0-9] ;
fragment ALPHA_UPPER : [A-Z] ;
fragment ALPHA_UPPER_NUM : [A-Z_0-9] ;
Such grammar can parse expressions like
Pluton Planet <% now + P10Y
%>
I changed a bit the parser rule iso to demonstrate ID and period mixing.
Hope this helps
It's not possible what you wanna do. ID matches the same input as iso8601_interval. In such cases ANTLR4 picks the longest match, which is ID as it can match an unlimited number of characters.
The only way you could possible make it work in the grammar is to exclude P as a possible ID introducer, which then can exclusively be used for the duration.
Another option is a post processing step. Parse durations like any other identifier and in your semantic phase check all those ids that look like a duration. This is probably the best solution.

Resources