antlr4 matches longest one - parsing

I try an Antlr4 grammar file. When I change define of ID property
ID :[A-Z]+;
to
ID: [A-Z][A-Za-z0-9_]* ;
I got this error.
line 1:7 mismatched input 'E550' expecting {'W', 'I'}
line 1:12 mismatched input ';' expecting {'W', 'I'}
Actualy I know the reason. which mathces with the longest one. But I must use ID Like erroneous way. and my foo must be E or I and Number. How can I make it happen? any help is appreciate.
Here is my code snippet which causes the error.
QUEST E550 ;
Here is my grammar
grammar test;
block: foo+;
foo:ID op=(WARNING|INFORMATION)INT SCOL;
SCOL :';';
WARNING :'W';
INFORMATION :'I';
ID: [A-Z]+ ;
//if I change to ID: [A-Z][A-Za-z0-9_]* ; error occurs
INT : [0-9]+;
SPACE: [ \t\r\n] -> skip;
OTHER: . ;

If your ID rule cannot start with W, I or E, then you will need to exclude those from the start:
ID: [A-DF-HJ-VX-Z] [A-Za-z0-9_]* ;
Of course, then input like EEEEE will not become an ID. To account for such cases, you could (1) let your ID rule start with a single uppercase other than W, I or E followed by the rest, or (2) let it start with 2 letters followed by the rest:
ID
: [A-DF-HJ-VX-Z] [A-Za-z0-9_]* // (1)
| [A-Z] [A-Z] [A-Za-z0-9_]* // (2)
;

Related

ANTLR grammar not working as expected. What am I doing wrong?

I have this grammar below for implementing an IN operator taking a list of numbers or strings.
grammar listFilterExpr;
listFilterExpr: entityIdNumberListFilter | entityIdStringListFilter;
entityIdNumberProperty
: 'a.Id'
| 'c.Id'
| 'e.Id'
;
entityIdStringProperty
: 'f.phone'
;
listFilterExpr
: entityIdNumberListFilter
| entityIdStringListFilter
;
listOperator
: '$in:'
;
entityIdNumberListFilter
: entityIdNumberProperty listOperator numberList
;
entityIdStringListFilter
: entityIdStringProperty listOperator stringList
;
numberList: '[' ID (',' ID)* ']';
fragment ID: [1-9][0-9]*;
stringList: '[' STRING (',' STRING)* ']';
STRING
: '"'(ESC | SAFECODEPOINT)*'"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment SAFECODEPOINT
: ~ ["\\\u0000-\u001F]
;
If I try to parse the following input:
c.Id $in: [1,1]
Then I get the following error in the parser:
mismatched input '1' expecting ID
Please help me to correct this grammar.
Update
I found this following rule way above in the huge grammar file of my project that might be matching '1' before it gets to match to ID:
NUMBER
: '-'? INT ('.' [0-9] +)?
;
fragment INT
: '0' | [1-9] [0-9]*
;
But, If I write my ID rule before NUMBER then other things fail, because they have already matched ID which should have matched NUMBER
What should I do?
As mentioned by rici: ID should not be a fragment. Fragments can only be used by other lexer rules, they will never become a token on their own (and can therefor not be used in parser rules).
Just remove the fragment keyword from it: ID: [1-9][0-9]*;
Note that you'll also have to account for spaces. You probably want to skip them:
SPACES : [ \t\r\n] -> skip;
...
mismatched input '1' expecting ID
...
This looks like there's another lexer, besides ID, that also matches the input 1 and is defined before ID. In that case, have a look at this Q&A: ANTLR 4.5 - Mismatched Input 'x' expecting 'x'
EDIT
Because you have the rules ordered like this:
NUMBER
: '-'? INT ('.' [0-9] +)?
;
fragment INT
: '0' | [1-9] [0-9]*
;
ID
: [1-9][0-9]*
;
the lexer will never create an ID token (only NUMBER tokens will be created). This is just how ANTLR works: in case of 2 or more lexer rules match the same amount of characters, the one defined first "wins".
In the first place I think it's odd to have an ID rule that matches only digits, but, if that's the language you're parsing, OK. In your case, you could do something like this:
id : POS_NUMBER;
number : POS_NUMBER | NEG_NUMBER;
POS_NUMBER : INT ('.' [0-9] +)?;
NEG_NUMBER : '-' POS_NUMBER;
fragment INT
: '0' | [1-9] [0-9]*
;
and then instead of ID, use id in your parser rules. As well as using number instead of the NUMBER you're using now.

Comment conflict in HQL grammar

I am trying to create the --i; statement.
But my problem lies with the single line comment rule of HQL which states:
L_S_COMMENT : ('--' | '//') .*? '\r'? '\n' -> channel(HIDDEN) ;
I wrote the rules in the lexer:
T_SUB2 : '--' ;
T_SEMICOLON : ';' ;
Rule in parser:
dummy_rule: T_SUB2 'i' T_SEMICOLON ;
When i test the rule it works fine with the parse tree correctly displayed, But when i press ENTER for a new line it shows an error, And it wont accept any more rules, I know its the L_S_COMMENT rule because when i remove it the rules works fine.
But deleting it is not the optimal solution any ideas what might cause this and how to bypass it.
If the relevant statements always have to be terminated in a SEMI, then effectively exclude then from the comment definition:
COMMENT
: ( CMark .*? Vws
| DMark .*? ~[; \t\r\n\f] Hws* Vws
) -> channel(HIDDEN)
;
fragment CMark : '//' ;
fragment DMark : '--' ;
fragment Hws : [ \t] ;
fragment Vws : [\r\n]+ ;
Explanation
The first alt of the rule matches a standard // comment
The second alt will match a -- comment iff the one visible character immediately prior to the terminating whitespace is not a SEMI. The ~ is set negation, while [; \t\r\n\f] is a set of characters. Since there is no operator modifying the set, ~[; \t\r\n\f] will match only a single character that is not one of the named characters.
Hence, the comment rule will not match the terminal portion of a line of code that contains -- and terminates in a SEMI.

Ordering lexer rules in a grammar using ANTLR4

I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.
I want the parser to be able to handle something like this:
Hello << name >>, how are you?
At runtime I will replace "<< name >>" with the user's name.
So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<< something >>" tag, which I am calling a "func" in my lexer rules.
Here is my grammar:
doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;
WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;
Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".
If I run this parser on the above sentence, I get a parse tree that looks like this:
Anything highlighted in red is a parse error.
So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that says "<< WORD >>", only a rule that says "<< ID >>", so I'm not clear on why that is happening.
If I swap the order of "ID" and "WORD" in my grammar, so now they are in this order:
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
And run the parser, I get a parse tree like this:
So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.
How do I get past this conundrum?
I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).
Thanks for any help!
From The Definitive ANTLR 4 Reference :
ANTLR resolves lexical ambiguities by
matching the input string to the rule specified first in the grammar.
With your grammar (in Question.g4) and a t.text file containing
Hello << name >>, how are you at nine o'clock?
the execution of
$ grun Question doc -tokens -diagnostics t.text
gives
[#0,0:4='Hello',<WORD>,1:0]
[#1,6:7='<<',<'<<'>,1:6]
[#2,9:12='name',<WORD>,1:9]
[#3,14:15='>>',<'>>'>,1:14]
[#4,16:16=',',<PUNCT>,1:16]
[#5,18:20='how',<WORD>,1:18]
[#6,22:24='are',<WORD>,1:22]
[#7,26:28='you',<WORD>,1:26]
[#8,30:31='at',<WORD>,1:30]
[#9,33:36='nine',<WORD>,1:33]
[#10,38:44='o'clock',<WORD>,1:38]
[#11,45:45='?',<PUNCT>,1:45]
[#12,47:46='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'name' expecting ID
line 1:14 extraneous input '>>' expecting {<EOF>, '<<', WORD, PUNCT}
Now change WORD to word in the item rule, and add a word rule :
item: (func | word) PUNCT? ;
word: WORD | ID ;
and put ID before WORD :
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
The tokens are now
[#0,0:4='Hello',<ID>,1:0]
[#1,6:7='<<',<'<<'>,1:6]
[#2,9:12='name',<ID>,1:9]
[#3,14:15='>>',<'>>'>,1:14]
[#4,16:16=',',<PUNCT>,1:16]
[#5,18:20='how',<ID>,1:18]
[#6,22:24='are',<ID>,1:22]
[#7,26:28='you',<ID>,1:26]
[#8,30:31='at',<ID>,1:30]
[#9,33:36='nine',<ID>,1:33]
[#10,38:44='o'clock',<WORD>,1:38]
[#11,45:45='?',<PUNCT>,1:45]
[#12,47:46='<EOF>',<EOF>,2:0]
and there is no more error. As the -gui graphic shows, you have now branches identified as word or func.
As "500 - Internal Server Error" already mentioned in his comment ANTLR will match lexer rules in the order they are defined in the grammar (the topmost rule will be matched first) and if a certain input has been matched ANTLR won't try to match it differently.
In your case the WORD and ID rule can both match input like abc but as WORD is declared first abc will always be matched as a WORD and never as an ID. In fact ID will never be matched as there is no valid input as an ID that can not be matched by WORD.
However if your only goal is to replace whatever is in between << and >> you'd be better off using regular expressions. However if you still want to use ANTLR for it you should reduce your grammar to only care about the essentials. That is to distinguish between any input and input in between << and >>. Therefore your grammar should look something like this:
start: (INTERESTING | UNINTERESTING) ;
INTERESTING: '<<' .*? '>>' ;
UNINTERESTING: (~[<])+ | '<' ;
Or you could skip the UNINTERESTING completely.

Require newline or EOF after statement match

Just looking for a simple way of getting ANTLR4 to generate a parser that will do the following (ignore anything after the ;):
int #i ; defines an int
int #j ; see how I have to go to another line for another statement?
My parser is as the following:
compilationUnit:
(statement END?)*
statement END?
EOF
;
statement:
intdef |
WS
;
// 10 - 1F block.
intdef:
'intdef' Identifier
;
// Lexer.
Identifier: '#' Letter LetterOrDigit*;
fragment Letter: [a-zA-Z_];
fragment LetterOrDigit: [a-zA-Z0-9$_];
// Whitespace, fragments and terminals.
WS: [ \t\r\n\u000C]+ -> skip;
//COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
END: (';' ~[\r\n]*) | '\n';
In essence, any time I have a statement, I need it to REQUIRE a newline before another is entered. I don't care if there's 3 new lines and then on the second one a bunch of tabs persist, as long as there's a new line.
The issue is, the ANTLR4 Parse Tree seems to be giving me errors for inputs such as:
.
(Pretend the dot isnt there, its literally no input)
int #i int #j
Woops, we got two on the same line!
Any ideas on how I can achieve this? I appreciate the help.
I've simplified your grammar a bit but made it require an end-of-line sequence after each statement to parse correctly.
grammar Testnl;
program: (statement )* EOF ;
statement: 'int' Identifier EOL;
Identifier: '#' Letter LetterOrDigit*;
fragment Letter: [a-zA-Z_];
fragment LetterOrDigit: [a-zA-Z0-9$_];
EOL: ';' .*? '\r\n'
| ';' .*? '\n'
;
WS: [ \t\r\n\u000C]+ -> skip;
It parses
int #i ;
int #j;
[#0,0:2='int',<'int'>,1:0]
[#1,4:5='#i',<Identifier>,1:4]
[#2,7:9=';\r\n',<EOL>,1:7]
[#3,10:12='int',<'int'>,2:0]
[#4,14:15='#j',<Identifier>,2:4]
[#5,16:18=';\r\n',<EOL>,2:6]
[#6,19:18='<EOF>',<EOF>,3:0]
It also ignore stuff after the semicolon as just part of the EOL token:
[#0,0:2='int',<'int'>,1:0]
[#1,4:5='#i',<Identifier>,1:4]
[#2,7:20='; ignore this\n',<EOL>,1:7]
[#3,21:23='int',<'int'>,2:0]
[#4,25:26='#j',<Identifier>,2:4]
[#5,27:28=';\n',<EOL>,2:6]
[#6,29:28='<EOF>',<EOF>,3:0]
using either linefeed or carriagereturn-linefeed just fine. Is that what you're looking for?
EDIT
Per OP comment, made a small change to allow consecutive EOL tokens, and also move EOL token to statement to reduce repetition:
grammar Testnl;
program: ( statement EOL )* EOF ;
statement: 'int' Identifier;
Identifier: '#' Letter LetterOrDigit*;
fragment Letter: [a-zA-Z_];
fragment LetterOrDigit: [a-zA-Z0-9$_];
EOL: ';' .*? ('\r\n')+
| ';' .*? ('\n')+
;
WS: [ \t\r\n\u000C]+ -> skip;

Dealing with overloaded symbols in ambiguous grammars in ANTLR4

I am trying to write a parser for a dialect of Answer Set Programming (ASP) which, in terms of grammar, looks like Prolog with some extensions.
One extension, for instance is expansion, meaning that fact(1..3). for instance is expanded in fact(1). fact(2). fact(3).. Notice that the language understands INT and FLOAT numbers and uses . also as a terminator.
In some cases the parser fails to distinguish between integers, floats, extensions and separators because I reckon the language is clearly ambiguous. In that cases, I have to explicitly separate tokens with white spaces. Any Prolog or ASP parser, however, correctly deals with such productions. I read that ANTLR4 can disambiguate problematic productions autonomously, but probably it needs some help but I don't know how to do! ;-) I read something like here and here, but apparently they did not help me.
Could somebody please tell me what to do to overcome this ambiguity?
Please notice that I cannot change the language because it is quite standard.
In order to simplify the experts' work, I created a minimal working example that follows.
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum: // not needed, but helps in TestRig
FLOAT;
range: // defines an expansion
INT DOTS INT ;
DOTS: '..';
DOT: '.';
FLOAT: DIGIT+ '.' DIGIT* | '.' DIGIT+ ;
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
I use the following input:
1 .
1. .
1.5 .
.5 .
1 .. 5 .
1.
1..
1.5.
.5.
1..5.
And I get the following errors which instead are parsed corrected by other tools:
line 8:0 extraneous input '1.' expecting '.'
line 11:2 extraneous input '.5' expecting '.'
Many thanks in advance!
Before your DOTS rule, add a unique rule for the statement terminal dot and disambiguate the DOTS rule (and change your other rules to use the TERMINAL):
TERMINAL: DOT { isTerminal(1) }? ;
DOTS: DOT DOT { !isTerminal(2) }? ;
DOT: '.';
where the predicate method simply looks ahead on the _input character stream to see if, at the current token index, the next character is white space. Put something like this in an #member block in your grammar:
public boolean isTerminal(int la) {
int offset = _tokenStartCharIndex + 1 + la;
String s = _input.getText(Interval.of(offset, offset));
if (Character.isWhitespace(s.charAt(0))) {
return true;
}
return false;
}
May have to do a bit more work if whitespace is valid between a DOTS and the trailing INT.
I recommend shifting the work to the parser.
If the lexer can't decide if 1..2 is 1. .2 or 1 .. 2 leave if up to the parser.
Maybe there is a context in which it can be interpreted as the first alternative and another context in which it may be interpreted as the second alternative.
Btw: 1..2. could be interpreted as 1 .. 2 . (range) or as 1. . 2 . (floatNum, intNum). How do you want to deal with this?
The following grammar should parse everything. But note that . . is treated as dots as well as 1 . 23 is a floatNum! You can check these tough while parsing or after parsing (depending on whether it should influence the parsing or not).
grammar Test;
program:
statement* ;
statement: // DOT is the statement terminator
range DOT |
intNum DOT |
floatNum DOT ;
intNum: // not needed, but helps in TestRig
INT;
floatNum:
INT DOT INT? | DOT INT ;
range: // defines an expansion
INT dots INT ;
dots : DOT DOT;
DOT: '.';
INT: DIGIT+ ;
WS: [ \t\r\n]+ -> skip ;
fragment NONZERO : [1-9] ;
fragment DIGIT : [0] | NONZERO ;
Prolog does not accept 1. as a float. This feature makes your grammar significantly more ambiguous, so maybe try removing that feature.

Resources