ANTLR4 grammar not behaving as expected - parsing

I have some data required to be parsed. I am using ANTLR4 tool to auto generate java parsers and lexers, that I can use to form a structured data from the input data given below
Grammar:
grammar SUBDATA;
subdata:
data+;
data:
array;
array:
'[' obj (',' obj)* ']';
intarray:
'[' number (',' number)* ']';
number:
INT;
obj:
'{' pair (',' pair)* '}';
pair:
key '=' value;
key:
WORD;
value:
INT | WORD | intarray;
WORD:
[A-Za-z0-9]+;
INT:
[0-9]+;
WS:
[ \t\n\r]+ -> skip;
Test Input Data:
[
{OmedaDemographicType=1, OmedaDemographicId=100, OmedaDemographicValue=4},
{OmedaDemographicType=1, OmedaDemographicId=101, OmedaDemographicValue=26},
{
OmedaDemographicType=2, OmedaDemographicId=102, OmedaDemographicValue=[16,34]
}
]
Ouput:
line 5:79 mismatched input '16' expecting INT
line 5:82 mismatched input '34' expecting INT
Parser is failing although I have the integer value at the above expected position.

You've made the classic mistake of not ordering your lexer rules properly. You should read and understand the priority rules and their consequences.
In your case, INT will never be able to match since the WORD rule can match everything the INT rule can, and it's defined first in the grammar. These 16 and 32 from the example are WORDs.
You should remove the ambiguity by not allowing a word to start with a digit:
WORD:
[A-Za-z] [A-Za-z0-9]*;
INT:
[0-9]+;
Or by swapping the order of the rules:
INT:
[0-9]+;
WORD:
[A-Za-z0-9]+;
In this case, you can't have words that are fully numeric, but they will still be able to start with a number.

Related

Matching parentheses in ANTLR

I'm new to Antlr and I met one issue with parentheses matching recently. Each node in the parse tree has the form (Node1,W1,W2,Node2), where Node1 and Node2 are two nodes and W1 and W2 are two weights between them. Given an input file as (1,C,10,2).((2,P,2,3).(3,S,3,2))*.(2,T,2,4), the parse tree looks wrong, where the operator is not the parent of those nodes and the parentheses are not matched.
The parse file I wrote is like this:
grammar Semi;
prog
: expr+
;
expr
: expr '*'
| expr ('.'|'+') expr
| tuple
| '(' expr ')'
;
tuple
: LP NODE W1 W2 NODE RP
;
LP : '(' ;
RP : ')' ;
W1 : [PCST0];
W2 : [0-9]+;
NODE: [0-9]+;
WS : [ \t\r\n]+ -> skip ; // toss out whitespace
COMMA: ',' -> skip;
It seems like expr| '(' expr ')' doesn't work correctly. So what should I do to make this parser detects if parentheses belong to the node or not?
Update:
There are two errors in the command:
line 1:1 no viable alternative at input '(1'
line 1:13 no viable alternative at input '(2'
So it seems like the lexer didn't detect the tuples, but why is that?
Your W2 and NODE rules are the same, so nodes you intend to be NODE are matching W2.
grun with -tokens option: (notice, no NODE tokens)
[#0,0:0='(',<'('>,1:0]
[#1,1:1='1',<W2>,1:1]
[#2,3:3='C',<W1>,1:3]
[#3,5:6='10',<W2>,1:5]
[#4,8:8='2',<W2>,1:8]
[#5,9:9=')',<')'>,1:9]
[#6,10:10='.',<'.'>,1:10]
[#7,11:11='(',<'('>,1:11]
[#8,12:12='(',<'('>,1:12]
[#9,13:13='2',<W2>,1:13]
[#10,15:15='P',<W1>,1:15]
[#11,17:17='2',<W2>,1:17]
[#12,19:19='3',<W2>,1:19]
[#13,20:20=')',<')'>,1:20]
[#14,21:21='.',<'.'>,1:21]
[#15,22:22='(',<'('>,1:22]
[#16,23:23='3',<W2>,1:23]
[#17,25:25='S',<W1>,1:25]
[#18,27:27='3',<W2>,1:27]
[#19,29:29='2',<W2>,1:29]
[#20,30:30=')',<')'>,1:30]
[#21,31:31=')',<')'>,1:31]
[#22,32:32='*',<'*'>,1:32]
[#23,33:33='.',<'.'>,1:33]
[#24,34:34='(',<'('>,1:34]
[#25,35:35='2',<W2>,1:35]
[#26,37:37='T',<W1>,1:37]
[#27,39:39='2',<W2>,1:39]
[#28,41:41='4',<W2>,1:41]
[#29,42:42=')',<')'>,1:42]
[#30,43:42='<EOF>',<EOF>,1:43]
If I replace the NODEs in your parse rule with W2s (sorry, I have no idea what this is supposed to represent), I get:
It appears that your misconception is that the recursive descent parsing starts with the parser rule and when it encounters a Lexer rule, attempts to match it.
This is not how ANTLR works. With ANTLR, your input is first run through the Lexer (aka Tokenizer) to produce a stream of tokens. This step knows absolutely nothing about your parser rules. (That's why it's so often useful to use grun to dump the stream of tokens, this gives you a picture of what your parser rules are acting upon (and you can see, in your example that there are no NODE tokens, because they all matched W2).
Also, a suggestion... It would appear that commas are an essential part of correct input (unless (1C102).((2P23).(3S32))*.(2T24) is considered valid input. On that assumption, I removed the -> skip and added them to your parser rule (that's why you see them in the parse tree). The resulting grammar I used was:
grammar Semi;
prog: expr+;
expr: expr '*' | expr ('.' | '+') expr | tuple | LP expr RP;
tuple: LP W2 COMMA W1 COMMA W2 COMMA W2 RP;
LP: '(';
RP: ')';
W1: [PCST0];
W2: [0-9]+;
NODE: [0-9]+;
WS: [ \t\r\n]+ -> skip; // toss out whitespace
COMMA: ',';
To take a bit more liberty with your grammar, I'd suggest that your Lexer rules should be raw type focused. And, that you can use labels to make the various elements or your tuple more easily accessible in your code. Here's an example:
grammar Semi;
prog: expr+;
expr: expr '*' | expr ('.' | '+') expr | tuple | LP expr RP;
tuple: LP nodef=INT COMMA w1=PCST0 COMMA w2=INT COMMA nodet=INT RP;
LP: '(';
RP: ')';
PCST0: [PCST0];
INT: [0-9]+;
COMMA: ',';
WS: [ \t\r\n]+ -> skip; // toss out whitespace
With this change, your tuple Context class will have accessors for w1, w1, and node. node will be an array of NBR tokens as I've defined it here.

ANTLR grammar not working as expected. What am I doing wrong?

I have this grammar below for implementing an IN operator taking a list of numbers or strings.
grammar listFilterExpr;
listFilterExpr: entityIdNumberListFilter | entityIdStringListFilter;
entityIdNumberProperty
: 'a.Id'
| 'c.Id'
| 'e.Id'
;
entityIdStringProperty
: 'f.phone'
;
listFilterExpr
: entityIdNumberListFilter
| entityIdStringListFilter
;
listOperator
: '$in:'
;
entityIdNumberListFilter
: entityIdNumberProperty listOperator numberList
;
entityIdStringListFilter
: entityIdStringProperty listOperator stringList
;
numberList: '[' ID (',' ID)* ']';
fragment ID: [1-9][0-9]*;
stringList: '[' STRING (',' STRING)* ']';
STRING
: '"'(ESC | SAFECODEPOINT)*'"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment SAFECODEPOINT
: ~ ["\\\u0000-\u001F]
;
If I try to parse the following input:
c.Id $in: [1,1]
Then I get the following error in the parser:
mismatched input '1' expecting ID
Please help me to correct this grammar.
Update
I found this following rule way above in the huge grammar file of my project that might be matching '1' before it gets to match to ID:
NUMBER
: '-'? INT ('.' [0-9] +)?
;
fragment INT
: '0' | [1-9] [0-9]*
;
But, If I write my ID rule before NUMBER then other things fail, because they have already matched ID which should have matched NUMBER
What should I do?
As mentioned by rici: ID should not be a fragment. Fragments can only be used by other lexer rules, they will never become a token on their own (and can therefor not be used in parser rules).
Just remove the fragment keyword from it: ID: [1-9][0-9]*;
Note that you'll also have to account for spaces. You probably want to skip them:
SPACES : [ \t\r\n] -> skip;
...
mismatched input '1' expecting ID
...
This looks like there's another lexer, besides ID, that also matches the input 1 and is defined before ID. In that case, have a look at this Q&A: ANTLR 4.5 - Mismatched Input 'x' expecting 'x'
EDIT
Because you have the rules ordered like this:
NUMBER
: '-'? INT ('.' [0-9] +)?
;
fragment INT
: '0' | [1-9] [0-9]*
;
ID
: [1-9][0-9]*
;
the lexer will never create an ID token (only NUMBER tokens will be created). This is just how ANTLR works: in case of 2 or more lexer rules match the same amount of characters, the one defined first "wins".
In the first place I think it's odd to have an ID rule that matches only digits, but, if that's the language you're parsing, OK. In your case, you could do something like this:
id : POS_NUMBER;
number : POS_NUMBER | NEG_NUMBER;
POS_NUMBER : INT ('.' [0-9] +)?;
NEG_NUMBER : '-' POS_NUMBER;
fragment INT
: '0' | [1-9] [0-9]*
;
and then instead of ID, use id in your parser rules. As well as using number instead of the NUMBER you're using now.

How do I figure out ANTLR grammar failure to parse input, special timestamp in logfile

INPUT:
Mar 9 10:19:07 west info tmm1[17280]: 01870003:6:
/Common/mysaml.app/mysaml:Common:00000000: helloasdfasdf asdfadf vgnfg
GRAMMAR:
grammar scratch;
lines : datestamp hostname level proc msgnum module msgstring;
datestamp: month day time;
//month : MONTH;
day : INTEGER;
time : INTEGER ':' INTEGER ':' INTEGER;
hostname : STRING;
level : ALPHA;
proc: procname '[' procnum ']' ':';
procname : STRING;
procnum : INTEGER;
msgnum : INTEGER ':' DIGIT':';
module : '/' DOTSLASHSTRING ':' PARTITION ':' SESSID ':';
PARTITION: STRING;
sessid : HEX;
msgstring: MSGSTRING;
DOTSLASHSTRING : [a-zA-Z./]+;
SESSID : HEX;
INTEGER : [0-9]+;
DIGIT: [0-9];
STRING : [a-zA-Z][a-zA-Z0-9]*;
HEX : [a-f0-9]+;
//ALPHA: [a-zA-Z]+;
ALPHA: ('['|'(') .*? (']'|')');
MSGSTRING : [a-zA-Z0-9':,_(). ]+ [\r];
// | 'Agent' MSGSTRING;
month : 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul' | 'Aug' | 'Sep' | 'Oct' | 'Nov' | 'Dec' ;
WS : [ \t\r\n]+ -> skip;
PROBLEM:
the parse tree shows that the month is populated properly, but the next item, day is not. In the parse tree, it shows day is set to the entire rest of the input. Don't see how this is possible.
Error from parser is:
line 1:4 mismatched input '9' expecting INTEGER
The parser (i.e. rules starting with lowercase letter) and lexer (uppercase first letter) behave in a slightly different way:
The parser knows what token it expects and tries to match it (except when it has multiple alternatives - then it looks at the next tokens to see which alternative to choose)
The lexer however knows nothing about the parser rules - it matches whatever it can match to the current input. When multiple lexer rules can match a prefix of the input:
It will match (and emit token for) the rule that matches the longest sequence
If multiple rules can match the same sequence, the rule earlier in the file (closer to the top) wins.
So your input would most likely be tokenized to*:
MONTH Mar
(WS)
SESSID 9 - SESSID matches and is higher up than INTEGER
(WS)
SESSID 10
':' :
SESSID 19
':' :
SESSID 07
(WS)
PARTITION west - same as STRING but higher up - STRING will never be matched
(WS)
PARTITION info
(WS)
PARTITION tmm1
ALPHA [17280] - matches longer sequence than just '[' in rule "proc"
':' :
(WS)
SESSID 01870003
':' :
SESSID 6
':' :
(WS)
DOTSLASHSTRING /Common/mysaml.app/mysaml - longer than just '/' in rule "module"
MSGSTRING :Common:00000000: helloasdfasdf asdfadf vgnfg - the rest can be matched to this rule
As you can see, these are quite different tokens than your parser expects.
The bottom line is, you have too much logic in your lexer rules, namely you tried to put semantic meanings into lexer. It's not suited to that task. If a single input sequence can mean different things (like 123 might be an integer number, hex number or session ID), that distinction needs to go into parser since it can only be decided based on context (where in the sentence it occurred), and not by the content of the 123 itself. Similarly, if [17280] can either be ALPHA (whatever that is) or an INTEGER in brackets, that decision needs to go into parser because it cannot be decided solely by looking at [17280] (it's now in lexer due to the ALPHA rule).
* The likely tokenization is based on the input from your screenshot which is all on one line, while the input in question itself is on two lines - not sure whether that is intentional or a result of line wrap.

Require newline or EOF after statement match

Just looking for a simple way of getting ANTLR4 to generate a parser that will do the following (ignore anything after the ;):
int #i ; defines an int
int #j ; see how I have to go to another line for another statement?
My parser is as the following:
compilationUnit:
(statement END?)*
statement END?
EOF
;
statement:
intdef |
WS
;
// 10 - 1F block.
intdef:
'intdef' Identifier
;
// Lexer.
Identifier: '#' Letter LetterOrDigit*;
fragment Letter: [a-zA-Z_];
fragment LetterOrDigit: [a-zA-Z0-9$_];
// Whitespace, fragments and terminals.
WS: [ \t\r\n\u000C]+ -> skip;
//COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
END: (';' ~[\r\n]*) | '\n';
In essence, any time I have a statement, I need it to REQUIRE a newline before another is entered. I don't care if there's 3 new lines and then on the second one a bunch of tabs persist, as long as there's a new line.
The issue is, the ANTLR4 Parse Tree seems to be giving me errors for inputs such as:
.
(Pretend the dot isnt there, its literally no input)
int #i int #j
Woops, we got two on the same line!
Any ideas on how I can achieve this? I appreciate the help.
I've simplified your grammar a bit but made it require an end-of-line sequence after each statement to parse correctly.
grammar Testnl;
program: (statement )* EOF ;
statement: 'int' Identifier EOL;
Identifier: '#' Letter LetterOrDigit*;
fragment Letter: [a-zA-Z_];
fragment LetterOrDigit: [a-zA-Z0-9$_];
EOL: ';' .*? '\r\n'
| ';' .*? '\n'
;
WS: [ \t\r\n\u000C]+ -> skip;
It parses
int #i ;
int #j;
[#0,0:2='int',<'int'>,1:0]
[#1,4:5='#i',<Identifier>,1:4]
[#2,7:9=';\r\n',<EOL>,1:7]
[#3,10:12='int',<'int'>,2:0]
[#4,14:15='#j',<Identifier>,2:4]
[#5,16:18=';\r\n',<EOL>,2:6]
[#6,19:18='<EOF>',<EOF>,3:0]
It also ignore stuff after the semicolon as just part of the EOL token:
[#0,0:2='int',<'int'>,1:0]
[#1,4:5='#i',<Identifier>,1:4]
[#2,7:20='; ignore this\n',<EOL>,1:7]
[#3,21:23='int',<'int'>,2:0]
[#4,25:26='#j',<Identifier>,2:4]
[#5,27:28=';\n',<EOL>,2:6]
[#6,29:28='<EOF>',<EOF>,3:0]
using either linefeed or carriagereturn-linefeed just fine. Is that what you're looking for?
EDIT
Per OP comment, made a small change to allow consecutive EOL tokens, and also move EOL token to statement to reduce repetition:
grammar Testnl;
program: ( statement EOL )* EOF ;
statement: 'int' Identifier;
Identifier: '#' Letter LetterOrDigit*;
fragment Letter: [a-zA-Z_];
fragment LetterOrDigit: [a-zA-Z0-9$_];
EOL: ';' .*? ('\r\n')+
| ';' .*? ('\n')+
;
WS: [ \t\r\n\u000C]+ -> skip;

ANTLR4 can't parse Integer if a parser rules has an own numeric literal

I am struggling a bit with trying to define integers in my grammar.
Let's say I have this small grammar:
grammar Hello;
r : 'hello' INTEGER;
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
If I then type in
hello 5
it parses correctly.
However, if I have an additional parser rule (even if it's unused) which defines a token '5',
then I can't parse the previous example anymore.
So this grammar:
grammar Hello;
r : 'hello' INTEGER;
unusedRule: 'hi' '5';
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
with
hello 5
won't parse anymore. It gives me the following error:
Hello::r:1:6: mismatched input '5' expecting INTEGER
How is that possible and how can I work around this?
When you define a parser rule like
unusedRule: 'hi' '5';
Antlr creates implicit lexer tokens for the subterms. Since they are automatically created in the lexer, you have no control over where the sit in the precedence evaluation of Lexer rules.
Consequently, the best policy is to never use literals in parser rules; always explicitly define your tokens.

Resources