Overlapping Tokens in ANTLR 4 - parsing

I have the following ANTLR 4 combined grammar:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: IDENTIFIER ;
NOTE: [A-Ga-g] ;
IDENTIFIER: [A-Za-z0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
This parses:
field x { A }
field x { B }
This does not:
field a { A }
field b { B }
In the case where parsing fails, I think the lexer is getting confused and emitting a NOTE token where I want it to emit an IDENTIFIER token.
Edit:
In the tokens coming out of the lexer, the 'NOTE' token is showing up where the parser is expecting 'IDENTIFIER'. 'NOTE' has higher precedence because it's shown first in the grammar. So, I can think of two ways to fix this... first, I could alter the grammar to disambiguate 'NOTE' and 'IDENTIFIER' (like adding a '$' in front of 'NOTE'). Or, I could just use 'IDENTIFIER' where I would use note and then deal with detecting issues when I walk the parse tree. Neither of those feel optimal. Surely there must be a way to fix this?

I actually ended up solving it like this:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: IDENTIFIER | NOTE ;
NOTE: [A-Ga-g] ;
IDENTIFIER: [A-Za-z0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
My parse tree still ends up looking how I'd like.
The actual grammar I'm developing is more complicated, as is the workaround based on this approach. But in general, the approach seems to work well.

Quick and dirty fix for your problem can be:
Change IDENTIFIERto match only the complement of NOTE. Then you put them together in identifier.
Resulting grammar:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: (NOTE|IDENTIFIER_C)+ ;
NOTE: [A-Ga-g] ;
IDENTIFIER_C: [H-Zh-z0-9] ;
WS: [ \t\r\n]+ -> skip ;
Disadvantage of this solution is, that you do not get the Identifier as tokens and you tokenize every single Character.

Related

antlr4 not parsing according to grammar

I'm trying to parse 'for loop' according to this (partial) grammar:
grammar GaleugParserNew;
/*
* PARSER RULES
*/
relational
: '>'
| '<'
;
varChange
: '++'
| '--'
;
values
: ID
| DIGIT
;
for_stat
: FOR '(' ID '=' values ';' values relational values ';' ID varChange ')' '{' '}'
;
/*
* LEXER RULES
*/
FOR : 'for' ;
ID : [a-zA-Z_] [a-zA-Z_0-9]* ;
DIGIT : [0-9]+ ;
SPACE : [ \t\r\n] -> skip ;
When I try to generate the gui of how it's parsed, it's not following the grammar I provided above. This is what it produces:
I've encountered this problem before, what I did then was simply exit cmd, open it again and compile everything and somehow that worked then. It's not working now though.
I'm not really very knowledgeable about antlr4 so I'm not sure where to look to solve this problem.
Must be a problem of the IDE you are using. The grammar is fine and produces this parse tree in Visual Studio Code:
I guess the IDE is using the wrong parser or lexer (maybe from a different work file?). Print the lexer tokens to see if they are what you expect. Hint: avoid defining implicit lexer tokens (like '(', '}' etc.), which will allow to give the tokens good names.

Parse any character until semicolon in ANTLR4

I am trying to parse the following grammar, where Value can be any character up to the semicolon, but I cannot get it to work correctly:
grammar Test;
pragmaDirective : 'pragma' Identifier Value ';' ;
Identifier : [a-z]+ ;
Value : ~';'* ;
WS : [ \t\r\n\u000C]+ -> skip ;
When I test it with pragma foo bar;, I get the following error:
line 1:6 extraneous input ' ' expecting Identifier
line 1:11 extraneous input 'bar' expecting ';'
Try this:
pragmaDirective : 'pragma' Identifier .*? ';' ;
and remove the Value rule. That should do the job.
And a recommendation: define lexer rules for your literals (like 'pragma') instead of defining them directly in the parser rules.
The Value rule is much too greedy. Lexer rules try to match as much as possible, so for input like this: pragma mu foo;, the Value rule would match pragma mu foo. After all, that's zero or more characters other than a semicolon.
Value is not well suited to be used as a lexer rule. I suggest you rethink your approach. Perhaps create a parser rule value that matches an Identifier and perhaps other lexer rules. Hard to make a suggestion without seeing much of the "real" grammar (you probably posted a dumbed down version of the grammar you're working on).

How to define alphanumeric token in antlr?

I'd like to have an alphanumeric lexer rule, a token of any combination of letters and digits, here's my grammar
grammar Equery;
query: queryTerm+;
queryTerm: filter
| '(' queryTerm ')'
;
filter: kvpair
| 'NOT' filter
;
kvpair: ID '=' VALUE;
ID: [a-zA-Z]+;
VALUE: [a-z0-9]+;
WS: [ \r\n\t]+ -> skip;
When I tested the kvpair rule with a=12, this error occurred:
mismatched input '12' expecting VALUE
I could work around this, but I'd like to know why 12 is not recognized as a VALUE?
Your grammar is correct, as far as I can tell. On my machine using Antlr4, I tested a = 12 through your kvpair rule, and it parsed fine. As far as I can tell by visual inspection, your code should work on previous versions of Antlr as well. I would try deleting all your Antlr generated files, and rebuilding the grammar to see if that is your issue.

Ignoring whitespace (in certain parts) in Antlr4

I am not so familiar with antlr. I am using version 4 and I have a grammar where whitespace is not important in some parts (but it might be in others, or rather its luck).
So say we have the following grammar
grammar Foo;
program : A* ;
A : ID '#' ID '(' IDList ')' ';' ;
ID : [a-zA-Z]+ ;
IDList : ID (',' IDList)* ;
WS : [ \t\r\n]+ -> skip ;
and a test input
foo#bar(X,Y);
foo#baz ( z,Z) ;
The first line is parsed correctly whereas the second one is not.
I don't want to polute my rules with the places where whitespace is not relevant, since my actual grammar is more complicated than the toy example. In case it's not clear the part ID'#'ID should not have a whitespace. Whitespace in any other position shouldn't matter at all.
Even though you are skipping WS, lexer rules are still sensitive to the existence of the whitespace characters. Skip simply means that no token is generated for consumption by the parser. Thus, the lexer Addr rule explicitly does not permit any interior whitespace characters.
Conversely, the a and idList parser rules never see interior whitespace tokens so those rules are insensitive to the occurrence of whitespace characters occurring between the generated tokens.
grammar Foo;
program : a* EOF ; // EOF will require parsing the entire input
a : Addr LParen IDList RParen Semi ;
idList : ID (Comma ID)* ; // simpler equivalent construct
Addr : ID '#' ID ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
Define ID '#' ID as lexer token rather than as parser token.
A : AID '(' IDList ')' ';' ;
AID : [a-zA-Z]+ '#' [a-zA-Z]+;
Other options
enable/disable whitespaces in your token stream, e.g. here
enable/disable whitespaces with lexer modes (may be a problem because lexer modes are triggered on context, which is not easy to determine in your case)

Shift/reduce conflict in yacc due to look-ahead token limitation?

I've been trying to tackle a seemingly simple shift/reduce conflict with no avail. Naturally, the parser works fine if I just ignore the conflict, but I'd feel much safer if I reorganized my rules. Here, I've simplified a relatively complex grammar to the single conflict:
statement_list
: statement_list statement
|
;
statement
: lvalue '=' expression
| function
;
lvalue
: IDENTIFIER
| '(' expression ')'
;
expression
: lvalue
| function
;
function
: IDENTIFIER '(' ')'
;
With the verbose option in yacc, I get this output file describing the state with the mentioned conflict:
state 2
lvalue -> IDENTIFIER . (rule 5)
function -> IDENTIFIER . '(' ')' (rule 9)
'(' shift, and go to state 7
'(' [reduce using rule 5 (lvalue)]
$default reduce using rule 5 (lvalue)
Thank you for any assistance.
The problem is that this requires 2-token lookahead to know when it has reached the end of a statement. If you have input of the form:
ID = ID ( ID ) = ID
after parser shifts the second ID (lookahead is (), it doesn't know whether that's the end of the first statement (the ( is the beginning of a second statement), or this is a function. So it shifts (continuing to parse a function), which is the wrong thing to do with the example input above.
If you extend function to allow an argument inside the parenthesis and expression to allow actual expressions, things become worse, as the lookahead required is unbounded -- the parser needs to get all the way to the second = to determine that this is not a function call.
The basic problem here is that there's no helper punctuation to aid the parser in finding the end of a statement. Since text that is the beginning of a valid statement can also appear in the middle of a valid statement, finding statement boundaries is hard.

Resources