ANTLR4: Unrecognized constant value in a lexer command - parsing

I am learning how to use the "more" lexer command. I typed in the lexer grammar shown in the ANTLR book, page 281:
lexer grammar Lexer_To_Test_More_Command ;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \t\r\n]+ -> skip ;
mode STR ;
STRING : '"' -> mode(DEFAULT_MODE) ;
TEXT : . -> more ;
Then I created this simple parser to use the lexer:
grammar Parser_To_Test_More_Command ;
import Lexer_To_Test_More_Command ;
test: STRING EOF ;
Then I opened a DOS window and entered this command:
antlr4 Parser_To_Test_More_Command.g4
That generated this warning message:
warning(155): Parser_To_Test_More_Command.g4:3:29: rule LQUOTE
contains a lexer command with an unrecognized constant value; lexer
interpreters may produce incorrect output
Am I doing something wrong in the lexer or parser?

Combined grammars (which are grammars that start with just grammar, instead of parser grammar or lexer grammar) cannot use lexer modes. Instead of using the import feature¹, you should use the tokenVocab feature like this:
Lexer_To_Test_More_Command.g4:
lexer grammar Lexer_To_Test_More_Command;
// lexer rules and modes here
Parser_To_Test_More_Command.g4:
parser grammar Parser_To_Test_More_Command;
options {
tokenVocab = Lexer_To_Test_More_Command;
}
// parser rules here
¹ I actually recommend avoiding the import statement altogether in ANTLR. The method I described above is almost always preferable.

Related

Differentiate between multiplication and comment

I am writing an ANTLR grammar for SAS and running into an issue where the lexer cannot differentiate between a single line comment and a multiplication operation.
The SAS syntax for comments is regrettably:
*message;
or
/*message*/
I have written a simple test grammar to illustrate the problem:
grammar TEST;
prog: expr* EOF;
expr
: VAR #base
| expr '*' expr #mult
;
VAR: ALPHA+;
fragment ALPHA : [a-zA-Z]+ ;
COMMENT: '*' ~[\r\n];
WS: [ \t\r\n] -> skip;
I'm not sure how I can qualify the lexer to differentiate between these two situations. I am an ANTLR beginner so I may have missed something obvious.

What does pushMode, popMode, mode, OPEN and CLOSE mean in the lexer grammar?

I am studying the lexer and parser grammars and using ANTLR for creating the parsers and lexers based on the .g4 files. However, I am quite confused as what does the pushMode and popMode do in general?
OPEN : '[' -> pushMode(BBCODE) ;
TEXT : ~('[')+ ;
mode BBCODE;
CLOSE : ']' -> popMode ;
What does OPEN, pushMode, BBCODE, CLOSE and popMode means in the lexer grammar? I tried searching for these modes, but there are no clear definition and explanation for these.
pushMode and popMode are used for so-called "Island Grammars" or lexical modes. These allow dealing with different formats in the same file. The basic idea is to have the lexer switch between modes when it sees certain character sequences.
In your grammar example, when the lexer encounters [ it will switch from the default grammar (i.e. grammar defined outside any mode <name>) to the grammar defined between
mode BBCODE;
and
CLOSE : ']' -> popMode ;
when it encounters ] it will switch back to default grammar.
One example of an island grammar would be Javadoc tags inside Java code.
Theoretically, lexical modes could be also used to parse JavaScript inside HTML. For example, the main grammar would define HTML, but when it encounters a <script ... tag it would switch to the JavaScript grammar with -> pushMode(javascript). When it encounters </script> tag it would popMode to return back to "default" HTML grammar.
OPEN and CLOSE in your example are lexical rules for '[' and ']' which can be used in parser grammar to improve readability. Instead of writing ']' -> popMode, you would write CLOSE.
If you plan any serious envelopment with ANTLR4, I strongly recommend to read this book: The Definitive ANTLR 4 Reference by Terence Parr.

ANTLR4 can't parse Integer if a parser rules has an own numeric literal

I am struggling a bit with trying to define integers in my grammar.
Let's say I have this small grammar:
grammar Hello;
r : 'hello' INTEGER;
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
If I then type in
hello 5
it parses correctly.
However, if I have an additional parser rule (even if it's unused) which defines a token '5',
then I can't parse the previous example anymore.
So this grammar:
grammar Hello;
r : 'hello' INTEGER;
unusedRule: 'hi' '5';
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
with
hello 5
won't parse anymore. It gives me the following error:
Hello::r:1:6: mismatched input '5' expecting INTEGER
How is that possible and how can I work around this?
When you define a parser rule like
unusedRule: 'hi' '5';
Antlr creates implicit lexer tokens for the subterms. Since they are automatically created in the lexer, you have no control over where the sit in the precedence evaluation of Lexer rules.
Consequently, the best policy is to never use literals in parser rules; always explicitly define your tokens.

Antlr Error when adding a Mode for Lexers

I'm trying Lexing Modes for the first time.
I have a lexer grammar with a mode that I'm importing into my "main" grammar.
I get this error when generating the java classes for the Grammar's lexer
'rule DESCRIPTION_FIELD contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output'
I followed this article
My Lexer grammar is the following :
lexer grammar TestLexerGrammar;
DESCRIPTION_FIELD
:
'DESCRIPTION:'-> pushMode(FREETEXTMODE)
;
mode FREETEXTMODE;
FREE_TEXT_FIELD_FORMAT
:
STR+
;
fragment
STR
:
(
LETTER
| DIGIT
)
;
my main grammar:
grammar Grammar;
import TestLexerGrammar;
descriptionElement
:
DESCRIPTION_FIELD freeTextFields
;
freeTextFields
:
FREE_TEXT_FIELD_FORMAT+
;
so in the generated GrammarLexer.java I get an error : " FREETEXTMODE cannot be resolved to a variable "
Is this a wrong approach? and is there a possible way to trigger changing mode through a parsing rule?
You can not use mode in grammars with import statement. There are related issues on github: Problems with lexical modes inside an imported grammar and No error/incorrect code generation when importing lexer grammar with modes into a combined grammar.
So, you should repair your main grammar and remove import statement by the following way:
parser grammar Grammar;
options { tokenVocab=TestLexerGrammar; }

Why does this grammar fail to parse this input?

I'm defining a grammar for a small language and Antlr4. The idea is in that language, there's a keyword "function" which can be used to either define a function or as a type specifier when defining parameters. I would like to be able to do something like this:
function aFunctionHere(int a, function callback) ....
However, it seems Antlr doesn't like that I use "function" in two different places. As far as I can tell, the grammar isn't even ambiguous.
In the following grammar, if I remove LINE 1, the generated parser parses the sample input without a problem. Also, if I change the token string in either LINE 2 or LINE 3, so that they are not equal, the parser works.
The error I get with the grammar as-is:
line 1:0 mismatched input 'function' expecting <INVALID>
What does "expecting <INVALID>" mean?
The (stripped down) grammar:
grammar test;
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: BaseParamType IDENTIFIER ;
// Lexer stuff
BaseParamType:
INT_TYPE
| FUNCTION_TYPE // <---- LINE 1
;
FUNCTION : 'function'; // <---- LINE 2
INT_TYPE : 'int';
FUNCTION_TYPE : 'function'; // <---- LINE 3
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;
The input I'm using:
function abc(int c, int d, int a)
The program to test the generated parser:
from antlr4 import *
from testLexer import testLexer as Lexer
from testParser import testParser as Parser
from antlr4.tree.Trees import Trees
def main(argv):
input = FileStream(argv[1] if len(argv)>1 else "test.in")
lexer = Lexer(input)
tokens = CommonTokenStream(lexer)
parser = Parser(tokens)
tree = parser.begin()
print Trees.toStringTree(tree, None, parser)
if __name__ == '__main__':
import sys
main(sys.argv)
Just use one name for the token function.
A token is just a token. Looking at function in isolation, it is not possible to decide whether it is a FUNCTION or a FUNCTION_TYPE. Since FUNCTION, comes first in the file, that's what the lexer used. That makes it impossible to match FUNCTION_TYPE, so that becomes an invalid token type.
The parser will figure out the syntactic role of the token function. So there would be no point using two different lexical descriptors for the same token, even if it would be possible.
In the grammar in the OP, BaseParamType is also a lexical type, which will absorb all uses of the token function, preventing FUNCTION from being recognized in the production for function. Changing its name to baseParamType, which effectively changes it to a parser non-terminal, will allow the parser to work, although I suppose it may alter the parse tree in undesirable ways.
I understand the objection that the parser "should know" which lexical tokens are possible in context, given the nature of Antlr's predictive parsing strategy. I'm far from an Antlr expert so I won't pretend to explain why it doesn't seem to work, but with the majority of parser generators -- and all the ones I commonly use -- lexical analysis is effectively performed as a prior pass to parsing, so the conversion of textual input into a stream of tokens is done prior to the parser establishing context. (Most lexical generators, including Antlr, have mechanisms with which the user can build lexical context, but IMHO these mechanisms reduce grammar readability and should only be used if strictly necessary.)
Here's the grammar file which I tested:
grammar test;
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: baseParamType IDENTIFIER ;
// Lexer stuff
baseParamType:
INT_TYPE
| FUNCTION //
;
FUNCTION : 'function';
INT_TYPE : 'int';
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;

Resources