Antlr Error when adding a Mode for Lexers - parsing

I'm trying Lexing Modes for the first time.
I have a lexer grammar with a mode that I'm importing into my "main" grammar.
I get this error when generating the java classes for the Grammar's lexer
'rule DESCRIPTION_FIELD contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output'
I followed this article
My Lexer grammar is the following :
lexer grammar TestLexerGrammar;
DESCRIPTION_FIELD
:
'DESCRIPTION:'-> pushMode(FREETEXTMODE)
;
mode FREETEXTMODE;
FREE_TEXT_FIELD_FORMAT
:
STR+
;
fragment
STR
:
(
LETTER
| DIGIT
)
;
my main grammar:
grammar Grammar;
import TestLexerGrammar;
descriptionElement
:
DESCRIPTION_FIELD freeTextFields
;
freeTextFields
:
FREE_TEXT_FIELD_FORMAT+
;
so in the generated GrammarLexer.java I get an error : " FREETEXTMODE cannot be resolved to a variable "
Is this a wrong approach? and is there a possible way to trigger changing mode through a parsing rule?

You can not use mode in grammars with import statement. There are related issues on github: Problems with lexical modes inside an imported grammar and No error/incorrect code generation when importing lexer grammar with modes into a combined grammar.
So, you should repair your main grammar and remove import statement by the following way:
parser grammar Grammar;
options { tokenVocab=TestLexerGrammar; }

Related

Antlr Matlab grammar lexing conflict

I've been using the Antlr Matlab grammar from Antlr grammars
I found out I need to implement the ' Matlab operator. It is the complex conjugate transpose operator, used as such
result = input'
I tried a straightforward solution of adding it to unary_expression as an option postfix_expression '\''
However, this failed to parse when multiple of these operators were used on a single line.
Here's a significantly simplified version of the grammar, still exhibiting the exact problem:
grammar Grammar;
unary_expression
: IDENTIFIER
| unary_expression '\''
;
translation_unit : unary_expression CR ;
STRING_LITERAL : '\'' [a-z]* '\'' ;
IDENTIFIER : [a-zA-Z] ;
CR : [\r\n] + ;
Test cases, being parsed as translation_unit:
"x''\n" //fails getNumberOfSyntaxErrors returns 1
"x'\n" //passes
The failure also prints the message line 1:1 extraneous input '''' expecting CR to stderr.
The failure goes away if I either remove STRING_LITERAL, or change the * to +. Neither is a proper solution of course, as removing it is entirely off the table, and mandating non-empty strings is not quite correct, though I might be able to live with it. Also, forcing non-empty string does nothing to help the real use case, when the input is something like x' + y' instead of using the operator twice.
For some reason removing CR from the grammar and \n from the tests also makes the parsing run without problems, but yet again is not a useable solution.
What can I do to the grammar to make it work correctly? I'm assuming it's a problem with lexing specifically because removing STRING_LITERAL or making it unable to match '' makes it go away.
The lexer can never be made that context aware I think, but I don't know Matlab well enough to be sure. How could you check during tokenisation that these single quotes are operators:
x' + y';
while these are strings:
x = 'x' + ' + y';
?
Maybe you can do something similar as how in ECMAScript a / can be a division operator or a regex delimiter. In this grammar that is handled by a predicate in the lexer that uses some target code to check this.
If something like the above is not possible, I see no other way than to "promote" the creation of strings to the parser. That would mean removing STRING_LITERAL and introducing a parser rule that matches something like this:
string_literal
: QUOTE ~(QUOTE | CR)* QUOTE
;
// Needed to match characters inside strings
OTHER
: .
;
However, that will fail when a string like 'hi there' is encountered: the space in between hi and there will now be skipped by the WS rule. So WS should also be removed (spaces will then get matched by the OTHER rule). But now (of course) all spaces will litter the token stream and you'll have to account for them in all parser rules (not really a viable solution).
All in all: I don't see ANTLR as a suitable tool in this case. You might look into parser generators where there is no separation between tokenisation and parsing. Google for "PEG" and/or "scannerless parsing".

What does pushMode, popMode, mode, OPEN and CLOSE mean in the lexer grammar?

I am studying the lexer and parser grammars and using ANTLR for creating the parsers and lexers based on the .g4 files. However, I am quite confused as what does the pushMode and popMode do in general?
OPEN : '[' -> pushMode(BBCODE) ;
TEXT : ~('[')+ ;
mode BBCODE;
CLOSE : ']' -> popMode ;
What does OPEN, pushMode, BBCODE, CLOSE and popMode means in the lexer grammar? I tried searching for these modes, but there are no clear definition and explanation for these.
pushMode and popMode are used for so-called "Island Grammars" or lexical modes. These allow dealing with different formats in the same file. The basic idea is to have the lexer switch between modes when it sees certain character sequences.
In your grammar example, when the lexer encounters [ it will switch from the default grammar (i.e. grammar defined outside any mode <name>) to the grammar defined between
mode BBCODE;
and
CLOSE : ']' -> popMode ;
when it encounters ] it will switch back to default grammar.
One example of an island grammar would be Javadoc tags inside Java code.
Theoretically, lexical modes could be also used to parse JavaScript inside HTML. For example, the main grammar would define HTML, but when it encounters a <script ... tag it would switch to the JavaScript grammar with -> pushMode(javascript). When it encounters </script> tag it would popMode to return back to "default" HTML grammar.
OPEN and CLOSE in your example are lexical rules for '[' and ']' which can be used in parser grammar to improve readability. Instead of writing ']' -> popMode, you would write CLOSE.
If you plan any serious envelopment with ANTLR4, I strongly recommend to read this book: The Definitive ANTLR 4 Reference by Terence Parr.

Why does this grammar fail to parse this input?

I'm defining a grammar for a small language and Antlr4. The idea is in that language, there's a keyword "function" which can be used to either define a function or as a type specifier when defining parameters. I would like to be able to do something like this:
function aFunctionHere(int a, function callback) ....
However, it seems Antlr doesn't like that I use "function" in two different places. As far as I can tell, the grammar isn't even ambiguous.
In the following grammar, if I remove LINE 1, the generated parser parses the sample input without a problem. Also, if I change the token string in either LINE 2 or LINE 3, so that they are not equal, the parser works.
The error I get with the grammar as-is:
line 1:0 mismatched input 'function' expecting <INVALID>
What does "expecting <INVALID>" mean?
The (stripped down) grammar:
grammar test;
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: BaseParamType IDENTIFIER ;
// Lexer stuff
BaseParamType:
INT_TYPE
| FUNCTION_TYPE // <---- LINE 1
;
FUNCTION : 'function'; // <---- LINE 2
INT_TYPE : 'int';
FUNCTION_TYPE : 'function'; // <---- LINE 3
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;
The input I'm using:
function abc(int c, int d, int a)
The program to test the generated parser:
from antlr4 import *
from testLexer import testLexer as Lexer
from testParser import testParser as Parser
from antlr4.tree.Trees import Trees
def main(argv):
input = FileStream(argv[1] if len(argv)>1 else "test.in")
lexer = Lexer(input)
tokens = CommonTokenStream(lexer)
parser = Parser(tokens)
tree = parser.begin()
print Trees.toStringTree(tree, None, parser)
if __name__ == '__main__':
import sys
main(sys.argv)
Just use one name for the token function.
A token is just a token. Looking at function in isolation, it is not possible to decide whether it is a FUNCTION or a FUNCTION_TYPE. Since FUNCTION, comes first in the file, that's what the lexer used. That makes it impossible to match FUNCTION_TYPE, so that becomes an invalid token type.
The parser will figure out the syntactic role of the token function. So there would be no point using two different lexical descriptors for the same token, even if it would be possible.
In the grammar in the OP, BaseParamType is also a lexical type, which will absorb all uses of the token function, preventing FUNCTION from being recognized in the production for function. Changing its name to baseParamType, which effectively changes it to a parser non-terminal, will allow the parser to work, although I suppose it may alter the parse tree in undesirable ways.
I understand the objection that the parser "should know" which lexical tokens are possible in context, given the nature of Antlr's predictive parsing strategy. I'm far from an Antlr expert so I won't pretend to explain why it doesn't seem to work, but with the majority of parser generators -- and all the ones I commonly use -- lexical analysis is effectively performed as a prior pass to parsing, so the conversion of textual input into a stream of tokens is done prior to the parser establishing context. (Most lexical generators, including Antlr, have mechanisms with which the user can build lexical context, but IMHO these mechanisms reduce grammar readability and should only be used if strictly necessary.)
Here's the grammar file which I tested:
grammar test;
begin : function ;
function: FUNCTION IDENTIFIER '(' parameterlist? ')' ;
parameterlist: parameter (',' parameter)+ ;
parameter: baseParamType IDENTIFIER ;
// Lexer stuff
baseParamType:
INT_TYPE
| FUNCTION //
;
FUNCTION : 'function';
INT_TYPE : 'int';
IDENTIFIER : [a-zA-Z_$]+[a-zA-Z_$0-9]*;
WS : [ \t\r\n]+ -> skip ;

ANTLR4: Unrecognized constant value in a lexer command

I am learning how to use the "more" lexer command. I typed in the lexer grammar shown in the ANTLR book, page 281:
lexer grammar Lexer_To_Test_More_Command ;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \t\r\n]+ -> skip ;
mode STR ;
STRING : '"' -> mode(DEFAULT_MODE) ;
TEXT : . -> more ;
Then I created this simple parser to use the lexer:
grammar Parser_To_Test_More_Command ;
import Lexer_To_Test_More_Command ;
test: STRING EOF ;
Then I opened a DOS window and entered this command:
antlr4 Parser_To_Test_More_Command.g4
That generated this warning message:
warning(155): Parser_To_Test_More_Command.g4:3:29: rule LQUOTE
contains a lexer command with an unrecognized constant value; lexer
interpreters may produce incorrect output
Am I doing something wrong in the lexer or parser?
Combined grammars (which are grammars that start with just grammar, instead of parser grammar or lexer grammar) cannot use lexer modes. Instead of using the import feature¹, you should use the tokenVocab feature like this:
Lexer_To_Test_More_Command.g4:
lexer grammar Lexer_To_Test_More_Command;
// lexer rules and modes here
Parser_To_Test_More_Command.g4:
parser grammar Parser_To_Test_More_Command;
options {
tokenVocab = Lexer_To_Test_More_Command;
}
// parser rules here
¹ I actually recommend avoiding the import statement altogether in ANTLR. The method I described above is almost always preferable.

equivalent BNF-grammar of grammar written in XText

In my current project, I have written grammar in Xtext with nice functionalities. For instance, code snippet of my grammar
Device:
deviceName = ID ':'
('region' ':' ( deviceRegions += DeviceRegions)+ )* ;
DeviceRegions:
regionLabel = [RegionLabel] ';'
// It stores a List of regionLabel functionalities
;
RegionLabel: name = ID ;
Using the above grammar, I write the following high-level specification:
DeviceOne :
region :
Room ;
Floor ;
Building;
DeviceTwo:
region :
Room ;
Floor ;
Building;
I would like to see an equivalent BNF grammar of grammar written in xText. The Equivallent grammar would be the following for instance:
Device = ID ':'
( 'region' ':' (deviceRegions = DeviceRegions)+)* ;
DeviceRegions :
regionLabel = RegionLabel ';' ;
RegionLabel = 'room' | 'Floor' | 'Building' ;
ID = 'A'..'Z' ('a' ..'z' | 'A' ..'Z')* ;
My question is that "Is there any way to convert xText written grammar into equivaluent BNF grammar or Should I do it manually ?
I know that xText grammar is very easy to learn and write. However, I have a requirement of having BNF grammar.
Need to do the same for documentation (visualization of xText grammar using railroad diagrams), first time did by hand, but as the grammar evolves it becomes boring, found two useful articles:
Simple solution - http://fornax-sculptor.blogspot.nl/2010/05/generating-syntaxrailroad-diagrams-from.html
A more solid one http://xtexterience.wordpress.com/2011/05/13/an-ebnf-grammar-in-xtext/
Please note that it is not possible to produce an 'equivalent' EBNF from an Xtext grammar. Xtext grammars support the notion of cross references where you do not reference production rules but produced instances. This cannot be expressed in EBNF. Anyway, it's possible to write a generator fragment that produces output from an Xtext grammar, e.g. the Antlr grammar is created that way. Have a look at the docs to learn more about that.

Resources