How to prefix numbers with optional letter? - xtext

I am building on an initial Xtext project build using gradle.
ext.xtextVersion = '2.20.0'
I have following xtext grammar:
grammar com.exampe.Rule with org.eclipse.xtext.common.Terminals hidden(WS, ML_COMMENT, SL_COMMENT)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate rule "http://www.example.com/Rule"
Rule:
{Number} (other?='o')? number=INT
;
This does NOT parse o19.
Then, the Rule is changed to following:
Rule:
{Number} (other?='*')? number=INT
;
This DOES parse *19.
I did not find any special treatment in letters versus symbols.
What is going wrong here? How can I make o19 getting parsed.

o19 is parsed by the rule ID which you imported by inheriting from org.eclipse.xtext.common.Terminals. In Xtext, the Lexer runs independent from the parser (context insensitive) and tokenizes the text into keywords and terminal rule calls.
You have to add a terminal rule for such cases.
terminal PREFIXED_INT:
'o' INT;
But I don't know whether it's a good idea in terms of readability if you keep the ID rule as well. Readers of your code might be mislead.

Related

Is there a way to insert phases between the lexer and parser in ANTLR

I am writing a lexer/parser for a language that allows abbreviations (and globs) for its keywords. And, I am trying to determine the best way to do it.
And one thought that occurs to me, is to insert a phase between the lexer and the parser, where the lexer recognizes the general class, e.g. is this a "command name" or is it an "option" and then passes those general tokens to a second phase which does further analysis and recognizes which command name it is and passes that on as the token type to the parser.
It will make the parser simple. I will only have to deal with well formed command names. Every token will be clear what it means.
It will keep the lexer simple. It will only have to divide things into classes. This is a simple name. This is a glob. This is an option name (starts with a dash).
The phase is the middle will also be relatively simple. The simple name (and option forms) will only have to deal with strings. The glob form can use standard glob techniques to match the glob against the legal candidates, which are in the tables for the simple names and options.
The question is how to insert that phase into ANTLR, so that I call the lexer and it creates tokens and the intermediate phase massages them and then the parser gets the tokens the intermediate phase has categorized.
Is there a known solution for this?
Something like:
lexer grammar simple
letter: [A-Z][a-z];
digit: [0-9];
glob-char: [*?];
name: letter (letter | digit)*;
option: '-'name;
glob: (glob-char|letter)(glob-char|letter|digit)*;
glob-option: '-'glob;
filter grammar name;
end: 'e' | 'end';
generate: 'ge' | 'generate';
goto: 'go' | 'goto';
help: 'h' | 'help';
if: 'i' | 'if';
then: 't' | 'then';
parser grammar simple;
The user (programmer writing the language I am parsing) need to be to write
g*te and have if match generate.
The phase between the lexer and the parser when it sees a glob needs to look at the glob (and the list of keywords) and see if only one of them matches the glob and if so, return that keyword. The stuff I listed in the "filter grammar" is the stuff that builds the list of keywords globs can match. I have found code on the web that matches globs to a list of names. That part isn't hard.
And, I've since found in the ANTLR doc how to run arbitrary code on matching a token and how to change the resulting tokens type. (See my answer.)
It looks like you can use lexerCustomActions to achieve the desired effect. Something like the following.
in your lexer:
GLOB: [-A-Za-z0-9_.]* '*' [-A-Za-z0-9_.*]* { setType(lexGlob(getText())); }
in your Java (or whatever language you are using code):
void int lexGlob(String origText()) {
return xyzzy; // some code that computes the right kind of token type
}

Running Antlr4 parser with lexer grammar gets token recognition errors

I'm trying to create a grammar to parse Solr queries (only mildly relevant and you don't need to know anything about solr to answer the question -- just know more than I do about antlr 4.7). I'm basing it on the QueryParser.jj file from solr 6. I looked for an existing one, but there doesn't seem to be one that isn't old and out-of-date.
I'm stuck because when I try to run the parser I get "token recognition error"s.
The lexer I created uses lexer modes which, as I understand it means I need to have a separate lexer grammar file. So, I have a parser and a lexer file.
I whittled it down to a simple example to show I'm seeing. Maybe someone can tell me what I'm doing wrong. Here's the parser (Junk.g4):
grammar Junk;
options {
language = Java;
tokenVocab=JLexer;
}
term : TERM '\r\n';
I can't use an import because of the lexer modes in the lexer file I'm trying to create (the tokens in the modes become "undefined" if I use an import). That's why I reference the lexer file with the tokenVocab parameter (as shown in the XML example in github).
Here's the lexer (JLexer.g4):
lexer grammar JLexer;
TERM : TERM_START_CHAR TERM_CHAR* ;
TERM_START_CHAR : [abc] ;
TERM_CHAR : [efg] ;
WS : [ \t\n\r\u3000]+ -> skip;
If I copy the lexer code into the parser, then things work as expected (e.g., "aeee" is a term). Also, if I run the lexer file with grun (specifying tokens as the target), then the string parses as a TERM (as expected).
If I run the parser ("grun Junk term -tokens"), then I get:
line 1:0 token recognition error at: 'a'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'e'
line 1:3 token recognition error at: 'e'
[#0,4:5='\r\n',<'
'>,1:4]
I "compile" the lexer first, then "compile" the parser and then javac the resulting java files. I do this in a batch file, so I'm pretty confident that I'm doing this every time.
I don't understand what I'm doing wrong. Is it the way I'm running grun? Any suggestions would be appreciated.
Always trust your intuition! There is some convention internal to grun :-) See here TestRig.java c. lines 125, 150. Would have been lot nicer if some additional CLI args were also added.
When lexer and grammar are compiled separately, the grammar name - in your case - would be (insofar as TestRig goes) "Junk" and the two files must be named "JunkLexer.g4" and "JunkParser.g4". Accordingly the headers in parser file JunkParser.g4 should be modified too
parser grammar JunkParser;
options { tokenVocab=JunkLexer; }
... stuff
Now you can run your tests
> antlr4 JunkLexer
> antlr4 JunkParser
> javac Junk*.java
> grun Junk term -tokens
aeee
^Z
[#0,0:3='aeee',<TERM>,1:0]
[#1,6:5='<EOF>',<EOF>,2:0]
>

ANTLR4: implicit or explicit token definition

What are the benefits and drawbacks of using explicit token definitions in ANTLR4? I find the text in single parentheses more descriptive and easier to use than creating a separate token and using that in place of the text.
E.g.:
grammar SimpleTest;
top: library | module ;
library: 'library' library_name ';' ;
library_name: IDENTIFIER;
module: MODULE module_name ';' ;
module_name: IDENTIFIER;
MODULE: 'module' ;
IDENTIFIER: [a-zA-Z0-9]+;
The generated tokens are:
T__0=1
T__1=2
MODULE=3
IDENTIFIER=4
'library'=1
';'=2
'module'=3
If I am not interested in the 'library' "token", since the rule already establishes what I am matching against, and I will just skip over it anyway, does it make any sense to replace it with LIBRARY and a token declaration? (The number of tokens then will grow.) Why is this a warning in ANTLRWorks?
Actually, there is a difference between implicit and explicit tokens:
From "The Definitive ANTLR4 Reference", page 76:
ANTLR collects and separates all of the string literals and lexer
rules from the parser rules. Literals such as 'enum' become lexical
rules and go immediately after the parser rules but before the
explicit lexical rules.
ANTLR lexers resolve ambiguities between
lexical rules by favoring the rule specified first.
Highlight from me.
The Antlr (and most compiler/compiler generators) implementation use the concept of a separate lexer and parser, mostly for performance reasons. In this model, the lexer is responsible for reading the actual characters in the input string and returning a list of the tokens found, in a more concise represetations, like an enum or int-codes for each token. The parser will work on these tokens instead of the original input for ease of implementation and performance.
There are two ways to "declare" the usage of a token in Antlr, one is explicit and have a regular pattern expresion, the other is implicit, is always a fixed string.
ExplicitRegExp: [A-Z][a-z]+; // lexer rule starts with uppercase letter
ExplicitFixed: 'fixed';
parserRule: 'implicit' ExplicitRegExp; // parser rules starts with lowercase letter
When declare a token explicitly, it's assigned an int-code to be used in the parsing state machine. Let's say ExplicitRegExp becomes 1 and ExplicitFixed becomes 2. But the parser will also need the implicit tokens to be able to parse the grammar correctly, so the implicit token is assigned the code 3 implicitly.
How is that bad? You may have typos in different parts of the grammar:
a : 'implicit' c;
b : 'implcit' d; // typo here
And your grammar will not work as expected, because implcit will be a valid token, assigned the int-code 4. It also makes your grammar/lexer harder to debug due to Antlr auto-generating names for the implicit rules, like T___0. Another thing is that you lose the ordering of lexer rules, which could make a diference (usually not because implicit tokens are all fixed content).
The Antlr compiler could choose to give you an error message and require you to write the tokens explicitly, but it chooses to let it go and just warn you that you should not to that, probably for prototyping/testing reasons.
To let Antlr be happy, do it the verbose way and declare all of your tokens:
grammar SimpleTest;
top: library | module ;
library: 'library' library_name=IDENTIFIER ';' ; // I'm using aliasing instead of different parser rule here, just a preference
module: 'module' module_name=IDENTIFIER ';' ;
MODULE: 'module' ;
LIBRARY: 'library' ;
IDENTIFIER: [a-zA-Z0-9]+;
Then it makes no difference if you reference a fixed token by it's explicit name (like MODULE) or by its content (like 'module').

How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.
I'm trying to write antlr3 grammar that allows me to match input such as:
"identifierkeyword"
I came up with a grammar like so in Antlr 3.4:
KEYWORD: 'keyword' ;
IDENTIFIER
:
(options {greedy=false;}: (LOWCHAR|HIGHCHAR))+
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
parse: IDENTIFIER KEYWORD EOF;
however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)
Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.
After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.
Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:
KEYWORD: 'keyword' ;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
IDENTIFIER
:
(LOWCHAR|HIGHCHAR)+?
;
parse: IDENTIFIER KEYWORD EOF;
for the input: "identifierkeyword" it produces this error:
line 1:1 mismatched input 'd' expecting 'keyword'
it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.
Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?
I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"
I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?
After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:
parse
:
identifier 'keyword'
;
identifier
:
(HIGHCHAR | LOWCHAR)+
;
/** lowercase letters */
LOWCHAR
: 'a'..'z';
/** uppercase letters */
HIGHCHAR
: 'A'..'Z';
This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments.
A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?
The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER to match a single input character, and then create a parser rule to handle sequences of these characters.
identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;
This would cause the lexer to skip the input identifier as 10 separate characters, and then read keyword as a single KEYWORD token.
The behavior you observed in ANTLR 4 using the non-greedy operator +? is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR) blocks as possible while still creating an IDENTIFIER token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER to match a single character. The reason the parse rule failed to handle this is it only allows a single IDENTIFIER token to appear before the KEYWORD token. By creating a parser rule identifier like I showed above, the parser would be able to treat sequences of IDENTIFIER tokens (which are each a single character), as a single identifier.
Edit: The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER will never match more than 1 character because the rule will always be successful with exactly 1 character.

How to match any symbol in ANTLR parser (not lexer)?

How to match any symbol in ANTLR parser (not lexer)? Where is the complete language description for ANTLR4 parsers?
UPDATE
Is the answer is "impossible"?
You first need to understand the roles of each part in parsing:
The lexer: this is the object that tokenizes your input string. Tokenizing means to convert a stream of input characters to an abstract token symbol (usually just a number).
The parser: this is the object that only works with tokens to determine the structure of a language. A language (written as one or more grammar files) defines the token combinations that are valid.
As you can see, the parser doesn't even know what a letter is. It only knows tokens. So your question is already wrong.
Having said that it would probably help to know why you want to skip individual input letters in your parser. Looks like your base concept needs adjustments.
It depends what you mean by "symbol". To match any token inside a parser rule, use the . (DOT) meta char. If you're trying to match any character inside a parser rule, then you're out of luck, there is a strict separation between parser- and lexer rules in ANTLR. It is not possible to match any character inside a parser rule.
It is possible, but only if you have such a basic grammar that the reason to use ANTlr is negated anyway.
If you had the grammar:
text : ANY_CHAR* ;
ANY_CHAR : . ;
it would do what you (seem to) want.
However, as many have pointed out, this would be a pretty strange thing to do. The purpose of the lexer is to identify different tokens that can be strung together in the parser to form a grammar, so your lexer can either identify the specific string "JSTL/EL" as a token, or [A-Z]'/EL', [A-Z]'/'[A-Z][A-Z], etc - depending on what you need.
The parser is then used to define the grammar, so:
phrase : CHAR* jstl CHAR* ;
jstl : JSTL SLASH QUALIFIER ;
JSTL : 'JSTL' ;
SLASH : '/'
QUALIFIER : [A-Z][A-Z] ;
CHAR : . ;
would accept "blah blah JSTL/EL..." as input, but not "blah blah EL/JSTL...".
I'd recommend looking at The Definitive ANTlr 4 Reference, in particular the section on "Islands in the stream" and the Grammar Reference (Ch 15) that specifically deals with Unicode.

Resources