ANTLR4 match any not-matched sections into one single STRING token - parsing

I am trying to create a Lexer/Parser with ANTLR that can parse plain text with 'tags' scattered inbetween.
These tags are denoted by opening ({) and closing (}) brackets and they represent Java objects that can evaluate to a string, that is then replaced in the original input to create a dynamic template of sorts.
Here is an example:
{player:name} says hi!
The {player:name} should be replaced by the name of the player and result in the output i.e. Mark says hi! for the player named Mark.
Now I can recognize and parse the tags just fine, what I have problems with is the text that comes after.
This is the grammar I use:
grammar : content+
content : tag
| literal
;
tag : player_tag
| <...>
| <other kinds of tags, not important for this example>
| <...>
;
player_tag : BRACKET_OPEN player_identifier SEMICOLON player_string_parameter BRACKET_CLOSE ;
player_string_parameter : NAME
| <...>
;
player_identifier : PLAYER ;
literal : NUMBER
| STRING
;
BRACKET_OPEN : '{';
BRACKET_CLOSE : '}';
PLAYER : 'player'
NAME : 'name'
NUMBER : <...>
STRING : (.+)? /* <- THIS IS THE PROBLEMATIC PART !*/
Now this STRING Lexer definition should match anything that is not an empty string but the problem is that it is too greedy and then also consumes the { } bracket tokens needed for the tag rule.
I have tried setting it to ~[{}]+ which is supposed to match anything that does not include the { } brackets but that screws with the tag parsing which I don't understand either.
I could set it to something like [ a-zA-Z0-9!"ยง$%&/()= etc...]+ but I really don't want to restrict it to parse only characters available on the british keyboard (German umlaute or French accents and all other special characters other languages have must to work!)
The only thing that somewhat works though I really dislike it is to force strings to have a prefix and a suffix like so:
STRING : '\'' ~[}{]+ '\'' ;
This forces me to alter the form from "{player:name} says hi!" to "{player:name}' says hi!'" and I really desperately want to avoid such restrictions because I would then have to account for literal ' characters in the string itself and it's just ugly to work with.
The two solutions I have in mind are the following:
- Is there any way to match any number of characters that has not been matched by the lexer as a STRING token and pass it to the parser? That way I could match all the tags and say the rest of the input is just plain text, give it back to me as a STRING token or whatever...
- Does ANTLR support lookahead and lookbehind regex expressions with which I could match any number of characters before the first '{', after the last '}' and anything inbetween '}' and '{' ?
I have tried
STRING : (?<=})(.+)?(?={) ;
but I can't seem to get the syntax right because that won't compile at all, which leads me to believe that ANTLR does not support lookahead and lookbehind syntax, but I could not find a definitive answer on the internet to that question.
Any advice on what to do?

Antlr does not support lookahead or lookbehind. It does support non-greedy wildcard matches, but only when the .* non-greedy wildcard is followed in the rule with the termination sequence (which, as you say, is also contained in the match, although you could push it back into the input stream).
So ~[{}]* is correct. But there's a little problem: lexer rules are (normally) always active. So that lexer rule will be active inside the braces as well, which means that it will swallow the entire contents between the braces (unless there are nested braces or braces inside quotes or some such, and that's even worse).
So you need to define different lexical contents, called "lexical modes" in Antlr. There's a publically viewable example in the Antlr Definitive Reference, which shows a solution to a very similar problem: parsing HTML.

Related

How to get a quote-enclosed string and discard the quote itself

I would like to parse text enclosed in single quotes as just the text itself. For example:
"hello" --> hello
I'm able to parse the string with the quotes in antlr using the following rule:
grammar Test;
root
: string EOF
;
string
: QUOTE WORD QUOTE
;
WORD
: [a-zA-Z0-9-]+
;
QUOTE
: '"'
;
And with the input text "mrrogers":
However I'm not sure how to 'discard' the S_QUOTE values, I've tried doing the following two items and it seems like I'm on the wrong course:
fragment QUOTE
: '\''
;
And:
QUOTE
: '\'' -> skip
;
What would be the proper way to do this?
IMO it's better to just keep the quotes in the token and eliminate them at the stage where you need the text. This also allows you to handle special case, like the conversion of double-quotes (two consecutive quote char) to single quotes or handle escape codes (if this is something you want to support).
If you separate your grammar into a Lexer grammar and a Parser grammar, you can use lexical modes to control which tokens are emitted for use by the parser.
As #MikeCargal has intimated, this is a simplistic solution but may help you see how your grammars may be structured.
TestLexer.g4
lexer grammar TestLexer;
tokens { WORD }
NEWLINE
: [\n\r]
->channel(HIDDEN)
;
QUOTE
: '"'
->pushMode(STRING_MODE),channel(HIDDEN)
;
mode STRING_MODE;
WORD
: [a-zA-Z0-9-]+
;
STRING_MODE_QUOTE
: '"'
->popMode,channel(HIDDEN)
;
The tokens { WORD } is an instruction to ANTLR to include the WORD token in the generated TestLexer.tokens file. This makes it visible to the parser.
The pushMode(STRING_MODE) is a way to have the lexer only emit tokens defined in the section mode STRING_MODE. ANTLR maintains a stack of modes, lexing starts out in DEFAULT_MODE and each pushMode() pushes a new mode onto the stack as the current mode governing which tokens will be emitted. Each popMode pops the current mode off the stack and the next one on the stack takes precedence.
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
root
: string EOF
;
string
: WORD
;
These are very simplistic rules for a String type. How would you handle a string that should contain a ', for example. (Maybe you're string content is as simple as your word rule, but it looks more like this is just a starting point.
It will probably serve you well to take a look at the String rules in a grammar for a language with strings similar to what you want to allow. (You can find many grammars here. (It's pretty common to need to use Lexer modes to properly parse a String token)
I think you'll find that you need to capture the initial and terminal ' (or ") characters in the Lexer rule, so they will, necessarily be a part of the token. It's really trivial to strip thee first and last character from your token to get the string content from the token in your ParseTree.

ANTLR Tries to Match an Expression That Wasn't Specified as an Option

I'm trying to understand how ANTLR grammars work and I've come across a situation where it behaves unexpectedly and I can't explain why or figure out how to fix it.
Here's the example:
root : title '\n' fields EOF;
title : STR;
fields : field_1 field_2;
field_1 : 'a' | 'b' | 'c';
field_2 : 'd' | 'e' | 'f';
STR : [a-z]+;
There are two parts:
A title that is a lowercase string with no special characters
A two character string representing a set of possible configurations
When I go to test the grammar, here's what happens: first I write the title and, on a new line, give the character for the first field. So far so good. The parse tree looks as I would expect up to this point.
When I add the next field is when the problem comes up. ANTLR decides to reinterpret the line as an instance of STR instead of a concatenation of the fields that I was expecting.
I do not understand why ANTLR tries to force an unrelated terminal expression when it wasn't specified as an option by the grammar. Shouldn't it know to only look for characters matching the field rules since it is descended from the fields node in the parse tree? What's going on here and how do I write my ANTLR grammars so they don't have this problem?
I've read that ANTLR tries to match the format greedily from the top of the grammar to the bottom, but this doesn't explain why this is happening because the STR terminal is the very last line in the file. If ANTLR gives special precedence to matching terminals, how do I format the grammar so that it interprets it properly? As far as I understand, regexes do not work for non-terminals so it seems that have to define it how it is now.
A note of clarification: this is just an example of a possible grammar that I'm trying to make work with the text format as is, so I'm not looking for answers like adding a space between the fields or changing the title to be uppercase.
What I didn't understand before is that there are two steps in generating a parser:
Scanning the input for a list of tokens using the lexer rules (uppercase statements) and then...
Constructing a parse tree using the parser rules (lowercase statements) and generated tokens
My problem was that there was no way for ANTLR to know I wanted that specific string to be interpreted differently when it was generating the tokens. To fix this problem, I wrote a new lexer rule for the fields string so that it would be identifiable as a token. The key was making the FIELDS rule appear before the STR rule because ANTLR checks them in the order they appear.
root : title FIELDS EOF;
title : STR;
FIELDS : [a-c] [d-f];
STR : [a-z]+;
Note: I had to bite the bullet and read the ANTLR Mega Tutorial to figure this out.

Does -> skip change the behavior of the lexer rule precedence?

I am writing a grammar to parse a configuration export file from a closed system. when a parameter identified in the export file has a particularly long string value assigned to it, the export file inserts "\r\n\t" (double quotes included) every so often in the value. In the file I'll see something like:
"stuff""morestuff""maybesomemorestuff"\r\n\t"morestuff""morestuff"...etc."
In that line, "" is the way the export file escapes a " that is part of the actual string value - vs. a single " which indicates the end of the string value.
my current approach for the grammar to get this string value to the parser is to grab "stuff" as a token and \r\n\t as a token. So I have rules like:
quoted_value : (QUOTED_PART | QUOTE_SEPARATOR)+ ;
QUOTED_PART : '"' .*? '"';
QUOTE_SEPARATOR : '\r\n\t';
WS : [ \t\r\n] -> skip; //note - just one char at a time
I get no errors when I lex or parse a sample string. However, in the token stream - no QUOTE_SEPARATOR tokens show up and there is literally nothing in the stream where they should have been.
I had expected that since QUOTE_SEPARATOR is longer than WS and that it is first in the grammar that it would be selected, but it is behaving as if WS was matched and the characters were skipped and not send to the token string.
Does the -> skip do something to change how rule precedence works?
I am also open to a different approach to the lexing that completely removes the "\r\n\t" (all five characters) - this way just seemed easier and it should be easy enough for the program that will process the parse tree to deal with as other manipulations to the data will be done there anyway (my first grammar - teach me;) ).
No, skip does not affect rule precedence.
Change the QUOTE_SEPARATOR rule to
QUOTE_SEPARATOR : '\\r\\n\\t' ;
in order to match the actual textual content of the source string.

How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.
I'm trying to write antlr3 grammar that allows me to match input such as:
"identifierkeyword"
I came up with a grammar like so in Antlr 3.4:
KEYWORD: 'keyword' ;
IDENTIFIER
:
(options {greedy=false;}: (LOWCHAR|HIGHCHAR))+
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
parse: IDENTIFIER KEYWORD EOF;
however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)
Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.
After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.
Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:
KEYWORD: 'keyword' ;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
IDENTIFIER
:
(LOWCHAR|HIGHCHAR)+?
;
parse: IDENTIFIER KEYWORD EOF;
for the input: "identifierkeyword" it produces this error:
line 1:1 mismatched input 'd' expecting 'keyword'
it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.
Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?
I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"
I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?
After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:
parse
:
identifier 'keyword'
;
identifier
:
(HIGHCHAR | LOWCHAR)+
;
/** lowercase letters */
LOWCHAR
: 'a'..'z';
/** uppercase letters */
HIGHCHAR
: 'A'..'Z';
This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments.
A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?
The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER to match a single input character, and then create a parser rule to handle sequences of these characters.
identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;
This would cause the lexer to skip the input identifier as 10 separate characters, and then read keyword as a single KEYWORD token.
The behavior you observed in ANTLR 4 using the non-greedy operator +? is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR) blocks as possible while still creating an IDENTIFIER token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER to match a single character. The reason the parse rule failed to handle this is it only allows a single IDENTIFIER token to appear before the KEYWORD token. By creating a parser rule identifier like I showed above, the parser would be able to treat sequences of IDENTIFIER tokens (which are each a single character), as a single identifier.
Edit: The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER will never match more than 1 character because the rule will always be successful with exactly 1 character.

How to match any symbol in ANTLR parser (not lexer)?

How to match any symbol in ANTLR parser (not lexer)? Where is the complete language description for ANTLR4 parsers?
UPDATE
Is the answer is "impossible"?
You first need to understand the roles of each part in parsing:
The lexer: this is the object that tokenizes your input string. Tokenizing means to convert a stream of input characters to an abstract token symbol (usually just a number).
The parser: this is the object that only works with tokens to determine the structure of a language. A language (written as one or more grammar files) defines the token combinations that are valid.
As you can see, the parser doesn't even know what a letter is. It only knows tokens. So your question is already wrong.
Having said that it would probably help to know why you want to skip individual input letters in your parser. Looks like your base concept needs adjustments.
It depends what you mean by "symbol". To match any token inside a parser rule, use the . (DOT) meta char. If you're trying to match any character inside a parser rule, then you're out of luck, there is a strict separation between parser- and lexer rules in ANTLR. It is not possible to match any character inside a parser rule.
It is possible, but only if you have such a basic grammar that the reason to use ANTlr is negated anyway.
If you had the grammar:
text : ANY_CHAR* ;
ANY_CHAR : . ;
it would do what you (seem to) want.
However, as many have pointed out, this would be a pretty strange thing to do. The purpose of the lexer is to identify different tokens that can be strung together in the parser to form a grammar, so your lexer can either identify the specific string "JSTL/EL" as a token, or [A-Z]'/EL', [A-Z]'/'[A-Z][A-Z], etc - depending on what you need.
The parser is then used to define the grammar, so:
phrase : CHAR* jstl CHAR* ;
jstl : JSTL SLASH QUALIFIER ;
JSTL : 'JSTL' ;
SLASH : '/'
QUALIFIER : [A-Z][A-Z] ;
CHAR : . ;
would accept "blah blah JSTL/EL..." as input, but not "blah blah EL/JSTL...".
I'd recommend looking at The Definitive ANTlr 4 Reference, in particular the section on "Islands in the stream" and the Grammar Reference (Ch 15) that specifically deals with Unicode.

Resources