ANTLR using a symbol as a token name - token

I want to use ( as a token name in a ANTLR lexer file. I've tried several versions, but none of them seem to work and cause errors when building. Here are my (broken) attempts:
(1) (: '(';
(2) '(': '(';
(3) '\(': '(';
(4) \(: '(';
Is this possible and, if so, how would I do that?

That is not possible. A rule name must start with an underscore or letter, and is followed by zero or more underscores, letters or digits. So '(', or whatever variation of it, is not a valid rule name.

Related

How to understand ANTLRWorks 1.5.2 MismatchedTokenException(80!=21)

I'm testing a simple grammar (shown below) with simple input strings and get the following error message from the Antlrworks interpreter: MismatchedTokenException(80!=21).
My input (abc45{r24}) means "repeat the keys a, b, c, 4 and 5, 24 times."
ANTLRWorks 1.5.2 Grammar:
expr : '(' (key)+ repcount ')' EOF;
key : KEY | digit ;
repcount : '{' 'r' count '}';
count : (digit)+;
digit : DIGIT;
DIGIT : '0'..'9';
KEY : ('a'..'z'|'A'..'Z') ;
Inputs:
(abc4{r4}) - ok
(abc44{r4}) - fails NoViableAltException
(abc4 4{r4}) - ok
(abc4{r45}) - fails MismatchedTokenException(80!=21)
(abc4{r4 5}) - ok
The parse succeeds with input (abc4{r4}) (single digits only).
The parse fails with input (abc44{r4}) (NoViableAltException).
The parse fails with input (abc4{r45}) (MismatchedTokenException(80!=21)).
The parse errors go away if I put a space between 44 or 45 to separate the individual digits.
Q1. What does NoViableAltException mean? How can I interpret it to look for a problem in the grammar/input pair?
Q2. What does the expression 80!=21 mean? Can I do anything useful with the information to look for a problem in the grammar/input pair?
I don't understand why the grammar has a problem reading successive digits. I thought my expressions (key)+ and (digit)+ specify that successive digits are allowed and would be read as successive individual digits.
If someone could explain what I'm doing wrong, I would be grateful. This seems like a simple problem, but hours later, I still don't understand why and how to solve it. Thank you.
UPDATE:
Further down in my simple grammar file I had a lexer rule for FLOAT copied from another grammar. I did not think to include it above (or check it as a source of the errors) because it was not used by any parser rule and would never match my input characters. Here is the FLOAT grammar rule (which contains sequences of DIGITs):
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
If I delete the whole rule, all my test cases above parse successfully. If I leave any one of the three FLOAT clauses in the grammar/lexer file, the parses fail as shown above.
Q3. Why does the FLOAT rule cause failures in the parse? The DIGIT lexer rule appears first, and so should "win" and be used in preference to the FLOAT rule. Besides, the FLOAT rule doesn't match the input stream.
I hazard a guess that the lexer is skipping the DIGIT rule getting stuck in the FLOAT rule, even though FLOAT comes after DIGIT in the input file.
SCREENSHOTS
I took these two screenshots after Bart's comment below to show the parse failures that I am experiencing. Not that it matters, but ANTLRWorks 1.5.2 will not accept the syntax SPACE : [ \t\r\n]+; regular expression syntax in Bart's kind replies. Maybe the screenshots will help. They show all the rules in my grammar file.
The only difference in the two screenshots is that one input has two sets of multiple digits and the other input string has only set of multiple digits. Maybe this extra info will help somehow.
If I remember correctly, ANTLR's v3 lexer is less powerful than v4's version. When the lexer gets the input "123x", this first 3 chars (123) are consumed by the lexer rule FLOAT, but after that, when the lexer encounters the x, it knows it cannot complete the FLOAT rule. However, the v3 lexer does not give up on its partial match and tries to find another rule, below it, that matches these 3 chars (123). Since there is no such rule, the lexer throws an exception. Again, not 100% sure, this is how I remember it.
ANTLRv4's lexer will give up on the partial 123 match and will return 23 to the char stream to create a single KEY token for the input 1.
I highly suggest you move away from v3 and opt for the more powerful v4 version.

ANTLR4 match any not-matched sections into one single STRING token

I am trying to create a Lexer/Parser with ANTLR that can parse plain text with 'tags' scattered inbetween.
These tags are denoted by opening ({) and closing (}) brackets and they represent Java objects that can evaluate to a string, that is then replaced in the original input to create a dynamic template of sorts.
Here is an example:
{player:name} says hi!
The {player:name} should be replaced by the name of the player and result in the output i.e. Mark says hi! for the player named Mark.
Now I can recognize and parse the tags just fine, what I have problems with is the text that comes after.
This is the grammar I use:
grammar : content+
content : tag
| literal
;
tag : player_tag
| <...>
| <other kinds of tags, not important for this example>
| <...>
;
player_tag : BRACKET_OPEN player_identifier SEMICOLON player_string_parameter BRACKET_CLOSE ;
player_string_parameter : NAME
| <...>
;
player_identifier : PLAYER ;
literal : NUMBER
| STRING
;
BRACKET_OPEN : '{';
BRACKET_CLOSE : '}';
PLAYER : 'player'
NAME : 'name'
NUMBER : <...>
STRING : (.+)? /* <- THIS IS THE PROBLEMATIC PART !*/
Now this STRING Lexer definition should match anything that is not an empty string but the problem is that it is too greedy and then also consumes the { } bracket tokens needed for the tag rule.
I have tried setting it to ~[{}]+ which is supposed to match anything that does not include the { } brackets but that screws with the tag parsing which I don't understand either.
I could set it to something like [ a-zA-Z0-9!"ยง$%&/()= etc...]+ but I really don't want to restrict it to parse only characters available on the british keyboard (German umlaute or French accents and all other special characters other languages have must to work!)
The only thing that somewhat works though I really dislike it is to force strings to have a prefix and a suffix like so:
STRING : '\'' ~[}{]+ '\'' ;
This forces me to alter the form from "{player:name} says hi!" to "{player:name}' says hi!'" and I really desperately want to avoid such restrictions because I would then have to account for literal ' characters in the string itself and it's just ugly to work with.
The two solutions I have in mind are the following:
- Is there any way to match any number of characters that has not been matched by the lexer as a STRING token and pass it to the parser? That way I could match all the tags and say the rest of the input is just plain text, give it back to me as a STRING token or whatever...
- Does ANTLR support lookahead and lookbehind regex expressions with which I could match any number of characters before the first '{', after the last '}' and anything inbetween '}' and '{' ?
I have tried
STRING : (?<=})(.+)?(?={) ;
but I can't seem to get the syntax right because that won't compile at all, which leads me to believe that ANTLR does not support lookahead and lookbehind syntax, but I could not find a definitive answer on the internet to that question.
Any advice on what to do?
Antlr does not support lookahead or lookbehind. It does support non-greedy wildcard matches, but only when the .* non-greedy wildcard is followed in the rule with the termination sequence (which, as you say, is also contained in the match, although you could push it back into the input stream).
So ~[{}]* is correct. But there's a little problem: lexer rules are (normally) always active. So that lexer rule will be active inside the braces as well, which means that it will swallow the entire contents between the braces (unless there are nested braces or braces inside quotes or some such, and that's even worse).
So you need to define different lexical contents, called "lexical modes" in Antlr. There's a publically viewable example in the Antlr Definitive Reference, which shows a solution to a very similar problem: parsing HTML.

Lex won't recognize double operators- !=, :=, <<, etc. Can I give a Lex expression precedence?

Trying to parse operators (+, -, =, <<, !=), using states like
%{
%}
OP ["+"|";"|":"|","|"*"|"/"|"="|"("|")"|"{"|"}"|"*"|"#"|"$"|
"<"|">"|"&"|"|"|"!"|]
DOUBOP [":="|".."|"<<"|">>"|"<>"|"<="|">="|"=>"|"**"|"!="|"{:"|"}:"|"\-"]
and later on
{DOUBOP} { printf("%s (operator)\n", yytext); }
{OP} { printf("%s (operator)\n", yytext); }
but Lex is identifying operators like "<<" as "<" and "<". I thought since it was in double quotes this would work, but I see that's not the case.
Is there anyway I can give a regular expression precedence, ie have lex check for a double operator first, and then a single operator?
Thanks in advance.
[...] is a character class, not an eccentric type of parenthesis. If you want to parenthesize a sub-expression in a pattern, use ordinary parentheses. In this case, however, parentheses are not necessary. (Indeed, most of the quotes aren't necessary either, but they don't hurt and some of them would be useful.)
"==" recognises the two character-sequence consisting of two equal signs. "=="|"++" recognizes either two equal signs or two plus signs.
By contrast, ["=="] recognises a single character, which could be either a quote or an equals sign. Since a character class is a set, the fact that each of those appears twice is irrelevant (although I think it would save a lot of grief if flex issued a warning). Similarly, ["=="|"<<"] recognises a single character if it is a quote, an equals sign, a vertical bar or a less than sign.
Flex pattern syntax is documented in the flex manual. It differs in a few ways from regexes in other systems, so it's worth reading the short document. However, character classes are mostly the same in all regex syntaxes in common use, especially the use of square brackets to delimit the set.
An easier way is to put all single characters together, and run the * command on the end up curly braces.
i.e.
OP ["+"|";"|":"|","|"*"|"/"|"="|"("|")"|"{"|"}"|"*"|"#"|"$"|
"<"|">"|"&"|"|"|"!"|]*

How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.
I'm trying to write antlr3 grammar that allows me to match input such as:
"identifierkeyword"
I came up with a grammar like so in Antlr 3.4:
KEYWORD: 'keyword' ;
IDENTIFIER
:
(options {greedy=false;}: (LOWCHAR|HIGHCHAR))+
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
parse: IDENTIFIER KEYWORD EOF;
however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)
Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.
After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.
Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:
KEYWORD: 'keyword' ;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
IDENTIFIER
:
(LOWCHAR|HIGHCHAR)+?
;
parse: IDENTIFIER KEYWORD EOF;
for the input: "identifierkeyword" it produces this error:
line 1:1 mismatched input 'd' expecting 'keyword'
it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.
Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?
I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"
I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?
After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:
parse
:
identifier 'keyword'
;
identifier
:
(HIGHCHAR | LOWCHAR)+
;
/** lowercase letters */
LOWCHAR
: 'a'..'z';
/** uppercase letters */
HIGHCHAR
: 'A'..'Z';
This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments.
A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?
The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER to match a single input character, and then create a parser rule to handle sequences of these characters.
identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;
This would cause the lexer to skip the input identifier as 10 separate characters, and then read keyword as a single KEYWORD token.
The behavior you observed in ANTLR 4 using the non-greedy operator +? is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR) blocks as possible while still creating an IDENTIFIER token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER to match a single character. The reason the parse rule failed to handle this is it only allows a single IDENTIFIER token to appear before the KEYWORD token. By creating a parser rule identifier like I showed above, the parser would be able to treat sequences of IDENTIFIER tokens (which are each a single character), as a single identifier.
Edit: The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER will never match more than 1 character because the rule will always be successful with exactly 1 character.

Accidental space inserted - Why does this compile?

I accidentally hit the spacebar and wrote this:
lTTEvent .CustUpdateStatus := usUnchanged;
and was surprised to see that the compiler accepted the space in front of the dot (actually, any number of spaces).
Is the dot such a special character that the parser can interpret it correctly? How would that work in Pascal?
The parser first translates text to tokens. So the text:
lTTEvent .CustUpdateStatus := usUnchanged;
Is translated to the tokens:
identifier
period
identifier
becomes
identifier
semicolon
The space is a whitespace and it can have three functions:
separator between tokens (for example between an identifier and a keyword).
a literal space (in that case it is included in a string.
cosmetic.
The first and last function spaces are lost in the translation to tokens.
An identifier and a period don't have any characters in common so there is no way those can be confused so a space is not required but it still can be used.
short answer
'lTTEvent' and '.' are tokens. Tokens can (sometimes) be separated by whitespace.

Resources