ANTLR4 - Make space optional between tokens - parsing

I have the following grammar:
grammar Hello;
prog: stat+ EOF;
stat: DELIMITER_OPEN expr DELIMITER_CLOSE;
expr: NOTES COMMA value=VAR_VALUE #delim_body;
VAR_VALUE: ANBang*;
NOTES: WS* 'notes' WS*;
COMMA: ',';
DELIMITER_OPEN: '<<!';
DELIMITER_CLOSE: '!>>';
fragment ANBang: AlphaNum | Bang;
fragment AlphaNum: [a-zA-Z0-9];
fragment Bang: '!';
WS : [ \t\r\n]+ -> skip ;
Parsing the following works:
<<! notes, Test !>>
and the variable value is "Test", however, the parser fails when I eliminate the space between the DELIMITER_OPEN and NOTES:
<<!notes, Test !>>
line 1:3 mismatched input 'notes' expecting NOTES

This is yet another case of badly ordered lexer rules.
When the lexer scans for the next token, it first tries to find the rule which will match the longest token. If several rules match, it will disambiguate by choosing the first one in definition order.
<<! notes, Test !>> will be tokenized as such:
DELIMITER_OPEN NOTES COMMA VAR_VALUE WS DELIMITER_CLOSE
This is because the NOTES rule can match the following:
<<! notes, Test !>>
\____/
Which includes the whitespace. If you remove it:
<<!notes, Test !>>
Then both the NOTES and VAR_VALUE rules can match the text notes, and, VAR_VALUE is defined first in the grammar, so it gets precedence. The tokenization is:
DELIMITER_OPEN VAR_VALUE COMMA VAR_VALUE WS DELIMITER_CLOSE
and it doesn't match your expr rule.
Change your rules like this to fix the problem:
NOTES: 'notes';
VAR_VALUE: ANBang+;
Adding WS* to other rules doesn't make much sense, since WS is skipped. And declaring a token as having a possible zero width * is also meaningless, so use + instead. Finally, reorder the rules so that the most specific ones match fist.
This way, notes becomes a keyword in your grammar. If you don't want it to be a keyword, remove the NOTES rule altogether, and use the VAR_VALUE rule with a predicate. Alternatively, you could use lexer modes.

Related

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

Unary minus messes up parsing

Here is the grammar of the language id' like to parse:
expr ::= val | const | (expr) | unop expr | expr binop expr
var ::= letter
const ::= {digit}+
unop ::= -
binop ::= /*+-
I'm using an example from the haskell wiki.
The semantics and token parser are not shown here.
exprparser = buildExpressionParser table term <?> "expression"
table = [ [Prefix (m_reservedOp "-" >> return (Uno Oppo))]
,[Infix (m_reservedOp "/" >> return (Bino Quot)) AssocLeft
,Infix (m_reservedOp "*" >> return (Bino Prod)) AssocLeft]
,[Infix (m_reservedOp "-" >> return (Bino Diff)) AssocLeft
,Infix (m_reservedOp "+" >> return (Bino Somm)) AssocLeft]
]
term = m_parens exprparser
<|> fmap Var m_identifier
<|> fmap Con m_natural
The minus char appears two times, once as unary, once as binary operator.
On input "1--2", the parser gives only
Con 1
instead of the expected
"Bino Diff (Con 1) (Uno Oppo (Con 2))"
Any help welcome.Full code here
The purpose of reservedOp is to create a parser (which you've named m_reservedOp) that parses the given string of operator symbols while ensuring that it is not the prefix of a longer string of operator symbols. You can see this from the definition of reservedOp in the source:
reservedOp name =
lexeme $ try $
do{ _ <- string name
; notFollowedBy (opLetter languageDef) <?> ("end of " ++ show name)
}
Note that the supplied name is parsed only if it is not followed by any opLetter symbols.
In your case, the string "--2" can't be parsed by m_reservedOp "-" because, even though it starts with the valid operator "-", this string occurs as the prefix of a longer valid operator "--".
In a language with single-character operators, you probably don't want to use reservedOp at all, unless you want to disallow adjacent operators without intervening whitespace. Just use symbol "-", which will always parse "-", no matter what follows (and consume following whitespace, if any). Also, in a language with a fixed set of operators (i.e., no user-defined operators), you probably won't use the operator parser, so you won't need opStart, or reservedOpNames. Without reservedOp or operator, the opLetter parser isn't used, so you can drop it too.
This is probably pretty confusing, and the Parsec documentation does a terrible job of explaining how the "reserved" mechanism is supposed to work. Here's a primer:
Let's start with identifiers, instead of operators. In a typical language that allows user-defined identifiers (i.e., pretty much any language, since "variables" and "functions" have user-defined names) and may also have some reserved words that aren't allowed as identifiers, the relevant settings in the GenLanguageDef are:
identStart -- parser for first character of valid identifier
identLetter -- second and following characters of valid identifier
reservedNames -- list of reserved names not allowed as identifiers
The lexeme (whitespace-absorbing) parsers created using the GenTokenParser object are:
identifier - Parses an unknown, user-defined identifier. It parses a character from identStart followed by zero or more identLetters up to the first non-identLetter. (It never parses a partial identifier, so it'll never leave more identLetters on the table.) Additionally, it checks that the identifier is not in the list reservedNames.
symbol - Parses the given string. If the string is a reserved word, no check is made that it isn't part of a larger valid identifier. So, symbol "for" would match the beginning of foreground = "black", which is rarely what you want. Note that symbol makes no use of identStart, identLetter, or reservedNames.
reserved - Parses the given string, and then ensures that it's not followed by an identLetter. So, m_reserved "for" will parse for (i=1; ... but not parse foreground = "black". Usually, the supplied string will be a valid identifier, but no check is made for this, so you can write m_reserved "15" if you want -- in a language with the usual sorts of alphanumeric identifiers, this would parse "15" provided it wasn't following by a letter or another digit. Also, maybe somewhat surprisingly, no check is made that the supplied string is in reservedNames.
If that makes sense to you, then the operator settings follow the exact same pattern. The relevant settings are:
opStart -- parser for first character of valid operator
opLetter -- valid second and following operator chars, for multichar operators
reservedOpNames -- list of reserved operator names not allowed as user-defined operators
and the relevant parsers are:
operator - Parses an unknown, user-defined operator starting with an opStart and followed by zero or more opLetters up to the first non-opLetter. So, operator applied to the string "--2" will always take the whole operator "--", never just the prefix "-". An additional check is made that the resulting operator is not in the reservedOpNames list.
symbol - Exactly as for identifiers. It parses a string with no checks or reference to opStart, opLetter, or reservedOpNames, so symbol "-" will parse the first character of the string "--" just fine, leaving the second "-" character for a later parser.
reservedOp - Parses the given string, ensuring it's not followed by opLetter. So, m_reservedOp "-" will parse the start of "-x" but not "--2", assuming - matches opLetter. As before, no check is made that the string is in reservedOpNames.

ANTLR4 how to separate Lexer subrule

Let's say I have Lexer rules like this:
EMPTY_LITERAL: '\'' '\'';
LITERAL: '\'' (ESCAPED_SEQ|.)*? '\'' ;
fragment ESCAPED_SEQ: '\\\'' | '\\\\'
and a parser rule like this:
literal: EMPTY_LITERAL #EmptyLiteral | LITERAL #LiteralWithContent;
I want to get the content of LITERAL without quotes in the parser. I can strip the quotes, of course, but I am interesting in getting that string without quotes.
If I move the inner rule in the LITERAL the rule will not match properly (will match only 1 char). If I move LITERAL as a parser rule, I can match ESCAPED_SEQ but this is not what I want. Is there a way to name the inner rule in the lexer?
Is there a way to name the inner rule in the lexer?
No, there is not. It's not possible to name or access specific parts of a token in ANTLR 4, nor is there a sensible way to turn LITERAL into a parser rule.
So stripping the quotes from the token's text yourself is your only option.

Ordering lexer rules in a grammar using ANTLR4

I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.
I want the parser to be able to handle something like this:
Hello << name >>, how are you?
At runtime I will replace "<< name >>" with the user's name.
So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<< something >>" tag, which I am calling a "func" in my lexer rules.
Here is my grammar:
doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;
WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;
Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".
If I run this parser on the above sentence, I get a parse tree that looks like this:
Anything highlighted in red is a parse error.
So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that says "<< WORD >>", only a rule that says "<< ID >>", so I'm not clear on why that is happening.
If I swap the order of "ID" and "WORD" in my grammar, so now they are in this order:
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
And run the parser, I get a parse tree like this:
So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.
How do I get past this conundrum?
I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).
Thanks for any help!
From The Definitive ANTLR 4 Reference :
ANTLR resolves lexical ambiguities by
matching the input string to the rule specified first in the grammar.
With your grammar (in Question.g4) and a t.text file containing
Hello << name >>, how are you at nine o'clock?
the execution of
$ grun Question doc -tokens -diagnostics t.text
gives
[#0,0:4='Hello',<WORD>,1:0]
[#1,6:7='<<',<'<<'>,1:6]
[#2,9:12='name',<WORD>,1:9]
[#3,14:15='>>',<'>>'>,1:14]
[#4,16:16=',',<PUNCT>,1:16]
[#5,18:20='how',<WORD>,1:18]
[#6,22:24='are',<WORD>,1:22]
[#7,26:28='you',<WORD>,1:26]
[#8,30:31='at',<WORD>,1:30]
[#9,33:36='nine',<WORD>,1:33]
[#10,38:44='o'clock',<WORD>,1:38]
[#11,45:45='?',<PUNCT>,1:45]
[#12,47:46='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'name' expecting ID
line 1:14 extraneous input '>>' expecting {<EOF>, '<<', WORD, PUNCT}
Now change WORD to word in the item rule, and add a word rule :
item: (func | word) PUNCT? ;
word: WORD | ID ;
and put ID before WORD :
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
The tokens are now
[#0,0:4='Hello',<ID>,1:0]
[#1,6:7='<<',<'<<'>,1:6]
[#2,9:12='name',<ID>,1:9]
[#3,14:15='>>',<'>>'>,1:14]
[#4,16:16=',',<PUNCT>,1:16]
[#5,18:20='how',<ID>,1:18]
[#6,22:24='are',<ID>,1:22]
[#7,26:28='you',<ID>,1:26]
[#8,30:31='at',<ID>,1:30]
[#9,33:36='nine',<ID>,1:33]
[#10,38:44='o'clock',<WORD>,1:38]
[#11,45:45='?',<PUNCT>,1:45]
[#12,47:46='<EOF>',<EOF>,2:0]
and there is no more error. As the -gui graphic shows, you have now branches identified as word or func.
As "500 - Internal Server Error" already mentioned in his comment ANTLR will match lexer rules in the order they are defined in the grammar (the topmost rule will be matched first) and if a certain input has been matched ANTLR won't try to match it differently.
In your case the WORD and ID rule can both match input like abc but as WORD is declared first abc will always be matched as a WORD and never as an ID. In fact ID will never be matched as there is no valid input as an ID that can not be matched by WORD.
However if your only goal is to replace whatever is in between << and >> you'd be better off using regular expressions. However if you still want to use ANTLR for it you should reduce your grammar to only care about the essentials. That is to distinguish between any input and input in between << and >>. Therefore your grammar should look something like this:
start: (INTERESTING | UNINTERESTING) ;
INTERESTING: '<<' .*? '>>' ;
UNINTERESTING: (~[<])+ | '<' ;
Or you could skip the UNINTERESTING completely.

Parse any character until semicolon in ANTLR4

I am trying to parse the following grammar, where Value can be any character up to the semicolon, but I cannot get it to work correctly:
grammar Test;
pragmaDirective : 'pragma' Identifier Value ';' ;
Identifier : [a-z]+ ;
Value : ~';'* ;
WS : [ \t\r\n\u000C]+ -> skip ;
When I test it with pragma foo bar;, I get the following error:
line 1:6 extraneous input ' ' expecting Identifier
line 1:11 extraneous input 'bar' expecting ';'
Try this:
pragmaDirective : 'pragma' Identifier .*? ';' ;
and remove the Value rule. That should do the job.
And a recommendation: define lexer rules for your literals (like 'pragma') instead of defining them directly in the parser rules.
The Value rule is much too greedy. Lexer rules try to match as much as possible, so for input like this: pragma mu foo;, the Value rule would match pragma mu foo. After all, that's zero or more characters other than a semicolon.
Value is not well suited to be used as a lexer rule. I suggest you rethink your approach. Perhaps create a parser rule value that matches an Identifier and perhaps other lexer rules. Hard to make a suggestion without seeing much of the "real" grammar (you probably posted a dumbed down version of the grammar you're working on).

Resources