Parse any character until semicolon in ANTLR4

Parse any character until semicolon in ANTLR4 - parsing

I am trying to parse the following grammar, where Value can be any character up to the semicolon, but I cannot get it to work correctly:
grammar Test;
pragmaDirective : 'pragma' Identifier Value ';' ;
Identifier : [a-z]+ ;
Value : ~';'* ;
WS : [ \t\r\n\u000C]+ -> skip ;
When I test it with pragma foo bar;, I get the following error:
line 1:6 extraneous input ' ' expecting Identifier
line 1:11 extraneous input 'bar' expecting ';'

Try this:
pragmaDirective : 'pragma' Identifier .*? ';' ;
and remove the Value rule. That should do the job.
And a recommendation: define lexer rules for your literals (like 'pragma') instead of defining them directly in the parser rules.

The Value rule is much too greedy. Lexer rules try to match as much as possible, so for input like this: pragma mu foo;, the Value rule would match pragma mu foo. After all, that's zero or more characters other than a semicolon.
Value is not well suited to be used as a lexer rule. I suggest you rethink your approach. Perhaps create a parser rule value that matches an Identifier and perhaps other lexer rules. Hard to make a suggestion without seeing much of the "real" grammar (you probably posted a dumbed down version of the grammar you're working on).

Related

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?

Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

Antlr Matlab grammar lexing conflict

I've been using the Antlr Matlab grammar from Antlr grammars
I found out I need to implement the ' Matlab operator. It is the complex conjugate transpose operator, used as such
result = input'
I tried a straightforward solution of adding it to unary_expression as an option postfix_expression '\''
However, this failed to parse when multiple of these operators were used on a single line.
Here's a significantly simplified version of the grammar, still exhibiting the exact problem:
grammar Grammar;
unary_expression
: IDENTIFIER
| unary_expression '\''
;
translation_unit : unary_expression CR ;
STRING_LITERAL : '\'' [a-z]* '\'' ;
IDENTIFIER : [a-zA-Z] ;
CR : [\r\n] + ;
Test cases, being parsed as translation_unit:
"x''\n" //fails getNumberOfSyntaxErrors returns 1
"x'\n" //passes
The failure also prints the message line 1:1 extraneous input '''' expecting CR to stderr.
The failure goes away if I either remove STRING_LITERAL, or change the * to +. Neither is a proper solution of course, as removing it is entirely off the table, and mandating non-empty strings is not quite correct, though I might be able to live with it. Also, forcing non-empty string does nothing to help the real use case, when the input is something like x' + y' instead of using the operator twice.
For some reason removing CR from the grammar and \n from the tests also makes the parsing run without problems, but yet again is not a useable solution.
What can I do to the grammar to make it work correctly? I'm assuming it's a problem with lexing specifically because removing STRING_LITERAL or making it unable to match '' makes it go away.

The lexer can never be made that context aware I think, but I don't know Matlab well enough to be sure. How could you check during tokenisation that these single quotes are operators:
x' + y';
while these are strings:
x = 'x' + ' + y';
?
Maybe you can do something similar as how in ECMAScript a / can be a division operator or a regex delimiter. In this grammar that is handled by a predicate in the lexer that uses some target code to check this.
If something like the above is not possible, I see no other way than to "promote" the creation of strings to the parser. That would mean removing STRING_LITERAL and introducing a parser rule that matches something like this:
string_literal
: QUOTE ~(QUOTE | CR)* QUOTE
;
// Needed to match characters inside strings
OTHER
: .
;
However, that will fail when a string like 'hi there' is encountered: the space in between hi and there will now be skipped by the WS rule. So WS should also be removed (spaces will then get matched by the OTHER rule). But now (of course) all spaces will litter the token stream and you'll have to account for them in all parser rules (not really a viable solution).
All in all: I don't see ANTLR as a suitable tool in this case. You might look into parser generators where there is no separation between tokenisation and parsing. Google for "PEG" and/or "scannerless parsing".

Ordering lexer rules in a grammar using ANTLR4

I'm using ANTLR4 to generate a parser. I am new to parser grammars. I've read the very helpful ANTLR Mega Tutorial but I am still stuck on how to properly order (and/or write) my lexer and parser rules.
I want the parser to be able to handle something like this:
Hello << name >>, how are you?
At runtime I will replace "<< name >>" with the user's name.
So mostly I am parsing text words (and punctuation, symbols, etc), except with the occasional "<< something >>" tag, which I am calling a "func" in my lexer rules.
Here is my grammar:
doc: item* EOF ;
item: (func | WORD) PUNCT? ;
func: '<<' ID '>>' ;
WS : [ \t\n\r] -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment CHAR : (LETTER | DIGIT | SYMB ) ;
WORD : CHAR+ ;
ID: LETTER ( LETTER | DIGIT)* ;
PUNCT : [.,?!] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}<>] ;
Side note: I added "PUNCT?" at the end of the "item" rule because it is possible, such as in the example sentence I gave above, to have a comma appear right after a "func". But since you can also have a comma after a "WORD" then I decided to put the punctuation in "item" instead of in both of "func" and "WORD".
If I run this parser on the above sentence, I get a parse tree that looks like this:
Anything highlighted in red is a parse error.
So it is not recognizing the "ID" inside the double angle brackets as an "ID". Presumably this is because "WORD" comes first in my list of lexer rules. However, I have no rule that says "<< WORD >>", only a rule that says "<< ID >>", so I'm not clear on why that is happening.
If I swap the order of "ID" and "WORD" in my grammar, so now they are in this order:
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
And run the parser, I get a parse tree like this:
So now the "func" and "ID" rules are being handled appropriately, but none of the "WORD"s are being recognized.
How do I get past this conundrum?
I suppose one option might be to change the "func" rule to "<< WORD >>" and just treat everything as words, doing away with "ID". But I wanted to differentiate a text word from a variable identifier (for instance, no special characters are allowed in a variable identifier).
Thanks for any help!

From The Definitive ANTLR 4 Reference :
ANTLR resolves lexical ambiguities by
matching the input string to the rule specified first in the grammar.
With your grammar (in Question.g4) and a t.text file containing
Hello << name >>, how are you at nine o'clock?
the execution of
$ grun Question doc -tokens -diagnostics t.text
gives
[#0,0:4='Hello',<WORD>,1:0]
[#1,6:7='<<',<'<<'>,1:6]
[#2,9:12='name',<WORD>,1:9]
[#3,14:15='>>',<'>>'>,1:14]
[#4,16:16=',',<PUNCT>,1:16]
[#5,18:20='how',<WORD>,1:18]
[#6,22:24='are',<WORD>,1:22]
[#7,26:28='you',<WORD>,1:26]
[#8,30:31='at',<WORD>,1:30]
[#9,33:36='nine',<WORD>,1:33]
[#10,38:44='o'clock',<WORD>,1:38]
[#11,45:45='?',<PUNCT>,1:45]
[#12,47:46='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'name' expecting ID
line 1:14 extraneous input '>>' expecting {<EOF>, '<<', WORD, PUNCT}
Now change WORD to word in the item rule, and add a word rule :
item: (func | word) PUNCT? ;
word: WORD | ID ;
and put ID before WORD :
ID: LETTER ( LETTER | DIGIT)* ;
WORD : CHAR+ ;
The tokens are now
[#0,0:4='Hello',<ID>,1:0]
[#1,6:7='<<',<'<<'>,1:6]
[#2,9:12='name',<ID>,1:9]
[#3,14:15='>>',<'>>'>,1:14]
[#4,16:16=',',<PUNCT>,1:16]
[#5,18:20='how',<ID>,1:18]
[#6,22:24='are',<ID>,1:22]
[#7,26:28='you',<ID>,1:26]
[#8,30:31='at',<ID>,1:30]
[#9,33:36='nine',<ID>,1:33]
[#10,38:44='o'clock',<WORD>,1:38]
[#11,45:45='?',<PUNCT>,1:45]
[#12,47:46='<EOF>',<EOF>,2:0]
and there is no more error. As the -gui graphic shows, you have now branches identified as word or func.

As "500 - Internal Server Error" already mentioned in his comment ANTLR will match lexer rules in the order they are defined in the grammar (the topmost rule will be matched first) and if a certain input has been matched ANTLR won't try to match it differently.
In your case the WORD and ID rule can both match input like abc but as WORD is declared first abc will always be matched as a WORD and never as an ID. In fact ID will never be matched as there is no valid input as an ID that can not be matched by WORD.
However if your only goal is to replace whatever is in between << and >> you'd be better off using regular expressions. However if you still want to use ANTLR for it you should reduce your grammar to only care about the essentials. That is to distinguish between any input and input in between << and >>. Therefore your grammar should look something like this:
start: (INTERESTING | UNINTERESTING) ;
INTERESTING: '<<' .*? '>>' ;
UNINTERESTING: (~[<])+ | '<' ;
Or you could skip the UNINTERESTING completely.

ANTLR4 - Make space optional between tokens

I have the following grammar:
grammar Hello;
prog: stat+ EOF;
stat: DELIMITER_OPEN expr DELIMITER_CLOSE;
expr: NOTES COMMA value=VAR_VALUE #delim_body;
VAR_VALUE: ANBang*;
NOTES: WS* 'notes' WS*;
COMMA: ',';
DELIMITER_OPEN: '<<!';
DELIMITER_CLOSE: '!>>';
fragment ANBang: AlphaNum | Bang;
fragment AlphaNum: [a-zA-Z0-9];
fragment Bang: '!';
WS : [ \t\r\n]+ -> skip ;
Parsing the following works:
<<! notes, Test !>>
and the variable value is "Test", however, the parser fails when I eliminate the space between the DELIMITER_OPEN and NOTES:
<<!notes, Test !>>
line 1:3 mismatched input 'notes' expecting NOTES

This is yet another case of badly ordered lexer rules.
When the lexer scans for the next token, it first tries to find the rule which will match the longest token. If several rules match, it will disambiguate by choosing the first one in definition order.
<<! notes, Test !>> will be tokenized as such:
DELIMITER_OPEN NOTES COMMA VAR_VALUE WS DELIMITER_CLOSE
This is because the NOTES rule can match the following:
<<! notes, Test !>>
\____/
Which includes the whitespace. If you remove it:
<<!notes, Test !>>
Then both the NOTES and VAR_VALUE rules can match the text notes, and, VAR_VALUE is defined first in the grammar, so it gets precedence. The tokenization is:
DELIMITER_OPEN VAR_VALUE COMMA VAR_VALUE WS DELIMITER_CLOSE
and it doesn't match your expr rule.
Change your rules like this to fix the problem:
NOTES: 'notes';
VAR_VALUE: ANBang+;
Adding WS* to other rules doesn't make much sense, since WS is skipped. And declaring a token as having a possible zero width * is also meaningless, so use + instead. Finally, reorder the rules so that the most specific ones match fist.
This way, notes becomes a keyword in your grammar. If you don't want it to be a keyword, remove the NOTES rule altogether, and use the VAR_VALUE rule with a predicate. Alternatively, you could use lexer modes.

Overlapping Tokens in ANTLR 4

I have the following ANTLR 4 combined grammar:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: IDENTIFIER ;
NOTE: [A-Ga-g] ;
IDENTIFIER: [A-Za-z0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
This parses:
field x { A }
field x { B }
This does not:
field a { A }
field b { B }
In the case where parsing fails, I think the lexer is getting confused and emitting a NOTE token where I want it to emit an IDENTIFIER token.
Edit:
In the tokens coming out of the lexer, the 'NOTE' token is showing up where the parser is expecting 'IDENTIFIER'. 'NOTE' has higher precedence because it's shown first in the grammar. So, I can think of two ways to fix this... first, I could alter the grammar to disambiguate 'NOTE' and 'IDENTIFIER' (like adding a '$' in front of 'NOTE'). Or, I could just use 'IDENTIFIER' where I would use note and then deal with detecting issues when I walk the parse tree. Neither of those feel optimal. Surely there must be a way to fix this?

I actually ended up solving it like this:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: IDENTIFIER | NOTE ;
NOTE: [A-Ga-g] ;
IDENTIFIER: [A-Za-z0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
My parse tree still ends up looking how I'd like.
The actual grammar I'm developing is more complicated, as is the workaround based on this approach. But in general, the approach seems to work well.

Quick and dirty fix for your problem can be:
Change IDENTIFIERto match only the complement of NOTE. Then you put them together in identifier.
Resulting grammar:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: (NOTE|IDENTIFIER_C)+ ;
NOTE: [A-Ga-g] ;
IDENTIFIER_C: [H-Zh-z0-9] ;
WS: [ \t\r\n]+ -> skip ;
Disadvantage of this solution is, that you do not get the Identifier as tokens and you tokenize every single Character.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart