Are the following two rules the same in ANTLR4, or does using a parens give any additional grouping information or such, similar to a regex capturing group:
WS : (' '|'\t'|'\r'|'\n')+ -> skip;
WS : [ \t\r\n]+ -> skip;
Those are synonymous. The latter is just a more concise syntax for the former.
Related
I am writing an ANTLR grammar for SAS and running into an issue where the lexer cannot differentiate between a single line comment and a multiplication operation.
The SAS syntax for comments is regrettably:
*message;
or
/*message*/
I have written a simple test grammar to illustrate the problem:
grammar TEST;
prog: expr* EOF;
expr
: VAR #base
| expr '*' expr #mult
;
VAR: ALPHA+;
fragment ALPHA : [a-zA-Z]+ ;
COMMENT: '*' ~[\r\n];
WS: [ \t\r\n] -> skip;
I'm not sure how I can qualify the lexer to differentiate between these two situations. I am an ANTLR beginner so I may have missed something obvious.
I'm trying to define the language of XQuery and XPath in test.g4. The part of the file relevant to my question looks like:
grammar test;
ap: 'doc' '(' '"' FILENAME '"' ')' '/' rp
| 'doc' '(' '"' FILENAME '"' ')' '//' rp
;
rp: ...;
f: ...;
xq: STRING
| ...
;
FILENAME : [a-zA-Z0-9/_]+ '.xml' ;
STRING : '"' [a-zA-Z0-9~!##$%^&*()=+._ -]+ '"';
WS: [ \n\t\r]+ -> skip;
I tried to parse something like doc("movies.xml")//TITLE, but it gives
line 1:4 no viable alternative at input 'doc("movies.xml"'
But if I remove the STRING parser rule, it works fine. And since FILENAME appears before STRING, I don't know why it fails to match doc("movies.xml")//TITLE with the FILENAME parser rule. How can I fix this? Thank you!
The literal tokens you have in your grammar, are nothing more than regular tokens. So your lexer will look like this:
TOKEN_1 : 'doc';
TOKEN_2 : '(';
TOKEN_3 : '"';
TOKEN_4 : ')';
TOKEN_5 : '/';
TOKEN_6 : '//';
FILENAME : [a-zA-Z0-9/_]+ '.xml' ;
STRING : '"' [a-zA-Z0-9~!##$%^&*()=+._ -]+ '"';
WS : [ \n\t\r]+ -> skip;
(they're not really called TOKEN_..., but that's unimportant)
Now, the way ANTLR creates tokens is to try to match as much characters as possible. Whenever two (or more) rules match the same amount of characters, the one defined first "wins". Given these 2 rules, the input doc("movies.xml") will be tokenised as follows:
doc → TOKEN_1
( → TOKEN_2
"movies.xml" → STRING
) → TOKEN_4
Since ANTLR tries to match as many characters as possible, "movies.xml" is tokenised as a single token. The lexer does not "listen" to what the parser might need at a given time. This is how ANTLR works, you cannot change this.
FYI, there's a user contributed XPath grammar here: https://github.com/antlr/grammars-v4/blob/master/xpath/xpath.g4
I'm new to Antlr4/CFG and am trying to write a parser for a boolean querying DSL of the form
(id AND id AND ID (OR id OR id OR id))
The logic can also take the form
(id OR id OR (id AND id AND id))
A more complex example might be:
(((id AND id AND (id OR id OR (id AND id)))))
(enclosed in an arbitrary amount of parentheses)
I've tried two things. First, I did a very simple parser, which ended up parsing everything left to right:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| expression (AND expression)+
| expression (OR expression)+
| atom;
atom
: INT;
I got the following parse tree for input:
( 60 ) AND ( 55 ) AND ( 53 ) AND ( 3337 OR 2830 OR 23)
This "works", but ideally I want to be able to separate my AND and OR blocks. Trying to separate these blocks into separate grammars leads to left-recursion. Secondly, I want my AND and OR blocks to be grouped together, instead of reading left-to-right, for example, on input (id AND id AND id),
I want:
(and id id id)
not
(and id (and id (and id)))
as it currently is.
The second thing I've tried is making OR blocks directly descendant of AND blocks (ie the first case).
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| and_expr;
and_expr
: term (AND term)* ;
term
: LPAREN or_expr RPAREN
| LPAREN atom RPAREN ;
or_expr
: atom (OR atom)+;
atom: INT ;
For the same input, I get the following parse tree, which is more along the lines of what I'm looking for but has one main problem: there isn't an actual hierarchy to OR and AND blocks in the DSL, so this doesn't work for the second case. This approach also seems a bit hacky, for what I'm trying to do.
What's the best way to proceed? Again, I'm not too familiar with parsing and CFGs, so some guidance would be great.
Both are equivalent in their ability to parse your sample input. If you simplify your input by removing the unnecessary parentheses, the output of this grammar looks pretty good too:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| expression (AND expression)+
| expression (OR expression)+
| atom;
atom : INT;
INT: DIGITS;
DIGITS : [0-9]+;
AND : 'AND';
OR : 'OR';
LPAREN : '(';
RPAREN : ')';
WS: [ \t\r\n]+ -> skip;
Which is what I suspect your first grammar looks like in its entirety.
Your second one requires too many parentheses for my liking (mainly in term), and the breaking up of AND and OR into separate rules instead of alternatives doesn't seem as clean to me.
You can simplify even more though:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN # ParenExp
| expression AND expression # AndBlock
| expression OR expression # OrBlock
| atom # AtomExp
;
atom : INT;
INT: DIGITS;
DIGITS : [0-9]+;
AND : 'AND';
OR : 'OR';
LPAREN : '(';
RPAREN : ')';
WS: [ \t\r\n]+ -> skip;
This gives a tree with a different shape but still is equivalent. And note the use of the # AndBlock and # OrBlock labels... these "alternative labels" will cause your generated listener or visitor to have separate methods for each, allowing you to completely separate these two in your code semantically as well as syntactically. Perhaps that's what you're looking for?
I like this one the best because it's the simplest and clearer recursion, and offers specific code alternatives for AND and OR.
I am struggling a bit with trying to define integers in my grammar.
Let's say I have this small grammar:
grammar Hello;
r : 'hello' INTEGER;
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
If I then type in
hello 5
it parses correctly.
However, if I have an additional parser rule (even if it's unused) which defines a token '5',
then I can't parse the previous example anymore.
So this grammar:
grammar Hello;
r : 'hello' INTEGER;
unusedRule: 'hi' '5';
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
with
hello 5
won't parse anymore. It gives me the following error:
Hello::r:1:6: mismatched input '5' expecting INTEGER
How is that possible and how can I work around this?
When you define a parser rule like
unusedRule: 'hi' '5';
Antlr creates implicit lexer tokens for the subterms. Since they are automatically created in the lexer, you have no control over where the sit in the precedence evaluation of Lexer rules.
Consequently, the best policy is to never use literals in parser rules; always explicitly define your tokens.
(I use Yecc, an Erlang parser generator similar to Yacc, so the syntax is different from Yacc)
The problem is simple, say we want parse a lispy syntax, i want match on expressions-lists.
An expression list is a list of expressions separated with blank space.
In Erlang, [1,3,4] is a list and ++ concatenates two lists.
We want match this 1 (1+2) 3. expression will match 1, (1+2) and 3. So, there i match on the list followed by one more expression, and if there is no match i end to match on a single expression. This builds a list recursively.
expressionlist -> expressionlist expression : '$1' ++ ['$2'].
expressionlist -> expression : ['$1'].
But i can do this too (invert the order):
expressionlist -> expression expressionlist : ['$1'] ++ '$2'.
expressionlist -> expression : ['$1'].
Both of theese seem to work, i would like to know if there was any difference.
With a separator
I want to match {name = albert , age = 43}. propdef matches name = value. So it is the same problem but with an additional separator ,. Is there any difference there from the first problem ?
proplist -> propdef ',' proplist : ['$1'] ++ '$3'.
proplist -> propdef : ['$1'].
proplist -> '{' proplist '}' : '$2'.
proplist -> '{' '}' : [].
%% Could write this
%% proplist -> proplist ',' propdef : '$1' ++ ['$3'].
Thank you.
Since Yecc is an LALR parser generator, the use of left recursion or right recursion doesn't matter much. In old times, people would prefer left recursion in Yacc/Bison and similar tools because it allows the parser to keep reducing instead of shifting everything onto the stack until the end of the list, but these days, stack space and speed isn't that important, so you can pick whatever suits you best. (Note that in an LL parser, left recursion causes an infinite loop, so for such parsers, right recursion is necessary.)
The more important thing, for your example above, is that '$1' ++ ['$2'] in the left recursive version will cause quadratic time complexity, since the "expressionlist" part is the one that's going to be longer and longer. You should never have the growing component on the left when you use the ++ operator. If you parse a list of thousands of elements, this complexity will hurt you. Using the right recursive version instead, ['$1'] ++ '$2' will give you linear time, even if the parser has to shift the whole list onto the stack before it starts reducing. You could try both versions and parse a really long list to see the difference.
Using a separator as in "propdef ',' proplist" does not change the problem.