I'm new to Antlr4/CFG and am trying to write a parser for a boolean querying DSL of the form
(id AND id AND ID (OR id OR id OR id))
The logic can also take the form
(id OR id OR (id AND id AND id))
A more complex example might be:
(((id AND id AND (id OR id OR (id AND id)))))
(enclosed in an arbitrary amount of parentheses)
I've tried two things. First, I did a very simple parser, which ended up parsing everything left to right:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| expression (AND expression)+
| expression (OR expression)+
| atom;
atom
: INT;
I got the following parse tree for input:
( 60 ) AND ( 55 ) AND ( 53 ) AND ( 3337 OR 2830 OR 23)
This "works", but ideally I want to be able to separate my AND and OR blocks. Trying to separate these blocks into separate grammars leads to left-recursion. Secondly, I want my AND and OR blocks to be grouped together, instead of reading left-to-right, for example, on input (id AND id AND id),
I want:
(and id id id)
not
(and id (and id (and id)))
as it currently is.
The second thing I've tried is making OR blocks directly descendant of AND blocks (ie the first case).
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| and_expr;
and_expr
: term (AND term)* ;
term
: LPAREN or_expr RPAREN
| LPAREN atom RPAREN ;
or_expr
: atom (OR atom)+;
atom: INT ;
For the same input, I get the following parse tree, which is more along the lines of what I'm looking for but has one main problem: there isn't an actual hierarchy to OR and AND blocks in the DSL, so this doesn't work for the second case. This approach also seems a bit hacky, for what I'm trying to do.
What's the best way to proceed? Again, I'm not too familiar with parsing and CFGs, so some guidance would be great.
Both are equivalent in their ability to parse your sample input. If you simplify your input by removing the unnecessary parentheses, the output of this grammar looks pretty good too:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| expression (AND expression)+
| expression (OR expression)+
| atom;
atom : INT;
INT: DIGITS;
DIGITS : [0-9]+;
AND : 'AND';
OR : 'OR';
LPAREN : '(';
RPAREN : ')';
WS: [ \t\r\n]+ -> skip;
Which is what I suspect your first grammar looks like in its entirety.
Your second one requires too many parentheses for my liking (mainly in term), and the breaking up of AND and OR into separate rules instead of alternatives doesn't seem as clean to me.
You can simplify even more though:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN # ParenExp
| expression AND expression # AndBlock
| expression OR expression # OrBlock
| atom # AtomExp
;
atom : INT;
INT: DIGITS;
DIGITS : [0-9]+;
AND : 'AND';
OR : 'OR';
LPAREN : '(';
RPAREN : ')';
WS: [ \t\r\n]+ -> skip;
This gives a tree with a different shape but still is equivalent. And note the use of the # AndBlock and # OrBlock labels... these "alternative labels" will cause your generated listener or visitor to have separate methods for each, allowing you to completely separate these two in your code semantically as well as syntactically. Perhaps that's what you're looking for?
I like this one the best because it's the simplest and clearer recursion, and offers specific code alternatives for AND and OR.
Related
In the following example, the order matters in terms of precedence:
grammar Precedence;
root: expr EOF;
expr
: expr ('+'|'-') expr
| expr ('*' | '/') expr
| Atom
;
Atom: [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
For example, on the expression 1+1*2 the above would produce the following parse tree which would evaluate to (1+1)*2=4:
Whereas if I changed the first and second alternations in the expr I would then get the following parse tree which would evaluate to 1+(1*2)=3:
What are the 'rules' then for when it actually matters where the ordering in an alternation occurs? Is this only relevant if it one of the 'edges' of the alternation recursively calls the expr? For example, something like ~ expr or expr + expr would matter, but something like func_call '(' expr ')' or Atom would not. Or, when is it important to order things for precedence?
If ANTLR did not have the rule to give precedence to the first alternative that could match, then either of those trees would be valid interpretations of your input (and means the grammar is technically ambiguous).
However, when there are two alternatives that could be used to match your input, then ANTLR will use the first alternative to resolve the ambiguity, in this case establishing operator precedence, so typically you would put the multiplication/division operator before the addition/subtraction, since that would be the traditional order of operations:
grammar Precedence;
root: expr EOF;
expr
: expr ('+'|'-') expr
| expr ('*' | '/') expr
| Atom
;
Atom: [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
Most grammar authors will just put them in precedence order, but things like Atoms or parenthesized exprs won’t really care about the order since there’s only a single alternative that could be used.
I've written the following arithmetic grammar:
grammar Calc;
program
: expressions
;
expressions
: expression (NEWLINE expression)*
;
expression
: '(' expression ')' // parenExpression has highest precedence
| expression MULDIV expression // then multDivExpression
| expression ADDSUB expression // then addSubExpression
| OPERAND // finally the operand itself
;
MULDIV
: [*/]
;
ADDSUB
: [-+]
;
// 12 or .12 or 2. or 2.38
OPERAND
: [0-9]+ ('.' [0-9]*)?
| '.' [0-9]+
;
NEWLINE
: '\n'
;
And I've noticed that regardless of how I space the tokens I get the same result, for example:
1+2
2+3
Or:
1 +2
2+3
Still give me the same thing. Also I've noticed that adding in the following rule does nothing for me:
WS
: [ \r\n\t] + -> skip
Which makes me wonder whether skipping whitespace is the default behavior of antlr4?
ANTLR4 based parsers have the ability to skip over single unwanted or missing tokens and continue parsing if possible (which is the case here). And there's no default to ignore whitespaces. You have to always specify a whitespace rule which either skips them or puts them on a hidden channel.
I have this grammar below for implementing an IN operator taking a list of numbers or strings.
grammar listFilterExpr;
listFilterExpr: entityIdNumberListFilter | entityIdStringListFilter;
entityIdNumberProperty
: 'a.Id'
| 'c.Id'
| 'e.Id'
;
entityIdStringProperty
: 'f.phone'
;
listFilterExpr
: entityIdNumberListFilter
| entityIdStringListFilter
;
listOperator
: '$in:'
;
entityIdNumberListFilter
: entityIdNumberProperty listOperator numberList
;
entityIdStringListFilter
: entityIdStringProperty listOperator stringList
;
numberList: '[' ID (',' ID)* ']';
fragment ID: [1-9][0-9]*;
stringList: '[' STRING (',' STRING)* ']';
STRING
: '"'(ESC | SAFECODEPOINT)*'"'
;
fragment ESC
: '\\' (["\\/bfnrt] | UNICODE)
;
fragment SAFECODEPOINT
: ~ ["\\\u0000-\u001F]
;
If I try to parse the following input:
c.Id $in: [1,1]
Then I get the following error in the parser:
mismatched input '1' expecting ID
Please help me to correct this grammar.
Update
I found this following rule way above in the huge grammar file of my project that might be matching '1' before it gets to match to ID:
NUMBER
: '-'? INT ('.' [0-9] +)?
;
fragment INT
: '0' | [1-9] [0-9]*
;
But, If I write my ID rule before NUMBER then other things fail, because they have already matched ID which should have matched NUMBER
What should I do?
As mentioned by rici: ID should not be a fragment. Fragments can only be used by other lexer rules, they will never become a token on their own (and can therefor not be used in parser rules).
Just remove the fragment keyword from it: ID: [1-9][0-9]*;
Note that you'll also have to account for spaces. You probably want to skip them:
SPACES : [ \t\r\n] -> skip;
...
mismatched input '1' expecting ID
...
This looks like there's another lexer, besides ID, that also matches the input 1 and is defined before ID. In that case, have a look at this Q&A: ANTLR 4.5 - Mismatched Input 'x' expecting 'x'
EDIT
Because you have the rules ordered like this:
NUMBER
: '-'? INT ('.' [0-9] +)?
;
fragment INT
: '0' | [1-9] [0-9]*
;
ID
: [1-9][0-9]*
;
the lexer will never create an ID token (only NUMBER tokens will be created). This is just how ANTLR works: in case of 2 or more lexer rules match the same amount of characters, the one defined first "wins".
In the first place I think it's odd to have an ID rule that matches only digits, but, if that's the language you're parsing, OK. In your case, you could do something like this:
id : POS_NUMBER;
number : POS_NUMBER | NEG_NUMBER;
POS_NUMBER : INT ('.' [0-9] +)?;
NEG_NUMBER : '-' POS_NUMBER;
fragment INT
: '0' | [1-9] [0-9]*
;
and then instead of ID, use id in your parser rules. As well as using number instead of the NUMBER you're using now.
I'm working with Antlr4 to parse a boolean-like DSL.
Here is my grammar:
grammar filter;
filter: overall EOF;
overall
: LPAREN overall RPAREN
| category
;
category
: expression # InferenceCategory
| category AND category # CategoryAndBlock
| label COLON expression # CategoryBlock
| LPAREN category RPAREN # NestedCategory
;
expression
: NOT expression # NotExpr
| expression AND expression # AndExpr
| expression OR expression # OrExpr
| atom # AtomExpr
| LPAREN expression RPAREN # NestedExpression
;
label
: ALPHANUM
;
atom
: ALPHANUM
;
Here is an example input string to parse:
(cat1:(1 OR 2) AND cat2:( 4 ))
This grammar works fine with this input; it produces the following parse tree which perfectly suits my needs:
However, there is weird case of the DSL, where the "cat1" label is implicit when no other category is specified. This is what the InferenceCategory tag catches, where this expression will be handled as a category in my code later.
For example, with
((1 OR 2) AND cat2:( 4 ))
I get (as expected):
However, in the following instance:
cat2:( 4 ) AND (1 OR 2)
I get:
Notice that the second block is not identified as a InferenceCategory and but instead as a normal expression, under the first category. This is because there the grammar parses ( 4 ) following cat2: as a normal expression, and everything past that is parsed as a normal expression.
Is there any way to fix this? I've tried:
label COLON expression (AND category)* # CategoryBlock
(which doesn't work)
and
category AND category AND category
(which "works", but is extremely hacky and only works in the specific case that I have exactly three categories. Any more, and it breaks again.)
The "alternative labels" like NOT expression # NotExpr do not make a difference in your parse tree. They are semantic-only. They will cause the code generation process to create specific signatures that you can override in your Visitor or Listener.
The rationale behind this is, for example, instead of getting just one Visitor override for expression, you'll get several, one for each alternative label. That way, you don't have to examine expression and determine what type it is before acting on it. Instead, you'll get an override for # OrExpr for example, and as soon as you're in that override code, you know you're dealing with an OR, with an expression on each side of the OR token.
The parse tree is useful, but much of the semantics only become apparent when you code up your Listener or Visitor.
I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)