Greedy subrules in ANTLR4 - parsing

I'm working on a parser grammar that should allow trailing expressions without enclosing symbols. The following is a simplified version that evidences the issue:
grammar Example;
root: expression EOF;
expression: binaryExpression;
binaryExpression
: binaryExpression 'and' binaryExpression
| binaryExpression 'or' binaryExpression
| quantifier
| '(' expression ')'
| OPERAND
;
quantifier
: 'no' ID 'in' ID 'satisfies' expression
;
OPERAND: 'true' | 'false';
ID: [a-z]+;
WS: (' ' | '\r' | '\t')+ -> channel(HIDDEN);
If you try to parse the following expression, you'll notice that, although the parse correctly recognizes the input, it reports an ambiguity:
true or false and no x in y satisfies true or false
The error reporting works as expected (more about this later):
line 1:1 token recognition error at: '1'
line 1:2 mismatched input '<EOF>' expecting {'(', 'no', OPERAND}
I'm looking for some way to explicitly tell the parser that the quantifier should be greedy: everything on the right-hand side should be consumed unambiguously until the end of the expression.
I tried to refactor the rules to allow the quantifier only on the RHS of binary expressions. Although it worked, the error recovery mechanism becomes unable to recognize most expressions:
grammar Example;
root: expression EOF;
expression: quantifier | booleanExpression;
quantifier
: 'no' ID 'in' ID 'satisfies' expression
;
booleanExpression
: orExpression ('or' (quantifier | andQuantifier))?
| andQuantifier
;
andQuantifier: andExpression 'and' quantifier;
orExpression
: orExpression 'or' orExpression
| andExpression
;
andExpression
: andExpression 'and' andExpression
| '(' expression ')'
| OPERAND
;
OPERAND: 'true' | 'false';
ID: [a-z]+;
WS: (' ' | '\r' | '\t')+ -> channel(HIDDEN);
As you can see, the problem is gone:
But it came at the cost of more complex grammar and unable to recognize wrong inputs like (1:
line 1:1 token recognition error at: '1'
line 1:2 no viable alternative at input '('
Does anyone else have any other idea on how to fix it?

This is the way I'd do it, using Antlr4's built-in algorithm for resolving ambiguity with precedence (since the grammar is certainly ambiguous). In order to get the precedence algorithm to work, it's useful to think of a qualification as a unary operator with low precedence, which is why quantifier below is just the "operator" and not the full expression. Presumably in a real grammar you would have other quantifiers, and very likely unary operators with higher precedence like not.
grammar Example;
root: expression EOF;
expression
: expression 'and' expression
| expression 'or' expression
| quantifier expression
| operand
| '(' expression ')'
;
quantifier
: 'no' ID 'in' ID 'satisfies'
;
operand: BOOLEAN | ID;
BOOLEAN: 'true' | 'false';
ID: [a-zA-Z]+;
WHITE_SPACE: (' ' | '\r' | '\n' | '\t')+ -> channel(HIDDEN);
This isn't quite the same as the example in your post because you modified a few minor details from the first version of the question. But I think it's indicative.
For obvious reasons I couldn't try it with (1 (I suppose that input corresponds to yet a different version where integers are OPERANDs), but with (true it gave me what looks like the error report you are seeking. I'm not really an ANTLR4 expert so I don't know how to predict the details of error recovery.

OK, after a lot of back and forth here, I think I finally get that what you're looking for is associativity. Try:
grammar Example;
root: expression EOF;
expression
: '(' expression ')' # parenExpr
| <assoc=right>expression (AND | OR) quantifier # quantifierExpr
| expression AND expression # andExpr
| expression OR expression # orExpr
| OPERAND # operandExpr
;
quantifier
: 'no' ID 'in' ID 'satisfies' expression
;
AND: 'and';
OR: 'or';
OPERAND: 'true' | 'false';
ID: [a-z]+;
WS: (' ' | '\r' | '\t')+ -> channel(HIDDEN);
(I took the liberty of adding labels to your alternatives and simplifying the expression rule.). The labels will come in very handy in your code as you need to deal with each alternative individually. Labels will give you separate functions to override in your listeners/visitors (along with Context classes specific to that alternative)
true and false or false and no x in y satisfies true or false
true and false or false or no x in y satisfies true or false
true and false or false and no x in y satisfies true or false

Related

Ambiguity between tuple and parenthesized expression

One of the expressions that can be very ambiguous up until almost the very end is that of a tuple vs. a parenthesized expression. A tuple is differentiated between a parenthesized expression by the presence of a comma -- and often a single-member tuple is not allowed, as it would be ambiguous, for example from BigQuery:
Tuple syntax
(expr1, expr2 [, ... ])
The output type is an anonymous STRUCT type with anonymous fields with types matching the types of the input expressions. There must be at least two expressions specified. Otherwise this syntax is indistinguishable from an expression wrapped with parentheses.
I am having trouble figuring out why my grammar is ambiguous here, which allows for both:
grammar DBParser;
options { caseInsensitive=true; }
statement: select EOF;
select:
'SELECT' expr (',' expr)*
('FROM' expr) ?
('WHERE' expr) ?
;
expr
: '(' expr ')' # parenExpression
| '(' expr (',' expr)+ ')' # tupleLiteralExpression
| expr 'IN' expr # inExpression
| select # subSelectExpression
| Atom # constantExpression
;
Atom:
[a-z-]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
And with my input:
SELECT id FROM sales WHERE country IN ((select 1,1,1,1,1,1,1,1,1),1)
I get the following profiling information from Antlr telling me I have ambiguities.
Why is this occurring, and how would I properly resolve this?
The ambiguity arises from the non-parenthesized sub-select expression. For example if we have:
SELECT a FROM b WHERE x IN (select 1,1)
The IN expression part can be parsed in two different ways:
Atom inExpression(tupleLiteralExpression(subSelectExpression, Atom))
Or as:
Atom inExpression(subSelectExpression)
Since (SELECT 1,1) could either be seen as a select clause SELECT 1,1 or it can be seen as a tuple containing two elements, SELECT 1 and 1.
Because of this, we must require parentheses around the sub-select so we know where the select clause starts and ends. Here would be the proper grammar resolving the ambiguities:
grammar DBParser;
options { caseInsensitive=true; }
statement: select EOF;
select:
'SELECT' expr (',' expr)*
('FROM' expr) ?
('WHERE' expr) ?
;
expr
: '(' expr ')' # parenExpression
| '(' expr (',' expr)+ ')' # tupleLiteralExpression
| expr 'IN' expr # inExpression
| '(' select ')' # subSelectExpression
| Atom # constantExpression
;
Atom:
[a-z-]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;

Good ways to test antlr performance

Is there a good way to compare the grammar between two different files or rules to see which one performs better? As an example, let's say I'm 'starting' with the following grammar that I want to optimize:
grammar Calc;
program
: equations
;
equations
: equation* EOF
;
equation
: expression relop expression
;
expression
: LPAREN expression RPAREN
| expression (POWER) expression
| expression (TIMES | DIV) expression
| expression (PLUS | MINUS) expression
| (PLUS | MINUS)* atom
;
atom
: number
| variable
;
variable // so the entire variable gets consumed as one token
: VARIABLE
;
number
: NUMBER
;
relop
: EQ
| GTE
| LTE
| GT
| LT
;
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/';
POWER: '^';
EQ: '=';
GTE: '>=';
GT: '>';
LTE: '<=';
LT: '<';
LPAREN: '(';
RPAREN: ')';
NUMBER: DECIMAL ([Ee] [+-]? UNSIGNED_INTEGER)?;
fragment DECIMAL: [0-9]+ ('.' [0-9]*)? | '.' [0-9]+;
UNSIGNED_INTEGER: [0-9]+;
VARIABLE: [a-zA-Z_] [a-zA-Z_0-9]*;
WS: [ \r\n\t] -> skip;
And then, perhaps I'm curious whether it performs better if I 'inline' some of the rules:
grammar Calc2;
program: equations;
equations: equation* EOF;
equation: expression ('=' | '>' | '>=' | '<' | '<=' ) expression
;
expression
: '(' expression ')'
| expression '^' expression
| expression ('*' | '/') expression
| expression ('+' | '-') expression
| ('+' | '-')* ATOM
;
ATOM
: ([a-zA-Z_] [a-zA-Z_0-9]* // variable
| [0-9]+ ('.' [0-9]*)? | '.' [0-9]+ ([Ee] [+-]? [0-9]+)? // decimal
);
WS: [ \r\n\t] -> skip;
I was thinking perhaps I could generate an output of about a million test expressions or something and then run both of the grammars against it to see the performance difference. Is there a tool to do this or basically to evaluate performance of one set of rules (or file) against another?
Just doing the above made an absolutely extraordinary difference in the parsing time and obviously memory consumption. Here is what I did:
First, I generated a file with 1M equations with the following:
x=open('input2.txt','w')
for i in range(0,1000000-1):
_ = x.write('x=(2+4)*%s-(x*72);\n' % i)
Next I timed the two runs with the following:
$ # this is the full file
$ antlr4 Calc.g4 && javac Calc*.java
$ time java org.antlr.v4.gui.TestRig Calc program input2.txt)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.antlr.v4.gui.TestRig.process(TestRig.java:207)
at org.antlr.v4.gui.TestRig.process(TestRig.java:166)
at org.antlr.v4.gui.TestRig.main(TestRig.java:119)
real 0m43.721s
user 2m11.130s
sys 0m2.481s
After about 40 seconds it runs out of memory. Here is the in-lined version:
# this is the in-lined file
$ antlr4 Calc2.g4 && javac Calc2*.java
$ time java org.antlr.v4.gui.TestRig Calc2 program input2.txt
real 0m7.149s
user 0m12.589s
sys 0m1.240s
So the first one where I write the items cleanly takes 43 seconds until it runs out of memory! The second version takes 7 seconds and finishes!
Though it's possible in the first one this is caused by the conditions between:
NUMBER: DECIMAL ([Ee] [+-]? UNSIGNED_INTEGER)?;
fragment DECIMAL: [0-9]+ ('.' [0-9]*)? | '.' [0-9]+;
UNSIGNED_INTEGER: [0-9]+;

How to avoid ambiguity in rules with optional right-hand operand?

It's known that ANTLR4 does automatic left recursion optimization. But how to avoid ambiguity when the last operand is optional?
Giving the following simplified grammar as an example:
grammar Example;
root: expression EOF;
expression: binaryExpression;
binaryExpression
: binaryExpression 'and' binaryExpression
| binaryExpression 'or' binaryExpression
| 'no' ID 'in' ID ('satisfies' expression)?
| '(' expression ')'
| OPERAND
;
OPERAND: 'true' | 'false';
ID: [a-z]+;
WS: (' ' | '\r' | '\t')+ -> channel(HIDDEN);
This grammar accepts both the expressions no x in y and no x in y satisfies condition. However, the optional subrule ('satisfies' expression)? is not considered during the left refactoring. As a result, the input no x in y satisfies true and false is reported as an ambiguity input:
The parser assumes two viable trees, no x in y satisfies (true and false) and (no x in y satisfies true) and false, but only the first should be considered a viable interpretation.
One could argue that I could rewrite the grammar as follows:
binaryExpression
: binaryExpression 'and' binaryExpression
| binaryExpression 'or' binaryExpression
| 'no' ID 'in' ID 'satisfies' expression
| 'no' ID 'in' ID
| '(' expression ')'
| OPERAND
;
Although it "works", in addition to the impact on performance, error reporting is no longer useful, as it is not possible to infer what comes next until the parser reaches the ID token.
Is there any way to refactor the grammar to explicitly specify that the expression in the RHS of the quantifier should always take precedence?

YACC grammar for arithmetic expressions, with no surrounding parentheses

I want to write the rules for arithmetic expressions in YACC; where the following operations are defined:
+ - * / ()
But, I don't want the statement to have surrounding parentheses. That is, a+(b*c) should have a matching rule but (a+(b*c)) shouldn't.
How can I achieve this?
The motive:
In my grammar I define a set like this: (1,2,3,4) and I want (5) to be treated as a 1-element set. The ambiguity causes a reduce/reduce conflict.
Here's a pretty minimal arithmetic grammar. It handles the four operators you mention and assignment statements:
stmt: ID '=' expr ';'
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
It's easy to define "set" literals:
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
If we assume that a set literal can only appear as the value in an assignment statement, and not as the operand of an arithmetic operator, then we would add a syntax for "expressions or set literals":
value: expr | set
and modify the syntax for assignment statements to use that:
stmt: ID '=' value ';'
But that leads to the reduce/reduce conflict you mention because (5) could be an expr, through the expansion expr → term → factor → '(' expr ')'.
Here are three solutions to this ambiguity:
1. Explicitly remove the ambiguity
Disambiguating is tedious but not particularly difficult; we just define two kinds of subexpression at each precedence level, one which is possibly parenthesized and one which is definitely not surrounded by parentheses. We start with some short-hand for a parenthesized expression:
paren: '(' expr ')'
and then for each subexpression type X, we add a production pp_X:
pp_term: term | paren
and modify the existing production by allowing possibly parenthesized subexpressions as operands:
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
Unfortunately, we will still end up with a shift/reduce conflict, because of the way expr_list was defined. Confronted with the beginning of an assignment statement:
a = ( 5 )
having finished with the 5, so that ) is the lookahead token, the parser does not know whether the (5) is a set (in which case the next token will be a ;) or a paren (which is only valid if the next token is an operand). This is not an ambiguity -- the parse could be trivially resolved with an LR(2) parse table -- but there are not many tools which can generate LR(2) parsers. So we sidestep the issue by insisting that the expr_list has to have two expressions, and adding paren to the productions for set:
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
Now the parser doesn't need to choose between expr_list and expr in the assignment statement; it simply reduces (5) to paren and waits for the next token to clarify the parse.
So that ends up with:
stmt: ID '=' value ';'
value: expr | set
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
paren: '(' expr ')'
pp_expr: expr | paren
expr: term | pp_expr '-' pp_term | pp_expr '+' pp_term
pp_term: term | paren
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
pp_factor: factor | paren
factor: ID | NUMBER | '-' pp_factor
which has no conflicts.
2. Use a GLR parser
Although it is possible to explicitly disambiguate, the resulting grammar is bloated and not really very clear, which is unfortunate.
Bison can generated GLR parsers, which would allow for a much simpler grammar. In fact, the original grammar would work almost without modification; we just need to use the Bison %dprec dynamic precedence declaration to indicate how to disambiguate:
%glr-parser
%%
stmt: ID '=' value ';'
value: expr %dprec 1
| set %dprec 2
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
The %dprec declarations in the two productions for value tell the parser to prefer value: set if both productions are possible. (They have no effect in contexts in which only one production is possible.)
3. Fix the language
While it is possible to parse the language as specified, we might not be doing anyone any favours. There might even be complaints from people who are surprised when they change
a = ( some complicated expression ) * 2
to
a = ( some complicated expression )
and suddenly a becomes a set instead of a scalar.
It is often the case that languages for which the grammar is not obvious are also hard for humans to parse. (See, for example, C++'s "most vexing parse").
Python, which uses ( expression list ) to create tuple literals, takes a very simple approach: ( expression ) is always an expression, so a tuple needs to either be empty or contain at least one comma. To make the latter possible, Python allows a tuple literal to be written with a trailing comma; the trailing comma is optional unless the tuple contains a single element. So (5) is an expression, while (), (5,), (5,6) and (5,6,) are all tuples (the last two are semantically identical).
Python lists are written between square brackets; here, a trailing comma is again permitted, but it is never required because [5] is not ambiguous. So [], [5], [5,], [5,6] and [5,6,] are all lists.

How to define logical operator with parenthesis in ANTLR grammar

I am defining a grammar in ANTLR that will express an expression which includes logical operator and parenthesis together.
Here is the grammar
grammar simpleGrammar;
/* This will be the entry point of the parser. */
parse
:
expression EOF
;
expression
:
expression binOp expression | ID | unOp (expression) | '(' expression ')'
;
binOp
:
('AND' | 'OR')
;
unOp
:
'NOT'
;
ID :
('a'..'z' | 'A'..'Z')+
;
The defined grammar can able to express parse tree without parenthesis but when I input an example with parenthesis for example, (Apple OR Bananana)AND Orange
It is showing MismatchedTokenException
So, It will be really appreciated if someone explains how to define the grammar in order to express the parenthesis.
You forgot to tell ANTLR what to do with whitespace. For example:
WS : [ \t\r\n] -> skip;
Add this and you grammar will work.
As a side note, your grammar has the same precedence for the AND and OR operators. And these operators have higher precedence than NOT. As this goes against conventional rules, I'd advise you to write your expression rule like this instead:
expression
: '(' expression ')' # parenExp
| 'NOT' expression # notExpr
| expression 'AND' expression # andExpr
| expression 'OR' expression # orExpr
| ID # atomExpr
;

Resources