I have written the following two grammars, one grouping the arithmetic expressions (where possible) and another that doesn't:
grammar NoPrefix;
root: (expr ';')* EOF;
expr
: '(' expr ')'
| expr '*' expr
| expr '/' expr
| expr '+' expr
| expr '-' expr
| Atom
;
Atom: [a-z]+ | [0-9]+ | '\'' Atom '\'';
WHITESPACE: [ \t\r\n] -> skip;
grammar YesPrefix;
root: (expr ';')* EOF;
expr
: '(' expr ')'
| expr ('*'|'/') expr
| expr ('+'|'-') expr
| Atom
;
Atom:[a-z]+ | [0-9]+ | '\'' Atom '\'';
WHITESPACE: [ \t\r\n] -> skip;
It seems that these two have almost identical runtimes, build sizes, etc. Does antlr automatically convert the two forms of alternatives to the same output, for example:
expr: expr '*' expr | expr '/' expr <==> expr: expr ('*'|'/') expr;
No. How would Antlr know that you wanted * and / to have the same binding precedence, different from + and -? You need to be explicit about that.
Related
One of the expressions that can be very ambiguous up until almost the very end is that of a tuple vs. a parenthesized expression. A tuple is differentiated between a parenthesized expression by the presence of a comma -- and often a single-member tuple is not allowed, as it would be ambiguous, for example from BigQuery:
Tuple syntax
(expr1, expr2 [, ... ])
The output type is an anonymous STRUCT type with anonymous fields with types matching the types of the input expressions. There must be at least two expressions specified. Otherwise this syntax is indistinguishable from an expression wrapped with parentheses.
I am having trouble figuring out why my grammar is ambiguous here, which allows for both:
grammar DBParser;
options { caseInsensitive=true; }
statement: select EOF;
select:
'SELECT' expr (',' expr)*
('FROM' expr) ?
('WHERE' expr) ?
;
expr
: '(' expr ')' # parenExpression
| '(' expr (',' expr)+ ')' # tupleLiteralExpression
| expr 'IN' expr # inExpression
| select # subSelectExpression
| Atom # constantExpression
;
Atom:
[a-z-]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
And with my input:
SELECT id FROM sales WHERE country IN ((select 1,1,1,1,1,1,1,1,1),1)
I get the following profiling information from Antlr telling me I have ambiguities.
Why is this occurring, and how would I properly resolve this?
The ambiguity arises from the non-parenthesized sub-select expression. For example if we have:
SELECT a FROM b WHERE x IN (select 1,1)
The IN expression part can be parsed in two different ways:
Atom inExpression(tupleLiteralExpression(subSelectExpression, Atom))
Or as:
Atom inExpression(subSelectExpression)
Since (SELECT 1,1) could either be seen as a select clause SELECT 1,1 or it can be seen as a tuple containing two elements, SELECT 1 and 1.
Because of this, we must require parentheses around the sub-select so we know where the select clause starts and ends. Here would be the proper grammar resolving the ambiguities:
grammar DBParser;
options { caseInsensitive=true; }
statement: select EOF;
select:
'SELECT' expr (',' expr)*
('FROM' expr) ?
('WHERE' expr) ?
;
expr
: '(' expr ')' # parenExpression
| '(' expr (',' expr)+ ')' # tupleLiteralExpression
| expr 'IN' expr # inExpression
| '(' select ')' # subSelectExpression
| Atom # constantExpression
;
Atom:
[a-z-]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
In some various testing I've done over the weeks, it seems the more 'compact' a grammar is the faster it runs and the smaller the program size -- and anything possible that can reduce various downstream rule/function calls (while keeping the grammar valid) is a good thing to do.
Here is the most basic example I could come up with demonstrating this:
grammar NoIndirection;
root: (expr ';')* EOF;
expr
: '(' expr ')'
| '-' expr
| '+' expr
| Atom
;
Atom:
[a-z]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
grammar YesIndirection1;
root: (expr ';')* EOF;
expr
: parenExpr
| uExpr
| atomExpr
;
parenExpr: '(' expr ')';
uExpr: ('+'|'-') expr;
atomExpr: Atom;
Atom:
[a-z]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
grammar YesIndirection2;
root: (expr ';')* EOF;
expr
: parenExpr
| uExpr
| atomExpr
;
parenExpr: '(' expr ')';
uExpr: uExprP | uExprM;
uExprP: '+' expr;
uExprM: '-' expr;
atomExpr: Atom;
Atom:
[a-z]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
The timings and output program size on a ~1MB file are as follows:
The timings and size on a ~1MB file are as follows:
0m0.476s / 72K
0m0.578s / 88K
0m0.636s / 104K (~1.4x on both performance and size over the first)
My question(s) related to this are as follows:
Does the above seem valid in your experience -- that is, the less number of rules/indirection there is, the faster the parser is?
Why is this the case that function calls should be so expensive?
Finally, given that indirection is (always) a performance hit, would it be a good idea to write the rules for maximum readability and then preprocess the file so that as much as possible is in-lined?
Can someone identify where the grammar conflict is in this expression production?
expr '+' expr
|
expr '-' expr
|
expr '*' expr
|
expr '/' expr
|
expr '(' ')'
|
T_IDENTIFIER
|
T_STRING_LITERAL
|
T_INTEGER_LITERAL
|
T_FLOAT_LITERAL
I'm trying to implement function calls taking an expr as the operand, so for example, the following would be valid grammar:
1()
1.5()
"STRING"()
fn()
I would like to be able to write a "meta-rule" in ANTLR4 that takes a rule as an input argument and performs a set modification to that rule. Here's an example grammar:
grammar G;
WS: [ \t\n\r] + -> skip;
CHAR: [a-z];
term: (CHAR)+;
sum: term ('+' term)+;
pterm: '(' term ')' | '(' pterm ')';
psum: '(' sum ')' | '(' psum ')';
expr: term | sum | pterm | psum;
The rules for pterm and psum perform the same action on term and sum, enclosing them in possibly nested parentheses. I would like to be able to replace the last three lines above with something like the following:
enclose[rule]: '(' rule ')' | '(' enclose(rule) ')';
expr: term | sum | enclose(term) | enclose(sum);
Is there a way to construct a meta-rule like this?
The short answer is, no.
Better to resolve by refactoring the grammar and identifying the structurally significant terms:
expr: LPAREN sum RPAREN | LPAREN expr RPAREN ;
sum : term ('+' term)* ; // changed to Kleene star
term: CHAR+ ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : [a-z] ;
WS : [ \t\n\r]+ -> skip ;
The sum rule will consume all terms, so the expr rule only needs to handle sums.
I'm writing a grammar in YACC (actually Bison), and I'm having a shift/reduce problem. It results from including the postfix increment and decrement operators. Here is a trimmed down version of the grammar:
%token NUMBER ID INC DEC
%left '+' '-'
%left '*' '/'
%right PREINC
%left POSTINC
%%
expr: NUMBER
| ID
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| INC expr %prec PREINC
| DEC expr %prec PREINC
| expr INC %prec POSTINC
| expr DEC %prec POSTINC
| '(' expr ')'
;
%%
Bison tells me there are 12 shift/reduce conflicts, but if I comment out the lines for the postfix increment and decrement, it works fine. Does anyone know how to fix this conflict? At this point, I'm considering moving to an LL(k) parser generator, which makes it much easier, but LALR grammars have always seemed much more natural to write. I'm also considering GLR, but I don't know of any good C/C++ GLR parser generators.
Bison/Yacc can generate a GLR parser if you specify %glr-parser in the option section.
Try this:
%token NUMBER ID INC DEC
%left '+' '-'
%left '*' '/'
%nonassoc '++' '--'
%left '('
%%
expr: NUMBER
| ID
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '++' expr
| '--' expr
| expr '++'
| expr '--'
| '(' expr ')'
;
%%
The key is to declare postfix operators as non associative. Otherwise you would be able to
++var++--
The parenthesis also need to be given a precedence to minimize shift/reduce warnings
I like to define more items. You shouldn't need the %left, %right, %prec stuff.
simple_expr: NUMBER
| INC simple_expr
| DEC simple_expr
| '(' expr ')'
;
term: simple_expr
| term '*' simple_expr
| term '/' simple_expr
;
expr: term
| expr '+' term
| expr '-' term
;
Play around with this approach.
This basic problem is that you don't have a precedence for the INC and DEC tokens, so it doesn't know how to resolve ambiguities involving a lookahead of INC or DEC. If you add
%right INC DEC
at the end of the precedence list (you want unaries to be higher precedence and postfix higher than prefix), it will fix it, and you can even get rid of all the PREINC/POSTINC stuff, as it's irrelevant.
preincrement and postincrement operators have nonassoc so define that in the precedence section and in the rules make the precedence of these operators high by using %prec