Good ways to test antlr performance - parsing

Is there a good way to compare the grammar between two different files or rules to see which one performs better? As an example, let's say I'm 'starting' with the following grammar that I want to optimize:
grammar Calc;
program
: equations
;
equations
: equation* EOF
;
equation
: expression relop expression
;
expression
: LPAREN expression RPAREN
| expression (POWER) expression
| expression (TIMES | DIV) expression
| expression (PLUS | MINUS) expression
| (PLUS | MINUS)* atom
;
atom
: number
| variable
;
variable // so the entire variable gets consumed as one token
: VARIABLE
;
number
: NUMBER
;
relop
: EQ
| GTE
| LTE
| GT
| LT
;
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/';
POWER: '^';
EQ: '=';
GTE: '>=';
GT: '>';
LTE: '<=';
LT: '<';
LPAREN: '(';
RPAREN: ')';
NUMBER: DECIMAL ([Ee] [+-]? UNSIGNED_INTEGER)?;
fragment DECIMAL: [0-9]+ ('.' [0-9]*)? | '.' [0-9]+;
UNSIGNED_INTEGER: [0-9]+;
VARIABLE: [a-zA-Z_] [a-zA-Z_0-9]*;
WS: [ \r\n\t] -> skip;
And then, perhaps I'm curious whether it performs better if I 'inline' some of the rules:
grammar Calc2;
program: equations;
equations: equation* EOF;
equation: expression ('=' | '>' | '>=' | '<' | '<=' ) expression
;
expression
: '(' expression ')'
| expression '^' expression
| expression ('*' | '/') expression
| expression ('+' | '-') expression
| ('+' | '-')* ATOM
;
ATOM
: ([a-zA-Z_] [a-zA-Z_0-9]* // variable
| [0-9]+ ('.' [0-9]*)? | '.' [0-9]+ ([Ee] [+-]? [0-9]+)? // decimal
);
WS: [ \r\n\t] -> skip;
I was thinking perhaps I could generate an output of about a million test expressions or something and then run both of the grammars against it to see the performance difference. Is there a tool to do this or basically to evaluate performance of one set of rules (or file) against another?

Just doing the above made an absolutely extraordinary difference in the parsing time and obviously memory consumption. Here is what I did:
First, I generated a file with 1M equations with the following:
x=open('input2.txt','w')
for i in range(0,1000000-1):
_ = x.write('x=(2+4)*%s-(x*72);\n' % i)
Next I timed the two runs with the following:
$ # this is the full file
$ antlr4 Calc.g4 && javac Calc*.java
$ time java org.antlr.v4.gui.TestRig Calc program input2.txt)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.antlr.v4.gui.TestRig.process(TestRig.java:207)
at org.antlr.v4.gui.TestRig.process(TestRig.java:166)
at org.antlr.v4.gui.TestRig.main(TestRig.java:119)
real 0m43.721s
user 2m11.130s
sys 0m2.481s
After about 40 seconds it runs out of memory. Here is the in-lined version:
# this is the in-lined file
$ antlr4 Calc2.g4 && javac Calc2*.java
$ time java org.antlr.v4.gui.TestRig Calc2 program input2.txt
real 0m7.149s
user 0m12.589s
sys 0m1.240s
So the first one where I write the items cleanly takes 43 seconds until it runs out of memory! The second version takes 7 seconds and finishes!
Though it's possible in the first one this is caused by the conditions between:
NUMBER: DECIMAL ([Ee] [+-]? UNSIGNED_INTEGER)?;
fragment DECIMAL: [0-9]+ ('.' [0-9]*)? | '.' [0-9]+;
UNSIGNED_INTEGER: [0-9]+;

Related

Is indirection always a size/performance hit in antlr grammars?

In some various testing I've done over the weeks, it seems the more 'compact' a grammar is the faster it runs and the smaller the program size -- and anything possible that can reduce various downstream rule/function calls (while keeping the grammar valid) is a good thing to do.
Here is the most basic example I could come up with demonstrating this:
grammar NoIndirection;
root: (expr ';')* EOF;
expr
: '(' expr ')'
| '-' expr
| '+' expr
| Atom
;
Atom:
[a-z]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
grammar YesIndirection1;
root: (expr ';')* EOF;
expr
: parenExpr
| uExpr
| atomExpr
;
parenExpr: '(' expr ')';
uExpr: ('+'|'-') expr;
atomExpr: Atom;
Atom:
[a-z]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
grammar YesIndirection2;
root: (expr ';')* EOF;
expr
: parenExpr
| uExpr
| atomExpr
;
parenExpr: '(' expr ')';
uExpr: uExprP | uExprM;
uExprP: '+' expr;
uExprM: '-' expr;
atomExpr: Atom;
Atom:
[a-z]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
The timings and output program size on a ~1MB file are as follows:
The timings and size on a ~1MB file are as follows:
0m0.476s / 72K
0m0.578s / 88K
0m0.636s / 104K (~1.4x on both performance and size over the first)
My question(s) related to this are as follows:
Does the above seem valid in your experience -- that is, the less number of rules/indirection there is, the faster the parser is?
Why is this the case that function calls should be so expensive?
Finally, given that indirection is (always) a performance hit, would it be a good idea to write the rules for maximum readability and then preprocess the file so that as much as possible is in-lined?

Antlr grun error - no viable alternative input at

I'm trying to write a grammar for Prolog interpreter. When I run grun from command line on input like "father(john,mary).", I get a message saying "no viable input at 'father(john,'" and I don't know why. I've tried rearranging rules in my grammar, used different entry points etc., but still get the same error. I'm not even sure if it's caused by my grammar or something else like antlr itself. Can someone point out what is wrong with my grammar or think of what could be the cause if not the grammar?
The commands I ran are:
antlr4 -no-listener -visitor Expr.g4
javac *.java
grun antlr.Expr start tests/test.txt -gui
And this is the resulting parse tree:
Here is my grammar:
grammar Expr;
#header{
package antlr;
}
//start rule
start : (program | query) EOF
;
program : (rule_ '.')*
;
query : conjunction '?'
;
rule_ : compound
| compound ':-' conjunction
;
conjunction : compound
| compound ',' conjunction
;
compound : Atom '(' elements ')'
| '.(' elements ')'
;
list : '[]'
| '[' element ']'
| '[' elements ']'
;
element : Term
| list
| compound
;
elements : element
| element ',' elements
;
WS : [ \t\r\n]+ -> skip ;
Atom : [a-z]([a-z]|[A-Z]|[0-9]|'_')*
| '0'
;
Var : [A-Z]([a-z]|[A-Z]|[0-9]|'_')*
;
Term : Atom
| Var
;
The lexer will always produce the same tokens for any input. The lexer does not "listen" to what the parser is trying to match. The rules the lexer applies are quite simple:
try to match as many characters as possible
when 2 or more lexer rules match the same amount of characters, let the rule defined first "win"
Because of the 2nd rule, the rule Term will never be matched. And moving the Term rule above Var and Atom will cause the latter rules to be never matched. The solution: "promote" the Term rule to a parser rule:
start : (program | query) EOF
;
program : (rule_ '.')*
;
query : conjunction '?'
;
rule_ : compound (':-' conjunction)?
;
conjunction : compound (',' conjunction)?
;
compound : Atom '(' elements ')'
| '.' '(' elements ')'
;
list : '[' elements? ']'
;
element : term
| list
| compound
;
elements : element (',' element)*
;
term : Atom
| Var
;
WS : [ \t\r\n]+ -> skip ;
Atom : [a-z] [a-zA-Z0-9_]*
| '0'
;
Var : [A-Z] [a-zA-Z0-9_]*
;

Greedy subrules in ANTLR4

I'm working on a parser grammar that should allow trailing expressions without enclosing symbols. The following is a simplified version that evidences the issue:
grammar Example;
root: expression EOF;
expression: binaryExpression;
binaryExpression
: binaryExpression 'and' binaryExpression
| binaryExpression 'or' binaryExpression
| quantifier
| '(' expression ')'
| OPERAND
;
quantifier
: 'no' ID 'in' ID 'satisfies' expression
;
OPERAND: 'true' | 'false';
ID: [a-z]+;
WS: (' ' | '\r' | '\t')+ -> channel(HIDDEN);
If you try to parse the following expression, you'll notice that, although the parse correctly recognizes the input, it reports an ambiguity:
true or false and no x in y satisfies true or false
The error reporting works as expected (more about this later):
line 1:1 token recognition error at: '1'
line 1:2 mismatched input '<EOF>' expecting {'(', 'no', OPERAND}
I'm looking for some way to explicitly tell the parser that the quantifier should be greedy: everything on the right-hand side should be consumed unambiguously until the end of the expression.
I tried to refactor the rules to allow the quantifier only on the RHS of binary expressions. Although it worked, the error recovery mechanism becomes unable to recognize most expressions:
grammar Example;
root: expression EOF;
expression: quantifier | booleanExpression;
quantifier
: 'no' ID 'in' ID 'satisfies' expression
;
booleanExpression
: orExpression ('or' (quantifier | andQuantifier))?
| andQuantifier
;
andQuantifier: andExpression 'and' quantifier;
orExpression
: orExpression 'or' orExpression
| andExpression
;
andExpression
: andExpression 'and' andExpression
| '(' expression ')'
| OPERAND
;
OPERAND: 'true' | 'false';
ID: [a-z]+;
WS: (' ' | '\r' | '\t')+ -> channel(HIDDEN);
As you can see, the problem is gone:
But it came at the cost of more complex grammar and unable to recognize wrong inputs like (1:
line 1:1 token recognition error at: '1'
line 1:2 no viable alternative at input '('
Does anyone else have any other idea on how to fix it?
This is the way I'd do it, using Antlr4's built-in algorithm for resolving ambiguity with precedence (since the grammar is certainly ambiguous). In order to get the precedence algorithm to work, it's useful to think of a qualification as a unary operator with low precedence, which is why quantifier below is just the "operator" and not the full expression. Presumably in a real grammar you would have other quantifiers, and very likely unary operators with higher precedence like not.
grammar Example;
root: expression EOF;
expression
: expression 'and' expression
| expression 'or' expression
| quantifier expression
| operand
| '(' expression ')'
;
quantifier
: 'no' ID 'in' ID 'satisfies'
;
operand: BOOLEAN | ID;
BOOLEAN: 'true' | 'false';
ID: [a-zA-Z]+;
WHITE_SPACE: (' ' | '\r' | '\n' | '\t')+ -> channel(HIDDEN);
This isn't quite the same as the example in your post because you modified a few minor details from the first version of the question. But I think it's indicative.
For obvious reasons I couldn't try it with (1 (I suppose that input corresponds to yet a different version where integers are OPERANDs), but with (true it gave me what looks like the error report you are seeking. I'm not really an ANTLR4 expert so I don't know how to predict the details of error recovery.
OK, after a lot of back and forth here, I think I finally get that what you're looking for is associativity. Try:
grammar Example;
root: expression EOF;
expression
: '(' expression ')' # parenExpr
| <assoc=right>expression (AND | OR) quantifier # quantifierExpr
| expression AND expression # andExpr
| expression OR expression # orExpr
| OPERAND # operandExpr
;
quantifier
: 'no' ID 'in' ID 'satisfies' expression
;
AND: 'and';
OR: 'or';
OPERAND: 'true' | 'false';
ID: [a-z]+;
WS: (' ' | '\r' | '\t')+ -> channel(HIDDEN);
(I took the liberty of adding labels to your alternatives and simplifying the expression rule.). The labels will come in very handy in your code as you need to deal with each alternative individually. Labels will give you separate functions to override in your listeners/visitors (along with Context classes specific to that alternative)
true and false or false and no x in y satisfies true or false
true and false or false or no x in y satisfies true or false
true and false or false and no x in y satisfies true or false

Antlr4: Another "No Viable Alternative Error"

I have checked similar questions surrounding this issue but none seems to provide a solution to my version of the problem.
I just started Antlr4 recently and all has been going nicely until I hit this particular roadblock.
My grammar is a basic math expression grammar but for some reason I noticed the generated parser(?) is unable to walk from paser-rule "equal" to paser-rule "expr", in order to reach lexer-rule "NAME".
grammar MathCraze;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : '\r'? '\n' -> skip;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
ADD: '+';
SUB : '-';
MUL : '*';
DIV : '/';
POW : '^';
equal
: add # add1
| NAME '=' equal # assign
;
add
: mul # mul1
| add op=('+'|'-') mul # addSub
;
mul
: exponent # power1
| mul op=('*'|'/') exponent # mulDiv
;
exponent
: expr # expr1
| expr '^' exponent # power
;
expr
: NUM # num
| NAME # name
| '(' add ')' # parens
;
If I pass a word as input, sth like "variable", the parser throws the error above, but if I pass a number as input (say "78"), the parser walks the tree successfully (i.e, from rule "equal" to "expr").
equal equal
| |
add add
| |
mul mul
| |
exponent exponent
| |
expr expr
| |
NUM NAME
| |
"78" # No Error "variable" # Error! Tree walk doesn't reach here.
I've checked for every type of ambiguity I know of, so I'm probably missing something here.
I'm using Antlr5.6 by the way and I will appreciate if this problem gets solved. Thanks in advance.
Your style of expression hierarchy is the one we use in parsers written by hand or in ANTLR v3, from low to high precedence.
As Raven said, ANTLR 4 is much more powerful. Note the <assoc = right> specification in the power rule, which is usually right-associative.
grammar Question;
question
: line+ EOF
;
line
: expr NL
| assign NL
;
assign
: NAME '=' expr # assignSingle
| NAME '=' assign # assignMulti
;
expr // from high to low precedence
: <assoc = right> expr '^' expr # power
| expr op=( '*' | '/' ) expr # mulDiv
| expr op=( '+' | '-' ) expr # addSub
| '(' expr ')' # parens
| atom_r # atom
;
atom_r
: NUM
| NAME
;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : [\r\n]+ ;
Run with the -gui option to see the parse tree :
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar
$ alias grun
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Question question -gui data.txt
and this data.txt file :
variable
78
a + b * c
a * b + c
a = 8 + (6 * 9)
a ^ b
a ^ b ^ c
7 * 2 ^ 5
a = b = c = 88
.
Added
Using your original grammar and starting with the equal rule, I have the following error :
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,9:10='78',<NUM>,2:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
line 2:0 no viable alternative at input 'variable78'
If I start with rule expr, there is no error :
$ grun Q2 expr -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
$
Run grun with the -gui option and you'll see the difference :
running with expr, the input token variable is catched in NAME, rule expr is satisfied and terminates;
running with equal it's all in error. The parser tries the first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK. It consumes the token variable and tries to do something with the next token 78. It rolls back in each rule, see if it can do something with the alt of rule, but each alt requires an operator. Thus it arrives in equal and starts again with the token variable, this time using the alt | NAME '='. NAME consumes the token, then the rule requires '=', but the input is 78 and does not satisfies it. As there is no other choice, it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
line 1:8 no viable alternative at input 'variable'
If variable is the only token, same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK, consumes variable, back to equal, tries the alt which requires '=', but the input is at EOF. That's why it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
If 78 is the only token, do the same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. The alternative is not an option. Satisfied ? oops, what about EOF.
Now let's add a NUM alt to equal :
equal
: add # add1
| NAME '=' equal # assign
| NUM '=' equal # assignNum
;
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
line 1:2 no viable alternative at input '78'
First alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. Now there is also an alt for NUM, starts again, this time using the alt | NUM '='. NUM consumes the token 78,
then the parser requires '=', but the input is at EOF, hence the message.
Now let's add a new rule with EOF and let's run the grammar from all :
all : equal EOF ;
$ grun Q2 all -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
$ grun Q2 all -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
The input corresponds to the grammar, and there is no more message.
Although I can't answer your question about why the parser can't reach NAME in expr I'd like to point out that with Antlr4 you can use direct left recursion in your rule specification which makes your grammar more compact and omproves readability.
With that in mind your grammar could be rewritten as
math:
assignment
| expression
;
assignment:
ID '=' (assignment | expression)
;
expression:
expression '^' expression
| expression ('*' | '/') expression
| expression ('+' | '-') expression
| NAME
| NUM
;
That grammar hapily takes a NAME as part of an expression so I guess it would solve your problem.
If you're really interested in why it didn't work with your grammar then I'd first check if the lexer has matched the input into the expected tokens. Afterwards I would have a look at the parse tree to see what the parser is making of the given token sequence and then trying to do the parsing manually accoding to your grammar and during that you should be able to find the point at which the parser does something different from what you'd expect it to do.

Antlr4 grammar left recursive error

I am having quite a problem with antlr4 right now.
Whenever I try to feed antlr with this RPN grammar
grammar UPN;
//Parser
expression : plus | minus | mult | div | NUMBER;
plus : expression expression '+';
minus : expression expression '-';
mult : expression expression '*';
div : expression expression '/';
//Lexer
NUMBER : '-'? ('0'..'9')+;
antlr will throw an error because plus,minus,mult and div are mutually left recursive.
I dont know how to fix that.
(I know this occurs because with this grammar "expression" could be infinitely looped, I have had this problem before with another grammar, but i could fix that on my own)
My only solution would be to restrict the grammar in the following way
grammar UPN;
//Parser
expression : plus | minus | mult | div | NUMBER;
exp2 : plus2 | minus2 | mult2 | div2 | NUMBER;
plus : exp2 exp2'+';
minus : exp2 exp2'-';
mult: exp2 exp2'*';
div: exp2 exp2'/';
plus2 : NUMBER NUMBER '+';
minus2 : NUMBER NUMBER '-';
mult2: NUMBER NUMBER '*';
div2: NUMBER NUMBER '/';
//Lexer
NUMBER : '-'? ('0'..'9')+;
but this is not really what i want it to be, because now i could work at maximum with expressions like
2 3 + 5 4 - *
and the grammar would be more complex than it actually could be.
Hope you guys can help me
ANTLR4 only supports "direct" left recursive rules, not "indirect", as you have them.
Try something like this:
grammar RPN;
parse : expression EOF;
expression
: expression expression '+'
| expression expression '-'
| expression expression '*'
| expression expression '/'
| NUMBER
;
NUMBER : '-'? ('0'..'9')+;
SPACES : [ \t\r\n] -> skip;
Btw, 23+54-* is not a valid RPN expression: it must start with two numbers.

Resources