Antlr4 grammar left recursive error - parsing

I am having quite a problem with antlr4 right now.
Whenever I try to feed antlr with this RPN grammar
grammar UPN;
//Parser
expression : plus | minus | mult | div | NUMBER;
plus : expression expression '+';
minus : expression expression '-';
mult : expression expression '*';
div : expression expression '/';
//Lexer
NUMBER : '-'? ('0'..'9')+;
antlr will throw an error because plus,minus,mult and div are mutually left recursive.
I dont know how to fix that.
(I know this occurs because with this grammar "expression" could be infinitely looped, I have had this problem before with another grammar, but i could fix that on my own)
My only solution would be to restrict the grammar in the following way
grammar UPN;
//Parser
expression : plus | minus | mult | div | NUMBER;
exp2 : plus2 | minus2 | mult2 | div2 | NUMBER;
plus : exp2 exp2'+';
minus : exp2 exp2'-';
mult: exp2 exp2'*';
div: exp2 exp2'/';
plus2 : NUMBER NUMBER '+';
minus2 : NUMBER NUMBER '-';
mult2: NUMBER NUMBER '*';
div2: NUMBER NUMBER '/';
//Lexer
NUMBER : '-'? ('0'..'9')+;
but this is not really what i want it to be, because now i could work at maximum with expressions like
2 3 + 5 4 - *
and the grammar would be more complex than it actually could be.
Hope you guys can help me

ANTLR4 only supports "direct" left recursive rules, not "indirect", as you have them.
Try something like this:
grammar RPN;
parse : expression EOF;
expression
: expression expression '+'
| expression expression '-'
| expression expression '*'
| expression expression '/'
| NUMBER
;
NUMBER : '-'? ('0'..'9')+;
SPACES : [ \t\r\n] -> skip;
Btw, 23+54-* is not a valid RPN expression: it must start with two numbers.

Related

Good ways to test antlr performance

Is there a good way to compare the grammar between two different files or rules to see which one performs better? As an example, let's say I'm 'starting' with the following grammar that I want to optimize:
grammar Calc;
program
: equations
;
equations
: equation* EOF
;
equation
: expression relop expression
;
expression
: LPAREN expression RPAREN
| expression (POWER) expression
| expression (TIMES | DIV) expression
| expression (PLUS | MINUS) expression
| (PLUS | MINUS)* atom
;
atom
: number
| variable
;
variable // so the entire variable gets consumed as one token
: VARIABLE
;
number
: NUMBER
;
relop
: EQ
| GTE
| LTE
| GT
| LT
;
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/';
POWER: '^';
EQ: '=';
GTE: '>=';
GT: '>';
LTE: '<=';
LT: '<';
LPAREN: '(';
RPAREN: ')';
NUMBER: DECIMAL ([Ee] [+-]? UNSIGNED_INTEGER)?;
fragment DECIMAL: [0-9]+ ('.' [0-9]*)? | '.' [0-9]+;
UNSIGNED_INTEGER: [0-9]+;
VARIABLE: [a-zA-Z_] [a-zA-Z_0-9]*;
WS: [ \r\n\t] -> skip;
And then, perhaps I'm curious whether it performs better if I 'inline' some of the rules:
grammar Calc2;
program: equations;
equations: equation* EOF;
equation: expression ('=' | '>' | '>=' | '<' | '<=' ) expression
;
expression
: '(' expression ')'
| expression '^' expression
| expression ('*' | '/') expression
| expression ('+' | '-') expression
| ('+' | '-')* ATOM
;
ATOM
: ([a-zA-Z_] [a-zA-Z_0-9]* // variable
| [0-9]+ ('.' [0-9]*)? | '.' [0-9]+ ([Ee] [+-]? [0-9]+)? // decimal
);
WS: [ \r\n\t] -> skip;
I was thinking perhaps I could generate an output of about a million test expressions or something and then run both of the grammars against it to see the performance difference. Is there a tool to do this or basically to evaluate performance of one set of rules (or file) against another?
Just doing the above made an absolutely extraordinary difference in the parsing time and obviously memory consumption. Here is what I did:
First, I generated a file with 1M equations with the following:
x=open('input2.txt','w')
for i in range(0,1000000-1):
_ = x.write('x=(2+4)*%s-(x*72);\n' % i)
Next I timed the two runs with the following:
$ # this is the full file
$ antlr4 Calc.g4 && javac Calc*.java
$ time java org.antlr.v4.gui.TestRig Calc program input2.txt)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.antlr.v4.gui.TestRig.process(TestRig.java:207)
at org.antlr.v4.gui.TestRig.process(TestRig.java:166)
at org.antlr.v4.gui.TestRig.main(TestRig.java:119)
real 0m43.721s
user 2m11.130s
sys 0m2.481s
After about 40 seconds it runs out of memory. Here is the in-lined version:
# this is the in-lined file
$ antlr4 Calc2.g4 && javac Calc2*.java
$ time java org.antlr.v4.gui.TestRig Calc2 program input2.txt
real 0m7.149s
user 0m12.589s
sys 0m1.240s
So the first one where I write the items cleanly takes 43 seconds until it runs out of memory! The second version takes 7 seconds and finishes!
Though it's possible in the first one this is caused by the conditions between:
NUMBER: DECIMAL ([Ee] [+-]? UNSIGNED_INTEGER)?;
fragment DECIMAL: [0-9]+ ('.' [0-9]*)? | '.' [0-9]+;
UNSIGNED_INTEGER: [0-9]+;

Does antlr automatically discard whitespace?

I've written the following arithmetic grammar:
grammar Calc;
program
: expressions
;
expressions
: expression (NEWLINE expression)*
;
expression
: '(' expression ')' // parenExpression has highest precedence
| expression MULDIV expression // then multDivExpression
| expression ADDSUB expression // then addSubExpression
| OPERAND // finally the operand itself
;
MULDIV
: [*/]
;
ADDSUB
: [-+]
;
// 12 or .12 or 2. or 2.38
OPERAND
: [0-9]+ ('.' [0-9]*)?
| '.' [0-9]+
;
NEWLINE
: '\n'
;
And I've noticed that regardless of how I space the tokens I get the same result, for example:
1+2
2+3
Or:
1 +2
2+3
Still give me the same thing. Also I've noticed that adding in the following rule does nothing for me:
WS
: [ \r\n\t] + -> skip
Which makes me wonder whether skipping whitespace is the default behavior of antlr4?
ANTLR4 based parsers have the ability to skip over single unwanted or missing tokens and continue parsing if possible (which is the case here). And there's no default to ignore whitespaces. You have to always specify a whitespace rule which either skips them or puts them on a hidden channel.

How would I implement operator-precedence in my grammar?

I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.

ANTLR Making Negative Test Cases

I'm new to ANTLR and am trying to understand how to do some things with it. I need it to throw an error when a statement is missing things, like a semicolon or an end bracket. It's been called negative test cases by the problem set that I'm working through.
For example, the below code returns true, which is correct.
val program = """
1 + 2;
"""
recognize(program)
However, this code also returns true, despite it missing the semicolon at the end. It should return false ([PARSER error at line=1]: missing ';' at '').
val program = """
1 + 2
""".trimIndent()
recognize(program)
The grammar is as follows:
program: (expression ';')* | EOF;
expression: INT PLUS INT | OPENBRAC INT PLUS INT CLOSEBRAC | QUOTE IDENT QUOTE PLUS QUOTE IDENT QUOTE;
IDENT: [A-Za-z0-9]+;
INT: [-][0-9]+ | ('0'..'9')+;
PLUS: '+';
OPENBRAC: '(';
CLOSEBRAC: ')';
QUOTE: '"';
program: (expression ';')* | EOF;
This means a program can either be zero or more instances of expression ';' followed by whatever else is in the input stream or it can be empty. Since (expression ';')* can already match the empty input by itself, the | EOF is just redundant.
What you want is program: (expression ';')* EOF, which means that a program consists of zero or more instances of expression ';', followed by the end of input, meaning there must be nothing left in the input afterwards.

Grammar of calculator in a finite field

I have a working calculator apart from one thing: unary operator '-'.
It has to be evaluated and dealt with in 2 difference cases:
When there is some expression further like so -(3+3)
When there isn't: -3
For case 1, I want to get a postfix output 3 3 + -
For case 2, I want to get just correct value of this token in this field, so for example in Z10 it's 10-3 = 7.
My current idea:
E: ...
| '-' NUM %prec NEGATIVE { $$ = correct(-yylval); appendNumber($$); }
| '-' E %prec NEGATIVE { $$ = correct(P-$2); strcat(rpn, "-"); }
| NUM { appendNumber(yylval); $$ = correct(yylval); }
Where NUM is a token, but obviously compiler says there is a confict reduce/reduce as E can also be a NUM in some cases, altough it works I want to get rid of the compilator warning.. and I ran out of ideas.
It has to be evaluated and dealt with in 2 difference cases:
No it doesn't. The cases are not distinct.
Both - E and - NUM are incorrect. The correct grammar would be something like:
primary
: NUM
| '-' primary
| '+' primary /* for completeness */
| '(' expression ')'
;
Normally, this should be implemented as two rules (pseudocode, I don't know bison syntax):
This is the likely rule for the 'terminal' element of an expression. Naturally, a parenthesized expression leads to a recursion to the top rule:
Element => Number
| '(' Expression ')'
The unary minus (and also the unary plus!) are just on one level up in the stack of productions (grammar rules):
Term => '-' Element
| '+' Element
| Element
Naturally, this can unbundle into all possible combinations such as '-' Number, '-' '(' Expression ')', likewise with '+' and without any unary operator at all.
Suppose we want addition / subtraction, and multiplication / division. Then the rest of the grammar would look like this:
Expression => Expression '+' MultiplicationExpr
| Expression '-' MultiplicationExpr
| MultiplicationExpr
MultiplicationExpr => MultiplicationExpr '*' Term
| MultiplicationExpr '/' Term
| Term
For the sake of completeness:
Terminals:
Number
Non-terminals:
Expression
Element
Term
MultiplicationExpr
Number, which is a terminal, shall match a regexp like this [0-9]+. In other words, it does not parse a minus sign — it's always a positive integer (or zero). Negative integers are calculated by matching a '-' Number sequence of tokens.

Resources