Example of removing left-recursion on a simple program - parsing

I have the following grammar which intentionally has left-recursion:
grammar DBParser;
statement: expr EOF;
expr: expr ('+' | '-') expr | 'x';
Is there a way to transform this using the method described here as:
A: Aa | b;
// becomes
A: bR;
R: (aR)?;
Does the initial A require it to be on the left-hand-side of the expression, making the above 'technique' unable to do a replacement? And if it can be replaced using that technique, what would the process look like?

Related

When does order of alternation matter in antlr?

In the following example, the order matters in terms of precedence:
grammar Precedence;
root: expr EOF;
expr
: expr ('+'|'-') expr
| expr ('*' | '/') expr
| Atom
;
Atom: [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
For example, on the expression 1+1*2 the above would produce the following parse tree which would evaluate to (1+1)*2=4:
Whereas if I changed the first and second alternations in the expr I would then get the following parse tree which would evaluate to 1+(1*2)=3:
What are the 'rules' then for when it actually matters where the ordering in an alternation occurs? Is this only relevant if it one of the 'edges' of the alternation recursively calls the expr? For example, something like ~ expr or expr + expr would matter, but something like func_call '(' expr ')' or Atom would not. Or, when is it important to order things for precedence?
If ANTLR did not have the rule to give precedence to the first alternative that could match, then either of those trees would be valid interpretations of your input (and means the grammar is technically ambiguous).
However, when there are two alternatives that could be used to match your input, then ANTLR will use the first alternative to resolve the ambiguity, in this case establishing operator precedence, so typically you would put the multiplication/division operator before the addition/subtraction, since that would be the traditional order of operations:
grammar Precedence;
root: expr EOF;
expr
: expr ('+'|'-') expr
| expr ('*' | '/') expr
| Atom
;
Atom: [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
Most grammar authors will just put them in precedence order, but things like Atoms or parenthesized exprs won’t really care about the order since there’s only a single alternative that could be used.

Why do parentheses create a left-recursion?

The following grammar works fine:
grammar DBParser;
statement: expr EOF;
expr: expr '+' expr | Atom;
Atom: [a-z]+ | [0-9]+ ;
However, neither of the following do:
grammar DBParser;
statement: expr EOF;
expr: (expr '+' expr) | Atom;
Atom: [a-z]+ | [0-9]+ ;
grammar DBParser;
statement: expr EOF;
expr: (expr '+' expr | Atom);
Atom: [a-z]+ | [0-9]+ ;
Why does antlr4 raise an error when adding in parentheticals, does that somehow change the meaning of the production that is being parsed?
Parentheses create a subrule, and subrules are handled internally by treating them as though they were new productions (in effect anonymous, which is why the mutual recursion error message only lists one non-terminal).
In these particular examples, the subrule is pointless; the parentheses could simply be removed without altering the grammar. But apparently Antlr doesn't attempt to decide which subrules are actually serving a purpose. (I suppose it could, but I wonder if it's a common enough usage to make justify the additional code complexity. But it's certainly not up to me to decide.)

Matching parentheses in ANTLR

I'm new to Antlr and I met one issue with parentheses matching recently. Each node in the parse tree has the form (Node1,W1,W2,Node2), where Node1 and Node2 are two nodes and W1 and W2 are two weights between them. Given an input file as (1,C,10,2).((2,P,2,3).(3,S,3,2))*.(2,T,2,4), the parse tree looks wrong, where the operator is not the parent of those nodes and the parentheses are not matched.
The parse file I wrote is like this:
grammar Semi;
prog
: expr+
;
expr
: expr '*'
| expr ('.'|'+') expr
| tuple
| '(' expr ')'
;
tuple
: LP NODE W1 W2 NODE RP
;
LP : '(' ;
RP : ')' ;
W1 : [PCST0];
W2 : [0-9]+;
NODE: [0-9]+;
WS : [ \t\r\n]+ -> skip ; // toss out whitespace
COMMA: ',' -> skip;
It seems like expr| '(' expr ')' doesn't work correctly. So what should I do to make this parser detects if parentheses belong to the node or not?
Update:
There are two errors in the command:
line 1:1 no viable alternative at input '(1'
line 1:13 no viable alternative at input '(2'
So it seems like the lexer didn't detect the tuples, but why is that?
Your W2 and NODE rules are the same, so nodes you intend to be NODE are matching W2.
grun with -tokens option: (notice, no NODE tokens)
[#0,0:0='(',<'('>,1:0]
[#1,1:1='1',<W2>,1:1]
[#2,3:3='C',<W1>,1:3]
[#3,5:6='10',<W2>,1:5]
[#4,8:8='2',<W2>,1:8]
[#5,9:9=')',<')'>,1:9]
[#6,10:10='.',<'.'>,1:10]
[#7,11:11='(',<'('>,1:11]
[#8,12:12='(',<'('>,1:12]
[#9,13:13='2',<W2>,1:13]
[#10,15:15='P',<W1>,1:15]
[#11,17:17='2',<W2>,1:17]
[#12,19:19='3',<W2>,1:19]
[#13,20:20=')',<')'>,1:20]
[#14,21:21='.',<'.'>,1:21]
[#15,22:22='(',<'('>,1:22]
[#16,23:23='3',<W2>,1:23]
[#17,25:25='S',<W1>,1:25]
[#18,27:27='3',<W2>,1:27]
[#19,29:29='2',<W2>,1:29]
[#20,30:30=')',<')'>,1:30]
[#21,31:31=')',<')'>,1:31]
[#22,32:32='*',<'*'>,1:32]
[#23,33:33='.',<'.'>,1:33]
[#24,34:34='(',<'('>,1:34]
[#25,35:35='2',<W2>,1:35]
[#26,37:37='T',<W1>,1:37]
[#27,39:39='2',<W2>,1:39]
[#28,41:41='4',<W2>,1:41]
[#29,42:42=')',<')'>,1:42]
[#30,43:42='<EOF>',<EOF>,1:43]
If I replace the NODEs in your parse rule with W2s (sorry, I have no idea what this is supposed to represent), I get:
It appears that your misconception is that the recursive descent parsing starts with the parser rule and when it encounters a Lexer rule, attempts to match it.
This is not how ANTLR works. With ANTLR, your input is first run through the Lexer (aka Tokenizer) to produce a stream of tokens. This step knows absolutely nothing about your parser rules. (That's why it's so often useful to use grun to dump the stream of tokens, this gives you a picture of what your parser rules are acting upon (and you can see, in your example that there are no NODE tokens, because they all matched W2).
Also, a suggestion... It would appear that commas are an essential part of correct input (unless (1C102).((2P23).(3S32))*.(2T24) is considered valid input. On that assumption, I removed the -> skip and added them to your parser rule (that's why you see them in the parse tree). The resulting grammar I used was:
grammar Semi;
prog: expr+;
expr: expr '*' | expr ('.' | '+') expr | tuple | LP expr RP;
tuple: LP W2 COMMA W1 COMMA W2 COMMA W2 RP;
LP: '(';
RP: ')';
W1: [PCST0];
W2: [0-9]+;
NODE: [0-9]+;
WS: [ \t\r\n]+ -> skip; // toss out whitespace
COMMA: ',';
To take a bit more liberty with your grammar, I'd suggest that your Lexer rules should be raw type focused. And, that you can use labels to make the various elements or your tuple more easily accessible in your code. Here's an example:
grammar Semi;
prog: expr+;
expr: expr '*' | expr ('.' | '+') expr | tuple | LP expr RP;
tuple: LP nodef=INT COMMA w1=PCST0 COMMA w2=INT COMMA nodet=INT RP;
LP: '(';
RP: ')';
PCST0: [PCST0];
INT: [0-9]+;
COMMA: ',';
WS: [ \t\r\n]+ -> skip; // toss out whitespace
With this change, your tuple Context class will have accessors for w1, w1, and node. node will be an array of NBR tokens as I've defined it here.

YACC grammar for arithmetic expressions, with no surrounding parentheses

I want to write the rules for arithmetic expressions in YACC; where the following operations are defined:
+ - * / ()
But, I don't want the statement to have surrounding parentheses. That is, a+(b*c) should have a matching rule but (a+(b*c)) shouldn't.
How can I achieve this?
The motive:
In my grammar I define a set like this: (1,2,3,4) and I want (5) to be treated as a 1-element set. The ambiguity causes a reduce/reduce conflict.
Here's a pretty minimal arithmetic grammar. It handles the four operators you mention and assignment statements:
stmt: ID '=' expr ';'
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
It's easy to define "set" literals:
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
If we assume that a set literal can only appear as the value in an assignment statement, and not as the operand of an arithmetic operator, then we would add a syntax for "expressions or set literals":
value: expr | set
and modify the syntax for assignment statements to use that:
stmt: ID '=' value ';'
But that leads to the reduce/reduce conflict you mention because (5) could be an expr, through the expansion expr → term → factor → '(' expr ')'.
Here are three solutions to this ambiguity:
1. Explicitly remove the ambiguity
Disambiguating is tedious but not particularly difficult; we just define two kinds of subexpression at each precedence level, one which is possibly parenthesized and one which is definitely not surrounded by parentheses. We start with some short-hand for a parenthesized expression:
paren: '(' expr ')'
and then for each subexpression type X, we add a production pp_X:
pp_term: term | paren
and modify the existing production by allowing possibly parenthesized subexpressions as operands:
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
Unfortunately, we will still end up with a shift/reduce conflict, because of the way expr_list was defined. Confronted with the beginning of an assignment statement:
a = ( 5 )
having finished with the 5, so that ) is the lookahead token, the parser does not know whether the (5) is a set (in which case the next token will be a ;) or a paren (which is only valid if the next token is an operand). This is not an ambiguity -- the parse could be trivially resolved with an LR(2) parse table -- but there are not many tools which can generate LR(2) parsers. So we sidestep the issue by insisting that the expr_list has to have two expressions, and adding paren to the productions for set:
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
Now the parser doesn't need to choose between expr_list and expr in the assignment statement; it simply reduces (5) to paren and waits for the next token to clarify the parse.
So that ends up with:
stmt: ID '=' value ';'
value: expr | set
set: '(' ')' | paren | '(' expr_list ')'
expr_list: expr ',' expr | expr_list ',' expr
paren: '(' expr ')'
pp_expr: expr | paren
expr: term | pp_expr '-' pp_term | pp_expr '+' pp_term
pp_term: term | paren
term: factor | pp_term '*' pp_factor | pp_term '/' pp_factor
pp_factor: factor | paren
factor: ID | NUMBER | '-' pp_factor
which has no conflicts.
2. Use a GLR parser
Although it is possible to explicitly disambiguate, the resulting grammar is bloated and not really very clear, which is unfortunate.
Bison can generated GLR parsers, which would allow for a much simpler grammar. In fact, the original grammar would work almost without modification; we just need to use the Bison %dprec dynamic precedence declaration to indicate how to disambiguate:
%glr-parser
%%
stmt: ID '=' value ';'
value: expr %dprec 1
| set %dprec 2
expr: term | expr '-' term | expr '+' term
term: factor | term '*' factor | term '/' factor
factor: ID | NUMBER | '(' expr ')' | '-' factor
set: '(' ')' | '(' expr_list ')'
expr_list: expr | expr_list ',' expr
The %dprec declarations in the two productions for value tell the parser to prefer value: set if both productions are possible. (They have no effect in contexts in which only one production is possible.)
3. Fix the language
While it is possible to parse the language as specified, we might not be doing anyone any favours. There might even be complaints from people who are surprised when they change
a = ( some complicated expression ) * 2
to
a = ( some complicated expression )
and suddenly a becomes a set instead of a scalar.
It is often the case that languages for which the grammar is not obvious are also hard for humans to parse. (See, for example, C++'s "most vexing parse").
Python, which uses ( expression list ) to create tuple literals, takes a very simple approach: ( expression ) is always an expression, so a tuple needs to either be empty or contain at least one comma. To make the latter possible, Python allows a tuple literal to be written with a trailing comma; the trailing comma is optional unless the tuple contains a single element. So (5) is an expression, while (), (5,), (5,6) and (5,6,) are all tuples (the last two are semantically identical).
Python lists are written between square brackets; here, a trailing comma is again permitted, but it is never required because [5] is not ambiguous. So [], [5], [5,], [5,6] and [5,6,] are all lists.

How to define logical operator with parenthesis in ANTLR grammar

I am defining a grammar in ANTLR that will express an expression which includes logical operator and parenthesis together.
Here is the grammar
grammar simpleGrammar;
/* This will be the entry point of the parser. */
parse
:
expression EOF
;
expression
:
expression binOp expression | ID | unOp (expression) | '(' expression ')'
;
binOp
:
('AND' | 'OR')
;
unOp
:
'NOT'
;
ID :
('a'..'z' | 'A'..'Z')+
;
The defined grammar can able to express parse tree without parenthesis but when I input an example with parenthesis for example, (Apple OR Bananana)AND Orange
It is showing MismatchedTokenException
So, It will be really appreciated if someone explains how to define the grammar in order to express the parenthesis.
You forgot to tell ANTLR what to do with whitespace. For example:
WS : [ \t\r\n] -> skip;
Add this and you grammar will work.
As a side note, your grammar has the same precedence for the AND and OR operators. And these operators have higher precedence than NOT. As this goes against conventional rules, I'd advise you to write your expression rule like this instead:
expression
: '(' expression ')' # parenExp
| 'NOT' expression # notExpr
| expression 'AND' expression # andExpr
| expression 'OR' expression # orExpr
| ID # atomExpr
;

Resources