Recursive Tree Rewrite ANTLR - parsing

I have an AST containing a simple list of tokens...
and I simply want to group pairs of balanced parameters into nested trees.
I've been trying various rules but I can't quite get it...
bottomup : findParams;
findParams
: ^(LIST left+=expression* LPARAM inner? RPARAM right+=expression*)
-> ^(LIST $left* ^(PARAMS inner?) $right*);
inner : (left+=expression* LPARAM inner? RPARAM right+=expression*)
-> $left* ^(PARAMS inner?) $right*) | (a+=expression* -> $a*);
fragment expression = INT;
This is sort of like the dyck language, but on a tree rather than a source. Also, I can't debug pattern matching tree grammars using remote debugging which is a hindrance.

Your approach is on the right track, but you're mixing a top-down approach with a bottom-up one. Top-down is good for breaking things down: "this list is big, make it into some smaller ones." Bottom-up is good for breaking things out: "this is the simplest thing that could be a list, so I'll make it into one."
Here is a bottom-up solution to grouping your nodes:
bottomup
: exit_list
;
exit_list
: ^(LIST pre* LPAR reduced* RPAR post+=.*) -> ^(LIST pre* ^(LIST reduced*) $post*)
;
pre : INT
| LPAR
| ^(LIST .*)
;
reduced
: INT
| ^(LIST .*)
;
For each set of parentheses that contains no other parentheses, convert the contents of that set into a new list. This rule is repeated until there are no more parentheses.
Example:
Input
1(3(4))5
Baseline AST
Final AST
Rule bottomup was recursively applied twice:
applied to (4): (LIST 1 '(' 3 '(' 4 ')' ')' 5) -> (LIST 1 '(' 3 (LIST 4) ')' 5)
applied to (3(4)): (LIST 1 '(' 3 (LIST 4) ')' 5) -> (LIST 1 (LIST 3 (LIST 4)) 5)

Related

Matching parentheses in ANTLR

I'm new to Antlr and I met one issue with parentheses matching recently. Each node in the parse tree has the form (Node1,W1,W2,Node2), where Node1 and Node2 are two nodes and W1 and W2 are two weights between them. Given an input file as (1,C,10,2).((2,P,2,3).(3,S,3,2))*.(2,T,2,4), the parse tree looks wrong, where the operator is not the parent of those nodes and the parentheses are not matched.
The parse file I wrote is like this:
grammar Semi;
prog
: expr+
;
expr
: expr '*'
| expr ('.'|'+') expr
| tuple
| '(' expr ')'
;
tuple
: LP NODE W1 W2 NODE RP
;
LP : '(' ;
RP : ')' ;
W1 : [PCST0];
W2 : [0-9]+;
NODE: [0-9]+;
WS : [ \t\r\n]+ -> skip ; // toss out whitespace
COMMA: ',' -> skip;
It seems like expr| '(' expr ')' doesn't work correctly. So what should I do to make this parser detects if parentheses belong to the node or not?
Update:
There are two errors in the command:
line 1:1 no viable alternative at input '(1'
line 1:13 no viable alternative at input '(2'
So it seems like the lexer didn't detect the tuples, but why is that?
Your W2 and NODE rules are the same, so nodes you intend to be NODE are matching W2.
grun with -tokens option: (notice, no NODE tokens)
[#0,0:0='(',<'('>,1:0]
[#1,1:1='1',<W2>,1:1]
[#2,3:3='C',<W1>,1:3]
[#3,5:6='10',<W2>,1:5]
[#4,8:8='2',<W2>,1:8]
[#5,9:9=')',<')'>,1:9]
[#6,10:10='.',<'.'>,1:10]
[#7,11:11='(',<'('>,1:11]
[#8,12:12='(',<'('>,1:12]
[#9,13:13='2',<W2>,1:13]
[#10,15:15='P',<W1>,1:15]
[#11,17:17='2',<W2>,1:17]
[#12,19:19='3',<W2>,1:19]
[#13,20:20=')',<')'>,1:20]
[#14,21:21='.',<'.'>,1:21]
[#15,22:22='(',<'('>,1:22]
[#16,23:23='3',<W2>,1:23]
[#17,25:25='S',<W1>,1:25]
[#18,27:27='3',<W2>,1:27]
[#19,29:29='2',<W2>,1:29]
[#20,30:30=')',<')'>,1:30]
[#21,31:31=')',<')'>,1:31]
[#22,32:32='*',<'*'>,1:32]
[#23,33:33='.',<'.'>,1:33]
[#24,34:34='(',<'('>,1:34]
[#25,35:35='2',<W2>,1:35]
[#26,37:37='T',<W1>,1:37]
[#27,39:39='2',<W2>,1:39]
[#28,41:41='4',<W2>,1:41]
[#29,42:42=')',<')'>,1:42]
[#30,43:42='<EOF>',<EOF>,1:43]
If I replace the NODEs in your parse rule with W2s (sorry, I have no idea what this is supposed to represent), I get:
It appears that your misconception is that the recursive descent parsing starts with the parser rule and when it encounters a Lexer rule, attempts to match it.
This is not how ANTLR works. With ANTLR, your input is first run through the Lexer (aka Tokenizer) to produce a stream of tokens. This step knows absolutely nothing about your parser rules. (That's why it's so often useful to use grun to dump the stream of tokens, this gives you a picture of what your parser rules are acting upon (and you can see, in your example that there are no NODE tokens, because they all matched W2).
Also, a suggestion... It would appear that commas are an essential part of correct input (unless (1C102).((2P23).(3S32))*.(2T24) is considered valid input. On that assumption, I removed the -> skip and added them to your parser rule (that's why you see them in the parse tree). The resulting grammar I used was:
grammar Semi;
prog: expr+;
expr: expr '*' | expr ('.' | '+') expr | tuple | LP expr RP;
tuple: LP W2 COMMA W1 COMMA W2 COMMA W2 RP;
LP: '(';
RP: ')';
W1: [PCST0];
W2: [0-9]+;
NODE: [0-9]+;
WS: [ \t\r\n]+ -> skip; // toss out whitespace
COMMA: ',';
To take a bit more liberty with your grammar, I'd suggest that your Lexer rules should be raw type focused. And, that you can use labels to make the various elements or your tuple more easily accessible in your code. Here's an example:
grammar Semi;
prog: expr+;
expr: expr '*' | expr ('.' | '+') expr | tuple | LP expr RP;
tuple: LP nodef=INT COMMA w1=PCST0 COMMA w2=INT COMMA nodet=INT RP;
LP: '(';
RP: ')';
PCST0: [PCST0];
INT: [0-9]+;
COMMA: ',';
WS: [ \t\r\n]+ -> skip; // toss out whitespace
With this change, your tuple Context class will have accessors for w1, w1, and node. node will be an array of NBR tokens as I've defined it here.

Antlr4 parser for boolean logic

I'm new to Antlr4/CFG and am trying to write a parser for a boolean querying DSL of the form
(id AND id AND ID (OR id OR id OR id))
The logic can also take the form
(id OR id OR (id AND id AND id))
A more complex example might be:
(((id AND id AND (id OR id OR (id AND id)))))
(enclosed in an arbitrary amount of parentheses)
I've tried two things. First, I did a very simple parser, which ended up parsing everything left to right:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| expression (AND expression)+
| expression (OR expression)+
| atom;
atom
: INT;
I got the following parse tree for input:
( 60 ) AND ( 55 ) AND ( 53 ) AND ( 3337 OR 2830 OR 23)
This "works", but ideally I want to be able to separate my AND and OR blocks. Trying to separate these blocks into separate grammars leads to left-recursion. Secondly, I want my AND and OR blocks to be grouped together, instead of reading left-to-right, for example, on input (id AND id AND id),
I want:
(and id id id)
not
(and id (and id (and id)))
as it currently is.
The second thing I've tried is making OR blocks directly descendant of AND blocks (ie the first case).
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| and_expr;
and_expr
: term (AND term)* ;
term
: LPAREN or_expr RPAREN
| LPAREN atom RPAREN ;
or_expr
: atom (OR atom)+;
atom: INT ;
For the same input, I get the following parse tree, which is more along the lines of what I'm looking for but has one main problem: there isn't an actual hierarchy to OR and AND blocks in the DSL, so this doesn't work for the second case. This approach also seems a bit hacky, for what I'm trying to do.
What's the best way to proceed? Again, I'm not too familiar with parsing and CFGs, so some guidance would be great.
Both are equivalent in their ability to parse your sample input. If you simplify your input by removing the unnecessary parentheses, the output of this grammar looks pretty good too:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| expression (AND expression)+
| expression (OR expression)+
| atom;
atom : INT;
INT: DIGITS;
DIGITS : [0-9]+;
AND : 'AND';
OR : 'OR';
LPAREN : '(';
RPAREN : ')';
WS: [ \t\r\n]+ -> skip;
Which is what I suspect your first grammar looks like in its entirety.
Your second one requires too many parentheses for my liking (mainly in term), and the breaking up of AND and OR into separate rules instead of alternatives doesn't seem as clean to me.
You can simplify even more though:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN # ParenExp
| expression AND expression # AndBlock
| expression OR expression # OrBlock
| atom # AtomExp
;
atom : INT;
INT: DIGITS;
DIGITS : [0-9]+;
AND : 'AND';
OR : 'OR';
LPAREN : '(';
RPAREN : ')';
WS: [ \t\r\n]+ -> skip;
This gives a tree with a different shape but still is equivalent. And note the use of the # AndBlock and # OrBlock labels... these "alternative labels" will cause your generated listener or visitor to have separate methods for each, allowing you to completely separate these two in your code semantically as well as syntactically. Perhaps that's what you're looking for?
I like this one the best because it's the simplest and clearer recursion, and offers specific code alternatives for AND and OR.

Antlr4: Another "No Viable Alternative Error"

I have checked similar questions surrounding this issue but none seems to provide a solution to my version of the problem.
I just started Antlr4 recently and all has been going nicely until I hit this particular roadblock.
My grammar is a basic math expression grammar but for some reason I noticed the generated parser(?) is unable to walk from paser-rule "equal" to paser-rule "expr", in order to reach lexer-rule "NAME".
grammar MathCraze;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : '\r'? '\n' -> skip;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
ADD: '+';
SUB : '-';
MUL : '*';
DIV : '/';
POW : '^';
equal
: add # add1
| NAME '=' equal # assign
;
add
: mul # mul1
| add op=('+'|'-') mul # addSub
;
mul
: exponent # power1
| mul op=('*'|'/') exponent # mulDiv
;
exponent
: expr # expr1
| expr '^' exponent # power
;
expr
: NUM # num
| NAME # name
| '(' add ')' # parens
;
If I pass a word as input, sth like "variable", the parser throws the error above, but if I pass a number as input (say "78"), the parser walks the tree successfully (i.e, from rule "equal" to "expr").
equal equal
| |
add add
| |
mul mul
| |
exponent exponent
| |
expr expr
| |
NUM NAME
| |
"78" # No Error "variable" # Error! Tree walk doesn't reach here.
I've checked for every type of ambiguity I know of, so I'm probably missing something here.
I'm using Antlr5.6 by the way and I will appreciate if this problem gets solved. Thanks in advance.
Your style of expression hierarchy is the one we use in parsers written by hand or in ANTLR v3, from low to high precedence.
As Raven said, ANTLR 4 is much more powerful. Note the <assoc = right> specification in the power rule, which is usually right-associative.
grammar Question;
question
: line+ EOF
;
line
: expr NL
| assign NL
;
assign
: NAME '=' expr # assignSingle
| NAME '=' assign # assignMulti
;
expr // from high to low precedence
: <assoc = right> expr '^' expr # power
| expr op=( '*' | '/' ) expr # mulDiv
| expr op=( '+' | '-' ) expr # addSub
| '(' expr ')' # parens
| atom_r # atom
;
atom_r
: NUM
| NAME
;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : [\r\n]+ ;
Run with the -gui option to see the parse tree :
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar
$ alias grun
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Question question -gui data.txt
and this data.txt file :
variable
78
a + b * c
a * b + c
a = 8 + (6 * 9)
a ^ b
a ^ b ^ c
7 * 2 ^ 5
a = b = c = 88
.
Added
Using your original grammar and starting with the equal rule, I have the following error :
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,9:10='78',<NUM>,2:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
line 2:0 no viable alternative at input 'variable78'
If I start with rule expr, there is no error :
$ grun Q2 expr -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
$
Run grun with the -gui option and you'll see the difference :
running with expr, the input token variable is catched in NAME, rule expr is satisfied and terminates;
running with equal it's all in error. The parser tries the first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK. It consumes the token variable and tries to do something with the next token 78. It rolls back in each rule, see if it can do something with the alt of rule, but each alt requires an operator. Thus it arrives in equal and starts again with the token variable, this time using the alt | NAME '='. NAME consumes the token, then the rule requires '=', but the input is 78 and does not satisfies it. As there is no other choice, it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
line 1:8 no viable alternative at input 'variable'
If variable is the only token, same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK, consumes variable, back to equal, tries the alt which requires '=', but the input is at EOF. That's why it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
If 78 is the only token, do the same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. The alternative is not an option. Satisfied ? oops, what about EOF.
Now let's add a NUM alt to equal :
equal
: add # add1
| NAME '=' equal # assign
| NUM '=' equal # assignNum
;
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
line 1:2 no viable alternative at input '78'
First alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. Now there is also an alt for NUM, starts again, this time using the alt | NUM '='. NUM consumes the token 78,
then the parser requires '=', but the input is at EOF, hence the message.
Now let's add a new rule with EOF and let's run the grammar from all :
all : equal EOF ;
$ grun Q2 all -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
$ grun Q2 all -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
The input corresponds to the grammar, and there is no more message.
Although I can't answer your question about why the parser can't reach NAME in expr I'd like to point out that with Antlr4 you can use direct left recursion in your rule specification which makes your grammar more compact and omproves readability.
With that in mind your grammar could be rewritten as
math:
assignment
| expression
;
assignment:
ID '=' (assignment | expression)
;
expression:
expression '^' expression
| expression ('*' | '/') expression
| expression ('+' | '-') expression
| NAME
| NUM
;
That grammar hapily takes a NAME as part of an expression so I guess it would solve your problem.
If you're really interested in why it didn't work with your grammar then I'd first check if the lexer has matched the input into the expected tokens. Afterwards I would have a look at the parse tree to see what the parser is making of the given token sequence and then trying to do the parsing manually accoding to your grammar and during that you should be able to find the point at which the parser does something different from what you'd expect it to do.

ANTLR 4 Parser Grammar

How can I improve my parser grammar so that instead of creating an AST that contains couple of decFunc rules for my testing code. It will create only one and sum becomes the second root. I tried to solve this problem using multiple different ways but I always get a left recursive error.
This is my testing code :
f :: [Int] -> [Int] -> [Int]
f x y = zipWith (sum) x y
sum :: [Int] -> [Int]
sum a = foldr(+) a
This is my grammar:
This is the image that has two decFuncin this link
http://postimg.org/image/w5goph9b7/
prog : stat+;
stat : decFunc | impFunc ;
decFunc : ID '::' formalType ( ARROW formalType )* NL impFunc
;
anotherFunc : ID+;
formalType : 'Int' | '[' formalType ']' ;
impFunc : ID+ '=' hr NL
;
hr : 'map' '(' ID* ')' ID*
| 'zipWith' '(' ('*' |'/' |'+' |'-') ')' ID+ | 'zipWith' '(' anotherFunc ')' ID+
| 'foldr' '(' ('*' |'/' |'+' |'-') ')' ID+
| hr op=('*'| '/' | '.&.' | 'xor' ) hr | DIGIT
| 'shiftL' hr hr | 'shiftR' hr hr
| hr op=('+'| '-') hr | DIGIT
| '(' hr ')'
| ID '(' ID* ')'
| ID
;
Your test input contains two instances of content that will match the decFunc rule. The generated parse-tree shows exactly that: two sub-trees, each having a deFunc as the root.
Antlr v4 will not produce a true AST where f and sum are the roots of separate sub-trees.
Is there any thing can I do with the grammar to make both f and sum roots – Jonny Magnam
Not directly in an Antlr v4 grammar. You could:
switch to Antlr v3, or another parser tool, and define the generated AST as you wish.
walk the Antlr v4 parse-tree and create a separate AST of your desired form.
just use the parse-tree directly with the realization that it is informationally equivalent to a classically defined AST and the implementation provides a number practical benefits.
Specifically, the standard academic AST is mutable, meaning that every (or all but the first) visitor is custom, not generated, and that any change in the underlying grammar or an interim structure of the AST will require reconsideration and likely changes to every subsequent visitor and their implemented logic.
The Antlr v4 parse-tree is essentially immutable, allowing decorations to be accumulated against tree nodes without loss of relational integrity. Visitors all use a common base structure, greatly reducing brittleness due to grammar changes and effects of prior executed visitors. As a practical matter, tree-walks are easily constructed, fast, and mutually independent except where expressly desired. They can achieve a greater separation of concerns in design and easier code maintenance in practice.
Choose the right tool for the whole job, in whatever way you define it.

Parse Parenthesis as atoms ANTLR

I'm trying to match balanced parentheses such that, a PARAMS tree is created if a match is made, else the LPARAM and RPARAM tokens are simply added as atoms to the tree...
tokens
{
LIST;
PARAMS;
}
start : list -> ^(LIST list);
list : (expr|atom)+;
expr : LPARAM list? RPARAM -> ^(PARAMS list?);
atom : INT | LPARAM | RPARAM;
INT : '0'..'9'+;
LPARAM : '(';
RPARAM : ')';
At the moment, it will never create a PARAMS tree, because in the rule expr it will always see the end RPARAM as an atom, rather than the the closing token for that rule.
So at the moment, something like 1 2 3 (4) 5 is added to a LIST tree as a flat list of tokens, rather than the required grouping.
I've handled adding tokens as atoms to a tree before, but they never were able to start another rule, as LPARAM does here.
Do I need some sort of syntatic/semantic predicate here?
Here is a simple approach that comes with a couple of constraints. I think these conform to the expected behavior that you mentioned in the comments.
An unmatched LPARAM never appears inside a child list
An unmatched RPARAM never appears inside a child list
Grammar:
start : root+ EOF -> ^(LIST root+ );
root : expr
| LPARAM
| RPARAM
;
expr : list
| atom
;
list : LPARAM expr+ RPARAM -> ^(LIST expr+)
;
atom : INT
;
Rule root matches mismatched LPARAMs and RPARAMs. Rules list and atom only care about themselves.
This solution is relatively fragile because rule root requires expr to be listed before LPARAM and RPARAM. Even so, maybe this is enough to solve your problem.
Test case 1 : no lists
Input: 1 2 3
Output:
Test case 2 : one list
Input: 1 (2) 3
Output:
Test case 3 : two lists
Input: (1) 2 (3)
Output:
Test case 4 : no lists, mismatched lefts
Input: ((1 2 3
Output:
Test case 5 : two lists, mismatched lefts
Input: ((1 (2) (3)
Output:
Test case 6 : no lists, mismatched rights
Input: 1 2 3))
Output:
Test case 7 : two lists, mismatched rights
Input: (1) (2) 3))
Output:
Test case 8 : two lists, mixed mismatched lefts
Input: ((1 (2) ( (3)
Output:
Test case 9 : two lists, mixed mismatched rights
Input: (1) ) (2) 3))
Output:
Here's a slightly more complicated grammar that operates on [] and () pairs. I think the solution is going to get exponentially worse as you add pairs, but hey, it's fun! You may also be hitting the limitation of what you can do with grammar-driven AST building.
start : root+ EOF -> ^(LIST root+ )
;
root : expr
| LPARAM
| RPARAM
| LSQB
| RSQB
;
expr : plist
| slist
| atom
;
plist : LPARAM pexpr* RPARAM -> ^(LIST pexpr*)
;
pexpr : slist
| atom
| LSQB
| RSQB
;
slist : LSQB sexpr* RSQB -> ^(LIST sexpr*)
;
sexpr : plist
| atom
| LPARAM
| RPARAM
;
atom : INT;
INT : ('0'..'9')+;
LPARAM : '(';
RPARAM : ')';
LSQB : '[';
RSQB : ']';

Resources