I'm trying to make a rule that will rewrite into a nested tree (similar to a binary tree).
For example:
a + b + c + d;
Would parse to a tree like ( ( (a + b) + c) + d). Basically each root node would have three children (LHS '+' RHS) where LHS could be more nested nodes.
I attempted some things like:
rule: lhs '+' ID;
lhs: ID | rule;
and
rule
: rule '+' ID
| ID '+' ID;
(with some tree rewrites) but they all gave me an error about it being left-recursive. I'm not sure how to solve this without some type of recursion.
EDIT: My latest attempt recurses on the right side which gives the reverse of what I want:
rule:
ID (op='+' rule)?
-> {op == null}? ID
-> ^(BinaryExpression<node=MyBinaryExpression> ID $op rule)
Gives (a + (b + (c + d) ) )
The follow grammar:
grammar T;
options {
output=AST;
}
tokens {
BinaryExpression;
}
parse
: expr ';' EOF -> expr
;
expr
: (atom -> atom) (ADD a=atom -> ^(BinaryExpression $expr ADD $a))*
;
atom
: ID
| NUM
| '(' expr ')'
;
ADD : '+';
NUM : '0'..'9'+;
ID : 'a'..'z'+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
parses your input "a + b + c + d;" as follows:
Did you try
rule: ID '+' rule | ID;
?
Related
I am creating parser and lexer rules for Decaf programming language written in ANTLR4. There is a parser test file I am trying to run to get the parser tree for it by printing the visited nodes on the terminal window and paste them into D3_parser_tree.html class. The current parser tree is missing the right square brackets with the number 10 according to this testing file : class program { int i [10]; }
The error I am getting : mismatched input '10' expecting INT_LITERAL
I am not sure why I am getting this error although I have declared a lexer rule for INT_LITERAL and then called it in a parser rule within field_decl according to the given Decaf spec :
** Parser rules **
<program> → class Program ‘{‘ <field_decl>* <method_decl>* ‘}’
<field_decl> → <type> { <id> | <id> ‘[‘ <int_literal> ‘]’ }+, ;
<method_decl> → { <type> | void } <id> ( [ { <type> <id> }+, ] ) <block>
<digit> → 0 | 1 | 2 | … | 9
<block> → ‘{‘ <var_decl>* <statement>* ‘}’
<literal> → <int_literal> | <char_literal> | <bool_literal>
<hex_digit> → <digit> | a | b | c | … | f | A | B | C | … | F
<int_literal> → <decimal_literal> | <hex_literal>
<decimal_literal> → <digit> <digit>*
<hex_literal> → 0x <hex_digit> <hex_digit>*
Related Lexer rules :
NUMBER : [0-9]+;
fragment ALPHA : [_a-zA-Z0-9];
fragment DIGIT : [0-9];
fragment DECIMAL_LITERAL : DIGIT+;
CHAR_LITERAL : '\'' CHAR '\'';
STRING_LITERAL : '"' CHAR+ '"' ;
COMMENT : '//' ~('\n')* '\n' -> skip;
WS : (' ' | '\n' | '\t' | '\r') + -> skip;
Related Parser rules :
program : CLASS VAR LCURLYBRACE field_decl*method_decl* RCURLYBRACE EOF;
field_decl : data_type field ( COMMA field )* SEMICOLON;
Please let me know if you need further details & I appreciate your help a lot.
The following rules conflict:
VAR : ALPHA+;
...
NUMBER : [0-9]+;
...
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
They all match 10, but the lexer will always choose VAR since that is the rule defined first.
This is just how ANTLR's lexer works: it tries to match the most characters as possible, and when two (or more) rules all match the same amount of characters, the one defined first "wins".
You will see that it parses correctly if you change field into:
field : VAR | VAR LSQUAREBRACE VAR RSQUAREBRACE;
I'm writing a grammar that supports arbitrary boolean expressions. The grammar is used to represent a program, which is later passed through the static analysis tool. The static analysis tool has certain limitations so I want to apply the following rewrite rules:
Strict inequalities are approximated with epsilon:
expression_a > expression_b -> expression_a >= expression_b + EPSILON
Inequality is approximated using "or" statement:
expression_a != expression_b -> expression_a > expression_b || expression_a < expression_b
Is there any easy way to do it using ANTLR? Currently my grammar looks like so:
comparison : expression ('=='^|'<='^|'>='^|'!='^|'>'^|'<'^) expression;
I'm not sure how to apply a different rewrite rule depending on what the operator is. I want to tree stay as it is if the operator is ("==", "<=" or ">=") and to recursively transform it otherwise, according to the rules defined above.
[...] and to recursively transform it otherwise, [...]
You can do it partly.
You can't tell ANTLR to rewrite a > b to ^('>=' a ^('+' b epsilon)) and then define a != b to become ^('||' ^('>' a b) ^('<' a b)) and then have ANTLR automatically rewrite both ^('>' a b) and ^('<' a b) to ^('>=' a ^('+' b epsilon)) and ^('<=' a ^('-' b epsilon)) respectively.
A bit of manual work is needed here. The trick is that you can't just use a token like >= if this token isn't actually parsed. A solution to this is to use imaginary tokens.
A quick demo:
grammar T;
options {
output=AST;
}
tokens {
AND;
OR;
GTEQ;
LTEQ;
SUB;
ADD;
EPSILON;
}
parse
: expr
;
expr
: logical_expr
;
logical_expr
: comp_expr ((And | Or)^ comp_expr)*
;
comp_expr
: (e1=mult_expr -> $e1) ( Eq e2=mult_expr -> ^(AND ^(GTEQ $e1 $e2) ^(LTEQ $e1 $e2))
| LtEq e2=mult_expr -> ^(LTEQ $e1 $e2)
| GtEq e2=mult_expr -> ^(GTEQ $e1 $e2)
| NEq e2=mult_expr -> ^(OR ^(GTEQ $e1 ^(ADD $e2 EPSILON)) ^(LTEQ $e1 ^(SUB $e2 EPSILON)))
| Gt e2=mult_expr -> ^(GTEQ $e1 ^(ADD $e2 EPSILON))
| Lt e2=mult_expr -> ^(LTEQ $e1 ^(SUB $e2 EPSILON))
)?
;
add_expr
: mult_expr ((Add | Sub)^ mult_expr)*
;
mult_expr
: atom ((Mult | Div)^ atom)*
;
atom
: Num
| Id
| '(' expr ')'
;
Eq : '==';
LtEq : '<=';
GtEq : '>=';
NEq : '!=';
Gt : '>';
Lt : '<';
Or : '||';
And : '&&';
Mult : '*';
Div : '/';
Add : '+';
Sub : '-';
Num : '0'..'9'+ ('.' '0'..'9'+)?;
Id : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
The parser generated from the grammar above will produce the following:
a == b
a != b
a > b
a < b
I'm trying to extend the grammar of the Tiny Language to treat assignment as expression. Thus it would be valid to write
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
a = 1 = 2; // invalid
Assignment differs from other operators in two aspects. It's right associative (not a big deal), and its left-hand side is has to be a variable. So I changed the grammar like this
statement: assignmentExpr | functionCall ...;
assignmentExpr: Identifier indexes? '=' expression;
expression: assignmentExpr | condExpr;
It doesn't work, because it contains a non-LL(*) decision. I also tried this variant:
assignmentExpr: Identifier indexes? '=' (expression | condExpr);
but I got the same error. I am interested in
This specific question
Given a grammar with a non-LL(*) decision, how to find the two paths that cause the problem
How to fix it
I think you can change your grammar like this to achieve the same, without using syntactic predicates:
statement: Expr ';' | functionCall ';'...;
Expr: Identifier indexes? '=' Expr | condExpr ;
condExpr: .... and so on;
I altered Bart's example with this idea in mind:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
And for the input:
a=b=4;
a = 2 * (b = 1);
you get following parse tree:
The key here is that you need to "assure" the parser that inside an expression, there is something ahead that satisfies the expression. This can be done using a syntactic predicate (the ( ... )=> parts in the add and mult rules).
A quick demo:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
ASSIGN;
}
parse
: stat* EOF -> ^(ROOT stat+)
;
stat
: expr ';' -> expr
;
expr
: add
;
add
: mult ((('+' | '-') mult)=> ('+' | '-')^ mult)*
;
mult
: atom ((('*' | '/') atom)=> ('*' | '/')^ atom)*
;
atom
: (Id -> Id) ('=' expr -> ^(ASSIGN Id expr))?
| Num
| '(' expr ')' -> expr
;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse the input:
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
into the following AST:
The following grammar works, but also gives a warning:
test.g
grammar test;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
}
program
: expr ';'!
;
term: ID | INT
;
assign
: term ('='^ expr)?
;
add : assign (('+' | '-')^ assign)*
;
expr: add
;
// T O K E N S
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
WS :
( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
DOT : '.' ;
fragment
LETTER : ('a'..'z'|'A'..'Z') ;
fragment
DIGIT : '0'..'9' ;
Warning
[15:08:20] warning(200): C:\Users\Charles\Desktop\test.g:21:34:
Decision can match input such as "'+'..'-'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Again, it does produce a tree the way I want:
Input: 0 + a = 1 + b = 2 + 3;
ANTLR produces | ... but I think it
this tree: | gives the warning
| because it _could_
+ | also be parsed this
/ \ | way:
0 = |
/ \ | +
a + | / \
/ \ | + 3
1 = | / \
/ \ | + =
b + | / \ / \
/ \ | 0 = b 2
2 3 | / \
| a 1
How can I explicitly tell ANTLR that I want it to create the AST on the left, thus making my intent clear and silencing the warning?
Charles wrote:
How can I explicitly tell ANTLR that I want it to create the AST on the left, thus making my intent clear and silencing the warning?
You shouldn't create two separate rules for assign and add. As your rules are now, assign has precedence over add, which you don't want: they should have equal precedence by looking at your desired AST. So, you need to wrap all operators +, - and = in one rule:
program
: expr ';'!
;
expr
: term (('+' | '-' | '=')^ expr)*
;
But now the grammar is still ambiguous. You'll need to "help" the parser to look beyond this ambiguity to assure there really is operator expr ahead when parsing (('+' | '-' | '=') expr)*. This can be done using a syntactic predicate, which looks like this:
(look_ahead_rule(s)_in_here)=> rule(s)_to_actually_parse
(the ( ... )=> is the predicate syntax)
A little demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
program
: expr ';'!
;
expr
: term ((op expr)=> op^ expr)*
;
op
: '+'
| '-'
| '='
;
term
: ID
| INT
;
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
WS : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z'|'A'..'Z');
fragment DIGIT : '0'..'9';
which can be tested with the class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "0 + a = 1 + b = 2 + 3;";
testLexer lexer = new testLexer(new ANTLRStringStream(source));
testParser parser = new testParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.program().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
And the output of the Main class corresponds to the following AST:
which is created without any warnings from ANTLR.
I'm parsing CoCo/R grammars in a utility to automate CoCo -> ANTLR translation. The core ANTLR grammar is:
rule '=' expression '.' ;
expression
: term ('|' term)*
-> ^( OR_EXPR term term* )
;
term
: (factor (factor)*)? ;
factor
: symbol
| '(' expression ')'
-> ^( GROUPED_EXPR expression )
| '[' expression']'
-> ^( OPTIONAL_EXPR expression)
| '{' expression '}'
-> ^( SEQUENCE_EXPR expression)
;
symbol
: IF_ACTION
| ID (ATTRIBUTES)?
| STRINGLITERAL
;
My problem is with constructions such as these:
CS = { ExternAliasDirective }
{ UsingDirective }
EOF .
CS results in an AST with a OR_EXPR node although no '|' character
actually appears. I'm sure this is due to the definition of
expression but I cannot see any other way to write the rules.
I did experiment with this to resolve the ambiguity.
// explicitly test for the presence of an '|' character
expression
#init { bool ored = false; }
: term {ored = (input.LT(1).Type == OR); } (OR term)*
-> {ored}? ^(OR_EXPR term term*)
-> ^(LIST term term*)
It works but the hack reinforces my conviction that something fundamental is wrong.
Any tips much appreciated.
Your rule:
expression
: term ('|' term)*
-> ^( OR_EXPR term term* )
;
always causes the rewrite rule to create a tree with a root of type OR_EXPR. You can create "sub rewrite rules" like this:
expression
: (term -> REWRITE_RULE_X) ('|' term -> ^(REWRITE_RULE_Y))*
;
And to resolve the ambiguity in your grammar, it's easiest to enable global backtracking which can be done in the options { ... } section of your grammar.
A quick demo:
grammar CocoR;
options {
output=AST;
backtrack=true;
}
tokens {
RULE;
GROUP;
SEQUENCE;
OPTIONAL;
OR;
ATOMS;
}
parse
: rule EOF -> rule
;
rule
: ID '=' expr* '.' -> ^(RULE ID expr*)
;
expr
: (a=atoms -> $a) ('|' b=atoms -> ^(OR $expr $b))*
;
atoms
: atom+ -> ^(ATOMS atom+)
;
atom
: ID
| '(' expr ')' -> ^(GROUP expr)
| '{' expr '}' -> ^(SEQUENCE expr)
| '[' expr ']' -> ^(OPTIONAL expr)
;
ID
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
with input:
CS = { ExternAliasDirective }
{ UsingDirective }
EOF .
produces the AST:
and the input:
foo = a | b ({c} | d [e f]) .
produces:
The class to test this:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
/*
String source =
"CS = { ExternAliasDirective } \n" +
"{ UsingDirective } \n" +
"EOF . ";
*/
String source = "foo = a | b ({c} | d [e f]) .";
ANTLRStringStream in = new ANTLRStringStream(source);
CocoRLexer lexer = new CocoRLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CocoRParser parser = new CocoRParser(tokens);
CocoRParser.parse_return returnValue = parser.parse();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and with the output this class produces, I used the following website to create the AST-images: http://graph.gafol.net/
HTH
EDIT
To account for epsilon (empty string) in your OR expressions, you might try something (quickly tested!) like this:
expr
: (a=atoms -> $a) ( ( '|' b=atoms -> ^(OR $expr $b)
| '|' -> ^(OR $expr NOTHING)
)
)*
;
which parses the source:
foo = a | b | .
into the following AST:
The production for expression explicitly says that it can only return an OR_EXPR node. You can try something like:
expression
:
term
|
term ('|' term)+
-> ^( OR_EXPR term term* )
;
Further down, you could use:
term
: factor*;