Here is a basic structure for simple nested expressions...
infix : prefix (INFIX_OP^ prefix)*;
prefix : postfix | (PREFIX_OP postfix) -> ^(PREFIX_OP postfix);
postfix : INT (POSTFIX_OP^)?;
POSTFIX_OP : '!';
INFIX_OP : '+';
PREFIX_OP : '-';
INT : '0'..'9'*;
If I wanted to create a list of these expressions I could use the following...
list: infix (',' infix)*;
Here we use the ',' as a delimiter.
I want to be able to build a list of expressions without any delimiter.
So if I have the string 4 5 2+3 1 6 I would like to be able to interpret that as (4) (5) ^(+ 2 3) (1) (6)
The problem is that both 4 and 2+3 have the same first symbol (INT) so I have a conflict. I'm trying to figure out how I can resolve this.
EDIT
I've almost figured it out, just having trouble coming up with the correct rewrite for a certain condition...
expr: (a=atom -> $a)
(op='+' b=atom-> {$a.text != "+" && $b.text != "+"}? ^($op $expr $b) // infix
-> {$b.text != "+"}? // HAVING TROUBLE COMING UP WITH THIS CORRECT REWRITE!
-> $expr $op $b)*; // simple list
atom: INT | '+';
INT : '0'..'9'+;
This will parse 1+2+3++4+5+ as ^(+ ^(+ 1 2) 3) (+) (+) ^(+ 4 5) (+), which is what I want.
Now I'm trying to finish my rewrite rule so that ++1+2 will parse as (+) (+) ^(+ 1 2).
Overall I want a list of tokens and to find all the infix expressions, and leave the rest as a list.
There's a problem with your INT rule:
INT : '0'..'9'*;
which matches an empty string. It should always match at least 1 char:
INT : '0'..'9'+;
Besides that, it seems to work just fine.
Given the grammar:
grammar T;
options {
output=AST;
}
tokens {
LIST;
}
parse : list EOF -> list;
list : infix+ -> ^(LIST infix+);
infix : prefix (INFIX_OP^ prefix)*;
prefix : postfix -> postfix
| PREFIX_OP postfix -> ^(PREFIX_OP postfix)
;
postfix : INT (POSTFIX_OP^)?;
POSTFIX_OP : '!';
INFIX_OP : '+';
PREFIX_OP : '-';
INT : '0'..'9'+;
SPACE : ' ' {skip();};
which parses the input:
4 5 2+3 1 6
into the following AST:
EDIT
Introducing operators that can both be used in post- and infix expressions will make your list ambiguous (well, in my version below, that is... :)). So, I'll keep the comma in there for this demo:
grammar T;
options {
output=AST;
}
tokens {
LIST;
P_ADD;
}
parse : list EOF -> list;
list : expr (',' expr)* -> ^(LIST expr+);
expr : postfix_expr;
postfix_expr : (infix_expr -> infix_expr) (ADD -> ^(P_ADD infix_expr))?;
infix_expr : atom ((ADD | SUB)^ atom)*;
atom : INT;
ADD : '+';
SUB : '-';
INT : '0'..'9'+;
SPACE : ' ' {skip();};
In the grammar above, the + as an infix operator has precedence over the postfix-version, as you can see when parsing input like 2+5+:
Related
I'm writing a grammar that supports arbitrary boolean expressions. The grammar is used to represent a program, which is later passed through the static analysis tool. The static analysis tool has certain limitations so I want to apply the following rewrite rules:
Strict inequalities are approximated with epsilon:
expression_a > expression_b -> expression_a >= expression_b + EPSILON
Inequality is approximated using "or" statement:
expression_a != expression_b -> expression_a > expression_b || expression_a < expression_b
Is there any easy way to do it using ANTLR? Currently my grammar looks like so:
comparison : expression ('=='^|'<='^|'>='^|'!='^|'>'^|'<'^) expression;
I'm not sure how to apply a different rewrite rule depending on what the operator is. I want to tree stay as it is if the operator is ("==", "<=" or ">=") and to recursively transform it otherwise, according to the rules defined above.
[...] and to recursively transform it otherwise, [...]
You can do it partly.
You can't tell ANTLR to rewrite a > b to ^('>=' a ^('+' b epsilon)) and then define a != b to become ^('||' ^('>' a b) ^('<' a b)) and then have ANTLR automatically rewrite both ^('>' a b) and ^('<' a b) to ^('>=' a ^('+' b epsilon)) and ^('<=' a ^('-' b epsilon)) respectively.
A bit of manual work is needed here. The trick is that you can't just use a token like >= if this token isn't actually parsed. A solution to this is to use imaginary tokens.
A quick demo:
grammar T;
options {
output=AST;
}
tokens {
AND;
OR;
GTEQ;
LTEQ;
SUB;
ADD;
EPSILON;
}
parse
: expr
;
expr
: logical_expr
;
logical_expr
: comp_expr ((And | Or)^ comp_expr)*
;
comp_expr
: (e1=mult_expr -> $e1) ( Eq e2=mult_expr -> ^(AND ^(GTEQ $e1 $e2) ^(LTEQ $e1 $e2))
| LtEq e2=mult_expr -> ^(LTEQ $e1 $e2)
| GtEq e2=mult_expr -> ^(GTEQ $e1 $e2)
| NEq e2=mult_expr -> ^(OR ^(GTEQ $e1 ^(ADD $e2 EPSILON)) ^(LTEQ $e1 ^(SUB $e2 EPSILON)))
| Gt e2=mult_expr -> ^(GTEQ $e1 ^(ADD $e2 EPSILON))
| Lt e2=mult_expr -> ^(LTEQ $e1 ^(SUB $e2 EPSILON))
)?
;
add_expr
: mult_expr ((Add | Sub)^ mult_expr)*
;
mult_expr
: atom ((Mult | Div)^ atom)*
;
atom
: Num
| Id
| '(' expr ')'
;
Eq : '==';
LtEq : '<=';
GtEq : '>=';
NEq : '!=';
Gt : '>';
Lt : '<';
Or : '||';
And : '&&';
Mult : '*';
Div : '/';
Add : '+';
Sub : '-';
Num : '0'..'9'+ ('.' '0'..'9'+)?;
Id : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
The parser generated from the grammar above will produce the following:
a == b
a != b
a > b
a < b
Background
I have been using ANTLRWorks (V 1.4.3) for a few days now and trying to write a simple Boolean parser. The combined lexer/parser grammar below works well for most of the requirements including support for quoted white-spaced text as operands for a Boolean expression.
Problem
I would like the grammar to work for white-spaced operands without the need of quotes.
Example
For example, expression-
"left right" AND center
should have the same parse tree even after dropping the quotes-
left right AND center.
I have been learning about backtracking, predicates etc but can't seem to find a solution.
Code
Below is the grammar I have got so far. Any feedback on the foolish mistakes is appreciated :).
Lexer/Parser Grammar
grammar boolean_expr;
options {
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
}
#modifier{public}
#ctorModifier{public}
#lexer::namespace{Org.CSharp.Parsers}
#parser::namespace{Org.CSharp.Parsers}
public
evaluator
: expr EOF
;
public
expr
: orexpr
;
public
orexpr
: andexpr (OR^ andexpr)*
;
public
andexpr
: notexpr (AND^ notexpr)*
;
public
notexpr
: (NOT^)? atom
;
public
atom
: word | LPAREN! expr RPAREN!
;
public
word
: QUOTED_TEXT | TEXT
;
/*
* Lexer Rules
*/
LPAREN
: '('
;
RPAREN
: ')'
;
AND
: 'AND'
;
OR
: 'OR'
;
NOT
: 'NOT'
;
WS
: ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
QUOTED_TEXT
: '"' (LETTER | DIGIT | ' ' | ',' | '-')+ '"'
;
TEXT
: (LETTER | DIGIT)+
;
/*
Fragment lexer rules can be used by other lexer rules, but do not return tokens by themselves
*/
fragment DIGIT
: ('0'..'9')
;
fragment LOWER
: ('a'..'z')
;
fragment UPPER
: ('A'..'Z')
;
fragment LETTER
: LOWER | UPPER
;
Simply let TEXT in your atom rule match once or more: TEXT+. When it matches a TEXT token more than once, you'll also want to create a custom root node for these TEXT tokens (I added an imaginary token called WORD in the grammar below).
grammar boolean_expr;
options {
output=AST;
}
tokens {
WORD;
}
evaluator
: expr EOF
;
...
word
: QUOTED_TEXT
| TEXT+ -> ^(WORD TEXT+)
;
...
Your input "left right AND center" would now be parsed as follows:
I'm trying to make a rule that will rewrite into a nested tree (similar to a binary tree).
For example:
a + b + c + d;
Would parse to a tree like ( ( (a + b) + c) + d). Basically each root node would have three children (LHS '+' RHS) where LHS could be more nested nodes.
I attempted some things like:
rule: lhs '+' ID;
lhs: ID | rule;
and
rule
: rule '+' ID
| ID '+' ID;
(with some tree rewrites) but they all gave me an error about it being left-recursive. I'm not sure how to solve this without some type of recursion.
EDIT: My latest attempt recurses on the right side which gives the reverse of what I want:
rule:
ID (op='+' rule)?
-> {op == null}? ID
-> ^(BinaryExpression<node=MyBinaryExpression> ID $op rule)
Gives (a + (b + (c + d) ) )
The follow grammar:
grammar T;
options {
output=AST;
}
tokens {
BinaryExpression;
}
parse
: expr ';' EOF -> expr
;
expr
: (atom -> atom) (ADD a=atom -> ^(BinaryExpression $expr ADD $a))*
;
atom
: ID
| NUM
| '(' expr ')'
;
ADD : '+';
NUM : '0'..'9'+;
ID : 'a'..'z'+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
parses your input "a + b + c + d;" as follows:
Did you try
rule: ID '+' rule | ID;
?
I am using ANTLR to create an and/or parser+evaluator. Expressions will have the format like:
x eq 1 && y eq 10
(x lt 10 && x gt 1) OR x eq -1
I was reading this post on logic expressions in ANTLR Looking for advice on project. Parsing logical expression and I found the grammar posted there a good start:
grammar Logic;
parse
: expression EOF
;
expression
: implication
;
implication
: or ('->' or)*
;
or
: and ('&&' and)*
;
and
: not ('||' not)*
;
not
: '~' atom
| atom
;
atom
: ID
| '(' expression ')'
;
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
However, while getting a tree from the parser works for expressions where the variables are just one character (ie, "(A || B) AND C", I am having a hard time adapting this to my case (in the example "x eq 1 && y eq 10" I'd expect one "AND" parent and two children, "x eq 1" and "y eq 10", see the test case below).
#Test
public void simpleAndEvaluation() throws RecognitionException{
String src = "1 eq 1 && B";
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
assertEquals("&&",tree.getText());
assertEquals("1 eq 1",tree.getChild(0).getText());
assertEquals("a neq a",tree.getChild(1).getText());
}
I believe this is related with the "ID". What would the correct syntax be?
For those interested, I made some improvements in my grammar file (see bellow)
Current limitations:
only works with &&/||, not AND/OR (not very problematic)
you can't have spaces between the parenthesis and the &&/|| (I solve that by replacing " (" with ")" and ") " with ")" in the source String before feeding the lexer)
grammar Logic;
options {
output = AST;
}
tokens {
AND = '&&';
OR = '||';
NOT = '~';
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| SYMBOL
)+
;
SYMBOL
:
('+'|'-'|'*'|'/'|'_')
;
ID : ('a'..'z' | 'A'..'Z')+;
states that an identifier is a sequence of one or more letters, but does not allow any digits. Try
ID : ('a'..'z' | 'A'..'Z' | '0'..'9')+;
which will allow e.g. abc, 123, 12ab, and ab12. If you don't want the latter types, you'll have to restructure the rule a little bit (left as a challenge...)
In order to accept arbitrarily many identifiers, you could define atom as ID+ instead of ID.
Also, you will likely need to specify AND, OR, -> and ~ as tokens so that, as #Bart Kiers says, the first two won't get classified as ID, and so that the latter two will get recognized at all.
I am quite new to ANTLR, so this is likely a simple question.
I have defined a simple grammar which is supposed to include arithmetic expressions with numbers and identifiers (strings that start with a letter and continue with one or more letters or numbers.)
The grammar looks as follows:
grammar while;
#lexer::header {
package ConFreeG;
}
#header {
package ConFreeG;
import ConFreeG.IR.*;
}
#parser::members {
}
arith:
term
| '(' arith ( '-' | '+' | '*' ) arith ')'
;
term returns [AExpr a]:
NUM
{
int n = Integer.parseInt($NUM.text);
a = new Num(n);
}
| IDENT
{
a = new Var($IDENT.text);
}
;
fragment LOWER : ('a'..'z');
fragment UPPER : ('A'..'Z');
fragment NONNULL : ('1'..'9');
fragment NUMBER : ('0' | NONNULL);
IDENT : ( LOWER | UPPER ) ( LOWER | UPPER | NUMBER )*;
NUM : '0' | NONNULL NUMBER*;
fragment NEWLINE:'\r'? '\n';
WHITESPACE : ( ' ' | '\t' | NEWLINE )+ { $channel=HIDDEN; };
I am using ANTLR v3 with the ANTLR IDE Eclipse plugin. When I parse the expression (8 + a45) using the interpreter, only part of the parse tree is generated:
Why does the second term (a45) not get parsed? The same happens if both terms are numbers.
You'll want to create a parser rule that has an EOF (end of file) token in it so that the parser will be forced to go through the entire token stream.
Add this rule to your grammar:
parse
: arith EOF
;
and let the interpreter start at that rule instead of the arith rule: