Alternatives in ANTLR rewrite rule - parsing

I'm writing a grammar that supports arbitrary boolean expressions. The grammar is used to represent a program, which is later passed through the static analysis tool. The static analysis tool has certain limitations so I want to apply the following rewrite rules:
Strict inequalities are approximated with epsilon:
expression_a > expression_b -> expression_a >= expression_b + EPSILON
Inequality is approximated using "or" statement:
expression_a != expression_b -> expression_a > expression_b || expression_a < expression_b
Is there any easy way to do it using ANTLR? Currently my grammar looks like so:
comparison : expression ('=='^|'<='^|'>='^|'!='^|'>'^|'<'^) expression;
I'm not sure how to apply a different rewrite rule depending on what the operator is. I want to tree stay as it is if the operator is ("==", "<=" or ">=") and to recursively transform it otherwise, according to the rules defined above.

[...] and to recursively transform it otherwise, [...]
You can do it partly.
You can't tell ANTLR to rewrite a > b to ^('>=' a ^('+' b epsilon)) and then define a != b to become ^('||' ^('>' a b) ^('<' a b)) and then have ANTLR automatically rewrite both ^('>' a b) and ^('<' a b) to ^('>=' a ^('+' b epsilon)) and ^('<=' a ^('-' b epsilon)) respectively.
A bit of manual work is needed here. The trick is that you can't just use a token like >= if this token isn't actually parsed. A solution to this is to use imaginary tokens.
A quick demo:
grammar T;
options {
output=AST;
}
tokens {
AND;
OR;
GTEQ;
LTEQ;
SUB;
ADD;
EPSILON;
}
parse
: expr
;
expr
: logical_expr
;
logical_expr
: comp_expr ((And | Or)^ comp_expr)*
;
comp_expr
: (e1=mult_expr -> $e1) ( Eq e2=mult_expr -> ^(AND ^(GTEQ $e1 $e2) ^(LTEQ $e1 $e2))
| LtEq e2=mult_expr -> ^(LTEQ $e1 $e2)
| GtEq e2=mult_expr -> ^(GTEQ $e1 $e2)
| NEq e2=mult_expr -> ^(OR ^(GTEQ $e1 ^(ADD $e2 EPSILON)) ^(LTEQ $e1 ^(SUB $e2 EPSILON)))
| Gt e2=mult_expr -> ^(GTEQ $e1 ^(ADD $e2 EPSILON))
| Lt e2=mult_expr -> ^(LTEQ $e1 ^(SUB $e2 EPSILON))
)?
;
add_expr
: mult_expr ((Add | Sub)^ mult_expr)*
;
mult_expr
: atom ((Mult | Div)^ atom)*
;
atom
: Num
| Id
| '(' expr ')'
;
Eq : '==';
LtEq : '<=';
GtEq : '>=';
NEq : '!=';
Gt : '>';
Lt : '<';
Or : '||';
And : '&&';
Mult : '*';
Div : '/';
Add : '+';
Sub : '-';
Num : '0'..'9'+ ('.' '0'..'9'+)?;
Id : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
The parser generated from the grammar above will produce the following:
a == b
a != b
a > b
a < b

Related

ANTLR Lexer matching the wrong rule

I'm working on a lexer and parser for an old object oriented chat system (MOO in case any readers are familiar with its language). Within this language, any of the below examples are valid floating point numbers:
2.3
3.
.2
3e+5
The language also implements an indexing syntax for extracting one or more characters from a string or list (which is a set of comma separated expressions enclosed in curly braces). The problem arises from the fact that the language supports a range operator inside the index brackets. For example:
a = foo[1..3];
I understand that ANTLR wants to match the longest possible match first. Unfortunately this results in the lexer seeing '1..3' as two floating points numbers (1. and .3), rather than two integers with a range operator ('..') between them. Is there any way to solve this short of using lexer modes? Given that the values inside of an indexing expression can be any valid expression, I would have to duplicate a lot of token rules (essentially all but the floating point numbers as I understand it). Now granted I'm new to ANTLR so I'm sure I'm missing something and any help is much appreciated. I will supply my lexer grammar below:
lexer grammar MooLexer;
channels { COMMENTS_CHANNEL }
SINGLE_LINE_COMMENT
: '//' INPUT_CHARACTER* -> channel(COMMENTS_CHANNEL);
DELIMITED_COMMENT
: '/*' .*? '*/' -> channel(COMMENTS_CHANNEL);
WS
: [ \t\r\n] -> channel(HIDDEN)
;
IF
: I F
;
ELSE
: E L S E
;
ELSEIF
: E L S E I F
;
ENDIF
: E N D I F
;
FOR
: F O R;
ENDFOR
: E N D F O R;
WHILE
: W H I L E
;
ENDWHILE
: E N D W H I L E
;
FORK
: F O R K
;
ENDFORK
: E N D F O R K
;
RETURN
: R E T U R N
;
BREAK
: B R E A K
;
CONTINUE
: C O N T I N U E
;
TRY
: T R Y
;
EXCEPT
: E X C E P T
;
ENDTRY
: E N D T R Y
;
IN
: I N
;
SPLICER
: '#';
UNDERSCORE
: '_';
DOLLAR
: '$';
SEMI
: ';';
COLON
: ':';
DOT
: '.';
COMMA
: ',';
BANG
: '!';
OPEN_QUOTE
: '`';
SINGLE_QUOTE
: '\'';
LEFT_BRACKET
: '[';
RIGHT_BRACKET
: ']';
LEFT_CURLY_BRACE
: '{';
RIGHT_CURLY_BRACE
: '}';
LEFT_PARENTHESIS
: '(';
RIGHT_PARENTHESIS
: ')';
PLUS
: '+';
MINUS
: '-';
STAR
: '*';
DIV
: '/';
PERCENT
: '%';
PIPE
: '|';
CARET
: '^';
ASSIGNMENT
: '=';
QMARK
: '?';
OP_AND
: '&&';
OP_OR
: '||';
OP_EQUALS
: '==';
OP_NOT_EQUAL
: '!=';
OP_LESS_THAN
: '<';
OP_GREATER_THAN
: '>';
OP_LESS_THAN_OR_EQUAL_TO
: '<=';
OP_GREATER_THAN_OR_EQUAL_TO
: '>=';
RANGE
: '..';
ERROR
: 'E_NONE'
| 'E_TYPE'
| 'E_DIV'
| 'E_PERM'
| 'E_PROPNF'
| 'E_VERBNF'
| 'E_VARNF'
| 'E_INVIND'
| 'E_RECMOVE'
| 'E_MAXREC'
| 'E_RANGE'
| 'E_ARGS'
| 'E_NACC'
| 'E_INVARG'
| 'E_QUOTA'
| 'E_FLOAT'
;
OBJECT
: '#' DIGIT+
| '#-' DIGIT+
;
STRING
: '"' ( ESC | [ !] | [#-[] | [\]-~] | [\t] )* '"';
INTEGER
: DIGIT+;
FLOAT
: DIGIT+ [.] (DIGIT*)? (EXPONENTNOTATION EXPONENTSIGN DIGIT+)?
| [.] DIGIT+ (EXPONENTNOTATION EXPONENTSIGN DIGIT+)?
| DIGIT+ EXPONENTNOTATION EXPONENTSIGN DIGIT+
;
IDENTIFIER
: (LETTER | DIGIT | UNDERSCORE)+
;
LETTER
: LOWERCASE
| UPPERCASE
;
/*
* fragments
*/
fragment LOWERCASE
: [a-z] ;
fragment UPPERCASE
: [A-Z] ;
fragment EXPONENTNOTATION
: ('E' | 'e');
fragment EXPONENTSIGN
: ('-' | '+');
fragment DIGIT
: [0-9] ;
fragment ESC
: '\\"' | '\\\\' ;
fragment INPUT_CHARACTER
: ~[\r\n\u0085\u2028\u2029];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
No, AFAIK, there is no way to solve this using lexer modes. You'll need a predicate with a bit of target specific code. If Java is your target, that might look like this:
lexer grammar RangeTestLexer;
FLOAT
: [0-9]+ '.' [0-9]+
| [0-9]+ '.' {_input.LA(1) != '.'}?
| '.' [0-9]+
;
INTEGER
: [0-9]+
;
RANGE
: '..'
;
SPACES
: [ \t\r\n] -> skip
;
If you run the following Java code:
Lexer lexer = new RangeTestLexer(CharStreams.fromString("1 .2 3. 4.5 6..7 8 .. 9"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s `%s`\n", RangeTestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
you'll get the following output:
INTEGER `1`
FLOAT `.2`
FLOAT `3.`
FLOAT `4.5`
INTEGER `6`
RANGE `..`
INTEGER `7`
INTEGER `8`
RANGE `..`
INTEGER `9`
EOF `<EOF>`
The { ... }? is the predicate and the embedded code must evaluate to a boolean. In my example, the Java code _input.LA(1) != '.' returns true if the character stream 1 step ahead of the current position does not equal a '.' char.

How to make certain rules mandatory in Antlr

I wrote the following grammar which should check for a conditional expression.
Examples below is what I want to achieve using this grammar:
test invalid
test = 1 valid
test = 1 and another_test>=0.2 valid
test = 1 kasd y = 1 invalid (two conditions MUST be separated by AND/OR)
a = 1 or (b=1 and c) invalid (there cannot be a lonely character like 'c'. It should always be a triplet. i.e, literal operator literal)
grammar expression;
expr
: literal_value
| expr ( '='|'<>'| '<' | '<=' | '>' | '>=' ) expr
| expr K_AND expr
| expr K_OR expr
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
| '(' expr ')'
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
| IDENTIFIER
;
keyword
: K_AND
| K_OR
;
name
: any_name
;
function_name
: any_name
;
database_name
: any_name
;
table_name
: any_name
;
column_name
: any_name
;
any_name
: IDENTIFIER
| keyword
| STRING_LITERAL
| '(' any_name ')'
;
K_AND : A N D;
K_OR : O R;
IDENTIFIER
: '"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [a-zA-Z_0-9]*
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
WS: [ \n\t\r]+ -> skip;
So my question is, how can I get the grammar to work for the examples mentioned above? Can we make certain words as mandatory between two triplets (literal operator literal)? In a sense I'm just trying to get a parser to validate the where clause condition but only simple condition and functions are permitted. I also want have a visitor that retrieves the values like function, parenthesis, any literal etc in Java, how to achieve that?
Yes and no.
You can change your grammar to only allow expressions that are comparisons and logical operations on the same:
expr
: term ( '='|'<>'| '<' | '<=' | '>' | '>=' ) term
| expr K_AND expr
| expr K_OR expr
| '(' expr ')'
;
term
: literal_value
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
;
The issue comes if you want to allow boolean variables or functions -- you need to classify the functions/vars in your lexer and have a different terminal for each, which is tricky and error prone.
Instead, it is generally better to NOT do this kind of checking in the parser -- have your parser be permissive and accept anything expression-like, and generate an expression tree for it. Then have a separate pass over the tree (called a type checker) that checks the types of the operands of operations and the arguments to functions.
This latter approach (with a separate type checker) generally ends up being much simpler, clearer, more flexible, and gives better error messages (rather than just 'syntax error').

ANTLR rule rewrite to nested tree

I'm trying to make a rule that will rewrite into a nested tree (similar to a binary tree).
For example:
a + b + c + d;
Would parse to a tree like ( ( (a + b) + c) + d). Basically each root node would have three children (LHS '+' RHS) where LHS could be more nested nodes.
I attempted some things like:
rule: lhs '+' ID;
lhs: ID | rule;
and
rule
: rule '+' ID
| ID '+' ID;
(with some tree rewrites) but they all gave me an error about it being left-recursive. I'm not sure how to solve this without some type of recursion.
EDIT: My latest attempt recurses on the right side which gives the reverse of what I want:
rule:
ID (op='+' rule)?
-> {op == null}? ID
-> ^(BinaryExpression<node=MyBinaryExpression> ID $op rule)
Gives (a + (b + (c + d) ) )
The follow grammar:
grammar T;
options {
output=AST;
}
tokens {
BinaryExpression;
}
parse
: expr ';' EOF -> expr
;
expr
: (atom -> atom) (ADD a=atom -> ^(BinaryExpression $expr ADD $a))*
;
atom
: ID
| NUM
| '(' expr ')'
;
ADD : '+';
NUM : '0'..'9'+;
ID : 'a'..'z'+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
parses your input "a + b + c + d;" as follows:
Did you try
rule: ID '+' rule | ID;
?

ANTLR parse assignments

I want to parse some assignments, where I only care about the assignment as a whole. Not about whats inside the assignment. An assignment is indiciated by ':='. (EDIT: Before and after the assignments other things may come)
Some examples:
a := TRUE & FALSE;
c := a ? 3 : 5;
b := case
a : 1;
!a : 0;
esac;
Currently I make a difference between assignments containing a 'case' and other assignments. For simple assignments I tried something like ~('case' | 'esac' | ';') but then antlr complained about unmatched tokens (like '=').
assignment :
NAME ':='! expression ;
expression :
( simple_expression | case_expression) ;
simple_expression :
((OPERATOR | NAME) & ~('case' | 'esac'))+ ';'! ;
case_expression :
'case' .+ 'esac' ';'! ;
I tried replacing with the following, because the eclipse-interpreter did not seem to like the ((OPERATOR | NAME) & ~('case' | 'esac'))+ ';'! ; because of the 'and'.
(~(OPERATOR | ~NAME | ('case' | 'esac')) |
~(~OPERATOR | NAME | ('case' | 'esac')) |
~(~OPERATOR | ~NAME | ('case' | 'esac'))) ';'!
But this does not work. I get
"error(139): /AntlrTutorial/src/foo/NusmvInput.g:78:5: set complement is empty |---> ~(~OPERATOR | ~NAME | ('case' | 'esac'))) EOC! ;"
How can I parse it?
There are a couple of things going wrong here:
you're using & in your grammar while it should be with quotes around it: '&'
unless you know exactly what you're doing, don't use ~ and . (especially not .+ !) inside parser rules: use them in lexer rules only;
create lexer rules instead of defining 'case' and 'esac' in your parser rules (it's safe to use literal tokens in your parser rules if no other lexer rule can potentially match is, but 'case' and 'esac' look a lot like NAME and they could end up in your AST in which case it's better to explicitly define them yourself in the lexer)
Here's a quick demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
CASES;
CASE;
}
parse
: (assignment SCOL)* EOF -> ^(ROOT assignment*)
;
assignment
: NAME ASSIGN^ expression
;
expression
: ternary_expression
;
ternary_expression
: or_expression (QMARK^ ternary_expression COL! ternary_expression)?
;
or_expression
: unary_expression ((AND | OR)^ unary_expression)*
;
unary_expression
: NOT^ atom
| atom
;
atom
: TRUE
| FALSE
| NUMBER
| NAME
| CASE single_case+ ESAC -> ^(CASES single_case+)
| '(' expression ')' -> expression
;
single_case
: expression COL expression SCOL -> ^(CASE expression expression)
;
TRUE : 'TRUE';
FALSE : 'FALSE';
CASE : 'case';
ESAC : 'esac';
ASSIGN : ':=';
AND : '&';
OR : '|';
NOT : '!';
QMARK : '?';
COL : ':';
SCOL : ';';
NAME : ('a'..'z' | 'A'..'Z')+;
NUMBER : ('0'..'9')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse your input:
a := TRUE & FALSE;
c := a ? 3 : 5;
b := case
a : 1;
!a : 0;
esac;
as follows:

Assignment as expression in Antlr grammar

I'm trying to extend the grammar of the Tiny Language to treat assignment as expression. Thus it would be valid to write
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
a = 1 = 2; // invalid
Assignment differs from other operators in two aspects. It's right associative (not a big deal), and its left-hand side is has to be a variable. So I changed the grammar like this
statement: assignmentExpr | functionCall ...;
assignmentExpr: Identifier indexes? '=' expression;
expression: assignmentExpr | condExpr;
It doesn't work, because it contains a non-LL(*) decision. I also tried this variant:
assignmentExpr: Identifier indexes? '=' (expression | condExpr);
but I got the same error. I am interested in
This specific question
Given a grammar with a non-LL(*) decision, how to find the two paths that cause the problem
How to fix it
I think you can change your grammar like this to achieve the same, without using syntactic predicates:
statement: Expr ';' | functionCall ';'...;
Expr: Identifier indexes? '=' Expr | condExpr ;
condExpr: .... and so on;
I altered Bart's example with this idea in mind:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
And for the input:
a=b=4;
a = 2 * (b = 1);
you get following parse tree:
The key here is that you need to "assure" the parser that inside an expression, there is something ahead that satisfies the expression. This can be done using a syntactic predicate (the ( ... )=> parts in the add and mult rules).
A quick demo:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
ASSIGN;
}
parse
: stat* EOF -> ^(ROOT stat+)
;
stat
: expr ';' -> expr
;
expr
: add
;
add
: mult ((('+' | '-') mult)=> ('+' | '-')^ mult)*
;
mult
: atom ((('*' | '/') atom)=> ('*' | '/')^ atom)*
;
atom
: (Id -> Id) ('=' expr -> ^(ASSIGN Id expr))?
| Num
| '(' expr ')' -> expr
;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse the input:
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
into the following AST:

Resources