ANTLR grammar not handling my "not" operator correctly - parsing

I am trying to parse a small expression language (I didn't define the language, from a vendor) and everything is fine until I try to use the not operator, which is a tilde in this language.
My grammar has been heavily influenced by these two links (aka shameless cut and pasting):
http://www.codeproject.com/KB/recipes/sota_expression_evaluator.aspx http://www.alittlemadness.com/2006/06/05/antlr-by-example-part-1-the-language
The language consists of three expression types that can be used with and, or, not operators and parenthesis change precedence. Expressions are:
Skill("name") > some_number (can also be <, >=, <=, =, !=)
SkillExists("name")
LoggedIn("name") (this one can also have name#name)
This input works fine:
Skill("somename") > 1 | (LoggedIn("somename") & SkillExists("othername"))
However, as soon as I try to use the not operator I get NoViableAltException. I can't figure out why. I have compared my grammar to the ECalc.g one at the codeproject.com link and they seem to match, there must be some subtle difference I can't see. Fails:
Skill("somename") < 10 ~ SkillExists("othername")
My Grammar:
grammar UserAttribute;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
SKILL = 'Skill' ;
SKILL_EXISTS = 'SkillExists' ;
LOGGED_IN = 'LoggedIn';
GT = '>';
LT = '<';
LTE = '<=';
GTE = '>=';
EQUALS = '=';
NOT_EQUALS = '!=';
AND = '&';
OR = '|' ;
NOT = '~';
LPAREN = '(';
RPAREN = ')';
QUOTE = '"';
AT = '#';
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expression : orexpression EOF!;
orexpression : andexpression (OR^ andexpression)*;
andexpression : notexpression (AND^ notexpression)*;
notexpression : primaryexpression | NOT^ primaryexpression;
primaryexpression : term | LPAREN! orexpression RPAREN!;
term : skill_exists | skill | logged_in;
skill_exists : SKILL_EXISTS LPAREN QUOTE NAME QUOTE RPAREN;
logged_in : LOGGED_IN LPAREN QUOTE NAME (AT NAME)? QUOTE RPAREN;
skill: SKILL LPAREN QUOTE NAME QUOTE RPAREN ((GT | LT| LTE | GTE | EQUALS | NOT_EQUALS)? NUMBER*)?;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NAME : ('a'..'z' | 'A'..'Z' | '_')+;
NUMBER : ('0'..'9')+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;

I have 2 remarks:
1
Since you're parsing single expressions (expression : orexpression EOF!;), the input "Skill("somename") < 10 ~ SkillExists("othername")" is not only invalid in your grammar, but it's invalid in terms of any expression parser (I know of). A notexpression only takes a "right-hand-side" expression, so ~ SkillExists("othername") is a single expression and Skill("somename") < 10 is also a single expression. But in between those two single expression, there's no OR or AND operator. It would be the same as evaluating the expression true false instead of true | false or true and false.
In short, your grammar disallows:
Skill("somename") < 10 ~ SkillExists("othername")
but allows for:
Skill("somename") < 10 & SkillExists("othername")
which seems logical to me.
2
I don't quite understand your skill rule (which is ambiguous, btw):
skill
: SKILL LPAREN QUOTE NAME QUOTE RPAREN
((GT | LT| LTE | GTE | EQUALS | NOT_EQUALS)? NUMBER*)?
;
This means that the operator is optional and there can be zero or more numbers at the end. This means that the following input are all valid:
Skill("foo") = 10 20
Skill("foo") 10 20 30
Skill("foo") <
Perhaps you meant:
skill
: SKILL LPAREN QUOTE NAME QUOTE RPAREN
((GT | LT| LTE | GTE | EQUALS | NOT_EQUALS)^ NUMBER)?
;
instead? (the ? becomes a ^ and the * is removed)
If I only change that rule and parse the input:
Skill("somename") < 10 & SkillExists("othername")
the following AST is created:
(as you can see, the AST needs to be better formed: i.e. you need some rewrite rules in your skill_exists, logged_in and skill rules)
EDIT
and if you want successive expressions to have implied AND tokens in between, do something like this:
grammar UserAttribute;
...
tokens {
...
I_AND; // <- added a token without any text (imaginary token)
AND = '&';
...
}
andexpression
: (notexpression -> notexpression) (AND? notexpression -> ^(I_AND $andexpression notexpression))*
;
...
As you can see, since the AND is now optional, it cannot be used inside a rewrite rule, but you'll have to use the imaginary token I_AND.
If you now parse the input:
Skill("somename") < 10 ~ SkillExists("othername")
you will get the following AST:

Related

How to make certain rules mandatory in Antlr

I wrote the following grammar which should check for a conditional expression.
Examples below is what I want to achieve using this grammar:
test invalid
test = 1 valid
test = 1 and another_test>=0.2 valid
test = 1 kasd y = 1 invalid (two conditions MUST be separated by AND/OR)
a = 1 or (b=1 and c) invalid (there cannot be a lonely character like 'c'. It should always be a triplet. i.e, literal operator literal)
grammar expression;
expr
: literal_value
| expr ( '='|'<>'| '<' | '<=' | '>' | '>=' ) expr
| expr K_AND expr
| expr K_OR expr
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
| '(' expr ')'
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
| IDENTIFIER
;
keyword
: K_AND
| K_OR
;
name
: any_name
;
function_name
: any_name
;
database_name
: any_name
;
table_name
: any_name
;
column_name
: any_name
;
any_name
: IDENTIFIER
| keyword
| STRING_LITERAL
| '(' any_name ')'
;
K_AND : A N D;
K_OR : O R;
IDENTIFIER
: '"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [a-zA-Z_0-9]*
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
WS: [ \n\t\r]+ -> skip;
So my question is, how can I get the grammar to work for the examples mentioned above? Can we make certain words as mandatory between two triplets (literal operator literal)? In a sense I'm just trying to get a parser to validate the where clause condition but only simple condition and functions are permitted. I also want have a visitor that retrieves the values like function, parenthesis, any literal etc in Java, how to achieve that?
Yes and no.
You can change your grammar to only allow expressions that are comparisons and logical operations on the same:
expr
: term ( '='|'<>'| '<' | '<=' | '>' | '>=' ) term
| expr K_AND expr
| expr K_OR expr
| '(' expr ')'
;
term
: literal_value
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
;
The issue comes if you want to allow boolean variables or functions -- you need to classify the functions/vars in your lexer and have a different terminal for each, which is tricky and error prone.
Instead, it is generally better to NOT do this kind of checking in the parser -- have your parser be permissive and accept anything expression-like, and generate an expression tree for it. Then have a separate pass over the tree (called a type checker) that checks the types of the operands of operations and the arguments to functions.
This latter approach (with a separate type checker) generally ends up being much simpler, clearer, more flexible, and gives better error messages (rather than just 'syntax error').

antlr4 does't parse obvious tree

I want to create a Grammar that will parse the input statement
myvar is 43+23
and
otherVar of myvar is "hallo"
But the parser doesn't recognize anything here.
(sorry, I am not allowed to post images :( imagine a statement node with the Tokens
[myvar] [is] [43] [+] [23] as children all marked red. Same goes for the other statement)
I'm getting error messages that confuse me:
line 2:7 no viable alternative at input 'myvaris'
line 3:19 no viable alternative at input 'otherVarofmyvaris'
Where are the spaces gone? I assume, It's something with my lexer, but I can't see what the problem is. Just in case here is the grammar for these statements:
statement
: envCall #call_Environment_Function
| identifier IS expression # assignment_statement // This one should be used
| loopHeader statement_block # loop_statement
etc...
expression
: '(' expression ')' #bracket_Expression
| mathExpression #math_Expression
| identifier #identifier_Expression // this one should be used
| objectExpression #object_Expression
etc ...
identifier //both of these should be used
: selector=IDENTIFIER OF object=expression #ofIdentifier
| selector=IDENTIFIER #idLocal
;
here are all the Lexer rules I have so far:
IdentifierNamespace: IDENTIFIER '.' IDENTIFIER;
FromIn: FROM | IN;
OPENBLOCK: NEWLINE? '{';
CLOSEBLOCK: '}' NEWLINE;
NEWLINE: ['\n''\t']+;
NUMBER: INT | FLOAT;
INT: [0-9]+;
FLOAT: [0-9]* '.' [0-9]+;
IsAre: IS | ARE;
OF: 'of';
IS: 'is';
ARE: 'are';
DO: 'do';
FROM: 'from';
IN: 'in';
IDENTIFIER : [a-zA-Z]+ ;
//WHITESPACE: [ \t]+ -> skip;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
STRING : '"' (ESC | ~["\\])* '"' ;
END: 'END'[.]* EOF;
WHITESPACE : ( '\t' | ' ' )+ -> skip ;
Ok, found it. There was a compOP defined for the parser, and it was messing up the treegeneration.
compOP: '<'
| '>'
| '=' // the programmers '=='
| '>='
| '<='
| '<>'
| '!='
| 'in'
| 'not' 'in'
| 'is' <- removed this one and it works now
;
So: never assign the same keyword to Parser and Lexer, I guess.

Context-Free-Grammar for assignment statements in ANTLR

I'm writing an ANTLR lexer/parser for context free grammar.
This is what I have now:
statement
: assignment_statement
;
assignment_statement
: IDENTIFIER '=' expression ';'
;
term
: IDENT
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENT '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
So my assignment statement is identified by the form
IDENTIFIER = expression;
However, assignment statement should also take into account cases when the right hand side is a function call (the return value of the statement). For example,
items = getItems();
What grammar rule should I add for this? I thought of adding a function call to the "expression" rule, but I wasn't sure if function call should be regarded as expression..
Thanks
This grammar looks fine to me. I am assuming that IDENT and IDENTIFIER are the same and that you have additional productions for the remaining terminals.
This production seems to define a function call.
| IDENT '(' actualParameters ')'
You need a production for the actual parameters, something like this.
actualParameters : nothing | expression ( ',' expression )*

Support optional quotes in a Boolean expression

Background
I have been using ANTLRWorks (V 1.4.3) for a few days now and trying to write a simple Boolean parser. The combined lexer/parser grammar below works well for most of the requirements including support for quoted white-spaced text as operands for a Boolean expression.
Problem
I would like the grammar to work for white-spaced operands without the need of quotes.
Example
For example, expression-
"left right" AND center
should have the same parse tree even after dropping the quotes-
left right AND center.
I have been learning about backtracking, predicates etc but can't seem to find a solution.
Code
Below is the grammar I have got so far. Any feedback on the foolish mistakes is appreciated :).
Lexer/Parser Grammar
grammar boolean_expr;
options {
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
}
#modifier{public}
#ctorModifier{public}
#lexer::namespace{Org.CSharp.Parsers}
#parser::namespace{Org.CSharp.Parsers}
public
evaluator
: expr EOF
;
public
expr
: orexpr
;
public
orexpr
: andexpr (OR^ andexpr)*
;
public
andexpr
: notexpr (AND^ notexpr)*
;
public
notexpr
: (NOT^)? atom
;
public
atom
: word | LPAREN! expr RPAREN!
;
public
word
: QUOTED_TEXT | TEXT
;
/*
* Lexer Rules
*/
LPAREN
: '('
;
RPAREN
: ')'
;
AND
: 'AND'
;
OR
: 'OR'
;
NOT
: 'NOT'
;
WS
: ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
QUOTED_TEXT
: '"' (LETTER | DIGIT | ' ' | ',' | '-')+ '"'
;
TEXT
: (LETTER | DIGIT)+
;
/*
Fragment lexer rules can be used by other lexer rules, but do not return tokens by themselves
*/
fragment DIGIT
: ('0'..'9')
;
fragment LOWER
: ('a'..'z')
;
fragment UPPER
: ('A'..'Z')
;
fragment LETTER
: LOWER | UPPER
;
Simply let TEXT in your atom rule match once or more: TEXT+. When it matches a TEXT token more than once, you'll also want to create a custom root node for these TEXT tokens (I added an imaginary token called WORD in the grammar below).
grammar boolean_expr;
options {
output=AST;
}
tokens {
WORD;
}
evaluator
: expr EOF
;
...
word
: QUOTED_TEXT
| TEXT+ -> ^(WORD TEXT+)
;
...
Your input "left right AND center" would now be parsed as follows:

Antlr parser for and/or logic - how to get expressions between logic operators?

I am using ANTLR to create an and/or parser+evaluator. Expressions will have the format like:
x eq 1 && y eq 10
(x lt 10 && x gt 1) OR x eq -1
I was reading this post on logic expressions in ANTLR Looking for advice on project. Parsing logical expression and I found the grammar posted there a good start:
grammar Logic;
parse
: expression EOF
;
expression
: implication
;
implication
: or ('->' or)*
;
or
: and ('&&' and)*
;
and
: not ('||' not)*
;
not
: '~' atom
| atom
;
atom
: ID
| '(' expression ')'
;
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
However, while getting a tree from the parser works for expressions where the variables are just one character (ie, "(A || B) AND C", I am having a hard time adapting this to my case (in the example "x eq 1 && y eq 10" I'd expect one "AND" parent and two children, "x eq 1" and "y eq 10", see the test case below).
#Test
public void simpleAndEvaluation() throws RecognitionException{
String src = "1 eq 1 && B";
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
assertEquals("&&",tree.getText());
assertEquals("1 eq 1",tree.getChild(0).getText());
assertEquals("a neq a",tree.getChild(1).getText());
}
I believe this is related with the "ID". What would the correct syntax be?
For those interested, I made some improvements in my grammar file (see bellow)
Current limitations:
only works with &&/||, not AND/OR (not very problematic)
you can't have spaces between the parenthesis and the &&/|| (I solve that by replacing " (" with ")" and ") " with ")" in the source String before feeding the lexer)
grammar Logic;
options {
output = AST;
}
tokens {
AND = '&&';
OR = '||';
NOT = '~';
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| SYMBOL
)+
;
SYMBOL
:
('+'|'-'|'*'|'/'|'_')
;
ID : ('a'..'z' | 'A'..'Z')+;
states that an identifier is a sequence of one or more letters, but does not allow any digits. Try
ID : ('a'..'z' | 'A'..'Z' | '0'..'9')+;
which will allow e.g. abc, 123, 12ab, and ab12. If you don't want the latter types, you'll have to restructure the rule a little bit (left as a challenge...)
In order to accept arbitrarily many identifiers, you could define atom as ID+ instead of ID.
Also, you will likely need to specify AND, OR, -> and ~ as tokens so that, as #Bart Kiers says, the first two won't get classified as ID, and so that the latter two will get recognized at all.

Resources