I'm trying to parse algebraic expressions with ANTLR4. One feature I tried to accomplish with my parser is the "intelligent" handling of percent expressions.
Edit: The goal is to make the calculation of discounts or tips in a restaurant easier. E.g. if you see an advert "30% off" you could enter "price - 30%" and get the correct result. Or in a restaurant you could enter the price of your meal plus 15% and get the sum you have to pay including a tip of 15%. But this interpretation should only occur if the expression looks like "expression1 (- or +) expression2". In all other cases the percent sign should be interpreted as usual. The Google Search box calculator behaves like that./Edit
100-30% should return 70
100-(20+10)% should also return 70
3+(100-(20+10)%) should return 73
but
5% should return 0.05
(5+5)% should return 0.10
My grammar looks like this:
expr:
e EOF
;
e:
'-'a=e
| '(' a=e ')'
| a=e op=(ADD|SUB) b=e '%'
| a=e op=(ADD|SUB) b=e
| a=e'%' //**PERCENTRULE**
| FLT
;
ADD : '+' ;
SUB : '-' ;
FLT: [0-9]+(('.'|',')[0-9]+)?;
NEWLINE:'\r'? '\n' ;
WS : [ \t\n]+ -> skip ;
For the expression 100-30% I would expect the this tree:
But I get this:
How can I get the correct tree (without deleting PERCENTRULE)?
I deleted my original grammar-based answer because I realized I had a very different idea of what kind of handling you were trying to accomplish. It sounds like you want anything in the form X op Y % to become X * (1 op (Y/100)) instead. Is that accurate?
One feature I tried to accomplish with my parser is the "intelligent" handling of percent expressions:
Are you sure your specification for this is solid enough to even begin coding? It looks quite confusing to me, especially since % is more like a units-designation.
For example, I would have expected 50-30% to be either one of these:
(50 - 0.3) = 49.3
(50 - 30) / 100 = 0.20
...but what you're asking for sounds stranger still: 50 * (1 - 0.3) = 35.
That opens up additional weirdness. Wouldn't both of these be true?
0+5% would become 0 * (1 + 0.05) = 0
5% would become 5 / 100 = 0.05
This is odd because adding zero usually doesn't change what the number means.
A more-restrictive version
OK, what about allowing percentage-based changes only if the user avoids ambiguity? One way would be to create new binary operators like A -% B or A +% B, but that's not quite human-centric, so how about:
expr: e EOF ;
e
: SUB e
| parenExpr
| percentOp
| binaryOp
| FLT
;
parenExpr
: LPAREN e RPAREN
;
percentOp
: (FLT|parenExpr) (ADD|SUB) (FLT|parenExpr) PCT
;
binaryOp
: e (ADD|SUB|MUL|DIV) e
;
PCT : '%';
LPAREN : '(';
RPAREN : ')';
ADD : '+' ;
SUB : '-' ;
MUL : '*' ;
DIV : '/' ;
FLT: [0-9]+(('.'|',')[0-9]+)?;
NEWLINE:'\r'? '\n' ;
WS : [ \t\n]+ -> skip ;
This would mean:
50-5-4% is treated as 100-(5-4%) to get 45.2.
5% is not valid (on its own)
5%+4 is not valid
Related
I have the following PEG grammar defined:
Program = _{ SOI ~ Expr ~ EOF }
Expr = { UnaryExpr | BinaryExpr }
Term = _{Int | "(" ~ Expr ~ ")" }
UnaryExpr = { Operator ~ Term }
BinaryExpr = { Term ~ (Operator ~ Term)* }
Operator = { "+" | "-" | "*" | "^" }
Int = #{ Operator? ~ ASCII_DIGIT+ }
WHITESPACE = _{ " " | "\t" }
EOF = _{ EOI | ";" }
And the following expressions are all parsed correctly:
1 + 2
1 - 2
1 + -2
1 - -2
1
+1
-1
But any expression that begins with a negative number errors
-1 + 2
errors with
--> 1:4
|
1 | -1 + 2
| ^---
|
= expected EOI
What I expect (what I would like) is for -1 + 2 to be treated the same as 1 + -2, that is a Binary expression that is made up of two Unary Expressions.
I have toyed around with a lot of variations with no success. And, I'm open to using an entirely different paradigm if I need to, but I'd really like to keep the UnaryExpression idea since I've already built my parser around it.
I'm new to PEG, so I'd appreciate any help.
For what its worth, I'm using Rust v1.59 and https://pest.rs/ to both parse and test my expressions.
You have a small error in the Expr logic. The first part before the | takes precedence if both match.
And -1 is a valid UnaryExpr so the program as a whole is expected to match SOI ~ UnaryExpr ~ EOF in this case. But there is additional data (+ 2) which leads to this error.
If you reverse the possibilities of Expr so that Expr = { BinaryExpr | UnaryExpr } the example works. The reason for that is that first BinaryExpr will be checked and only if that fails UnaryExpr.
I'm building a parser for SML using ANTLR 4.8, and for some reason the generated parser keeps choking on integer literals:
# CLASSPATH=bin ./scripts/grun SML expression -tree <<<'1'
line 1:0 mismatched input '1' expecting {'(', 'let', 'op', '{', '()', '[', '#', 'raise', 'if', 'while', 'case', 'fn', LONGID, CONSTANT}
(expression 1)
I've trimmed as much as I can from the grammar to still show this issue, which appears very strange. This grammar shows the issue (despite LABEL not even being used):
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
On the other hand, removing LABEL makes positive numbers work again:
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
I've tried replacing NUM* with DIGIT? and similar variations, but that didn't fix my problem.
I'm really not sure what's going on, so I suspect it's something deeper than the syntax I'm using.
As already mentioned in the comments by Rici: the lexer tries to match as much characters as possible, and when 2 or more rules match the same characters, the one defined first "wins". So with rules like these:
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
the input 1 will always become a LABEL. And input like 0 will always be a CONSTANT. An INT token will only be created when a ~ is encountered followed by some digits. The NUM and DIGIT will never produce a token since the rules before it will be matched. The fact that NUM and DIGIT can never become tokens on their own, makes them candidates to becoming fragment tokens:
fragment NUM : DIGIT+ ;
fragment DIGIT : [0-9] ;
That way, you can't accidentally use these tokens inside parser rules.
Also, making ~ part of a token is usually not the way to go. You'll probably also want ~(1 + 2) to be a valid expression. So an unary operator like ~ is often better used in a parser rule: expression : '~' expression | ... ;.
Finally, if you want to make a distinction between a non-zero integer value as a label, you can do it like this:
grammar SML_Small;
expression
: '(' expression ')'
| '~' expression
| integer
;
integer
: INT
| INT_NON_ZERO
;
label
: INT_NON_ZERO
;
INT_NON_ZERO : [1-9] DIGIT* ;
INT : DIGIT+ ;
SPACES : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;
I've created a grammar to express a search through a Map using key and value pairs using an ANTLR4 grammar file:
START: 'SEARCH FOR';
VALUE_EXPRESSION: 'VALUE:'[a-zA-Z0-9]+;
MATCH: 'MATCHING';
COMMA: ',';
KEY_EXPRESSION: 'KEY:'[a-zA-Z0-9]*;
KEY_VALUE_PAIR: KEY_EXPRESSION MATCH VALUE_EXPRESSION;
r : START KEY_VALUE_PAIR (COMMA KEY_VALUE_PAIR)*;
WS: [ \n\t\r]+ -> skip;
The "Interpret Lexer" in ANTLRWorks produces:
And the "Parse Tree" like this:
I'm not sure if this is the correct (or even typical) way to go about parsing an input string but what I'd like to do is have each of the key/value pairs split up and placed under a parent node like such:
[SEARCH FOR] [PAIR], [PAIR]
| |
/ \ / \
/ \ / \
/ \ / \
colour red size small
My belief is that in doing this It will make like easier when I come to walk the tree.
I've searched around and tried to use the caret '^' character to specify the parent but ANTLRWorks always indicates that there is an error in my grammar.
Can anybody help with this, or possibly supply another solution (if this is an atypical approach)?
You can probably simplify this even further. You might want to have a LEXER rule for your keys to keep track of them. So below, I am simply using string as the key. But you could define a lexer rule for 'colour', 'size', etc... Also, I did away with the matching. Instead, I created a set of pairs.
grammar GRAMMAR;
start: START set ;
set
: pair (',' pair)*
;
pair: STRING ':' value ;
value
: STRING
| NUMBER
;
START: 'SEARCH FOR: ' ;
STRING : '"' [a-zA-Z_0-9]* '"' ;
NUMBER
: '-'? INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r]+ -> skip ;
I am using ANTLR to create an and/or parser+evaluator. Expressions will have the format like:
x eq 1 && y eq 10
(x lt 10 && x gt 1) OR x eq -1
I was reading this post on logic expressions in ANTLR Looking for advice on project. Parsing logical expression and I found the grammar posted there a good start:
grammar Logic;
parse
: expression EOF
;
expression
: implication
;
implication
: or ('->' or)*
;
or
: and ('&&' and)*
;
and
: not ('||' not)*
;
not
: '~' atom
| atom
;
atom
: ID
| '(' expression ')'
;
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
However, while getting a tree from the parser works for expressions where the variables are just one character (ie, "(A || B) AND C", I am having a hard time adapting this to my case (in the example "x eq 1 && y eq 10" I'd expect one "AND" parent and two children, "x eq 1" and "y eq 10", see the test case below).
#Test
public void simpleAndEvaluation() throws RecognitionException{
String src = "1 eq 1 && B";
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
assertEquals("&&",tree.getText());
assertEquals("1 eq 1",tree.getChild(0).getText());
assertEquals("a neq a",tree.getChild(1).getText());
}
I believe this is related with the "ID". What would the correct syntax be?
For those interested, I made some improvements in my grammar file (see bellow)
Current limitations:
only works with &&/||, not AND/OR (not very problematic)
you can't have spaces between the parenthesis and the &&/|| (I solve that by replacing " (" with ")" and ") " with ")" in the source String before feeding the lexer)
grammar Logic;
options {
output = AST;
}
tokens {
AND = '&&';
OR = '||';
NOT = '~';
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| SYMBOL
)+
;
SYMBOL
:
('+'|'-'|'*'|'/'|'_')
;
ID : ('a'..'z' | 'A'..'Z')+;
states that an identifier is a sequence of one or more letters, but does not allow any digits. Try
ID : ('a'..'z' | 'A'..'Z' | '0'..'9')+;
which will allow e.g. abc, 123, 12ab, and ab12. If you don't want the latter types, you'll have to restructure the rule a little bit (left as a challenge...)
In order to accept arbitrarily many identifiers, you could define atom as ID+ instead of ID.
Also, you will likely need to specify AND, OR, -> and ~ as tokens so that, as #Bart Kiers says, the first two won't get classified as ID, and so that the latter two will get recognized at all.
I am trying to parse a small expression language (I didn't define the language, from a vendor) and everything is fine until I try to use the not operator, which is a tilde in this language.
My grammar has been heavily influenced by these two links (aka shameless cut and pasting):
http://www.codeproject.com/KB/recipes/sota_expression_evaluator.aspx http://www.alittlemadness.com/2006/06/05/antlr-by-example-part-1-the-language
The language consists of three expression types that can be used with and, or, not operators and parenthesis change precedence. Expressions are:
Skill("name") > some_number (can also be <, >=, <=, =, !=)
SkillExists("name")
LoggedIn("name") (this one can also have name#name)
This input works fine:
Skill("somename") > 1 | (LoggedIn("somename") & SkillExists("othername"))
However, as soon as I try to use the not operator I get NoViableAltException. I can't figure out why. I have compared my grammar to the ECalc.g one at the codeproject.com link and they seem to match, there must be some subtle difference I can't see. Fails:
Skill("somename") < 10 ~ SkillExists("othername")
My Grammar:
grammar UserAttribute;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
SKILL = 'Skill' ;
SKILL_EXISTS = 'SkillExists' ;
LOGGED_IN = 'LoggedIn';
GT = '>';
LT = '<';
LTE = '<=';
GTE = '>=';
EQUALS = '=';
NOT_EQUALS = '!=';
AND = '&';
OR = '|' ;
NOT = '~';
LPAREN = '(';
RPAREN = ')';
QUOTE = '"';
AT = '#';
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
expression : orexpression EOF!;
orexpression : andexpression (OR^ andexpression)*;
andexpression : notexpression (AND^ notexpression)*;
notexpression : primaryexpression | NOT^ primaryexpression;
primaryexpression : term | LPAREN! orexpression RPAREN!;
term : skill_exists | skill | logged_in;
skill_exists : SKILL_EXISTS LPAREN QUOTE NAME QUOTE RPAREN;
logged_in : LOGGED_IN LPAREN QUOTE NAME (AT NAME)? QUOTE RPAREN;
skill: SKILL LPAREN QUOTE NAME QUOTE RPAREN ((GT | LT| LTE | GTE | EQUALS | NOT_EQUALS)? NUMBER*)?;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NAME : ('a'..'z' | 'A'..'Z' | '_')+;
NUMBER : ('0'..'9')+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
I have 2 remarks:
1
Since you're parsing single expressions (expression : orexpression EOF!;), the input "Skill("somename") < 10 ~ SkillExists("othername")" is not only invalid in your grammar, but it's invalid in terms of any expression parser (I know of). A notexpression only takes a "right-hand-side" expression, so ~ SkillExists("othername") is a single expression and Skill("somename") < 10 is also a single expression. But in between those two single expression, there's no OR or AND operator. It would be the same as evaluating the expression true false instead of true | false or true and false.
In short, your grammar disallows:
Skill("somename") < 10 ~ SkillExists("othername")
but allows for:
Skill("somename") < 10 & SkillExists("othername")
which seems logical to me.
2
I don't quite understand your skill rule (which is ambiguous, btw):
skill
: SKILL LPAREN QUOTE NAME QUOTE RPAREN
((GT | LT| LTE | GTE | EQUALS | NOT_EQUALS)? NUMBER*)?
;
This means that the operator is optional and there can be zero or more numbers at the end. This means that the following input are all valid:
Skill("foo") = 10 20
Skill("foo") 10 20 30
Skill("foo") <
Perhaps you meant:
skill
: SKILL LPAREN QUOTE NAME QUOTE RPAREN
((GT | LT| LTE | GTE | EQUALS | NOT_EQUALS)^ NUMBER)?
;
instead? (the ? becomes a ^ and the * is removed)
If I only change that rule and parse the input:
Skill("somename") < 10 & SkillExists("othername")
the following AST is created:
(as you can see, the AST needs to be better formed: i.e. you need some rewrite rules in your skill_exists, logged_in and skill rules)
EDIT
and if you want successive expressions to have implied AND tokens in between, do something like this:
grammar UserAttribute;
...
tokens {
...
I_AND; // <- added a token without any text (imaginary token)
AND = '&';
...
}
andexpression
: (notexpression -> notexpression) (AND? notexpression -> ^(I_AND $andexpression notexpression))*
;
...
As you can see, since the AND is now optional, it cannot be used inside a rewrite rule, but you'll have to use the imaginary token I_AND.
If you now parse the input:
Skill("somename") < 10 ~ SkillExists("othername")
you will get the following AST: