ANTLR4 grammar to specify parent child relationship - parsing

I've created a grammar to express a search through a Map using key and value pairs using an ANTLR4 grammar file:
START: 'SEARCH FOR';
VALUE_EXPRESSION: 'VALUE:'[a-zA-Z0-9]+;
MATCH: 'MATCHING';
COMMA: ',';
KEY_EXPRESSION: 'KEY:'[a-zA-Z0-9]*;
KEY_VALUE_PAIR: KEY_EXPRESSION MATCH VALUE_EXPRESSION;
r : START KEY_VALUE_PAIR (COMMA KEY_VALUE_PAIR)*;
WS: [ \n\t\r]+ -> skip;
The "Interpret Lexer" in ANTLRWorks produces:
And the "Parse Tree" like this:
I'm not sure if this is the correct (or even typical) way to go about parsing an input string but what I'd like to do is have each of the key/value pairs split up and placed under a parent node like such:
[SEARCH FOR] [PAIR], [PAIR]
| |
/ \ / \
/ \ / \
/ \ / \
colour red size small
My belief is that in doing this It will make like easier when I come to walk the tree.
I've searched around and tried to use the caret '^' character to specify the parent but ANTLRWorks always indicates that there is an error in my grammar.
Can anybody help with this, or possibly supply another solution (if this is an atypical approach)?

You can probably simplify this even further. You might want to have a LEXER rule for your keys to keep track of them. So below, I am simply using string as the key. But you could define a lexer rule for 'colour', 'size', etc... Also, I did away with the matching. Instead, I created a set of pairs.
grammar GRAMMAR;
start: START set ;
set
: pair (',' pair)*
;
pair: STRING ':' value ;
value
: STRING
| NUMBER
;
START: 'SEARCH FOR: ' ;
STRING : '"' [a-zA-Z_0-9]* '"' ;
NUMBER
: '-'? INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r]+ -> skip ;

Related

ANTLR4 grammar for SML choking on positive integer literals

I'm building a parser for SML using ANTLR 4.8, and for some reason the generated parser keeps choking on integer literals:
# CLASSPATH=bin ./scripts/grun SML expression -tree <<<'1'
line 1:0 mismatched input '1' expecting {'(', 'let', 'op', '{', '()', '[', '#', 'raise', 'if', 'while', 'case', 'fn', LONGID, CONSTANT}
(expression 1)
I've trimmed as much as I can from the grammar to still show this issue, which appears very strange. This grammar shows the issue (despite LABEL not even being used):
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
On the other hand, removing LABEL makes positive numbers work again:
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
I've tried replacing NUM* with DIGIT? and similar variations, but that didn't fix my problem.
I'm really not sure what's going on, so I suspect it's something deeper than the syntax I'm using.
As already mentioned in the comments by Rici: the lexer tries to match as much characters as possible, and when 2 or more rules match the same characters, the one defined first "wins". So with rules like these:
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
the input 1 will always become a LABEL. And input like 0 will always be a CONSTANT. An INT token will only be created when a ~ is encountered followed by some digits. The NUM and DIGIT will never produce a token since the rules before it will be matched. The fact that NUM and DIGIT can never become tokens on their own, makes them candidates to becoming fragment tokens:
fragment NUM : DIGIT+ ;
fragment DIGIT : [0-9] ;
That way, you can't accidentally use these tokens inside parser rules.
Also, making ~ part of a token is usually not the way to go. You'll probably also want ~(1 + 2) to be a valid expression. So an unary operator like ~ is often better used in a parser rule: expression : '~' expression | ... ;.
Finally, if you want to make a distinction between a non-zero integer value as a label, you can do it like this:
grammar SML_Small;
expression
: '(' expression ')'
| '~' expression
| integer
;
integer
: INT
| INT_NON_ZERO
;
label
: INT_NON_ZERO
;
INT_NON_ZERO : [1-9] DIGIT* ;
INT : DIGIT+ ;
SPACES : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;

Parsing percent expressions with antlr4

I'm trying to parse algebraic expressions with ANTLR4. One feature I tried to accomplish with my parser is the "intelligent" handling of percent expressions.
Edit: The goal is to make the calculation of discounts or tips in a restaurant easier. E.g. if you see an advert "30% off" you could enter "price - 30%" and get the correct result. Or in a restaurant you could enter the price of your meal plus 15% and get the sum you have to pay including a tip of 15%. But this interpretation should only occur if the expression looks like "expression1 (- or +) expression2". In all other cases the percent sign should be interpreted as usual. The Google Search box calculator behaves like that./Edit
100-30% should return 70
100-(20+10)% should also return 70
3+(100-(20+10)%) should return 73
but
5% should return 0.05
(5+5)% should return 0.10
My grammar looks like this:
expr:
e EOF
;
e:
'-'a=e
| '(' a=e ')'
| a=e op=(ADD|SUB) b=e '%'
| a=e op=(ADD|SUB) b=e
| a=e'%' //**PERCENTRULE**
| FLT
;
ADD : '+' ;
SUB : '-' ;
FLT: [0-9]+(('.'|',')[0-9]+)?;
NEWLINE:'\r'? '\n' ;
WS : [ \t\n]+ -> skip ;
For the expression 100-30% I would expect the this tree:
But I get this:
How can I get the correct tree (without deleting PERCENTRULE)?
I deleted my original grammar-based answer because I realized I had a very different idea of what kind of handling you were trying to accomplish. It sounds like you want anything in the form X op Y % to become X * (1 op (Y/100)) instead. Is that accurate?
One feature I tried to accomplish with my parser is the "intelligent" handling of percent expressions:
Are you sure your specification for this is solid enough to even begin coding? It looks quite confusing to me, especially since % is more like a units-designation.
For example, I would have expected 50-30% to be either one of these:
(50 - 0.3) = 49.3
(50 - 30) / 100 = 0.20
...but what you're asking for sounds stranger still: 50 * (1 - 0.3) = 35.
That opens up additional weirdness. Wouldn't both of these be true?
0+5% would become 0 * (1 + 0.05) = 0
5% would become 5 / 100 = 0.05
This is odd because adding zero usually doesn't change what the number means.
A more-restrictive version
OK, what about allowing percentage-based changes only if the user avoids ambiguity? One way would be to create new binary operators like A -% B or A +% B, but that's not quite human-centric, so how about:
expr: e EOF ;
e
: SUB e
| parenExpr
| percentOp
| binaryOp
| FLT
;
parenExpr
: LPAREN e RPAREN
;
percentOp
: (FLT|parenExpr) (ADD|SUB) (FLT|parenExpr) PCT
;
binaryOp
: e (ADD|SUB|MUL|DIV) e
;
PCT : '%';
LPAREN : '(';
RPAREN : ')';
ADD : '+' ;
SUB : '-' ;
MUL : '*' ;
DIV : '/' ;
FLT: [0-9]+(('.'|',')[0-9]+)?;
NEWLINE:'\r'? '\n' ;
WS : [ \t\n]+ -> skip ;
This would mean:
50-5-4% is treated as 100-(5-4%) to get 45.2.
5% is not valid (on its own)
5%+4 is not valid

Support optional quotes in a Boolean expression

Background
I have been using ANTLRWorks (V 1.4.3) for a few days now and trying to write a simple Boolean parser. The combined lexer/parser grammar below works well for most of the requirements including support for quoted white-spaced text as operands for a Boolean expression.
Problem
I would like the grammar to work for white-spaced operands without the need of quotes.
Example
For example, expression-
"left right" AND center
should have the same parse tree even after dropping the quotes-
left right AND center.
I have been learning about backtracking, predicates etc but can't seem to find a solution.
Code
Below is the grammar I have got so far. Any feedback on the foolish mistakes is appreciated :).
Lexer/Parser Grammar
grammar boolean_expr;
options {
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
}
#modifier{public}
#ctorModifier{public}
#lexer::namespace{Org.CSharp.Parsers}
#parser::namespace{Org.CSharp.Parsers}
public
evaluator
: expr EOF
;
public
expr
: orexpr
;
public
orexpr
: andexpr (OR^ andexpr)*
;
public
andexpr
: notexpr (AND^ notexpr)*
;
public
notexpr
: (NOT^)? atom
;
public
atom
: word | LPAREN! expr RPAREN!
;
public
word
: QUOTED_TEXT | TEXT
;
/*
* Lexer Rules
*/
LPAREN
: '('
;
RPAREN
: ')'
;
AND
: 'AND'
;
OR
: 'OR'
;
NOT
: 'NOT'
;
WS
: ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
QUOTED_TEXT
: '"' (LETTER | DIGIT | ' ' | ',' | '-')+ '"'
;
TEXT
: (LETTER | DIGIT)+
;
/*
Fragment lexer rules can be used by other lexer rules, but do not return tokens by themselves
*/
fragment DIGIT
: ('0'..'9')
;
fragment LOWER
: ('a'..'z')
;
fragment UPPER
: ('A'..'Z')
;
fragment LETTER
: LOWER | UPPER
;
Simply let TEXT in your atom rule match once or more: TEXT+. When it matches a TEXT token more than once, you'll also want to create a custom root node for these TEXT tokens (I added an imaginary token called WORD in the grammar below).
grammar boolean_expr;
options {
output=AST;
}
tokens {
WORD;
}
evaluator
: expr EOF
;
...
word
: QUOTED_TEXT
| TEXT+ -> ^(WORD TEXT+)
;
...
Your input "left right AND center" would now be parsed as follows:

Antlr parser for and/or logic - how to get expressions between logic operators?

I am using ANTLR to create an and/or parser+evaluator. Expressions will have the format like:
x eq 1 && y eq 10
(x lt 10 && x gt 1) OR x eq -1
I was reading this post on logic expressions in ANTLR Looking for advice on project. Parsing logical expression and I found the grammar posted there a good start:
grammar Logic;
parse
: expression EOF
;
expression
: implication
;
implication
: or ('->' or)*
;
or
: and ('&&' and)*
;
and
: not ('||' not)*
;
not
: '~' atom
| atom
;
atom
: ID
| '(' expression ')'
;
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
However, while getting a tree from the parser works for expressions where the variables are just one character (ie, "(A || B) AND C", I am having a hard time adapting this to my case (in the example "x eq 1 && y eq 10" I'd expect one "AND" parent and two children, "x eq 1" and "y eq 10", see the test case below).
#Test
public void simpleAndEvaluation() throws RecognitionException{
String src = "1 eq 1 && B";
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
assertEquals("&&",tree.getText());
assertEquals("1 eq 1",tree.getChild(0).getText());
assertEquals("a neq a",tree.getChild(1).getText());
}
I believe this is related with the "ID". What would the correct syntax be?
For those interested, I made some improvements in my grammar file (see bellow)
Current limitations:
only works with &&/||, not AND/OR (not very problematic)
you can't have spaces between the parenthesis and the &&/|| (I solve that by replacing " (" with ")" and ") " with ")" in the source String before feeding the lexer)
grammar Logic;
options {
output = AST;
}
tokens {
AND = '&&';
OR = '||';
NOT = '~';
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| SYMBOL
)+
;
SYMBOL
:
('+'|'-'|'*'|'/'|'_')
;
ID : ('a'..'z' | 'A'..'Z')+;
states that an identifier is a sequence of one or more letters, but does not allow any digits. Try
ID : ('a'..'z' | 'A'..'Z' | '0'..'9')+;
which will allow e.g. abc, 123, 12ab, and ab12. If you don't want the latter types, you'll have to restructure the rule a little bit (left as a challenge...)
In order to accept arbitrarily many identifiers, you could define atom as ID+ instead of ID.
Also, you will likely need to specify AND, OR, -> and ~ as tokens so that, as #Bart Kiers says, the first two won't get classified as ID, and so that the latter two will get recognized at all.

Why does ANTLR not parse the entire input?

I am quite new to ANTLR, so this is likely a simple question.
I have defined a simple grammar which is supposed to include arithmetic expressions with numbers and identifiers (strings that start with a letter and continue with one or more letters or numbers.)
The grammar looks as follows:
grammar while;
#lexer::header {
package ConFreeG;
}
#header {
package ConFreeG;
import ConFreeG.IR.*;
}
#parser::members {
}
arith:
term
| '(' arith ( '-' | '+' | '*' ) arith ')'
;
term returns [AExpr a]:
NUM
{
int n = Integer.parseInt($NUM.text);
a = new Num(n);
}
| IDENT
{
a = new Var($IDENT.text);
}
;
fragment LOWER : ('a'..'z');
fragment UPPER : ('A'..'Z');
fragment NONNULL : ('1'..'9');
fragment NUMBER : ('0' | NONNULL);
IDENT : ( LOWER | UPPER ) ( LOWER | UPPER | NUMBER )*;
NUM : '0' | NONNULL NUMBER*;
fragment NEWLINE:'\r'? '\n';
WHITESPACE : ( ' ' | '\t' | NEWLINE )+ { $channel=HIDDEN; };
I am using ANTLR v3 with the ANTLR IDE Eclipse plugin. When I parse the expression (8 + a45) using the interpreter, only part of the parse tree is generated:
Why does the second term (a45) not get parsed? The same happens if both terms are numbers.
You'll want to create a parser rule that has an EOF (end of file) token in it so that the parser will be forced to go through the entire token stream.
Add this rule to your grammar:
parse
: arith EOF
;
and let the interpreter start at that rule instead of the arith rule:

Resources