ANTLR4: how to match kv expression with same rule - token

I have the following statement I wish to parse:
key=value
key: [a-zA-Z] ([a-zA-Z0-9_-])*
value: [a-zA-Z] ([a-zA-Z0-9_-])*
The parser is always confused as key and value have the same rule.
my error grammar:
grammar MatchExpr;
prog: stat ;
stat: expr
;
expr : kv JOINER kv #joiner
| kv #condition
;
kv: KEY OP VALUE;
JOINER: '&';
KEY : [a-zA-Z] ([a-zA-Z0-9])*;
OP : '=';
VALUE : [a-zA-Z0-9];
WS : [ \t]+ -> skip ; // toss out whitespace
but another grammar can run :
grammar MatchExpr;
prog: stat ;
stat: expr
;
expr : kv JOINER kv #joiner
| kv #condition
; kv: KV;
KV: [a-zA-Z] ([a-zA-Z0-9_-])* '=' [a-zA-Z0-9] ([a-zA-Z0-9._-])*;
JOINER: '&';
WS : [ \t]+ -> skip ; // toss out whitespace
why?

ANTLR will always create a KEY token for the input foo. No matter if the input is mu = foo, then too will there be 2 KEY tokens created (with an OP token in between).
This is simply how ANTLR's lexer works. The lexer is not "driven" by the parser. It doesn't matter if the parser is trying to match a VALUE token, the input foo will always be a KEY token.
These are the 2 rules by which the lexer creates tokens:
create the longest possible match
if there are 2 or more lexer rules than match the same characters, let the one defined first "win"
Because of rule 2, you can see why KEY will be created for foo and not a VALUE.
To fix this, do something like this:
kv : KEY OP value;
value : KEY | VALUE;
JOINER : '&';
KEY : [a-zA-Z] [a-zA-Z0-9]*;
VALUE : [a-zA-Z0-9]+ // matches an ID starting with a digit
OP : '=';

Related

ANTLR4 grammar for SML choking on positive integer literals

I'm building a parser for SML using ANTLR 4.8, and for some reason the generated parser keeps choking on integer literals:
# CLASSPATH=bin ./scripts/grun SML expression -tree <<<'1'
line 1:0 mismatched input '1' expecting {'(', 'let', 'op', '{', '()', '[', '#', 'raise', 'if', 'while', 'case', 'fn', LONGID, CONSTANT}
(expression 1)
I've trimmed as much as I can from the grammar to still show this issue, which appears very strange. This grammar shows the issue (despite LABEL not even being used):
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
On the other hand, removing LABEL makes positive numbers work again:
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
I've tried replacing NUM* with DIGIT? and similar variations, but that didn't fix my problem.
I'm really not sure what's going on, so I suspect it's something deeper than the syntax I'm using.
As already mentioned in the comments by Rici: the lexer tries to match as much characters as possible, and when 2 or more rules match the same characters, the one defined first "wins". So with rules like these:
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
the input 1 will always become a LABEL. And input like 0 will always be a CONSTANT. An INT token will only be created when a ~ is encountered followed by some digits. The NUM and DIGIT will never produce a token since the rules before it will be matched. The fact that NUM and DIGIT can never become tokens on their own, makes them candidates to becoming fragment tokens:
fragment NUM : DIGIT+ ;
fragment DIGIT : [0-9] ;
That way, you can't accidentally use these tokens inside parser rules.
Also, making ~ part of a token is usually not the way to go. You'll probably also want ~(1 + 2) to be a valid expression. So an unary operator like ~ is often better used in a parser rule: expression : '~' expression | ... ;.
Finally, if you want to make a distinction between a non-zero integer value as a label, you can do it like this:
grammar SML_Small;
expression
: '(' expression ')'
| '~' expression
| integer
;
integer
: INT
| INT_NON_ZERO
;
label
: INT_NON_ZERO
;
INT_NON_ZERO : [1-9] DIGIT* ;
INT : DIGIT+ ;
SPACES : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;

ANTLR4 parsing a keyword-contained variable name

I'm trying to parse a simple integer declaration in antlr4.
The grammar I'm doing now is:
main : 'int' var '=' NUMBER+ ;
var : LETTER (LETTER | NUMBER)* ;
LETTER: [a-zA-Z_] ;
NUMBER: [0-9] ;
WS : [ \t\r\n]+ -> skip ;
When I tried to test the main rule with int int_A = 0, I got an error:
extraneous input 'int' expecting LETTER.
I know it's because the variable name 'int_A' contains the keyword 'int', but how do I modify my grammar? Thanks.
The lexer creates tokens with as much characters as possible. So int_A is being tokenised as the following 3 tokens:
'int' (int keyword defined in parser)
LETTER (_)
LETTER (A)
So the parser cannot create a var with these tokens.
Instead of a parser rule var, make it a lexer rule:
main : 'int' VAR '=' NUMBER+ ;
VAR : [a-zA-Z_] ([a-zA-Z_] | [0-9])* ;
NUMBER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;

ANTLR4 grammar to specify parent child relationship

I've created a grammar to express a search through a Map using key and value pairs using an ANTLR4 grammar file:
START: 'SEARCH FOR';
VALUE_EXPRESSION: 'VALUE:'[a-zA-Z0-9]+;
MATCH: 'MATCHING';
COMMA: ',';
KEY_EXPRESSION: 'KEY:'[a-zA-Z0-9]*;
KEY_VALUE_PAIR: KEY_EXPRESSION MATCH VALUE_EXPRESSION;
r : START KEY_VALUE_PAIR (COMMA KEY_VALUE_PAIR)*;
WS: [ \n\t\r]+ -> skip;
The "Interpret Lexer" in ANTLRWorks produces:
And the "Parse Tree" like this:
I'm not sure if this is the correct (or even typical) way to go about parsing an input string but what I'd like to do is have each of the key/value pairs split up and placed under a parent node like such:
[SEARCH FOR] [PAIR], [PAIR]
| |
/ \ / \
/ \ / \
/ \ / \
colour red size small
My belief is that in doing this It will make like easier when I come to walk the tree.
I've searched around and tried to use the caret '^' character to specify the parent but ANTLRWorks always indicates that there is an error in my grammar.
Can anybody help with this, or possibly supply another solution (if this is an atypical approach)?
You can probably simplify this even further. You might want to have a LEXER rule for your keys to keep track of them. So below, I am simply using string as the key. But you could define a lexer rule for 'colour', 'size', etc... Also, I did away with the matching. Instead, I created a set of pairs.
grammar GRAMMAR;
start: START set ;
set
: pair (',' pair)*
;
pair: STRING ':' value ;
value
: STRING
| NUMBER
;
START: 'SEARCH FOR: ' ;
STRING : '"' [a-zA-Z_0-9]* '"' ;
NUMBER
: '-'? INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r]+ -> skip ;

cannot create implicit token for string literal in non-combined grammar

so found a nice grammar for a calculator and copied it with some lil changes from here:
https://dexvis.wordpress.com/2012/11/22/a-tale-of-two-grammars/
I have two Files: Parser and Lexer. Looks like this:
parser grammar Parser;
options{
language = Java;
tokenVocab = Lexer;
}
// PARSER
program : ((assignment|expression) ';')+;
assignment : ID '=' expression;
expression
: '(' expression ')' # parenExpression
| expression ('*'|'/') expression # multOrDiv
| expression ('+'|'-') expression # addOrSubtract
| 'print' arg (',' arg)* # print
| STRING # string
| ID # identifier
| INT # integer;
arg : ID|STRING;
and the Lexer:
lexer grammar WRBLexer;
STRING : '"' (' '..'~')* '"';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
WS : [ \t\n\r]+ -> skip ;
Basically just splitted Lexer and Parser into two files.
But when i try to save i get some Errors:
error(126): Parser.g4:9:35: cannot create implicit token for string literal in non-combined grammar: ';'
error(126): Parser.g4:11:16: cannot create implicit token for string literal in non-combined grammar: '='
error(126): Parser.g4:2:13: cannot create implicit token for string literal in non-combined grammar: '('
error(126): Parser.g4:2:28: cannot create implicit token for string literal in non-combined grammar: ')'
error(126): Parser.g4:3:10: cannot create implicit token for string literal in non-combined grammar: 'print'
error(126): Parser.g4:3:23: cannot create implicit token for string literal in non-combined grammar: ','
error(126): Parser.g4:9:37: cannot create implicit token for string literal in non-combined grammar: '*'
error(126): Parser.g4:9:41: cannot create implicit token for string literal in non-combined grammar: '/'
error(126): Parser.g4:10:47: cannot create implicit token for string literal in non-combined grammar: '+'
error(126): Parser.g4:10:51: cannot create implicit token for string literal in non-combined grammar: '-'
10 error(s)
Hope someone can help me with this.
Best regards
All literal tokens inside your parser grammar: '*', '/', etc. need to be defined in your lexer grammar:
lexer grammar WRBLexer;
ADD : '+';
MUL : '*';
...
And then in your parser grammar, you'd do:
expression
: ...
| expression (MUL|DIV) expression # multOrDiv
| expression (ADD|SUB) expression # addOrSubtract
| ...
;
Since you write two file.
All your symbols, must write in Lexer file.
I suggest you to do this:
In Lexer file:
STRING : '"' (' '..'~')* '"';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
WS : [ \t\n\r]+ -> skip ;
ADD_SUB: '+' | '-';
MUL_DIV: '*' | '/';
COMMA : ',';
PRINT : 'print';
Lb : '(';
Rb : ')';
COLON : ';';
EQUAL : '=';
And your Parser:
parser grammar Parser;
options{
language = Java;
tokenVocab = Lexer;
}
// PARSER
program : ((assignment|expression) COLON)+;
assignment : ID EQUAL expression;
expression
: Lb expression Rb # parenExpression
| expression MUL_DIV expression # multOrDiv
| expression ADD_SUB expression # addOrSubtract
| PRINT arg (COMMA arg)* # print
| STRING # string
| ID # identifier
| INT # integer
;
arg : ID|STRING;
Actually, it's okay to write literal tokens inside your rules. You can name literal tokens. For example,
expr: expr op=('*' | '/') expr # binaryExpr
| expr op=('+' | '-') expr # binaryExpr
| Number # number
;
Number: blah blah ;
Star : '*';
Div : '/';
Plus : '+';
Minus: '-';
And you can write the listener as follows:
class BinaryExpr {
public enum BinaryOp {
// ...
}
// ...
}
public class MyListener extends YourGrammarBaseListener {
#Override
public void exitBinaryExpr(YourGrammarParser.BinaryExprContext ctx) {
BinaryExpr.BinaryOp op;
switch (ctx.op.getType()) {
case YourGrammarParser.Star: op = BinaryExpr.BinaryOp.MUL; break;
case YourGrammarParser.Div: op = BinaryExpr.BinaryOp.DIV; break;
case YourGrammarParser.Plus: op = BinaryExpr.BinaryOp.ADD; break;
case YourGrammarParser.Minus: op = BinaryExpr.BinaryOp.SUB; break;
default: throw new RuntimeException("Unknown binary op.");
}
// ...
}
}

Antlr parser for and/or logic - how to get expressions between logic operators?

I am using ANTLR to create an and/or parser+evaluator. Expressions will have the format like:
x eq 1 && y eq 10
(x lt 10 && x gt 1) OR x eq -1
I was reading this post on logic expressions in ANTLR Looking for advice on project. Parsing logical expression and I found the grammar posted there a good start:
grammar Logic;
parse
: expression EOF
;
expression
: implication
;
implication
: or ('->' or)*
;
or
: and ('&&' and)*
;
and
: not ('||' not)*
;
not
: '~' atom
| atom
;
atom
: ID
| '(' expression ')'
;
ID : ('a'..'z' | 'A'..'Z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;};
However, while getting a tree from the parser works for expressions where the variables are just one character (ie, "(A || B) AND C", I am having a hard time adapting this to my case (in the example "x eq 1 && y eq 10" I'd expect one "AND" parent and two children, "x eq 1" and "y eq 10", see the test case below).
#Test
public void simpleAndEvaluation() throws RecognitionException{
String src = "1 eq 1 && B";
LogicLexer lexer = new LogicLexer(new ANTLRStringStream(src));
LogicParser parser = new LogicParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
assertEquals("&&",tree.getText());
assertEquals("1 eq 1",tree.getChild(0).getText());
assertEquals("a neq a",tree.getChild(1).getText());
}
I believe this is related with the "ID". What would the correct syntax be?
For those interested, I made some improvements in my grammar file (see bellow)
Current limitations:
only works with &&/||, not AND/OR (not very problematic)
you can't have spaces between the parenthesis and the &&/|| (I solve that by replacing " (" with ")" and ") " with ")" in the source String before feeding the lexer)
grammar Logic;
options {
output = AST;
}
tokens {
AND = '&&';
OR = '||';
NOT = '~';
}
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| SYMBOL
)+
;
SYMBOL
:
('+'|'-'|'*'|'/'|'_')
;
ID : ('a'..'z' | 'A'..'Z')+;
states that an identifier is a sequence of one or more letters, but does not allow any digits. Try
ID : ('a'..'z' | 'A'..'Z' | '0'..'9')+;
which will allow e.g. abc, 123, 12ab, and ab12. If you don't want the latter types, you'll have to restructure the rule a little bit (left as a challenge...)
In order to accept arbitrarily many identifiers, you could define atom as ID+ instead of ID.
Also, you will likely need to specify AND, OR, -> and ~ as tokens so that, as #Bart Kiers says, the first two won't get classified as ID, and so that the latter two will get recognized at all.

Resources