ANTLR4 parsing a keyword-contained variable name - parsing

I'm trying to parse a simple integer declaration in antlr4.
The grammar I'm doing now is:
main : 'int' var '=' NUMBER+ ;
var : LETTER (LETTER | NUMBER)* ;
LETTER: [a-zA-Z_] ;
NUMBER: [0-9] ;
WS : [ \t\r\n]+ -> skip ;
When I tried to test the main rule with int int_A = 0, I got an error:
extraneous input 'int' expecting LETTER.
I know it's because the variable name 'int_A' contains the keyword 'int', but how do I modify my grammar? Thanks.

The lexer creates tokens with as much characters as possible. So int_A is being tokenised as the following 3 tokens:
'int' (int keyword defined in parser)
LETTER (_)
LETTER (A)
So the parser cannot create a var with these tokens.
Instead of a parser rule var, make it a lexer rule:
main : 'int' VAR '=' NUMBER+ ;
VAR : [a-zA-Z_] ([a-zA-Z_] | [0-9])* ;
NUMBER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;

Related

ANTLR4: how to match kv expression with same rule

I have the following statement I wish to parse:
key=value
key: [a-zA-Z] ([a-zA-Z0-9_-])*
value: [a-zA-Z] ([a-zA-Z0-9_-])*
The parser is always confused as key and value have the same rule.
my error grammar:
grammar MatchExpr;
prog: stat ;
stat: expr
;
expr : kv JOINER kv #joiner
| kv #condition
;
kv: KEY OP VALUE;
JOINER: '&';
KEY : [a-zA-Z] ([a-zA-Z0-9])*;
OP : '=';
VALUE : [a-zA-Z0-9];
WS : [ \t]+ -> skip ; // toss out whitespace
but another grammar can run :
grammar MatchExpr;
prog: stat ;
stat: expr
;
expr : kv JOINER kv #joiner
| kv #condition
; kv: KV;
KV: [a-zA-Z] ([a-zA-Z0-9_-])* '=' [a-zA-Z0-9] ([a-zA-Z0-9._-])*;
JOINER: '&';
WS : [ \t]+ -> skip ; // toss out whitespace
why?
ANTLR will always create a KEY token for the input foo. No matter if the input is mu = foo, then too will there be 2 KEY tokens created (with an OP token in between).
This is simply how ANTLR's lexer works. The lexer is not "driven" by the parser. It doesn't matter if the parser is trying to match a VALUE token, the input foo will always be a KEY token.
These are the 2 rules by which the lexer creates tokens:
create the longest possible match
if there are 2 or more lexer rules than match the same characters, let the one defined first "win"
Because of rule 2, you can see why KEY will be created for foo and not a VALUE.
To fix this, do something like this:
kv : KEY OP value;
value : KEY | VALUE;
JOINER : '&';
KEY : [a-zA-Z] [a-zA-Z0-9]*;
VALUE : [a-zA-Z0-9]+ // matches an ID starting with a digit
OP : '=';

ANTLR4 grammar for SML choking on positive integer literals

I'm building a parser for SML using ANTLR 4.8, and for some reason the generated parser keeps choking on integer literals:
# CLASSPATH=bin ./scripts/grun SML expression -tree <<<'1'
line 1:0 mismatched input '1' expecting {'(', 'let', 'op', '{', '()', '[', '#', 'raise', 'if', 'while', 'case', 'fn', LONGID, CONSTANT}
(expression 1)
I've trimmed as much as I can from the grammar to still show this issue, which appears very strange. This grammar shows the issue (despite LABEL not even being used):
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
On the other hand, removing LABEL makes positive numbers work again:
grammar SML_Small;
Whitespace : [ \t\r\n]+ -> skip ;
expression : CONSTANT ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
I've tried replacing NUM* with DIGIT? and similar variations, but that didn't fix my problem.
I'm really not sure what's going on, so I suspect it's something deeper than the syntax I'm using.
As already mentioned in the comments by Rici: the lexer tries to match as much characters as possible, and when 2 or more rules match the same characters, the one defined first "wins". So with rules like these:
LABEL : [1-9] NUM* ;
CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;
the input 1 will always become a LABEL. And input like 0 will always be a CONSTANT. An INT token will only be created when a ~ is encountered followed by some digits. The NUM and DIGIT will never produce a token since the rules before it will be matched. The fact that NUM and DIGIT can never become tokens on their own, makes them candidates to becoming fragment tokens:
fragment NUM : DIGIT+ ;
fragment DIGIT : [0-9] ;
That way, you can't accidentally use these tokens inside parser rules.
Also, making ~ part of a token is usually not the way to go. You'll probably also want ~(1 + 2) to be a valid expression. So an unary operator like ~ is often better used in a parser rule: expression : '~' expression | ... ;.
Finally, if you want to make a distinction between a non-zero integer value as a label, you can do it like this:
grammar SML_Small;
expression
: '(' expression ')'
| '~' expression
| integer
;
integer
: INT
| INT_NON_ZERO
;
label
: INT_NON_ZERO
;
INT_NON_ZERO : [1-9] DIGIT* ;
INT : DIGIT+ ;
SPACES : [ \t\r\n]+ -> skip ;
fragment DIGIT : [0-9] ;

"No viable alternative at input" error for ANTLR 4 JSON grammar

I am trying to adapt the STRING part of Pair in Object to a CamelString, but it fails. and report "No viable alternative at input".
I have tried to used my CamelString as an independent grammar, it works well. I think it means there is ambiguity in my grammar, but I can not understand why.
For the test input
{
'BaaaBcccCdddd':'cc'
}
Ther error is
line 2:2 no viable alternative at input '{'BaaaBcccCdddd''
The following is my grammar. It's almost the same with the standard JSON grammar for ANTLR 4.
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
grammar Json;
json: object
| array
;
object
: '{' pair (',' pair)* '}'
| '{' '}' // empty object
;
pair : camel_string ':' value;
camel_string : '\'' (camel_body)*? '\'';
STRING
: '\'' (ESC | ~['\\])* '\'';
camel_body: CAMEL_BODY;
CAMEL_START: [a-z] ALPHA_NUM_LOWER*;
CAMEL_BODY: [A-Z] ALPHA_NUM_LOWER*;
CAMEL_END: [A-Z]+;
fragment ALPHA_NUM_LOWER : [0-9a-z];
array
: '[' value (',' value)* ']'
| '[' ']' // empty array
;
value
: STRING
| NUMBER
| object // recursion
| array // recursion
| 'true' // keywords
| 'false'
| 'null'
;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
NUMBER
: '-'? INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r]+ -> skip ;

cannot create implicit token for string literal in non-combined grammar

so found a nice grammar for a calculator and copied it with some lil changes from here:
https://dexvis.wordpress.com/2012/11/22/a-tale-of-two-grammars/
I have two Files: Parser and Lexer. Looks like this:
parser grammar Parser;
options{
language = Java;
tokenVocab = Lexer;
}
// PARSER
program : ((assignment|expression) ';')+;
assignment : ID '=' expression;
expression
: '(' expression ')' # parenExpression
| expression ('*'|'/') expression # multOrDiv
| expression ('+'|'-') expression # addOrSubtract
| 'print' arg (',' arg)* # print
| STRING # string
| ID # identifier
| INT # integer;
arg : ID|STRING;
and the Lexer:
lexer grammar WRBLexer;
STRING : '"' (' '..'~')* '"';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
WS : [ \t\n\r]+ -> skip ;
Basically just splitted Lexer and Parser into two files.
But when i try to save i get some Errors:
error(126): Parser.g4:9:35: cannot create implicit token for string literal in non-combined grammar: ';'
error(126): Parser.g4:11:16: cannot create implicit token for string literal in non-combined grammar: '='
error(126): Parser.g4:2:13: cannot create implicit token for string literal in non-combined grammar: '('
error(126): Parser.g4:2:28: cannot create implicit token for string literal in non-combined grammar: ')'
error(126): Parser.g4:3:10: cannot create implicit token for string literal in non-combined grammar: 'print'
error(126): Parser.g4:3:23: cannot create implicit token for string literal in non-combined grammar: ','
error(126): Parser.g4:9:37: cannot create implicit token for string literal in non-combined grammar: '*'
error(126): Parser.g4:9:41: cannot create implicit token for string literal in non-combined grammar: '/'
error(126): Parser.g4:10:47: cannot create implicit token for string literal in non-combined grammar: '+'
error(126): Parser.g4:10:51: cannot create implicit token for string literal in non-combined grammar: '-'
10 error(s)
Hope someone can help me with this.
Best regards
All literal tokens inside your parser grammar: '*', '/', etc. need to be defined in your lexer grammar:
lexer grammar WRBLexer;
ADD : '+';
MUL : '*';
...
And then in your parser grammar, you'd do:
expression
: ...
| expression (MUL|DIV) expression # multOrDiv
| expression (ADD|SUB) expression # addOrSubtract
| ...
;
Since you write two file.
All your symbols, must write in Lexer file.
I suggest you to do this:
In Lexer file:
STRING : '"' (' '..'~')* '"';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
WS : [ \t\n\r]+ -> skip ;
ADD_SUB: '+' | '-';
MUL_DIV: '*' | '/';
COMMA : ',';
PRINT : 'print';
Lb : '(';
Rb : ')';
COLON : ';';
EQUAL : '=';
And your Parser:
parser grammar Parser;
options{
language = Java;
tokenVocab = Lexer;
}
// PARSER
program : ((assignment|expression) COLON)+;
assignment : ID EQUAL expression;
expression
: Lb expression Rb # parenExpression
| expression MUL_DIV expression # multOrDiv
| expression ADD_SUB expression # addOrSubtract
| PRINT arg (COMMA arg)* # print
| STRING # string
| ID # identifier
| INT # integer
;
arg : ID|STRING;
Actually, it's okay to write literal tokens inside your rules. You can name literal tokens. For example,
expr: expr op=('*' | '/') expr # binaryExpr
| expr op=('+' | '-') expr # binaryExpr
| Number # number
;
Number: blah blah ;
Star : '*';
Div : '/';
Plus : '+';
Minus: '-';
And you can write the listener as follows:
class BinaryExpr {
public enum BinaryOp {
// ...
}
// ...
}
public class MyListener extends YourGrammarBaseListener {
#Override
public void exitBinaryExpr(YourGrammarParser.BinaryExprContext ctx) {
BinaryExpr.BinaryOp op;
switch (ctx.op.getType()) {
case YourGrammarParser.Star: op = BinaryExpr.BinaryOp.MUL; break;
case YourGrammarParser.Div: op = BinaryExpr.BinaryOp.DIV; break;
case YourGrammarParser.Plus: op = BinaryExpr.BinaryOp.ADD; break;
case YourGrammarParser.Minus: op = BinaryExpr.BinaryOp.SUB; break;
default: throw new RuntimeException("Unknown binary op.");
}
// ...
}
}

Support optional quotes in a Boolean expression

Background
I have been using ANTLRWorks (V 1.4.3) for a few days now and trying to write a simple Boolean parser. The combined lexer/parser grammar below works well for most of the requirements including support for quoted white-spaced text as operands for a Boolean expression.
Problem
I would like the grammar to work for white-spaced operands without the need of quotes.
Example
For example, expression-
"left right" AND center
should have the same parse tree even after dropping the quotes-
left right AND center.
I have been learning about backtracking, predicates etc but can't seem to find a solution.
Code
Below is the grammar I have got so far. Any feedback on the foolish mistakes is appreciated :).
Lexer/Parser Grammar
grammar boolean_expr;
options {
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
}
#modifier{public}
#ctorModifier{public}
#lexer::namespace{Org.CSharp.Parsers}
#parser::namespace{Org.CSharp.Parsers}
public
evaluator
: expr EOF
;
public
expr
: orexpr
;
public
orexpr
: andexpr (OR^ andexpr)*
;
public
andexpr
: notexpr (AND^ notexpr)*
;
public
notexpr
: (NOT^)? atom
;
public
atom
: word | LPAREN! expr RPAREN!
;
public
word
: QUOTED_TEXT | TEXT
;
/*
* Lexer Rules
*/
LPAREN
: '('
;
RPAREN
: ')'
;
AND
: 'AND'
;
OR
: 'OR'
;
NOT
: 'NOT'
;
WS
: ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
QUOTED_TEXT
: '"' (LETTER | DIGIT | ' ' | ',' | '-')+ '"'
;
TEXT
: (LETTER | DIGIT)+
;
/*
Fragment lexer rules can be used by other lexer rules, but do not return tokens by themselves
*/
fragment DIGIT
: ('0'..'9')
;
fragment LOWER
: ('a'..'z')
;
fragment UPPER
: ('A'..'Z')
;
fragment LETTER
: LOWER | UPPER
;
Simply let TEXT in your atom rule match once or more: TEXT+. When it matches a TEXT token more than once, you'll also want to create a custom root node for these TEXT tokens (I added an imaginary token called WORD in the grammar below).
grammar boolean_expr;
options {
output=AST;
}
tokens {
WORD;
}
evaluator
: expr EOF
;
...
word
: QUOTED_TEXT
| TEXT+ -> ^(WORD TEXT+)
;
...
Your input "left right AND center" would now be parsed as follows:

Resources