cannot create implicit token for string literal in non-combined grammar - token

so found a nice grammar for a calculator and copied it with some lil changes from here:
https://dexvis.wordpress.com/2012/11/22/a-tale-of-two-grammars/
I have two Files: Parser and Lexer. Looks like this:
parser grammar Parser;
options{
language = Java;
tokenVocab = Lexer;
}
// PARSER
program : ((assignment|expression) ';')+;
assignment : ID '=' expression;
expression
: '(' expression ')' # parenExpression
| expression ('*'|'/') expression # multOrDiv
| expression ('+'|'-') expression # addOrSubtract
| 'print' arg (',' arg)* # print
| STRING # string
| ID # identifier
| INT # integer;
arg : ID|STRING;
and the Lexer:
lexer grammar WRBLexer;
STRING : '"' (' '..'~')* '"';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
WS : [ \t\n\r]+ -> skip ;
Basically just splitted Lexer and Parser into two files.
But when i try to save i get some Errors:
error(126): Parser.g4:9:35: cannot create implicit token for string literal in non-combined grammar: ';'
error(126): Parser.g4:11:16: cannot create implicit token for string literal in non-combined grammar: '='
error(126): Parser.g4:2:13: cannot create implicit token for string literal in non-combined grammar: '('
error(126): Parser.g4:2:28: cannot create implicit token for string literal in non-combined grammar: ')'
error(126): Parser.g4:3:10: cannot create implicit token for string literal in non-combined grammar: 'print'
error(126): Parser.g4:3:23: cannot create implicit token for string literal in non-combined grammar: ','
error(126): Parser.g4:9:37: cannot create implicit token for string literal in non-combined grammar: '*'
error(126): Parser.g4:9:41: cannot create implicit token for string literal in non-combined grammar: '/'
error(126): Parser.g4:10:47: cannot create implicit token for string literal in non-combined grammar: '+'
error(126): Parser.g4:10:51: cannot create implicit token for string literal in non-combined grammar: '-'
10 error(s)
Hope someone can help me with this.
Best regards

All literal tokens inside your parser grammar: '*', '/', etc. need to be defined in your lexer grammar:
lexer grammar WRBLexer;
ADD : '+';
MUL : '*';
...
And then in your parser grammar, you'd do:
expression
: ...
| expression (MUL|DIV) expression # multOrDiv
| expression (ADD|SUB) expression # addOrSubtract
| ...
;

Since you write two file.
All your symbols, must write in Lexer file.
I suggest you to do this:
In Lexer file:
STRING : '"' (' '..'~')* '"';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
WS : [ \t\n\r]+ -> skip ;
ADD_SUB: '+' | '-';
MUL_DIV: '*' | '/';
COMMA : ',';
PRINT : 'print';
Lb : '(';
Rb : ')';
COLON : ';';
EQUAL : '=';
And your Parser:
parser grammar Parser;
options{
language = Java;
tokenVocab = Lexer;
}
// PARSER
program : ((assignment|expression) COLON)+;
assignment : ID EQUAL expression;
expression
: Lb expression Rb # parenExpression
| expression MUL_DIV expression # multOrDiv
| expression ADD_SUB expression # addOrSubtract
| PRINT arg (COMMA arg)* # print
| STRING # string
| ID # identifier
| INT # integer
;
arg : ID|STRING;

Actually, it's okay to write literal tokens inside your rules. You can name literal tokens. For example,
expr: expr op=('*' | '/') expr # binaryExpr
| expr op=('+' | '-') expr # binaryExpr
| Number # number
;
Number: blah blah ;
Star : '*';
Div : '/';
Plus : '+';
Minus: '-';
And you can write the listener as follows:
class BinaryExpr {
public enum BinaryOp {
// ...
}
// ...
}
public class MyListener extends YourGrammarBaseListener {
#Override
public void exitBinaryExpr(YourGrammarParser.BinaryExprContext ctx) {
BinaryExpr.BinaryOp op;
switch (ctx.op.getType()) {
case YourGrammarParser.Star: op = BinaryExpr.BinaryOp.MUL; break;
case YourGrammarParser.Div: op = BinaryExpr.BinaryOp.DIV; break;
case YourGrammarParser.Plus: op = BinaryExpr.BinaryOp.ADD; break;
case YourGrammarParser.Minus: op = BinaryExpr.BinaryOp.SUB; break;
default: throw new RuntimeException("Unknown binary op.");
}
// ...
}
}

Related

ANTLR4: how to match kv expression with same rule

I have the following statement I wish to parse:
key=value
key: [a-zA-Z] ([a-zA-Z0-9_-])*
value: [a-zA-Z] ([a-zA-Z0-9_-])*
The parser is always confused as key and value have the same rule.
my error grammar:
grammar MatchExpr;
prog: stat ;
stat: expr
;
expr : kv JOINER kv #joiner
| kv #condition
;
kv: KEY OP VALUE;
JOINER: '&';
KEY : [a-zA-Z] ([a-zA-Z0-9])*;
OP : '=';
VALUE : [a-zA-Z0-9];
WS : [ \t]+ -> skip ; // toss out whitespace
but another grammar can run :
grammar MatchExpr;
prog: stat ;
stat: expr
;
expr : kv JOINER kv #joiner
| kv #condition
; kv: KV;
KV: [a-zA-Z] ([a-zA-Z0-9_-])* '=' [a-zA-Z0-9] ([a-zA-Z0-9._-])*;
JOINER: '&';
WS : [ \t]+ -> skip ; // toss out whitespace
why?
ANTLR will always create a KEY token for the input foo. No matter if the input is mu = foo, then too will there be 2 KEY tokens created (with an OP token in between).
This is simply how ANTLR's lexer works. The lexer is not "driven" by the parser. It doesn't matter if the parser is trying to match a VALUE token, the input foo will always be a KEY token.
These are the 2 rules by which the lexer creates tokens:
create the longest possible match
if there are 2 or more lexer rules than match the same characters, let the one defined first "win"
Because of rule 2, you can see why KEY will be created for foo and not a VALUE.
To fix this, do something like this:
kv : KEY OP value;
value : KEY | VALUE;
JOINER : '&';
KEY : [a-zA-Z] [a-zA-Z0-9]*;
VALUE : [a-zA-Z0-9]+ // matches an ID starting with a digit
OP : '=';

Xtext to Acceleo

I have a xtext code for an expression like this:
expr : RelExp ( {LogicExp.args+=current} op=LO args+=RelExp)* ;
RelExp returns expr : ArithExp ( {RelExp.args+=current} op=RO args+=ArithExp)* ;
ArithExp returns expr : Term ( {ArithExp.args+=current} op=AO1 args+=Term)* ;
Term returns expr : Factor ( {Term.args+=current} op=AO2 args+=Factor)* ;
Factor returns expr : Atom ({PostfixOp.arg=current} uo=UO)?
| {PrefixOp} uo=UO arg=Atom ;
Atom returns expr : Literal
| {Parenteses} '(' exp=expr ')'
| lValue ;
lValue returns expr : {Var} valor=ID (
({FuncCall.def=current} '(' arg=Argument? ')') |
({FieldAccess.obj=current} '.' field=ID) |
({ArrayAccess.arr=current} '[' index=expr ']')
)*
| PointerExp ;
PointerExp : {PointerExp} '**' '(' exp=expr ')' ;
//Case : 'case' val=Atom ':' (commands+=Command)* ;
//Type : tipo=TYPELIT ('[' exp=expr? ']')?;
Literal : {IntLit} val=NUMBER | {TrueLit} val='TRUE' | {FalseLit} val='FALSE' | {StrLit} val=STRING;
I am trying to write an acceleo code to print an expression. But everytime I write (stat.exp/) in acceleo it prints org.xtext.example.scldsl.sclDsl.impl.TrueLitImpl#67af833b(val: 0).
But I needed only (val: 0)
Can anyone please help!!
When you call [stat.exp/] in Acceleo, it adds an implicit toString() to obtain a String representation from your AST element.
If you want to use your Xtext grammar to obtain a String representation, you will need to find a way to use the Xtext serializer generated for your DSL.
As a first step, you should add a Java service in your Acceleo, and implement your Java service in such a way that it takes an AST element (probably EObject or some common super-type in your metamodel if you have one), has access to the Xtext serializer, and returns the serialized version of your AST element.

ANTLR4 parsing a keyword-contained variable name

I'm trying to parse a simple integer declaration in antlr4.
The grammar I'm doing now is:
main : 'int' var '=' NUMBER+ ;
var : LETTER (LETTER | NUMBER)* ;
LETTER: [a-zA-Z_] ;
NUMBER: [0-9] ;
WS : [ \t\r\n]+ -> skip ;
When I tried to test the main rule with int int_A = 0, I got an error:
extraneous input 'int' expecting LETTER.
I know it's because the variable name 'int_A' contains the keyword 'int', but how do I modify my grammar? Thanks.
The lexer creates tokens with as much characters as possible. So int_A is being tokenised as the following 3 tokens:
'int' (int keyword defined in parser)
LETTER (_)
LETTER (A)
So the parser cannot create a var with these tokens.
Instead of a parser rule var, make it a lexer rule:
main : 'int' VAR '=' NUMBER+ ;
VAR : [a-zA-Z_] ([a-zA-Z_] | [0-9])* ;
NUMBER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;

Support optional quotes in a Boolean expression

Background
I have been using ANTLRWorks (V 1.4.3) for a few days now and trying to write a simple Boolean parser. The combined lexer/parser grammar below works well for most of the requirements including support for quoted white-spaced text as operands for a Boolean expression.
Problem
I would like the grammar to work for white-spaced operands without the need of quotes.
Example
For example, expression-
"left right" AND center
should have the same parse tree even after dropping the quotes-
left right AND center.
I have been learning about backtracking, predicates etc but can't seem to find a solution.
Code
Below is the grammar I have got so far. Any feedback on the foolish mistakes is appreciated :).
Lexer/Parser Grammar
grammar boolean_expr;
options {
TokenLabelType=CommonToken;
output=AST;
ASTLabelType=CommonTree;
}
#modifier{public}
#ctorModifier{public}
#lexer::namespace{Org.CSharp.Parsers}
#parser::namespace{Org.CSharp.Parsers}
public
evaluator
: expr EOF
;
public
expr
: orexpr
;
public
orexpr
: andexpr (OR^ andexpr)*
;
public
andexpr
: notexpr (AND^ notexpr)*
;
public
notexpr
: (NOT^)? atom
;
public
atom
: word | LPAREN! expr RPAREN!
;
public
word
: QUOTED_TEXT | TEXT
;
/*
* Lexer Rules
*/
LPAREN
: '('
;
RPAREN
: ')'
;
AND
: 'AND'
;
OR
: 'OR'
;
NOT
: 'NOT'
;
WS
: ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
QUOTED_TEXT
: '"' (LETTER | DIGIT | ' ' | ',' | '-')+ '"'
;
TEXT
: (LETTER | DIGIT)+
;
/*
Fragment lexer rules can be used by other lexer rules, but do not return tokens by themselves
*/
fragment DIGIT
: ('0'..'9')
;
fragment LOWER
: ('a'..'z')
;
fragment UPPER
: ('A'..'Z')
;
fragment LETTER
: LOWER | UPPER
;
Simply let TEXT in your atom rule match once or more: TEXT+. When it matches a TEXT token more than once, you'll also want to create a custom root node for these TEXT tokens (I added an imaginary token called WORD in the grammar below).
grammar boolean_expr;
options {
output=AST;
}
tokens {
WORD;
}
evaluator
: expr EOF
;
...
word
: QUOTED_TEXT
| TEXT+ -> ^(WORD TEXT+)
;
...
Your input "left right AND center" would now be parsed as follows:

Why does ANTLR not parse the entire input?

I am quite new to ANTLR, so this is likely a simple question.
I have defined a simple grammar which is supposed to include arithmetic expressions with numbers and identifiers (strings that start with a letter and continue with one or more letters or numbers.)
The grammar looks as follows:
grammar while;
#lexer::header {
package ConFreeG;
}
#header {
package ConFreeG;
import ConFreeG.IR.*;
}
#parser::members {
}
arith:
term
| '(' arith ( '-' | '+' | '*' ) arith ')'
;
term returns [AExpr a]:
NUM
{
int n = Integer.parseInt($NUM.text);
a = new Num(n);
}
| IDENT
{
a = new Var($IDENT.text);
}
;
fragment LOWER : ('a'..'z');
fragment UPPER : ('A'..'Z');
fragment NONNULL : ('1'..'9');
fragment NUMBER : ('0' | NONNULL);
IDENT : ( LOWER | UPPER ) ( LOWER | UPPER | NUMBER )*;
NUM : '0' | NONNULL NUMBER*;
fragment NEWLINE:'\r'? '\n';
WHITESPACE : ( ' ' | '\t' | NEWLINE )+ { $channel=HIDDEN; };
I am using ANTLR v3 with the ANTLR IDE Eclipse plugin. When I parse the expression (8 + a45) using the interpreter, only part of the parse tree is generated:
Why does the second term (a45) not get parsed? The same happens if both terms are numbers.
You'll want to create a parser rule that has an EOF (end of file) token in it so that the parser will be forced to go through the entire token stream.
Add this rule to your grammar:
parse
: arith EOF
;
and let the interpreter start at that rule instead of the arith rule:

Resources