I'm new to ANTLR and am trying to understand how to do some things with it. I need it to throw an error when a statement is missing things, like a semicolon or an end bracket. It's been called negative test cases by the problem set that I'm working through.
For example, the below code returns true, which is correct.
val program = """
1 + 2;
"""
recognize(program)
However, this code also returns true, despite it missing the semicolon at the end. It should return false ([PARSER error at line=1]: missing ';' at '').
val program = """
1 + 2
""".trimIndent()
recognize(program)
The grammar is as follows:
program: (expression ';')* | EOF;
expression: INT PLUS INT | OPENBRAC INT PLUS INT CLOSEBRAC | QUOTE IDENT QUOTE PLUS QUOTE IDENT QUOTE;
IDENT: [A-Za-z0-9]+;
INT: [-][0-9]+ | ('0'..'9')+;
PLUS: '+';
OPENBRAC: '(';
CLOSEBRAC: ')';
QUOTE: '"';
program: (expression ';')* | EOF;
This means a program can either be zero or more instances of expression ';' followed by whatever else is in the input stream or it can be empty. Since (expression ';')* can already match the empty input by itself, the | EOF is just redundant.
What you want is program: (expression ';')* EOF, which means that a program consists of zero or more instances of expression ';', followed by the end of input, meaning there must be nothing left in the input afterwards.
I have checked similar questions surrounding this issue but none seems to provide a solution to my version of the problem.
I just started Antlr4 recently and all has been going nicely until I hit this particular roadblock.
My grammar is a basic math expression grammar but for some reason I noticed the generated parser(?) is unable to walk from paser-rule "equal" to paser-rule "expr", in order to reach lexer-rule "NAME".
grammar MathCraze;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : '\r'? '\n' -> skip;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
ADD: '+';
SUB : '-';
MUL : '*';
DIV : '/';
POW : '^';
equal
: add # add1
| NAME '=' equal # assign
;
add
: mul # mul1
| add op=('+'|'-') mul # addSub
;
mul
: exponent # power1
| mul op=('*'|'/') exponent # mulDiv
;
exponent
: expr # expr1
| expr '^' exponent # power
;
expr
: NUM # num
| NAME # name
| '(' add ')' # parens
;
If I pass a word as input, sth like "variable", the parser throws the error above, but if I pass a number as input (say "78"), the parser walks the tree successfully (i.e, from rule "equal" to "expr").
equal equal
| |
add add
| |
mul mul
| |
exponent exponent
| |
expr expr
| |
NUM NAME
| |
"78" # No Error "variable" # Error! Tree walk doesn't reach here.
I've checked for every type of ambiguity I know of, so I'm probably missing something here.
I'm using Antlr5.6 by the way and I will appreciate if this problem gets solved. Thanks in advance.
Your style of expression hierarchy is the one we use in parsers written by hand or in ANTLR v3, from low to high precedence.
As Raven said, ANTLR 4 is much more powerful. Note the <assoc = right> specification in the power rule, which is usually right-associative.
grammar Question;
question
: line+ EOF
;
line
: expr NL
| assign NL
;
assign
: NAME '=' expr # assignSingle
| NAME '=' assign # assignMulti
;
expr // from high to low precedence
: <assoc = right> expr '^' expr # power
| expr op=( '*' | '/' ) expr # mulDiv
| expr op=( '+' | '-' ) expr # addSub
| '(' expr ')' # parens
| atom_r # atom
;
atom_r
: NUM
| NAME
;
NAME: [a-zA-Z_][a-zA-Z_0-9]*;
NUM : [0-9]+ ('.' [0-9]+)?;
WS : [ \t]+ -> skip;
NL : [\r\n]+ ;
Run with the -gui option to see the parse tree :
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar
$ alias grun
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Question question -gui data.txt
and this data.txt file :
variable
78
a + b * c
a * b + c
a = 8 + (6 * 9)
a ^ b
a ^ b ^ c
7 * 2 ^ 5
a = b = c = 88
.
Added
Using your original grammar and starting with the equal rule, I have the following error :
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,9:10='78',<NUM>,2:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
line 2:0 no viable alternative at input 'variable78'
If I start with rule expr, there is no error :
$ grun Q2 expr -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
...
[#41,89:88='<EOF>',<EOF>,10:0]
$
Run grun with the -gui option and you'll see the difference :
running with expr, the input token variable is catched in NAME, rule expr is satisfied and terminates;
running with equal it's all in error. The parser tries the first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK. It consumes the token variable and tries to do something with the next token 78. It rolls back in each rule, see if it can do something with the alt of rule, but each alt requires an operator. Thus it arrives in equal and starts again with the token variable, this time using the alt | NAME '='. NAME consumes the token, then the rule requires '=', but the input is 78 and does not satisfies it. As there is no other choice, it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
line 1:8 no viable alternative at input 'variable'
If variable is the only token, same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NAME => OK, consumes variable, back to equal, tries the alt which requires '=', but the input is at EOF. That's why it says there is no viable alternative.
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
If 78 is the only token, do the same reasoning : first alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. The alternative is not an option. Satisfied ? oops, what about EOF.
Now let's add a NUM alt to equal :
equal
: add # add1
| NAME '=' equal # assign
| NUM '=' equal # assignNum
;
$ grun Q2 equal -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
line 1:2 no viable alternative at input '78'
First alternative equal -> add -> mul -> exponent -> expr -> NUM => OK, consumes 78, back to equal. Now there is also an alt for NUM, starts again, this time using the alt | NUM '='. NUM consumes the token 78,
then the parser requires '=', but the input is at EOF, hence the message.
Now let's add a new rule with EOF and let's run the grammar from all :
all : equal EOF ;
$ grun Q2 all -tokens data.txt
[#0,0:1='78',<NUM>,1:0]
[#1,2:1='<EOF>',<EOF>,1:2]
$ grun Q2 all -tokens data.txt
[#0,0:7='variable',<NAME>,1:0]
[#1,8:7='<EOF>',<EOF>,1:8]
The input corresponds to the grammar, and there is no more message.
Although I can't answer your question about why the parser can't reach NAME in expr I'd like to point out that with Antlr4 you can use direct left recursion in your rule specification which makes your grammar more compact and omproves readability.
With that in mind your grammar could be rewritten as
math:
assignment
| expression
;
assignment:
ID '=' (assignment | expression)
;
expression:
expression '^' expression
| expression ('*' | '/') expression
| expression ('+' | '-') expression
| NAME
| NUM
;
That grammar hapily takes a NAME as part of an expression so I guess it would solve your problem.
If you're really interested in why it didn't work with your grammar then I'd first check if the lexer has matched the input into the expected tokens. Afterwards I would have a look at the parse tree to see what the parser is making of the given token sequence and then trying to do the parsing manually accoding to your grammar and during that you should be able to find the point at which the parser does something different from what you'd expect it to do.
I am defining an ANTLR4 grammar and I'd like it to tokenize certain - but not all - things differently when they appear inside double-quotes than when they appear outside double-quotes. Here's the grammar I have so far:
grammar SimpleGrammar;
AND: '&';
TERM: TERM_CHAR+;
PHRASE_TERM: (TERM_CHAR | '%' | '&' | ':' | '$')+;
TRUNCATION: TERM '!';
WS: WS_CHAR+ -> skip;
fragment TERM_CHAR: 'a' .. 'z' | 'A' .. 'Z';
fragment WS_CHAR: [ \t\r\n];
// Parser rules
expr:
expr AND expr
| '"' phrase '"'
| TERM
| TRUNCATION
;
phrase:
(TERM | PHRASE_TERM | TRUNCATION)+
;
The above grammar works when parsing a! & b, which correctly parses to:
AND
/ \
/ \
a! b
However, when I attempt to parse "a! & b", I get:
line 1:4 extraneous input '&' expecting {'"', TERM, PHRASE_TERM, TRUNCATION}
The error message makes sense, because the & is getting tokenized as AND. What I would like to do, however, is have the & get tokenized as a PHRASE_TERM when it appears inside of double-quotes (inside a "phrase"). Note, I do want the a! to tokenize as TRUNCATION even when it appears inside the phrase.
Is this possible?
It is possible if you use lexer modes. It is possible to change mode after encounter of specific token. But lexer rules must be defined separately, not in combined grammar.
In your case, after encountering quote, you will change mode and after encountering another quote, you will change mode back to the default one.
LBRACK : '[' -> pushMode(CharSet);
RBRACK : ']' -> popMode;
For more information google 'ANTLR lexer Mode'
I'm writing a grammar that should convert infix to postfix. Our teacher told us to change this grammar:
E -> TT'
T -> FF'
T'-> +T | -T | nil
F -> (E) | id | num
F' -> *F | /F | nil
Note: tokens are +,-,*,/, ^ (pow). The problem is power operator . I don't know how to change the grammar so that it could parse power too.
Thanks in advance.
I'm facing a strange ANTLR issue with a that should just output an AST.
grammar ltxt.g;
options
{
language=CSharp3;
}
prog : start
;
start : '{Start 'loopname'}'statement'{Ende 'loopname'}'
| statement
;
loopname : (('a'..'z')|('A'..'Z')|('1'..'9'))*;
statement : '<%' table_ref '>'
| start;
table_ref : '{'format'}'ID;
format : FSTRING
| FSTRING OFSTRING{0,5}
;
FSTRING : '#F'
| '#D'
| '#U'
| '#K'
;
OFSTRING: 'F'
| 'D'
| 'U'
| 'K'
//| 1..65536
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
When I try to code-gen this I get
error(100):LTXT.g:1:13:syntax error: antlr: MismatchedTokenException(74!=52). I didn't declare any 74 or 52.
also I do not get a Synatx diagram, since "rule "start"" cannot be found as a start state...
I know that this isn't pretty, but I thought it would work at least :)
Best,
wishi
There are four errors that I see.
A grammar name can't contain a period. That's the syntax error you're getting. The 74!=52 error message is a hint telling you that ANTLR found token id 74 when it was expecting token id 52, which in this case just translates to "it found one thing when it expected something else."
The grammar name ("ltxt") and the file name before the extension ("LTXT") need to match exactly.
The grammar won't produce an AST unless you specify output=AST; in the options section.
format's second alternative (FSTRING OFSTRING{0,5}) won't do what I think you think it's going to do. ANTLR doesn't support an arbitrary number of matches such as "match zero to five OFSTRINGs". You'll need to redefine the rule using semantic predicates that count occurrences for you. They aren't hard to use, but they're one of the trickier parts of ANTLR.
I hope that helps get you started.