I am working on parsing a grammar using Antlr4 and running into a problem that I cannot understand. In a nutshell, the problem is that Antlr4 parser fails to fully parse a test string in my original grammar but when I add a superfluous rule , the parse is completed. I am providing a simplified version of my grammar to illustrate the issue.
grammar my;
st: 'H' hd | EOF ;
hd: 'D' d | 'C' c | st ;
d: hd ;
c: 'D' c | hd ;
s1: 'D' s1 | c ;
// p: hd ;
SKP: [ \t\r\n]+ -> skip
When provided with the input string:
H C D C C D
Antlr4 parser reports the error:
line 2:0 no viable alternative at input 'DCCD'
and the command grun st -gui shows the partial parse tree:
However, if the commented out rule (p: hd) is included in the grammar, Antlr4 parses the string completely and produces the following parse tree:
Note that the nonterminal p is not in the original grammar and cannot be reached from the start symbol st. As such, the added production is superfluous and should not affect the parsing of the grammar.
I've written the following arithmetic grammar:
grammar Calc;
program
: expressions
;
expressions
: expression (NEWLINE expression)*
;
expression
: '(' expression ')' // parenExpression has highest precedence
| expression MULDIV expression // then multDivExpression
| expression ADDSUB expression // then addSubExpression
| OPERAND // finally the operand itself
;
MULDIV
: [*/]
;
ADDSUB
: [-+]
;
// 12 or .12 or 2. or 2.38
OPERAND
: [0-9]+ ('.' [0-9]*)?
| '.' [0-9]+
;
NEWLINE
: '\n'
;
And I've noticed that regardless of how I space the tokens I get the same result, for example:
1+2
2+3
Or:
1 +2
2+3
Still give me the same thing. Also I've noticed that adding in the following rule does nothing for me:
WS
: [ \r\n\t] + -> skip
Which makes me wonder whether skipping whitespace is the default behavior of antlr4?
ANTLR4 based parsers have the ability to skip over single unwanted or missing tokens and continue parsing if possible (which is the case here). And there's no default to ignore whitespaces. You have to always specify a whitespace rule which either skips them or puts them on a hidden channel.
In Crafting Interpreters, we implement a little programming language using a recursive descent parser. Among many other things, it has these statements:
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" statement ;
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
One of the exercises is to add a break statement to the language. Also, it should be a syntax error to have this statement outside a loop. Naturally, it can appear inside other blocks, if statements etc. if those are inside a loop.
My first approach was to create a new rule, whileBody, to accept break:
## FIRST TRY
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
| break ;
break → "break" ";" ;
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
But we have to accept break inside nested loops, if conditionals etc. What I could imagine is, I'd need a new rule for blocks and conditionals which accept break:
## SECOND TRY
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
| break
| whileBlock
| whileIfStmt
whileBlock→ "{" (declaration | break)* "}" ;
whileIfStmt → "if" "(" expression ")" whileBody ( "else" whileBody )? ;
break → "break" ";"
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
It is not infeasible for now, but it can be cumbersome to handle it once the language grows. It is boring and error-prone to write even today!
I looked for inspiration in C and Java BNF specifications. Apparently, none of those specifications prohibit the break outside loop. I guess their parsers have ad hoc code to prevent that. So, I followed suit and added code into the parser to prevent break outside loops.
TL;DR
My questions are:
Would the approach of my second try even work? In other words, could a recursive descent parser handle a break statement that only appears inside loops?
Is there a more practical way to bake the break command inside a syntax specification?
Or the standard way is indeed to change a parser to prevent breaks outside loops while parsing?
Attribute grammars are good at this sort of thing. Define an inherited attribute (I'll call it LC for loop count). The 'program' non-terminal passes LC = 0 to its children; loops pass LC = $LC + 1 to their children; all other constructs pass LC = $LC to their children. Make the rule for 'break' syntactically valid only if $LC > 0.
There is no standard syntax for attribute grammars, or for using attribute values in guards (as I am suggesting for 'break'), but using Prolog definite-clause grammar notation your grammar might look something like the following. I have added a few notes on DCG notation, in case it's been too long since you have used them.
/* nt(X) means, roughly, pass the value X as an inherited attribute.
** In a recursive-descent system, it can be passed as a parameter.
** N.B. in definite-clause grammars, semicolon separates alternatives,
** and full stop ends a rule.
*/
/* DCD doesn't have regular-right-part rules, so we have to
** handle repetition via recursion.
*/
program -->
statement(0);
statement(0), program.
statement(LC) -->
exprStmt(LC);
ifStmt(LC);
printStmt(LC);
whileStmt(LC);
block(LC);
break(LC).
block(LC) -->
"{", star-declaration(LC), "}".
/* The notation [] denotes the empty list, and matches zero
** tokens in the input.
*/
star-declaration(LC) -->
[];
declaration(LC), star-declaration(LC).
/* On the RHS of a rule, braces { ... } contain Prolog code. Here,
** the code "LC2 is LC + 1" adds 1 to LC and binds LC2 to that value.
*/
whileStmt(LC) -->
{ LC2 is LC + 1 }, "while", "(", expression(LC2), ")", statement(LC2).
ifStmt(LC) --> "if", "(", expression(LC), ")", statement(LC), opt-else(LC).
opt-else(LC) -->
"else", statement(LC);
[].
/* The definition of break checks the value of the loop count:
** "LC > 0" succeeds if LC is greater than zero, and allows the
** parse to succeed. If LC is not greater than zero, the expression
** fails. And since there is no other rule for 'break', any attempt
** to parse a 'break' rule when LC = 0 will fail.
*/
break(LC) --> { LC > 0 }, "break", ";".
Nice introductions to attribute grammars can be found in Grune and Jacobs, Parsing Techniques and in the Springer volumes Lecture Notes in Computer Science 461 (Attribute Grammars and Their Applications*, ed. P. Deransart and M. Jourdan) and 545 (Attribute Grammars, Applications, and Systems, ed. H. Alblas and B. Melichar.
The technique of duplicating some productions in order to distinguish two situations (am I in a loop? or not?), as illustrated in the answer by #rici, can be regarded as a way of pushing a Boolean attribute into the non-terminal names.
Would the approach of my second try even work? In other words, could a recursive descent parser handle a break statement that only appears inside loops?
Sure. But you need a lot of duplication. Since while is not the only loop construct, I've used a different way of describing the alternatives, which consists of adding _B to the name of non-terminals which might include break statements.
declaration → varDecl
| statement
declaration_B → varDecl
| statement_B
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block
statement_B → exprStmt
| printStmt
| whileStmt
| breakStmt
| ifStmt_B
| block_B
breakStmt → "break" ";"
ifStmt → "if" "(" expression ")" statement ( "else" statement )?
ifStmt_B → "if" "(" expression ")" statement_B ( "else" statement_B )?
whileStmt → "while" "(" expression ")" statement_B ;
block → "{" declaration* "}"
block_B → "{" declaration_B* "}"
Not all statement types need to be duplicated. Non-compound statement like exprStmt don't, because they cannot possibly include a break statement (or any other statement type). And the statement which is the target of a loop statement like whileStmt can always include break, regardless of whether the while was inside a loop or not.
Is there a more practical way to bake the break command inside a syntax specification?
Not unless your syntax specification has marker macros, like the specification used to describe ECMAScript.
Is there a different way to do this?
Since this is a top-down (recursive descent) parser, it's pretty straight-forward to handle this condition in the parser's execution. You just need to add an argument to every (or many) parsing functions which specifies whether a break is possible or not. Any parsing function called by whileStmt would set that argument to True (or an enumeration indicating that break is possible), while other statement types would just pass the parameter through, and the top-level parsing function would set the argument to False. The breakStmt implementation would just return failure if it is called with False.
I am struggling a bit with trying to define integers in my grammar.
Let's say I have this small grammar:
grammar Hello;
r : 'hello' INTEGER;
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
If I then type in
hello 5
it parses correctly.
However, if I have an additional parser rule (even if it's unused) which defines a token '5',
then I can't parse the previous example anymore.
So this grammar:
grammar Hello;
r : 'hello' INTEGER;
unusedRule: 'hi' '5';
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
with
hello 5
won't parse anymore. It gives me the following error:
Hello::r:1:6: mismatched input '5' expecting INTEGER
How is that possible and how can I work around this?
When you define a parser rule like
unusedRule: 'hi' '5';
Antlr creates implicit lexer tokens for the subterms. Since they are automatically created in the lexer, you have no control over where the sit in the precedence evaluation of Lexer rules.
Consequently, the best policy is to never use literals in parser rules; always explicitly define your tokens.
I recently started studying ANTLR. Below is the grammar for the arithmetic expression.
The problem is that when I am putting (calling) expression rule in the term rule then it is parsing incorrectly even for (9+8). It is somehow ignoring the right parenthesis.
While when I put add rule instead of calling expression rule from the rule term, it is working fine.
As in:
term:
INTEGER
| '(' add ')'
;
Can anyone tell me why it is happening because more or les they both are the same.
Grammer for which it is giving incorrect results
term
:
INTEGER
| '(' expression ')'
;
mult
:
term ('*' term)*
;
add
:
mult ('+' mult)*
;
expression
:
add
;
When I parse "(8+9)" with a parser generated from your grammar, starting with the expression rule, I get the following parse tree:
In other words: it works just fine.
Perhaps you're using ANTLRWorks' (or ANTLR IDE's) interpreter to test your grammar? In thta case: don't use the interpreter, it's buggy. Use ANTLRWorks' debugger instead (the image is exported from ANTLRWorks' debugger).