Validating a "break" statement with a recursive descent parser - parsing

In Crafting Interpreters, we implement a little programming language using a recursive descent parser. Among many other things, it has these statements:
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" statement ;
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
One of the exercises is to add a break statement to the language. Also, it should be a syntax error to have this statement outside a loop. Naturally, it can appear inside other blocks, if statements etc. if those are inside a loop.
My first approach was to create a new rule, whileBody, to accept break:
## FIRST TRY
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
| break ;
break → "break" ";" ;
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
But we have to accept break inside nested loops, if conditionals etc. What I could imagine is, I'd need a new rule for blocks and conditionals which accept break:
## SECOND TRY
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
| break
| whileBlock
| whileIfStmt
whileBlock→ "{" (declaration | break)* "}" ;
whileIfStmt → "if" "(" expression ")" whileBody ( "else" whileBody )? ;
break → "break" ";"
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
It is not infeasible for now, but it can be cumbersome to handle it once the language grows. It is boring and error-prone to write even today!
I looked for inspiration in C and Java BNF specifications. Apparently, none of those specifications prohibit the break outside loop. I guess their parsers have ad hoc code to prevent that. So, I followed suit and added code into the parser to prevent break outside loops.
TL;DR
My questions are:
Would the approach of my second try even work? In other words, could a recursive descent parser handle a break statement that only appears inside loops?
Is there a more practical way to bake the break command inside a syntax specification?
Or the standard way is indeed to change a parser to prevent breaks outside loops while parsing?

Attribute grammars are good at this sort of thing. Define an inherited attribute (I'll call it LC for loop count). The 'program' non-terminal passes LC = 0 to its children; loops pass LC = $LC + 1 to their children; all other constructs pass LC = $LC to their children. Make the rule for 'break' syntactically valid only if $LC > 0.
There is no standard syntax for attribute grammars, or for using attribute values in guards (as I am suggesting for 'break'), but using Prolog definite-clause grammar notation your grammar might look something like the following. I have added a few notes on DCG notation, in case it's been too long since you have used them.
/* nt(X) means, roughly, pass the value X as an inherited attribute.
** In a recursive-descent system, it can be passed as a parameter.
** N.B. in definite-clause grammars, semicolon separates alternatives,
** and full stop ends a rule.
*/
/* DCD doesn't have regular-right-part rules, so we have to
** handle repetition via recursion.
*/
program -->
statement(0);
statement(0), program.
statement(LC) -->
exprStmt(LC);
ifStmt(LC);
printStmt(LC);
whileStmt(LC);
block(LC);
break(LC).
block(LC) -->
"{", star-declaration(LC), "}".
/* The notation [] denotes the empty list, and matches zero
** tokens in the input.
*/
star-declaration(LC) -->
[];
declaration(LC), star-declaration(LC).
/* On the RHS of a rule, braces { ... } contain Prolog code. Here,
** the code "LC2 is LC + 1" adds 1 to LC and binds LC2 to that value.
*/
whileStmt(LC) -->
{ LC2 is LC + 1 }, "while", "(", expression(LC2), ")", statement(LC2).
ifStmt(LC) --> "if", "(", expression(LC), ")", statement(LC), opt-else(LC).
opt-else(LC) -->
"else", statement(LC);
[].
/* The definition of break checks the value of the loop count:
** "LC > 0" succeeds if LC is greater than zero, and allows the
** parse to succeed. If LC is not greater than zero, the expression
** fails. And since there is no other rule for 'break', any attempt
** to parse a 'break' rule when LC = 0 will fail.
*/
break(LC) --> { LC > 0 }, "break", ";".
Nice introductions to attribute grammars can be found in Grune and Jacobs, Parsing Techniques and in the Springer volumes Lecture Notes in Computer Science 461 (Attribute Grammars and Their Applications*, ed. P. Deransart and M. Jourdan) and 545 (Attribute Grammars, Applications, and Systems, ed. H. Alblas and B. Melichar.
The technique of duplicating some productions in order to distinguish two situations (am I in a loop? or not?), as illustrated in the answer by #rici, can be regarded as a way of pushing a Boolean attribute into the non-terminal names.

Would the approach of my second try even work? In other words, could a recursive descent parser handle a break statement that only appears inside loops?
Sure. But you need a lot of duplication. Since while is not the only loop construct, I've used a different way of describing the alternatives, which consists of adding _B to the name of non-terminals which might include break statements.
declaration → varDecl
| statement
declaration_B → varDecl
| statement_B
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block
statement_B → exprStmt
| printStmt
| whileStmt
| breakStmt
| ifStmt_B
| block_B
breakStmt → "break" ";"
ifStmt → "if" "(" expression ")" statement ( "else" statement )?
ifStmt_B → "if" "(" expression ")" statement_B ( "else" statement_B )?
whileStmt → "while" "(" expression ")" statement_B ;
block → "{" declaration* "}"
block_B → "{" declaration_B* "}"
Not all statement types need to be duplicated. Non-compound statement like exprStmt don't, because they cannot possibly include a break statement (or any other statement type). And the statement which is the target of a loop statement like whileStmt can always include break, regardless of whether the while was inside a loop or not.
Is there a more practical way to bake the break command inside a syntax specification?
Not unless your syntax specification has marker macros, like the specification used to describe ECMAScript.
Is there a different way to do this?
Since this is a top-down (recursive descent) parser, it's pretty straight-forward to handle this condition in the parser's execution. You just need to add an argument to every (or many) parsing functions which specifies whether a break is possible or not. Any parsing function called by whileStmt would set that argument to True (or an enumeration indicating that break is possible), while other statement types would just pass the parameter through, and the top-level parsing function would set the argument to False. The breakStmt implementation would just return failure if it is called with False.

Related

Grammar for if statements in newline sensitive language

I'm working on a language that is meant to read much like English, and having issues with the grammar for if statements. In case you are curious, the language is inspired by HyperTalk, so I'm trying to make sure I match all the valid constructs in that language. The sample input I'm using that demonstrates all the possible if constructs can be viewed here. There are a lot, so I didn't want to inline the code.
I've removed most other constructs from the grammar to make it a bit easier to read, but basically statements look like this:
start
: statementList
;
statementList
: '\n'
| statement '\n'
| statementList '\n'
| statementList statement '\n'
;
statement
: ID
| ifStatement
;
The shift/reduce conflicts I'm seeing are in the ifStatement rules:
ifStatement
: ifCondition THEN statement
| ifCondition THEN statement ELSE statement
| ifCondition THEN statement ELSE '\n' statementList END IF
| ifCondition THEN '\n' statementList END IF
| ifCondition THEN '\n' END IF
| ifCondition THEN '\n' ELSE statement
| ifCondition THEN '\n' ELSE '\n' statementList END IF
| ifCondition THEN '\n' statementList ELSE statement
| ifCondition THEN '\n' statementList ELSE '\n' statementList END IF
// The following rules cause issues, but should be legal:
| ifCondition THEN statement newlines ELSE statement
| ifCondition THEN statement newlines ELSE '\n' statementList END IF
;
ifCondition
: IF expression
| IF expression '\n'
;
expression
: TRUE
| FALSE
;
newlines
: '\n'
| newlines '\n'
;
The problem is that I need to support this construct:
if true then statement # <- Any number of newlines
else statement
The problem (as I understand it) is that there isn't enough context to correctly determine whether to shift the else, or reduce just the if true then statement part without knowing what comes later (the end of the statement list, or another statement). Is this even parseable?
I have gists for the parser, scanner, and sample input to try out.
Getting this right is surprisingly difficult, so I've tried to annotate the steps. There are a lot of annoying details.
At its core, this is just a manifestation of the dangling else ambiguity, whose resolution is pretty well-known (force the parser to always shift the else). The solution below resolves the ambiguity in the grammar itself, which is unambiguous.
The basic principle that I've used here is the one outlined several decades ago in Principles of Compiler Design by Alfred Aho and Jeffrey Ullman (the so-called "Dragon book", which I mention since its authors were recently granted the Turing award precisely for that and their other influential works). In particular, I use the terms "matched" and "unmatched" (rather than "open" and "closed", which are also popular) because that's the way I learned it.
It is also possible to solve this grammar problem using precedence declarations; indeed, that often turns out to be much simpler. But in this particular case, it's not easy to work with operator precedence because the relevant token (the else) can be preceded by an arbitrary number of newline tokens. I'm pretty sure you could still construct a precedence-based solution, but there are advantages to using an unambiguous grammar, including the ease of porting to a parser generator which doesn't use the same precedence algorithm, and the fact that it is possible to analyze mechanically.
The basic outline of the solution is to divide all statements into two categories:
"matched" (or "closed") statements, which are complete in the sense that it is not possible to extend the statement with an else clause. (In other words, every if…then is matched by a corresponding else.) These
"unmatched" (or "open") statements, which could have been extended with an else clause. (In other words, at least one if…then clause is not matched by an else.) Since the unmatched statement is a complete statement, it cannot be immediately followed by an else token; had an else token appeared, it would have served to extend the statement.
Once we manage to construct grammars for these two categories of statement, it's only necessary to figure out which uses of statement in the ambiguous grammar can be followed by else. In all of these contexts, the non-terminal statement must be replaced with the non-terminal matched-statement, because only matched statements can be followed by else without interacting with it. In other contexts, where else could not be the next token, either category of statement is valid.
So the essential grammar style is (taken from the Dragon book):
stmt → matched_stmt
| unmatched_stmt
matched_stmt → "if" expr "then" matched_stmt "else" matched_stmt
| other_stmt
unmatched_stmt → "if" expr "then" matched_stmt "else" unmatched_stmt
| "if" expr "then" stmt
other_stmt is anything other than a conditional statement. Or, to be more precise, anything other than a compound statement which precisely ends with a stmt.
In Hypertalk, as far as I know, if statements are the only compound statements which can end with a statement. Other compound statements are precisely terminated with an end X, which effectively closes the statement. But in other languages, such as C, there are a variety of compound statements, and most of these need to be divided into "matched" and "unmatched" depending precisely on whether their terminating substatement is (recursively) matched or unmatched.
One thing I want to note here, which is apparent from that outline grammar if you look at it a bit sideways, is that the if…then…else part of the if statement is grammatically similar to a bracketed prefix operator. That is, both matched_stmt and unmatched_stmt are similar to the right-recursive rule for unary minus:
unary → '-' unary
| atom
which in turn could be written in an Extended BNF dialect which allows Kleene stars as
unary → ('-')* atom
If we were to do that transformation to Aho&Ullman's grammar, we'd end up with:
if_then_else → "if" expr "then" matched_stmt "else"
matched_stmt → (if_then_else)* other_stmt
unmatched_stmt → (if_then_else)* "if" expr "then" stmt
That makes it reasonably clear how to implement this grammar with a top-down recursive-descent parser. (A bit of left-factoring is needed, but it still ends up being similar to the unary minus grammar.) I'm not planning on developing this thought further in this answer, but I think that the EBNF conversion helps guide the intuitions about how this grammar actually works to undangle the else.
It was also really helpful in figuring out how to deal with newlines. The key insight (for me) was that statements must end with a newline. The one exception is the condensed single-line version of the if command. But that exception only happens just before an else token (and only when the then which it matches in on the same line). In this grammar, that case is implemented with the inner-matched non-terminal, assisted by the fact that one-line statements (like do-statement) lack the terminating newline. The newline which terminates one-line statements is added in the recursive base case for matched (single-statement NL); that's the only place it needs to be handled. Multi-line compound statements are all defined with a terminating newline (see, for example, repeat-statement).
Most of the rest of the complications deal with the variety of syntactic forms. The only one which is really interesting is the handling of blocks after a then token at the end of a line. That block can be terminated in two ways:
with an end if line, without an else clause. This is treated as a "matched" case, since it clearly could not be extended with an else clause.
with an else clause (which could be a single line else or a block else, where the else token is at the end of the line). But here there is a possible ambiguity; if the last statement in the block is an unmatched if, then an else line should extend that statement, rather than terminating the block. That's not really different from the rest of the matched/unmatched logic; to implement it, I created two different block non-terminals, one ending with a matched statement and the other ending with an unmatched statement. And then, as usual, only the matched block can be used before an else.
(I found the new counterexample generator in bison 3.7.6 extremely helpful here; my initial attempt just used block because I'd failed to notice the ambiguity. But it is a real ambiguity, and it lead to a shift-reduce conflict whose origins seemed mysterious. Once I saw the counterexample produced by the counterexample generator -- which showed the conflict happening inside a block following an if-then -- the problem became a lot more evident.)
The alternation between matched-block and unmatched-block is a simple example of the correspondence between grammar productions and state machines. The two non-terminals represent the two states in a very simple state machine, whose state records a single bit: whether or not the last statement was matched. The non-terminals must be right-recursive for this to work, which is a deviation from the usual "prefer left-recursion" heuristic for building LALR(1) grammars.
OK, with that overlong preamble, here's the grammar. In the interests of compactification, I simplified expressions down to just variables and boolean constants, included only one simple statement (do expr) and included only one other compound statement (repeat until expr / block / end repeat). (The last one is there as a placeholder.)
program : block
block : %empty
| matched-block
| unmatched-block
NL : '\n'
| NL '\n'
matched-block
: block matched
unmatched-block
: block unmatched
simple-statement
: "do" expression
repeat-statement
: "repeat" "until" expression NL block "end" "repeat" NL
matched : if-then matched else-matched
| if-then inner-matched else-matched
| if-then NL matched-block else-matched
| if-then NL else-matched
| if-then NL block "end" "if" NL
| repeat-statement
| simple-statement NL
inner-matched
: %empty
| simple-statement
| if-then inner-matched "else" inner-matched
unmatched
: if-then matched
| if-then unmatched
| if-then inner-matched "else" unmatched
| if-then matched "else" unmatched
if-then : "if" expression NL "then"
| "if" expression "then"
else-matched
: "else" NL block "end" "if" NL
| "else" matched
expression
: ID
| "true"
| "false"
Previous answer (to original question, only visible in the edit history)
There is an obvious ambiguity between
ifCondition THEN statement EOL ELSE statement
and
ifCondition THEN EOL statementList ELSE statement
Recall that
statement: %empty
statementList: statement
with the result that both statement and statementList can derive the empty sequence. So both of the above productions for ifStatement can derive:
ifCondition THEN EOL ELSE statement
The parser has no way to know whether there is an empty statement before the EOL or an empty statementList after it. (You might not care which of these is chosen but parsers obsess about this kind of decision.)
Nullable productions are often problematic. Where possible, avoid them. Instead of letting statement derive empty, indicate explicitly where an empty statement might go by adding a rule where the optional statement is omitted. And consider rewriting statementList so that it must end with an EOL, which I think was your intention anyway (but perhaps I'm wrong).

parsing maxscript - problem with newlines

I am trying to create parser for MAXScript language using their official grammar description of the language. I use flex and bison to create the lexer and parser.
However, I have run into following problem. In traditional languages (e.g. C) statements are separated by a special token (; in C). But in MAXScript expressions inside a compound expression can be separated either by ; or newline. There are other languages that use whitespace characters in their parsers, like Python. But Python is much more strict about the placement of the newline, and following program in Python is invalid:
# compile error
def
foo(x):
print(x)
# compile error
def bar
(x):
foo(x)
However in MAXScript following program is valid:
fn
foo x =
( // parenthesis start the compound expression
a = 3 + 2; // the semicolon is optional
print x
)
fn bar
x =
foo x
And you can even write things like this:
for
x
in
#(1,2,3,4)
do
format "%," x
Which will evaluate fine and print 1,2,3,4, to the output. So newlines can be inserted into many places with no special meaning.
However if you insert one more newline in the program like this:
for
x
in
#(1,2,3,4)
do
format "%,"
x
You will get a runtime error as format function expects to have more than one parameter passed.
Here is part of the bison input file that I have:
expr:
simple_expr
| if_expr
| while_loop
| do_loop
| for_loop
| expr_seq
expr_seq:
"(" expr_semicolon_list ")"
expr_semicolon_list:
expr
| expr TK_SEMICOLON expr_semicolon_list
| expr TK_EOL expr_semicolon_list
if_expr:
"if" expr "then" expr "else" expr
| "if" expr "then" expr
| "if" expr "do" expr
// etc.
This will parse only programs which use newline only as expression separator and will not expect newlines to be scattered in other places in the program.
My question is: Is there some way to tell bison to treat a token as an optional token? For bison it would mean this:
If you find newline token and you can shift with it or reduce, then do so.
Otherwise just discard the newline token and continue parsing.
Because if there is no way to do this, the only other solution I can think of is modifying the bison grammar file so that it expects those newlines everywhere. And bump the precedence of the rule where newline acts as an expression separator. Like this:
%precedence EXPR_SEPARATOR // high precedence
%%
// w = sequence of whitespace tokens
w: %empty // either nothing
| TK_EOL w // or newline followed by other whitespace tokens
expr:
w simple_expr w
| w if_expr w
| w while_loop w
| w do_loop w
| w for_loop w
| w expr_seq w
expr_seq:
w "(" w expr_semicolon_list w ")" w
expr_semicolon_list:
expr
| expr w TK_SEMICOLON w expr_semicolon_list
| expr TK_EOL w expr_semicolon_list %prec EXPR_SEPARATOR
if_expr:
w "if" w expr w "then" w expr w "else" w expr w
| w "if" w expr w "then" w expr w
| w "if" w expr w "do" w expr w
// etc.
However this looks very ugly and error-prone, and I would like to avoid such solution if possible.
My question is: Is there some way to tell bison to treat a token as an optional token?
No, there isn't. (See below for a longer explanation with diagrams.)
Still, the workaround is not quite as ugly as you think, although it's not without its problems.
In order to simplify things, I'm going to assume that the lexer can be convinced to produce only a single '\n' token regardless of how many consecutive newlines appear in the program text, including the case where there are comments scattered among the blank lines. That could be achieved with a complex regular expression, but a simpler way to do it is to use a start condition to suppress \n tokens until a regular token is encountered. The lexer's initial start condition should be the one which suppresses newline tokens, so that blank lines at the beginning of the program text won't confuse anything.
Now, the key insight is that we don't have to insert "maybe a newline" markers all over the grammar, since every newline must appear right after some real token. And that means that we can just add one non-terminal for every terminal:
tok_id: ID | ID '\n'
tok_if: "if" | "if" '\n'
tok_then: "then" | "then" '\n'
tok_else: "else" | "else" '\n'
tok_do: "do" | "do" '\n'
tok_semi: ';' | ';' '\n'
tok_dot: '.' | '.' '\n'
tok_plus: '+' | '+' '\n'
tok_dash: '-' | '-' '\n'
tok_star: '*' | '*' '\n'
tok_slash: '/' | '/' '\n'
tok_caret: '^' | '^' '\n'
tok_open: '(' | '(' '\n'
tok_close: ')' | ')' '\n'
tok_openb: '[' | '[' '\n'
tok_closeb: ']' | ']' '\n'
/* Etc. */
Now, it's just a question of replacing the use of a terminal with the corresponding non-terminal defined above. (No w non-terminal is required.) Once we do that, bison will report a number of shift-reduce conflicts in the non-terminal definitions just added; any terminal which can appear at the end of an expression will instigate a conflict, since the newline could be absorbed either by the terminal's non-terminal wrapper or by the expr_semicolon_list production. We want the newline to be part of expr_semicolon_list, so we need to add precedence declarations starting with newline, so that it is lower precedence than any other token.
That will most likely work for your grammar, but it is not 100% certain. The problem with precedence-based solutions is that they can have the effect of hiding real shift-reduce conflict issues. So I'd recommend running bison on the grammar and verifying that all the shift-reduce conflicts appear where expected (in the wrapper productions) before adding the precedence declarations.
Why token fallback is not as simple as it looks
In theory, it would be possible to implement a feature similar to the one you suggest. [Note 1]
But it's non-trivial, because of the way the LALR parser construction algorithm combines states. The result is that the parser might not "know" that the lookahead token cannot be shifted until it has done one or more reductions. So by the time it figures out that the lookahead token is not valid, it has already performed reductions which would have to be undone in order to continue the parse without the lookahead token.
Most parser generators compound the problem by removing error actions corresponding to a lookahead token if the default action in the state for that token is a reduction. The effect is again to delay detection of the error until after one or more futile reductions, but it has the benefit of significantly reducing the size of the transition table (since default entries don't need to be stored explicitly). Since the delayed error will be detected before any more input is consumed, the delay is generally considered acceptable. (Bison has an option to prevent this optimisation, however.)
As a practical illustration, here's a very simple expression grammar with only two operators:
prog: expr '\n' | prog expr '\n'
expr: prod | expr '+' prod
prod: term | prod '*' term
term: ID | '(' expr ')'
That leads to this state diagram [Note 2]:
Let's suppose that we wanted to ignore newlines pythonically, allowing the input
(
a + b
)
That means that the parser must ignore the newline after the b, since the input might be
(
a + b
* c
)
(Which is fine in Python but not, if I understand correctly, in MAXScript.)
Of course, the newline would be recognised as a statement separator if the input were not parenthesized:
a + b
Looking at the state diagram, we can see that the parser will end up in State 15 after the b is read, whether or not the expression is parenthesized. In that state, a newline is marked as a valid lookahead for the reduction action, so the reduction action will be performed, presumably creating an AST node for the sum. Only after this reduction will the parser notice that there is no action for the newline. If it now discards the newline character, it's too late; there is now no way to reduce b * c in order to make it an operand of the sum.
Bison does allow you to request a Canonical LR parser, which does not combine states. As a result, the state machine is much, much bigger; so much so that Canonical-LR is still considered impractical for non-toy grammars. In the simple two-operator expression grammar above, asking for a Canonical LR parser only increases the state count from 16 to 26, as shown here:
In the Canonical LR parser, there are two different states for the reduction term: term '+' prod. State 16 applies at the top-level, and thus the lookahead includes newline but not ) Inside parentheses the parser will instead reach state 26, where ) is a valid lookahead but newline is not. So, at least in some grammars, using a Canonical LR parser could make the prediction more precise. But features which are dependent on the use of a mammoth parsing automaton are not particularly practical.
One alternative would be for the parser to react to the newline by first simulating the reduction actions to see if a shift would eventually succeed. If you request Lookahead Correction (%define parse.lac full), bison will insert code to do precisely this. This code can create significant overhead, but many people request it anyway because it makes verbose error messages more accurate. So it would certainly be possible to repurpose this code to do token fallback handling, but no-one has actually done so, as far as I know.
Notes:
A similar question which comes up from time to time is whether you can tell bison to cause a token to be reclassified to a fallback token if there is no possibility to shift the token. (That would be useful for parsing languages like SQL which have a lot of non-reserved keywords.)
I generated the state graphs using Bison's -g option:
bison -o ex.tab.c --report=all -g ex.y
dot -Tpng -oex.png ex.dot
To produce the Canonical LR, I defined lr.type to be canonical-lr:
bison -o ex_canon.c --report=all -g -Dlr.type=canonical-lr ex.y
dot -Tpng -oex_canon.png ex_canon.dot

Solving shift/reduce conflict in expression grammar

I am new to bison and I am trying to make a grammar parsing expressions.
I am facing a shift/reduce conflight right now I am not able to solve.
The grammar is the following:
%left "[" "("
%left "+"
%%
expression_list : expression_list "," expression
| expression
| /*empty*/
;
expression : "(" expression ")"
| STRING_LITERAL
| INTEGER_LITERAL
| DOUBLE_LITERAL
| expression "(" expression_list ")" /*function call*/
| expression "[" expression "]" /*index access*/
| expression "+" expression
;
This is my grammar, but I am facing a shift/reduce conflict with those two rules "(" expression ")" and expression "(" expression_list ")".
How can I resolve this conflict?
EDIT: I know I could solve this using precedence climbing, but I would like to not do so, because this is only a small part of the expression grammar, and the size of the expression grammar would explode using precedence climbing.
There is no shift-reduce conflict in the grammar as presented, so I suppose that it is just an excerpt of the full grammar. In particular, there will be precisely the shift/reduce conflict mentioned if the real grammar includes:
%start program
%%
program: %empty
| program expression
In that case, you will run into an ambiguity because given, for example, a(b), the parser cannot tell whether it is a single call-expression or two consecutive expressions, first a single variable, and second a parenthesized expression. To avoid this problem you need to have some token which separates expression (statements).
There are some other issues:
expression_list : expression_list "," expression
| expression
| /*empty*/
;
That allows an expression list to be ,foo (as in f(,foo)), which is likely not desirable. Better would be
arguments: %empty
| expr_list
expr_list: expr
| expr_list ',' expr
And the precedences are probably backwards. Usually one wants postfix operators like call and index to bind more tightly than arithmetic operators, so they should come at the end. Otherwise a+b(7) is (a+b)(7), which is unconventional.

How to correctly parse a VB Case statement?

I'm trying to parse VBA code, and the 5.4.2.10 section of the spec defines the Select Case statement, which we've defined as follows:
// 5.4.2.10 Select Case Statement
selectCaseStmt :
SELECT whiteSpace? CASE whiteSpace? selectExpression endOfStatement
caseClause*
caseElseClause?
END_SELECT
;
selectExpression : expression;
caseClause :
CASE whiteSpace rangeClause (whiteSpace? COMMA whiteSpace? rangeClause)* endOfStatement block
;
caseElseClause : CASE whiteSpace? ELSE endOfStatement block;
rangeClause :
expression
| selectStartValue whiteSpace TO whiteSpace selectEndValue
| (IS whiteSpace?)? comparisonOperator whiteSpace? expression
;
selectStartValue : expression;
selectEndValue : expression;
The problem is that the expression in rangeClause is taking precedence, and makes this:
Select Case foo
Case Is = 42
Exit Sub
End Select
...ultimately get picked up and treated as {undeclared-variable} {EQ} {literal}, which is a problem, because Is ought to be a lexer token, not the LHS of a comparison expression:
expression whiteSpace? (EQ | NEQ | LT | GT | LEQ | GEQ | LIKE | IS) whiteSpace? expression # relationalOp
I tried reordering the alternatives so that the expression branch has lower precedence, like this:
rangeClause :
selectStartValue whiteSpace TO whiteSpace selectEndValue
| (IS whiteSpace?)? comparisonOperator whiteSpace? expression
| expression
;
But that broke the entire grammar in all kinds of ways (breaks ~1000 tests in my project), so instead I tried changing the rangeClause to this (removed optional tokens, because Is without = is actually illegal VBA code):
rangeClause :
expression (whiteSpace TO whiteSpace expression)? #caseFromTo
| (IS whiteSpace comparisonOperator whiteSpace)? expression #caseIs
;
And then working with CaseFromToContext and CaseIsContext classes in the code (had to, to keep it compiling), but again it broke ~1000 tests in my project.
Then I figured, "hey that's potentially ambiguous!" and turned it into this:
rangeClause :
expression whiteSpace TO whiteSpace expression #caseFromTo
| IS whiteSpace comparisonOperator whiteSpace expression #caseIs
| expression #caseExpr
;
...but no luck, same identical outcome.
How can I make the rangeClause understand this annoying Case Is = foobar syntax? I'm using ANTLR 4.3, but we're planning to upgrade to ANTLR 4.6 soon-ish.
If additional context is needed, the complete VBAParser.g4 grammar is on github.
Turns out that re-ordering actually does work, but in order to keep the ambiguity out of the parse, the IS whiteSpace comparisonOperator has to come first:
rangeClause :
(IS whiteSpace?)? comparisonOperator whiteSpace? expression
| selectStartValue whiteSpace TO whiteSpace selectEndValue
| expression
The problem is with expression (and by extension selectStartValue and selectEndValue) which will recursively match Is = because comparisonOperator comparisonOperator is an expression match. There's probably some work that can be done to prevent comparisonOperator comparisonOperator from matching expression (it's never valid in VBA AFAIK), but the above works as a quick and dirty fix.
Basically all the above grammar does is ensure that the "invalid" comparisonOperator comparisonOperator matches as a rangeClause before it can be matched as an expression.

ANTLR4 - How to tokenize differently inside quotes?

I am defining an ANTLR4 grammar and I'd like it to tokenize certain - but not all - things differently when they appear inside double-quotes than when they appear outside double-quotes. Here's the grammar I have so far:
grammar SimpleGrammar;
AND: '&';
TERM: TERM_CHAR+;
PHRASE_TERM: (TERM_CHAR | '%' | '&' | ':' | '$')+;
TRUNCATION: TERM '!';
WS: WS_CHAR+ -> skip;
fragment TERM_CHAR: 'a' .. 'z' | 'A' .. 'Z';
fragment WS_CHAR: [ \t\r\n];
// Parser rules
expr:
expr AND expr
| '"' phrase '"'
| TERM
| TRUNCATION
;
phrase:
(TERM | PHRASE_TERM | TRUNCATION)+
;
The above grammar works when parsing a! & b, which correctly parses to:
AND
/ \
/ \
a! b
However, when I attempt to parse "a! & b", I get:
line 1:4 extraneous input '&' expecting {'"', TERM, PHRASE_TERM, TRUNCATION}
The error message makes sense, because the & is getting tokenized as AND. What I would like to do, however, is have the & get tokenized as a PHRASE_TERM when it appears inside of double-quotes (inside a "phrase"). Note, I do want the a! to tokenize as TRUNCATION even when it appears inside the phrase.
Is this possible?
It is possible if you use lexer modes. It is possible to change mode after encounter of specific token. But lexer rules must be defined separately, not in combined grammar.
In your case, after encountering quote, you will change mode and after encountering another quote, you will change mode back to the default one.
LBRACK : '[' -> pushMode(CharSet);
RBRACK : ']' -> popMode;
For more information google 'ANTLR lexer Mode'

Resources