How to correctly parse a VB Case statement?

How to correctly parse a VB Case statement? - parsing

I'm trying to parse VBA code, and the 5.4.2.10 section of the spec defines the Select Case statement, which we've defined as follows:
// 5.4.2.10 Select Case Statement
selectCaseStmt :
SELECT whiteSpace? CASE whiteSpace? selectExpression endOfStatement
caseClause*
caseElseClause?
END_SELECT
;
selectExpression : expression;
caseClause :
CASE whiteSpace rangeClause (whiteSpace? COMMA whiteSpace? rangeClause)* endOfStatement block
;
caseElseClause : CASE whiteSpace? ELSE endOfStatement block;
rangeClause :
expression
| selectStartValue whiteSpace TO whiteSpace selectEndValue
| (IS whiteSpace?)? comparisonOperator whiteSpace? expression
;
selectStartValue : expression;
selectEndValue : expression;
The problem is that the expression in rangeClause is taking precedence, and makes this:
Select Case foo
Case Is = 42
Exit Sub
End Select
...ultimately get picked up and treated as {undeclared-variable} {EQ} {literal}, which is a problem, because Is ought to be a lexer token, not the LHS of a comparison expression:
expression whiteSpace? (EQ | NEQ | LT | GT | LEQ | GEQ | LIKE | IS) whiteSpace? expression # relationalOp
I tried reordering the alternatives so that the expression branch has lower precedence, like this:
rangeClause :
selectStartValue whiteSpace TO whiteSpace selectEndValue
| (IS whiteSpace?)? comparisonOperator whiteSpace? expression
| expression
;
But that broke the entire grammar in all kinds of ways (breaks ~1000 tests in my project), so instead I tried changing the rangeClause to this (removed optional tokens, because Is without = is actually illegal VBA code):
rangeClause :
expression (whiteSpace TO whiteSpace expression)? #caseFromTo
| (IS whiteSpace comparisonOperator whiteSpace)? expression #caseIs
;
And then working with CaseFromToContext and CaseIsContext classes in the code (had to, to keep it compiling), but again it broke ~1000 tests in my project.
Then I figured, "hey that's potentially ambiguous!" and turned it into this:
rangeClause :
expression whiteSpace TO whiteSpace expression #caseFromTo
| IS whiteSpace comparisonOperator whiteSpace expression #caseIs
| expression #caseExpr
;
...but no luck, same identical outcome.
How can I make the rangeClause understand this annoying Case Is = foobar syntax? I'm using ANTLR 4.3, but we're planning to upgrade to ANTLR 4.6 soon-ish.
If additional context is needed, the complete VBAParser.g4 grammar is on github.

Turns out that re-ordering actually does work, but in order to keep the ambiguity out of the parse, the IS whiteSpace comparisonOperator has to come first:
rangeClause :
(IS whiteSpace?)? comparisonOperator whiteSpace? expression
| selectStartValue whiteSpace TO whiteSpace selectEndValue
| expression
The problem is with expression (and by extension selectStartValue and selectEndValue) which will recursively match Is = because comparisonOperator comparisonOperator is an expression match. There's probably some work that can be done to prevent comparisonOperator comparisonOperator from matching expression (it's never valid in VBA AFAIK), but the above works as a quick and dirty fix.
Basically all the above grammar does is ensure that the "invalid" comparisonOperator comparisonOperator matches as a rangeClause before it can be matched as an expression.

Related

Grammar for if statements in newline sensitive language

I'm working on a language that is meant to read much like English, and having issues with the grammar for if statements. In case you are curious, the language is inspired by HyperTalk, so I'm trying to make sure I match all the valid constructs in that language. The sample input I'm using that demonstrates all the possible if constructs can be viewed here. There are a lot, so I didn't want to inline the code.
I've removed most other constructs from the grammar to make it a bit easier to read, but basically statements look like this:
start
: statementList
;
statementList
: '\n'
| statement '\n'
| statementList '\n'
| statementList statement '\n'
;
statement
: ID
| ifStatement
;
The shift/reduce conflicts I'm seeing are in the ifStatement rules:
ifStatement
: ifCondition THEN statement
| ifCondition THEN statement ELSE statement
| ifCondition THEN statement ELSE '\n' statementList END IF
| ifCondition THEN '\n' statementList END IF
| ifCondition THEN '\n' END IF
| ifCondition THEN '\n' ELSE statement
| ifCondition THEN '\n' ELSE '\n' statementList END IF
| ifCondition THEN '\n' statementList ELSE statement
| ifCondition THEN '\n' statementList ELSE '\n' statementList END IF
// The following rules cause issues, but should be legal:
| ifCondition THEN statement newlines ELSE statement
| ifCondition THEN statement newlines ELSE '\n' statementList END IF
;
ifCondition
: IF expression
| IF expression '\n'
;
expression
: TRUE
| FALSE
;
newlines
: '\n'
| newlines '\n'
;
The problem is that I need to support this construct:
if true then statement # <- Any number of newlines
else statement
The problem (as I understand it) is that there isn't enough context to correctly determine whether to shift the else, or reduce just the if true then statement part without knowing what comes later (the end of the statement list, or another statement). Is this even parseable?
I have gists for the parser, scanner, and sample input to try out.

Getting this right is surprisingly difficult, so I've tried to annotate the steps. There are a lot of annoying details.
At its core, this is just a manifestation of the dangling else ambiguity, whose resolution is pretty well-known (force the parser to always shift the else). The solution below resolves the ambiguity in the grammar itself, which is unambiguous.
The basic principle that I've used here is the one outlined several decades ago in Principles of Compiler Design by Alfred Aho and Jeffrey Ullman (the so-called "Dragon book", which I mention since its authors were recently granted the Turing award precisely for that and their other influential works). In particular, I use the terms "matched" and "unmatched" (rather than "open" and "closed", which are also popular) because that's the way I learned it.
It is also possible to solve this grammar problem using precedence declarations; indeed, that often turns out to be much simpler. But in this particular case, it's not easy to work with operator precedence because the relevant token (the else) can be preceded by an arbitrary number of newline tokens. I'm pretty sure you could still construct a precedence-based solution, but there are advantages to using an unambiguous grammar, including the ease of porting to a parser generator which doesn't use the same precedence algorithm, and the fact that it is possible to analyze mechanically.
The basic outline of the solution is to divide all statements into two categories:
"matched" (or "closed") statements, which are complete in the sense that it is not possible to extend the statement with an else clause. (In other words, every if…then is matched by a corresponding else.) These
"unmatched" (or "open") statements, which could have been extended with an else clause. (In other words, at least one if…then clause is not matched by an else.) Since the unmatched statement is a complete statement, it cannot be immediately followed by an else token; had an else token appeared, it would have served to extend the statement.
Once we manage to construct grammars for these two categories of statement, it's only necessary to figure out which uses of statement in the ambiguous grammar can be followed by else. In all of these contexts, the non-terminal statement must be replaced with the non-terminal matched-statement, because only matched statements can be followed by else without interacting with it. In other contexts, where else could not be the next token, either category of statement is valid.
So the essential grammar style is (taken from the Dragon book):
stmt → matched_stmt
| unmatched_stmt
matched_stmt → "if" expr "then" matched_stmt "else" matched_stmt
| other_stmt
unmatched_stmt → "if" expr "then" matched_stmt "else" unmatched_stmt
| "if" expr "then" stmt
other_stmt is anything other than a conditional statement. Or, to be more precise, anything other than a compound statement which precisely ends with a stmt.
In Hypertalk, as far as I know, if statements are the only compound statements which can end with a statement. Other compound statements are precisely terminated with an end X, which effectively closes the statement. But in other languages, such as C, there are a variety of compound statements, and most of these need to be divided into "matched" and "unmatched" depending precisely on whether their terminating substatement is (recursively) matched or unmatched.
One thing I want to note here, which is apparent from that outline grammar if you look at it a bit sideways, is that the if…then…else part of the if statement is grammatically similar to a bracketed prefix operator. That is, both matched_stmt and unmatched_stmt are similar to the right-recursive rule for unary minus:
unary → '-' unary
| atom
which in turn could be written in an Extended BNF dialect which allows Kleene stars as
unary → ('-')* atom
If we were to do that transformation to Aho&Ullman's grammar, we'd end up with:
if_then_else → "if" expr "then" matched_stmt "else"
matched_stmt → (if_then_else)* other_stmt
unmatched_stmt → (if_then_else)* "if" expr "then" stmt
That makes it reasonably clear how to implement this grammar with a top-down recursive-descent parser. (A bit of left-factoring is needed, but it still ends up being similar to the unary minus grammar.) I'm not planning on developing this thought further in this answer, but I think that the EBNF conversion helps guide the intuitions about how this grammar actually works to undangle the else.
It was also really helpful in figuring out how to deal with newlines. The key insight (for me) was that statements must end with a newline. The one exception is the condensed single-line version of the if command. But that exception only happens just before an else token (and only when the then which it matches in on the same line). In this grammar, that case is implemented with the inner-matched non-terminal, assisted by the fact that one-line statements (like do-statement) lack the terminating newline. The newline which terminates one-line statements is added in the recursive base case for matched (single-statement NL); that's the only place it needs to be handled. Multi-line compound statements are all defined with a terminating newline (see, for example, repeat-statement).
Most of the rest of the complications deal with the variety of syntactic forms. The only one which is really interesting is the handling of blocks after a then token at the end of a line. That block can be terminated in two ways:
with an end if line, without an else clause. This is treated as a "matched" case, since it clearly could not be extended with an else clause.
with an else clause (which could be a single line else or a block else, where the else token is at the end of the line). But here there is a possible ambiguity; if the last statement in the block is an unmatched if, then an else line should extend that statement, rather than terminating the block. That's not really different from the rest of the matched/unmatched logic; to implement it, I created two different block non-terminals, one ending with a matched statement and the other ending with an unmatched statement. And then, as usual, only the matched block can be used before an else.
(I found the new counterexample generator in bison 3.7.6 extremely helpful here; my initial attempt just used block because I'd failed to notice the ambiguity. But it is a real ambiguity, and it lead to a shift-reduce conflict whose origins seemed mysterious. Once I saw the counterexample produced by the counterexample generator -- which showed the conflict happening inside a block following an if-then -- the problem became a lot more evident.)
The alternation between matched-block and unmatched-block is a simple example of the correspondence between grammar productions and state machines. The two non-terminals represent the two states in a very simple state machine, whose state records a single bit: whether or not the last statement was matched. The non-terminals must be right-recursive for this to work, which is a deviation from the usual "prefer left-recursion" heuristic for building LALR(1) grammars.
OK, with that overlong preamble, here's the grammar. In the interests of compactification, I simplified expressions down to just variables and boolean constants, included only one simple statement (do expr) and included only one other compound statement (repeat until expr / block / end repeat). (The last one is there as a placeholder.)
program : block
block : %empty
| matched-block
| unmatched-block
NL : '\n'
| NL '\n'
matched-block
: block matched
unmatched-block
: block unmatched
simple-statement
: "do" expression
repeat-statement
: "repeat" "until" expression NL block "end" "repeat" NL
matched : if-then matched else-matched
| if-then inner-matched else-matched
| if-then NL matched-block else-matched
| if-then NL else-matched
| if-then NL block "end" "if" NL
| repeat-statement
| simple-statement NL
inner-matched
: %empty
| simple-statement
| if-then inner-matched "else" inner-matched
unmatched
: if-then matched
| if-then unmatched
| if-then inner-matched "else" unmatched
| if-then matched "else" unmatched
if-then : "if" expression NL "then"
| "if" expression "then"
else-matched
: "else" NL block "end" "if" NL
| "else" matched
expression
: ID
| "true"
| "false"
Previous answer (to original question, only visible in the edit history)
There is an obvious ambiguity between
ifCondition THEN statement EOL ELSE statement
and
ifCondition THEN EOL statementList ELSE statement
Recall that
statement: %empty
statementList: statement
with the result that both statement and statementList can derive the empty sequence. So both of the above productions for ifStatement can derive:
ifCondition THEN EOL ELSE statement
The parser has no way to know whether there is an empty statement before the EOL or an empty statementList after it. (You might not care which of these is chosen but parsers obsess about this kind of decision.)
Nullable productions are often problematic. Where possible, avoid them. Instead of letting statement derive empty, indicate explicitly where an empty statement might go by adding a rule where the optional statement is omitted. And consider rewriting statementList so that it must end with an EOL, which I think was your intention anyway (but perhaps I'm wrong).

Validating a "break" statement with a recursive descent parser

In Crafting Interpreters, we implement a little programming language using a recursive descent parser. Among many other things, it has these statements:
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" statement ;
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
One of the exercises is to add a break statement to the language. Also, it should be a syntax error to have this statement outside a loop. Naturally, it can appear inside other blocks, if statements etc. if those are inside a loop.
My first approach was to create a new rule, whileBody, to accept break:
## FIRST TRY
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
| break ;
break → "break" ";" ;
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
But we have to accept break inside nested loops, if conditionals etc. What I could imagine is, I'd need a new rule for blocks and conditionals which accept break:
## SECOND TRY
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
| break
| whileBlock
| whileIfStmt
whileBlock→ "{" (declaration | break)* "}" ;
whileIfStmt → "if" "(" expression ")" whileBody ( "else" whileBody )? ;
break → "break" ";"
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
It is not infeasible for now, but it can be cumbersome to handle it once the language grows. It is boring and error-prone to write even today!
I looked for inspiration in C and Java BNF specifications. Apparently, none of those specifications prohibit the break outside loop. I guess their parsers have ad hoc code to prevent that. So, I followed suit and added code into the parser to prevent break outside loops.
TL;DR
My questions are:
Would the approach of my second try even work? In other words, could a recursive descent parser handle a break statement that only appears inside loops?
Is there a more practical way to bake the break command inside a syntax specification?
Or the standard way is indeed to change a parser to prevent breaks outside loops while parsing?

Attribute grammars are good at this sort of thing. Define an inherited attribute (I'll call it LC for loop count). The 'program' non-terminal passes LC = 0 to its children; loops pass LC = $LC + 1 to their children; all other constructs pass LC = $LC to their children. Make the rule for 'break' syntactically valid only if $LC > 0.
There is no standard syntax for attribute grammars, or for using attribute values in guards (as I am suggesting for 'break'), but using Prolog definite-clause grammar notation your grammar might look something like the following. I have added a few notes on DCG notation, in case it's been too long since you have used them.
/* nt(X) means, roughly, pass the value X as an inherited attribute.
** In a recursive-descent system, it can be passed as a parameter.
** N.B. in definite-clause grammars, semicolon separates alternatives,
** and full stop ends a rule.
*/
/* DCD doesn't have regular-right-part rules, so we have to
** handle repetition via recursion.
*/
program -->
statement(0);
statement(0), program.
statement(LC) -->
exprStmt(LC);
ifStmt(LC);
printStmt(LC);
whileStmt(LC);
block(LC);
break(LC).
block(LC) -->
"{", star-declaration(LC), "}".
/* The notation [] denotes the empty list, and matches zero
** tokens in the input.
*/
star-declaration(LC) -->
[];
declaration(LC), star-declaration(LC).
/* On the RHS of a rule, braces { ... } contain Prolog code. Here,
** the code "LC2 is LC + 1" adds 1 to LC and binds LC2 to that value.
*/
whileStmt(LC) -->
{ LC2 is LC + 1 }, "while", "(", expression(LC2), ")", statement(LC2).
ifStmt(LC) --> "if", "(", expression(LC), ")", statement(LC), opt-else(LC).
opt-else(LC) -->
"else", statement(LC);
[].
/* The definition of break checks the value of the loop count:
** "LC > 0" succeeds if LC is greater than zero, and allows the
** parse to succeed. If LC is not greater than zero, the expression
** fails. And since there is no other rule for 'break', any attempt
** to parse a 'break' rule when LC = 0 will fail.
*/
break(LC) --> { LC > 0 }, "break", ";".
Nice introductions to attribute grammars can be found in Grune and Jacobs, Parsing Techniques and in the Springer volumes Lecture Notes in Computer Science 461 (Attribute Grammars and Their Applications*, ed. P. Deransart and M. Jourdan) and 545 (Attribute Grammars, Applications, and Systems, ed. H. Alblas and B. Melichar.
The technique of duplicating some productions in order to distinguish two situations (am I in a loop? or not?), as illustrated in the answer by #rici, can be regarded as a way of pushing a Boolean attribute into the non-terminal names.

Would the approach of my second try even work? In other words, could a recursive descent parser handle a break statement that only appears inside loops?
Sure. But you need a lot of duplication. Since while is not the only loop construct, I've used a different way of describing the alternatives, which consists of adding _B to the name of non-terminals which might include break statements.
declaration → varDecl
| statement
declaration_B → varDecl
| statement_B
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block
statement_B → exprStmt
| printStmt
| whileStmt
| breakStmt
| ifStmt_B
| block_B
breakStmt → "break" ";"
ifStmt → "if" "(" expression ")" statement ( "else" statement )?
ifStmt_B → "if" "(" expression ")" statement_B ( "else" statement_B )?
whileStmt → "while" "(" expression ")" statement_B ;
block → "{" declaration* "}"
block_B → "{" declaration_B* "}"
Not all statement types need to be duplicated. Non-compound statement like exprStmt don't, because they cannot possibly include a break statement (or any other statement type). And the statement which is the target of a loop statement like whileStmt can always include break, regardless of whether the while was inside a loop or not.
Is there a more practical way to bake the break command inside a syntax specification?
Not unless your syntax specification has marker macros, like the specification used to describe ECMAScript.
Is there a different way to do this?
Since this is a top-down (recursive descent) parser, it's pretty straight-forward to handle this condition in the parser's execution. You just need to add an argument to every (or many) parsing functions which specifies whether a break is possible or not. Any parsing function called by whileStmt would set that argument to True (or an enumeration indicating that break is possible), while other statement types would just pass the parameter through, and the top-level parsing function would set the argument to False. The breakStmt implementation would just return failure if it is called with False.

Antlr4 grammar - trouble identifying grammar

I'm working with Antlr4 to parse a boolean-like DSL.
Here is my grammar:
grammar filter;
filter: overall EOF;
overall
: LPAREN overall RPAREN
| category
;
category
: expression # InferenceCategory
| category AND category # CategoryAndBlock
| label COLON expression # CategoryBlock
| LPAREN category RPAREN # NestedCategory
;
expression
: NOT expression # NotExpr
| expression AND expression # AndExpr
| expression OR expression # OrExpr
| atom # AtomExpr
| LPAREN expression RPAREN # NestedExpression
;
label
: ALPHANUM
;
atom
: ALPHANUM
;
Here is an example input string to parse:
(cat1:(1 OR 2) AND cat2:( 4 ))
This grammar works fine with this input; it produces the following parse tree which perfectly suits my needs:
However, there is weird case of the DSL, where the "cat1" label is implicit when no other category is specified. This is what the InferenceCategory tag catches, where this expression will be handled as a category in my code later.
For example, with
((1 OR 2) AND cat2:( 4 ))
I get (as expected):
However, in the following instance:
cat2:( 4 ) AND (1 OR 2)
I get:
Notice that the second block is not identified as a InferenceCategory and but instead as a normal expression, under the first category. This is because there the grammar parses ( 4 ) following cat2: as a normal expression, and everything past that is parsed as a normal expression.
Is there any way to fix this? I've tried:
label COLON expression (AND category)* # CategoryBlock
(which doesn't work)
and
category AND category AND category
(which "works", but is extremely hacky and only works in the specific case that I have exactly three categories. Any more, and it breaks again.)

The "alternative labels" like NOT expression # NotExpr do not make a difference in your parse tree. They are semantic-only. They will cause the code generation process to create specific signatures that you can override in your Visitor or Listener.
The rationale behind this is, for example, instead of getting just one Visitor override for expression, you'll get several, one for each alternative label. That way, you don't have to examine expression and determine what type it is before acting on it. Instead, you'll get an override for # OrExpr for example, and as soon as you're in that override code, you know you're dealing with an OR, with an expression on each side of the OR token.
The parse tree is useful, but much of the semantics only become apparent when you code up your Listener or Visitor.

Ambiguous call expression in ANTLR4 grammar

I have a simple grammar (for demonstration)
grammar Test;
program
: expression* EOF
;
expression
: Identifier
| expression '(' expression? ')'
| '(' expression ')'
;
Identifier
: [a-zA-Z_] [a-zA-Z_0-9?]*
;
WS
: [ \r\t\n]+ -> channel(HIDDEN)
;
Obviously the second and third alternatives in the expression rule are ambiguous. I want to resolve this ambiguity by permitting the second alternative only if an expression is immediately followed by a '('.
So the following
bar(foo)
should match the second alternative while
bar
(foo)
should match the 1st and 3rd alternatives (even if the token between them is in the HIDDEN channel).
How can I do that? I have seen these ambiguities, between call expressions and parenthesized expressions, present in languages that have no (or have optional) expression terminator tokens (or rules) - example

The solution to this is to temporary "unhide" whitespace in your second alternative. Have a look at this question for how this can be done.
With that solution your code could look somthing like this
expression
: Identifier
| {enableWS();} expression '(' {disableWS();} expression? ')'
| '(' expression ')'
;
That way the second alternative matches the input WS-sensitive and will therefore only be matched if the identifier is directly followed by the bracket.
See here for the implementation of the MultiChannelTokenStream that is mentioned in the linked question.

Can I do something to avoid the need to backtrack in this grammar?

I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!

I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)

Categories

HOME

opencv

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to correctly parse a VB Case statement? - parsing

Related

Grammar for if statements in newline sensitive language

Validating a "break" statement with a recursive descent parser

Antlr4 grammar - trouble identifying grammar

Ambiguous call expression in ANTLR4 grammar

Can I do something to avoid the need to backtrack in this grammar?

Categories

Resources