Grammar for if statements in newline sensitive language - parsing

I'm working on a language that is meant to read much like English, and having issues with the grammar for if statements. In case you are curious, the language is inspired by HyperTalk, so I'm trying to make sure I match all the valid constructs in that language. The sample input I'm using that demonstrates all the possible if constructs can be viewed here. There are a lot, so I didn't want to inline the code.
I've removed most other constructs from the grammar to make it a bit easier to read, but basically statements look like this:
start
: statementList
;
statementList
: '\n'
| statement '\n'
| statementList '\n'
| statementList statement '\n'
;
statement
: ID
| ifStatement
;
The shift/reduce conflicts I'm seeing are in the ifStatement rules:
ifStatement
: ifCondition THEN statement
| ifCondition THEN statement ELSE statement
| ifCondition THEN statement ELSE '\n' statementList END IF
| ifCondition THEN '\n' statementList END IF
| ifCondition THEN '\n' END IF
| ifCondition THEN '\n' ELSE statement
| ifCondition THEN '\n' ELSE '\n' statementList END IF
| ifCondition THEN '\n' statementList ELSE statement
| ifCondition THEN '\n' statementList ELSE '\n' statementList END IF
// The following rules cause issues, but should be legal:
| ifCondition THEN statement newlines ELSE statement
| ifCondition THEN statement newlines ELSE '\n' statementList END IF
;
ifCondition
: IF expression
| IF expression '\n'
;
expression
: TRUE
| FALSE
;
newlines
: '\n'
| newlines '\n'
;
The problem is that I need to support this construct:
if true then statement # <- Any number of newlines
else statement
The problem (as I understand it) is that there isn't enough context to correctly determine whether to shift the else, or reduce just the if true then statement part without knowing what comes later (the end of the statement list, or another statement). Is this even parseable?
I have gists for the parser, scanner, and sample input to try out.

Getting this right is surprisingly difficult, so I've tried to annotate the steps. There are a lot of annoying details.
At its core, this is just a manifestation of the dangling else ambiguity, whose resolution is pretty well-known (force the parser to always shift the else). The solution below resolves the ambiguity in the grammar itself, which is unambiguous.
The basic principle that I've used here is the one outlined several decades ago in Principles of Compiler Design by Alfred Aho and Jeffrey Ullman (the so-called "Dragon book", which I mention since its authors were recently granted the Turing award precisely for that and their other influential works). In particular, I use the terms "matched" and "unmatched" (rather than "open" and "closed", which are also popular) because that's the way I learned it.
It is also possible to solve this grammar problem using precedence declarations; indeed, that often turns out to be much simpler. But in this particular case, it's not easy to work with operator precedence because the relevant token (the else) can be preceded by an arbitrary number of newline tokens. I'm pretty sure you could still construct a precedence-based solution, but there are advantages to using an unambiguous grammar, including the ease of porting to a parser generator which doesn't use the same precedence algorithm, and the fact that it is possible to analyze mechanically.
The basic outline of the solution is to divide all statements into two categories:
"matched" (or "closed") statements, which are complete in the sense that it is not possible to extend the statement with an else clause. (In other words, every if…then is matched by a corresponding else.) These
"unmatched" (or "open") statements, which could have been extended with an else clause. (In other words, at least one if…then clause is not matched by an else.) Since the unmatched statement is a complete statement, it cannot be immediately followed by an else token; had an else token appeared, it would have served to extend the statement.
Once we manage to construct grammars for these two categories of statement, it's only necessary to figure out which uses of statement in the ambiguous grammar can be followed by else. In all of these contexts, the non-terminal statement must be replaced with the non-terminal matched-statement, because only matched statements can be followed by else without interacting with it. In other contexts, where else could not be the next token, either category of statement is valid.
So the essential grammar style is (taken from the Dragon book):
stmt → matched_stmt
| unmatched_stmt
matched_stmt → "if" expr "then" matched_stmt "else" matched_stmt
| other_stmt
unmatched_stmt → "if" expr "then" matched_stmt "else" unmatched_stmt
| "if" expr "then" stmt
other_stmt is anything other than a conditional statement. Or, to be more precise, anything other than a compound statement which precisely ends with a stmt.
In Hypertalk, as far as I know, if statements are the only compound statements which can end with a statement. Other compound statements are precisely terminated with an end X, which effectively closes the statement. But in other languages, such as C, there are a variety of compound statements, and most of these need to be divided into "matched" and "unmatched" depending precisely on whether their terminating substatement is (recursively) matched or unmatched.
One thing I want to note here, which is apparent from that outline grammar if you look at it a bit sideways, is that the if…then…else part of the if statement is grammatically similar to a bracketed prefix operator. That is, both matched_stmt and unmatched_stmt are similar to the right-recursive rule for unary minus:
unary → '-' unary
| atom
which in turn could be written in an Extended BNF dialect which allows Kleene stars as
unary → ('-')* atom
If we were to do that transformation to Aho&Ullman's grammar, we'd end up with:
if_then_else → "if" expr "then" matched_stmt "else"
matched_stmt → (if_then_else)* other_stmt
unmatched_stmt → (if_then_else)* "if" expr "then" stmt
That makes it reasonably clear how to implement this grammar with a top-down recursive-descent parser. (A bit of left-factoring is needed, but it still ends up being similar to the unary minus grammar.) I'm not planning on developing this thought further in this answer, but I think that the EBNF conversion helps guide the intuitions about how this grammar actually works to undangle the else.
It was also really helpful in figuring out how to deal with newlines. The key insight (for me) was that statements must end with a newline. The one exception is the condensed single-line version of the if command. But that exception only happens just before an else token (and only when the then which it matches in on the same line). In this grammar, that case is implemented with the inner-matched non-terminal, assisted by the fact that one-line statements (like do-statement) lack the terminating newline. The newline which terminates one-line statements is added in the recursive base case for matched (single-statement NL); that's the only place it needs to be handled. Multi-line compound statements are all defined with a terminating newline (see, for example, repeat-statement).
Most of the rest of the complications deal with the variety of syntactic forms. The only one which is really interesting is the handling of blocks after a then token at the end of a line. That block can be terminated in two ways:
with an end if line, without an else clause. This is treated as a "matched" case, since it clearly could not be extended with an else clause.
with an else clause (which could be a single line else or a block else, where the else token is at the end of the line). But here there is a possible ambiguity; if the last statement in the block is an unmatched if, then an else line should extend that statement, rather than terminating the block. That's not really different from the rest of the matched/unmatched logic; to implement it, I created two different block non-terminals, one ending with a matched statement and the other ending with an unmatched statement. And then, as usual, only the matched block can be used before an else.
(I found the new counterexample generator in bison 3.7.6 extremely helpful here; my initial attempt just used block because I'd failed to notice the ambiguity. But it is a real ambiguity, and it lead to a shift-reduce conflict whose origins seemed mysterious. Once I saw the counterexample produced by the counterexample generator -- which showed the conflict happening inside a block following an if-then -- the problem became a lot more evident.)
The alternation between matched-block and unmatched-block is a simple example of the correspondence between grammar productions and state machines. The two non-terminals represent the two states in a very simple state machine, whose state records a single bit: whether or not the last statement was matched. The non-terminals must be right-recursive for this to work, which is a deviation from the usual "prefer left-recursion" heuristic for building LALR(1) grammars.
OK, with that overlong preamble, here's the grammar. In the interests of compactification, I simplified expressions down to just variables and boolean constants, included only one simple statement (do expr) and included only one other compound statement (repeat until expr / block / end repeat). (The last one is there as a placeholder.)
program : block
block : %empty
| matched-block
| unmatched-block
NL : '\n'
| NL '\n'
matched-block
: block matched
unmatched-block
: block unmatched
simple-statement
: "do" expression
repeat-statement
: "repeat" "until" expression NL block "end" "repeat" NL
matched : if-then matched else-matched
| if-then inner-matched else-matched
| if-then NL matched-block else-matched
| if-then NL else-matched
| if-then NL block "end" "if" NL
| repeat-statement
| simple-statement NL
inner-matched
: %empty
| simple-statement
| if-then inner-matched "else" inner-matched
unmatched
: if-then matched
| if-then unmatched
| if-then inner-matched "else" unmatched
| if-then matched "else" unmatched
if-then : "if" expression NL "then"
| "if" expression "then"
else-matched
: "else" NL block "end" "if" NL
| "else" matched
expression
: ID
| "true"
| "false"
Previous answer (to original question, only visible in the edit history)
There is an obvious ambiguity between
ifCondition THEN statement EOL ELSE statement
and
ifCondition THEN EOL statementList ELSE statement
Recall that
statement: %empty
statementList: statement
with the result that both statement and statementList can derive the empty sequence. So both of the above productions for ifStatement can derive:
ifCondition THEN EOL ELSE statement
The parser has no way to know whether there is an empty statement before the EOL or an empty statementList after it. (You might not care which of these is chosen but parsers obsess about this kind of decision.)
Nullable productions are often problematic. Where possible, avoid them. Instead of letting statement derive empty, indicate explicitly where an empty statement might go by adding a rule where the optional statement is omitted. And consider rewriting statementList so that it must end with an EOL, which I think was your intention anyway (but perhaps I'm wrong).

Related

Solve shift-reduce conlict for optional newlines in C-style block statement

I want to create a grammar rule for a simplified version of a block statement in C, which matches a list of statements in braces with optional newlines at the beginning and end. Statements in this language are terminated by a newline character.
The optional newlines are so that block statements can span multiple as well as single lines. i.e, both
{ statement }
and
{
statement
}
should be supported.
Currently my rules are as follows:
BlockStmt:
'{' OptionalNewlines BlockStmtList OptionalNewlines '}';
OptionalNewlines:
OptionalNewlines '\n'
| %empty;
Empty blocks are also supported, which are basically blocks with just newlines in them and no statements. This is possible because BlockStmtList can reduce to %empty.
However for empty blocks with just newlines, this leads to a shift-reduce conflict as the newlines can be matched by both the beginning and the ending OptionalNewlines non-terminal.
How do I tell yacc to prioritise one of the OptionalNewlines in the case of an empty block with just newlines?
Unless you have a problem with blank lines in the middle of a block -- something which many of us like to do -- the simple solution is to just allow empty statements (that is, a statement consisting only of the newline terminator. If you do that, you can stop worrying about optional newlines and just use
BlockStmt: '{' BlockStmtList '}';
That's far and away the easiest. But if it doesn't work for you, read on.
In general, you cannot have a sequence of optional lists where two of the lists have the same elements. That leads to an ambiguity: if your grammar allows a* b* a* (using Kleene * for simplicity) and the input is a, there is no way to know whether the empty a* is before or after the empty b*. "Optional" elements are problematic in many situations; it's often necessary to expand empty non-terminals into multiple rules using non-optional elements:
BlockStmt: '{' '}'
| '{' NewLineList '}'
| '{' NewLineList StmtList OptionalNewLineList '}'
| '{' StmtList OptionalNewLineList '}'

parsing maxscript - problem with newlines

I am trying to create parser for MAXScript language using their official grammar description of the language. I use flex and bison to create the lexer and parser.
However, I have run into following problem. In traditional languages (e.g. C) statements are separated by a special token (; in C). But in MAXScript expressions inside a compound expression can be separated either by ; or newline. There are other languages that use whitespace characters in their parsers, like Python. But Python is much more strict about the placement of the newline, and following program in Python is invalid:
# compile error
def
foo(x):
print(x)
# compile error
def bar
(x):
foo(x)
However in MAXScript following program is valid:
fn
foo x =
( // parenthesis start the compound expression
a = 3 + 2; // the semicolon is optional
print x
)
fn bar
x =
foo x
And you can even write things like this:
for
x
in
#(1,2,3,4)
do
format "%," x
Which will evaluate fine and print 1,2,3,4, to the output. So newlines can be inserted into many places with no special meaning.
However if you insert one more newline in the program like this:
for
x
in
#(1,2,3,4)
do
format "%,"
x
You will get a runtime error as format function expects to have more than one parameter passed.
Here is part of the bison input file that I have:
expr:
simple_expr
| if_expr
| while_loop
| do_loop
| for_loop
| expr_seq
expr_seq:
"(" expr_semicolon_list ")"
expr_semicolon_list:
expr
| expr TK_SEMICOLON expr_semicolon_list
| expr TK_EOL expr_semicolon_list
if_expr:
"if" expr "then" expr "else" expr
| "if" expr "then" expr
| "if" expr "do" expr
// etc.
This will parse only programs which use newline only as expression separator and will not expect newlines to be scattered in other places in the program.
My question is: Is there some way to tell bison to treat a token as an optional token? For bison it would mean this:
If you find newline token and you can shift with it or reduce, then do so.
Otherwise just discard the newline token and continue parsing.
Because if there is no way to do this, the only other solution I can think of is modifying the bison grammar file so that it expects those newlines everywhere. And bump the precedence of the rule where newline acts as an expression separator. Like this:
%precedence EXPR_SEPARATOR // high precedence
%%
// w = sequence of whitespace tokens
w: %empty // either nothing
| TK_EOL w // or newline followed by other whitespace tokens
expr:
w simple_expr w
| w if_expr w
| w while_loop w
| w do_loop w
| w for_loop w
| w expr_seq w
expr_seq:
w "(" w expr_semicolon_list w ")" w
expr_semicolon_list:
expr
| expr w TK_SEMICOLON w expr_semicolon_list
| expr TK_EOL w expr_semicolon_list %prec EXPR_SEPARATOR
if_expr:
w "if" w expr w "then" w expr w "else" w expr w
| w "if" w expr w "then" w expr w
| w "if" w expr w "do" w expr w
// etc.
However this looks very ugly and error-prone, and I would like to avoid such solution if possible.
My question is: Is there some way to tell bison to treat a token as an optional token?
No, there isn't. (See below for a longer explanation with diagrams.)
Still, the workaround is not quite as ugly as you think, although it's not without its problems.
In order to simplify things, I'm going to assume that the lexer can be convinced to produce only a single '\n' token regardless of how many consecutive newlines appear in the program text, including the case where there are comments scattered among the blank lines. That could be achieved with a complex regular expression, but a simpler way to do it is to use a start condition to suppress \n tokens until a regular token is encountered. The lexer's initial start condition should be the one which suppresses newline tokens, so that blank lines at the beginning of the program text won't confuse anything.
Now, the key insight is that we don't have to insert "maybe a newline" markers all over the grammar, since every newline must appear right after some real token. And that means that we can just add one non-terminal for every terminal:
tok_id: ID | ID '\n'
tok_if: "if" | "if" '\n'
tok_then: "then" | "then" '\n'
tok_else: "else" | "else" '\n'
tok_do: "do" | "do" '\n'
tok_semi: ';' | ';' '\n'
tok_dot: '.' | '.' '\n'
tok_plus: '+' | '+' '\n'
tok_dash: '-' | '-' '\n'
tok_star: '*' | '*' '\n'
tok_slash: '/' | '/' '\n'
tok_caret: '^' | '^' '\n'
tok_open: '(' | '(' '\n'
tok_close: ')' | ')' '\n'
tok_openb: '[' | '[' '\n'
tok_closeb: ']' | ']' '\n'
/* Etc. */
Now, it's just a question of replacing the use of a terminal with the corresponding non-terminal defined above. (No w non-terminal is required.) Once we do that, bison will report a number of shift-reduce conflicts in the non-terminal definitions just added; any terminal which can appear at the end of an expression will instigate a conflict, since the newline could be absorbed either by the terminal's non-terminal wrapper or by the expr_semicolon_list production. We want the newline to be part of expr_semicolon_list, so we need to add precedence declarations starting with newline, so that it is lower precedence than any other token.
That will most likely work for your grammar, but it is not 100% certain. The problem with precedence-based solutions is that they can have the effect of hiding real shift-reduce conflict issues. So I'd recommend running bison on the grammar and verifying that all the shift-reduce conflicts appear where expected (in the wrapper productions) before adding the precedence declarations.
Why token fallback is not as simple as it looks
In theory, it would be possible to implement a feature similar to the one you suggest. [Note 1]
But it's non-trivial, because of the way the LALR parser construction algorithm combines states. The result is that the parser might not "know" that the lookahead token cannot be shifted until it has done one or more reductions. So by the time it figures out that the lookahead token is not valid, it has already performed reductions which would have to be undone in order to continue the parse without the lookahead token.
Most parser generators compound the problem by removing error actions corresponding to a lookahead token if the default action in the state for that token is a reduction. The effect is again to delay detection of the error until after one or more futile reductions, but it has the benefit of significantly reducing the size of the transition table (since default entries don't need to be stored explicitly). Since the delayed error will be detected before any more input is consumed, the delay is generally considered acceptable. (Bison has an option to prevent this optimisation, however.)
As a practical illustration, here's a very simple expression grammar with only two operators:
prog: expr '\n' | prog expr '\n'
expr: prod | expr '+' prod
prod: term | prod '*' term
term: ID | '(' expr ')'
That leads to this state diagram [Note 2]:
Let's suppose that we wanted to ignore newlines pythonically, allowing the input
(
a + b
)
That means that the parser must ignore the newline after the b, since the input might be
(
a + b
* c
)
(Which is fine in Python but not, if I understand correctly, in MAXScript.)
Of course, the newline would be recognised as a statement separator if the input were not parenthesized:
a + b
Looking at the state diagram, we can see that the parser will end up in State 15 after the b is read, whether or not the expression is parenthesized. In that state, a newline is marked as a valid lookahead for the reduction action, so the reduction action will be performed, presumably creating an AST node for the sum. Only after this reduction will the parser notice that there is no action for the newline. If it now discards the newline character, it's too late; there is now no way to reduce b * c in order to make it an operand of the sum.
Bison does allow you to request a Canonical LR parser, which does not combine states. As a result, the state machine is much, much bigger; so much so that Canonical-LR is still considered impractical for non-toy grammars. In the simple two-operator expression grammar above, asking for a Canonical LR parser only increases the state count from 16 to 26, as shown here:
In the Canonical LR parser, there are two different states for the reduction term: term '+' prod. State 16 applies at the top-level, and thus the lookahead includes newline but not ) Inside parentheses the parser will instead reach state 26, where ) is a valid lookahead but newline is not. So, at least in some grammars, using a Canonical LR parser could make the prediction more precise. But features which are dependent on the use of a mammoth parsing automaton are not particularly practical.
One alternative would be for the parser to react to the newline by first simulating the reduction actions to see if a shift would eventually succeed. If you request Lookahead Correction (%define parse.lac full), bison will insert code to do precisely this. This code can create significant overhead, but many people request it anyway because it makes verbose error messages more accurate. So it would certainly be possible to repurpose this code to do token fallback handling, but no-one has actually done so, as far as I know.
Notes:
A similar question which comes up from time to time is whether you can tell bison to cause a token to be reclassified to a fallback token if there is no possibility to shift the token. (That would be useful for parsing languages like SQL which have a lot of non-reserved keywords.)
I generated the state graphs using Bison's -g option:
bison -o ex.tab.c --report=all -g ex.y
dot -Tpng -oex.png ex.dot
To produce the Canonical LR, I defined lr.type to be canonical-lr:
bison -o ex_canon.c --report=all -g -Dlr.type=canonical-lr ex.y
dot -Tpng -oex_canon.png ex_canon.dot

How to correctly parse a VB Case statement?

I'm trying to parse VBA code, and the 5.4.2.10 section of the spec defines the Select Case statement, which we've defined as follows:
// 5.4.2.10 Select Case Statement
selectCaseStmt :
SELECT whiteSpace? CASE whiteSpace? selectExpression endOfStatement
caseClause*
caseElseClause?
END_SELECT
;
selectExpression : expression;
caseClause :
CASE whiteSpace rangeClause (whiteSpace? COMMA whiteSpace? rangeClause)* endOfStatement block
;
caseElseClause : CASE whiteSpace? ELSE endOfStatement block;
rangeClause :
expression
| selectStartValue whiteSpace TO whiteSpace selectEndValue
| (IS whiteSpace?)? comparisonOperator whiteSpace? expression
;
selectStartValue : expression;
selectEndValue : expression;
The problem is that the expression in rangeClause is taking precedence, and makes this:
Select Case foo
Case Is = 42
Exit Sub
End Select
...ultimately get picked up and treated as {undeclared-variable} {EQ} {literal}, which is a problem, because Is ought to be a lexer token, not the LHS of a comparison expression:
expression whiteSpace? (EQ | NEQ | LT | GT | LEQ | GEQ | LIKE | IS) whiteSpace? expression # relationalOp
I tried reordering the alternatives so that the expression branch has lower precedence, like this:
rangeClause :
selectStartValue whiteSpace TO whiteSpace selectEndValue
| (IS whiteSpace?)? comparisonOperator whiteSpace? expression
| expression
;
But that broke the entire grammar in all kinds of ways (breaks ~1000 tests in my project), so instead I tried changing the rangeClause to this (removed optional tokens, because Is without = is actually illegal VBA code):
rangeClause :
expression (whiteSpace TO whiteSpace expression)? #caseFromTo
| (IS whiteSpace comparisonOperator whiteSpace)? expression #caseIs
;
And then working with CaseFromToContext and CaseIsContext classes in the code (had to, to keep it compiling), but again it broke ~1000 tests in my project.
Then I figured, "hey that's potentially ambiguous!" and turned it into this:
rangeClause :
expression whiteSpace TO whiteSpace expression #caseFromTo
| IS whiteSpace comparisonOperator whiteSpace expression #caseIs
| expression #caseExpr
;
...but no luck, same identical outcome.
How can I make the rangeClause understand this annoying Case Is = foobar syntax? I'm using ANTLR 4.3, but we're planning to upgrade to ANTLR 4.6 soon-ish.
If additional context is needed, the complete VBAParser.g4 grammar is on github.
Turns out that re-ordering actually does work, but in order to keep the ambiguity out of the parse, the IS whiteSpace comparisonOperator has to come first:
rangeClause :
(IS whiteSpace?)? comparisonOperator whiteSpace? expression
| selectStartValue whiteSpace TO whiteSpace selectEndValue
| expression
The problem is with expression (and by extension selectStartValue and selectEndValue) which will recursively match Is = because comparisonOperator comparisonOperator is an expression match. There's probably some work that can be done to prevent comparisonOperator comparisonOperator from matching expression (it's never valid in VBA AFAIK), but the above works as a quick and dirty fix.
Basically all the above grammar does is ensure that the "invalid" comparisonOperator comparisonOperator matches as a rangeClause before it can be matched as an expression.

yacc shift-reduce for ambiguous lambda syntax

I'm writing a grammar for a toy language in Yacc (the one packaged with Go) and I have an expected shift-reduce conflict due to the following pseudo-issue. I have to distilled the problem grammar down to the following.
start:
stmt_list
expr:
INT | IDENT | lambda | '(' expr ')' { $$ = $2 }
lambda:
'(' params ')' '{' stmt_list '}'
params:
expr | params ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
A lambda function looks something like this:
map((v) { v * 2 }, collection)
My parser emits:
conflicts: 1 shift/reduce
Given the input:
(a)
It correctly parses an expr by the '(' expr ')' rule. However given an input of:
(a) { a }
(Which would be a lambda for the identity function, returning its input). I get:
syntax error: unexpected '{'
This is because when (a) is read, the parser is choosing to reduce it as '(' expr ')', rather than consider it to be '(' params ')'. Given this conflict is a shift-reduce and not a reduce-reduce, I'm assuming this is solvable. I just don't know how to structure the grammar to support this syntax.
EDIT | It's ugly, but I'm considering defining a token so that the lexer can recognize the ')' '{' sequence and send it through as a single token to resolve this.
EDIT 2 | Actually, better still, I'll make lambdas require syntax like ->(a, b) { a * b} in the grammar, but have the lexer emit the -> rather than it being in the actual source code.
Your analysis is indeed correct; although the grammar is not ambiguous, it is impossible for the parser to decide with the input reduced to ( <expr> and with lookahead ) whether or not the expr should be reduced to params before shifting the ) or whether the ) should be shifted as part of a lambda. If the next token were visible, the decision could be made, so the grammar LR(2), which is outside of the competence of go/yacc.
If you were using bison, you could easily solve this problem by requesting a GLR parser, but I don't believe that go/yacc provides that feature.
There is an LR(1) grammar for the language (there is always an LR(1) grammar corresponding to any LR(k) grammar for any value of k) but it is rather annoying to write by hand. The essential idea of the LR(k) to LR(1) transformation is to shift the reduction decisions k-1 tokens forward by accumulating k-1 tokens of context into each production. So in the case that k is 2, each production P: N → α will be replaced with productions TNU → Tα U for each T in FIRST(α) and each U in FOLLOW(N). [See Note 1] That leads to a considerable blow-up of non-terminals in any non-trivial grammar.
Rather than pursuing that idea, let me propose two much simpler solutions, both of which you seem to be quite close to.
First, in the grammar you present, the issue really is simply the need for a two-token lookahead when the two tokens are ){. That could easily be detected in the lexer, and leads to a solution which is still hacky but a simpler hack: Return ){ as a single token. You need to deal with intervening whitespace, etc., but it doesn't require retaining any context in the lexer. This has the added bonus that you don't need to define params as a list of exprs; they can just be a list of IDENT (if that's relevant; a comment suggests that it isn't).
The alternative, which I think is a bit cleaner, is to extend the solution you already seem to be proposing: accept a little too much and reject the errors in a semantic action. In this case, you might do something like:
start:
stmt_list
expr:
INT
| IDENT
| lambda
| '(' expr_list ')'
{ // If $2 has more than one expr, report error
$$ = $2
}
lambda:
'(' expr_list ')' '{' stmt_list '}'
{ // If anything in expr_list is not a valid param, report error
$$ = make_lambda($2, $4)
}
expr_list:
expr | expr_list ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
Notes
That's only an outline; the complete algorithm includes the mechanism to recover the original parse tree. If k is greater than 2 then T and U are strings the the FIRSTk-1 and FOLLOWk-1 sets.
If it really is a shift-reduce conflict, and you want only the shift behavior, your parser generator may give you a way to prefer a shift vs. a reduce. This is classically how the conflict for grammar rules for "if-then-stmt" and "if-then-stmt-else-stmt" is resolved, when the if statement can also be a statement.
See http://www.gnu.org/software/bison/manual/html_node/Shift_002fReduce.html
You can get this effect two ways:
a) Count on the accidental behavior of the parsing engine.
If an LALR parser handles shifts first, and then reductions if there are no shifts, then you'll get this "prefer shift" for free. All the parser generator has to do is built the parse tables anyway, even if there is a detected conflict.
b) Enforce the accidental behavior. Design (or a get a) parser generator to accept "prefer shift on token T". Then one can supress the ambiguity. One still have to implement the parsing engine as in a) but that's pretty easy.
I think this is easier/cleaner than abusing the lexer to make strange tokens (and that doesn't always work anyway).
Obviously, you could make a preference for reductions to turn it the other way. With some extra hacking, you could make shift-vs-reduce specific the state in which the conflict occured; you can even make it specific to the pair of conflicting rules but now the parsing engine needs to keep preference data around for nonterminals. That still isn't hard. Finally, you could add a predicate for each nonterminal which is called when a shift-reduce conflict is about to occur, and it have it provide a decision.
The point is you don't have to accept "pure" LALR parsing; you can bend it easily in a variety of ways, if you are willing to modify the parser generator/engine a little bit. This gives a really good reason to understand how these tools work; then you can abuse them to your benefit.

Bison Shift/Reduce conflict for simple grammar

I'm building a parser for a language I've designed, in which type names start with an upper case letter and variable names start with a lower case letter, such that the lexer can tell the difference and provide different tokens. Also, the string 'this' is recognised by the lexer (it's an OOP language) and passed as a separate token. Finally, data members can only be accessed on the 'this' object, so I built the grammar as so:
%token TYPENAME
%token VARNAME
%token THIS
%%
start:
Expression
;
Expression:
THIS
| THIS '.' VARNAME
| Expression '.' TYPENAME
;
%%
The first rule of Expression allows the user to pass 'this' around as a value (for example, returning it from a method or passing to a method call). The second is for accessing data on 'this'. The third rule is for calling methods, however I've removed the brackets and parameters since they are irrelevant to the problem. The originally grammar was clearly much larger than this, however this is the smallest part that generates the same error (1 Shift/Reduce conflict) - I isolated it into its own parser file and verified this, so the error has nothing to do with any other symbols.
As far as I can see, the grammar given here is unambiguous and so should not produce any errors. If you remove any of the three rules or change the second rule to
Expression '.' VARNAME
there is no conflict. In any case, I probably need someone to state the obvious of why this conflict occurs and how to resolve it.
The problem is that the grammar can only look one ahead. Therefore when you see a THIS then a ., are you in line 2(Expression: THIS '.' VARNAME) or line 3 (Expression: Expression '.' TYPENAME, via a reduction according to line 1).
The grammar could reduce THIS. to Expression. and then look for a TYPENAME or shift it to THIS. and look for a VARNAME, but it has to decide when it gets to the ..
I try to avoid y.output but sometimes it does help. I looked at the file it produced and saw.
state 1
2 Expression: THIS. [$end, '.']
3 | THIS . '.' VARNAME
'.' shift, and go to state 4
'.' [reduce using rule 2 (Expression)]
$default reduce using rule 2 (Expression)
Basically it is saying it sees '.' and can reduce or it can shift. Reduce makes me anrgu sometimes because they are hard to fine. The shift is rule 3 and is obvious (but the output doesnt mention the rule #). The reduce where it see's '.' in this case is the line
| Expression '.' TYPENAME
When it goes to Expression it looks at the next letter (the '.') and goes in. Now it sees THIS | so when it gets to the end of that statement it expects '.' when it leaves or an error. However it sees THIS '.' while its between this and '.' (hence the dot in the out file) and it CAN reduce a rule so there is a path conflict. I believe you can use %glr-parser to allow it to try both but the more conflicts you have the more likely you'll either get unexpected output or an ambiguity error. I had ambiguity errors in the past. They are annoying to deal with especially if you dont remember what rule caused or affected them. it is recommended to avoid conflicts.
I highly recommend this book before attempting to use bison.
I cant think of a 'great' solution but this gives no conflicts
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
| THIS //trick is moving this AWAY so it doesnt reduce
rval:
THIS '.' VARNAME
Alternative you can make it reduce later by adding more to the rule so it doesnt reduce as soon or by adding a token after or before to make it clear which path to take or fails (remember, it must know BEFORE reducing ANY path)
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
rval:
THIS '#'
| THIS '.' VARNAME
%%
-edit- note if i want to do func param and type varname i cant because type according to the lexer func is a Var (which is A-Za-z09_) as well as type. param and varname are both var's as well so this will cause me a reduce/reduce conflict. You cant write this as what they are, only what they look like. So keep that in mind when writing. You'll have to write a token to differentiate the two or write it as one of the two but write additional logic in code (the part that is in { } on the right side of the rules) to check if it is a funcname or a type and handle both those case.

Resources