Gnu Bison shift / reduce conflicts in indentation-based grammar describing hierarchical expressions - parsing

I've been not using Bison for a long time, so there is a chance I'm missing something simple here, however, I cannot figure out why the following grammar produces shift / reduce conflicts. I think the following grammar is not ambiguous. It's purpose is to parse expressions like:
a b
c d
e f
g h
as (in pseudo-AST):
App
(App a b)
(Seq
[ App
(App c d)
(Seq [App e f])
, (App g h)
]
)
The grammar:
%token <Token> VAR
%token <Token> EOL
%token <Token> INDENT_INC
%token <Token> INDENT_DEC
%token <AST> CONS
%token <AST> WILDCARD
%type <AST> expr
%type <AST> subExpr
%type <AST> block
%type <AST> tok
%start program
%%
program:
expr { result = $1; }
expr:
subExpr {$$=$1;}
| subExpr EOL INDENT_INC block { $$ = AST.app($1,$3); }
subExpr:
tok {$$=$1;}
| subExpr tok {$$ = AST.app($1,$2); }
block:
expr {$$=$1;}
| block EOL expr {$$=AST.seq($1,$3);} // causes error
tok:
VAR { $$ = AST.fromToken($1); }
%%
The error is just 2 shift/reduce conflicts. When debugging the parser, we can observe:
Grammar
0 $accept: program $end
1 program: expr
2 expr: subExpr
3 | subExpr EOL INDENT_INC block
4 subExpr: tok
5 | subExpr tok
6 block: expr
7 | block EOL expr
8 tok: VAR
[...]
State 4
2 expr: subExpr .
3 | subExpr . EOL INDENT_INC block
5 subExpr: subExpr . tok
VAR shift, and go to state 1
EOL shift, and go to state 7
EOL [reduce using rule 2 (expr)]
$default reduce using rule 2 (expr)
tok go to state 8
[...]
State 11
3 expr: subExpr EOL INDENT_INC block .
7 block: block . EOL expr
EOL shift, and go to state 12
EOL [reduce using rule 3 (expr)]
$default reduce using rule 3 (expr)
And to be honest, I'm not convinced where the ambiguity comes from. I'd be thankful for any help on how to remove the conflicts in such a grammar.

Your grammar does not use INDENT_DEC; without that, you cannot know where an indented block ends.
In effect, that's what those shift/reduce conflicts are telling you. Since the grammar doesn't see INDENT_DEC, it cannot distinguish between the EOL which separates two exprs in the same block and the EOL which terminates a block. Thus, an EOL is ambiguous (provided at least one INDENT_INC has been seen).
Here's a simple demonstration of ambiguity. The expression to parse is:
a EOL INDENT_INC b EOL INDENT_INC c EOL d
Here are two leftmost derivations, which differ in where d is nested (I condensed the subexpr ⇒ var ⇒ TOK path for simplicity):
# Here, d belongs to the outer subexpr (effectively, a single indent)
expr ⇒ subexpr EOL INDENT_INC block
⇒ TOK (a) EOL INDENT_INC block
⇒ TOK (a) EOL INDENT_INC block EOL expr
⇒ TOK (a) EOL INDENT_INC expr EOL expr
⇒ TOK (a) EOL INDENT_INC subexpr EOL INDENT_INC block EOL expr
⇒ TOK (a) EOL INDENT_INC subexpr EOL INDENT_INC expr EOL expr
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC subexpr EOL expr
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC TOK (c) EOL expr
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC TOK (c) EOL subexpr
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC TOK (c) EOL TOK (d)
# Here, d belongs to the inner subexpr (effectively two indents)
expr ⇒ subexpr EOL INDENT_INC block
⇒ TOK (a) EOL INDENT_INC block
⇒ TOK (a) EOL INDENT_INC expr
⇒ TOK (a) EOL INDENT_INC subexpr EOL INDENT_INC block
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC block
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC block EOL expr
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC expr EOL expr
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC subexpr EOL expr
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC TOK (c) EOL expr
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC TOK (c) EOL subexpr
⇒ TOK (a) EOL INDENT_INC TOK (b) EOL INDENT_INC TOK (c) EOL TOK (d)
So the grammar really is ambiguous. But the shift/reduce conflicts don't directly point at the ambiguity. They point at the problem of deciding whether or not to reduce the construct before the EOL without seeing the symbol following the EOL. This is the essence of the LR(1) restriction: Every reduction must be made before shifting the next symbol, so even if a grammar would be unambiguous if you could see far enough into the future it will still have shift/reduce conflicts if the reduction decision could go either way.

Related

Does antlr automatically factor top-level alternates?

I have written the following two grammars, one grouping the arithmetic expressions (where possible) and another that doesn't:
grammar NoPrefix;
root: (expr ';')* EOF;
expr
: '(' expr ')'
| expr '*' expr
| expr '/' expr
| expr '+' expr
| expr '-' expr
| Atom
;
Atom: [a-z]+ | [0-9]+ | '\'' Atom '\'';
WHITESPACE: [ \t\r\n] -> skip;
grammar YesPrefix;
root: (expr ';')* EOF;
expr
: '(' expr ')'
| expr ('*'|'/') expr
| expr ('+'|'-') expr
| Atom
;
Atom:[a-z]+ | [0-9]+ | '\'' Atom '\'';
WHITESPACE: [ \t\r\n] -> skip;
It seems that these two have almost identical runtimes, build sizes, etc. Does antlr automatically convert the two forms of alternatives to the same output, for example:
expr: expr '*' expr | expr '/' expr <==> expr: expr ('*'|'/') expr;
No. How would Antlr know that you wanted * and / to have the same binding precedence, different from + and -? You need to be explicit about that.

Implementation of a context-free grammar for logical operators with parentheses

I'm trying to implement a context-free grammar for the language of logical operators with parentheses including operator precedence.
For example, the follows:
{1} or {2}
{1} and {2} or {3}
({1} or {2}) and {3}
not {1} and ({2} or {3})
...
I've started from the following grammar:
expr := control | expr and expr | expr or expr | not expr | (expr)
control := {d+}
To implement operator precedence and eliminate left recursion, I changed it in the following way:
S ::= expr
expr ::= control or expr1 | expr1
expr1 ::= control and expr2 | expr2
expr2 ::= not expr3 | expr3
expr3 ::= (expr) | expr | control
control := {d+}
But such grammar doesn't support examples like: ({1} or {2}) and {3} that contain 'and' / 'or' after parentheses.
For now, I have the following grammar:
S ::= expr
expr ::= control or expr1 | expr1
expr1 ::= control and expr2 | expr2
expr2 ::= not expr3 | expr3
expr3 ::= (expr) | (expr) expr4 | expr | control
expr4 :: = and expr | or expr
control := {d+}
Is this grammar correct?
Can it be simplified in some way?
Thanks!
To implement your operator precedence, you want just:
S ::= expr
expr ::= expr or expr1 | expr1
expr1 ::= expr1 and expr2 | expr2
expr2 ::= not expr2 | expr3
expr3 ::= (expr) | control
control := {d+}
This is left-recursive in order to be left-associative, as that's generally what you want (both for correctness and most parser-generators), but if you need to avoid left recursion for some reason, you can use a right-recursive, right associative grammar:
S ::= expr
expr ::= expr1 or expr | expr1
expr1 ::= expr2 and expr1 | expr2
expr2 ::= not expr2 | expr3
expr3 ::= (expr) | control
control := {d+}
as both and and or are associative operators, so left-vs-right doesn't matter.
In both cases, you can "simplify" it by folding expr3 and control into expr2:
expr2 ::= not expr2 | ( expr ) | {d+}

Yacc grammar expressions, conflicts

Can someone identify where the grammar conflict is in this expression production?
expr '+' expr
|
expr '-' expr
|
expr '*' expr
|
expr '/' expr
|
expr '(' ')'
|
T_IDENTIFIER
|
T_STRING_LITERAL
|
T_INTEGER_LITERAL
|
T_FLOAT_LITERAL
I'm trying to implement function calls taking an expr as the operand, so for example, the following would be valid grammar:
1()
1.5()
"STRING"()
fn()

Is Lemon correctly handling nonassoc precedence?

I feel like the Lemon parser generator is doing it wrong with nonassoc precedence. I have a simplified grammar that exhibits the problems I'm seeing.
%nonassoc EQ.
%left PLUS.
stmt ::= expr.
expr ::= expr EQ expr.
expr ::= expr PLUS expr.
expr ::= IDENTIFIER.
Yields a report with a conflict like so:
State 4:
expr ::= expr * EQ expr
(1) expr ::= expr EQ expr *
expr ::= expr * PLUS expr
EQ shift 2
EQ reduce 1 ** Parsing conflict **
PLUS shift 1
{default} reduce 1
If I tell it that equals is left associative, the problem goes away. It's as if nonassoc doesn't put the rule into the precedence set. Comparing to a Bison version of that grammar, there is no conflict. And assignment really should be nonassociative. I'd rather not lie to it about that to work around this.
After spending some time poring over the reports generated by both Lemon and Bison for the associated grammars, I can only conclude that Lemon is, indeed, mishandling nonassoc precedence. The smoking gun is contained in that state 4 quoted above, but I should probably lay out some more detail for clarity.
The states the build up to expr EQ are straightforward. You arrive at state 2 then:
State 2:
expr ::= * expr EQ expr
expr ::= expr EQ * expr
expr ::= * expr PLUS expr
expr ::= * IDENTIFIER
IDENTIFIER shift 5
expr shift 4
This state contains the current expr EQ item, which expects to be followed by another expr. Because of that, it contains the First set for expr, which are the 3 entries starting with * in the state. If we read an expr in this state, we'll land in state 4 with an item either partway through the reduction or at the end.
expr ::= expr * EQ expr
expr ::= expr EQ expr *
What happens if we read an EQ in this state? I told Lemon the answer. It's an error because EQ is nonassociative. Instead it reports a shift/reduce conflict. In practice, it will shift, which will let it accept an illegal parse, such as x=y=z.
Bison contains these same states, numbered differently, but with a telling distinction.
state 8
2 expr: expr . EQ expr [$end, PLUS]
2 | expr EQ expr . [$end, PLUS]
3 | expr . PLUS expr
EQ error (nonassociative)
$default reduce using rule 2 (expr)
Conflict between rule 2 and token PLUS resolved as reduce (PLUS < EQ).
Conflict between rule 2 and token EQ resolved as an error (%nonassoc EQ).
Bison knows what nonassociative means, and uses that to eliminate the supposed ambiguity if it sees a second EQ in an expression.

How to fix YACC shift/reduce conflicts from post-increment operator?

I'm writing a grammar in YACC (actually Bison), and I'm having a shift/reduce problem. It results from including the postfix increment and decrement operators. Here is a trimmed down version of the grammar:
%token NUMBER ID INC DEC
%left '+' '-'
%left '*' '/'
%right PREINC
%left POSTINC
%%
expr: NUMBER
| ID
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| INC expr %prec PREINC
| DEC expr %prec PREINC
| expr INC %prec POSTINC
| expr DEC %prec POSTINC
| '(' expr ')'
;
%%
Bison tells me there are 12 shift/reduce conflicts, but if I comment out the lines for the postfix increment and decrement, it works fine. Does anyone know how to fix this conflict? At this point, I'm considering moving to an LL(k) parser generator, which makes it much easier, but LALR grammars have always seemed much more natural to write. I'm also considering GLR, but I don't know of any good C/C++ GLR parser generators.
Bison/Yacc can generate a GLR parser if you specify %glr-parser in the option section.
Try this:
%token NUMBER ID INC DEC
%left '+' '-'
%left '*' '/'
%nonassoc '++' '--'
%left '('
%%
expr: NUMBER
| ID
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '++' expr
| '--' expr
| expr '++'
| expr '--'
| '(' expr ')'
;
%%
The key is to declare postfix operators as non associative. Otherwise you would be able to
++var++--
The parenthesis also need to be given a precedence to minimize shift/reduce warnings
I like to define more items. You shouldn't need the %left, %right, %prec stuff.
simple_expr: NUMBER
| INC simple_expr
| DEC simple_expr
| '(' expr ')'
;
term: simple_expr
| term '*' simple_expr
| term '/' simple_expr
;
expr: term
| expr '+' term
| expr '-' term
;
Play around with this approach.
This basic problem is that you don't have a precedence for the INC and DEC tokens, so it doesn't know how to resolve ambiguities involving a lookahead of INC or DEC. If you add
%right INC DEC
at the end of the precedence list (you want unaries to be higher precedence and postfix higher than prefix), it will fix it, and you can even get rid of all the PREINC/POSTINC stuff, as it's irrelevant.
preincrement and postincrement operators have nonassoc so define that in the precedence section and in the rules make the precedence of these operators high by using %prec

Resources