Parsing conflict in Lemon grammar - parsing

I am writing a parser for LaTeX mathematical formulas to convert them into MathML. So I wrote this grammar for Lemon.
%token BEGIN_GROUP END_GROUP MATH_SHIFT ALIGNMENT_TAB.
%token END_OF_LINE PARAMETER SUPERSCRIPT SUBSCRIPT.
%token SPACE LETTER DIGIT SYMBOL.
%token COMMAND COMMAND_LEFT COMMAND_RIGHT.
%token COMMAND_LIMITS COMMAND_NOLIMITS.
%token BEGIN_ENV END_ENV.
%token NBSP.
/* Some API */
document ::= list.
list ::= list element.
list ::= .
element ::= identifier(Id).
element ::= symbol(O).
element ::= number(Num).
identifier ::= LETTER.
symbol ::= SYMBOL.
number(N) ::= number DIGIT(D). /* Append digit */
number(N) ::= DIGIT(D). /* Init digits */
/* Lexer code */
This grammar is incomplete, it doesn't contains main program code. This is an output from Lemon parser:
State 2:
(2) element ::= number *
number ::= number * DIGIT
DIGIT shift-reduce 3 number ::= number DIGIT
DIGIT reduce 2 ** Parsing conflict **
{default} reduce 2 element ::= number
This grammar produces one parsing conflict. How can I resolve this conflict?
I am writing my parser for the first time so I don't have enough experience to solve this problem.

Related

Parens in BNF, EBNF

I could capture a parenthetical group using something like:
expr ::= "(" <something> ")"
However, sometimes it's useful to use multiple levels of nesting, and so it's (theoretically) possible to have more than one parens as long as they match. For example:
>>> (1)+1
2
>>> (((((-1)))))+2
1
>>> ((2+2)+(1+1))
6
>>> (2+2))
SyntaxError: invalid syntax
Is there a way to specify a "matching-ness" in EBNF, or how is parenthetical-matching handled by most parsers?
In order to be able to match an arbitrary amount of anything (be it parentheses, operators, list items etc.) you need recursion (EBNF also features repetition operators that can be used instead of recursion in some cases, but not for constructs that need to be matched like parentheses).
For well-matched parentheses, the proper production is simply:
expr ::= "(" expr ")"
That's in addition to productions for other types of expressions, of course, so a complete grammar might look like this:
expr ::= "(" expr ")"
expr ::= NUMBER
expr ::= expr "+" expr
expr ::= expr "-" expr
expr ::= expr "*" expr
expr ::= expr "/" expr
Or for an unambiguous grammar:
expr ::= expr "+" multExpr
expr ::= expr "-" multExpr
multExpr ::= multExpr "*" primaryExpr
multExpr ::= multExpr "/" primaryExpr
primaryExpr ::= "(" expr ")"
primaryExpr ::= NUMBER
Also, how do you usually go about 'testing' that it is correct -- is there an online tool or something that can validate a syntax?
There are many parser generators that can accept some form of BNF- or EBNF-like notation and generate a parser from it. You can use one of those and then test whether the generated parser parses what you want it to. They're usually not available as online tools though. Also note that parser generators generally need the grammar to be unambiguous or you to add precedence declarations to disambiguate it.
also wouldn't infinite loop?
No. The exact mechanics depend on the parsing algorithm used of course, but if the character at the current input position is not an opening parenthesis, then clearly this isn't the right production to use and another one needs to be applied (or a syntax error raised if none of the productions apply).
Left recursion can cause infinite recursion when using top-down parsing algorithms (though in case of parser generators it's more likely that the grammar will either be rejected or in some cases automatically rewritten than that you get an actual infinite recursion or loop), but non-left recursion doesn't cause that kind of problem with any algorithm.

reduce/reduce conflict in CUP

I am implementing a parser for a subset of Java using Java CUP.
The grammar is like
vardecl ::= type ID
type ::= ID | INT | FLOAT | ...
exp ::= ID | exp LBRACKET exp RBRACKET | ...
stmt ::= ID ASSIGN exp SEMI
This works fine, but when I add
stmt ::= ID ASSIGN exp SEMI
|ID LBRACKET exp RBRACKET ASSIGN exp SEMI
CUP won't work, the warnings are:
Warning : *** Shift/Reduce conflict found in state #122
between exp ::= identifier (*)
and statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI
under symbol LBRACKET
Resolved in favor of shifting.
Warning : *** Reduce/Reduce conflict found in state #42
between type ::= identifier (*)
and exp ::= identifier (*)
under symbols: {}
Resolved in favor of the first production.
Warning : *** Shift/Reduce conflict found in state #42
between type ::= identifier (*)
and statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI
under symbol LBRACKET
Resolved in favor of shifting.
Warning : *** Shift/Reduce conflict found in state #42
between exp ::= identifier (*)
and statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI
under symbol LBRACKET
Resolved in favor of shifting.
I think there are two problems:
1. type ::= ID and exp ::= ID, when the parser sees an ID, it wants to reduce it, but it doesn't know which to reduce, type or exp.
stmt ::= ID LBRACKET exp RBRACKET ASSIGN exp SEMI is for assignment of an element in array, such as arr[key] = value;
exp :: exp LBRACKET exp RBRACKET is for expression of get an element from array, such as arr[key]
So in the case arr[key], when the parser sees arr, it knows that it is an ID, but it doesn't know if it should shift or reduce to exp.
However, I have no idea of how to fix this, please give me some advice if you have, thanks a lot.
Your analysis is correct. The grammar is LR(2) because declarations cannot be identified until the ] token is seen, which will be the second-next token from the ID which could be a type.
One simple solution is to hack the lexer to return [] as a single token when the brackets appear as consecutive tokens. (The lexer should probably allow whitespace between the brackets, too, so it's not quite trivial but it's not complicated.) If a [ is not immediately followed by a ], the lexer will return it as an ordinary [. That makes it easy for the parser to distinguish between assignment to an array (which will have a [ token) and declaration of an array (which will have a [] token).
It's also possible to rewrite the grammar, but that's a real nuisance.
The second problem -- array indexing assignment versus array indexed expressions. Normally programming languages allow assignment of the form:
exp [ exp ] = exp
and not just ID [ exp ]. Making this change will delay the need to reduce until late enough for the parser to identify the correct reduction. Depending on the language, it's possible that this syntax is not semantically meaningful but checking that is in the realm of type checking (semantics) not syntax. If there is some syntax of that form which is meaningful, however, there is no obvious reason to prohibit it.
Some parser generators implement GLR parsers. A GLR parser would have no problem with this grammar because it is no ambiguous. But CUP isn't such a generator.

How to overcome shift-reduce conflict in LALR grammar

I am trying to parse positive and negative decimals.
number(N) ::= pnumber(N1).
number(N) ::= nnumber(N1).
number(N) ::= pnumber(N1) DOT pnumber(N2).
number(N) ::= nnumber(N1) DOT pnumber(N2).
pnumber(N) ::= NUMBER(N1).
nnumber(N) ::= MINUS NUMBER(N1).
The inclusion of the first two rules gives a shift/reduce conflict but I don't know how I can write the grammar such that the conflict never occurs.
I am using the Lemon parser.
Edit: conflicts from .out file
State 79:
(56) number ::= nnumber *
number ::= nnumber * DOT pnumber
DOT shift 39
DOT reduce 56 ** Parsing conflict **
{default} reduce 56 number ::= nnumber
State 80:
(55) number ::= pnumber *
number ::= pnumber * DOT pnumber
DOT shift 40
DOT reduce 55 ** Parsing conflict **
{default} reduce 55 number ::= pnumber
State 39:
number ::= nnumber DOT * pnumber
pnumber ::= * NUMBER
NUMBER shift-reduce 59 pnumber ::= NUMBER
pnumber shift-reduce 58 number ::= nnumber DOT pnumber
State 40:
number ::= pnumber DOT * pnumber
pnumber ::= * NUMBER
NUMBER shift-reduce 59 pnumber ::= NUMBER
pnumber shift-reduce 57 number ::= pnumber DOT pnumber
Edit 2: Minimal grammar that causes issue
start ::= prog.
prog ::= rule.
rule ::= REVERSE_IMPLICATION body DOT.
body ::= bodydef.
body ::= body CONJUNCTION bodydef.
bodydef ::= literal.
literal ::= variable.
variable ::= number.
number ::= pnumber.
number ::= nnumber.
number ::= pnumber DOT pnumber.
number ::= nnumber DOT pnumber.
pnumber ::= NUMBER.
nnumber ::= MINUS NUMBER.
The conflicts you show indicate a problem with how the number non-terminal is used, not with number itself.
The basic problem is that after seeing a pnumber or nnumber, when the next token of lookahead is a DOT, it can't decide if that should be the end of the number (reduce, so DOT is part of some other non-terminal after the number), or if the DOT should be treated as part of the number (shifted so it can later reduce one of the p/nnumber DOT pnumber rules.)
So in order to diagnose the problem, you'll need to show all the rules that use number anywhere on the right hand side (and recursively any other rules that use any of those rules' non-terminals on the right).
Note that it is rarely useful to post just a fragment of a grammar, as the LR parser construction process depends heavily on the context of where the rules are used elsewhere in the grammar...
So the problem here is that you need two-token lookahead to differentiate between a DOT in a (real) number literal and a DOT at the end of a rule.
The easy fix is to let the lexer deal with it -- lexers can do small amounts of lookahead quite easily, so you can recognize REAL_NUMBER as a distinct non-terminal from NUMBER (probably still without the -, so you'd end up with
number ::= NUMBER | MINUS NUMBER | REAL_NUMBER | MINUS REAL_NUMBER
It's much harder to remove the conflict by factoring the grammar but it can be done.
In general, to refactor a grammar to remove a lookahead conflict, you need to figure out the rules that manifest the conflict (rule and number here) and refactor things to bring them together into rules that have common prefixes until you get far enough along to disambiguate.
First, I'm going to assume there are other rules besides number that can appear here, as otherwise we could just eliminate all the intervening rules.
variable ::= number | name
We want to move the number rule "up" in the grammar to get it into the same place as rule with DOT. So we need to split the containing rules to special case when they end with a number. We add a suffix to denote the rules that correspond to the original rule with all versions that end in a number split off
variable ::= number | variable_n
variable_n ::= name
...and propagate that "up"
literal ::= number | literal_n
literal_n ::= variable_n
...and again
bodydef ::= number | bodydef_n
bodydef_n := literal_n
...and again
body ::= number | body_n
body := body CONJUNCTION number
body_n ::= bodydef_n
body_n ::= body CONJUNCTION bodydef_n
Notice that as you move it up, you need to split up more and more rules, so this process can blow up the grammar quite a bit. However, rules that are used only at the end of a rhs that you're refactoring will end up only needing the _n version, so you don't necessarily have to double the number of rules.
...last step
rule ::= REVERSE_IMPLICATION body_n DOT
rule ::= REVERSE_IMPLICATION number DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION number DOT
Now you have the DOTs in all the same places, so expand the number rules:
rule ::= REVERSE_IMPLICATION body_n DOT
rule ::= REVERSE_IMPLICATION integer DOT
rule ::= REVERSE_IMPLICATION integer DOT pnumber DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT pnumber DOT
and the shift-reduce conflicts are gone, because the rules have common prefixes up until past the needed lookahead to determine which to use.
I've reduced the number of rules in this final expansion by adding
integer ::= pnumber | nnumber
You have to declare the associativity of the DOT operator token with %left or %right.
Or, another idea is to drop this intermediate reduction. The obvious feature in your grammar is that numbers grow by DOT followed by a number. That can be captured with a single rule:
number : number DOT NUMBER
A number followed by a DOT followed by a NUMBER token is still a number.
This rule doesn't require DOT to have an associativity declared, because there is no ambiguity; the rule is purely left-recursive, and the right hand of DOT is a terminal token. The parser must reduce the top of the stack to number when the state machine is at this point, and then shift DOT:
number : number DOT NUMBER
The language which you are parsing here is regular; it can be parsed by regular expressions without any recursion. That is why rules that have both left and right recursion in them and require associativity to be declared are somewhat of a "big hammer".

BNFC parser and bracket Mathematica like syntax

I played a bit with the BNF Converter and tried to re-engineer parts of the Mathematica language. My BNF had already about 150 lines and worked OK, until I noticed a very basic bug. Brackets [] in Mathematica are used for two different things
expr[arg] to call a function
list[[spec]] to access elements of an expression, e.g. a List
Let's assume I want to create the parser for a language which consists only of identifiers, function calls, element access and sequence of expressions as arguments. These forms would be valid
f[]
f[a]
f[a,b,c]
f[[a]]
f[[a,b]]
f[a,f[b]]
f[[a,f[x]]]
A direct, but obviously wrong input-file for BNFC could look like
entrypoints Expr ;
TSymbol. Expr1 ::= Ident ;
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]]" ;
coercions Expr 1 ;
separator Sequence "," ;
SequenceExpr. Sequence ::= Expr ;
This BNF does not work for the last two examples of the first code-block.
The problem seems to be located in the created Yylex lexer file, which matches ] and ]] separately. This is wrong, because as can be seen in the last to examples, whether or not it's a closing ] or ]] depends on the context. So either you have to create a stack of braces to ensure the right matching or you leave that to the parser.
Can someone enlighten me whether it's possible to realize this with BNFC?
(Btw, other hints would be gratefully taken too)
Your problem is the token "]]". If the lexer collects this without having
any memory of its past, it might be mistaken. So just don't do that!
The parser by definition remembers its left context, so you can get
it to do the bracket matching correctly.
I would define your grammar this way:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[" "[" [Sequence] "]" "]" ;
with the lexer detecting only single "[" "]" as tokens.
An odd variant:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]" "]" ;
with the lexer also detecting "[[" as a token, since it can't be mistaken.

Resolving reduce/reduce conflict in yacc/ocamlyacc

I'm trying to parse a grammar in ocamlyacc (pretty much the same as regular yacc) which supports function application with no operators (like in Ocaml or Haskell), and the normal assortment of binary and unary operators. I'm getting a reduce/reduce conflict with the '-' operator, which can be used both for subtraction and negation. Here is a sample of the grammar I'm using:
%token <int> INT
%token <string> ID
%token MINUS
%start expr
%type <expr> expr
%nonassoc INT ID
%left MINUS
%left APPLY
%%
expr: INT
{ ExprInt $1 }
| ID
{ ExprId $1 }
| expr MINUS expr
{ ExprSub($1, $3) }
| MINUS expr
{ ExprNeg $2 }
| expr expr %prec APPLY
{ ExprApply($1, $2) };
The problem is that when you get an expression like "a - b" the parser doesn't know whether this should be reduced as "a (-b)" (negation of b, followed by application) or "a - b" (subtraction). The subtraction reduction is correct. How do I resolve the conflict in favor of that rule?
Unfortunately, the only answer I can come up with means increasing the complexity of the grammar.
split expr into simple_expr and expr_with_prefix
allow only simple_expr or (expr_with_prefix) in an APPLY
The first step turns your reduce/reduce conflict into a shift/reduce conflict, but the parentheses resolve that.
You're going to have the same problem with 'a b c': is it a(b(c)) or (a(b))(c)? You'll need to also break off applied_expression and required (applied_expression) in the grammar.
I think this will do it, but I'm not sure:
expr := INT
| parenthesized_expr
| expr MINUS expr
parenthesized_expr := ( expr )
| ( applied_expr )
| ( expr_with_prefix )
applied_expr := expr expr
expr_with_prefix := MINUS expr
Well, this simplest answer is to just ignore it and let the default reduce/reduce resolution handle it -- reduce the rule that appears first in the grammar. In this case, that means reducing expr MINUS expr in preference to MINUS expr, which is exactly what you want. After seeing a-b, you want to parse it as a binary minus, rather than a unary minus and then an apply.

Resources