How to overcome shift-reduce conflict in LALR grammar - parsing

I am trying to parse positive and negative decimals.
number(N) ::= pnumber(N1).
number(N) ::= nnumber(N1).
number(N) ::= pnumber(N1) DOT pnumber(N2).
number(N) ::= nnumber(N1) DOT pnumber(N2).
pnumber(N) ::= NUMBER(N1).
nnumber(N) ::= MINUS NUMBER(N1).
The inclusion of the first two rules gives a shift/reduce conflict but I don't know how I can write the grammar such that the conflict never occurs.
I am using the Lemon parser.
Edit: conflicts from .out file
State 79:
(56) number ::= nnumber *
number ::= nnumber * DOT pnumber
DOT shift 39
DOT reduce 56 ** Parsing conflict **
{default} reduce 56 number ::= nnumber
State 80:
(55) number ::= pnumber *
number ::= pnumber * DOT pnumber
DOT shift 40
DOT reduce 55 ** Parsing conflict **
{default} reduce 55 number ::= pnumber
State 39:
number ::= nnumber DOT * pnumber
pnumber ::= * NUMBER
NUMBER shift-reduce 59 pnumber ::= NUMBER
pnumber shift-reduce 58 number ::= nnumber DOT pnumber
State 40:
number ::= pnumber DOT * pnumber
pnumber ::= * NUMBER
NUMBER shift-reduce 59 pnumber ::= NUMBER
pnumber shift-reduce 57 number ::= pnumber DOT pnumber
Edit 2: Minimal grammar that causes issue
start ::= prog.
prog ::= rule.
rule ::= REVERSE_IMPLICATION body DOT.
body ::= bodydef.
body ::= body CONJUNCTION bodydef.
bodydef ::= literal.
literal ::= variable.
variable ::= number.
number ::= pnumber.
number ::= nnumber.
number ::= pnumber DOT pnumber.
number ::= nnumber DOT pnumber.
pnumber ::= NUMBER.
nnumber ::= MINUS NUMBER.

The conflicts you show indicate a problem with how the number non-terminal is used, not with number itself.
The basic problem is that after seeing a pnumber or nnumber, when the next token of lookahead is a DOT, it can't decide if that should be the end of the number (reduce, so DOT is part of some other non-terminal after the number), or if the DOT should be treated as part of the number (shifted so it can later reduce one of the p/nnumber DOT pnumber rules.)
So in order to diagnose the problem, you'll need to show all the rules that use number anywhere on the right hand side (and recursively any other rules that use any of those rules' non-terminals on the right).
Note that it is rarely useful to post just a fragment of a grammar, as the LR parser construction process depends heavily on the context of where the rules are used elsewhere in the grammar...
So the problem here is that you need two-token lookahead to differentiate between a DOT in a (real) number literal and a DOT at the end of a rule.
The easy fix is to let the lexer deal with it -- lexers can do small amounts of lookahead quite easily, so you can recognize REAL_NUMBER as a distinct non-terminal from NUMBER (probably still without the -, so you'd end up with
number ::= NUMBER | MINUS NUMBER | REAL_NUMBER | MINUS REAL_NUMBER
It's much harder to remove the conflict by factoring the grammar but it can be done.
In general, to refactor a grammar to remove a lookahead conflict, you need to figure out the rules that manifest the conflict (rule and number here) and refactor things to bring them together into rules that have common prefixes until you get far enough along to disambiguate.
First, I'm going to assume there are other rules besides number that can appear here, as otherwise we could just eliminate all the intervening rules.
variable ::= number | name
We want to move the number rule "up" in the grammar to get it into the same place as rule with DOT. So we need to split the containing rules to special case when they end with a number. We add a suffix to denote the rules that correspond to the original rule with all versions that end in a number split off
variable ::= number | variable_n
variable_n ::= name
...and propagate that "up"
literal ::= number | literal_n
literal_n ::= variable_n
...and again
bodydef ::= number | bodydef_n
bodydef_n := literal_n
...and again
body ::= number | body_n
body := body CONJUNCTION number
body_n ::= bodydef_n
body_n ::= body CONJUNCTION bodydef_n
Notice that as you move it up, you need to split up more and more rules, so this process can blow up the grammar quite a bit. However, rules that are used only at the end of a rhs that you're refactoring will end up only needing the _n version, so you don't necessarily have to double the number of rules.
...last step
rule ::= REVERSE_IMPLICATION body_n DOT
rule ::= REVERSE_IMPLICATION number DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION number DOT
Now you have the DOTs in all the same places, so expand the number rules:
rule ::= REVERSE_IMPLICATION body_n DOT
rule ::= REVERSE_IMPLICATION integer DOT
rule ::= REVERSE_IMPLICATION integer DOT pnumber DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT pnumber DOT
and the shift-reduce conflicts are gone, because the rules have common prefixes up until past the needed lookahead to determine which to use.
I've reduced the number of rules in this final expansion by adding
integer ::= pnumber | nnumber

You have to declare the associativity of the DOT operator token with %left or %right.
Or, another idea is to drop this intermediate reduction. The obvious feature in your grammar is that numbers grow by DOT followed by a number. That can be captured with a single rule:
number : number DOT NUMBER
A number followed by a DOT followed by a NUMBER token is still a number.
This rule doesn't require DOT to have an associativity declared, because there is no ambiguity; the rule is purely left-recursive, and the right hand of DOT is a terminal token. The parser must reduce the top of the stack to number when the state machine is at this point, and then shift DOT:
number : number DOT NUMBER
The language which you are parsing here is regular; it can be parsed by regular expressions without any recursion. That is why rules that have both left and right recursion in them and require associativity to be declared are somewhat of a "big hammer".

Related

Parsing conflict in Lemon grammar

I am writing a parser for LaTeX mathematical formulas to convert them into MathML. So I wrote this grammar for Lemon.
%token BEGIN_GROUP END_GROUP MATH_SHIFT ALIGNMENT_TAB.
%token END_OF_LINE PARAMETER SUPERSCRIPT SUBSCRIPT.
%token SPACE LETTER DIGIT SYMBOL.
%token COMMAND COMMAND_LEFT COMMAND_RIGHT.
%token COMMAND_LIMITS COMMAND_NOLIMITS.
%token BEGIN_ENV END_ENV.
%token NBSP.
/* Some API */
document ::= list.
list ::= list element.
list ::= .
element ::= identifier(Id).
element ::= symbol(O).
element ::= number(Num).
identifier ::= LETTER.
symbol ::= SYMBOL.
number(N) ::= number DIGIT(D). /* Append digit */
number(N) ::= DIGIT(D). /* Init digits */
/* Lexer code */
This grammar is incomplete, it doesn't contains main program code. This is an output from Lemon parser:
State 2:
(2) element ::= number *
number ::= number * DIGIT
DIGIT shift-reduce 3 number ::= number DIGIT
DIGIT reduce 2 ** Parsing conflict **
{default} reduce 2 element ::= number
This grammar produces one parsing conflict. How can I resolve this conflict?
I am writing my parser for the first time so I don't have enough experience to solve this problem.

Lemon Parser - How can I set different associativity for unary minus and substraction?

expr ::= expr MINUS expr.
expr ::= MINUS expr.
I need to set different associativity for the 2 MINUS tokens. But I can't twice set associativity for MINUS.
%left PLUS MINUS. // + -
%right NOT MINUS. // ! - // error!
This is answered in the Lemon documentation, which provide an example of that specific requirement:
The precedence of a grammar rule is equal to the precedence of the left-most terminal symbol in the rule for which a precedence is defined. This is normally what you want, but in those cases where you want the precedence of a grammar rule to be something different, you can specify an alternative precedence symbol by putting the symbol in square braces after the period at the end of the rule and before any C-code. For example:
expr = MINUS expr. [NOT]
This rule has a precedence equal to that of the NOT symbol, not the MINUS symbol as would have been the case by default.
The above example assumes you have a token NOT which you have placed in the correct order in your precedence list.

How would I implement operator-precedence in my grammar?

I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.

Handling whitespace in EBNF

Let's say I have the following EBNF defined for a simpler two-term adder:
<expression> ::= <number> <plus> <number>
<number> ::= [0-9]+
<plus> ::= "+"
Shown here.
What would be the proper way to allow any amount of whitespace except a newline/return between the terms? For example to allow:
1 + 2
1 <tab> + 2
1 + 2
etc.
For example, doing something like the following fails:
<whitespace>::= " " | \t
Furthermore, it seems (almost) every term would be preceded and followed by an optional space. Something like:
<plus> ::= <whitespace>? "+" <whitespace>?
How would that be properly addressed?
The XML standard, as an example, uses the following production for whitespace:
S ::= (#x20 | #x9 | #xD | #xA)+
You could omit CR (#xD) and LF (#xA) if you don't want those.
Regarding your observation that grammars could become overwhelmed by whitespace non-terminals, note that whitespace handling can be done in lexical analysis rather than in parsing. See EBNF Grammar for list of words separated by a space.

How should "or" be treated in a BNF production rule?

I'm looking at the BNF grammar for SVG path data, and one of the derivation rules is:
digit-sequence ::= digit | digit digit-sequence
Is there a sematic difference beween this rule and:
digit-sequence ::= digit digit-sequence | digit
Exactly what does the | mean in a BNF grammar? Should the first match be selected, or the one that consumes most of the input?
| in a BNF grammar means alternation, i.e. if the current token matches one alternative or another it must be accepted. Here is a tutorial on BNF.
However, the rule you quoted is recursive (note digit-sequence is on both the left- and the right-hand sides of the rule) so that rule means a sequence of digits, e.g. [0-9]+ in regex.
BTW, parsing SVG path data seems to be an untrivial task so that a general BNF parser was once used to parse path data in combination with XML parser — https://metacpan.org/release/MarpaX-Languages-SVG-Parser

Resources