bison shift/reduce conflict - parsing

in the following simple grammar, on the conflict at state 4,
can 'shift' become the taken action without changing the rules ?
(I thought that by default shift was bison's preferred action)
%token one two three
%%
start : a;
a : X Y Z;
X : one;
Z : two | three;
Y : two | ;
%%

shift is bison's preferred action, and you can see in the state output that it will shift two in state 4. It will still report a shift-reduce conflict, but you can take that as a warning if you like. (See %expect.) You'd probably be better off fixing the grammar:
start : a;
a : X Z | X Y Z;
X : one;
Y : two;
Z : two | three;

Shift is the default, but that results in the generated parser giving an error for the input one two so that is probably not what you want. Instead, follow rici's advice and fix the grammar.

Related

Ambiguity when Parsing Preceding List

While writing parser code in Menhir, I keep coming across this design pattern that is becoming very frustrating. I'm trying to build a parser that accepts either "a*ba" or "bb". To do this, I'm using the following syntax (note that A* is the same as list(A)):
exp:
| A*; B; A; {1}
| B; B; {2}
However, this code fails to parse the string "ba". The menhir compiler also indicates that there are shift-reduce conflicts in the parser, specifically as follows:
** In state 0, looking ahead at B, shifting is permitted
** because of the following sub-derivation:
. B B
** In state 0, looking ahead at B, reducing production
** list(A) ->
** is permitted because of the following sub-derivation:
list(A) B A // lookahead token appears
So | B A requires a shift, while | A* B A requires a reduce when the first token is B. I can resolve this ambiguity manually and get the expected behavior by changing the expression to read as follows (note that A+ is the same as nonempty_list(A)):
exp2:
| B; A; {1}
| A+; B; A; {1}
| B; B; {2}
In my mind, exp and exp2 read the same, but are clearly treated differently. Is there a way to write exp to do what I want without code duplication (which can cause other problems)? Is this a design pattern I should be avoiding entirely?
exp and exp2 parse the same language, but they're definitely not the same grammar. exp requires a two-symbol lookahead to correctly parse a sentence starting with B, for precisely the reason you've noted: the parser can't decide whether to insert an empty A* into the parse before it sees the symbol after the B, but it needs to do that insertion before it can handle the B. By contrast, exp2 does not need an empty production to create an empty list of As before B A, so no decision is necessary.
You don't need a list to produce this conflict. Replacing A* with A? would produce exactly the same conflict.
You've already found the usual solution to this shift-reduce conflict for LALR(1) parser generators: a little bit of redundancy. As you note, though, that solution is not ideal.
Another common solution (but maybe not for menhir) involves using a right-recursive definition of a terminated list:
prefix:
| B;
| A; prefix;
exp:
| prefix; A; { 1 }
| B; B; { 2 }
As far as I know, menhir's standard library doesn't include a terminated list macro, but it would easy enough to write. It might look something like this:
%public terminated_list(X, Y):
| y = Y;
{ ( [], y ) }
| x = X; xsy = terminated_list(X, Y);
{ ( x :: (fst xsy), (snd xsy) ) }
There's probably a more idiomatic way to write that; I don't pretend to be an OCAML coder.

Different methods of implementing a specific parsing rule for a compiler

Let's say we have a rule in parsing tokens that specifies:
x -> [y[,y]*]
Where the brackets '[ ]' mean that anything in them is optional in order for the rule to take place and the * means 0 or more.
e.g it could be:
x : (empty)
OR
x : y
OR
x : y,y
as well etc. (the above are examples of input that 'x' rule would be activated, not how the code should be)
I have tried the following that works already
x : y commaY
|
;
commaY : COMMA y commaY
|
;
I would like to know alternative options in the above that would make it work, if there are any, for educational purposes.
Thank you in advance.
EDIT my earlier answer was incorrect (as pointed out in the comments), but I cannot remove an accepted answer, so I decided to edit it.
You will need (at least) 2 rules for x -> [y[,y]*]. Here is another possibility:
x
: list
| /* eps */
;
list
: list ',' y
| y
;

Resolving left-recursion in my grammar

My grammar has a case of left-recursion in the sixth production rule.
I resolved this by replacing Rule 6 and 7 like this:
I couldn't find any indirect left recursions in this grammar.
The only thing that bothers me is the final production rule, which has a terminal surrounded by two non-terminals.
My two questions are:
Is my resolved left recursion correct?
Is the final production rule a left recursion? I am not sure how to
treat this special case.
Yes, your resolution is correct. You may want to remove the epsilon rule for ease of use, but the accepted strings are correct.
X -> -
X -> -Z
Z -> +
Z -> +Z
Z -> X + Y
... and Y is of the form 0* 1 (no syntax collisions)
As a check, note that you could now replace this final rule with two new rules, one for each expansion of X:
Z -> - + Y
Z -> -Z + Y
This removes X entirely from the Z rules, and each Z rule would then begin with a terminal.
No, your final production rule is no longer left-recursive. X now must resolve to a string beginning with a non-terminal.
I have to admit, though, I'm curious about what use the language has. :-)

Top down parsing - Compute FIRST and FOLLOW

Given the following grammar:
S -> S + S | S S | (S) | S* | a
S -> S S + | S S * | a
For the life of me I can't seem to figure out how to compute the FIRST and FOLLOW for the above grammar. The recursive non-terminal of S confuses me. Does that mean I have to factor out the grammar first before computing the FIRST and FOLLOW?
The general rule for computing FIRST sets in CFGs without ε productions is the following:
Initialize FIRST(A) as follows: for each production A → tω, where t is a terminal, add t to FIRST(A).
Repeatedly apply the following until nothing changes: for each production of the form A → Bω, where B is a nonterminal, set FIRST(A) = FIRST(A) ∪ FIRST(B).
We could follow the above rules as written, but there's something interesting here we can notice. Your grammar only has a single nonterminal, so that second rule - which imports elements into the FIRST set of one nonterminal from FIRST sets from another nonterminal - won't actually do anything. In other words, we can compute the FIRST set just by applying that initial rule. And that's not too bad here - we just look at all the productions that start with a terminal and get FIRST(S) = { a, ( }.

SLR(1) parser with epsilon

let's suppose I have the following grammar:
E --> TX
T --> (E) | int Y
X --> + E | ε
Y --> * T | ε
Building the item sets I get a state like this one:
T --> int . Y
Y --> . * T
Y --> .
This state is adequate or not? That is, the grammar is SLR(1) or not?
Thanks
Yes the entries specified by you in the state absolutely correct.
T->int.Y
Y->.*T
Y->.
This is the 5th state in the DFA created for SLR(1) parser for given grammar.
Confusion may have arise in Y->Ɛ . When you place a dot in the augmented productions for example S->A.B it means that A is completed and B is yet to be completed (by completion here means progress in parsing) . Similarly if you write Y->.Ɛ , it means Ɛ is yet to be over, but we also know that Ɛ is null string i.e nothing therefore Y->.Ɛ is interpreted as Y->.
I created the DFA (13 states) for this grammar and found that the given grammar is SLR(1) as there is no Reduce-Reduce or Shift-Reduce conflict.
you have to construct the FOLLOW sets and see if FOLLOW(Y) contains int or *. if that was the case there'd be a shift/reduce conflict and the grammar wouldn't be SLR(1).
check all the states and if none has conflicts the grammar is SLR(1).

Resources