LALR(1) Parser DFA Lookahead Core Question - parsing

I am having trouble understanding what the rules are for adding a lookahead to a core production during the construction of the DFA. To illustrate my confusion, I will be using an online parser generator that exposes all the internal calculations; this_tool. (<- open in a new tab)
(The formating is: NONTERMINAL -> RULE, LOOKAHEADS, where the lookaheads are forward slash sperated)
Using this grammar as an example:
S -> E
E -> ( E )
E -> N O E
E -> N
N -> 1
N -> 2
N -> 3
O -> +
O -> -
Copy and pasting the above grammar into the lalr parser generator will produce a dfa with 12 states (click the >>). My question is finally, why are the goto(0, N) kernel productions ( {[E -> N.O E, $/)]; [E -> N., $/)]} ) initiated with the ) terminal? Where does the ) come from? I would expect the goto(0, N) to be {[E -> N.O E, $]; [E -> N., $]}. Equally the kernel production in the goto(0, ( ) has an 'extra' ).
As the dfa is being constructed, equal cores are merged (the core is the set of productions that introduce a new state by performing closure on that set). State 2 has production [E -> .N, )];, which when merged with [E -> N., $] produces the correct output, but there's no way for state 0 to have known about lookahead of )
Thanks in advance, sorry if this was a confusing and specific question and about using an external website to demonstrate my issue.✌️

The solution is to propagate any newly found lookaheads then 'goto' the states where those lookaheads are cores of.
The method is described in chapter 4 section 7.5 of the Dragon Book 2nd ed.
(here: https://github.com/muthukumarse/books/blob/master/Dragon%20Book%20Compilers%20Principle%20Techniques%20and%20Tools%202nd%20Edtion.pdf)

Related

Why this grammar has Reduce/Reduce conflict in LR(0)?

I have the following grammar:
S -> a b D E
S -> A B E F
D -> M x
E -> N y
F -> z
M -> epsilon
N -> epsilon
My textbook says there is a Reduce/Reduce conflict in LR(0). I built a diagram and found out that there is a state:
S -> a b . D E
S -> A B . E F
D -> . M x
E -> . N y
M -> .
N -> .
The textbook says that it's a Reduce/Reduce conflict. I'm trying to figure out why. If I build the SLR table I get the following row (3 is the state above):
That's because:
Follow(M)={x} so we can do reduce to rule 6 from state 3.
Follow(N)={y} so we can do reduce to rule 7 from state 3.
I was taught that there is a conflict S/R if there is a cell with S/R and conflict R/R if there is a cell with R/R. But I don't see two Rs in the same cell in the table. So why is it a reduce/reduce conflict?
You show an SLR(1) parsing table, in which the columns correspond to a lookahead of length 1. It's correct, and there is no conflict.
But here we're talking about an LR(0) machine, in which there is no lookahead. (That's the 0 in LR(0).) The only decision the machine can make is to shift or reduce, and since it cannot use lookahead, it can only use the state itself. A given state must be either a shift state or a reduce state (and, if a reduce state, which production is being reduced).
(In case it's confusing, and it often is, the concept of lookahead does not refer to the use of the shifted symbol to decide which state to transition to. The transition is taken based on the shifted symbol, which is at that point no longer part of the lookahead.)
So in that state, there is no possible shift action; in all items in the itemset, either the dot is at the end or the next symbol is a non-terminal (implying a GOTO action after returning from a reduce).
But the state does not have a unique reduction. Depending on the lookahead, the parsers needs to choose to reduce M or to reduce N. And since there is no lookahead, the decision cannot be made and hence it's a conflict.

How do I expand the item set for this grammar?

I have this grammar
E -> E + i
E -> i
The augmented grammar
E' -> E
E -> E + i
E -> i
Now I try to expand the item set 0
I0)
E' -> .E
+E -> .E + i
+E -> .i
Then, since I have .E in I0 I would expand it but then I will get another E rule, and so on, this is my first doubt.
Assuming that this is alright the next item sets are
I0)
E' -> .E
+E -> .E + i
+E -> .i
I1) (I moved the dot from I0, no variables at rhs of dot)
E' -> E.
E -> E. + i
E -> i.
I2) (I moved the dot from I1, no vars at rhs of dot)
E -> E +. i
I3) (I moved the dot from I2, also no vars)
E -> E + i.
Then I will have this DFA
I0 -(E, i)-> I1 -(+)-> I2 -(i)-> I3
| |
+-(∅)-> acpt <-(∅)--+
I'm missing something because E -> E + i must accept i + i + .. but the DFA doesn't goes back to the I0, so it seems wrong to me. My guess is that it should have a I0 to I0 transition, but I then I don't know that to do with the dot.
What you call the "expansion" of the item set is actually a closure; that's how it's described in all the descriptions of the algorithm I've seen (at least in textbooks). Like any closure operation, you just keep on doing the transformation until you reach a fixed-point; once you've included the productions for E, they're included.
But the essential point is that you're not building a DFA. You're building a pushdown automaton, and the DFA is just one part of it. The DFA is used for shift operations; when you shift a new terminal (because the current parse stack is not a handle), you do a state transition according to the DFA. But you also push the current state onto the PDA's stack.
The interesting part is what happens when the automaton decides to perform a reduction, which replaces the right-hand side of a production with its left-hand side non-terminal. (The right-hand side at the top of the stack is called a "handle".) To do the reduction, you unwind the stack, popping each right-hand side symbol (and the corresponding DFA state) until you reach the beginning of the production. What that does is rewind the DFA to the state it was in before it shifted the first symbol from the right-hand side. (Note that it is only at this point that you know for sure which production was used.) With the DFA thus reset, you can now shift the non-terminal which was encountered, do the corresponding DFA transition, and continue with the parse.
The basis for this procedure is the fact that the parser stack is at all times a "viable prefix"; that is, a sequence of symbols which are the prefix of some right sentential form which can be derived from the start symbol. What's interesting about the set of viable prefixes for a context-free grammar is that it is a regular language, and consequently can be recognised by a DFA. The reduction procedure given above precisely represents this recognition procedure when handles are "pruned" (to use Knuth's original vocabulary).
In that sense, it doesn't really matter what procedure is used to determine which handle is to be pruned, as long as it provides a valid answer. You could, for example, fork the parse every time a potential handle is noticed at the top of the stack, and continue in parallel with both forks. With clever stack management, this parallel search can be done in worst-case O(n3) time for any context-free grammar (and this can be reduced if the grammar is not ambiguous). That's a very rough description of Earley parsers.
But in the case of an LR(k) parser, we require that the grammar be unambiguous, and we also require that we can identify a reduction by looking at no more than k more symbols from the input stream, which is an O(1) operation since k is fixed. If at each point in the parse we know whether to reduce or not, and if so which reduction to choose, then the reductions can be implemented as I outlined above. Each reduction can be performed in O(1) time for a fixed grammar (since the maximum size of a right-hand side in a particular grammar is fixed), and since the number of reductions in a parse is linear in the size of the input, the entire parse can be done in linear time.
That was all a bit informal, but I hope it serves as an intuitive explanation. If you're interested in the formal proof, Donald Knuth's original 1965 paper (On the Translation of Languages from Left to Right) is easy to find and highly readable as these things go.

Find a s-grammar (simple grammar)

find a simple grammar (a.k.a s-grammar) for the following language:
L={(ab)2mb :m>=0}
[i did this but it is wrong]
S-> aASBB|b
A-> a
B->b
What about this?
S -> aA | T
A -> bB
B -> aC
C -> bS
T -> b
This is a regular grammar - all productions of the form X -> sY or X -> t, and corresponds to a minimal DFA for the language in question via a direct mapping of productions to transactions and nonterminal symbols to states.

SLR parsing conflicts with epsilon production

Consider the following grammar
S -> aPbSQ | a
Q -> tS | ε
P -> r
While constructing the DFA we can see there shall be a state which contains Items
Q -> .tS
Q -> . (epsilon as a blank string)
since t is in follow(Q) there appears to be a shift - reduce conflict.
Can we conclude the nature of the grammar isn't SLR(1) ?
(Please ignore my incorrect previous answer.)
Yes, the fact that you have a shift/reduce conflict in this configuring set is sufficient to show that this grammar isn't SLR(1).

fixing a grammar to LR(0)

Question:
Given the following grammar, fix it to an LR(O) grammar:
S -> S' $
S'-> aS'b | T
T -> cT | c
Thoughts
I've been trying this for quite sometime, using automatic tools for checking my fixed grammars, with no success. Our professor likes asking this kind of questions on test without giving us a methodology for approaching this (except for repeated trying). Is there any method that can be applied to answer these kind of questions? Can anyone show this method can be applied on this example?
I don't know of an automatic procedure, but the basic idea is to defer decisions. That is, if at a particular state in the parse, both shift and reduce actions are possible, find a way to defer the reduction.
In the LR(0) parser, you can make a decision based on the token you just shifted, but not on the token you (might be) about to shift. So you need to move decisions to the end of productions, in a manner of speaking.
For example, your language consists of all sentences { ancmbn$ | n ≥ 0, m > 0}. If we restrict that to n > 0, then an LR(0) grammar can be constructed by deferring the reduction decision to the point following a b:
S -> S' $.
S' -> U | a S' b.
U -> a c T.
T -> b | c T.
That grammar is LR(0). In the original grammar, at the itemset including T -> c . and T -> c . T, both shift and reduce are possible: shift c and reduce before b. By moving the b into the production for T, we defer the decision until after the shift: after shifting b, a reduction is required; after c, the reduction is impossible.
But that forces every sentence to have at least one b. It omits sentences for which n = 0 (that is, the regular language c*$). That subset has an LR(0) grammar:
S -> S' $.
S' -> c | S' c.
We can construct the union of these two languages in a straight-forward manner, renaming one of the S's:
S -> S1' $ | S2' $.
S1' -> U | a S1' b.
U -> a c T.
T -> b | c T.
S2' -> c | S2' c.
This grammar is LR(0), but the form in which the end-of-input sentinel $ has been included seems to be cheating. At least, it violates the rule for augmented grammars, because an augmented grammar's base rule is always S -> S' $ where S' and $ are symbols not used in the original grammar.
It might seem that we could avoid that technicality by right-factoring:
S -> S' $
S' -> S1' | S2'
Unfortunately, while that grammar is still deterministic, and does recognise exactly the original language, it is not LR(0).
(Many thanks to #templatetypedef for checking the original answer, and identifying a flaw, and also to #Dennis, who observed that c* was omitted.)

Resources