Understand whether a grammar is LR(1) with no parsing table - parsing

I've found out an exercise that require a trick to understand whether a grammar is LR(1) with no parsing table operations.
The grammar is the followed:
S -> Aa | Bb
A -> aAb | ab
B -> aBbb | abb
Do you know what is the trick behind?
Thanks, :)

Imagine that you're an LR(1) parser and that you've just read aab with a lookahead of b. (I know, you're probably thinking "man, that happens to me all the time!") What exactly should you do here?
Looking at the grammar, you can't tell whether the initial production was Aa or Bb, so you're going to have to simultaneously consider production rules for A and for B. If you look at the A options, you'll see that one option here would be to reduce A → ab, which is plausible here because the lookahead is a b and that's precisely what you'd expect to find after seeing an ab when expanding out an A (notice that there's the rule A → aRb, so any recursively-expanded As would be followed by a b). So that tells you to reduce. On the other hand, look at the B options. If you see aab followed by a b, you'd be thinking "oh, that second b is going to make aabb, and then I'd go and reduce B → abb, because that's totally a thing I like to do because I'm an LR(1) parser." So that tells you to shift. At that point, bam! You've got a shift/reduce conflict, so you're almost certainly not going to have an LR(1) grammar.
So does that actually happen? Well, let's go build the LR(1) configurating sets that we'd see if we did indeed read aab and see b as a lookahead:
Initial State
S' -> .S [$]
S -> .Aa [$]
S -> .Bb [$]
A -> .aAb [a]
A -> .ab [a]
B -> .aBbb [b]
B -> .abb [b]
State after reading a
A -> a.Ab [a]
A -> a.b [a]
A -> .aAb [b]
A -> .ab [b]
B -> a.Bbb [b]
B -> a.bb [b]
B -> .aBbb [b]
B -> .abb [b]
State after reading aa
A -> a.Ab [b]
A -> a.b [b]
A -> .aAb [b]
A -> .ab [b]
B -> a.Bbb [b]
B -> a.bb [b]
B -> .aBbb [b]
B -> .abb [b]
State after reading aab
A -> ab. [b]
B -> ab.b [b]
And hey! There's that shift/reduce conflict we were talking about. That first item reduces on b, but the second shifts on b. So there you go! Our intuition led us to think that this isn't going to be an LR(1) grammar, and if we look at the tables the evidence is supported by the data.
So how would you know to try that? Well, in general, it's pretty hard to do this. The main cue, for me, at least, is that the parser has to guess whether it wants A or B at some point, but the way it tiebreaks is the number of bs. The parser was going to have to at some point determine whether it likes ab and to go with A or whether it likes abb and to go with B, but it can't see both of the bs before making the decision. That led me to think that we'd like to find some sort of conflict where we've seen enough to know that some recursion was happening (so that the trailing b would cause problem) and to find a place where the recursion would differ between the two production rules.

Related

LALR(1) Parser DFA Lookahead Core Question

I am having trouble understanding what the rules are for adding a lookahead to a core production during the construction of the DFA. To illustrate my confusion, I will be using an online parser generator that exposes all the internal calculations; this_tool. (<- open in a new tab)
(The formating is: NONTERMINAL -> RULE, LOOKAHEADS, where the lookaheads are forward slash sperated)
Using this grammar as an example:
S -> E
E -> ( E )
E -> N O E
E -> N
N -> 1
N -> 2
N -> 3
O -> +
O -> -
Copy and pasting the above grammar into the lalr parser generator will produce a dfa with 12 states (click the >>). My question is finally, why are the goto(0, N) kernel productions ( {[E -> N.O E, $/)]; [E -> N., $/)]} ) initiated with the ) terminal? Where does the ) come from? I would expect the goto(0, N) to be {[E -> N.O E, $]; [E -> N., $]}. Equally the kernel production in the goto(0, ( ) has an 'extra' ).
As the dfa is being constructed, equal cores are merged (the core is the set of productions that introduce a new state by performing closure on that set). State 2 has production [E -> .N, )];, which when merged with [E -> N., $] produces the correct output, but there's no way for state 0 to have known about lookahead of )
Thanks in advance, sorry if this was a confusing and specific question and about using an external website to demonstrate my issue.✌️
The solution is to propagate any newly found lookaheads then 'goto' the states where those lookaheads are cores of.
The method is described in chapter 4 section 7.5 of the Dragon Book 2nd ed.
(here: https://github.com/muthukumarse/books/blob/master/Dragon%20Book%20Compilers%20Principle%20Techniques%20and%20Tools%202nd%20Edtion.pdf)

fixing a grammar to LR(0)

Question:
Given the following grammar, fix it to an LR(O) grammar:
S -> S' $
S'-> aS'b | T
T -> cT | c
Thoughts
I've been trying this for quite sometime, using automatic tools for checking my fixed grammars, with no success. Our professor likes asking this kind of questions on test without giving us a methodology for approaching this (except for repeated trying). Is there any method that can be applied to answer these kind of questions? Can anyone show this method can be applied on this example?
I don't know of an automatic procedure, but the basic idea is to defer decisions. That is, if at a particular state in the parse, both shift and reduce actions are possible, find a way to defer the reduction.
In the LR(0) parser, you can make a decision based on the token you just shifted, but not on the token you (might be) about to shift. So you need to move decisions to the end of productions, in a manner of speaking.
For example, your language consists of all sentences { ancmbn$ | n ≥ 0, m > 0}. If we restrict that to n > 0, then an LR(0) grammar can be constructed by deferring the reduction decision to the point following a b:
S -> S' $.
S' -> U | a S' b.
U -> a c T.
T -> b | c T.
That grammar is LR(0). In the original grammar, at the itemset including T -> c . and T -> c . T, both shift and reduce are possible: shift c and reduce before b. By moving the b into the production for T, we defer the decision until after the shift: after shifting b, a reduction is required; after c, the reduction is impossible.
But that forces every sentence to have at least one b. It omits sentences for which n = 0 (that is, the regular language c*$). That subset has an LR(0) grammar:
S -> S' $.
S' -> c | S' c.
We can construct the union of these two languages in a straight-forward manner, renaming one of the S's:
S -> S1' $ | S2' $.
S1' -> U | a S1' b.
U -> a c T.
T -> b | c T.
S2' -> c | S2' c.
This grammar is LR(0), but the form in which the end-of-input sentinel $ has been included seems to be cheating. At least, it violates the rule for augmented grammars, because an augmented grammar's base rule is always S -> S' $ where S' and $ are symbols not used in the original grammar.
It might seem that we could avoid that technicality by right-factoring:
S -> S' $
S' -> S1' | S2'
Unfortunately, while that grammar is still deterministic, and does recognise exactly the original language, it is not LR(0).
(Many thanks to #templatetypedef for checking the original answer, and identifying a flaw, and also to #Dennis, who observed that c* was omitted.)

Generating the LL(1) parsing table for the given CFG

The CFG is as following :
S -> SD|SB
B -> b|c
D -> a|dB
The method which I tried is as following:
I removed non-determinism from the first production (S->SD|SB) by left-factoring method.
So, the CFG after applying left-factoring is as following:
S -> SS'
S'-> D|B
B -> b|c
D -> a|dB
I need to find the first of S for the production i.e. S -> SS'
in order to proceed further. Could some one please help or advise?
You cannot convert this grammar that way into an LL(1) parser: the grammar is left recursive, you thus will have to perform left recursion removal. The point is that you can perform the following trick: since the only rule for S is S -> SS' and S -> (epsilon), it means that you simply reverse the order, and thus introduce the rule S -> S'S. So now the grammar is:
S -> S'S
S'-> D|B
B -> b|c
D -> a|dB
Now we can construct first: first(B)={b,c}, first(D)={a,d}, first(S')={a,b,c,d} and first(S)={a,b,c,d}.

Dealing with infinite loops when constructing states for LR(1) parsing

I'm currently constructing LR(1) states from the following grammar.
S->AS
S->c
A->aA
A->b
where A,S are nonterminals and a,b,c are terminals.
This is the construction of I0
I0: S' -> .S, epsilon
---------------
S -> .AS, epsilon
S -> .c, epsilon
---------------
S -> .AS, a
S -> .c, c
A -> .aA, a
A -> .b, b
And I1.
From S, I1: S' -> S., epsilon //DONE
And so on. But when I get to constructing I4...
From a, I4: A -> a.A, a
-----------
A -> .aA, a
A -> .b, b
The problem is
A -> .aA
When I attempt to construct the next state from a, I'm going to once again get the exact same content of I4, and this continues infinitely. A similar loop occurs with
S -> .AS
So, what am I doing wrong? There has to be some detail that I'm missing, but I've browsed my notes and my book and either can't find or just don't understand what's wrong here. Any help?
I'm pretty sure I figured out the answer. Obviously, states can point to each other, so that eliminates the need to create new ones if it's content already exists. I'd still like it if someone can confirm this, though.

Will this 'algorithm' for nullable and first work (in a parser)?

Working through this for fun: http://www.diku.dk/hjemmesider/ansatte/torbenm/Basics/
Example calculation of nullable and first uses a fixed-point calculation. (see section 3.8)
I'm doing things in Scheme and relying a lot on recursion.
If you try to implement nullable or first via recursion, it should be clear you'll recur infinitely on a production like
N -> N a b
where N is a non-terminal and a,b are terminals.
Could this be solved, recursively, by maintaining a set of non-terminals seen on the left hand side of production rules, and ignoring them after we have accounted for them once?
This seems to work for nullable. What about for first?
EDIT: This is what I have learned from playing around. Source code link at bottom.
Non terminals cannot be ignored in the calculation of first unless they are nullable.
Consider:
N -> N a
N -> X
N ->
Here we can ignore N in N a because N is nullable. We can replace N -> N a with N -> a and deduce that a is a member of first(N).
Here we cannot ignore N:
N -> N a
N -> M
M -> b
If we ignored the N in N -> N a we would deduce that a is in first(N) which is false. Instead, we see that N is not nullable, and hence when calculating first, we can omit any production where N is found as the first symbol in the RHS.
This yields:
N -> M
M -> b
which tells us b is in first(N).
Source Code: http://gist.github.com/287069
So ... does this sound OK?
I suggest to keep on reading :)
3.13 Rewriting a grammar for LL(1) parsing and especially 3.13.1 Eliminating left-recursion.
Just to note you can run into indirect left recursion as well:
A -> Bac
B -> A
B -> _also something else_
But the solution here is quite similar to eliminating the direct left recursion as in your first example.
You might want to check this paper which explains it in a little bit more straight-forward way. Less theory :)

Resources