SLR Parsing - with an epsilon production - parsing

Say I have:
S -> A
A -> B C A a | ϵ
B -> k | ϵ
C -> m
Now in the initial state S' -> S, I'm going to include:
S' -> .S
Then the closure of S:
A -> .B C A a , A -> .
Closure would also include B -> .k and B -> . obviously.
But since B -> ϵ is a production, would I also have to include C -> ,m in the initial state? Since in A -> B C A a, B can be ϵ.
I just wanted to know if I'm right and if this is the right way to deal with epsilons in grammar. If not, do guide me in the right direction. Thanks!

No, C -> . m is not part of the initial state, because C cannot be reduced without a preceding B (even if the B is reduced from ε).

Related

Calculating First and Follow of a grammar

I'm trying to calculate First and Follow of the following grammar:
S -> A B C D E
A -> a
A -> EPSILON
B -> b
B -> EPSILON
C -> c
D -> d
D -> EPSILON
E -> e
E -> EPSILON
I calculated them and got First(S)={a,b,c}. But using this tools, says: First(S)= {a, ε, c, b}. Why epsilon is part of First(S)? As I understand it should not be there. Is it my mistake or a bug? In case it's a bug. Are there other tools I can use to verify my results? In case it's my mistake, it would be helpful to understand why. Printscreen:
Also I got Follow(C)={d,e,$} but their result is Follow(C)={c, d, $}. Why?

Which productions are considered in LR(1) lookahead?

I'm currently looking at two closure calculation examples using the tool at
http://jsmachines.sourceforge.net/machines/lr1.html
Example 1
S -> A c
A -> b B
B -> A b
Here, in the initial state ends up with a closure of:
[S -> .A c, $]; [A -> .b B, c]}
Example 2
S -> A B
A -> a
B -> b
B -> ''
The calculated first step closure is:
{[S -> .A B, $]; [A -> .a, b/$]}
In example 1, why is the follow of b from rule 3 not included in the lookahead? In case 2, we follow B to figure out that $ is part of the lookahead so is there some special reason to not consider all rules in case 1?
When doing a closure with ". A α" we use FIRST(α) as the lookahead, and only include the containing (parent) lookahead if ε ∈ FIRST(α). In example 1, ε ∉ FIRST(c), so the lookahead is just c. In example 2, ε ∈ FIRST(B), so we add the containing lookahead ($ in this case) to the lookahead.
FOLLOW is never relevant.

How to remove inderect left recursion

I need help removing the indirect left recursion from this grammar:
A -> B (sB)*
| dAd
| z
B -> <id>
| sB
| A
So you could move from A->B->A.... without consuming any characters.
I tried to fix it a couple different ways but keep running into issues because of this bit (sB)*
I am not sure if I'm doing something wrong or if the grammar is wrong in general.
Before we begin, let's number your productions, so that we have something to refer to:
1: A -> B (s B)*
2: A -> d A d
3: A -> z
4: B -> <id>
5: B -> s B
6: B -> A
Since you're trying to eliminate left recursion, I can only assume you're trying to apply LL parsing. However, this grammar is ambiguous, so it can't be an LL(1) grammar. For instance, the phrase zszsz can be (leftmost) derived from A in more than one way:
A ->+ B s B (1)
->+ A s B (6)
->+ z s B (3)
->+ z s B s B (1)
->+ z s z s z (6, 3, 6, 3)
A ->+ B s B (1)
->+ A s B (6)
->+ B s B s B (1)
->+ A s B s B (6)
->+ z s B s B (3)
->+ z s z s z (6, 3, 6, 3)
The first step would be to simplify this grammar, so that every production only has sequences of terminals and non-terminals on the "expanded" side. Rule #1 has a Kleene star, so let's get rid of it by replacing it by a non-terminal C:
1: A -> B C
2: A -> d A d
3: A -> z
4: B -> <id>
5: B -> s B
6: B -> A
7: C -> <empty>
8: C -> s B C
Now, all productions are simple.
Next, we identify indirect left recursion (if any), and turn it into direct left recursion. By looking at all productions that start with a non-terminal, we find that A and B are involved in indirect left recursion (through rules #1 and #6). We can break this loop by substituting B in rule #1 with what it can produce; we replace rule #1 with
9: A -> <id> C
10: A -> s B C
11: A -> A C
Alternatively, we could break the loop by substituting the productions #1, #2, and #3 in #6. However we do it, the resulting grammar is free of indirect left recursion.
Then we eliminate direct left recursion (if any) in our grammar. This occurs in the non-terminal A, as a result of our substitution:
2: A -> d A d
3: A -> z
...
9: A -> <id> C
10: A -> s B C
11: A -> A C
We introduce another non-terminal D, and replace these rules with
12: A -> d A d D
13: A -> z D
14: A -> <id> C D
15: A -> s B C D
17: D -> <empty>
18: D -> A D
The resulting grammar is free of left recursion:
4: B -> <id>
5: B -> s B
6: B -> A
7: C -> <empty>
8: C -> s B C
12: A -> d A d D
13: A -> z D
14: A -> <id> C D
15: A -> s B C D
17: D -> <empty>
18: D -> A D
As stated in the beginning, you can't construct an LL(1) parsing table from this grammar, because the leftmost derivation of zszsz from A is still ambiguous.
Interesting. I can't see a mechanical way to do it. Is this how the language is specified or did you end up with it by some other simplifications? Anyway, a solution for the specific issue is to "inline" B in the left-recursive part:
A -> (<id> | sB | dAd | z) (sB)*
B -> <id> | sB | A
Basic idea is to substitute the no-recursive terms in the recursive part and moving the recursive term to the end.
Start with
A -> B (sB)* | dAd | z
B -> <id> | sB | A
Substitute
A -> (<id> | sB | A) (sB)* | dAd | z
Define
C -> (sB)*
Substitute
A -> (<id> | sB | A) C | dAd | z
Factor
A -> <id> C | sBC | AC | dAd | z
Define
D -> <id> C | sBC | dAd | z
So
A -> D | AC
Remove left recursion
A -> D (C)*
Substitute for C and D
A -> (<id> (sB)* | sB(sB)* | dAd | z) (sB)**
Since x** = x*
A -> (<id> (sB)* | sB(sB)* | dAd | z) (sB)*
Since x*x* = x*
A -> (<id> | sB | dAd | z) (sB)*
B -> <id> | sB | A
Same result as Sreenivasa's.
Edit added after seeing #Rymoid's answer.
At this point the left recursion has been removed, so we are done. But as pointed out by #Rymoid, the grammar is still ambiguous and so not LL(1). Below I will try to cope with the ambiguity, but not to find an LL(1) grammar.
One problem is that, since A =>* sB, the choice sB | A is ambiguous and unneeded. Let's start by removing that choice. We have
A -> (<id> | sB | dAd | z) (sB)*
B -> <id> | A
Likewise A =>* <id>, so the choice <id> | A is ambiguous and not needed. We have
A -> (<id> | sB | dAd | z) (sB)*
B -> A
And then we don't need B anymore.
A -> (<id> | sA | dAd | z) (sA)*
The remaining problem is that, since s is in the follow set of A, there is no way to tell, with one token of lookahead, whether to stay in the (sA)* loop or exit it.
The original question did not ask for an LL(1) grammar, but since the post is tagged [JavaCC], we might assume that what is wanted is one that works with JavaCC. That's not quite the same thing as being LL(1), although being LL(1) implies that the grammar will work well with JavaCC.
I'll assume all uses of A outside of the definition of A are definitely not followed by an s. To be concrete about this, I'll assume that there is (only) one more production which is S -> A <EOF>and that S is the start nonterminal. But really the important thing is that you never have an A followed by an s except because of the loop in A's current definition.
We have
S -> A <EOF>
A -> (<id> | sA | dAd | z) (sA)*
When you have an ambiguous grammar but want to eliminate ambiguity, the question to ask yourself is: Which parse do I want in the ambiguous cases? Two answers are: "Stay in the loop as long as possible." and "Jump out of the loop as soon as possible." (Other answers are possible, but unlikely.)
"Stay in the loop as long as possible"
This is the JavaCC default, so there is no need to change the grammar. It might generate a warning. It might be possible to suppress that warning with LOOKAHEAD( <s> ) at the start of the loop.
"Exit the loop as soon as possible"
Make two versions of A. A0 is never followed by an s. A1 is always followed by an s. (In fact it is followed by the first s possible, so the (sA)* part is not wanted. This choice corresponds to bailing out of the loop as soon as possible.)
S -> A0 <EOF>
A0 -> (<id> | sA0 | dA0d | z) [ s (A1s)* A0 ]
A1 -> <id> | sA1 | dA0d | z
I'm fairly sure this is unambiguous and that A0 defines the same language as A. It is not LL(1) and JavaCC will give a warning that should be heeded.
To make it suitable for JavaCC we can add a syntactic lookahead of LOOKAHEAD( A1 <s> ) to the start of the loop.

Why do we put 'A' as the look ahead symbol when all have `$`?

I am using canonical LR Method to construct the Parsing table.
Consider the grammar :
s -> D C A
s -> D a B
a -> C
s -> a A
The book I am reading mentions the first closure state as :
I(0) = [s -> .D C A , $]
[s -> .D a B , $]
[a -> .C , A]
[s -> .a A , $]
In the state
[a -> .C , A]
from where does A in the item come ? All the items have $ as a Look ahead symbol and third item has A .
Please explain this.
The item:
[ a -> · C, A ]
Results from the expansion of the item:
[ s -> · a A ]
in which the nonterminal a is followed by the terminal A. That means that the reduction of C to a can occur in a successor state whose context is s -> a · A; or, in other words, when the lookahead is A.
All of the other items in the state you mention result either from the initial (implicit) item
[ s' -> · s $ ]
where the nonterminal s is followed by the pseudo-terminal $ (that is, the end-of-input marker), so that their lookaheads are all $.

LR(1) Item DFA - Computing Lookaheads

I have trouble understanding how to compute the lookaheads for the LR(1)-items.
Lets say that I have this grammar:
S -> AB
A -> aAb | a
B -> d
A LR(1)-item is an LR(0) item with a lookahead. So we will get the following LR(0)-item for state 0:
S -> .AB , {lookahead}
A -> .aAb, {lookahead}
A -> .a, {lookahead}
State: 1
A -> a.Ab, {lookahead}
A -> a. ,{lookahead}
A -> .aAb ,{lookahead}
A ->.a ,{lookahead}
Can somebody explain how to compute the lookaheads ? What is the general approach ?
Thank you in advance
The lookaheads used in an LR(1) parser are computed as follows. First, the start state has an item of the form
S -> .w ($)
for every production S -> w, where S is the start symbol. Here, the $ marker denotes the end of the input.
Next, for any state that contains an item of the form A -> x.By (t), where x is an arbitrary string of terminals and nonterminals and B is a nonterminal, you add an item of the form B -> .w (s) for every production B -> w and for every terminal in the set FIRST(yt). (Here, FIRST refers to FIRST sets, which are usually introduced when talking about LL parsers. If you haven't seen them before, I would take a few minutes to look over those lecture notes).
Let's try this out on your grammar. We start off by creating an item set containing
S -> .AB ($)
Next, using our second rule, for every production of A, we add in a new item corresponding to that production and with lookaheads of every terminal in FIRST(B$). Since B always produces the string d, FIRST(B$) = d, so all of the productions we introduce will have lookahead d. This gives
S -> .AB ($)
A -> .aAb (d)
A -> .a (d)
Now, let's build the state corresponding to seeing an 'a' in this initial state. We start by moving the dot over one step for each production that starts with a:
A -> a.Ab (d)
A -> a. (d)
Now, since the first item has a dot before a nonterminal, we use our rule to add one item for each production of A, giving those items lookahead FIRST(bd) = b. This gives
A -> a.Ab (d)
A -> a. (d)
A -> .aAb (b)
A -> .a (b)
Continuing this process will ultimately construct all the LR(1) states for this LR(1) parser. This is shown here:
[0]
S -> .AB ($)
A -> .aAb (d)
A -> .a (d)
[1]
A -> a.Ab (d)
A -> a. (d)
A -> .aAb (b)
A -> .a (b)
[2]
A -> a.Ab (b)
A -> a. (b)
A -> .aAb (b)
A -> .a (b)
[3]
A -> aA.b (d)
[4]
A -> aAb. (d)
[5]
S -> A.B ($)
B -> .d ($)
[6]
B -> d. ($)
[7]
S -> AB. ($)
[8]
A -> aA.b (b)
[9]
A -> aAb. (b)
In case it helps, I taught a compilers course last summer and have all the lecture slides available online. The slides on bottom-up parsing should cover all of the details of LR parsing and parse table construction, and I hope that you find them useful!
Hope this helps!
here is the LR(1) automaton for the grammar as the follow has been done above
I think it's better for the understanding to trying draw the automaton and the flow will make the idea of the lookaheads clearer
The LR(1) item set constructed by you should have two more items.
I8 A--> aA.b , b from I2
I9 A--> aAb. , b from I8
I also get 11 states, not 8:
State 0
S: .A B ["$"]
A: .a A b ["d"]
A: .a ["d"]
Transitions
S -> 1
A -> 2
a -> 5
Reductions
none
State 1
S_Prime: S .$ ["$"]
Transitions
none
Reductions
none
State 2
S: A .B ["$"]
B: .d ["$"]
Transitions
B -> 3
d -> 4
Reductions
none
State 3
S: A B .["$"]
Transitions
none
Reductions
$ => S: A B .
State 4
B: d .["$"]
Transitions
none
Reductions
$ => B: d .
State 5
A: a .A b ["d"]
A: .a A b ["b"]
A: .a ["b"]
A: a .["d"]
Transitions
A -> 6
a -> 8
Reductions
d => A: a .
State 6
A: a A .b ["d"]
Transitions
b -> 7
Reductions
none
State 7
A: a A b .["d"]
Transitions
none
Reductions
d => A: a A b .
State 8
A: a .A b ["b"]
A: .a A b ["b"]
A: .a ["b"]
A: a .["b"]
Transitions
A -> 9
a -> 8
Reductions
b => A: a .
State 9
A: a A .b ["b"]
Transitions
b -> 10
Reductions
none
State 10
A: a A b .["b"]
Transitions
none
Reductions
b => A: a A b .

Resources