Why do we need FOLLOW set in LL(1) grammar parser? - parsing

In generated parsing function we use an algorithm which looks on a peek of a tokens list and chooses rule (alternative) based on the current non-terminal FIRST set. If it contains an epsilon (rule is nullable), FOLLOW set is checked as well.
Consider following grammar [not LL(1)]:
B : A term
A : N1 | N2
N1 :
N2 :
During calculation of the FOLLOW set terminal term will be propagated from A to both N1 and N2, so FOLLOW set won't help us decide.
On the other hand, if there is exactly one nullable alternative, we know for sure how to continue execution, even in case current token doesn't match against anything from the FIRST set (by choosing epsilon production).
If above statements are true, FOLLOW set is redundant. Is it needed only for error-handling?

Yes, it is not necessary.
I was asked precisely this question on the colloquium, and my answer that FOLLOW set is used
to check that grammar is LL(1)
to fail immediately when an error occurs, instead of dragging the ill-formatted token to some later production, where generated fail message may be unclear
and for nothing else
was accepted

While you can certainly find grammars for which FOLLOW is unnecessary (i.e., it doesn't play a role in the calculation of the parsing table), in general it is necessary.
For example, consider the grammar
S : A | C
A : B a
B : b | epsilon
C : D c
D : d | epsilon
You need to know that
Follow(B) = {a}
Follow(D) = {c}
to calculate
First(A) = {b, a}
First(C) = {d, c}
in order to make the correct choice at S.

Related

How to calculate Follow set of a production rule that has only ONE symbol on the right side

So i have this grammar :
S -> (D)
D -> EF
E -> a|b|S
F -> *D | +D | ε
First of all, books solution uses the P -> pBq , First(q) - {ε} is subset of FOLLOW(B) for the rule D -> EF but that rule has only 2 symbols do we assume ε infront of E (ε being the p in pBq)?
And secondly i can't understand how to calculate Follow(E).
FOLLOW(E) consists of every terminal symbol which can immediately follow E in some derivation step. That's the precise definition; it's not very complicated.
For a simple grammar, you should be able to figure out all the FOLLOW sets just be looking at the grammar and applying a little bit of common sense. It would probably be a good idea to do that, since it will give you a better idea of how the algorithm works.
As a side note, it's maybe worth mentioning that ε is not a thing. Or at least, it's not a grammar symbol. It's one of several conventions used to make the empty sequence visible, just like 0 is a way to make nothing visible. Sometimes that's useful, but it's important to not let it confuse you. (Abuse of notation is endemic in mathematics, which can be frustrating.)
So, what can follow E? E only appears in one place on the right-hand sdie of that grammar, in the production D → E F. So clearly any symbol which be the first symbol of F must be in FOLLOW(E). The symbols which could be at the start of F are + and *, since as mentioned, ε is not a grammar symbol. (Many definitions of FIRST allow ε to be a member of that set, along with any actual terminal symbol. That's an example of the abuse of notation I was talking about in the previous paragraph, since it makes it look like ε is a terminal symbol. But it isn't. It's nothing.)
F is what we call a "nullable" non-terminal, because it can derive the empty sequence (which was written as ε so that you can see it). In other words, it's possible for F to disappear completely in a derivation step. And if it does disappear, then E might be at the end of the production D → E F. If E is at the end of D, then it can be followed by anything which could follow D, which includes ). D can also appear at the end of a derivation of F, which means that F could be followed by anything which could follow F, a tautology which adds no information whatsoever.
So it's easy to see that FOLLOW(F) = {*, +, )}, and you can use that to check your understanding of any algorithm to compute follow sets.
Now, I don't know what book you are referring to (and it would have been courteous to mention that in your original question; sources should always be correctly cited). But the book I happen to have in front of me --the Dragon Book-- has a pretty similar algorithm. The Dragon book uses a simple convention for writing statements like that. Probably your book does, too, but it might not be the same convention. You should check what it says and make sure that you typed the copied statement correctly, respecting whatever formatting used to indicate what the symbols stand for.
In the Dragon book, some of the conventions include:
Lower case characters at the start of the alphabet. –a, b, c,…– are terminals (as well as actual symbols like * and +).
Upper case characters at the start of the alphabet. –A, B, C,…– are non-terminals.
S is the start symbol.
Upper case characters at the end of the alphabet. –X, Y, Z– stand for arbitrary grammar symbols (either terminals or non-terminals).
$ is the marker used to indicate the end of the input.
Lower-case Greek letters –α, β, γ,…– are possibly-empty strings of grammar symbols.
The phrase "possibly empty" is very important, so I'm repeating it.
With that convention, they write the rules for computing the FOLLOW set:
Place $ in FOLLOW(S).
For every production A → αBβ, copy everything from FIRST(&beta) except ε into FOLLOW(B).
If there is a production A → αB or a production A → αBβ where FIRST(β) contains ε, place everything in FOLLOW(A) into FOLLOW(B).
As mentioned above, α is a possibly-empty string of grammar symbols. So it might not be visible.
Keep doing steps 2 and 3 until no new symbols are added to any follow set.
I'm pretty sure that the algorithm in your book differs only in notation conventions.

Is it possible that FIRST SET contains same terminal more than one time

I am confused that can FIRST SET contains same terminal twice..
for example I have grammar
E->T+E|T FIRST(E)={a,a}
T->a FIRST(T)={a}
..
Is this correct? or I should write
FIRST(E)={a}
By definition sets can not contain the same element multiple times - this applies to first sets as much as any other set. So {a} is the proper way to write it.
I guess you're trying to compute the First and Follow sets, to construct the final predictive table, but generally, you need to resolve all the conflicts first, which are:
ε-derivation
Direct Left Recursion
Indirect Left Recursion
Ambiguous prefixes
In your example (Or part of it, I guess), you need to factor out ambiguous prefixes, the T.
E -> T E'
E' -> + E | ε
T -> a
Formally, for any non-terminal with derivation rules of the form A → αβ | αγ
1- Remove these 2 derivation rules
2- Create a rule A′ → β | γ
3- Create a rule A → α A′
Check out this Paper about Conflicts, it was very helpful for me, and you might as well check this slide and this, if you have any problem with top-down parsing.

Why is this grammar LL(1) even though all the FIRST sets are the same?

Consider the following CFG:
S := AbC
A := aA | epsilon
C := Ac
Here, FIRST(A) = FIRST(B) = FIRST(C) = {a, ε}, so all the FIRST sets are the same. However, this grammar is supposedly LL(1). How is that possible? Wouldn't that mean that there would be a bunch of FIRST/FIRST conflicts everywhere?
There's nothing fundamentally wrong about having multiple nonterminals that have the same FIRST sets. Things only become problematic if you have multiple nonterminals with overlapping FIRST or FOLLOW sets in a context where you have to choose between a number of production options.
As an example, consider this simple grammar:
A → bB | cC
B → b | c
C → b | c
Notice that all three of A, B, and C have the same FIRST set, namely {b, c}. But this grammar is also LL(1). While you can formally convince yourself of this by writing out the actual LL(1) parsing table, you can think of this in another way as well. Imagine you're reading the nonterminal A, and you see the character b. Which production should you pick: A → bB, or A → cC? Well, there's no reason to pick A → cC, because that would put c at the front of your string. So don't pick that one. Instead, pick A → bB. Similarly, suppose you're reading an A and you see the character c. Which production should you pick? You'd never want to pick A → bB, since that would put b at the front of your string. Instead, you'd pick A → cC.
Notice that in this discussion, we never stopped to think about what FIRST(B) or FIRST(C) was. It simply didn't come up because we never needed to know what characters could be produced there.
Now, let's look at your example. If you're trying to expand an S, there's only one possible production to apply, which is S → AbC. So there's no possible conflict here; when you see S, you always apply that rule. Similarly, if you're trying to expand a C, there's no choice of what to do. You have to expand C → Ac.
So now let's think about the nonterminal A, where there really is a choice of what to do next. If you see the character a, then we have to decide - do we expand out A → aA, or do we expand out A → ε? In answering that question, we have to think about the FOLLOW set of A, since the production A → ε would only make sense to pick if we saw a terminal symbol where we basically just want to get A out of the way. Here, FOLLOW(A) = {b, c}, with the b coming from the production S → AbC and the c coming from the production C → Ac. So we'd only pick A → ε if we see b or c, not if we see a. That means that
on reading a, we pick A → aA, and
on reading b o r c, we pick A → ε.
Notice that in this discussion we never needed to think about what FIRST(B) or FIRST(C) was. In fact, we never even needed to look at what FIRST(A) was either! So that's why there isn't necessarily a conflict. Were we to encounter a scenario where we needed to compare FIRST(A) against FIRST(B) or something like that, then yes, we'd definitely have a conflict. But that never came up, so no conflict exists.

Determining FOLLOW sets of CFG

On this page the author explains how to determine the FOLLOW sets of a CFG. Under the headline Syntax Analysis Goal: FOLLOW Sets he states:
Steps to Make the Follow Set
Conventions: a, b, and c represent a terminal or non-terminal. a*
represents zero or more terminals or non-terminals (possibly both). a+
represents one or more... D is a non-terminal.
Place an End of Input token ($) into the starting rule's follow set.
Suppose we have a rule R → a*Db. Everything in First(b) (except for ε)
is added to Follow(D). If First(b) contains ε then everything in
Follow(R) is put in Follow(D).
Finally, if we have a rule R → a*D,
then everything in Follow(R) is placed in Follow(D).
The Follow set of
a terminal is an empty set.
So far so good. But in the box below this item, we read:
[...] Step 2 on rule 1 (N → V = E) indicates that first(=) is in Follow(V).
Now this is the part I don't understand. When he says that First(=) is in Follow (V), he obviously maps = to b and V to D (b and D from the explication in the first box). But (a*)(D)(b) does not match ()(V)(=)E.
Am I reading this completely wrong, or did the author maybe write a*Db instead of a*Dba*?
(Especially if you read this on wikipedia: "FOLLOW(I) of an Item I [A → α • B β, x] is the set of terminals that can appear after nonterminal B, where α, β are arbitrary symbol strings, and x is an arbitrary lookahead terminal.")
Yes, he meant:
R → a* D b*
and since b* could be zero symbols, i.e. ε, the second rule is unneeded. Remember that FIRST is defined on arbitrary sequences of symbols.
In other words, for:
A → α B β
Every terminal in FIRST(β) is in FOLLOW(B), and
If β ⇒* ε, then everything in FOLLOW(A) is in FOLLOW(B).
Here's what Aho, Sethi & Ullman say in the dragon book:
Formally, we say LR(1) item [A → α·β, a] is valid for a viable prefix γ if there is a derivation S ⇒* δAw ⇒ δαβw
where γ = δα and either a is the first symbol of w or w is ε and a is $.
(The ⇒'s above are marked rm, meaning right-most derivation; in other words, in every derivation step, the right-most non-terminal is substituted with one of its productions. Consequently, w only contains terminals.)
In other words, the LR(1) item is valid (could apply) if we've reached some point where we've decided that A might be the next reduction and a might follow A; at the current point in the parse, we've read α. So if a follows β, then the reduction is possible. We don't yet know that, unless β is the empty sequence, but we need to remember the fact in case it turns out that β can derive the empty sequence.
I hope that helps. It's late here and I'm too tired to check it again. Maybe tomorrow...

What are FIRST and FOLLOW sets used for in parsing?

What are FIRST and FOLLOW sets? What are they used for in parsing?
Are they used for top-down or bottom-up parsers?
Can anyone explain me FIRST and FOLLOW SETS for the following set of grammar rules:
E := E+T | T
T := T*V | T
V := <id>
They are typically used in LL (top-down) parsers to check if the running parser would encounter any situation where there is more than one way to continue parsing.
If you have the alternative A | B and also have FIRST(A) = {"a"} and FIRST(B) = {"b", "a"} then you would have a FIRST/FIRST conflict because when "a" comes next in the input you wouldn't know whether to expand A or B. (Assuming lookahead is 1).
On the other side if you have a Nonterminal that is nullable like AOpt: ("a")? then you have to make sure that FOLLOW(AOpt) doesn't contain "a" because otherwise you wouldn't know if to expand AOpt or not like here: S: AOpt "a" Either S or AOpt could consume "a" which gives us a FIRST/FOLLOW conflict.
FIRST sets can also be used during the parsing process for performance reasons. If you have a nullable nonterminal NullableNt you can expand it in order to see if it can consume anything, or it may be faster to check if FIRST(NullableNt) contains the next token and if not simply ignore it (backtracking vs predictive parsing). Another performance improvement would be to additionally provide the lexical scanner with the current FIRST set, so the scanner does not try all possible terminals but only those that are currently allowed by the context. This conflicts with reserved terminals but those are not always needed.
Bottom up parsers have different kinds of conflicts namely Reduce/Reduce and Shift/Reduce. They also use item sets to detect conflicts and not FIRST,FOLLOW.
Your grammar would't work with LL-parsers because it contains left recursion. But the FIRST sets for E, T and V would be {id} (assuming your T := T*V | T is meant to be T := T*V | V).
Answer :
E->E+T|T
left recursion
E->TE'
E'->+TE'|eipsilon
T->T*V|T
left recursion
T->VT'
T'->*VT'|epsilon
no left recursion in
V->(id)
Therefore the grammar is:
E->TE'
E'->+TE'|epsilon
T->VT'
T'->*VT'|epsilon
V-> (id)
FIRST(E)={(}
FIRST(E')={+,epsilon}
FIRST(T)={(}
FIRST(T')={*,epsilon}
FIRST(V)={(}
Starting Symbol=FOLLOW(E)={$}
E->TE',E'->TE'|epsilon:FOLLOW(E')=FOLLOW(E)={$}
E->TE',E'->+TE'|epsilon:FOLLOW(T)=FIRST(E')={+,$}
T->VT',T'->*VT'|epsilon:FOLLOW(T')=FOLLOW(T)={+,$}
T->VT',T->*VT'|epsilon:FOLLOW(V)=FIRST(T)={ *,epsilon}
Rules for First Sets
If X is a terminal then First(X) is just X!
If there is a Production X → ε then add ε to first(X)
If there is a Production X → Y1Y2..Yk then add first(Y1Y2..Yk) to first(X)
First(Y1Y2..Yk) is either
First(Y1) (if First(Y1) doesn't contain ε)
OR (if First(Y1) does contain ε) then First (Y1Y2..Yk) is everything in First(Y1) except for ε as well as everything in First(Y2..Yk)
If First(Y1) First(Y2)..First(Yk) all contain ε then add ε to First(Y1Y2..Yk) as well.
Rules for Follow Sets
First put $ (the end of input marker) in Follow(S) (S is the start symbol)
If there is a production A → aBb, (where a can be a whole string) then everything in FIRST(b) except for ε is placed in FOLLOW(B).
If there is a production A → aB, then everything in FOLLOW(A) is in FOLLOW(B)
If there is a production A → aBb, where FIRST(b) contains ε, then everything in FOLLOW(A) is in FOLLOW(B)
Wikipedia is your friend. See discussion of LL parsers and first/follow sets.
Fundamentally they are used as the basic for parser construction, e.g., as part of parser generators. You can also use them to reason about properties of grammars, but most people don't have much of a need to do this.

Resources