What are FIRST and FOLLOW sets used for in parsing? - parsing

What are FIRST and FOLLOW sets? What are they used for in parsing?
Are they used for top-down or bottom-up parsers?
Can anyone explain me FIRST and FOLLOW SETS for the following set of grammar rules:
E := E+T | T
T := T*V | T
V := <id>

They are typically used in LL (top-down) parsers to check if the running parser would encounter any situation where there is more than one way to continue parsing.
If you have the alternative A | B and also have FIRST(A) = {"a"} and FIRST(B) = {"b", "a"} then you would have a FIRST/FIRST conflict because when "a" comes next in the input you wouldn't know whether to expand A or B. (Assuming lookahead is 1).
On the other side if you have a Nonterminal that is nullable like AOpt: ("a")? then you have to make sure that FOLLOW(AOpt) doesn't contain "a" because otherwise you wouldn't know if to expand AOpt or not like here: S: AOpt "a" Either S or AOpt could consume "a" which gives us a FIRST/FOLLOW conflict.
FIRST sets can also be used during the parsing process for performance reasons. If you have a nullable nonterminal NullableNt you can expand it in order to see if it can consume anything, or it may be faster to check if FIRST(NullableNt) contains the next token and if not simply ignore it (backtracking vs predictive parsing). Another performance improvement would be to additionally provide the lexical scanner with the current FIRST set, so the scanner does not try all possible terminals but only those that are currently allowed by the context. This conflicts with reserved terminals but those are not always needed.
Bottom up parsers have different kinds of conflicts namely Reduce/Reduce and Shift/Reduce. They also use item sets to detect conflicts and not FIRST,FOLLOW.
Your grammar would't work with LL-parsers because it contains left recursion. But the FIRST sets for E, T and V would be {id} (assuming your T := T*V | T is meant to be T := T*V | V).

Answer :
E->E+T|T
left recursion
E->TE'
E'->+TE'|eipsilon
T->T*V|T
left recursion
T->VT'
T'->*VT'|epsilon
no left recursion in
V->(id)
Therefore the grammar is:
E->TE'
E'->+TE'|epsilon
T->VT'
T'->*VT'|epsilon
V-> (id)
FIRST(E)={(}
FIRST(E')={+,epsilon}
FIRST(T)={(}
FIRST(T')={*,epsilon}
FIRST(V)={(}
Starting Symbol=FOLLOW(E)={$}
E->TE',E'->TE'|epsilon:FOLLOW(E')=FOLLOW(E)={$}
E->TE',E'->+TE'|epsilon:FOLLOW(T)=FIRST(E')={+,$}
T->VT',T'->*VT'|epsilon:FOLLOW(T')=FOLLOW(T)={+,$}
T->VT',T->*VT'|epsilon:FOLLOW(V)=FIRST(T)={ *,epsilon}
Rules for First Sets
If X is a terminal then First(X) is just X!
If there is a Production X → ε then add ε to first(X)
If there is a Production X → Y1Y2..Yk then add first(Y1Y2..Yk) to first(X)
First(Y1Y2..Yk) is either
First(Y1) (if First(Y1) doesn't contain ε)
OR (if First(Y1) does contain ε) then First (Y1Y2..Yk) is everything in First(Y1) except for ε as well as everything in First(Y2..Yk)
If First(Y1) First(Y2)..First(Yk) all contain ε then add ε to First(Y1Y2..Yk) as well.
Rules for Follow Sets
First put $ (the end of input marker) in Follow(S) (S is the start symbol)
If there is a production A → aBb, (where a can be a whole string) then everything in FIRST(b) except for ε is placed in FOLLOW(B).
If there is a production A → aB, then everything in FOLLOW(A) is in FOLLOW(B)
If there is a production A → aBb, where FIRST(b) contains ε, then everything in FOLLOW(A) is in FOLLOW(B)

Wikipedia is your friend. See discussion of LL parsers and first/follow sets.
Fundamentally they are used as the basic for parser construction, e.g., as part of parser generators. You can also use them to reason about properties of grammars, but most people don't have much of a need to do this.

Related

How to calculate Follow set of a production rule that has only ONE symbol on the right side

So i have this grammar :
S -> (D)
D -> EF
E -> a|b|S
F -> *D | +D | ε
First of all, books solution uses the P -> pBq , First(q) - {ε} is subset of FOLLOW(B) for the rule D -> EF but that rule has only 2 symbols do we assume ε infront of E (ε being the p in pBq)?
And secondly i can't understand how to calculate Follow(E).
FOLLOW(E) consists of every terminal symbol which can immediately follow E in some derivation step. That's the precise definition; it's not very complicated.
For a simple grammar, you should be able to figure out all the FOLLOW sets just be looking at the grammar and applying a little bit of common sense. It would probably be a good idea to do that, since it will give you a better idea of how the algorithm works.
As a side note, it's maybe worth mentioning that ε is not a thing. Or at least, it's not a grammar symbol. It's one of several conventions used to make the empty sequence visible, just like 0 is a way to make nothing visible. Sometimes that's useful, but it's important to not let it confuse you. (Abuse of notation is endemic in mathematics, which can be frustrating.)
So, what can follow E? E only appears in one place on the right-hand sdie of that grammar, in the production D → E F. So clearly any symbol which be the first symbol of F must be in FOLLOW(E). The symbols which could be at the start of F are + and *, since as mentioned, ε is not a grammar symbol. (Many definitions of FIRST allow ε to be a member of that set, along with any actual terminal symbol. That's an example of the abuse of notation I was talking about in the previous paragraph, since it makes it look like ε is a terminal symbol. But it isn't. It's nothing.)
F is what we call a "nullable" non-terminal, because it can derive the empty sequence (which was written as ε so that you can see it). In other words, it's possible for F to disappear completely in a derivation step. And if it does disappear, then E might be at the end of the production D → E F. If E is at the end of D, then it can be followed by anything which could follow D, which includes ). D can also appear at the end of a derivation of F, which means that F could be followed by anything which could follow F, a tautology which adds no information whatsoever.
So it's easy to see that FOLLOW(F) = {*, +, )}, and you can use that to check your understanding of any algorithm to compute follow sets.
Now, I don't know what book you are referring to (and it would have been courteous to mention that in your original question; sources should always be correctly cited). But the book I happen to have in front of me --the Dragon Book-- has a pretty similar algorithm. The Dragon book uses a simple convention for writing statements like that. Probably your book does, too, but it might not be the same convention. You should check what it says and make sure that you typed the copied statement correctly, respecting whatever formatting used to indicate what the symbols stand for.
In the Dragon book, some of the conventions include:
Lower case characters at the start of the alphabet. –a, b, c,…– are terminals (as well as actual symbols like * and +).
Upper case characters at the start of the alphabet. –A, B, C,…– are non-terminals.
S is the start symbol.
Upper case characters at the end of the alphabet. –X, Y, Z– stand for arbitrary grammar symbols (either terminals or non-terminals).
$ is the marker used to indicate the end of the input.
Lower-case Greek letters –α, β, γ,…– are possibly-empty strings of grammar symbols.
The phrase "possibly empty" is very important, so I'm repeating it.
With that convention, they write the rules for computing the FOLLOW set:
Place $ in FOLLOW(S).
For every production A → αBβ, copy everything from FIRST(&beta) except ε into FOLLOW(B).
If there is a production A → αB or a production A → αBβ where FIRST(β) contains ε, place everything in FOLLOW(A) into FOLLOW(B).
As mentioned above, α is a possibly-empty string of grammar symbols. So it might not be visible.
Keep doing steps 2 and 3 until no new symbols are added to any follow set.
I'm pretty sure that the algorithm in your book differs only in notation conventions.

Why do we need FOLLOW set in LL(1) grammar parser?

In generated parsing function we use an algorithm which looks on a peek of a tokens list and chooses rule (alternative) based on the current non-terminal FIRST set. If it contains an epsilon (rule is nullable), FOLLOW set is checked as well.
Consider following grammar [not LL(1)]:
B : A term
A : N1 | N2
N1 :
N2 :
During calculation of the FOLLOW set terminal term will be propagated from A to both N1 and N2, so FOLLOW set won't help us decide.
On the other hand, if there is exactly one nullable alternative, we know for sure how to continue execution, even in case current token doesn't match against anything from the FIRST set (by choosing epsilon production).
If above statements are true, FOLLOW set is redundant. Is it needed only for error-handling?
Yes, it is not necessary.
I was asked precisely this question on the colloquium, and my answer that FOLLOW set is used
to check that grammar is LL(1)
to fail immediately when an error occurs, instead of dragging the ill-formatted token to some later production, where generated fail message may be unclear
and for nothing else
was accepted
While you can certainly find grammars for which FOLLOW is unnecessary (i.e., it doesn't play a role in the calculation of the parsing table), in general it is necessary.
For example, consider the grammar
S : A | C
A : B a
B : b | epsilon
C : D c
D : d | epsilon
You need to know that
Follow(B) = {a}
Follow(D) = {c}
to calculate
First(A) = {b, a}
First(C) = {d, c}
in order to make the correct choice at S.

Is it possible that FIRST SET contains same terminal more than one time

I am confused that can FIRST SET contains same terminal twice..
for example I have grammar
E->T+E|T FIRST(E)={a,a}
T->a FIRST(T)={a}
..
Is this correct? or I should write
FIRST(E)={a}
By definition sets can not contain the same element multiple times - this applies to first sets as much as any other set. So {a} is the proper way to write it.
I guess you're trying to compute the First and Follow sets, to construct the final predictive table, but generally, you need to resolve all the conflicts first, which are:
ε-derivation
Direct Left Recursion
Indirect Left Recursion
Ambiguous prefixes
In your example (Or part of it, I guess), you need to factor out ambiguous prefixes, the T.
E -> T E'
E' -> + E | ε
T -> a
Formally, for any non-terminal with derivation rules of the form A → αβ | αγ
1- Remove these 2 derivation rules
2- Create a rule A′ → β | γ
3- Create a rule A → α A′
Check out this Paper about Conflicts, it was very helpful for me, and you might as well check this slide and this, if you have any problem with top-down parsing.

Why is this grammar LL(1) even though all the FIRST sets are the same?

Consider the following CFG:
S := AbC
A := aA | epsilon
C := Ac
Here, FIRST(A) = FIRST(B) = FIRST(C) = {a, ε}, so all the FIRST sets are the same. However, this grammar is supposedly LL(1). How is that possible? Wouldn't that mean that there would be a bunch of FIRST/FIRST conflicts everywhere?
There's nothing fundamentally wrong about having multiple nonterminals that have the same FIRST sets. Things only become problematic if you have multiple nonterminals with overlapping FIRST or FOLLOW sets in a context where you have to choose between a number of production options.
As an example, consider this simple grammar:
A → bB | cC
B → b | c
C → b | c
Notice that all three of A, B, and C have the same FIRST set, namely {b, c}. But this grammar is also LL(1). While you can formally convince yourself of this by writing out the actual LL(1) parsing table, you can think of this in another way as well. Imagine you're reading the nonterminal A, and you see the character b. Which production should you pick: A → bB, or A → cC? Well, there's no reason to pick A → cC, because that would put c at the front of your string. So don't pick that one. Instead, pick A → bB. Similarly, suppose you're reading an A and you see the character c. Which production should you pick? You'd never want to pick A → bB, since that would put b at the front of your string. Instead, you'd pick A → cC.
Notice that in this discussion, we never stopped to think about what FIRST(B) or FIRST(C) was. It simply didn't come up because we never needed to know what characters could be produced there.
Now, let's look at your example. If you're trying to expand an S, there's only one possible production to apply, which is S → AbC. So there's no possible conflict here; when you see S, you always apply that rule. Similarly, if you're trying to expand a C, there's no choice of what to do. You have to expand C → Ac.
So now let's think about the nonterminal A, where there really is a choice of what to do next. If you see the character a, then we have to decide - do we expand out A → aA, or do we expand out A → ε? In answering that question, we have to think about the FOLLOW set of A, since the production A → ε would only make sense to pick if we saw a terminal symbol where we basically just want to get A out of the way. Here, FOLLOW(A) = {b, c}, with the b coming from the production S → AbC and the c coming from the production C → Ac. So we'd only pick A → ε if we see b or c, not if we see a. That means that
on reading a, we pick A → aA, and
on reading b o r c, we pick A → ε.
Notice that in this discussion we never needed to think about what FIRST(B) or FIRST(C) was. In fact, we never even needed to look at what FIRST(A) was either! So that's why there isn't necessarily a conflict. Were we to encounter a scenario where we needed to compare FIRST(A) against FIRST(B) or something like that, then yes, we'd definitely have a conflict. But that never came up, so no conflict exists.

Difference between Left Factoring and Left Recursion

What is the difference between Left Factoring and Left Recursion ? I understand that Left factoring is a predictive top down parsing technique. But I get confused when I hear these two terms.
Left factoring is removing the common left factor that appears in two productions of the same non-terminal. It is done to avoid back-tracing by the parser. Suppose the parser has a look-ahead, consider this example:
A -> qB | qC
where A, B and C are non-terminals and q is a sentence.
In this case, the parser will be confused as to which of the two productions to choose and it might have to back-trace. After left factoring, the grammar is converted to:
A -> qD
D -> B | C
In this case, a parser with a look-ahead will always choose the right production.
Left recursion is a case when the left-most non-terminal in a production of a non-terminal is the non-terminal itself (direct left recursion) or through some other non-terminal definitions, rewrites to the non-terminal again (indirect left recursion).
Consider these examples:
(1) A -> Aq (direct)
(2) A -> Bq
B -> Ar (indirect)
Left recursion has to be removed if the parser performs top-down parsing.
Left Factoring is a grammar transformation technique. It consists in "factoring out" prefixes which are common to two or more productions.
For example, going from:
A → α β | α γ
to:
A → α A'
A' → β | γ
Left Recursion is a property a grammar has whenever you can derive from a given variable (non terminal) a rhs that begins with the same variable, in one or more steps.
For example:
A → A α
or
A → B α
B → A γ
There is a grammar transformation technique called Elimination of left recursion, which provides a method to generate, given a left recursive grammar, another grammar that is equivalent and is not left recursive.
The relationship/confusion between both terms probably derives from the fact that both transformation techniques may need to be applied to a grammar before being able to derive a predictive top down parser for it.
This is the way I've seen the two terms used:
Left recursion: when one or more productions can be reached from themselves with no tokens consumed in-between.
Left factoring: a process of transformation, turning the grammar from a left-recursive form to an equivalent non-left-recursive form.
left factor :
Let the given grammar :
A-->ab1 | ab2 | ab3
1) we can see that, for every production, there is a common prefix & if we choose any production here, it is not confirmed that we will not need to backtrack.
2) it is non deterministic, because we cannot choice any production and be assured that we will reach at our desired string by making the correct parse tree.
but if we rewrite the grammar in a way that is deterministic and also leaves us flexible enough to convert it into any string that is possible without backtracking, it will be:
A --> aA',
A' --> b1 | b2| b3
now if we are asked to make the parse tree for string ab2 and now we don't need back tracking. Because we can always choose the correct production when we get A' thus we will generate the correct parse tree.
Left recursion :
A --> Aa | b
here it is clear that the left child of A will always be A if we choose the first production,this is left recursion .because , A is calling itself over and over again .
the generated string from this grammar is :
ba*
since this cannot be in a grammar ... we eliminate the left recursion by writing :
A --> bA'
A' --> E | aA'
now we will not have left recursion and also we can generate ba* .
Left Recursion:
A grammar is left recursive if it has a nonterminal A such that there is a derivation A -> Aα | β where α and β are sequences of terminals and nonterminals that do not start with A.
While designing a top down-parser, if the left recursion exist in the grammar then the parser falls in an infinite loop, here because A is trying to match A itself, which is not possible.
We can eliminate the above left recursion by rewriting the offending production. As-
A -> βA'
A' -> αA' | epsilon
Left Factoring: Left factoring is required to eliminate non-determinism of a grammar. Suppose a grammar, S -> abS | aSb
Here, S is deriving the same terminal a in the production rule(two alternative choices for S), which follows non-determinism. We can rewrite the production to defer the decision of S as-
S -> aS'
S' -> bS | Sb
Thus, S' can be replaced for bS or Sb
Here is a simple way to differentiate between both terms:
Left Recursion:
When leftmost Element of a production is the Producing element itself (Non Terminal Element).
e.g. A -> Aα / Aβ
Left Factoring:
When leftmost Element of a production (Terminal element) is repeated in the same production.
e.g. A -> αB / αC
Furthermore,
If a Grammar is Left Recursive, it might result into infinite loop hence we need to Eliminate Left Recursion.
If a Grammar is Left Factoring, it confuses the parser hence we need to Remove Left Factoring as well.
left recursion:= when left hand non terminal is same as right hand non terminal.
Example:
A->A&|B where & is alpha.
We can remove left ricursion using rewrite this production as like.
A->BA'
A'->&A'|€
Left factor mean productn should not non deterministic. .
Example:
A->&A|&B|&C

Resources