Computing the FOLLOW sets - parsing

My task is to calculate FIRST and FOLLOW sets for the following grammar:
P ::= S CS .
S ::= ( int , int )
CS ::= C CS | epsilon
C ::= left int | right int | low C
I got the following first sets:
FIRST(S) = {'('}
FIRST(C) = {left,right,low}
FIRST(CS) = {left,right,low,epsilon}
FIRST(P) = FIRST(S) = {'('}
For the following sets I calculated:
FOLLOW(P) = $ (or empty)
FOLLOW(C) = {left,right,low,'.'}
FOLLOW(CS) = {'.'}
FOLLOW(S) = {left,right,low}
I tried out my solution using first and follow sets generator and what I got with FOLLOW(S) was: FOLLOW(S) = {'.', left, right, low}. Is generator's solution correct and why? I calculated my solution using formula: FOLLOW(S) = FIRST({left,right,low} concat FOLLOW(P)) = {left, right, low}. Can someone please explain me why my/ generator's solution is not correct and check whether I got everything else right? I also want to know why I don't have int in any first or follow set and if this will be okay with building parser later anyway. Thank you

When you compute FOLLOW sets you have to be careful with​ empty productions.
In this case, CS has an empty production, which means that S might be followed by a . in P → S CS .. Similarly, the C in C CS might be at the end of the production, so C could also be followed by a .
int can only appear after a left or right token. It can never appear at the beginning of a nom-terminal nor immediately following a non-terminal. So it is entirely expected that it not be in any FIRST or FOLLOW set.

Related

Why do we need FOLLOW set in LL(1) grammar parser?

In generated parsing function we use an algorithm which looks on a peek of a tokens list and chooses rule (alternative) based on the current non-terminal FIRST set. If it contains an epsilon (rule is nullable), FOLLOW set is checked as well.
Consider following grammar [not LL(1)]:
B : A term
A : N1 | N2
N1 :
N2 :
During calculation of the FOLLOW set terminal term will be propagated from A to both N1 and N2, so FOLLOW set won't help us decide.
On the other hand, if there is exactly one nullable alternative, we know for sure how to continue execution, even in case current token doesn't match against anything from the FIRST set (by choosing epsilon production).
If above statements are true, FOLLOW set is redundant. Is it needed only for error-handling?
Yes, it is not necessary.
I was asked precisely this question on the colloquium, and my answer that FOLLOW set is used
to check that grammar is LL(1)
to fail immediately when an error occurs, instead of dragging the ill-formatted token to some later production, where generated fail message may be unclear
and for nothing else
was accepted
While you can certainly find grammars for which FOLLOW is unnecessary (i.e., it doesn't play a role in the calculation of the parsing table), in general it is necessary.
For example, consider the grammar
S : A | C
A : B a
B : b | epsilon
C : D c
D : d | epsilon
You need to know that
Follow(B) = {a}
Follow(D) = {c}
to calculate
First(A) = {b, a}
First(C) = {d, c}
in order to make the correct choice at S.

Ambiguity when Parsing Preceding List

While writing parser code in Menhir, I keep coming across this design pattern that is becoming very frustrating. I'm trying to build a parser that accepts either "a*ba" or "bb". To do this, I'm using the following syntax (note that A* is the same as list(A)):
exp:
| A*; B; A; {1}
| B; B; {2}
However, this code fails to parse the string "ba". The menhir compiler also indicates that there are shift-reduce conflicts in the parser, specifically as follows:
** In state 0, looking ahead at B, shifting is permitted
** because of the following sub-derivation:
. B B
** In state 0, looking ahead at B, reducing production
** list(A) ->
** is permitted because of the following sub-derivation:
list(A) B A // lookahead token appears
So | B A requires a shift, while | A* B A requires a reduce when the first token is B. I can resolve this ambiguity manually and get the expected behavior by changing the expression to read as follows (note that A+ is the same as nonempty_list(A)):
exp2:
| B; A; {1}
| A+; B; A; {1}
| B; B; {2}
In my mind, exp and exp2 read the same, but are clearly treated differently. Is there a way to write exp to do what I want without code duplication (which can cause other problems)? Is this a design pattern I should be avoiding entirely?
exp and exp2 parse the same language, but they're definitely not the same grammar. exp requires a two-symbol lookahead to correctly parse a sentence starting with B, for precisely the reason you've noted: the parser can't decide whether to insert an empty A* into the parse before it sees the symbol after the B, but it needs to do that insertion before it can handle the B. By contrast, exp2 does not need an empty production to create an empty list of As before B A, so no decision is necessary.
You don't need a list to produce this conflict. Replacing A* with A? would produce exactly the same conflict.
You've already found the usual solution to this shift-reduce conflict for LALR(1) parser generators: a little bit of redundancy. As you note, though, that solution is not ideal.
Another common solution (but maybe not for menhir) involves using a right-recursive definition of a terminated list:
prefix:
| B;
| A; prefix;
exp:
| prefix; A; { 1 }
| B; B; { 2 }
As far as I know, menhir's standard library doesn't include a terminated list macro, but it would easy enough to write. It might look something like this:
%public terminated_list(X, Y):
| y = Y;
{ ( [], y ) }
| x = X; xsy = terminated_list(X, Y);
{ ( x :: (fst xsy), (snd xsy) ) }
There's probably a more idiomatic way to write that; I don't pretend to be an OCAML coder.

Top down parsing - Compute FIRST and FOLLOW

Given the following grammar:
S -> S + S | S S | (S) | S* | a
S -> S S + | S S * | a
For the life of me I can't seem to figure out how to compute the FIRST and FOLLOW for the above grammar. The recursive non-terminal of S confuses me. Does that mean I have to factor out the grammar first before computing the FIRST and FOLLOW?
The general rule for computing FIRST sets in CFGs without ε productions is the following:
Initialize FIRST(A) as follows: for each production A → tω, where t is a terminal, add t to FIRST(A).
Repeatedly apply the following until nothing changes: for each production of the form A → Bω, where B is a nonterminal, set FIRST(A) = FIRST(A) ∪ FIRST(B).
We could follow the above rules as written, but there's something interesting here we can notice. Your grammar only has a single nonterminal, so that second rule - which imports elements into the FIRST set of one nonterminal from FIRST sets from another nonterminal - won't actually do anything. In other words, we can compute the FIRST set just by applying that initial rule. And that's not too bad here - we just look at all the productions that start with a terminal and get FIRST(S) = { a, ( }.

confusion in finding first and follow in left recursive grammar

Recently I faced the problem for finding first and follow
S->cAd
A->Ab|a
Here I am confused with first of A
which one is correct {a} , {empty,a} as there is left recursion in A's production .
I am confused whether to include empty string in first of A or not
Any help would be appreciated.
-------------edited---------------
what wil be the first and follow of this ,,This is so confusing grammar i have ever seen
S->SA|A
A->a
I need to prove this grammar is not in LL(1) using parsing table but unable to do because i didnot get 2 entry in single cell.
Firstly,you'll need to remove left-recursion leading to
S -> cAd
A -> aA'
A' -> bA' | epsilon
Then, you can calculate
FIRST(A) = a // as a is the only terminal nderived first from A.
EDIT :-
For your second question,
S -> AS'
S' -> AS' | epsilon
A -> a
FIRST(A) = a
FIRST(S) = a
FIRST(S') = {a,epsilon}.
The idea of removing left-recursion before calculating FIRST() and FOLLOW() can be learnt here.

Explanation on this FIRST function

LL(1) Grammar:
(1) Var -> ID DimList
(2) DimList -> ε DimList'
(3) DimList' -> Dim DimList'
(4) DimList' -> ε
(5) Dim -> [ CONST ]
And, in the script that I am reading, it says that the function FIRST(ε DimList') gives {#, [}. But, how?
My guess is that since the right side of (2) begins with ε, it skips epsilon and takes FIRST(DimList') which is, considering (3) and (5), equal to {[}, BUT also, because of (4), takes FOLLOW(DimList') which is {#}.
Other way it could be is that, since (2) begins with ε it skips epsilon and takes FIRST(DimList') BUT ALSO takes FOLLOW(DimList) from (2)...
First one makes more sense to me, though I'm still in the process of learning basics of LL(1) grammars so I would appreciate if someone takes the time to make this clear, thank you.
EDIT: And, of course, it could be that neither of these is true.
The usual definition of the FIRST function would result in FIRST(Dimlist) (or, if you like, FIRST(ε Dimlist') being {ε, [}. ε is in FIRST(ε Dimlist') because both ε and Dimlist' are nullable. [ is an element because it could be the first symbol in a derivation of ε Dimlist, which is the same as saying that it could be the first symbol in a derivation of Dimlist'.
Another way of saying this is that:
FIRST(ε Dimlist' #) = {#, [}
We usually then define the function PREDICT:
PREDICT(ω) = FIRST(ω FOLLOW(ω))
and we can see that
PREDICT(Dimlist) = FIRST(Dimlist FOLLOW(Dimlist)) = {#, [}
Here, FIRST(ω) is the set of strings of terminals (of length ≤ 1) which could appear at the beginning of a derivation of ω, while PREDICT(ω) is the set of strings of terminals (of length ≤ 1) which could be present in the input when a derivation of ω is possible.
It's not uncommon to confuse FIRST and PREDICT, but it's better to keep the difference straight.
Note that all of these functions can be generalized to strings of maximum length k, which are usually written FIRSTk, FOLLOWk and PREDICTk, and the definition of PREDICTk is similar to the above:
PREDICTk(ω) = FIRSTk(ω FOLLOWk(ω))

Resources