Explanation on this FIRST function - parsing

LL(1) Grammar:
(1) Var -> ID DimList
(2) DimList -> ε DimList'
(3) DimList' -> Dim DimList'
(4) DimList' -> ε
(5) Dim -> [ CONST ]
And, in the script that I am reading, it says that the function FIRST(ε DimList') gives {#, [}. But, how?
My guess is that since the right side of (2) begins with ε, it skips epsilon and takes FIRST(DimList') which is, considering (3) and (5), equal to {[}, BUT also, because of (4), takes FOLLOW(DimList') which is {#}.
Other way it could be is that, since (2) begins with ε it skips epsilon and takes FIRST(DimList') BUT ALSO takes FOLLOW(DimList) from (2)...
First one makes more sense to me, though I'm still in the process of learning basics of LL(1) grammars so I would appreciate if someone takes the time to make this clear, thank you.
EDIT: And, of course, it could be that neither of these is true.

The usual definition of the FIRST function would result in FIRST(Dimlist) (or, if you like, FIRST(ε Dimlist') being {ε, [}. ε is in FIRST(ε Dimlist') because both ε and Dimlist' are nullable. [ is an element because it could be the first symbol in a derivation of ε Dimlist, which is the same as saying that it could be the first symbol in a derivation of Dimlist'.
Another way of saying this is that:
FIRST(ε Dimlist' #) = {#, [}
We usually then define the function PREDICT:
PREDICT(ω) = FIRST(ω FOLLOW(ω))
and we can see that
PREDICT(Dimlist) = FIRST(Dimlist FOLLOW(Dimlist)) = {#, [}
Here, FIRST(ω) is the set of strings of terminals (of length ≤ 1) which could appear at the beginning of a derivation of ω, while PREDICT(ω) is the set of strings of terminals (of length ≤ 1) which could be present in the input when a derivation of ω is possible.
It's not uncommon to confuse FIRST and PREDICT, but it's better to keep the difference straight.
Note that all of these functions can be generalized to strings of maximum length k, which are usually written FIRSTk, FOLLOWk and PREDICTk, and the definition of PREDICTk is similar to the above:
PREDICTk(ω) = FIRSTk(ω FOLLOWk(ω))

Related

How to calculate Follow set of a production rule that has only ONE symbol on the right side

So i have this grammar :
S -> (D)
D -> EF
E -> a|b|S
F -> *D | +D | ε
First of all, books solution uses the P -> pBq , First(q) - {ε} is subset of FOLLOW(B) for the rule D -> EF but that rule has only 2 symbols do we assume ε infront of E (ε being the p in pBq)?
And secondly i can't understand how to calculate Follow(E).
FOLLOW(E) consists of every terminal symbol which can immediately follow E in some derivation step. That's the precise definition; it's not very complicated.
For a simple grammar, you should be able to figure out all the FOLLOW sets just be looking at the grammar and applying a little bit of common sense. It would probably be a good idea to do that, since it will give you a better idea of how the algorithm works.
As a side note, it's maybe worth mentioning that ε is not a thing. Or at least, it's not a grammar symbol. It's one of several conventions used to make the empty sequence visible, just like 0 is a way to make nothing visible. Sometimes that's useful, but it's important to not let it confuse you. (Abuse of notation is endemic in mathematics, which can be frustrating.)
So, what can follow E? E only appears in one place on the right-hand sdie of that grammar, in the production D → E F. So clearly any symbol which be the first symbol of F must be in FOLLOW(E). The symbols which could be at the start of F are + and *, since as mentioned, ε is not a grammar symbol. (Many definitions of FIRST allow ε to be a member of that set, along with any actual terminal symbol. That's an example of the abuse of notation I was talking about in the previous paragraph, since it makes it look like ε is a terminal symbol. But it isn't. It's nothing.)
F is what we call a "nullable" non-terminal, because it can derive the empty sequence (which was written as ε so that you can see it). In other words, it's possible for F to disappear completely in a derivation step. And if it does disappear, then E might be at the end of the production D → E F. If E is at the end of D, then it can be followed by anything which could follow D, which includes ). D can also appear at the end of a derivation of F, which means that F could be followed by anything which could follow F, a tautology which adds no information whatsoever.
So it's easy to see that FOLLOW(F) = {*, +, )}, and you can use that to check your understanding of any algorithm to compute follow sets.
Now, I don't know what book you are referring to (and it would have been courteous to mention that in your original question; sources should always be correctly cited). But the book I happen to have in front of me --the Dragon Book-- has a pretty similar algorithm. The Dragon book uses a simple convention for writing statements like that. Probably your book does, too, but it might not be the same convention. You should check what it says and make sure that you typed the copied statement correctly, respecting whatever formatting used to indicate what the symbols stand for.
In the Dragon book, some of the conventions include:
Lower case characters at the start of the alphabet. –a, b, c,…– are terminals (as well as actual symbols like * and +).
Upper case characters at the start of the alphabet. –A, B, C,…– are non-terminals.
S is the start symbol.
Upper case characters at the end of the alphabet. –X, Y, Z– stand for arbitrary grammar symbols (either terminals or non-terminals).
$ is the marker used to indicate the end of the input.
Lower-case Greek letters –α, β, γ,…– are possibly-empty strings of grammar symbols.
The phrase "possibly empty" is very important, so I'm repeating it.
With that convention, they write the rules for computing the FOLLOW set:
Place $ in FOLLOW(S).
For every production A → αBβ, copy everything from FIRST(&beta) except ε into FOLLOW(B).
If there is a production A → αB or a production A → αBβ where FIRST(β) contains ε, place everything in FOLLOW(A) into FOLLOW(B).
As mentioned above, α is a possibly-empty string of grammar symbols. So it might not be visible.
Keep doing steps 2 and 3 until no new symbols are added to any follow set.
I'm pretty sure that the algorithm in your book differs only in notation conventions.

Are those languages regular or not

Hi I would need help from you, I got the following languages and need to determine if are regular or not.
Now I think that Y is not regular and I applied the Pumping Lemma to determine that.
For X I am not sure if is a regular language or not, I was thinking that X is the set of strings with an odd number of a's that can be easily represented with an NFA.
Can anyone help me with that ?
The first one (X) is regular, because you can construct a finite automaton for it:
(start) --- a --> (final) -- a --> (state)
^ |
\------ a -------/
The second one (Y) is not regular, because you cannot construct a finite automaton for it. It would require memory to store the number of a to be able to later find one more of b. That language is context-free, with a grammar:
S = T b
T = a T b
T = ε

How do I expand the item set for this grammar?

I have this grammar
E -> E + i
E -> i
The augmented grammar
E' -> E
E -> E + i
E -> i
Now I try to expand the item set 0
I0)
E' -> .E
+E -> .E + i
+E -> .i
Then, since I have .E in I0 I would expand it but then I will get another E rule, and so on, this is my first doubt.
Assuming that this is alright the next item sets are
I0)
E' -> .E
+E -> .E + i
+E -> .i
I1) (I moved the dot from I0, no variables at rhs of dot)
E' -> E.
E -> E. + i
E -> i.
I2) (I moved the dot from I1, no vars at rhs of dot)
E -> E +. i
I3) (I moved the dot from I2, also no vars)
E -> E + i.
Then I will have this DFA
I0 -(E, i)-> I1 -(+)-> I2 -(i)-> I3
| |
+-(∅)-> acpt <-(∅)--+
I'm missing something because E -> E + i must accept i + i + .. but the DFA doesn't goes back to the I0, so it seems wrong to me. My guess is that it should have a I0 to I0 transition, but I then I don't know that to do with the dot.
What you call the "expansion" of the item set is actually a closure; that's how it's described in all the descriptions of the algorithm I've seen (at least in textbooks). Like any closure operation, you just keep on doing the transformation until you reach a fixed-point; once you've included the productions for E, they're included.
But the essential point is that you're not building a DFA. You're building a pushdown automaton, and the DFA is just one part of it. The DFA is used for shift operations; when you shift a new terminal (because the current parse stack is not a handle), you do a state transition according to the DFA. But you also push the current state onto the PDA's stack.
The interesting part is what happens when the automaton decides to perform a reduction, which replaces the right-hand side of a production with its left-hand side non-terminal. (The right-hand side at the top of the stack is called a "handle".) To do the reduction, you unwind the stack, popping each right-hand side symbol (and the corresponding DFA state) until you reach the beginning of the production. What that does is rewind the DFA to the state it was in before it shifted the first symbol from the right-hand side. (Note that it is only at this point that you know for sure which production was used.) With the DFA thus reset, you can now shift the non-terminal which was encountered, do the corresponding DFA transition, and continue with the parse.
The basis for this procedure is the fact that the parser stack is at all times a "viable prefix"; that is, a sequence of symbols which are the prefix of some right sentential form which can be derived from the start symbol. What's interesting about the set of viable prefixes for a context-free grammar is that it is a regular language, and consequently can be recognised by a DFA. The reduction procedure given above precisely represents this recognition procedure when handles are "pruned" (to use Knuth's original vocabulary).
In that sense, it doesn't really matter what procedure is used to determine which handle is to be pruned, as long as it provides a valid answer. You could, for example, fork the parse every time a potential handle is noticed at the top of the stack, and continue in parallel with both forks. With clever stack management, this parallel search can be done in worst-case O(n3) time for any context-free grammar (and this can be reduced if the grammar is not ambiguous). That's a very rough description of Earley parsers.
But in the case of an LR(k) parser, we require that the grammar be unambiguous, and we also require that we can identify a reduction by looking at no more than k more symbols from the input stream, which is an O(1) operation since k is fixed. If at each point in the parse we know whether to reduce or not, and if so which reduction to choose, then the reductions can be implemented as I outlined above. Each reduction can be performed in O(1) time for a fixed grammar (since the maximum size of a right-hand side in a particular grammar is fixed), and since the number of reductions in a parse is linear in the size of the input, the entire parse can be done in linear time.
That was all a bit informal, but I hope it serves as an intuitive explanation. If you're interested in the formal proof, Donald Knuth's original 1965 paper (On the Translation of Languages from Left to Right) is easy to find and highly readable as these things go.

Automata and Formal Languages

Showing that the reverse of a word for a regular language L is also regular
I am confused as to how I am to approach this question, i've been stuck for hours: For a word x, we use x^r to denote its reverse. For a language L, we use L^r to denote {x^r where x is in the set of L}. Show that if L is regular then so is L^r
If L is regular, then there exists some regular grammar which generates it. It can be always represented as either a left-regular grammar, or a right-regular grammar. Let's assume that it's left-regular grammar G_l(the proof for right-regular grammar is analogous).
This grammar has productions of two types; the terminating-type:
A -> a, where A is non-terminal and a is either a terminal or empty string (epsilon)
or the chaining type:
B -> Ca, where B, C are non-terminals and a is a terminal
When we apply reverse to a regular language, we basically also apply it to the tails of productions (since heads are just single non-terminals). It's going to be proved later on. So we get a new grammar G_r, with productions:
A -> a, where A is non-terminal and a is either a terminal or empty string (epsilon)
B -> aC, where B, C are non-terminals and a is a terminal
But hey, it's a right-regular grammar! So the language it accepts is also regular.
There is one thing to do - to show that reversing tails actually does the thing it's supposed to. We're going to prove it very simply:
If L contains \epsilon, then there is production 'S -> \epsilon' in G_l. Since we don't touch productions like that, it's also present in G_r.
If L contains a, a word composed of a single terminal, then it's similar to the above
If L contains aZ, where a is a terminal and Z is a word from the language constructed from chopping off the first terminals out of words in L, then L^r contains (because of changes to the chaining productions) (Z^r)a. Z is also a regular language, since it can be constructed by dropping the first "level" of left-productions from G_l, which leaves us with a regular grammar.
I hope it helped. There's also an arguably easier way of doing that by reversing edges of the relevant finite automata and changing accepting and entry states a bit.

What are FIRST and FOLLOW sets used for in parsing?

What are FIRST and FOLLOW sets? What are they used for in parsing?
Are they used for top-down or bottom-up parsers?
Can anyone explain me FIRST and FOLLOW SETS for the following set of grammar rules:
E := E+T | T
T := T*V | T
V := <id>
They are typically used in LL (top-down) parsers to check if the running parser would encounter any situation where there is more than one way to continue parsing.
If you have the alternative A | B and also have FIRST(A) = {"a"} and FIRST(B) = {"b", "a"} then you would have a FIRST/FIRST conflict because when "a" comes next in the input you wouldn't know whether to expand A or B. (Assuming lookahead is 1).
On the other side if you have a Nonterminal that is nullable like AOpt: ("a")? then you have to make sure that FOLLOW(AOpt) doesn't contain "a" because otherwise you wouldn't know if to expand AOpt or not like here: S: AOpt "a" Either S or AOpt could consume "a" which gives us a FIRST/FOLLOW conflict.
FIRST sets can also be used during the parsing process for performance reasons. If you have a nullable nonterminal NullableNt you can expand it in order to see if it can consume anything, or it may be faster to check if FIRST(NullableNt) contains the next token and if not simply ignore it (backtracking vs predictive parsing). Another performance improvement would be to additionally provide the lexical scanner with the current FIRST set, so the scanner does not try all possible terminals but only those that are currently allowed by the context. This conflicts with reserved terminals but those are not always needed.
Bottom up parsers have different kinds of conflicts namely Reduce/Reduce and Shift/Reduce. They also use item sets to detect conflicts and not FIRST,FOLLOW.
Your grammar would't work with LL-parsers because it contains left recursion. But the FIRST sets for E, T and V would be {id} (assuming your T := T*V | T is meant to be T := T*V | V).
Answer :
E->E+T|T
left recursion
E->TE'
E'->+TE'|eipsilon
T->T*V|T
left recursion
T->VT'
T'->*VT'|epsilon
no left recursion in
V->(id)
Therefore the grammar is:
E->TE'
E'->+TE'|epsilon
T->VT'
T'->*VT'|epsilon
V-> (id)
FIRST(E)={(}
FIRST(E')={+,epsilon}
FIRST(T)={(}
FIRST(T')={*,epsilon}
FIRST(V)={(}
Starting Symbol=FOLLOW(E)={$}
E->TE',E'->TE'|epsilon:FOLLOW(E')=FOLLOW(E)={$}
E->TE',E'->+TE'|epsilon:FOLLOW(T)=FIRST(E')={+,$}
T->VT',T'->*VT'|epsilon:FOLLOW(T')=FOLLOW(T)={+,$}
T->VT',T->*VT'|epsilon:FOLLOW(V)=FIRST(T)={ *,epsilon}
Rules for First Sets
If X is a terminal then First(X) is just X!
If there is a Production X → ε then add ε to first(X)
If there is a Production X → Y1Y2..Yk then add first(Y1Y2..Yk) to first(X)
First(Y1Y2..Yk) is either
First(Y1) (if First(Y1) doesn't contain ε)
OR (if First(Y1) does contain ε) then First (Y1Y2..Yk) is everything in First(Y1) except for ε as well as everything in First(Y2..Yk)
If First(Y1) First(Y2)..First(Yk) all contain ε then add ε to First(Y1Y2..Yk) as well.
Rules for Follow Sets
First put $ (the end of input marker) in Follow(S) (S is the start symbol)
If there is a production A → aBb, (where a can be a whole string) then everything in FIRST(b) except for ε is placed in FOLLOW(B).
If there is a production A → aB, then everything in FOLLOW(A) is in FOLLOW(B)
If there is a production A → aBb, where FIRST(b) contains ε, then everything in FOLLOW(A) is in FOLLOW(B)
Wikipedia is your friend. See discussion of LL parsers and first/follow sets.
Fundamentally they are used as the basic for parser construction, e.g., as part of parser generators. You can also use them to reason about properties of grammars, but most people don't have much of a need to do this.

Resources