I was given the question for test:
Show the parse tree of a string proving that b is in the follow of T.
S=> ET
T=>bSc | d
E=>aTE| ε
I solved the First set :
First(S)=>First(E)=>{a} U First(T)=> {a,b,d}
First(T)=>{b,d}
First(E)=>{a,ε}
And the Follow set :
Follow(S)=>{$,c}
Follow(T)=>Follow(S) U First(E)=> {$,c} U First(E)=>{$,c,a}
Follow(E)=>First(T) U Follow(E)=>{b,d}
Where I am going wrong ?
You wrote:
Follow(T) ⇒ Follow(S) ∪ First(E)
But ε is in First(E), so that should be:
Follow(T) ⇒ Follow(S) ∪ First(E) ∪ Follow(E)
I've seen this post about how to convert context free grammar to a DFA:
Automata theory : Conversion of a Context free grammar to a DFA
However, just wondering can all context free grammars be converted to DFA/NFA? What about context free grammars that cannot be expressed as a regular expression? Ex. S->(S) | ()
Thanks!
Only regular languages can be converted to a DFA, and not all CFGs represent regular languages, including the one in the question.
So the answer is "no".
NFAs are not more expressive than DFAs, so the above statement would still be true if you replaced DFA with NFA
A CFG represents a regular language if it is right- or left-linear. But the mere fact that a CFG is not left- or right-linear proves nothing. For example, S→a | a S a happens to generate the same language as S→a | S a a.
Yes ... if the F in "DFA" is replaced by I to get "DIA", but no ... for DFA, itself; and I will show how this works for your example at the end. In fact, all languages have DIA's whose state diagrams reside on a single Universal State Diagram as sub-diagrams thereof.
Consider your example, but rewrite it as S → u S v, S → w. This grammar, like all grammars, is algebraically a system of inequations over a certain partially ordered algebra. In particular, it can be rewritten as
S ⊇ {u}S{v}, S ⊇ {w},
or equivalently as
S ⊇ {u}S{v} ∪ {w}.
The object identified by the grammar is the least solution to the system. Since the system is a fixed point system S ⊇ f(S) = {u}S{v} ∪ {w}, then the least solution may also be described as the least fixed point solution and it is denoted μx f(x) = μx({u}x{v} ∪ {w}).
The ordering relation, for this algebra here, is subset ordering y ⊆ x ⇔ x ⊇ y. The operations include a product AB ≡ { ab: a ∈ A, b ∈ B }, defined element-wise (where, component-wise, the product is word concatenation, with ab being the concatenation of a and b). The product has {1} as an identity, where 1 denotes the empty word. Both word concatenation and product satisfy the fundamental properties
(xy)z = x(yz) [Associativity]
and
xe = x = ex [Identity property]
with the respective identities e = 1 (for concatenation) or e = {1} (for set product). The algebra is called a Monoid.
The simplest and most direct monoid formed from the elements X = {u,v,w} is the Free Monoid X* = {u,v,w}*, which is equivalently described as the set of all words of finite length (including the empty word, 1, of length 0) formed from u, v and w. It is possible to frame the question in terms of more general monoids, but (as the literature usually does) I will restrict it to free monoids.
The family of languages over X is one and the same as the family 𝔓M of subsets A ⊆ M of the monoid M = X*; the defining condition being A ∈ 𝔓M ⇔ A ⊆ M. Other distinguished subfamilies exist, such as the families ℜM ⊆ ℭM ⊆ 𝔗M ⊆ 𝔓M, respectively, of rational, context-free and Turing (or recursively enumerable) languages. The second of these ℭM, which is what your question is concerned with, are given by context-free grammars and are identified as the least fixed point solutions to the corresponding fixed point system of inequations.
Over 𝔓M, one can define the left-quotient operation v\A = { w ∈ M: vw ∈ A }, for each word v ∈ M and subset A ∈ 𝔓M. Because M = X* is a free monoid, it can be decomposed uniquely into left-quotients on the individual elements of X, by the properties 1\A = A, and (vw)\A = w\(v\A).
Correspondingly, one can define a state transition on each x ∈ X by x: A → x\A, treating each subset A ∈ 𝔓M as a state. Together, 𝔓M comprises the state set of the Universal State Diagram over M. Because M = X* is a free monoid, every element of M is either of the form xw for some x ∈ X and w ∈ X*, or is the empty word 1. The decomposition is unique: xw ≠ 1 for any x ∈ X or w ∈ X* and xw = x'w' for x, x' ∈ X and w, w' ∈ X*, only if x = x' and w = w'. Therefore, every A ∈ 𝔓M decomposes uniquely into a partition in a manner analogous to Taylor's Theorem as
A = A₀ ∪ ⋃_{x∈X} {x} x\A.
where A₀ ≡ A ∩ {1} is either {1} if 1 ∈ A or is ∅ if 1 ∉ A. The states for which A₀ = {1} may be regarded as the Final States in the Universal State Diagram.
The analogy to Taylor's Theorem is not too far-removed, since the left-quotient satisfies an analogue of the Product Rule
x\(AB) = (x\A) B ∪ A₀ (x\B)
so it is also denoted as a partial derivative x\A = ∂A/∂x: the Brzozowski Derivative, so that the decomposition rule could just as well be written as:
A = A₀ ∪ ⋃_{x∈X} {x} ∂A/∂x.
What you actually have is an infinite fixed-point system of inequations
A ⊇ A₀ ∪ ⋃_{x∈X} {x} ∂A/∂x for all A ∈ 𝔓M,
with variables A ∈ 𝔓M ranging over all of 𝔓M, whose right-hand sides are all right-linear in the variables. The sets, themselves, are the least fixed point solution to their own system (and to all closed subsystems of the universal system that contain that set as a variable).
Choosing different states as start states yields the different DIA's contained within it. Every minimal DIA (and every minimal DFA) of every language over X is contained in it.
In particular, in this diagram, you can consider the largest subdiagram accessible from a specific state A ∈ 𝔓M. All the states that can be accessed from A are left-quotients by words in M. So, together they comprise a family δA ≡ { v\A: v ∈ M }. The subdiagram consisting only of these states gives you the minimal DIA for the language A, where A, itself, is treated as the start state of the DIA.
If δA is finite, then the I is an F and it's actually a DFA - and that's what you're looking for. Which states in 𝔓M have DIA that are actually DFA's? The regular ones - the ones in the subfamily ℜM ⊆ 𝔓M. This is the case when M = X* is a free monoid. I'm not totally sure if this can also be proven for non-free monoids (like X* × Y*, whose rational subsets ℜ(X* × Y*) are one and the same as what are known as rational transductions) ... because of the reliance on the Taylor's Formula decomposition. There is still something like a Taylor's Theorem, but the decompositions are not necessarily partitions or unique, any longer.
For larger subfamilies of 𝔓M, the DIA are necessarily infinite; but their transitions may possess a sufficient degree of symmetry to allow both the states and transition rules to be wrapped up more succinctly. Correspondingly, one can distinguish different families of DIA by what symmetry properties they possess.
For your example, X = {u,v,w} and M = {u,v,w}*. The subset identified by your grammar is S = {uⁿ w vⁿ: n = 0, 1, 2, ...}. We can define the following sets
S(n) = S {vⁿ}, T(n) = {vⁿ}, for n = 0, 1, 2, ...
The sub-diagram of states accessible from S consists of all the states
δS = { S(n): n = 0, 1, 2, ... } ∪ { T(n): n = 0, 1, 2, ... } ∪ { ∅ }
The state transitions are the following
u: S(n) → S(n+1)
v: T(n+1) → T(n)
w: S(n) → T(n)
with x: A → ∅ in all other cases for x ∈ {u,v,w} and A ∈ δS. The sole final state is T(0).
As you can see, the DIA is infinite and is not a DFA at all. If you were to draw out the diagram, you would see an infinite ladder with S = S(0) being the start state T(0) = {1} the final state, with all the u transitions climbing up a rung, all the v transitions climbing down a rung, and the w transitions crossing over on a rung.
The symmetry is captured by factoring the state set into
δS = {S,T}×{0,1,2,3,⋯} ∪ {∅}
with S(n) rewritten as (S,n) and T(n) as (T,n). This includes a finite set of states Q = {S,T} for a finite state "control" and a set of states D = {0,1,2,3,⋯} for a "device"; as well as the empty set ∅ for the fail state. That device is none other than a counter, and this DIA is just a one-counter automaton in disguise.
All of the classical automata models posed in the literature have a similar form, when expressed as DIA. They contain a state set Q×D ∪ {∅} that includes a finite set Q for the "finite state control" and a (generally infinite) state set D for the device, along with the fail state ∅. The restrictions or constraints on the device correspond to what types of symmetries are contained in the underlying DIA. A deterministic PDA, with two stack symbols {a,b} for instance, has a device state set D = {a,b}* (consisting of all stack words), and an underlying DIA that has the form of an infinite binary tree with copies of Q residing at each node.
You can best see this by writing out and graphing the DIA for the Dyck language, which is given by the grammar D₂ → b D₂ d D₂, D₂ → p D₂ q D₂, D₂ → 1 as a language over X = {b,d,p,q} and subset of M = X* = {b,d,p,q}*; i.e. as the least-fixed point D₂ = μx ({b}x{d}x ∪ {p}x{q}x ∪ {1}).
Every subset in A ⊆ ℭM can be expressed in terms of a subset in A' ⊆ ℜM[b,d,p,q] of the free extension of the monoid M by indeterminates {b,d,p,q}, by carrying out insertions of {b,d,p,q} in suitable places in A, such that the result upon applying the identities {bd} = {1} = {pq}, {bq} = ∅ = {pd}, and xy = yx for x ∈ M and y ∈ {b,d,p,q} will yield A, itself, from A'.
This result (known, but unpublished since the 1990's and published only in 2022) is the algebraic form of the Chomsky-Schützenberger Theorem and is true for all monoids M. For instance, it holds for the non-free monoid M = X* × Y*, where the corresponding family ℭ(X* × Y*) comprise the push-down transductions from X to Y (or "simple syntax directed translations"; aka yacc-like grammars).
So, there is also something like a DFA even for these classes of DIA; provided you include transition arrows for {b,d,p,q}. For your example, A = μx({u}x{v} ∪ {w}), you have A' = {b}{up,qv,w}*{d} and you can easily write down the corresponding DFA. That automaton is just the one-counter machine, itself, with "b" interpreted as "start up at count 0", "d" as "check for count 0 and finish", "p" as "add one to the count" and "q" as "check for count greater than 0 and subtract 1". With respect to the algebraic rules given for {b,d,p,q}, A' is not just a representation of A, it is actually is A: A' = A.
On this page the author explains how to determine the FOLLOW sets of a CFG. Under the headline Syntax Analysis Goal: FOLLOW Sets he states:
Steps to Make the Follow Set
Conventions: a, b, and c represent a terminal or non-terminal. a*
represents zero or more terminals or non-terminals (possibly both). a+
represents one or more... D is a non-terminal.
Place an End of Input token ($) into the starting rule's follow set.
Suppose we have a rule R → a*Db. Everything in First(b) (except for ε)
is added to Follow(D). If First(b) contains ε then everything in
Follow(R) is put in Follow(D).
Finally, if we have a rule R → a*D,
then everything in Follow(R) is placed in Follow(D).
The Follow set of
a terminal is an empty set.
So far so good. But in the box below this item, we read:
[...] Step 2 on rule 1 (N → V = E) indicates that first(=) is in Follow(V).
Now this is the part I don't understand. When he says that First(=) is in Follow (V), he obviously maps = to b and V to D (b and D from the explication in the first box). But (a*)(D)(b) does not match ()(V)(=)E.
Am I reading this completely wrong, or did the author maybe write a*Db instead of a*Dba*?
(Especially if you read this on wikipedia: "FOLLOW(I) of an Item I [A → α • B β, x] is the set of terminals that can appear after nonterminal B, where α, β are arbitrary symbol strings, and x is an arbitrary lookahead terminal.")
Yes, he meant:
R → a* D b*
and since b* could be zero symbols, i.e. ε, the second rule is unneeded. Remember that FIRST is defined on arbitrary sequences of symbols.
In other words, for:
A → α B β
Every terminal in FIRST(β) is in FOLLOW(B), and
If β ⇒* ε, then everything in FOLLOW(A) is in FOLLOW(B).
Here's what Aho, Sethi & Ullman say in the dragon book:
Formally, we say LR(1) item [A → α·β, a] is valid for a viable prefix γ if there is a derivation S ⇒* δAw ⇒ δαβw
where γ = δα and either a is the first symbol of w or w is ε and a is $.
(The ⇒'s above are marked rm, meaning right-most derivation; in other words, in every derivation step, the right-most non-terminal is substituted with one of its productions. Consequently, w only contains terminals.)
In other words, the LR(1) item is valid (could apply) if we've reached some point where we've decided that A might be the next reduction and a might follow A; at the current point in the parse, we've read α. So if a follows β, then the reduction is possible. We don't yet know that, unless β is the empty sequence, but we need to remember the fact in case it turns out that β can derive the empty sequence.
I hope that helps. It's late here and I'm too tired to check it again. Maybe tomorrow...
How would I apply the FIRST() rule on a production such as :
A -> AAb | Ab | s
where A is a non-terminal, and b,s are terminals.
FIRST(A) of alternatives 1 & 2 would be A again, but such would end in infinite applications of FIRST, since I need a terminal to get the FIRST set?
To compute FIRST sets, you typically perform a fixed-point iteration. That is, you start off with a small set of values, then iteratively recompute FIRST sets until the sets converge.
In this case, you would start off by noting that the production A → s means that FIRST(A) must contain {s}. So initially you set FIRST(A) = {s}.
Now, you iterate across each production of A and update FIRST based on the knowledge of the FIRST sets you've computed so far. For example, the rule
A → AAb
Means that you should update FIRST(A) to include all elements of FIRST(AAb). This causes no change to FIRST(A). You then visit
A → Ab
You again update FIRST(A) to include FIRST(Ab), which is again a no-op. Finally, you visit
A → s
And since FIRST(A) already contains s, this causes no change.
Since nothing changed on this iteration, you would end up with FIRST(A) = {s}, which is indeed correct because any derivation starting at A ultimately will produce an s as its first character.
For more information, you might find these lecture slides useful (here's part two). They describe in detail how top-down parsing works and how to iteratively compute FIRST sets.
Hope this helps!
My teaching notes are in Spanish, but the algorithms are in English. This is one way to calculate FIRST:
foreach a ∈ Σ do
F(a) := {a}
for each A ∈ N do
if A→ε ∈ P then
F(A) := {ε}
else
F(A) := ∅
repeat
for each A ∈ N do
F'(A) := F(A)
for each A → X1X2...Xn ∈ P do
if n > 0 then
F(A) := F(A) ∪ F'(X1) ⋅k F'(X2) ⋅k ... ⋅k F'(Xn)
until F(A) = F'(A) forall A ∈ N
FIRSTk(X) := F(X) forall X ∈ (Σ ∪ N)
Σ is the alphabet (terminals), N is the set of non-terminals, P is the set of productions (rules), ε is the null string, and ⋅k is concatenation trimmed to k places. Note that ∅ ⋅k x = ∅, and that concatenating two sets produces the concatenation of the elements in the Cartesian product.
The easiest way to calculate FIRST sets by hand is by using one table per algorithm iteration.
F(A) = ∅
F'(A) = F(A) ⋅1 F(A) .1 F(b) U F(A) .1 F(b) U F(s)
F'(A) = ∅ ⋅1 ∅ ⋅1 {b} U ∅ ⋅1 {b} U {s}
F'(A) = ∅ U ∅ U {s}
F'(A) = {s}
F''(A) = F'(A) ⋅1 F'(A) .1 F'(b) U F'(A) .1 F'(b) U F'(s)
F''(A) = {s} ⋅1 {s} ⋅1 {b} U {s} ⋅1 {b} U {s}
F''(A) = {s} U {s} U {s}
F''(A) = {s}
And we're done, because F' = F'', so FIRST = F'', and FIRST(A) = {s}.
your grammar rule has left recursion as you already realized and LL parsers are not able to parse grammars with left recursion.
So you need to get rid of left recursion first and then you should be able to compute the first set for the rule.
I'm having a slight problem that's been bugging me for quite some time now and I haven't been able to find a suitable answer online.
Given a grammar:
S = Sc | Eab | Db | b
D = EDcD | ca | ɛ
E = dE | DY
Y = Ed | Dad | ɛ
To find the FIRST set of say Y, so FIRST(Y), am I correct in assuming that it goes like this:
FIRST(Y)
= FIRST(Ed) ∪ FIRST(Dad) ∪ FIRST(ɛ)
= FIRST(E) ∪ (FIRST(D)\{ɛ}) ∪ FIRST(ad) ∪ {ɛ}
= FIRST(E) ∪ (FIRST(D)\{ɛ}) ∪ {a, ɛ}
Now the question is how do I find FIRST(E) and FIRST(D)?
So, the problem with FIRST(E) and FIRST(D) is that E and D reference one another. And the solution is the usual one when you want a sort of "least fixed point" -- start with everything empty and keep iterating until they stabilize.
That is: first of all, initialize all your FIRST sets to the empty set. Now, repeatedly, consider each production and pretend your current estimates for the non-terminal's FIRST sets are the truth. (In reality, they will typically be "underestimates".) Work out what the production tells you about the FIRST set of its LHS, and update your estimate of that non-terminal's FIRST set accordingly. Keep doing this, processing all the productions in turn, until you've gone through all of them and your estimates haven't changed. At that point, you're done.
In this case, here's how it goes (assuming I haven't goofed, of course). The first pass produces successively S: {b}, D: {c,ɛ}, E: {c,d}, Y: {c,d,ɛ}. The second produces successively S: {b,c,d}, D: {c,d,ɛ}, E: {c,d,ɛ}, Y: {c,d,ɛ}. The third doesn't change any of those, so those are the final answers.