Closure property of regular languages under concatenation and star operation - closures

In our course of Theory of Computation, we have done the proof for closure of
regular languages(L1, L2) under intersection, union and complement. But their closure under concatenation(L1L2) and star(L1*) operation was not done. It would be great if someone can explain me how can we prove these two.
Thanks in advance

The proofs of these facts are constructive.
Let L1 and L2 be arbitrary regular languages. Because they are regular languages, we know there are minimal DFAs for L1 and L2; let's call these M1 and M2, respectively.
To see that the concatenation of these languages must be regular, construct a machine M* as follows:
the states of M* are the states of M1 and M2 put together
the alphabet of M* is the union of the alphabets of M1 and M2
initial state of M* is the initial state of M1
accepting states of M* are the accepting states of M2
M* has all the same transitions as M1 and M2 put together, plus empty/epsilon/lambda transitions from all the accepting states in M1 to the initial state of M2
This defines an NFA-lambda (NFA with empty/lambda/epsilon transitions). We know those are equivalent to DFAs and all DFAs can be minimized; let us call the equivalent minimal DFA M**.
Because there is a minimal DFA for the concatenation of L1 and L2, the concatenation must be regular.
The argument for L* where L is regular is similar. Let M3 be a minimal DFA for L. Then define M*** as follows:
States of M*** are the states of M3 plus an extra state q*
Alphabet of M*** is the alphabet of M3
Initial state of M*** is q*
Accepting state of M*** is q*
Transitions of M*** are the transitions of M3, plus empty/epsilon/lambda transitions from q* to the accepting state of M3, as well as empty/epsilon/lambda transitions from the accepting states of M3 to q*
This defines an NFA-lambda (NFA with empty/epsilon/lambda transitions) which accepts L*. Because NFA-lambdas are equivalent to DFAs and DFAs can be minimized, there is a minimal DFA M**** for L*. As such, L* must be regular whenever L is.

Related

Can all context free grammars be converted to NFA/DFA?

I've seen this post about how to convert context free grammar to a DFA:
Automata theory : Conversion of a Context free grammar to a DFA
However, just wondering can all context free grammars be converted to DFA/NFA? What about context free grammars that cannot be expressed as a regular expression? Ex. S->(S) | ()
Thanks!
Only regular languages can be converted to a DFA, and not all CFGs represent regular languages, including the one in the question.
So the answer is "no".
NFAs are not more expressive than DFAs, so the above statement would still be true if you replaced DFA with NFA
A CFG represents a regular language if it is right- or left-linear. But the mere fact that a CFG is not left- or right-linear proves nothing. For example, S→a | a S a happens to generate the same language as S→a | S a a.
Yes ... if the F in "DFA" is replaced by I to get "DIA", but no ... for DFA, itself; and I will show how this works for your example at the end. In fact, all languages have DIA's whose state diagrams reside on a single Universal State Diagram as sub-diagrams thereof.
Consider your example, but rewrite it as S → u S v, S → w. This grammar, like all grammars, is algebraically a system of inequations over a certain partially ordered algebra. In particular, it can be rewritten as
S ⊇ {u}S{v}, S ⊇ {w},
or equivalently as
S ⊇ {u}S{v} ∪ {w}.
The object identified by the grammar is the least solution to the system. Since the system is a fixed point system S ⊇ f(S) = {u}S{v} ∪ {w}, then the least solution may also be described as the least fixed point solution and it is denoted μx f(x) = μx({u}x{v} ∪ {w}).
The ordering relation, for this algebra here, is subset ordering y ⊆ x ⇔ x ⊇ y. The operations include a product AB ≡ { ab: a ∈ A, b ∈ B }, defined element-wise (where, component-wise, the product is word concatenation, with ab being the concatenation of a and b). The product has {1} as an identity, where 1 denotes the empty word. Both word concatenation and product satisfy the fundamental properties
(xy)z = x(yz) [Associativity]
and
xe = x = ex [Identity property]
with the respective identities e = 1 (for concatenation) or e = {1} (for set product). The algebra is called a Monoid.
The simplest and most direct monoid formed from the elements X = {u,v,w} is the Free Monoid X* = {u,v,w}*, which is equivalently described as the set of all words of finite length (including the empty word, 1, of length 0) formed from u, v and w. It is possible to frame the question in terms of more general monoids, but (as the literature usually does) I will restrict it to free monoids.
The family of languages over X is one and the same as the family 𝔓M of subsets A ⊆ M of the monoid M = X*; the defining condition being A ∈ 𝔓M ⇔ A ⊆ M. Other distinguished subfamilies exist, such as the families ℜM ⊆ ℭM ⊆ 𝔗M ⊆ 𝔓M, respectively, of rational, context-free and Turing (or recursively enumerable) languages. The second of these ℭM, which is what your question is concerned with, are given by context-free grammars and are identified as the least fixed point solutions to the corresponding fixed point system of inequations.
Over 𝔓M, one can define the left-quotient operation v\A = { w ∈ M: vw ∈ A }, for each word v ∈ M and subset A ∈ 𝔓M. Because M = X* is a free monoid, it can be decomposed uniquely into left-quotients on the individual elements of X, by the properties 1\A = A, and (vw)\A = w\(v\A).
Correspondingly, one can define a state transition on each x ∈ X by x: A → x\A, treating each subset A ∈ 𝔓M as a state. Together, 𝔓M comprises the state set of the Universal State Diagram over M. Because M = X* is a free monoid, every element of M is either of the form xw for some x ∈ X and w ∈ X*, or is the empty word 1. The decomposition is unique: xw ≠ 1 for any x ∈ X or w ∈ X* and xw = x'w' for x, x' ∈ X and w, w' ∈ X*, only if x = x' and w = w'. Therefore, every A ∈ 𝔓M decomposes uniquely into a partition in a manner analogous to Taylor's Theorem as
A = A₀ ∪ ⋃_{x∈X} {x} x\A.
where A₀ ≡ A ∩ {1} is either {1} if 1 ∈ A or is ∅ if 1 ∉ A. The states for which A₀ = {1} may be regarded as the Final States in the Universal State Diagram.
The analogy to Taylor's Theorem is not too far-removed, since the left-quotient satisfies an analogue of the Product Rule
x\(AB) = (x\A) B ∪ A₀ (x\B)
so it is also denoted as a partial derivative x\A = ∂A/∂x: the Brzozowski Derivative, so that the decomposition rule could just as well be written as:
A = A₀ ∪ ⋃_{x∈X} {x} ∂A/∂x.
What you actually have is an infinite fixed-point system of inequations
A ⊇ A₀ ∪ ⋃_{x∈X} {x} ∂A/∂x for all A ∈ 𝔓M,
with variables A ∈ 𝔓M ranging over all of 𝔓M, whose right-hand sides are all right-linear in the variables. The sets, themselves, are the least fixed point solution to their own system (and to all closed subsystems of the universal system that contain that set as a variable).
Choosing different states as start states yields the different DIA's contained within it. Every minimal DIA (and every minimal DFA) of every language over X is contained in it.
In particular, in this diagram, you can consider the largest subdiagram accessible from a specific state A ∈ 𝔓M. All the states that can be accessed from A are left-quotients by words in M. So, together they comprise a family δA ≡ { v\A: v ∈ M }. The subdiagram consisting only of these states gives you the minimal DIA for the language A, where A, itself, is treated as the start state of the DIA.
If δA is finite, then the I is an F and it's actually a DFA - and that's what you're looking for. Which states in 𝔓M have DIA that are actually DFA's? The regular ones - the ones in the subfamily ℜM ⊆ 𝔓M. This is the case when M = X* is a free monoid. I'm not totally sure if this can also be proven for non-free monoids (like X* × Y*, whose rational subsets ℜ(X* × Y*) are one and the same as what are known as rational transductions) ... because of the reliance on the Taylor's Formula decomposition. There is still something like a Taylor's Theorem, but the decompositions are not necessarily partitions or unique, any longer.
For larger subfamilies of 𝔓M, the DIA are necessarily infinite; but their transitions may possess a sufficient degree of symmetry to allow both the states and transition rules to be wrapped up more succinctly. Correspondingly, one can distinguish different families of DIA by what symmetry properties they possess.
For your example, X = {u,v,w} and M = {u,v,w}*. The subset identified by your grammar is S = {uⁿ w vⁿ: n = 0, 1, 2, ...}. We can define the following sets
S(n) = S {vⁿ}, T(n) = {vⁿ}, for n = 0, 1, 2, ...
The sub-diagram of states accessible from S consists of all the states
δS = { S(n): n = 0, 1, 2, ... } ∪ { T(n): n = 0, 1, 2, ... } ∪ { ∅ }
The state transitions are the following
u: S(n) → S(n+1)
v: T(n+1) → T(n)
w: S(n) → T(n)
with x: A → ∅ in all other cases for x ∈ {u,v,w} and A ∈ δS. The sole final state is T(0).
As you can see, the DIA is infinite and is not a DFA at all. If you were to draw out the diagram, you would see an infinite ladder with S = S(0) being the start state T(0) = {1} the final state, with all the u transitions climbing up a rung, all the v transitions climbing down a rung, and the w transitions crossing over on a rung.
The symmetry is captured by factoring the state set into
δS = {S,T}×{0,1,2,3,⋯} ∪ {∅}
with S(n) rewritten as (S,n) and T(n) as (T,n). This includes a finite set of states Q = {S,T} for a finite state "control" and a set of states D = {0,1,2,3,⋯} for a "device"; as well as the empty set ∅ for the fail state. That device is none other than a counter, and this DIA is just a one-counter automaton in disguise.
All of the classical automata models posed in the literature have a similar form, when expressed as DIA. They contain a state set Q×D ∪ {∅} that includes a finite set Q for the "finite state control" and a (generally infinite) state set D for the device, along with the fail state ∅. The restrictions or constraints on the device correspond to what types of symmetries are contained in the underlying DIA. A deterministic PDA, with two stack symbols {a,b} for instance, has a device state set D = {a,b}* (consisting of all stack words), and an underlying DIA that has the form of an infinite binary tree with copies of Q residing at each node.
You can best see this by writing out and graphing the DIA for the Dyck language, which is given by the grammar D₂ → b D₂ d D₂, D₂ → p D₂ q D₂, D₂ → 1 as a language over X = {b,d,p,q} and subset of M = X* = {b,d,p,q}*; i.e. as the least-fixed point D₂ = μx ({b}x{d}x ∪ {p}x{q}x ∪ {1}).
Every subset in A ⊆ ℭM can be expressed in terms of a subset in A' ⊆ ℜM[b,d,p,q] of the free extension of the monoid M by indeterminates {b,d,p,q}, by carrying out insertions of {b,d,p,q} in suitable places in A, such that the result upon applying the identities {bd} = {1} = {pq}, {bq} = ∅ = {pd}, and xy = yx for x ∈ M and y ∈ {b,d,p,q} will yield A, itself, from A'.
This result (known, but unpublished since the 1990's and published only in 2022) is the algebraic form of the Chomsky-Schützenberger Theorem and is true for all monoids M. For instance, it holds for the non-free monoid M = X* × Y*, where the corresponding family ℭ(X* × Y*) comprise the push-down transductions from X to Y (or "simple syntax directed translations"; aka yacc-like grammars).
So, there is also something like a DFA even for these classes of DIA; provided you include transition arrows for {b,d,p,q}. For your example, A = μx({u}x{v} ∪ {w}), you have A' = {b}{up,qv,w}*{d} and you can easily write down the corresponding DFA. That automaton is just the one-counter machine, itself, with "b" interpreted as "start up at count 0", "d" as "check for count 0 and finish", "p" as "add one to the count" and "q" as "check for count greater than 0 and subtract 1". With respect to the algebraic rules given for {b,d,p,q}, A' is not just a representation of A, it is actually is A: A' = A.

Is a Pushdown Automaton with an Epsilon Transition a NDPA?

Let's suppose we have this PA:
-> q0 (e, e -> $) --> q1
Where:
q0 is a final and initial state;
e is epsilon (empty); and q1 is another state.
If the automaton were to read the e word, it could either make the transition to q1 or stop in q0.
So, would this PA be Non-Deterministic?
My teacher says it wouldn't because, in reality, there's only one path for the automaton to follow: since the word is empty and all symbols had already been consumed in q0, it would make no transition whatsoever; however, we're not sure if he's right (by the way, he says that in order for a PA to recognize a word it needs not only to be in a final state but also all the word's symbols must have been consumed).
For a PA to be deterministic it must follow at least the following rule:
If there is an epsilon transition from a state q, there must not be any alphabet transition from that state.
So in your case, if there isn't any other rule, the PA is deterministic.

Recognizing permutations of a finite set of strings in a formal grammar

Goal: find a way to formally define a grammar that recognizes elements from a set 0 or 1 times in any order. Subsequently, I want to parse it and generate an AST as well.
For example: Say the set of valid strings in my language is {A, B, C}. I want to define a grammar that recognizes all valid permutations of any number of those elements.
Syntactically valid strings would include:
(the empty string)
A,
B A, and
C A B
Syntactically invalid strings would include:
A A, and
B A C B
To be clear, defining all possible permutations explicitly in a CFG is unacceptable for my purposes, since larger sets would be impossible to maintain.
From what I understand, such a language fails the pumping lemma for context free languages, so the solution will not be context free or regular.
Update
What I'm after is called a "permutation language", which Benedek Nagy has done some theoretical work on as an extension to context free languages.
Regarding a parser generator, I've only found talk of implementing parsers with a permutation phase (link). Parsers evidently have an exponential lower bound on the size of resulting CFG, and I haven't found any parser generators that support it anyhow.
A sort-of solution to this problem was written in ANTLR. It uses semantic predicates to 'code around' the issue.
Assuming that the set of alternative strings is fixed and known in advance, say of size n, one can come up with a (non context-free) grammar of size O(n!). This is not asymptotically smaller than enumerating all permutations, so I suppose it cannot be considered a good solution. I believe that this grammar can be reformulated as a context-sensitive grammar (although in the form I'm suggesting below it is not).
For the example {a, b, c} mentioned in the question, one such grammar is the following. I'm using lower case letters for terminal symbols and upper case letters for non-terminals, as is customary. S is the initial non-terminal symbol.
S ::= XabcY
XabcY ::= aXbcY | bXacY | cXabY
XabY ::= ab | ba
XacY ::= ac | ca
XbcY ::= bc | cb
Non-terminals X and Y enclose the substring in the production which has not been finalized yet; this substring will eventually be replaced by a permutation of the terminals that are given between X and Y (in some arbitrary order).

Pushdown Automata definition

I am quite confused about formal description of PDA (push down automata)
If we write down L(M), this means that PDA M recognizes language L, correct ?
then , L(M)* means that PDA M recognizes that L* right ?
but what is the L* means ? how PDA can recognize infinite combination of L ?
L(M)* means that PDA M recognizes that L* right ?
No! It means a language which is constituted from the concatenation of any number of sentences (including zero) that are valid in the language recognized by the PDA M.
Given that L(M) is context-free, it is easy to prove that L(M)* is context-free too.
To construct L(M)* , just take all the grammar of L(M), take the starting symbol S from L(M) and add two new productions R -> S R and R -> empty where R is the starting symbol of L(M)*.
So, given that L(M)* is context-free, then there is some PDA that recognize it. If you could construct a PDA for L(M), then a PDA for L(M)* should be trivial to construct, given that it is almost the same as L(M) just with two extra productions.

Testing intersection of two regular languages

I want to test whether two languages have a string in common. Both of these languages are from a subset of regular languages described below and I only need to know whether there exists a string in both languages, not produce an example string.
The language is specified by a glob-like string like
/foo/**/bar/*.baz
where ** matches 0 or more characters, and * matches zero or more characters that are not /, and all other characters are literal.
Any ideas?
thanks,
mike
EDIT:
I implemented something that seems to perform well, but have yet to try a correctness proof. You can see the source and unit tests
Build FAs A and B for both languages, and construct the "intersection FA" AnB. If AnB has at least one accepting state accessible from the start state, then there is a word that is in both languages.
Constructing AnB could be tricky, but I'm sure there are FA textbooks that cover it. The approach I would take is:
The states of AnB is the cartesian product of the states of A and B respectively. A state in AnB is written (a, b) where a is a state in A and b is a state in B.
A transition (a, b) ->r (c, d) (meaning, there is a transition from (a, b) to (c, d) on symbol r) exists iff a ->r c is a transition in A, and b ->r d is a transition in B.
(a, b) is a start state in AnB iff a and b are start states in A and B respectively.
(a, b) is an accepting state in AnB iff each is an accepting state in its respective FA.
This is all off the top of my head, and hence completely unproven!
I just did a quick search and this problem is decidable (aka can be done), but I don't know of any good algorithms to do it. One is solution is:
Convert both regular expressions to NFAs A and B
Create a NFA, C, that represents the intersection of A and B.
Now try every string from 0 to the number of states in C and see if C accepts it (since if the string is longer it must repeat states at one point).
I know this might be a little hard to follow but this is only way I know how.

Resources