Are those languages regular or not - automata

Hi I would need help from you, I got the following languages and need to determine if are regular or not.
Now I think that Y is not regular and I applied the Pumping Lemma to determine that.
For X I am not sure if is a regular language or not, I was thinking that X is the set of strings with an odd number of a's that can be easily represented with an NFA.
Can anyone help me with that ?

The first one (X) is regular, because you can construct a finite automaton for it:
(start) --- a --> (final) -- a --> (state)
^ |
\------ a -------/
The second one (Y) is not regular, because you cannot construct a finite automaton for it. It would require memory to store the number of a to be able to later find one more of b. That language is context-free, with a grammar:
S = T b
T = a T b
T = ε

Related

Automata and Formal Languages

Showing that the reverse of a word for a regular language L is also regular
I am confused as to how I am to approach this question, i've been stuck for hours: For a word x, we use x^r to denote its reverse. For a language L, we use L^r to denote {x^r where x is in the set of L}. Show that if L is regular then so is L^r
If L is regular, then there exists some regular grammar which generates it. It can be always represented as either a left-regular grammar, or a right-regular grammar. Let's assume that it's left-regular grammar G_l(the proof for right-regular grammar is analogous).
This grammar has productions of two types; the terminating-type:
A -> a, where A is non-terminal and a is either a terminal or empty string (epsilon)
or the chaining type:
B -> Ca, where B, C are non-terminals and a is a terminal
When we apply reverse to a regular language, we basically also apply it to the tails of productions (since heads are just single non-terminals). It's going to be proved later on. So we get a new grammar G_r, with productions:
A -> a, where A is non-terminal and a is either a terminal or empty string (epsilon)
B -> aC, where B, C are non-terminals and a is a terminal
But hey, it's a right-regular grammar! So the language it accepts is also regular.
There is one thing to do - to show that reversing tails actually does the thing it's supposed to. We're going to prove it very simply:
If L contains \epsilon, then there is production 'S -> \epsilon' in G_l. Since we don't touch productions like that, it's also present in G_r.
If L contains a, a word composed of a single terminal, then it's similar to the above
If L contains aZ, where a is a terminal and Z is a word from the language constructed from chopping off the first terminals out of words in L, then L^r contains (because of changes to the chaining productions) (Z^r)a. Z is also a regular language, since it can be constructed by dropping the first "level" of left-productions from G_l, which leaves us with a regular grammar.
I hope it helped. There's also an arguably easier way of doing that by reversing edges of the relevant finite automata and changing accepting and entry states a bit.

Recognizing permutations of a finite set of strings in a formal grammar

Goal: find a way to formally define a grammar that recognizes elements from a set 0 or 1 times in any order. Subsequently, I want to parse it and generate an AST as well.
For example: Say the set of valid strings in my language is {A, B, C}. I want to define a grammar that recognizes all valid permutations of any number of those elements.
Syntactically valid strings would include:
(the empty string)
A,
B A, and
C A B
Syntactically invalid strings would include:
A A, and
B A C B
To be clear, defining all possible permutations explicitly in a CFG is unacceptable for my purposes, since larger sets would be impossible to maintain.
From what I understand, such a language fails the pumping lemma for context free languages, so the solution will not be context free or regular.
Update
What I'm after is called a "permutation language", which Benedek Nagy has done some theoretical work on as an extension to context free languages.
Regarding a parser generator, I've only found talk of implementing parsers with a permutation phase (link). Parsers evidently have an exponential lower bound on the size of resulting CFG, and I haven't found any parser generators that support it anyhow.
A sort-of solution to this problem was written in ANTLR. It uses semantic predicates to 'code around' the issue.
Assuming that the set of alternative strings is fixed and known in advance, say of size n, one can come up with a (non context-free) grammar of size O(n!). This is not asymptotically smaller than enumerating all permutations, so I suppose it cannot be considered a good solution. I believe that this grammar can be reformulated as a context-sensitive grammar (although in the form I'm suggesting below it is not).
For the example {a, b, c} mentioned in the question, one such grammar is the following. I'm using lower case letters for terminal symbols and upper case letters for non-terminals, as is customary. S is the initial non-terminal symbol.
S ::= XabcY
XabcY ::= aXbcY | bXacY | cXabY
XabY ::= ab | ba
XacY ::= ac | ca
XbcY ::= bc | cb
Non-terminals X and Y enclose the substring in the production which has not been finalized yet; this substring will eventually be replaced by a permutation of the terminals that are given between X and Y (in some arbitrary order).

What is a "Production" in plain English?

I can read on Wikipedia the formal definition of a Production, however when you start reading that article, it makes an assumption about prior knowledge.
Wikipedia defines it as follows:
A production or production rule in computer science is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences.
This assumes that I know and understand what a rewrite rule is. I don't, and if I click the link, I get into another fairly technical explanation.
Can someone explain to me in plain English what a Production actually is?
Note: I have made many attempts to understand this, but I don't think I've succeeded. From what I can tell it rewrites the given string in terms of grammar rules. Not sure if I'm correct.
To explain what a production is I'd like to introduce a bit of context first.
The dragon book states that a context free grammar has 4 components:
a set of terminal symbols (tokens)
a set of non-terminal symbols (syntactic variables)
a set of productions of the form: non-term --> sequence of terminals and non-terminals
a non-terminal symbol designated as the start symbol
It is also said that parsing is the problem of taking a string of terminals (the source code) and figuring out what are the steps required to derive this string of terminals from the start symbol of the grammar.
Now that this has been said, a production is essentially a possible (intermediate) step. I say possible because some symbols can derive into different sequences.
For example, let's make a simple grammar to represent an arbitrarily long sequences of a's ending with a b. The 4 components of this grammar would be:
Terminals: a, b
Non-terminals: S, X
Rules: S --> X, X --> aX, X --> ab
Start symbol: S
From the description I gave above "aaaab" should be derivable from this grammar. Let's see if that holds up. We start from, the start symbol, and then apply productions until a) we get the final sequence, b) we exhaust all possibilities without succeeding (meaning the sequence is not "grammatically correct").
S
X (after applying S --> X)
aX (after applying X --> aX)
aaX (after applying X --> aX)
aaaX (after applying X --> aX)
aaaab (after applying X --> ab)
And we're done, we got the original sequence. So as you can see we re-wrote the non-terminal symbols by applying rules (one of them we applied recursively) which transformed the sequence into a new sequence of symbols at every step and we did so until we had the final sequence.
A rewrite rule is a method of replacing subterms of a formula with other terms. In their most basic form, they consist of a set of objects, plus relations on how to transform those objects.
An example of a rewrite rule could look like:
A → B
Now as for what this actually does! You are right on your note, take for example a list of things (and 2 rewrite rules):
X, Y, Z
X → Y
Y → Z
Which would result in:
Z, Z, Z
A production rule is a rewrite rule because it is a method of replacing subterms of a formula (probably a string in your case). A production rule could look like:
X, Y, Z
X → aX
By using the rule in such a way it becomes possible to apply recursion (create new sequences) as it will keep replacing itself:
aX, Y, Z
aaX, Y, Z
aaaX, Y, Z
As for the question you are asking, you could say: "A production rule is a replacement rule for formulas that uses recursion to create new sequences".

What exactly is the 'pumping length' in the Pumping lemma?

I'm trying to understand what is this 'magical' number 'n' that is used in every application of the Pumping lemma. After hours of research on the subject, I came to the following website: http://elvis.rowan.edu/~nlt/TheoryNotes/PumpingLemma.pdf
It states
n is
the longest a string can be without having a loop. The biggest n can
be is s, though it might be smaller for some particular language.
From what I understand if there is a Language L then the pumping length of L is the amount of states in the Finite State Automata that recognizes L. Is this true?
If it is then what exactly does the last line from above say "though it might be smaller for some particular language"? Complete mess in my head. Could somebody clear it up, please?
A state doesn't recognise a language. A DFA recognises a language by accepting exactly the set of words in the languages and no others. A DFA has many states.
If there is a regular language L, which can be modelled by the Pumping Lemma, it will have a property n.
For a DFA with s states, in order for it to accept L, s must be >= n.
The last line merely states that there are some languages in which s is greater than n, rather than equal.
This is probably more suited for https://cstheory.stackexchange.com/ or https://cs.stackexchange.com/ (not quite sure of the value of both myself).
Note: Trivially, not all DFA's with sufficient states will accept the language. Also, the fact that a language passes the pumping lemma doesn't mean it's regular (but failing it means definitely isn't).
Note also, the language changes between FA and DFA - this is a bit lax, but because NDFAs have the same power as DFAs and DFAs are easier to write and understand, DFAs are used for the proof.
Edit I'll give an example of a regular language, so you can see an idea of u,v,w,z and n. Then we'll discuss s.
L = xy^nz, n > 2 (i.e. xyyz, xyyyz, xyyyyz)
u = xy
v = y
w = z
n = 4
Hence:
|z| = 3: xyz (i = 0) Not in L, but |z| < n
|z| = 4: xyyz (i = 1)
|z| = 5: xyyyz (i = 2)
etc
Hence, it's modelled by the Pumping Lemma. A DFA would need at least 4 states. So let's think of one.
-> State 1: x
State 1:
-> State 2: y
State 2:
-> State 3: y
State 3:
-> State 3: y
-> State 4: z
State 4:
Accepting state
Terminating state
4 states, as expected. Not possible to do it in less, as expected by n = 4. In this case, the example is quite simple so I don't think you can build one with more than 4 states (but that would be okay if it were needed).
I think a possibility is when you have a FA with an unreachable state. The FA has s states, but 1 is unreachable, so all strings recognizing L will be comprised of (s-1)=n states, so n<s.

Explanation on this FIRST function

LL(1) Grammar:
(1) Var -> ID DimList
(2) DimList -> ε DimList'
(3) DimList' -> Dim DimList'
(4) DimList' -> ε
(5) Dim -> [ CONST ]
And, in the script that I am reading, it says that the function FIRST(ε DimList') gives {#, [}. But, how?
My guess is that since the right side of (2) begins with ε, it skips epsilon and takes FIRST(DimList') which is, considering (3) and (5), equal to {[}, BUT also, because of (4), takes FOLLOW(DimList') which is {#}.
Other way it could be is that, since (2) begins with ε it skips epsilon and takes FIRST(DimList') BUT ALSO takes FOLLOW(DimList) from (2)...
First one makes more sense to me, though I'm still in the process of learning basics of LL(1) grammars so I would appreciate if someone takes the time to make this clear, thank you.
EDIT: And, of course, it could be that neither of these is true.
The usual definition of the FIRST function would result in FIRST(Dimlist) (or, if you like, FIRST(ε Dimlist') being {ε, [}. ε is in FIRST(ε Dimlist') because both ε and Dimlist' are nullable. [ is an element because it could be the first symbol in a derivation of ε Dimlist, which is the same as saying that it could be the first symbol in a derivation of Dimlist'.
Another way of saying this is that:
FIRST(ε Dimlist' #) = {#, [}
We usually then define the function PREDICT:
PREDICT(ω) = FIRST(ω FOLLOW(ω))
and we can see that
PREDICT(Dimlist) = FIRST(Dimlist FOLLOW(Dimlist)) = {#, [}
Here, FIRST(ω) is the set of strings of terminals (of length ≤ 1) which could appear at the beginning of a derivation of ω, while PREDICT(ω) is the set of strings of terminals (of length ≤ 1) which could be present in the input when a derivation of ω is possible.
It's not uncommon to confuse FIRST and PREDICT, but it's better to keep the difference straight.
Note that all of these functions can be generalized to strings of maximum length k, which are usually written FIRSTk, FOLLOWk and PREDICTk, and the definition of PREDICTk is similar to the above:
PREDICTk(ω) = FIRSTk(ω FOLLOWk(ω))

Resources