FInite automata to regular expression clarification - automata

Can you check on this: https://dl.dropbox.com/u/25439537/finite%20automata.png
This is a checked homework, so don't worry. I just want to clarify whether my answer is correct or not, because it is marked by my teacher as incorrect.
My answer is ((a+b)(a+b))*a
The first (a+b) signifies the upper arrows. The second (a+b) signifies the lower arrows. The last 'a' tells us that it should always end in 'a'.
I just want to record evidences from a lot of experts so that I can give it to my teacher.

I believe your answer is correct.
Let's consider the whole process as two parts: (1) start with start, and go back to start; and (2) go from start to end and accept. Obviously, the (1) part is a loop.
For (1), starting with start, either accept b or a. For b, it's b(a+b) to go back. For a, it's a(a+b) to go back. So (1) is b(a+b) + a(a+b) which is (a+b)(a+b).
For (2), it's a'.
So, the final result is (loop in (1))* (2) i.e. ( (a+b)(a+b) )* a.
Follow the description above, you can also come up with a proof of the equivalence between the two. Proof part (a) every sequence accepted by the automata is in the set ((a+b)(a+b))*a; part (b) every sequence in the set ((a+b)(a+b))*a is accepted by the automata.

Your answer is wrong, because it doesn't provide for strings starting with b.
The path (start) -> b -> a+b -> a -> (end) is accepted by your finite automaton, but not by your regex. The simplest counterexample to your answer being correct is the regex's rejection of the string "baba".
By the way, if the teacher gave you that regex without the "end" state having two concentric circles (to indicate being an accept state) it was probably a trick question. Having no accept state means your automaton rejects everything. The best way to describe that would be to just write down {} (the empty set).

Related

What production rule should I use to reduce in bottom-up parsing?

So far, my understanding of the algorithm of bottom-up parsing is this.
shift a token into the stack
check the stack from top if some elements including the top can be reduced by some production rule
if the elements can be reduced, pop and push the left hand side of the production rule.
continue those steps until top is the start symbol and next input is EOF
So to support my question with an example grammar,
S → aABe
A → Abc
A → b
B → d
if we have input string as
abbcde$
we will shift a in stack
and because there are no production rule that reduces a, we shift the next token b.
Then we can find a production rule A → b and reduce b to A.
Then my question is this. We have aA on stack and the next input is b. Then how can the parser determine whether we reduce b to A we wait for c to come and use the rule A → Abc?
Well of course, reducing b to A at that point results in an error. But how does the parser know at that point that we should wait for c?
I'm sorry if I missed something while studying.
That's an excellent question, and it will be addressed in the next part of your course.
For now, it's sufficient to pretend that there is some magic black box which tells the parser when it should reduce (and, sometimes, which of several possible productions to use).
The various parsing algorithms explain the construction of this black box. Note that one possible solution is to fork reality and try both actions in parallel, but a more common solution is to process the grammar in order to work out how to predict the correct action.

How to parse this simple grammar? Is it ambiguous?

I'm delving deeper into parsing and came across an issue I don't quite understand. I made up the following grammar:
S = R | aSc
R = b | RbR
where S is the start symbol. It is possible to show that abbbc is a valid sentence based on this grammar, hopefully, that is correct but I may have completely missunderstood something. If I try to implement this using recursive descent I seem to have a problem when trying to parse abbbc, using left-derivation eg
S => aSc
aSc => aRc
at this point I would have thought that recursive descent would pick the first option in the second production because the next token is b leading to:
aRc => abc
and we're finished since there are no more non-terminals, which isn't of course abbbc. The only way to show that abbbc is valid is to pick the second option but with one lookahead I assume it would always pick b. I don't think the grammar is ambiguous unless I missed something. So what I am doing wrong?
Update: I came across this nice derivation app at https://web.stanford.edu/class/archive/cs/cs103/cs103.1156/tools/cfg/. I used to do a sanity check that abbbc is a valid sentence and it is.
Thinking more about this problem, is it true to say that I can't use LL(1) to parse this grammar but in fact need LL(2)? With two lookaheads I could correctly pick the second option in the second production because I now also know there are more tokens to be read and therefore picking b would prematurely terminate the derivation.
For starters, I’m glad you’re finding our CFG tool useful! A few of my TAs made that a while back and we’ve gotten a lot of mileage out of it.
Your grammar is indeed ambiguous. This stems from your R nonterminal:
R → b | RbR
Generally speaking, if you have recursive production rules with two copies of the same nonterminal in it, it will lead to ambiguities because there will be multiple options for how to apply the rule twice. For example, in this case, you can derive bbbbb by first expanding R to RbR, then either
expanding the left R to RbR and converting each R to a b, or
expanding the right R to RbR and converting each R to a b.
Because this grammar is ambiguous, it isn’t going to be LL(k) for any choice of k because all LL(k) grammars must be unambiguous. That means that stepping up the power of your parser won’t help here. You’ll need to rewrite the grammar to not be ambiguous.
The nonterminal R that you’ve described here generates strings of odd numbers of b’s in them, so we could try redesigning R to achieve this more directly. An initial try might be something like this:
R → b | bbR
This, unfortunately, isn’t LL(1), since after seeing a single b it’s unclear whether you’re supposed to apply the first production rule or the second. However, it is LL(2).
If you’d like an LL(1) grammar, you could do something like this:
R → bX
X → bbX | ε
This works by laying down a single b, then laying down as many optional pairs of b’s as you’d like.

How do ɛ-transitions work in nondeterministic finite automata?

I am confused about the implementation of a language by an automaton. Does the automaton go directly to the next state if there is a ɛ-transition? Suppose I have an automaton consisting of three states a, b, and c (where a is initial state and c the accepting state) with alphabet {0,1}. How does the following work?
a----ɛ--->(b----0---->a)
(b----1---->c)
Is the string "1" accepted? What if we had
a---1--->b----ɛ--->c
? Would the string "1" be accepted?
Does the automaton go directly to the next state if there is an ɛ-transition?
Roughly speaking, yes. An ɛ-transition (in a non-deterministic finite automaton, or NFA, for short) is a transition that is not associated with the consumption of any symbol (0 or 1, in this case). Once you understand that, it's easy (in this case) to derive deterministic finite automata (or DFA, for short) that are equivalent to your NFAs and identify the languages that the latter describe.
Suppose I have an automaton [...] Is the string "1" accepted?
Yes. Here is a nicer diagram (curtesy of LaTeX and tikz) of your first NFA:
An equivalent DFA would be:
Once you have that, it's easy to see that the language accepted by that NFA is the set of strings that
start with zero or more 0's,
end with exactly one 1.
The string "1", because it starts with zero 0 and ends with one 1, is indeed accepted.
What if we had [...]? Would the string "1" be accepted?
Yes. Here is a nicer diagram of your second NFA:
An equivalent DFA would be:
In fact, it's easy to see that "1" is the only accepted string, here.

LALR parsers and look-ahead

I'm implementing the automatic construction of an LALR parse table for no reason at all. There are two flavors of this parser, LALR(0) and LALR(1), where the number signifies the amount of look-ahead.
I have gotten myself confused on what look-ahead means.
If my input stream is 'abc' and I have the following production, would I need 0 look-ahead, or 1?
P :== a E
Same question, but I can't choose the correct P production in advance by only looking at the 'a' in the input.
P :== a b E
| a b F
I have additional confusion in that I don't think the latter P-productions really happen in when building a LALR parser generator. The reason is that the grammar is effectively left-factored automatically as we compute the closures.
I was working through this page and was ok until I got to the first/follow section. My issue here is that I don't know why we are calculating these things, so I am having trouble abstracting this in my head.
I almost get the idea that the look-ahead is not related to shifting input, but instead in deciding when to reduce.
I've been reading the Dragon book, but it is about as linear as a Tarantino script. It seems like a great reference for people who already know how to do this.
The first thing you need to do when learning about bottom-up parsing (such as LALR) is to remember that it is completely different from top-down parsing. Top-down parsing starts with a nonterminal, the left-hand-side (LHS) of a production, and guesses which right-hand-side (RHS) to use. Bottom-up parsing, on the other hand, starts by identifying the RHS and then figures out which LHS to select.
To be more specific, a bottom-up parser accumulates incoming tokens into a queue until a right-hand side is at the right-hand end of the queue. Then it reduces that RHS by replacing it with the corresponding LHS, and checks to see whether an appropriate RHS is at the right-hand edge of the modified accumulated input. It keeps on doing that until it decides that no more reductions will take place at that point in the input, and then reads a new token (or, in other words, takes the next input token and shifts it onto the end of the queue.)
This continues until the last token is read and all possible reductions are performed, at which point if what remains is the single non-terminal which is the "start symbol", it accepts the parse.
It is not obligatory for the parser to reduce a RHS just because it appears at the end of the current queue, but it cannot reduce a RHS which is not at the end of the queue. That means that it has to decide whether to reduce or not before it shifts any other token. Since the decision is not always obvious, it may examine one or more tokens which it has not yet read ("lookahead tokens", because it is looking ahead into the input) in order to decide. But it can only look at the next k tokens for some value of k, typically 1.
Here's a very simple example; a comma separated list:
1. Start -> List
2. List -> ELEMENT
3. List -> List ',' ELEMENT
Let's suppose the input is:
ELEMENT , ELEMENT , ELEMENT
At the beginning, the input queue is empty, and since no RHS is empty the only alternative is to shift:
queue remaining input action
---------------------- --------------------------- -----
ELEMENT , ELEMENT , ELEMENT SHIFT
At the next step, the parser decides to reduce using production 2:
ELEMENT , ELEMENT , ELEMENT REDUCE 2
Now there is a List at the end of the queue, so the parser could reduce using production 1, but it decides not to based on the fact that it sees a , in the incoming input. This goes on for a while:
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT REDUCE 3
List , ELEMENT SHIFT
List , ELEMENT SHIFT
List , ELEMENT -- REDUCE 3
Now the lookahead token is the "end of input" pseudo-token. This time, it does decide to reduce:
List -- REDUCE 1
Start -- ACCEPT
and the parse is successful.
That still leaves a few questions. To start with, how do we use the FIRST and FOLLOW sets?
As a simple answer, the FOLLOW set of a non-terminal cannot be computed without knowing the FIRST sets for the non-terminals which might follow that non-terminal. And one way we can decide whether or not a reduction should be performed is to see whether the lookahead is in the FOLLOW set for the target non-terminal of the reduction; if not, the reduction certainly should not be performed. That algorithm is sufficient for the simple grammar above, for example: the reduction of Start -> List is not possible with a lookahead of ,, because , is not in FOLLOW(Start). Grammars whose only conflicts can be resolved in this way are SLR grammars (where S stands for "Simple", which it certainly is).
For most grammars, that is not sufficient, and more analysis has to be performed. It is possible that a symbol might be in the FOLLOW set of a non-terminal, but not in the context which lead to the current stack configuration. In order to determine that, we need to know more about how we got to the current configuration; the various possible analyses lead to LALR, IELR and canonical LR parsing, amongst other possibilities.

LL-1 Parsers: Is the FOLLOW-Set really necessary?

as far as I understand the FOLLOW-Set is there to tell me at the first possible moment if there is an error in the input stream. Is that right?
Because otherwise I'm wondering what you actually need it for. Consider you're parser has a non-terminal on top of the stack (in our class we used a stack as abstraction for LL-Parsers)
i.e.
[TOP] X...[BOTTOM]
The X - let it be a non-terminal - is to be replaced in the next step since it is at the top of the stack. Therefore the parser asks the parsing table what derivation to use for X. Consider the input is
+ b
Where + and b are both terminals.
Suppose X has "" i.e. empty string in its FIRST set. And it has NO + in his FIRST set.
As far as I see it in this situation, the parser could simply check that there is no + in the FIRST set of X and then use the derivation which lets X dissolve into an "" i.e. empty string since it is the only way how the parser possibly can continue parsing the input without throwing an error. If the input stream is invalid the parser will recognize it anyway at some moment later in time. I understand that the FOLLOW set can help here to right away identify whether parsing can continue without an error or not.
My question is - is that really the only role that the FOLLOW set plays?
I hope my question belongs here - I'm sorry if not. Also feel free to request clarification in case something is unclear.
Thank you in advance
You are right about that. The parser could eventually just continue parsing and would eventually find a conflict in another way.
Besides that, the FOLLOW set can be very convenient in reasoning about a grammar. Not by the parser, but by the people that constructs the grammar. For instance, if you discover, that any FIRST/FIRST or FIRST/FOLLOW conflict exists, you have made an ambiguous grammar, and may want to revise it.

Resources