I need help on proves about automata - automata

Prove or disprove the following statements about the set of languages decided
by each of the automata:
1) DPDA ⊆ 2-stack DPDA
2) 2-stack DPDA ⊆ DPDA
3) NPDA ⊆ 2-stack DPDA
4) 2-stack DPDA ⊆ NPDA

1) A DPDA has one stack, so a DPDA with two stacks can definitely do anything a DPDA can do just by not using (or trivially using) the second stack.
2) A two stack DPDA can recognize the language of words with the same numbers of a's, b's and c's by using the first stack to ensure the number of a's is equal to the number of b's, and the second stack to ensure the number of b's is equal to the number of c's. Because this language is not context-free, there is no one-stack DPDA for it; therefore, 2-stack DPDA is not equal. Because we know DPDA is subset-or-equal to 2-stack DPDA, then we know 2-stack DPDA is not subset-or-equal to DPDA.
3) this is hard to say, will attempt later
4) certainly not, by the same argument as for (2).
As for (3) - I strongly suspect that a 2-stack DPDA can simulate a deterministic TM, which would imply NPDA is a subset of 2-stack DPDA. To prove this, one needs to show how to simulate the operation of any TM using a 2-stack DPDA. I think something like this visualization maybe suffices...
_ _ _ _ _ _ _ _ _ _ _ _ _ _
|Z|_|_|_|_|_|_| |_|_|_|_|_|_|Z|
stack #1 ^ stack #2
TM
The input is originally recorded entirely on stack #2. To move right, the TM pops from stack #2 and pushes to stack #1. To move left, the TM pops from stack #1 and pushes to stack #2. The current tape cell is always the top of stack #2. The 2-stack DPDA can do whatever a deterministic TM can do in terms of states and transitions; and it can use the two stacks to simulate a tape as above. For instance, a TM to accept the nondeterministic context-free language of all palindromes works as follows:
check the first cell. if empty, halt-accept. Otherwise, write a blank and move to state End(s) where s is the symbol observed.
while in state End(s), move right until you find a blank. Then, move left and enter state Check(s).
while in state Check(s), confirm the observed symbol is s. If not, then halt-accept if blank and halt-reject otherwise. Otherwise, write a blank and enter the state IsFinished and move left.
while in state IsFinished, check to see whether the tape cell is blank. If so, halt-accept. Otherwise, enter state Start and move left.
while in state Start, move left until you find a blank. Then, move right and enter the initial state again.
Example on a TM vs a 2-stack DPDA:
DTM 2-stack DPDA
------- --------------------
tape State Stack #1 Stack #2
B11011B Z 11011Z
^ initial ^
BB1011B ZB 1011Z
^ End(1) ^
BB1011B ZB1 011Z
^ End(1) ^
BB1011B ZB10 11Z
^ End(1) ^
BB1011B ZB101 1Z
^ End(1)
BB1011B ZB1011 Z
^ End(1) ^
BB1011B ZB101 1Z
^ Check(1) ^
BB101BB ZB10 1BZ
^ IsFin ^
BB101BB ZB1 01BZ
^ Start ^
BB101BB ZB 101BZ
^ Start ^
BB101BB Z B101BZ
^ Start ^
BB101BB ZB 101BZ
^ Initial ^
BBB01BB ZBB 01BZ
^ End(1) ^
BBB01BB ZBB0 1BZ
^ End(1) ^
BBB01BB ZBB01 BZ
^ End(1) ^
BBB01BB ZBB0 1BZ
^ Check(1) ^
BBB0BBB ZBB 0BBZ
^ IsFin ^
BBB0BBB ZB B0BBZ
^ Start ^
BBB0BBB ZBB 0BBZ
^ initial ^
BBBBBBB ZBBB BBZ
^ End(0) ^
BBBBBBB ZBB BBBZ
^ Check(0) ^
BBBBBBB ZBB BBBZ
^ halt-accept ^
Based on this example, I think the construction is correct and 2-stack DPDAs are equivalent to Turing machines. Therefore, NPDA is subset-or-equal to 2-stack DPDA.

Related

How to simplify this left-recursive rule?

I am trying to simplify a rule that has left-recursion in the form of:
A → Aα | β
----------
A → βA'
A' → αA' | ε
The rule I have is:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
From my understanding of the formula, here is what my variables would be:
A = selectStmt
α = setOp selectStmt
β = simpleSelectStmt
A'= selectStmt' // for readability
Then, from application of the rule we get:
1. A → βA'
selectStmt → simpleSelectStmt selectStmt'
2. A' → αA' | ε
selectStmt' -> setOp selectStmt selectStmt' | ε
But then how do I simplify it further to get the final production? In a comment from my previous question at Removing this left-recursive way to define a SELECT statement, it was stated:
In our case a direct application it would take us from what we had originally:
selectStmt: selectStmt (setOp selectStmt) | simpleSelectStmt
to
selectStmt: simpleSelectStmt selectStmt'
and
selectStmt': (setOp selectStmt) | empty
which simplifies to
selectStmt: simpleSelectStmt (setOp selectStmt)?
I don't get how that simplification works. Specifically:
How does selectStmt' -> setOp selectStmt selectStmt' | ε simplify to selectStmt': (setOp selectStmt) | empty? And how is the ε removed here? I assume (setOp selectStmt) | empty simplifies to (setOp selectStmt)? because if it can be empty than it just means the optional ?.
Your starting point:
# I removed the parentheses from the following
selectStmt: selectStmt setOp selectStmt | simpleSelectStmt
is ambiguous. Left recursion elimination does not solve ambiguity; rather, it retains ambiguity. So it's not a bad idea to resolve the ambiguity first.
Real-world parser generators can resolve this kind of ambiguity by using operator precedence rules. Some parser generators require you to write out the precedence rules, but Antlr prefers to use a default set of precedence rules (using the order of the productions in the grammar, and assuming every operator to be left-associative unless otherwise declared). (I mention Antlr because you seem to be using it as a reference implementation, although its production semantics is a bit quirky; the implicit precedence rule is just one of the quirks.)
Translating operator precedence into precise BNF is a non-trivial endeavour. Parser generators tend to implement operator precedence by eliminating certain productions, either at grammar compile time (yacc/bison) or with runtime predicates (Antlr4 and most generators based on the shunting yard algorithm). Nevertheless, since operator precedence doesn't affect the context-free property, we know that there is a context-free grammar in which the ambiguity has been resolved. And in some cases, like this one, it's very easy to find.
This is essentially the same ambiguity as you find in arithmetic expressions; without some kind of precedence resolution, 1+2+3+4 is syntactically ambiguous, with five different parse trees. ((1+(2+(3+4))), (1+((2+3)+4)), ((1+2)+(3+4)), ((1+(2+3))+4), (((1+2)+3)+4)). As it happens, these are semantically identical, because addition is associative (in the mathematical sense). But with other operators, such as - or /, the different parses result in different semantics. (The semantics are also different if you use floating point arithmetic, which is not associative.)
So, in the same way as your grammar, the algebraic grammar which starts:
expr: expr '+' expr
expr: expr '*' expr
is ambiguous; it leads precisely to the above ambiguity. The resolution is to say that + and most other algebraic operators are left associative. That results in an adjustment to the grammar:
expr: expr '+' term | term
term: term '*' factor | factor
...
which is not ambiguous (but is still left recursive).
Note that if we had chosen to make those operators right associative, thereby producing the parse (1+(2+(3+4))), then the unambiguous grammar would be right-recursive:
expr: term '+' expr | term
term: factor '*' term | factor
...
Since those particular operators are associative, so that it doesn't matter which syntactic binding we chose (as long as * binds more tightly than +), we could bypass left-recursion elimination altogether, as long as those were the only operators we cared about. But, as noted above, there are lots of operators whose semantics are not so convenient.
It's worth stopping for a moment to understand why the unambiguous grammars are unambiguous. It shouldn't be hard to understand, and it's an important aspect of context-free grammars.
Take the production expr: expr '+' term. Note that term does not derive 2 + 3; term only allows multiplicative operators. So 1 + 2 + 3 can only be produced by reducing 1 + 2 to expr and 3 to term, leaving expr '+' term, which matches the production for expr. Consequently, ((1+2)+3) is the only possible parse. (1+(2+3)) could only be written with explicit parentheses.
Now, it's easy to do left-recursion elimination on expr: expr '+' term, or selectStmt: selectStmt setOp simpleSelectStmt | simpleSelectStmt, to return to the question at hand. We proceed precisely as you indicate, except that α is setOp simpleSelectStmt. We then get:
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
By back-substituting selectStmt into the first production of selectStmt', we get
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp selectStmt
| ε
That's cool; it's not ambiguous, not left-recursive, and has no LL(1) conflicts. But it does not produce the same parse tree as the original. In fact, the parse tree is quite peculiar: S1 UNION S2 UNION S3 is parsed as (S1 (UNION S2 (UNION S3 ()))).
Intriguingly, this is exactly the same place we would have gotten to had we used the right-associative grammar selectStmt: simpleSelectStmt setOp selectStmt | simpleSelectStmt. That grammar is unambiguous and not left-recursive, but it's not LL(1) because both alternatives start with simpleSelectStmt. So we need to left-factor, turning it into selectStmt: simpleSelectStmt (setop selectStmt | ε), exactly the same grammar as we ended up with from the left-recursive starting point.
But the left-recursive and right-recursive grammars really are different: one of them parses as ((S1 UNION S2) UNION S3) and the other one as (S1 UNION (S2 UNION S3)). With UNION, we have the luxury of not caring, but that wouldn't be the case with a SET DIFFERENCE operator, for example.
So the take-away: left-recursion elimination erases the difference between left and right associative operators, and that difference has to be put back using some non-grammatical feature (such as Antlr's run-time semantics). On the other hand, bottom-up parsers like Yacc/Bison, which do not require left-recursion elimination, can implement either parse without requiring any extra mechanism.
Anyway, let's go back to
selectStmt: simpleSelectStmt selectStmt'
selectStmt': setOp simpleSelectStmt selectStmt'
| ε
It should be clear that selectStmt' represents zero or more repetitions of setOp simpleSelectStmt. (Try that with a pad of paper, deriving successively longer sentences, in order to convince yourself that it's true.)
So if we had a parser generator which implemented the Kleene * operator (zero or more repetitions), we could write selectStmt' as (setOp simpleSelectStmt)*, making the final grammar
selectStmt: simpleSelectStmt (setOp simpleSelectStmt)*
That's no longer BNF --BNF does not have grouping, optionality, or repetition operators-- but in practical terms it's a lot easier to read, and if you're using Antlr or a similar parser generator, it's what you will inevitably write. (All the same, it still doesn't indicate whether setOp binds to the left or to the right. So the convenience does come at a small price.)

Combining unary operators with different precedence

I was having some trouble with Bison creating an operator as such:
<- = identity postfix operator with a low precedence to force evaluation of what's on the left first, e.g. 1+2<-*3 (equivalent (1+2)*3) as well as -> which is a prefix operator which does the same thing but to the right.
I was not able to get the syntax to work properly and tested with Python using - not False, which resulted in a syntax error (in Python, - has a greater precedence than not). However, this is not a problem in C or C++, where - and !/not have the same precedence.
Of course, the difference in precedence has nothing to do with the relationship between the 2 operators, only a relationship with other operators that result in the relative precedences between them.
Why is chaining prefix or postfix operators with different precedences a problem when parsing and how can implement the <- and -> operators while still having higher-precedence operators like !, ++, NOT, etc.?
Obligatory Bison (this pattern is repeated for all operators, where copy has greater precedence than post_unary):
post_unary:
copy
| post_unary "++"
| post_unary "--"
| post_unary '!'
;
Chaining operators in this category, e.g. x ! -- ! works fine syntactically.
Ok, let me suggest a possible erroneous grammar based on your sketch:
low_postfix:
mid_infix
| low_postfix "<-"
mid_infix:
high_postfix
| mid_infix '+' high_postfix
high_postfix:
term
| high_postfix "++"
term:
ID
'(' expr ')'
It should be clear just looking at those productions that var <- ++ is not part of the language. The only things that can be used as an operand to ++ are terms and other applications of ++. var <- is neither of these things.
On the other hand, var ++ <- is fine, because the operand to <- can be a mid_infix which can be a high_postfix which is an application of the ++ operator.
If the intention were to allow both of those postfix sequences, then that grammar is incorrect.
A version of that cascade is present in the Python grammar (albeit using prefix operators) which is why not - False is OK, but - not False is a syntax error. I'm reluctant to call that a bug because it may have been intentional. (Really, neither of those expressions makes much sense.) We could disagree about the value of such an intention but not on SO, which prefers to avoid opinionated discussions.
Note that what we might call "strict precedence" in this grammar and the Python grammar is by no means restricted to combinations of unary operators. Here's another one which you have likely never tried:
$ python3 -c 'print(41 + not False)'
File "<string>", line 1
print(41 + not False)
^
SyntaxError: invalid syntax
So, how can we fix that?
On some level, it would be nice to be able to just write an unambiguous grammar which conveyed our intention. And it is certainly possible to write an unambiguous grammar, which would convey the intention to bison. But it's at least an open question as to whether it would convey anything to a human reader, because the massive clutter of multiple rules necessary in order to keep track of what is and is not an acceptable grouping would be pretty daunting.
On the other hand, it's dead simple to do with bison/yacc precedence declarations. We just list the operators in order, and the parser generator resolves all the ambiguities accordingly. [See Note 1 below]
Here's a similar grammar to the above, with precedence declarations. (I left the actions in place in case you want to play with it, although it's by no means a Reproducible Example; the infrastructure it relies upon is much bigger than the grammar itself, and of little use to anyone other than me. So you'll have to define the three functions and fill in some of the bison type declarations. Or just delete the AST functions and use your own.)
%left ','
%precedence "<-"
%precedence "->"
%left '+'
%left '*'
%precedence NEG
%right "++" '('
%%
expr: expr ',' expr { $$ = make_binop(OP_LIST, $1, $3); }
| "<-" expr { $$ = make_unop(OP_LARR, $2); }
| expr "->" { $$ = make_unop(OP_RARR, $1); }
| expr '+' expr { $$ = make_binop(OP_ADD, $1, $3); }
| expr '*' expr { $$ = make_binop(OP_MUL, $1, $3); }
| '-' expr %prec NEG { $$ = make_unop(OP_NEG, $2); }
| expr '(' expr ')' %prec '(' { $$ = make_binop(OP_CALL, $1, $3); }
| "++" expr { $$ = make_unop(OP_PREINC, $2); }
| expr "++" { $$ = make_unop(OP_POSTINC, $1); }
| VALUE { $$ = make_ident($1); }
| '(' expr ')' { $$ = $2; }
A couple of notes:
I used %prec NEG on the unary minus production in order to separate that production from the subtraction production. I also used a %prec declaration to modify the precedence of the call production (the default would be ')'), although in this particular case that's unnecessary. It is necessary to put '(' into the precedence list, though. ( is the lookahead symbol which is used in precedence comparisons.
For many unary operators, I used bison %precedence declaration in the precedence list, rather than %right or %left. Really, there is no such thing as associativity with unary operators, so I think that it's more self-documenting to use %precedence, which doesn't resolve conflicts involving reductions and shifts in the same precedence level. However, even though there is no such thing as associativity between unary operators, the nature of the precedence resolution algorithm is that you can put prefix operators and postfix operators in the same precedence level and choose whether the postfix or prefix operators have priority by using %right or %left, respectively. %right is almost always correct. I did that with ++, because I was a bit lazy by the time I got to that point.
This does "work" (I think). It certainly resolves all the conflicts; bison happily produces a parser without warnings. And the tests that I tried worked at least as I expected them to:
? a++->
=> [-> [++/post a]]
? a->++
=> [++/post [-> a]]
? 3*f(a)+2
=> [+ [* 3 [CALL f a]] 2]
? 3*f(a)->+2
=> [+ [-> [* 3 [CALL f a]]] 2]
? 2+<-f(a)*3
=> [+ 2 [<- [* [CALL f a] 3]]]
? 2+<-f(a)*3->
=> [+ 2 [<- [-> [* [CALL f a] 3]]]]
But there are some expressions where the operator precedence, while "correct", might not be easily explained to a novice user. For example, although the arrow operators look somewhat like parentheses, they don't group that way. Furthermore, the decision as to which of the two operators has higher precedence seems to me to be totally arbitrary (and indeed I might have done it differently from what you expected). Consider:
? <-2*f(a)->+3
=> [<- [+ [-> [* 2 [CALL f a]]] 3]]
? <-2+f(a)->*3
=> [<- [* [-> [+ 2 [CALL f a]]] 3]]
? 2+<-f(a)->*3
=> [+ 2 [<- [* [-> [CALL f a]] 3]]]
There's also something a bit odd about how the arrow operators override normal operator precedence, so that you can't just drop them into a formula without changing its meaning:
? 2+f(a)*3
=> [+ 2 [* [CALL f a] 3]]
? 2+f(a)->*3
=> [* [-> [+ 2 [CALL f a]]] 3]
If that's your intention, fine. It's your language.
Note that there are operator precedence problems which are not quite so easy to solve by just listing operators in precedence order. Sometimes it would be convenient for a binary operator to have different binding power on the left- and right-hand sides.
A classic (but perhaps controversial) case is the assignment operator, if it is an operator. Assignment must associate to the right (because parsing a = b = 0 as (a = b) = 0 would be ridiculous), and the usual expectation is that it greedily accepts as much to the right as possible. If assignment had consistent precedence, then it would also accept as much to the left as possible, which seems a bit strange, at least to me. If a = 2 + b = 7 is meaningful, my intuitions say that its meaning should be a = (2 + (b = 7)) [Note 2]. That would require differential precedence, which is a bit complicated but not unheard of. C solves this problem by restricting the left-hand side of the assignment operators to (syntactic) lvalues, which cannot be binary operator expressions. But in C++, it really does mean a = ((2 + b) = 7), which is semantically valid if 2 + b has been overloaded by a function which returns a reference.
Notes
Precedence declarations do not really add any power to the parser generator. The languages it can produce a parser for are exactly the same languages; it produces the same sort of parsing machine (a pushdown automaton); and it is at least theoretically possible to take that pushdown automaton and reverse engineer a grammar out of it. (In practice, the grammars produced by this process are usually monstrous. But they exist.)
All that the precedence declarations do is resolve parsing conflicts (typically in an ambiguous grammar) according to some user-supplied rules. So it's worth asking why it's so much simpler with precedence declarations than by writing an unambiguous grammar.
The simple hand-waving answer is that precedence rules only apply when there is a conflict. If the parser is in a state where only one action is possible, that's the action which remains, regardless of what the precedence rules might say. In a simple expression grammar, an infix operator followed by a prefix operator is not at all ambiguous: the prefix operator must be shifted, because there is no reduce action for a partial sequence ending with an infix operator.
But when we're writing a grammar, we have to specify explicitly what constructs are possible at each point in the grammar, which we usually do by defining a bunch of non-terminals, each corresponding to some parsing state. An unambiguous grammar for expressions already has split the expression non-terminal into a cascading series of non-terminals, one for each operator precedence value. But unary operators do not have the same binding power on both sides (since, as noted above, one side of the unary operator cannot take an operand). That means that a binary operator could well be able to accept a unary operator for one of its operands, and not be able to accept the same unary operator for its other operand. Which in turn means that we need to split all of our non-terminals again, corresponding to whether the non-terminal appears on the left or the right side of a binary operator.
That's a lot of work, and it's really easy to make a mistake. If you're lucky, the mistake will result in a parsing conflict; but equally it could result in the grammar not being able to recognise a particular construct which you would never think of trying, but which some irate language user feels is an absolute necessity. (Like 41 + not False)
It's possible that my intuitions have been permanently marked by learning APL at a very early age. In APL, all operators associate to the right, basically without any precedence differences.

How do I expand the item set for this grammar?

I have this grammar
E -> E + i
E -> i
The augmented grammar
E' -> E
E -> E + i
E -> i
Now I try to expand the item set 0
I0)
E' -> .E
+E -> .E + i
+E -> .i
Then, since I have .E in I0 I would expand it but then I will get another E rule, and so on, this is my first doubt.
Assuming that this is alright the next item sets are
I0)
E' -> .E
+E -> .E + i
+E -> .i
I1) (I moved the dot from I0, no variables at rhs of dot)
E' -> E.
E -> E. + i
E -> i.
I2) (I moved the dot from I1, no vars at rhs of dot)
E -> E +. i
I3) (I moved the dot from I2, also no vars)
E -> E + i.
Then I will have this DFA
I0 -(E, i)-> I1 -(+)-> I2 -(i)-> I3
| |
+-(∅)-> acpt <-(∅)--+
I'm missing something because E -> E + i must accept i + i + .. but the DFA doesn't goes back to the I0, so it seems wrong to me. My guess is that it should have a I0 to I0 transition, but I then I don't know that to do with the dot.
What you call the "expansion" of the item set is actually a closure; that's how it's described in all the descriptions of the algorithm I've seen (at least in textbooks). Like any closure operation, you just keep on doing the transformation until you reach a fixed-point; once you've included the productions for E, they're included.
But the essential point is that you're not building a DFA. You're building a pushdown automaton, and the DFA is just one part of it. The DFA is used for shift operations; when you shift a new terminal (because the current parse stack is not a handle), you do a state transition according to the DFA. But you also push the current state onto the PDA's stack.
The interesting part is what happens when the automaton decides to perform a reduction, which replaces the right-hand side of a production with its left-hand side non-terminal. (The right-hand side at the top of the stack is called a "handle".) To do the reduction, you unwind the stack, popping each right-hand side symbol (and the corresponding DFA state) until you reach the beginning of the production. What that does is rewind the DFA to the state it was in before it shifted the first symbol from the right-hand side. (Note that it is only at this point that you know for sure which production was used.) With the DFA thus reset, you can now shift the non-terminal which was encountered, do the corresponding DFA transition, and continue with the parse.
The basis for this procedure is the fact that the parser stack is at all times a "viable prefix"; that is, a sequence of symbols which are the prefix of some right sentential form which can be derived from the start symbol. What's interesting about the set of viable prefixes for a context-free grammar is that it is a regular language, and consequently can be recognised by a DFA. The reduction procedure given above precisely represents this recognition procedure when handles are "pruned" (to use Knuth's original vocabulary).
In that sense, it doesn't really matter what procedure is used to determine which handle is to be pruned, as long as it provides a valid answer. You could, for example, fork the parse every time a potential handle is noticed at the top of the stack, and continue in parallel with both forks. With clever stack management, this parallel search can be done in worst-case O(n3) time for any context-free grammar (and this can be reduced if the grammar is not ambiguous). That's a very rough description of Earley parsers.
But in the case of an LR(k) parser, we require that the grammar be unambiguous, and we also require that we can identify a reduction by looking at no more than k more symbols from the input stream, which is an O(1) operation since k is fixed. If at each point in the parse we know whether to reduce or not, and if so which reduction to choose, then the reductions can be implemented as I outlined above. Each reduction can be performed in O(1) time for a fixed grammar (since the maximum size of a right-hand side in a particular grammar is fixed), and since the number of reductions in a parse is linear in the size of the input, the entire parse can be done in linear time.
That was all a bit informal, but I hope it serves as an intuitive explanation. If you're interested in the formal proof, Donald Knuth's original 1965 paper (On the Translation of Languages from Left to Right) is easy to find and highly readable as these things go.

How to deal with the implicit 'cat' operator while building a syntax tree for RE(use stack evaluation)

I am trying to build a syntax tree for regular expression. I use the strategy similar to arithmetic expression evaluation (i know that there are ways like recursive descent), that is, use two stack, the OPND stack and the OPTR stack, then to process.
I use different kind of node to represent different kind of RE. For example, the SymbolExpression, the CatExpression, the OrExpression and the StarExpression, all of them are derived from RegularExpression.
So the OPND stack stores the RegularExpression*.
while(c || optr.top()):
if(!isOp(c):
opnd.push(c)
c = getchar();
else:
switch(precede(optr.top(), c){
case Less:
optr.push(c)
c = getchar();
case Equal:
optr.pop()
c = getchar();
case Greater:
pop from opnd and optr then do operation,
then push the result back to opnd
}
But my primary question is, in typical RE, the cat operator is implicit.
a|bc represents a|b.c, (a|b)*abb represents (a|b)*.a.b.b. So when meeting an non-operator, how should i do to determine whether there's a cat operator or not? And how should i deal with the cat operator, to correctly implement the conversion?
Update
Now i've learn that there is a kind of grammar called "operator precedence grammar", its evaluation is similar to arithmetic expression's. It require that the pattern of the grammar cannot have the form of S -> ...AB...(A and B are non-terminal). So i guess that i just cannot directly use this method to parse the regular expression.
Update II
I try to design a LL(1) grammar to parse the basic regular expression.
Here's the origin grammar.(\| is the escape character, since | is a special character in grammar's pattern)
E -> E \| T | T
T -> TF | F
F -> P* | P
P -> (E) | i
To remove the left recursive, import new Variable
E -> TE'
E' -> \| TE' | ε
T -> FT'
T' -> FT' | ε
F -> P* | P
P -> (E) | i
now, for pattern F -> P* | P, import P'
P' -> * | ε
F -> PP'
However, the pattern T' -> FT' | ε has problem. Consider case (a|b):
E => TE'
=> FT' E'
=> PT' E'
=> (E)T' E'
=> (TE')T'E'
=> (FT'E')T'E'
=> (PT'E')T'E'
=> (iT'E')T'E'
=> (iFT'E')T'E'
Here, our human know that we should substitute the Variable T' with T' -> ε, but program will just call T' -> FT', which is wrong.
So, what's wrong with this grammar? And how should i rewrite it to make it suitable for the recursive descendent method.
1. LL(1) grammar
I don't see any problem with your LL(1) grammar. You are parsing the string
(a|b)
and you have gotten to this point:
(a T'E')T'E' |b)
The lookahead symbol is | and you have two possible productions:
T' ⇒ FT'
T' ⇒ ε
FIRST(F) is {(, i}, so the first production is clearly incorrect, both for the human and the LL(1) parser. (A parser without lookahead couldn't make the decision, but parsers without lookahead are almost useless for practical parsing.)
2. Operator precedence parsing
You are technically correct. Your original grammar is not an operator grammar. However, it is normal to augment operator precedence parsers with a small state machine (otherwise algebraic expressions including unary minus, for example, cannot be correctly parsed), and once you have done that it is clear where the implicit concatenation operator must go.
The state machine is logically equivalent to preprocessing the input to insert an explicit concatenation operator where necessary -- that is, between a and b whenever a is in {), *, i} and b is in {), i}.
You should take note that your original grammar does not really handle regular expressions unless you augment it with an explicit ε primitive to represent the empty string. Otherwise, you have no way to express optional choices, usually represented in regular expressions as an implicit operand (such as (a|), also often written as a?). However, the state machine is easily capable of detecting implicit operands as well because there is no conflict in practice between implicit concatenation and implicit epsilon.
I think just keeping track of the previous character should be enough. So if we have
(a|b)*abb
^--- we are here
c = a
pc = *
We know * is unary, so 'a' cannot be its operand. So we must have concatentation. Similarly at the next step
(a|b)*abb
^--- we are here
c = b
pc = a
a isn't an operator, b isn't an operator, so our hidden operator is between them. One more:
(a|b)*abb
^--- we are here
c = b
pc = |
| is a binary operator expecting a right-hand operand, so we do not concatenate.
The full solution probably involves building a table for each possible pc, which sounds painful, but it should give you enough context to get through.
If you don't want to mess up your loop, you could do a preprocessing pass where you insert your own concatenation character using similar logic. Can't tell you if that's better or worse, but it's an idea.

Shift-reduce: when to stop reducing?

I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:
S: A;
P
: NUMBER
| '(' S ')'
;
M
: P
| M '*' P
| M '/' P
;
A
: M
| A '+' M
| A '-' M
;
And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?
Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:
S '+'
If we shift '2', the reductions might be:
S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S
Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it
A '+' M
So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?
Edit: An addition to the question follows.
Now suppose we parse 1+2*3. Some shift/reduce operations are as follows:
Stack | Input | Operation
---------+-------+----------------------------------------------
| 1+2*3 |
NUMBER | +2*3 | Shift
A | +2*3 | Reduce (looking ahead, we know to stop at A)
A+ | 2*3 | Shift
A+NUMBER | *3 | Shift (looking ahead, we know to stop at M)
A+M | *3 | Reduce (looking ahead, we know to stop at M)
Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M to A, as doing so would result in an inevitable syntax error after reading *3 ?
The problem you're describing is an issue with creating LR(0) parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0) grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1), however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1) parser would look ahead when it had the 1 on the stack, see that the next symbol is a +, and realize that it shouldn't reduce past A (since that is the only thing it could reduce to that would still match a rule with + in the second position).
An interesting property of LR grammars is that for any grammar which is LR(k) for k>1, it is possible to construct an LR(1) grammar which is equivalent. However, the same does not extend all the way down to LR(0) - there are many grammars which cannot be converted to LR(0).
See here for more details on LR(k)-ness:
http://en.wikipedia.org/wiki/LR_parser
I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.
First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.

Resources