Conversion to Chomsky Normal Form

Conversion to Chomsky Normal Form - parsing

I do need your help.
I have these productions:
1) A--> aAb
2) A--> bAa
3) A--> ε
I should apply the Chomsky Normal Form (CNF).
In order to apply the above rule I should:
eliminate ε producions
eliminate unitary productions
remove useless symbols
Immediately I get stuck. The reason is that A is a nullable symbol (ε is part of its body)
Of course I can't remove the A symbol.
Can anyone help me to get the final solution?

As the Wikipedia notes, there are two definitions of Chomsky Normal Form, which differ in the treatment of ε productions. You will have to pick the one where these are allowed, as otherwise you will never get an equivalent grammar: your grammar produces the empty string, while a CNF grammar following the other definition isn't capable of that.

To begin conversion to Chomsky normal form (using Definition (1) provided by the Wikipedia page), you need to find an equivalent essentially noncontracting grammar. A grammar G with start symbol S is essentially noncontracting iff
1. S is not a recursive variable
2. G has no ε-rules other than S -> ε if ε ∈ L(G)
Calling your grammar G, an equivalent grammar G' with a non-recursive start symbol is:
G' : S -> A
A -> aAb | bAa | ε
Clearly, the set of nullable variables of G' is {S,A}, since A -> ε is a production in G' and S -> A is a chain rule. I assume that you have been given an algorithm for removing ε-rules from a grammar. That algorithm should produce a grammar similar to:
G'' : S -> A | ε
A -> aAb | bAa | ab | ba
The grammar G'' is essentially noncontracting; you can now apply the remaining algorithms to the grammar to find an equivalent grammar in Chomsky normal form.

Related

Determining the type of grammar [duplicate]

How do you identify whether a grammar is LL(1), LR(0), or SLR(1)?
Can anyone please explain it using this example, or any other example?
X → Yz | a
Y → bZ | ε
Z → ε

To check if a grammar is LL(1), one option is to construct the LL(1) parsing table and check for any conflicts. These conflicts can be
FIRST/FIRST conflicts, where two different productions would have to be predicted for a nonterminal/terminal pair.
FIRST/FOLLOW conflicts, where two different productions are predicted, one representing that some production should be taken and expands out to a nonzero number of symbols, and one representing that a production should be used indicating that some nonterminal should be ultimately expanded out to the empty string.
FOLLOW/FOLLOW conflicts, where two productions indicating that a nonterminal should ultimately be expanded to the empty string conflict with one another.
Let's try this on your grammar by building the FIRST and FOLLOW sets for each of the nonterminals. Here, we get that
FIRST(X) = {a, b, z}
FIRST(Y) = {b, epsilon}
FIRST(Z) = {epsilon}
We also have that the FOLLOW sets are
FOLLOW(X) = {$}
FOLLOW(Y) = {z}
FOLLOW(Z) = {z}
From this, we can build the following LL(1) parsing table:
a b z $
X a Yz Yz
Y bZ eps
Z eps
Since we can build this parsing table with no conflicts, the grammar is LL(1).
To check if a grammar is LR(0) or SLR(1), we begin by building up all of the LR(0) configurating sets for the grammar. In this case, assuming that X is your start symbol, we get the following:
(1)
X' -> .X
X -> .Yz
X -> .a
Y -> .
Y -> .bZ
(2)
X' -> X.
(3)
X -> Y.z
(4)
X -> Yz.
(5)
X -> a.
(6)
Y -> b.Z
Z -> .
(7)
Y -> bZ.
From this, we can see that the grammar is not LR(0) because there is a shift/reduce conflicts in state (1). Specifically, because we have the shift item X → .a and Y → ., we can't tell whether to shift the a or reduce the empty string. More generally, no grammar with ε-productions is LR(0).
However, this grammar might be SLR(1). To see this, we augment each reduction with the lookahead set for the particular nonterminals. This gives back this set of SLR(1) configurating sets:
(1)
X' -> .X
X -> .Yz [$]
X -> .a [$]
Y -> . [z]
Y -> .bZ [z]
(2)
X' -> X.
(3)
X -> Y.z [$]
(4)
X -> Yz. [$]
(5)
X -> a. [$]
(6)
Y -> b.Z [z]
Z -> . [z]
(7)
Y -> bZ. [z]
The shift/reduce conflict in state (1) has been eliminated because we only reduce when the lookahead is z, which doesn't conflict with any of the other items.

If you have no FIRST/FIRST conflicts and no FIRST/FOLLOW conflicts, your grammar is LL(1).
An example of a FIRST/FIRST conflict:
S -> Xb | Yc
X -> a
Y -> a
By seeing only the first input symbol "a", you cannot know whether to apply the production S -> Xb or S -> Yc, because "a" is in the FIRST set of both X and Y.
An example of a FIRST/FOLLOW conflict:
S -> AB
A -> fe | ε
B -> fg
By seeing only the first input symbol "f", you cannot decide whether to apply the production A -> fe or A -> ε, because "f" is in both the FIRST set of A and the FOLLOW set of A (A can be parsed as ε/empty and B as f).
Notice that if you have no epsilon-productions you cannot have a FIRST/FOLLOW conflict.

Simple answer:A grammar is said to be an LL(1),if the associated LL(1) parsing table has atmost one production in each table entry.
Take the simple grammar A -->Aa|b.[A is non-terminal & a,b are terminals]
then find the First and follow sets A.
First{A}={b}.
Follow{A}={$,a}.
Parsing table for Our grammar.Terminals as columns and Nonterminal S as a row element.
a b $
--------------------------------------------
S | A-->a |
| A-->Aa. |
--------------------------------------------
As [S,b] contains two Productions there is a confusion as to which rule to choose.So it is not LL(1).
Some simple checks to see whether a grammar is LL(1) or not.
Check 1: The Grammar should not be left Recursive.
Example: E --> E+T. is not LL(1) because it is Left recursive.
Check 2: The Grammar should be Left Factored.
Left factoring is required when two or more grammar rule choices share a common prefix string.
Example: S-->A+int|A.
Check 3:The Grammar should not be ambiguous.
These are some simple checks.

LL(1) grammar is Context free unambiguous grammar which can be parsed by LL(1) parsers.
In LL(1)
First L stands for scanning input from Left to Right. Second L stands
for Left Most Derivation. 1 stands for using one input symbol at each
step.
For Checking grammar is LL(1) you can draw predictive parsing table. And if you find any multiple entries in table then you can say grammar is not LL(1).
Their is also short cut to check if the grammar is LL(1) or not .
Shortcut Technique

With these two steps we can check if it LL(1) or not.
Both of them have to be satisfied.
1.If we have the production:A->a1|a2|a3|a4|.....|an.
Then,First(a(i)) intersection First(a(j)) must be phi(empty set)[a(i)-a subscript i.]
2.For every non terminal 'A',if First(A) contains epsilon
Then First(A) intersection Follow(A) must be phi(empty set).

How to deal with the implicit 'cat' operator while building a syntax tree for RE(use stack evaluation)

I am trying to build a syntax tree for regular expression. I use the strategy similar to arithmetic expression evaluation (i know that there are ways like recursive descent), that is, use two stack, the OPND stack and the OPTR stack, then to process.
I use different kind of node to represent different kind of RE. For example, the SymbolExpression, the CatExpression, the OrExpression and the StarExpression, all of them are derived from RegularExpression.
So the OPND stack stores the RegularExpression*.
while(c || optr.top()):
if(!isOp(c):
opnd.push(c)
c = getchar();
else:
switch(precede(optr.top(), c){
case Less:
optr.push(c)
c = getchar();
case Equal:
optr.pop()
c = getchar();
case Greater:
pop from opnd and optr then do operation,
then push the result back to opnd
}
But my primary question is, in typical RE, the cat operator is implicit.
a|bc represents a|b.c, (a|b)*abb represents (a|b)*.a.b.b. So when meeting an non-operator, how should i do to determine whether there's a cat operator or not? And how should i deal with the cat operator, to correctly implement the conversion?
Update
Now i've learn that there is a kind of grammar called "operator precedence grammar", its evaluation is similar to arithmetic expression's. It require that the pattern of the grammar cannot have the form of S -> ...AB...(A and B are non-terminal). So i guess that i just cannot directly use this method to parse the regular expression.
Update II
I try to design a LL(1) grammar to parse the basic regular expression.
Here's the origin grammar.（\| is the escape character, since | is a special character in grammar's pattern)
E -> E \| T | T
T -> TF | F
F -> P* | P
P -> (E) | i
To remove the left recursive, import new Variable
E -> TE'
E' -> \| TE' | ε
T -> FT'
T' -> FT' | ε
F -> P* | P
P -> (E) | i
now, for pattern F -> P* | P， import P'
P' -> * | ε
F -> PP'
However, the pattern T' -> FT' | ε has problem. Consider case (a|b):
E => TE'
=> FT' E'
=> PT' E'
=> (E)T' E'
=> (TE')T'E'
=> (FT'E')T'E'
=> (PT'E')T'E'
=> (iT'E')T'E'
=> (iFT'E')T'E'
Here, our human know that we should substitute the Variable T' with T' -> ε, but program will just call T' -> FT', which is wrong.
So, what's wrong with this grammar? And how should i rewrite it to make it suitable for the recursive descendent method.

1. LL(1) grammar
I don't see any problem with your LL(1) grammar. You are parsing the string
(a|b)
and you have gotten to this point:
(a T'E')T'E' |b)
The lookahead symbol is | and you have two possible productions:
T' ⇒ FT'
T' ⇒ ε
FIRST(F) is {(, i}, so the first production is clearly incorrect, both for the human and the LL(1) parser. (A parser without lookahead couldn't make the decision, but parsers without lookahead are almost useless for practical parsing.)
2. Operator precedence parsing
You are technically correct. Your original grammar is not an operator grammar. However, it is normal to augment operator precedence parsers with a small state machine (otherwise algebraic expressions including unary minus, for example, cannot be correctly parsed), and once you have done that it is clear where the implicit concatenation operator must go.
The state machine is logically equivalent to preprocessing the input to insert an explicit concatenation operator where necessary -- that is, between a and b whenever a is in {), *, i} and b is in {), i}.
You should take note that your original grammar does not really handle regular expressions unless you augment it with an explicit ε primitive to represent the empty string. Otherwise, you have no way to express optional choices, usually represented in regular expressions as an implicit operand (such as (a|), also often written as a?). However, the state machine is easily capable of detecting implicit operands as well because there is no conflict in practice between implicit concatenation and implicit epsilon.

I think just keeping track of the previous character should be enough. So if we have
(a|b)*abb
^--- we are here
c = a
pc = *
We know * is unary, so 'a' cannot be its operand. So we must have concatentation. Similarly at the next step
(a|b)*abb
^--- we are here
c = b
pc = a
a isn't an operator, b isn't an operator, so our hidden operator is between them. One more:
(a|b)*abb
^--- we are here
c = b
pc = |
| is a binary operator expecting a right-hand operand, so we do not concatenate.
The full solution probably involves building a table for each possible pc, which sounds painful, but it should give you enough context to get through.
If you don't want to mess up your loop, you could do a preprocessing pass where you insert your own concatenation character using similar logic. Can't tell you if that's better or worse, but it's an idea.

Compilers, finding FIRST set for grammar

I am reading the famous violet dragon book 2nd edition, and can't get the example from page 65, about creating the FIRST set:
We have the following grammar (terminals are bolded):
stmt → expr;
| if ( expr ) stmt
| for ( optexpr ; optexpr ; optexpr ) stmt
| other
optexpr → ε
| expr
And the book suggests that the following is the correct calculation of FIRST:
FIRST(stmt) → {expr, if, for, other} // agree on this
FIRST(expr ;) → {expr} // Where does this come from?
As the comment suggests, from where does the second line come from?

There is no error in the textbook.
The FIRST function is defined (on page 64, emphasis added):
Let α be a string of grammar symbols (terminals and/or nonterminals). We define FIRST(α) to be the set of terminals that appear as the first symbols of one or more strings of terminals generated from α.
In this example, expr ; is a string of grammar symbols consisting of two terminals, so it is a possible value of α. Since it includes no nonterminals, it cannot only generate itself; thus the only string of terminals generated from that value of α is precisely expr ;, and the only terminal that will appear in FIRST(α) is the first symbol in that string, expr.
This may all seem to be belabouring the obvious, but it leads up to the important note just under the example you cite:
The FIRST sets must be considered if there are two productions A → α and A → β. Ignoring ε-productions for the moment, predictive parsing requires FIRST(α) and FIRST(β) to be disjoint.
Since expr ; is one of the possible right-hand sides for stmt, we need to compute its FIRST set (even though the computation in this case is trivial) in order to test this prerequisite.

fixing a grammar to LR(0)

Question:
Given the following grammar, fix it to an LR(O) grammar:
S -> S' $
S'-> aS'b | T
T -> cT | c
Thoughts
I've been trying this for quite sometime, using automatic tools for checking my fixed grammars, with no success. Our professor likes asking this kind of questions on test without giving us a methodology for approaching this (except for repeated trying). Is there any method that can be applied to answer these kind of questions? Can anyone show this method can be applied on this example?

I don't know of an automatic procedure, but the basic idea is to defer decisions. That is, if at a particular state in the parse, both shift and reduce actions are possible, find a way to defer the reduction.
In the LR(0) parser, you can make a decision based on the token you just shifted, but not on the token you (might be) about to shift. So you need to move decisions to the end of productions, in a manner of speaking.
For example, your language consists of all sentences { ancmbn$ | n ≥ 0, m > 0}. If we restrict that to n > 0, then an LR(0) grammar can be constructed by deferring the reduction decision to the point following a b:
S -> S' $.
S' -> U | a S' b.
U -> a c T.
T -> b | c T.
That grammar is LR(0). In the original grammar, at the itemset including T -> c . and T -> c . T, both shift and reduce are possible: shift c and reduce before b. By moving the b into the production for T, we defer the decision until after the shift: after shifting b, a reduction is required; after c, the reduction is impossible.
But that forces every sentence to have at least one b. It omits sentences for which n = 0 (that is, the regular language c*$). That subset has an LR(0) grammar:
S -> S' $.
S' -> c | S' c.
We can construct the union of these two languages in a straight-forward manner, renaming one of the S's:
S -> S1' $ | S2' $.
S1' -> U | a S1' b.
U -> a c T.
T -> b | c T.
S2' -> c | S2' c.
This grammar is LR(0), but the form in which the end-of-input sentinel $ has been included seems to be cheating. At least, it violates the rule for augmented grammars, because an augmented grammar's base rule is always S -> S' $ where S' and $ are symbols not used in the original grammar.
It might seem that we could avoid that technicality by right-factoring:
S -> S' $
S' -> S1' | S2'
Unfortunately, while that grammar is still deterministic, and does recognise exactly the original language, it is not LR(0).
(Many thanks to #templatetypedef for checking the original answer, and identifying a flaw, and also to #Dennis, who observed that c* was omitted.)

Automatic grammar transformation for left-factoring; and left-recursion removal

Standard methods are readily available to transform a context-free grammar which is not LL(1) into an equivalent grammar which is. Are there any tools available which can automate this process?
In the examples below I use upper-case lettering for non-terminals, and lower-case for terminals.
The following left-recursive non-terminal:
A -> A a | b
can be transformed into a right-recursive form:
A -> b A'
A' -> NIL | a A'
Note though that left-recursive production rules ensure that expressions associate to the left, and similarly for right recursive productions; and so a grammar modification will also change expression associativity.
Another issue is indirect left-recursion, such as the following:
A -> B a
B -> A b
Left-factoring is also used to ensure that only one look-ahead token is required by the parser. The following production must look ahead by two tokens:
A -> a b | a c
This can also be refactored; to:
A -> a (b | c)
Are there any software tools which can automate these grammar transformations; and so produce an equivalent grammar suitable for a LL(1) parser?

The Haskell grammar-combinators library here allows a grammar to be transformed into a non-left-recursive form. The input grammar must though be a parsing expression grammar.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Conversion to Chomsky Normal Form - parsing

Related

Determining the type of grammar [duplicate]

How to deal with the implicit 'cat' operator while building a syntax tree for RE(use stack evaluation)

Compilers, finding FIRST set for grammar

fixing a grammar to LR(0)

Automatic grammar transformation for left-factoring; and left-recursion removal

Categories

Resources