LR(1) parsing table with epsilon productions - parsing

I'm having trouble building the collection of sets of items for LR(1) parsers with a grammar containing epsilon productions. For example, given the following grammar (where eps stands for epsilon)
S -> a S U
U -> b
| eps
State0 would be
S' -> .S, $
S -> .a S U, $
Moving with 'a' from State0 would give the following state, let's call it State2
S -> a .S U, $
S -> .a S U, $/???
In order to have the lookahead for the second item of State2 I need to calculate FIRST(U$). I know that FIRST(U) = {'b', eps}. My first question is: the lookaheads of the second item of State2 are $ and 'b'? Since U can be eps, my brain tells me that I can have $ as a lookahead as well, not just 'b'. It would have been just 'b' if FIRST(U) would have been just {'b'}. Is that correct?
Second question: at some point I will have a state as the following one
S -> a S .U, $
U -> .b, $
U -> .eps, $
What do I do here? Do I need to move with eps and have a set with the item U -> eps., $? What if I have another terminal as lookahead, i.e. X -> .eps, a/$? And if I move, ending up having a set of the form X -> eps., $, do I reduce?
And more: do I need to insert eps in the parse table as a symbol?
Thanks

FIRST(U$) means "the set of symbols which could be first in a derivation of U$". Clearly, if U can derive the empty string, $ must be part of this set. The end-of-input marker $ ensures that we never have to worry about epsilons in the FIRST sets. (If we were doing LR(k) instead of LR(1), we would use k end markers so that all the strings in FIRSTk had length k.
The item associated with U → (or with U → ε if you insist) is U → • . In other words, it is reducible and should trigger a reduce action on matching lookahead.
ε is not a symbol; we only use it (sometimes) to make the empty string visible. But the empty string is empty.

Related

Why this grammar has Reduce/Reduce conflict in LR(0)?

I have the following grammar:
S -> a b D E
S -> A B E F
D -> M x
E -> N y
F -> z
M -> epsilon
N -> epsilon
My textbook says there is a Reduce/Reduce conflict in LR(0). I built a diagram and found out that there is a state:
S -> a b . D E
S -> A B . E F
D -> . M x
E -> . N y
M -> .
N -> .
The textbook says that it's a Reduce/Reduce conflict. I'm trying to figure out why. If I build the SLR table I get the following row (3 is the state above):
That's because:
Follow(M)={x} so we can do reduce to rule 6 from state 3.
Follow(N)={y} so we can do reduce to rule 7 from state 3.
I was taught that there is a conflict S/R if there is a cell with S/R and conflict R/R if there is a cell with R/R. But I don't see two Rs in the same cell in the table. So why is it a reduce/reduce conflict?
You show an SLR(1) parsing table, in which the columns correspond to a lookahead of length 1. It's correct, and there is no conflict.
But here we're talking about an LR(0) machine, in which there is no lookahead. (That's the 0 in LR(0).) The only decision the machine can make is to shift or reduce, and since it cannot use lookahead, it can only use the state itself. A given state must be either a shift state or a reduce state (and, if a reduce state, which production is being reduced).
(In case it's confusing, and it often is, the concept of lookahead does not refer to the use of the shifted symbol to decide which state to transition to. The transition is taken based on the shifted symbol, which is at that point no longer part of the lookahead.)
So in that state, there is no possible shift action; in all items in the itemset, either the dot is at the end or the next symbol is a non-terminal (implying a GOTO action after returning from a reduce).
But the state does not have a unique reduction. Depending on the lookahead, the parsers needs to choose to reduce M or to reduce N. And since there is no lookahead, the decision cannot be made and hence it's a conflict.

Determining the type of grammar [duplicate]

How do you identify whether a grammar is LL(1), LR(0), or SLR(1)?
Can anyone please explain it using this example, or any other example?
X → Yz | a
Y → bZ | ε
Z → ε
To check if a grammar is LL(1), one option is to construct the LL(1) parsing table and check for any conflicts. These conflicts can be
FIRST/FIRST conflicts, where two different productions would have to be predicted for a nonterminal/terminal pair.
FIRST/FOLLOW conflicts, where two different productions are predicted, one representing that some production should be taken and expands out to a nonzero number of symbols, and one representing that a production should be used indicating that some nonterminal should be ultimately expanded out to the empty string.
FOLLOW/FOLLOW conflicts, where two productions indicating that a nonterminal should ultimately be expanded to the empty string conflict with one another.
Let's try this on your grammar by building the FIRST and FOLLOW sets for each of the nonterminals. Here, we get that
FIRST(X) = {a, b, z}
FIRST(Y) = {b, epsilon}
FIRST(Z) = {epsilon}
We also have that the FOLLOW sets are
FOLLOW(X) = {$}
FOLLOW(Y) = {z}
FOLLOW(Z) = {z}
From this, we can build the following LL(1) parsing table:
a b z $
X a Yz Yz
Y bZ eps
Z eps
Since we can build this parsing table with no conflicts, the grammar is LL(1).
To check if a grammar is LR(0) or SLR(1), we begin by building up all of the LR(0) configurating sets for the grammar. In this case, assuming that X is your start symbol, we get the following:
(1)
X' -> .X
X -> .Yz
X -> .a
Y -> .
Y -> .bZ
(2)
X' -> X.
(3)
X -> Y.z
(4)
X -> Yz.
(5)
X -> a.
(6)
Y -> b.Z
Z -> .
(7)
Y -> bZ.
From this, we can see that the grammar is not LR(0) because there is a shift/reduce conflicts in state (1). Specifically, because we have the shift item X → .a and Y → ., we can't tell whether to shift the a or reduce the empty string. More generally, no grammar with ε-productions is LR(0).
However, this grammar might be SLR(1). To see this, we augment each reduction with the lookahead set for the particular nonterminals. This gives back this set of SLR(1) configurating sets:
(1)
X' -> .X
X -> .Yz [$]
X -> .a [$]
Y -> . [z]
Y -> .bZ [z]
(2)
X' -> X.
(3)
X -> Y.z [$]
(4)
X -> Yz. [$]
(5)
X -> a. [$]
(6)
Y -> b.Z [z]
Z -> . [z]
(7)
Y -> bZ. [z]
The shift/reduce conflict in state (1) has been eliminated because we only reduce when the lookahead is z, which doesn't conflict with any of the other items.
If you have no FIRST/FIRST conflicts and no FIRST/FOLLOW conflicts, your grammar is LL(1).
An example of a FIRST/FIRST conflict:
S -> Xb | Yc
X -> a
Y -> a
By seeing only the first input symbol "a", you cannot know whether to apply the production S -> Xb or S -> Yc, because "a" is in the FIRST set of both X and Y.
An example of a FIRST/FOLLOW conflict:
S -> AB
A -> fe | ε
B -> fg
By seeing only the first input symbol "f", you cannot decide whether to apply the production A -> fe or A -> ε, because "f" is in both the FIRST set of A and the FOLLOW set of A (A can be parsed as ε/empty and B as f).
Notice that if you have no epsilon-productions you cannot have a FIRST/FOLLOW conflict.
Simple answer:A grammar is said to be an LL(1),if the associated LL(1) parsing table has atmost one production in each table entry.
Take the simple grammar A -->Aa|b.[A is non-terminal & a,b are terminals]
then find the First and follow sets A.
First{A}={b}.
Follow{A}={$,a}.
Parsing table for Our grammar.Terminals as columns and Nonterminal S as a row element.
a b $
--------------------------------------------
S | A-->a |
| A-->Aa. |
--------------------------------------------
As [S,b] contains two Productions there is a confusion as to which rule to choose.So it is not LL(1).
Some simple checks to see whether a grammar is LL(1) or not.
Check 1: The Grammar should not be left Recursive.
Example: E --> E+T. is not LL(1) because it is Left recursive.
Check 2: The Grammar should be Left Factored.
Left factoring is required when two or more grammar rule choices share a common prefix string.
Example: S-->A+int|A.
Check 3:The Grammar should not be ambiguous.
These are some simple checks.
LL(1) grammar is Context free unambiguous grammar which can be parsed by LL(1) parsers.
In LL(1)
First L stands for scanning input from Left to Right. Second L stands
for Left Most Derivation. 1 stands for using one input symbol at each
step.
For Checking grammar is LL(1) you can draw predictive parsing table. And if you find any multiple entries in table then you can say grammar is not LL(1).
Their is also short cut to check if the grammar is LL(1) or not .
Shortcut Technique
With these two steps we can check if it LL(1) or not.
Both of them have to be satisfied.
1.If we have the production:A->a1|a2|a3|a4|.....|an.
Then,First(a(i)) intersection First(a(j)) must be phi(empty set)[a(i)-a subscript i.]
2.For every non terminal 'A',if First(A) contains epsilon
Then First(A) intersection Follow(A) must be phi(empty set).

How to deal with the implicit 'cat' operator while building a syntax tree for RE(use stack evaluation)

I am trying to build a syntax tree for regular expression. I use the strategy similar to arithmetic expression evaluation (i know that there are ways like recursive descent), that is, use two stack, the OPND stack and the OPTR stack, then to process.
I use different kind of node to represent different kind of RE. For example, the SymbolExpression, the CatExpression, the OrExpression and the StarExpression, all of them are derived from RegularExpression.
So the OPND stack stores the RegularExpression*.
while(c || optr.top()):
if(!isOp(c):
opnd.push(c)
c = getchar();
else:
switch(precede(optr.top(), c){
case Less:
optr.push(c)
c = getchar();
case Equal:
optr.pop()
c = getchar();
case Greater:
pop from opnd and optr then do operation,
then push the result back to opnd
}
But my primary question is, in typical RE, the cat operator is implicit.
a|bc represents a|b.c, (a|b)*abb represents (a|b)*.a.b.b. So when meeting an non-operator, how should i do to determine whether there's a cat operator or not? And how should i deal with the cat operator, to correctly implement the conversion?
Update
Now i've learn that there is a kind of grammar called "operator precedence grammar", its evaluation is similar to arithmetic expression's. It require that the pattern of the grammar cannot have the form of S -> ...AB...(A and B are non-terminal). So i guess that i just cannot directly use this method to parse the regular expression.
Update II
I try to design a LL(1) grammar to parse the basic regular expression.
Here's the origin grammar.(\| is the escape character, since | is a special character in grammar's pattern)
E -> E \| T | T
T -> TF | F
F -> P* | P
P -> (E) | i
To remove the left recursive, import new Variable
E -> TE'
E' -> \| TE' | ε
T -> FT'
T' -> FT' | ε
F -> P* | P
P -> (E) | i
now, for pattern F -> P* | P, import P'
P' -> * | ε
F -> PP'
However, the pattern T' -> FT' | ε has problem. Consider case (a|b):
E => TE'
=> FT' E'
=> PT' E'
=> (E)T' E'
=> (TE')T'E'
=> (FT'E')T'E'
=> (PT'E')T'E'
=> (iT'E')T'E'
=> (iFT'E')T'E'
Here, our human know that we should substitute the Variable T' with T' -> ε, but program will just call T' -> FT', which is wrong.
So, what's wrong with this grammar? And how should i rewrite it to make it suitable for the recursive descendent method.
1. LL(1) grammar
I don't see any problem with your LL(1) grammar. You are parsing the string
(a|b)
and you have gotten to this point:
(a T'E')T'E' |b)
The lookahead symbol is | and you have two possible productions:
T' ⇒ FT'
T' ⇒ ε
FIRST(F) is {(, i}, so the first production is clearly incorrect, both for the human and the LL(1) parser. (A parser without lookahead couldn't make the decision, but parsers without lookahead are almost useless for practical parsing.)
2. Operator precedence parsing
You are technically correct. Your original grammar is not an operator grammar. However, it is normal to augment operator precedence parsers with a small state machine (otherwise algebraic expressions including unary minus, for example, cannot be correctly parsed), and once you have done that it is clear where the implicit concatenation operator must go.
The state machine is logically equivalent to preprocessing the input to insert an explicit concatenation operator where necessary -- that is, between a and b whenever a is in {), *, i} and b is in {), i}.
You should take note that your original grammar does not really handle regular expressions unless you augment it with an explicit ε primitive to represent the empty string. Otherwise, you have no way to express optional choices, usually represented in regular expressions as an implicit operand (such as (a|), also often written as a?). However, the state machine is easily capable of detecting implicit operands as well because there is no conflict in practice between implicit concatenation and implicit epsilon.
I think just keeping track of the previous character should be enough. So if we have
(a|b)*abb
^--- we are here
c = a
pc = *
We know * is unary, so 'a' cannot be its operand. So we must have concatentation. Similarly at the next step
(a|b)*abb
^--- we are here
c = b
pc = a
a isn't an operator, b isn't an operator, so our hidden operator is between them. One more:
(a|b)*abb
^--- we are here
c = b
pc = |
| is a binary operator expecting a right-hand operand, so we do not concatenate.
The full solution probably involves building a table for each possible pc, which sounds painful, but it should give you enough context to get through.
If you don't want to mess up your loop, you could do a preprocessing pass where you insert your own concatenation character using similar logic. Can't tell you if that's better or worse, but it's an idea.

fixing a grammar to LR(0)

Question:
Given the following grammar, fix it to an LR(O) grammar:
S -> S' $
S'-> aS'b | T
T -> cT | c
Thoughts
I've been trying this for quite sometime, using automatic tools for checking my fixed grammars, with no success. Our professor likes asking this kind of questions on test without giving us a methodology for approaching this (except for repeated trying). Is there any method that can be applied to answer these kind of questions? Can anyone show this method can be applied on this example?
I don't know of an automatic procedure, but the basic idea is to defer decisions. That is, if at a particular state in the parse, both shift and reduce actions are possible, find a way to defer the reduction.
In the LR(0) parser, you can make a decision based on the token you just shifted, but not on the token you (might be) about to shift. So you need to move decisions to the end of productions, in a manner of speaking.
For example, your language consists of all sentences { ancmbn$ | n ≥ 0, m > 0}. If we restrict that to n > 0, then an LR(0) grammar can be constructed by deferring the reduction decision to the point following a b:
S -> S' $.
S' -> U | a S' b.
U -> a c T.
T -> b | c T.
That grammar is LR(0). In the original grammar, at the itemset including T -> c . and T -> c . T, both shift and reduce are possible: shift c and reduce before b. By moving the b into the production for T, we defer the decision until after the shift: after shifting b, a reduction is required; after c, the reduction is impossible.
But that forces every sentence to have at least one b. It omits sentences for which n = 0 (that is, the regular language c*$). That subset has an LR(0) grammar:
S -> S' $.
S' -> c | S' c.
We can construct the union of these two languages in a straight-forward manner, renaming one of the S's:
S -> S1' $ | S2' $.
S1' -> U | a S1' b.
U -> a c T.
T -> b | c T.
S2' -> c | S2' c.
This grammar is LR(0), but the form in which the end-of-input sentinel $ has been included seems to be cheating. At least, it violates the rule for augmented grammars, because an augmented grammar's base rule is always S -> S' $ where S' and $ are symbols not used in the original grammar.
It might seem that we could avoid that technicality by right-factoring:
S -> S' $
S' -> S1' | S2'
Unfortunately, while that grammar is still deterministic, and does recognise exactly the original language, it is not LR(0).
(Many thanks to #templatetypedef for checking the original answer, and identifying a flaw, and also to #Dennis, who observed that c* was omitted.)

How to identify whether a grammar is LL(1), LR(0) or SLR(1)?

How do you identify whether a grammar is LL(1), LR(0), or SLR(1)?
Can anyone please explain it using this example, or any other example?
X → Yz | a
Y → bZ | ε
Z → ε
To check if a grammar is LL(1), one option is to construct the LL(1) parsing table and check for any conflicts. These conflicts can be
FIRST/FIRST conflicts, where two different productions would have to be predicted for a nonterminal/terminal pair.
FIRST/FOLLOW conflicts, where two different productions are predicted, one representing that some production should be taken and expands out to a nonzero number of symbols, and one representing that a production should be used indicating that some nonterminal should be ultimately expanded out to the empty string.
FOLLOW/FOLLOW conflicts, where two productions indicating that a nonterminal should ultimately be expanded to the empty string conflict with one another.
Let's try this on your grammar by building the FIRST and FOLLOW sets for each of the nonterminals. Here, we get that
FIRST(X) = {a, b, z}
FIRST(Y) = {b, epsilon}
FIRST(Z) = {epsilon}
We also have that the FOLLOW sets are
FOLLOW(X) = {$}
FOLLOW(Y) = {z}
FOLLOW(Z) = {z}
From this, we can build the following LL(1) parsing table:
a b z $
X a Yz Yz
Y bZ eps
Z eps
Since we can build this parsing table with no conflicts, the grammar is LL(1).
To check if a grammar is LR(0) or SLR(1), we begin by building up all of the LR(0) configurating sets for the grammar. In this case, assuming that X is your start symbol, we get the following:
(1)
X' -> .X
X -> .Yz
X -> .a
Y -> .
Y -> .bZ
(2)
X' -> X.
(3)
X -> Y.z
(4)
X -> Yz.
(5)
X -> a.
(6)
Y -> b.Z
Z -> .
(7)
Y -> bZ.
From this, we can see that the grammar is not LR(0) because there is a shift/reduce conflicts in state (1). Specifically, because we have the shift item X → .a and Y → ., we can't tell whether to shift the a or reduce the empty string. More generally, no grammar with ε-productions is LR(0).
However, this grammar might be SLR(1). To see this, we augment each reduction with the lookahead set for the particular nonterminals. This gives back this set of SLR(1) configurating sets:
(1)
X' -> .X
X -> .Yz [$]
X -> .a [$]
Y -> . [z]
Y -> .bZ [z]
(2)
X' -> X.
(3)
X -> Y.z [$]
(4)
X -> Yz. [$]
(5)
X -> a. [$]
(6)
Y -> b.Z [z]
Z -> . [z]
(7)
Y -> bZ. [z]
The shift/reduce conflict in state (1) has been eliminated because we only reduce when the lookahead is z, which doesn't conflict with any of the other items.
If you have no FIRST/FIRST conflicts and no FIRST/FOLLOW conflicts, your grammar is LL(1).
An example of a FIRST/FIRST conflict:
S -> Xb | Yc
X -> a
Y -> a
By seeing only the first input symbol "a", you cannot know whether to apply the production S -> Xb or S -> Yc, because "a" is in the FIRST set of both X and Y.
An example of a FIRST/FOLLOW conflict:
S -> AB
A -> fe | ε
B -> fg
By seeing only the first input symbol "f", you cannot decide whether to apply the production A -> fe or A -> ε, because "f" is in both the FIRST set of A and the FOLLOW set of A (A can be parsed as ε/empty and B as f).
Notice that if you have no epsilon-productions you cannot have a FIRST/FOLLOW conflict.
Simple answer:A grammar is said to be an LL(1),if the associated LL(1) parsing table has atmost one production in each table entry.
Take the simple grammar A -->Aa|b.[A is non-terminal & a,b are terminals]
then find the First and follow sets A.
First{A}={b}.
Follow{A}={$,a}.
Parsing table for Our grammar.Terminals as columns and Nonterminal S as a row element.
a b $
--------------------------------------------
S | A-->a |
| A-->Aa. |
--------------------------------------------
As [S,b] contains two Productions there is a confusion as to which rule to choose.So it is not LL(1).
Some simple checks to see whether a grammar is LL(1) or not.
Check 1: The Grammar should not be left Recursive.
Example: E --> E+T. is not LL(1) because it is Left recursive.
Check 2: The Grammar should be Left Factored.
Left factoring is required when two or more grammar rule choices share a common prefix string.
Example: S-->A+int|A.
Check 3:The Grammar should not be ambiguous.
These are some simple checks.
LL(1) grammar is Context free unambiguous grammar which can be parsed by LL(1) parsers.
In LL(1)
First L stands for scanning input from Left to Right. Second L stands
for Left Most Derivation. 1 stands for using one input symbol at each
step.
For Checking grammar is LL(1) you can draw predictive parsing table. And if you find any multiple entries in table then you can say grammar is not LL(1).
Their is also short cut to check if the grammar is LL(1) or not .
Shortcut Technique
With these two steps we can check if it LL(1) or not.
Both of them have to be satisfied.
1.If we have the production:A->a1|a2|a3|a4|.....|an.
Then,First(a(i)) intersection First(a(j)) must be phi(empty set)[a(i)-a subscript i.]
2.For every non terminal 'A',if First(A) contains epsilon
Then First(A) intersection Follow(A) must be phi(empty set).

Resources