Resolving left-recursion in my grammar - parsing

My grammar has a case of left-recursion in the sixth production rule.
I resolved this by replacing Rule 6 and 7 like this:
I couldn't find any indirect left recursions in this grammar.
The only thing that bothers me is the final production rule, which has a terminal surrounded by two non-terminals.
My two questions are:
Is my resolved left recursion correct?
Is the final production rule a left recursion? I am not sure how to
treat this special case.

Yes, your resolution is correct. You may want to remove the epsilon rule for ease of use, but the accepted strings are correct.
X -> -
X -> -Z
Z -> +
Z -> +Z
Z -> X + Y
... and Y is of the form 0* 1 (no syntax collisions)
As a check, note that you could now replace this final rule with two new rules, one for each expansion of X:
Z -> - + Y
Z -> -Z + Y
This removes X entirely from the Z rules, and each Z rule would then begin with a terminal.
No, your final production rule is no longer left-recursive. X now must resolve to a string beginning with a non-terminal.
I have to admit, though, I'm curious about what use the language has. :-)

Related

How do I expand the item set for this grammar?

I have this grammar
E -> E + i
E -> i
The augmented grammar
E' -> E
E -> E + i
E -> i
Now I try to expand the item set 0
I0)
E' -> .E
+E -> .E + i
+E -> .i
Then, since I have .E in I0 I would expand it but then I will get another E rule, and so on, this is my first doubt.
Assuming that this is alright the next item sets are
I0)
E' -> .E
+E -> .E + i
+E -> .i
I1) (I moved the dot from I0, no variables at rhs of dot)
E' -> E.
E -> E. + i
E -> i.
I2) (I moved the dot from I1, no vars at rhs of dot)
E -> E +. i
I3) (I moved the dot from I2, also no vars)
E -> E + i.
Then I will have this DFA
I0 -(E, i)-> I1 -(+)-> I2 -(i)-> I3
| |
+-(∅)-> acpt <-(∅)--+
I'm missing something because E -> E + i must accept i + i + .. but the DFA doesn't goes back to the I0, so it seems wrong to me. My guess is that it should have a I0 to I0 transition, but I then I don't know that to do with the dot.
What you call the "expansion" of the item set is actually a closure; that's how it's described in all the descriptions of the algorithm I've seen (at least in textbooks). Like any closure operation, you just keep on doing the transformation until you reach a fixed-point; once you've included the productions for E, they're included.
But the essential point is that you're not building a DFA. You're building a pushdown automaton, and the DFA is just one part of it. The DFA is used for shift operations; when you shift a new terminal (because the current parse stack is not a handle), you do a state transition according to the DFA. But you also push the current state onto the PDA's stack.
The interesting part is what happens when the automaton decides to perform a reduction, which replaces the right-hand side of a production with its left-hand side non-terminal. (The right-hand side at the top of the stack is called a "handle".) To do the reduction, you unwind the stack, popping each right-hand side symbol (and the corresponding DFA state) until you reach the beginning of the production. What that does is rewind the DFA to the state it was in before it shifted the first symbol from the right-hand side. (Note that it is only at this point that you know for sure which production was used.) With the DFA thus reset, you can now shift the non-terminal which was encountered, do the corresponding DFA transition, and continue with the parse.
The basis for this procedure is the fact that the parser stack is at all times a "viable prefix"; that is, a sequence of symbols which are the prefix of some right sentential form which can be derived from the start symbol. What's interesting about the set of viable prefixes for a context-free grammar is that it is a regular language, and consequently can be recognised by a DFA. The reduction procedure given above precisely represents this recognition procedure when handles are "pruned" (to use Knuth's original vocabulary).
In that sense, it doesn't really matter what procedure is used to determine which handle is to be pruned, as long as it provides a valid answer. You could, for example, fork the parse every time a potential handle is noticed at the top of the stack, and continue in parallel with both forks. With clever stack management, this parallel search can be done in worst-case O(n3) time for any context-free grammar (and this can be reduced if the grammar is not ambiguous). That's a very rough description of Earley parsers.
But in the case of an LR(k) parser, we require that the grammar be unambiguous, and we also require that we can identify a reduction by looking at no more than k more symbols from the input stream, which is an O(1) operation since k is fixed. If at each point in the parse we know whether to reduce or not, and if so which reduction to choose, then the reductions can be implemented as I outlined above. Each reduction can be performed in O(1) time for a fixed grammar (since the maximum size of a right-hand side in a particular grammar is fixed), and since the number of reductions in a parse is linear in the size of the input, the entire parse can be done in linear time.
That was all a bit informal, but I hope it serves as an intuitive explanation. If you're interested in the formal proof, Donald Knuth's original 1965 paper (On the Translation of Languages from Left to Right) is easy to find and highly readable as these things go.

What is a "Production" in plain English?

I can read on Wikipedia the formal definition of a Production, however when you start reading that article, it makes an assumption about prior knowledge.
Wikipedia defines it as follows:
A production or production rule in computer science is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences.
This assumes that I know and understand what a rewrite rule is. I don't, and if I click the link, I get into another fairly technical explanation.
Can someone explain to me in plain English what a Production actually is?
Note: I have made many attempts to understand this, but I don't think I've succeeded. From what I can tell it rewrites the given string in terms of grammar rules. Not sure if I'm correct.
To explain what a production is I'd like to introduce a bit of context first.
The dragon book states that a context free grammar has 4 components:
a set of terminal symbols (tokens)
a set of non-terminal symbols (syntactic variables)
a set of productions of the form: non-term --> sequence of terminals and non-terminals
a non-terminal symbol designated as the start symbol
It is also said that parsing is the problem of taking a string of terminals (the source code) and figuring out what are the steps required to derive this string of terminals from the start symbol of the grammar.
Now that this has been said, a production is essentially a possible (intermediate) step. I say possible because some symbols can derive into different sequences.
For example, let's make a simple grammar to represent an arbitrarily long sequences of a's ending with a b. The 4 components of this grammar would be:
Terminals: a, b
Non-terminals: S, X
Rules: S --> X, X --> aX, X --> ab
Start symbol: S
From the description I gave above "aaaab" should be derivable from this grammar. Let's see if that holds up. We start from, the start symbol, and then apply productions until a) we get the final sequence, b) we exhaust all possibilities without succeeding (meaning the sequence is not "grammatically correct").
S
X (after applying S --> X)
aX (after applying X --> aX)
aaX (after applying X --> aX)
aaaX (after applying X --> aX)
aaaab (after applying X --> ab)
And we're done, we got the original sequence. So as you can see we re-wrote the non-terminal symbols by applying rules (one of them we applied recursively) which transformed the sequence into a new sequence of symbols at every step and we did so until we had the final sequence.
A rewrite rule is a method of replacing subterms of a formula with other terms. In their most basic form, they consist of a set of objects, plus relations on how to transform those objects.
An example of a rewrite rule could look like:
A → B
Now as for what this actually does! You are right on your note, take for example a list of things (and 2 rewrite rules):
X, Y, Z
X → Y
Y → Z
Which would result in:
Z, Z, Z
A production rule is a rewrite rule because it is a method of replacing subterms of a formula (probably a string in your case). A production rule could look like:
X, Y, Z
X → aX
By using the rule in such a way it becomes possible to apply recursion (create new sequences) as it will keep replacing itself:
aX, Y, Z
aaX, Y, Z
aaaX, Y, Z
As for the question you are asking, you could say: "A production rule is a replacement rule for formulas that uses recursion to create new sequences".

Explanation on this FIRST function

LL(1) Grammar:
(1) Var -> ID DimList
(2) DimList -> ε DimList'
(3) DimList' -> Dim DimList'
(4) DimList' -> ε
(5) Dim -> [ CONST ]
And, in the script that I am reading, it says that the function FIRST(ε DimList') gives {#, [}. But, how?
My guess is that since the right side of (2) begins with ε, it skips epsilon and takes FIRST(DimList') which is, considering (3) and (5), equal to {[}, BUT also, because of (4), takes FOLLOW(DimList') which is {#}.
Other way it could be is that, since (2) begins with ε it skips epsilon and takes FIRST(DimList') BUT ALSO takes FOLLOW(DimList) from (2)...
First one makes more sense to me, though I'm still in the process of learning basics of LL(1) grammars so I would appreciate if someone takes the time to make this clear, thank you.
EDIT: And, of course, it could be that neither of these is true.
The usual definition of the FIRST function would result in FIRST(Dimlist) (or, if you like, FIRST(ε Dimlist') being {ε, [}. ε is in FIRST(ε Dimlist') because both ε and Dimlist' are nullable. [ is an element because it could be the first symbol in a derivation of ε Dimlist, which is the same as saying that it could be the first symbol in a derivation of Dimlist'.
Another way of saying this is that:
FIRST(ε Dimlist' #) = {#, [}
We usually then define the function PREDICT:
PREDICT(ω) = FIRST(ω FOLLOW(ω))
and we can see that
PREDICT(Dimlist) = FIRST(Dimlist FOLLOW(Dimlist)) = {#, [}
Here, FIRST(ω) is the set of strings of terminals (of length ≤ 1) which could appear at the beginning of a derivation of ω, while PREDICT(ω) is the set of strings of terminals (of length ≤ 1) which could be present in the input when a derivation of ω is possible.
It's not uncommon to confuse FIRST and PREDICT, but it's better to keep the difference straight.
Note that all of these functions can be generalized to strings of maximum length k, which are usually written FIRSTk, FOLLOWk and PREDICTk, and the definition of PREDICTk is similar to the above:
PREDICTk(ω) = FIRSTk(ω FOLLOWk(ω))

bison shift/reduce conflict

in the following simple grammar, on the conflict at state 4,
can 'shift' become the taken action without changing the rules ?
(I thought that by default shift was bison's preferred action)
%token one two three
%%
start : a;
a : X Y Z;
X : one;
Z : two | three;
Y : two | ;
%%
shift is bison's preferred action, and you can see in the state output that it will shift two in state 4. It will still report a shift-reduce conflict, but you can take that as a warning if you like. (See %expect.) You'd probably be better off fixing the grammar:
start : a;
a : X Z | X Y Z;
X : one;
Y : two;
Z : two | three;
Shift is the default, but that results in the generated parser giving an error for the input one two so that is probably not what you want. Instead, follow rici's advice and fix the grammar.

Difference between Left Factoring and Left Recursion

What is the difference between Left Factoring and Left Recursion ? I understand that Left factoring is a predictive top down parsing technique. But I get confused when I hear these two terms.
Left factoring is removing the common left factor that appears in two productions of the same non-terminal. It is done to avoid back-tracing by the parser. Suppose the parser has a look-ahead, consider this example:
A -> qB | qC
where A, B and C are non-terminals and q is a sentence.
In this case, the parser will be confused as to which of the two productions to choose and it might have to back-trace. After left factoring, the grammar is converted to:
A -> qD
D -> B | C
In this case, a parser with a look-ahead will always choose the right production.
Left recursion is a case when the left-most non-terminal in a production of a non-terminal is the non-terminal itself (direct left recursion) or through some other non-terminal definitions, rewrites to the non-terminal again (indirect left recursion).
Consider these examples:
(1) A -> Aq (direct)
(2) A -> Bq
B -> Ar (indirect)
Left recursion has to be removed if the parser performs top-down parsing.
Left Factoring is a grammar transformation technique. It consists in "factoring out" prefixes which are common to two or more productions.
For example, going from:
A → α β | α γ
to:
A → α A'
A' → β | γ
Left Recursion is a property a grammar has whenever you can derive from a given variable (non terminal) a rhs that begins with the same variable, in one or more steps.
For example:
A → A α
or
A → B α
B → A γ
There is a grammar transformation technique called Elimination of left recursion, which provides a method to generate, given a left recursive grammar, another grammar that is equivalent and is not left recursive.
The relationship/confusion between both terms probably derives from the fact that both transformation techniques may need to be applied to a grammar before being able to derive a predictive top down parser for it.
This is the way I've seen the two terms used:
Left recursion: when one or more productions can be reached from themselves with no tokens consumed in-between.
Left factoring: a process of transformation, turning the grammar from a left-recursive form to an equivalent non-left-recursive form.
left factor :
Let the given grammar :
A-->ab1 | ab2 | ab3
1) we can see that, for every production, there is a common prefix & if we choose any production here, it is not confirmed that we will not need to backtrack.
2) it is non deterministic, because we cannot choice any production and be assured that we will reach at our desired string by making the correct parse tree.
but if we rewrite the grammar in a way that is deterministic and also leaves us flexible enough to convert it into any string that is possible without backtracking, it will be:
A --> aA',
A' --> b1 | b2| b3
now if we are asked to make the parse tree for string ab2 and now we don't need back tracking. Because we can always choose the correct production when we get A' thus we will generate the correct parse tree.
Left recursion :
A --> Aa | b
here it is clear that the left child of A will always be A if we choose the first production,this is left recursion .because , A is calling itself over and over again .
the generated string from this grammar is :
ba*
since this cannot be in a grammar ... we eliminate the left recursion by writing :
A --> bA'
A' --> E | aA'
now we will not have left recursion and also we can generate ba* .
Left Recursion:
A grammar is left recursive if it has a nonterminal A such that there is a derivation A -> Aα | β where α and β are sequences of terminals and nonterminals that do not start with A.
While designing a top down-parser, if the left recursion exist in the grammar then the parser falls in an infinite loop, here because A is trying to match A itself, which is not possible.
We can eliminate the above left recursion by rewriting the offending production. As-
A -> βA'
A' -> αA' | epsilon
Left Factoring: Left factoring is required to eliminate non-determinism of a grammar. Suppose a grammar, S -> abS | aSb
Here, S is deriving the same terminal a in the production rule(two alternative choices for S), which follows non-determinism. We can rewrite the production to defer the decision of S as-
S -> aS'
S' -> bS | Sb
Thus, S' can be replaced for bS or Sb
Here is a simple way to differentiate between both terms:
Left Recursion:
When leftmost Element of a production is the Producing element itself (Non Terminal Element).
e.g. A -> Aα / Aβ
Left Factoring:
When leftmost Element of a production (Terminal element) is repeated in the same production.
e.g. A -> αB / αC
Furthermore,
If a Grammar is Left Recursive, it might result into infinite loop hence we need to Eliminate Left Recursion.
If a Grammar is Left Factoring, it confuses the parser hence we need to Remove Left Factoring as well.
left recursion:= when left hand non terminal is same as right hand non terminal.
Example:
A->A&|B where & is alpha.
We can remove left ricursion using rewrite this production as like.
A->BA'
A'->&A'|€
Left factor mean productn should not non deterministic. .
Example:
A->&A|&B|&C

Resources