Epsilon(ε) productions and LR(0) grammars and LL(1) grammars - parsing

At many places (for example in this answer here), I have seen it is written that an LR(0) grammar cannot contain ε productions.
Also in Wikipedia I have seen statements like: An ε free LL(1) grammar is also SLR(1).
Now the problem which I am facing is that I cannot reason out the logic behind these statements.
Well, I know that LR(0) grammars accept the languages accepted by a DPDA by empty stack, i.e. the language they accept must have prefix property. [This prefix property can, however, be dealt with if we assume end markers and as such given any language the prefix property shall always be satisfied. Many texts like Theory of Computation by Sipser assume this end marker to simply their argument]. That being said, we can say (informally?) that a grammar is LR(0) if there is no state in the canonical collection of LR(0) items that have a shift-reduce conflict or reduce-reduce conflict.
With this background, I tried to consider the following grammar:
S -> Aa
A -> ε
canonical collection of LR(0) items
In the above DFA, I find that there is no state which has a shift-reduce conflict or reduce-reduce conflict.
So this grammar should be LR(0) as per my analysis. But it also has ε production.
Isn't this example contradicting the statement:
"no grammar with ε productions can be LR(0)"
I guess if I know the logic behind the above quoted statement then I can understand the concept better.
Actually my main problem arose with the statement :
An ε free LL(1) grammar is also SLR(1).
When I asked one of my friends, he gave the argument that as the LL(1) grammar is ε free hence it is LR(0) and hence it is SLR(1).
But I could not understand his logic either. When I asked him about reasoning, he started sharing post regarding "grammar with ε productions can never be LR(0)"...
But personally I could not think of any logic as to how "ε free LL(1) grammar is SLR(1)". Is it really related to the above property of "grammar with ε productions cannot be LR(0)"? If so, please do help me out.. If not, then should I consider asking a separate question for the second confusion?
I have got my concepts of compiler design from the dragon book by Ullman only. Also the knowledge of TOC from Ullman and from few other texts like Sipser, Linz.

A notable feature of your grammar is that A could just be eliminated. It serves absolutely no purpose. (By "eliminated", I mean simply removing all references to it; leaving productions otherwise intact.)
It is true that it's existence doesn't preclude the grammar from being LR(0). Similarly, a grammar with an unreachable non-terminal and an ε-production for that non-terminal could also be LR(0).
So it would be more accurate to say that a grammar cannot be LR(0) if it has a productive non-terminal with both an ε-production and some other productive production. But since we usually only consider reduced grammars without pointless non-terminals, I'm not sure that this additional pedantry serves much purpose.
As for your question about ε-free LL(1) grammars, here's a rough outline:
If an ε-free grammar is not LR(0), then there is some state with both a shift and a reduce action. Since the grammar is ε-free, that state was reached by way of a shift or a goto. The previous state must then have had two different productions with the same FIRST set, contradicting the LL(1) condition.

Related

Parsing in compiler design

As far as I know, Left recursion is not a problem for LR parser.And I know that an ambiguous grammar can't be parsed by any kind of parser.So if I have an ambiguous grammar as follows,how can I remove ambiguity so that I can check if this grammar is SLR(1) or not?
E->E+E|E-E|(E)|id
And one more question,is left factoring needed for a grammar to check if the grammar is LL(1) or SLR(1)?
Any help will be appreciated.
Any parser generator you are likely to encounter will be able to handle the ambiguities in your grammar simply.
Your grammar produces shift/reduce conflicts. These are not necessarily a problem (as are reduce/reduce conflicts). The default action on a shift/reduce conflict in every parser generator is to shift, which solves your problem. There are usually mechanisms (as in YACC or Bison) to ignore this as a warning.
You can remove the conflicts in your grammar by setting up multiple levels of expressions so that you force the precedence of the operators.

Difference between: 'Eliminate left-recursion' and 'construct an equivalent unambiguous grammar'

For example:
R → R bar R|RR|R star|(R)|a|b
construct an equivalent unambiguous grammar:
R → S|RbarS S→T|ST
T → U|Tstar U→a|b|(R)
How about Eliminate left-recursion for R → R bar R|RR|R star|(R)|a|b?
What's the different between Eliminate left-recursion and construct an equivalent unambiguous grammar?
An unambiguous grammar is one where for each string in the language, there is exactly one way to derive it from the grammar. In the context of compiler construction the problem with ambiguous grammar is that it's not obvious from the grammar what the parse tree for a given input string should be. Some tools solve this using their rules for resolving ambiguities while other simply require the grammar to be unambiguous.
A left-recursive grammar is one where the derivation for a given non-terminal can produce that same non-terminal again without first producing a terminal. This leads to infinite loops in recursive-descent-style parsers, but is no problems for shift-reduce parsers.
Note that an unambiguous grammar can still be left-recursive and a grammar without left recursion can still be ambiguous. Also note that depending on your tools, you may need to only remove ambiguity, but not left-recursion, or you may need to remove left-recursion, but not ambiguity (though an unambiguous grammar is generally preferable).
So the difference is that eliminating left recursion and ambiguity solve different problems and are necessary in different situation.

How to solve this Grammar through SLR?

I want to solve this Grammar.
S->SS+
S->SS*
S->a
I want to construct SLR sets of items and parsing table with action and goto.
Can this grammar parse without eliminate left recursion.
Is this Grammar SLR.
No, this grammar is not SLR. It is ambiguous.
Left recursion is not a problem for LR parsers. Left recursion elimination is only necessary for LL parsers.
I am not entirely sure about this, but I think this grammar is actually SLR(1). I constructed by hand the SLR(1) table and I obtained one with no conflicts (having added a 0-transition from S' (new start symbol) -> S).
Can somebody provide a sentence that can be derived in two different ways from this grammar? I was able to get a parser for it in Bison without any warning. Are you sure it is ambiguous?

Is this grammar LL(1)?

I have derived the following grammar:
S -> a | aT
T -> b | bR
R -> cb | cbR
I understand that in order for a grammar to be LL(1) it has to be non-ambiguous and right-recursive. The problem is that I do not fully understand the concept of left-recursive and right-recursive grammars. I do not know whether or not the following grammar is right recursive. I would really appreciate a simple explanation of the concept of left-recursive and right-recursive grammars, and if my grammar is LL(1).
Many thanks.
This grammar is not LL(1). In an LL(1) parser, it should always be possible to determine which production to use next based on the current nonterminal symbol and the next token of the input.
Let's look at this production, for example:
S → a | aT
Now, suppose that I told you that the current nonterminal symbol is S and the next symbol of input was an a. Could you determine which production to use? Unfortunately, without more context, you couldn't do so: perhaps you're suppose to use S → a, and perhaps you're supposed to use S → aT. Using similar reasoning, you can see that all the other productions have similar problems.
This doesn't have anything to do with left or right recursion, but rather the fact that no two productions for the same nonterminal in an LL(1) grammar can have a nonempty common prefix. In fact, a simple heuristic for checking if a grammar is not LL(1) is to see if you can find two production rules like this.
Hope this helps!
The grammar has only a single recursive rule: the last one where R is the symbol on the left, and also appears on the right. It is right-recursive because in the grammar rule, R is the rightmost symbol. The rule refers to R, and that reference is rightmost.
The language is LL(1). How we know this is that we can easily construct a recursive descent parser that uses no backtracking and at most one token of lookahead.
But such a parser would be based on a slightly modified version of the grammar.
For instance the two productions: S -> a and S -> a T could be merged into a single one that can be expressed by the EBNF S -> a [ T ]. (S derives a, followed by optional T). This rule can be handled by a single parsing function for recognizing S.
The function matches a and then looks for the optional T, which would be indicated by the next input symbol being b.
We can write an LL(1) grammar for this, along these lines:
S -> a T_opt
T_opt -> b R_opt
T_opt -> <empty>
... et cetera
The optionality of T is handled explicitly, by making T (which we rename to T_opt) capable of deriving the empty string, and then condensing to a single rule for S, so that we don't have two phrases that both start with a.
So in summary, the language is LL(1), but the given grammar for it isn't. Since the language is LL(1) it is possible to find another grammar which is LL(1), and that grammar is not far off from the given one.

Practical consequences of formal grammar power?

Every undergraduate Intro to Compilers course reviews the commonly-implemented subsets of context-free grammars: LL(k), SLR(k), LALR(k), LR(k). We are also taught that for any given k, each of those grammars is a subset of the next.
What I've never seen is an explanation of what sorts of programming language syntactic features might require moving to a different language class. There's an obvious practical motivation for GLR parsers, namely, avoiding an unholy commingling of parser and symbol table when parsing C++. But what about the differences between the two "standard" classes, LL and LR?
Two questions:
What (general) syntactic constructions can be parsed with LR(k) but not LL(k')?
In what ways, if any, do those constructions manifest as desirable language constructs?
There's a plausible argument for reducing language power by making k as small as possible, because a language requiring many, many tokens of lookahead will be harder for humans to parse, as well as "harder" for machines to parse. Question (2) implicitly asks if the same reasoning ends up holding between classes, as well as within a class.
edit: Here's one example to illustrate the sorts of answers I'm looking for, but for regular languages instead of context-free:
When describing a regular language, one usually gets three operators: +, *, and ?. Now, you can remove + without reducing the power of the language; instead of writing x+, you write xx*, and the effect is the same. But if x is some big and hairy expression, the two xs are likely to diverge over time due to human forgetfulness, yielding a syntactically correct regular expression that doesn't match the original author's intent. Thus, even though adding + doesn't strictly add power, it does make the notation less error-prone.
Are there constructs with similar practical (human?) effects that must be "removed" when switching from LR to LL?
Parsing (I claim) is a bit like sorting: a problem that was the focus of a lot of thought in the early days of CS, leading to a set of well-understood solutions with some nice theoretical results.
My claim is that the picture that we get (or give, for those of us who teach) in a compilers class is, to some degree, a beautiful answer to the wrong question.
To answer your question more directly, an LL(1) grammar can't parse all kinds of things that you might want to parse; the "natural" formulation of an 'if' with an optional 'else', for instance.
But wait! Can't I reformulate my grammar as an LL(1) grammar and then patch up the source tree by walking over it afterward? Sure you can! To some degree, this is what makes the question of what kind of grammar your parser uses largely moot.
Also, back when I was an undergraduate (1990-94), whitespace-sensitive grammars were clearly the work of the Devil; now, Python and Haskell's designs are bringing whitespace-sensitivity back into the light. Also, Packrat parsing says "to heck with your theoretical purity: I'm just going to define a parser as a set of rules, and I don't care what class my grammar belongs to." (paraphrased)
In summary, I would agree with what I believe to be your implied suggestion: in 2009, a clear understanding of the difference between the classes LL(k) and LR(k) is less important in itself than the ability to formulate and debug a grammar that makes your parser generator happy.
The difference between LL and LR is primarily in the lookahead mechanism. People generally say that LR parsers carry more "context". To see this practically, consider a recursive grammar definition with S as the starting symbol:
A -> Ax | x
B -> Ay
C -> Az
S -> B | C
When k is a small fixed value, parsing a string like xxxxxxy is a task better suited to an LR parser. However, these days the popular LL parsers such as ANTLR do not restrict k to such small values and most people no longer care.
I hope this is more or less in line with your question. Of course Knuth showed that any unambiguous context-free language can be recognized by some LR(1) grammar. However, in practice we are also concerned with translation.
As a side note: You might also enjoy reading http://www.antlr.org/article/needlook.html.
This is by no means proven, but I have always questioned whether LR-like parsing is really similar to how the brain works when reading certain notations. For example, when reading an English sentence it is pretty obvious that we read from left-to-right. But, consider the pattern bellow:
. . . . . | . . . . .
I rather expect that with short patterns such as this one people do not literally read "dot dot dot dot dot bar dot dot dot dot dot" from left to right, but rather processes the pattern in parallel or at least in some kind of fuzzy iterative manner. In other words, I do not believe we necessarily read all patterns in a left-to-right manner with the kind of linear lookahead that a LL/LR parser employs.
Furthermore, if we can describe any context-free language using an LR(1) grammar then it is clear that simply recognizing a string is not the same as "understanding" it.
well, for one, Left recursive definitions are impossible in LL(k) grammars (as far as i know), don't know about others. This doesn't make itimpossible to define other things just a massive pain to do otherwise. For instance, putting together expressions can be easy in a left-recursive language (in pseudocode):
lexer rule expression = other rules
| expression
| '(' expression ')';
As far as syntactically useful things that can be made with left-recursion, um does simpler grammars count as syntactically useful?
The capabilities of a language are not limited by its syntax and grammar.
It's possible to define any language feature with an LL(k) grammar, it just might not be very readable to humans.

Resources