I'm trying to learn how to make a compiler. In order to do so, I read a lot about context-free language. But there are some things I cannot get by myself yet.
Since it's my first compiler there are some practices that I'm not aware of. My questions are asked with in mind to build a parser generator, not a compiler neither a lexer. Some questions may be obvious..
Among my reads are : Bottom-Up Parsing, Top-Down Parsing, Formal Grammars. The picture shown comes from : Miscellanous Parsing. All coming from the Stanford CS143 class.
Here are the points :
0) How do ( ambiguous / unambiguous ) and ( left-recursive / right-recursive ) influence the needs for one algorithm or another ? Are there other ways to qualify a grammar ?
1) An ambiguous grammar is one that have several parse trees. But shouldn't the choice of a leftmost-derivation or rightmost-derivation lead to unicity of the parse tree ?
[EDIT: Answered here ]
2.1) But still, is the ambiguity of the grammar related to k ? I mean giving a LR(2) grammar, is it ambiguous for a LR(1) parser and not ambiguous for a LR(2) one ?
[EDIT: No it's not, a LR(2) grammar means that the parser will need two tokens of lookahead to choose the right rule to use. On the other hand, an ambiguous grammar is one that possibly leads to several parse trees. ]
2.2) So a LR(*) parser, as long as you can imagine it, will have no ambiguous grammar at all and can then parse the entire set of context free languages ?
[EDIT: Answered by Ira Baxter, LR(*) is less powerful than GLR, in that it can't handle multiple parse trees. ]
3) Depending on the previous answers, what follows may be self contradictory. Considering LR parsing, do ambiguous grammars trigger shift-reduce conflict ? Can an unambiguous grammar trigger one too ? In the same way, what about reduce-reduce conflicts ?
[EDIT: this is it, ambiguous grammars leads to shift-reduce and reduce-reduce conflicts. By contrapositive, if there are no conflicts, the grammar is univocal. ]
4) The ability to parse left-recursive grammar is an advantage of LR(k) parser over LL(k), is it the only difference between them ?
[EDIT: yes. ]
5) Giving G1 :
G1 :
S -> S + S
S -> S - S
S -> a
5.1) G1 is both left-recursive, right-recursive, and ambiguous, am I right ? Is it a LR(2) grammar ? One would make it unambiguous :
G2 :
S -> S + a
S -> S - a
S -> a
5.2) Is G2 still ambiguous ? Does a parser for G2 needs two lookaheads ? By factorisation we have :
G3 :
S -> S V
V -> + a
V -> - a
S -> a
5.3) Now, does a parser for G3 need one lookahead only ? What are the counter parts for doing these transformations ? Is LR(1) the minimal parser required ?
5.4) G1 is left recursive, in order to parse it with a LL parser, one need to transform it into a right recursive grammar :
G4 :
S -> a + S
S -> a - S
S -> a
then
G5 :
S -> a V
V -> - V
V -> + V
V -> a
5.5) Does G4 need at least a LL(2) parser ? G5 only is parsable by a LL(1) parser, G1-G5 do define the same language, and this language is ( a (+/- a)^n ). Is it true ?
5.6) For each grammar G1 to G5, what is the minimal set to which it belongs ?
6) Finally, since many differents grammars may define the same language, how does one chose the grammar and the associated parser ? Is the resulting parse tree imortant ? What is the influence of the parse tree ?
I'm asking a lot, and I don't really expect a complete answer, anyway any help would be very appreciated.
Thx for reading !
"Many grammars may define the same langauge, how does one choose..."?
Usually, you choose the one that meets the following criteria:
conceptually as simple as you can make it (implication: smaller than others)
tracks the terminology in the langauge reference manual where possible
least amount of bending to meet the constraints of your parser generator
That last one can make a mess of your conceptual simplicity, and your chart of various parser styles shows the number of different issues that you face depending on your choice-of-generator. This is aggravated by the fact that choice is often made well before you actually choose the grammar.
One way to minimize grammar bending is to choose a parser generator which handles fully context-free grammars. GLR parsing has this very significant advantage. I've been using it for 15 years and have done dozens of real langauges with it.
Related
For example:
R → R bar R|RR|R star|(R)|a|b
construct an equivalent unambiguous grammar:
R → S|RbarS S→T|ST
T → U|Tstar U→a|b|(R)
How about Eliminate left-recursion for R → R bar R|RR|R star|(R)|a|b?
What's the different between Eliminate left-recursion and construct an equivalent unambiguous grammar?
An unambiguous grammar is one where for each string in the language, there is exactly one way to derive it from the grammar. In the context of compiler construction the problem with ambiguous grammar is that it's not obvious from the grammar what the parse tree for a given input string should be. Some tools solve this using their rules for resolving ambiguities while other simply require the grammar to be unambiguous.
A left-recursive grammar is one where the derivation for a given non-terminal can produce that same non-terminal again without first producing a terminal. This leads to infinite loops in recursive-descent-style parsers, but is no problems for shift-reduce parsers.
Note that an unambiguous grammar can still be left-recursive and a grammar without left recursion can still be ambiguous. Also note that depending on your tools, you may need to only remove ambiguity, but not left-recursion, or you may need to remove left-recursion, but not ambiguity (though an unambiguous grammar is generally preferable).
So the difference is that eliminating left recursion and ambiguity solve different problems and are necessary in different situation.
to compile a syntax portion of a language we need to parse syntactically and lexiqually
then a step that is not trivial: adding semantic
what I do is to use DCG with some predicates in prolog for the first two step, but now I want to know how it done for the semantic ?
how I consider this task?
separate to the parser or mix it with them?...
EDIT :
Usually, a semantic of simple expression with DCG be written as;
expr (Z) -> term(X) "+", expr(Y), {Z is X + Y}.
but if the language is based on logical formulas such as B Method , problem become more complicated,
and what I need is some tips to overcome this problem
sorry about the wrong expressions
I am very appreciated to your opinions
I have derived the following grammar:
S -> a | aT
T -> b | bR
R -> cb | cbR
I understand that in order for a grammar to be LL(1) it has to be non-ambiguous and right-recursive. The problem is that I do not fully understand the concept of left-recursive and right-recursive grammars. I do not know whether or not the following grammar is right recursive. I would really appreciate a simple explanation of the concept of left-recursive and right-recursive grammars, and if my grammar is LL(1).
Many thanks.
This grammar is not LL(1). In an LL(1) parser, it should always be possible to determine which production to use next based on the current nonterminal symbol and the next token of the input.
Let's look at this production, for example:
S → a | aT
Now, suppose that I told you that the current nonterminal symbol is S and the next symbol of input was an a. Could you determine which production to use? Unfortunately, without more context, you couldn't do so: perhaps you're suppose to use S → a, and perhaps you're supposed to use S → aT. Using similar reasoning, you can see that all the other productions have similar problems.
This doesn't have anything to do with left or right recursion, but rather the fact that no two productions for the same nonterminal in an LL(1) grammar can have a nonempty common prefix. In fact, a simple heuristic for checking if a grammar is not LL(1) is to see if you can find two production rules like this.
Hope this helps!
The grammar has only a single recursive rule: the last one where R is the symbol on the left, and also appears on the right. It is right-recursive because in the grammar rule, R is the rightmost symbol. The rule refers to R, and that reference is rightmost.
The language is LL(1). How we know this is that we can easily construct a recursive descent parser that uses no backtracking and at most one token of lookahead.
But such a parser would be based on a slightly modified version of the grammar.
For instance the two productions: S -> a and S -> a T could be merged into a single one that can be expressed by the EBNF S -> a [ T ]. (S derives a, followed by optional T). This rule can be handled by a single parsing function for recognizing S.
The function matches a and then looks for the optional T, which would be indicated by the next input symbol being b.
We can write an LL(1) grammar for this, along these lines:
S -> a T_opt
T_opt -> b R_opt
T_opt -> <empty>
... et cetera
The optionality of T is handled explicitly, by making T (which we rename to T_opt) capable of deriving the empty string, and then condensing to a single rule for S, so that we don't have two phrases that both start with a.
So in summary, the language is LL(1), but the given grammar for it isn't. Since the language is LL(1) it is possible to find another grammar which is LL(1), and that grammar is not far off from the given one.
Is it possible to construct an LR(0) parser that could parse a language with both prefix and postfix operators? For example, if I had a grammar with the + (addition) and ! (factorial) operators with the usual precedence then 1+3! should be 1 + 3! = 1 + 6 = 7, but surely if the parser were LR(0) then when it had 1+3 on the stack it would reduce rather than shift?
Also, do right associative operators pose a problem? For example, 2^3^4 should be 2^(3^4) but again, when the parser have 2^3 on the stack how would it know to reduce or shift?
If this isn't possible is there still a way to use an LR(0) parser, possibly by altering the grammar to add brackets in the appropriate places?
LR(0) parsers have a weakness in that they can only parse prefix-free languages, languages where no string in the language is a prefix of any other. This generally makes it a bit tricky to parse expressions like these, since something like 5 is a prefix of 5!. This also explains why it's hard to get right-associative operators - given a production like
S → F | F ^ S
the parser will have a shift/reduce conflict after seeing an F because it can't tell whether to extend it or to reduce again. This is related to the prefix-free property mentioned earlier.
This weakness of LR(0) is one of the reasons why people don't use it much in practice. SLR(1) and LALR(1) parsers can usually parse these grammars because they have a token of lookahead that lets them decide whether to shift or reduce. In the above case, the parsers wouldn't encounter shift/reduce conflicts because when deciding whether to reduce an F or shift a ^, they can see to shift the ^ because there's no correct string where a ^ should appear after an S.
Every undergraduate Intro to Compilers course reviews the commonly-implemented subsets of context-free grammars: LL(k), SLR(k), LALR(k), LR(k). We are also taught that for any given k, each of those grammars is a subset of the next.
What I've never seen is an explanation of what sorts of programming language syntactic features might require moving to a different language class. There's an obvious practical motivation for GLR parsers, namely, avoiding an unholy commingling of parser and symbol table when parsing C++. But what about the differences between the two "standard" classes, LL and LR?
Two questions:
What (general) syntactic constructions can be parsed with LR(k) but not LL(k')?
In what ways, if any, do those constructions manifest as desirable language constructs?
There's a plausible argument for reducing language power by making k as small as possible, because a language requiring many, many tokens of lookahead will be harder for humans to parse, as well as "harder" for machines to parse. Question (2) implicitly asks if the same reasoning ends up holding between classes, as well as within a class.
edit: Here's one example to illustrate the sorts of answers I'm looking for, but for regular languages instead of context-free:
When describing a regular language, one usually gets three operators: +, *, and ?. Now, you can remove + without reducing the power of the language; instead of writing x+, you write xx*, and the effect is the same. But if x is some big and hairy expression, the two xs are likely to diverge over time due to human forgetfulness, yielding a syntactically correct regular expression that doesn't match the original author's intent. Thus, even though adding + doesn't strictly add power, it does make the notation less error-prone.
Are there constructs with similar practical (human?) effects that must be "removed" when switching from LR to LL?
Parsing (I claim) is a bit like sorting: a problem that was the focus of a lot of thought in the early days of CS, leading to a set of well-understood solutions with some nice theoretical results.
My claim is that the picture that we get (or give, for those of us who teach) in a compilers class is, to some degree, a beautiful answer to the wrong question.
To answer your question more directly, an LL(1) grammar can't parse all kinds of things that you might want to parse; the "natural" formulation of an 'if' with an optional 'else', for instance.
But wait! Can't I reformulate my grammar as an LL(1) grammar and then patch up the source tree by walking over it afterward? Sure you can! To some degree, this is what makes the question of what kind of grammar your parser uses largely moot.
Also, back when I was an undergraduate (1990-94), whitespace-sensitive grammars were clearly the work of the Devil; now, Python and Haskell's designs are bringing whitespace-sensitivity back into the light. Also, Packrat parsing says "to heck with your theoretical purity: I'm just going to define a parser as a set of rules, and I don't care what class my grammar belongs to." (paraphrased)
In summary, I would agree with what I believe to be your implied suggestion: in 2009, a clear understanding of the difference between the classes LL(k) and LR(k) is less important in itself than the ability to formulate and debug a grammar that makes your parser generator happy.
The difference between LL and LR is primarily in the lookahead mechanism. People generally say that LR parsers carry more "context". To see this practically, consider a recursive grammar definition with S as the starting symbol:
A -> Ax | x
B -> Ay
C -> Az
S -> B | C
When k is a small fixed value, parsing a string like xxxxxxy is a task better suited to an LR parser. However, these days the popular LL parsers such as ANTLR do not restrict k to such small values and most people no longer care.
I hope this is more or less in line with your question. Of course Knuth showed that any unambiguous context-free language can be recognized by some LR(1) grammar. However, in practice we are also concerned with translation.
As a side note: You might also enjoy reading http://www.antlr.org/article/needlook.html.
This is by no means proven, but I have always questioned whether LR-like parsing is really similar to how the brain works when reading certain notations. For example, when reading an English sentence it is pretty obvious that we read from left-to-right. But, consider the pattern bellow:
. . . . . | . . . . .
I rather expect that with short patterns such as this one people do not literally read "dot dot dot dot dot bar dot dot dot dot dot" from left to right, but rather processes the pattern in parallel or at least in some kind of fuzzy iterative manner. In other words, I do not believe we necessarily read all patterns in a left-to-right manner with the kind of linear lookahead that a LL/LR parser employs.
Furthermore, if we can describe any context-free language using an LR(1) grammar then it is clear that simply recognizing a string is not the same as "understanding" it.
well, for one, Left recursive definitions are impossible in LL(k) grammars (as far as i know), don't know about others. This doesn't make itimpossible to define other things just a massive pain to do otherwise. For instance, putting together expressions can be easy in a left-recursive language (in pseudocode):
lexer rule expression = other rules
| expression
| '(' expression ')';
As far as syntactically useful things that can be made with left-recursion, um does simpler grammars count as syntactically useful?
The capabilities of a language are not limited by its syntax and grammar.
It's possible to define any language feature with an LL(k) grammar, it just might not be very readable to humans.