You've written a yacc grammar (or some other LALR grammar in the tool of your choice), and you've decided that you want to refactor some productions for efficiency, clarity, whatever. For example, you had:
xs : xs ';' x
| xs ';'
| x
You want to make it more obvious that there can be multiple semicolons, so you rewrite it as:
semi_plus : semi_plus ';'
| ';'
xs : xs semi_plus x
| x
OK, so, looks plausible... but did I actually do the refactoring correctly? It would be great if I could pass these declarations to a tool that would tell me if the grammars are equivalent or not. (For now, let us consider solely the question of whether or not we recognize the same languages.)
One knee-jerk reaction is to quote that context free grammar equivalence is undecidable. In fact, even the problem of determining if a CFG is regular is undecidable. But yacc does not recognize CFGs: it recognizes deterministic context free grammars, and for these, it is known that equivalence is decidable. But has anyone implemented any of these decision procedures?
Related
Is there an easy way to tell whether a simple grammar is suitable for recursive descent? Is eliminating left recursion and left factoring the grammar enough to achieve this ?
Not necessarily.
To build a recursive descent parser (without backtracking), you need to eliminate or resolve all predict conflicts. So one definitive test is to see if the grammar is LL(1); LL(1) grammars have no predict conflicts, by definition. Left-factoring and left-recursion elimination are necessary for this task, but they might not be sufficient, since a predict conflict might be hiding behind two competing non-terminals:
list ::= item list'
list' ::= ε
| ';' item list'
item ::= expr1
| expr2
expr1 ::= ID '+' ID
expr2 ::= ID '(' list ')
The problem with the above (or, at least, one problem) is that when the parser expects an item and sees an ID, it can't know which of expr1 and expr2 to try. (That's a predict conflict: Both non-terminals could be predicted.) In this particular case, it's pretty easy to see how to eliminate that conflict, but it's not really left-factoring since it starts by combining two non-terminals. (And in the full grammar this might be excerpted from, combining the two non-terminals might be much more difficult.)
In the general case, there is no algorithm which can turn an arbitrary grammar into an LL(1) grammar, or even to be able to say whether the language recognised by that grammar has an LL(1) grammar as well. (However, it's easy to tell whether the grammar itself is LL(1).) So there's always going to be some art and/or experimentation involved.
I think it's worth adding that you don't really need to eliminate left-recursion in a practical recursive descent parser, since you can usually turn it into a while-loop instead of recursion. For example, leaving aside the question of the two expr types above, the original grammar in an extended BNF with repetition operators might be something like
list ::= item (';' item)*
Which translates into something like:
def parse_list():
parse_item()
while peek(';'):
match(';')
parse_item()
(Error checking and AST building omitted.)
I try a bit the parser generators with Haskell, using Happy here. I used to use parser combinators before, such as Parsec, and one thing I can't achieve now with that is the dynamic addition (during execution) of new externally defined operators. For example, Haskell has some basic operators, but we can add more, giving them precedence and fixity. So I would like to know how to reproduce this with Happy, following the Haskell design (view example code bellow to be parsed), if it is not trivially feasible, or if it should perhaps be done through the parser combinators.
-- Adding the new operator
infixl 5 ++
(++) :: [a] -> [a] -> [a]
[] ++ ys = ys
(x:xs) ++ ys = x : xs ++ ys
-- Using the new operator taking into consideration fixity and precedence during parsing
example = "Hello, " ++ "world!"
Haskell only allows a few precedence levels. So you don't strictly need a dynamic grammar; you could just write out the grammar using precedence-level token classes instead of individual operators, leaving the lexer with the problem of associating a given symbol with a given precedence level.
In effect, that moves the dynamic addition of operators to the lexer. That's a slightly uncomfortable design decision, although in some cases it may not be too difficult to implement. It's uncomfortable design because it requires semantic feedback to the lexer; at a minimum, the lexer needs to consult the symbol table to figure out what type of token it is looking at. In the case of Haskell, at least, this is made more uncomfortable by the fact that fixity declarations are scoped, so in order to track fixity information, the lexer would also need to understand scoping rules.
In practice, most languages which allow program text to define operators and operator precedence work in precisely the same way the Haskell compiler does: expressions are parsed by the grammar into a simple list of items (where parenthesized subexpressions count as a single item), and in a later semantic analysis the list is rearranged into an actual tree taking into account precedence and associativity rules, using a simple version of the shunting yard algorithm. (It's a simple version because it doesn't need to deal with parenthesized subconstructs.)
There are several reasons for this design decision:
As mentioned above, for the lexer to figure out what the precedence of a symbol is (or even if the symbol is an operator with precedence) requires a close collaboration between the lexer and the parser, which many would say violates separation of concerns. Worse, it makes it difficult or impossible to use parsing technologies without a small fixed lookahead, such as GLR parsers.
Many languages have more precedence levels than Haskell. In some cases, even the number of precedence levels is not defined by the grammar. In Swift, for example, you can declare your own precedence levels, and you define a level not with a number but with a comparison to another previously defined level, leading to a partial order between precedence levels.
IMHO, that's actually a better design decision than Haskell, in part because it avoids the ambiguity of a precedence level having both left- and right-associative operators, but more importantly because the relative precedence declarations both avoid magic numbers and allow the parser to flag the ambiguous use of operators from different modules. In other words, it does not force a precedence declaration to mechanically apply to any pair of totally unrelated operators; in this sense it makes operator declarations easier to compose.
The grammar is much simpler, and arguably easier to understand since most people anyway rely on precedence tables rather than analysing grammar productions to figure out how operators interact with each other. In that sense, having precedence set by the grammar is more a distraction than documentation. See the C++ grammar as a good example of why precedence tables are easier to read than grammars.
On the other hand, as the C++ grammar also illustrates, a grammar is a lot more general than simple precedence declarations because it can express asymmetric precedences. (The grammar doesn't always express these gracefully, but they can be expressed.) A classic example of an asymmetric precedence is a lambda construct (λ ID expr) which binds very loosely to the right and very tightly to the left: the expected parse of a ∘ λ b b ∘ a does not ever consult the associativity of ∘ because the λ comes between them.
In practice, there is very little cost to building the tree later. The algorithm to build the tree is well-known, simple and cheap.
I am trying to find how LL(1) parser handle right associative grammar. For example in case of left associative grammar like this E->+TE' first() and follow() works smoothly and parsing table generated easily. But, in case of right-recursive grammar, for example, in case of power like E->T^E/T parsing table isn't generating properly. I am searching for resources but found every example avoiding right associativity like powers.
LL algorithms handle right-recursion with no problem whatsoever. In fact, the transformation you mention turns a left-associative grammar into a right-associative one, and left-associativity needs to restored by transforming the syntax tree in a semantic rule. So if the production is really right-associative, you can use the same grammar without the need for post- processing the tree.
The problem with E -> T ^ E | T is not that it is right recursive. The problem is that the two right-hand sides start with the same non-terminal, making prediction impossible. The solution is left-factoring, which will produce E -> T E' / E' -> ε | ^ T E'.
I have derived the following grammar:
S -> a | aT
T -> b | bR
R -> cb | cbR
I understand that in order for a grammar to be LL(1) it has to be non-ambiguous and right-recursive. The problem is that I do not fully understand the concept of left-recursive and right-recursive grammars. I do not know whether or not the following grammar is right recursive. I would really appreciate a simple explanation of the concept of left-recursive and right-recursive grammars, and if my grammar is LL(1).
Many thanks.
This grammar is not LL(1). In an LL(1) parser, it should always be possible to determine which production to use next based on the current nonterminal symbol and the next token of the input.
Let's look at this production, for example:
S → a | aT
Now, suppose that I told you that the current nonterminal symbol is S and the next symbol of input was an a. Could you determine which production to use? Unfortunately, without more context, you couldn't do so: perhaps you're suppose to use S → a, and perhaps you're supposed to use S → aT. Using similar reasoning, you can see that all the other productions have similar problems.
This doesn't have anything to do with left or right recursion, but rather the fact that no two productions for the same nonterminal in an LL(1) grammar can have a nonempty common prefix. In fact, a simple heuristic for checking if a grammar is not LL(1) is to see if you can find two production rules like this.
Hope this helps!
The grammar has only a single recursive rule: the last one where R is the symbol on the left, and also appears on the right. It is right-recursive because in the grammar rule, R is the rightmost symbol. The rule refers to R, and that reference is rightmost.
The language is LL(1). How we know this is that we can easily construct a recursive descent parser that uses no backtracking and at most one token of lookahead.
But such a parser would be based on a slightly modified version of the grammar.
For instance the two productions: S -> a and S -> a T could be merged into a single one that can be expressed by the EBNF S -> a [ T ]. (S derives a, followed by optional T). This rule can be handled by a single parsing function for recognizing S.
The function matches a and then looks for the optional T, which would be indicated by the next input symbol being b.
We can write an LL(1) grammar for this, along these lines:
S -> a T_opt
T_opt -> b R_opt
T_opt -> <empty>
... et cetera
The optionality of T is handled explicitly, by making T (which we rename to T_opt) capable of deriving the empty string, and then condensing to a single rule for S, so that we don't have two phrases that both start with a.
So in summary, the language is LL(1), but the given grammar for it isn't. Since the language is LL(1) it is possible to find another grammar which is LL(1), and that grammar is not far off from the given one.
I'm trying to learn how to make a compiler. In order to do so, I read a lot about context-free language. But there are some things I cannot get by myself yet.
Since it's my first compiler there are some practices that I'm not aware of. My questions are asked with in mind to build a parser generator, not a compiler neither a lexer. Some questions may be obvious..
Among my reads are : Bottom-Up Parsing, Top-Down Parsing, Formal Grammars. The picture shown comes from : Miscellanous Parsing. All coming from the Stanford CS143 class.
Here are the points :
0) How do ( ambiguous / unambiguous ) and ( left-recursive / right-recursive ) influence the needs for one algorithm or another ? Are there other ways to qualify a grammar ?
1) An ambiguous grammar is one that have several parse trees. But shouldn't the choice of a leftmost-derivation or rightmost-derivation lead to unicity of the parse tree ?
[EDIT: Answered here ]
2.1) But still, is the ambiguity of the grammar related to k ? I mean giving a LR(2) grammar, is it ambiguous for a LR(1) parser and not ambiguous for a LR(2) one ?
[EDIT: No it's not, a LR(2) grammar means that the parser will need two tokens of lookahead to choose the right rule to use. On the other hand, an ambiguous grammar is one that possibly leads to several parse trees. ]
2.2) So a LR(*) parser, as long as you can imagine it, will have no ambiguous grammar at all and can then parse the entire set of context free languages ?
[EDIT: Answered by Ira Baxter, LR(*) is less powerful than GLR, in that it can't handle multiple parse trees. ]
3) Depending on the previous answers, what follows may be self contradictory. Considering LR parsing, do ambiguous grammars trigger shift-reduce conflict ? Can an unambiguous grammar trigger one too ? In the same way, what about reduce-reduce conflicts ?
[EDIT: this is it, ambiguous grammars leads to shift-reduce and reduce-reduce conflicts. By contrapositive, if there are no conflicts, the grammar is univocal. ]
4) The ability to parse left-recursive grammar is an advantage of LR(k) parser over LL(k), is it the only difference between them ?
[EDIT: yes. ]
5) Giving G1 :
G1 :
S -> S + S
S -> S - S
S -> a
5.1) G1 is both left-recursive, right-recursive, and ambiguous, am I right ? Is it a LR(2) grammar ? One would make it unambiguous :
G2 :
S -> S + a
S -> S - a
S -> a
5.2) Is G2 still ambiguous ? Does a parser for G2 needs two lookaheads ? By factorisation we have :
G3 :
S -> S V
V -> + a
V -> - a
S -> a
5.3) Now, does a parser for G3 need one lookahead only ? What are the counter parts for doing these transformations ? Is LR(1) the minimal parser required ?
5.4) G1 is left recursive, in order to parse it with a LL parser, one need to transform it into a right recursive grammar :
G4 :
S -> a + S
S -> a - S
S -> a
then
G5 :
S -> a V
V -> - V
V -> + V
V -> a
5.5) Does G4 need at least a LL(2) parser ? G5 only is parsable by a LL(1) parser, G1-G5 do define the same language, and this language is ( a (+/- a)^n ). Is it true ?
5.6) For each grammar G1 to G5, what is the minimal set to which it belongs ?
6) Finally, since many differents grammars may define the same language, how does one chose the grammar and the associated parser ? Is the resulting parse tree imortant ? What is the influence of the parse tree ?
I'm asking a lot, and I don't really expect a complete answer, anyway any help would be very appreciated.
Thx for reading !
"Many grammars may define the same langauge, how does one choose..."?
Usually, you choose the one that meets the following criteria:
conceptually as simple as you can make it (implication: smaller than others)
tracks the terminology in the langauge reference manual where possible
least amount of bending to meet the constraints of your parser generator
That last one can make a mess of your conceptual simplicity, and your chart of various parser styles shows the number of different issues that you face depending on your choice-of-generator. This is aggravated by the fact that choice is often made well before you actually choose the grammar.
One way to minimize grammar bending is to choose a parser generator which handles fully context-free grammars. GLR parsing has this very significant advantage. I've been using it for 15 years and have done dozens of real langauges with it.