Parse trees in ambiguous and unambiguous grammar - parsing

In an unambiguous grammar, do left and right derivation both produce the same parse tree?
Because I have read that grammar having more than one parse tree is said to be ambiguous.

If the grammar is unambiguous, there is only one parse tree. (By definition.) So the leftmost and rightmost derivations generate the same tree.
You can think of a derivation as a tree walk. For a given tree, there are many different possible ways of traversing it. Leftmost and rightmost derivations are pre- and post-order depth-first traverses, respectively.

Related

Parsing Tree and Derivation?

I don't understand the relantioship between parsing tree and derivation. The parsing tree is invariant with respect to the derivation, but does this mean that regardless of the derivation (rightmost or leftmost) the parsing tree remains the same? Or according to the method (rightmost or leftmost) does the parsing tree change?
Pls help me
Sorry for my bad english.
The parse tree is a record of the derivation. Each non-leaf node of the tree is the result of a single derivation step.
At the root of the parse tree and at the beginning of the derivation, you find the grammar's start symbol. A derivation step replaces a non-terminal with the right-hand side of some production which has that non-terminal on the left-hand side. In the tree, the node corresponding to the non-terminal is given a sequence of children, each one a symbol in the right-hand side of the production. Terminal symbols become leaf nodes and non-terminals will eventually become the top of a subtree.
If the grammar is unambiguous, there is only one parse tree for each derivable sentence. But that parse tree represents a large number of possible derivations, unless the grammar is linear (that is, the right-hand side of every production contains at most one non-terminal). In a derivation which is being built, you can select any non-terminal for the next derivation step; in the parse tree, you can select any node representing a non-terminal which does not yet have children.
The leftmost and rightmost derivations are just two of these many possibilities. (Again, unless the grammar is linear, in which case the leftmost and rightmost derivations are the same derivation.) But a derivation doesn't have to be leftmost or rightmost. At each step, it can choose any non-terminal, not only the leftmost or rightmost one. In the tree representation, a possible derivation can be generated by any valid topological sort of the nodes.
But that doesn't matter in practical terms. The only useful practical question is whether there is more than one different parse tree for each sentence in the grammar, which is exactly the same as asking whether there is more than one leftmost derivation or more than one rightmost derivation.

Theory: LL(k) parser vs parser for LL(k) grammars

I'm concerned about the very important difference between the therms: "LL(k) parser" and "parser for LL(k) grammars". When a LL(1) backtracking parser is in question, it IS a parser for LL(k) grammars, because it can parse them, but its NOT LL(k) parser, because it does not use k tokens to look ahead from a single position in the grammar, but its exploring with backtracking the possible cases, regardless that it still uses k tokens to explore.
Am I right?
The question may break down to the way the look-ahead is performed. If the look-ahead is actually still processing the grammar with backtracking, that does not make it LL(k) parser. To be LL(k) parser the parser must not use the grammar with backtracking mechanism, because then it would be "LL(1) parser with backtracking that can parser LL(k) grammars".
Am I right again?
I think the difference is related to the expectation that LL(1) parser is using a constant time per token, and LL(k) parser is using at most k * constant (linear to the look-ahead) time per token, not an exponential time as it would be in the case of a backtracking parser.
Update 1: to simplify - per token, is the parsing LL(k) expected to run exponentially in respect to k or in a linear time in respect to k?
Update 2: I have changed it to LL(k) because the question is irrelevant to the range of which k is (integer or infinity).
An LL(k) parser needs to do the following at each point in the inner loop:
Collect the next k input symbols. Since this is done at each point in the input, this can be done in constant time by keeping the lookahead vector in a circular buffer.
If the top of the prediction stack is a terminal, then it is compared with the next input symbol; either both are discarded or an error is signalled. This is clearly constant time.
If the top of the prediction stack is a non-terminal, the action table is consulted, using the non-terminal, the current state and the current lookahead vector as keys. (Not all LL(k) parsers need to maintain a state; this is the most general formulation. But it doesn't make a difference to complexity.) This lookup can also be done in constant time, again by taking advantage of the incremental nature of the lookahead vector.
The prediction action is normally done by pushing the right-hand side of the selected production onto the stack. A naive implementation would take time proportional to the length of the right-hand side, which is not correlated with either the lookahead k nor the length of the input N, but rather is related to the size of the grammar itself. It's possible to avoid the variability of this work by simply pushing a reference to the right-hand side, which can be used as though it were the list of symbols (since the list can't change during the parse).
However, that's not the full story. Executing a prediction action does not consume an input, and it's possible -- even likely -- that multiple predictions will be made for a single input symbol. Again, the maximum number of predictions is only related to the grammar itself, not to k nor to N.
More specifically, since the same non-terminal cannot be predicted twice in the same place without violating the LL property, the total number of predictions cannot exceed the number of non-terminals in the grammar. Therefore, even if you do push the entire right-hand side onto the stack, the total number of symbols pushed between consecutive shift actions cannot exceed the size of the grammar. (Each right-hand side can be pushed at most once. In fact, only one right-hand side for a given non-terminal can be pushed, but it's possible that almost every non-terminal has only one right-hand side, so that doesn't reduce the asymptote.) If instead only a reference is pushed onto the stack, the number of objects pushed between consecutive shift actions -- that is, the number of predict actions between two consecutive shift actions -- cannot exceed the size of the non-terminal alphabet. (But, again, it's possible that |V| is O(|G|).
The linearity of LL(k) parsing was established, I believe, in Lewis and Stearns (1968), but I don't have that paper at hand right now so I'll refer you to the proof in Sippu & Soisalon-Soininen's Parsing Theory (1988), where it is proved in Chapter 5 for Strong LL(K) (as defined by Rosenkrantz & Stearns 1970), and in Chapter 8 for Canonical LL(K).
In short, the time the LL(k) algorithm spends between shifting two successive input symbols is expected to be O(|G|), which is independent of both k and N (and, of course, constant for a given grammar).
This does not really have any relation to LL(*) parsers, since an LL(*) parser does not just try successive LL(k) parses (which would not be possible, anyway). For the LL(*) algorithm presented by Terence Parr (which is the only reference I know of which defines what LL(*) means), there is no bound to the amount of time which could be taken between successive shift actions. The parser might expand the lookahead to the entire remaining input (which would, therefore, make the time complexity dependent on the total size of the input), or it might fail over to a backtracking algorithm, in which case it is more complicated to define what is meant by "processing an input symbol".
I suggest you to read the chapter 5.1 of Aho Ullman Volume 1.
https://dl.acm.org/doi/book/10.5555/578789
A LL(k) parser is a k-predictive algorithm (k is the lookahead integer >= 1).
A LL(k) parser can parse any LL(k) grammar. (chapter 5.1.2)
for all a, b you have a < b => LL(b) grammar is also a LL(a) grammar. But the reverse is not true.
A LL(k) parser is PREDICTIVE. So there is NO backtracking.
All LL(k) parsers are O(n) n is the length of the parsed sentence.
It is important to understand that a LL(3) parser do not parse faster than a LL(1). But the LL(3) parser can parse MORE grammars than the LL(1). (see the point #2 and #3)

How can LR parsers generate parse trees?

Suppose I have a grammar:
S -> Aaa
A -> a | ε
Clearly, this grammar can parse only sequences aa and aaa. A simple LR(1) parser (or even LL) can parse these when transformed into an equivalent grammar:
S -> aaA
A -> a | ε
Although these grammars are equivalent, their generated parse trees are different. Consider, for the sequence aaa:
S S
/ \ / \
A aa aa A
| |
a a
Grammars determine whether a sequence is part of a language, instead of providing the parse tree that represents it in the language. The un-transformed grammar cannot parse the sequence (without greater look-ahead); While the transformed grammar can parse it, but builds an invalid parse tree.
How would one go about building a parse tree for a sequence - whose (context-free) grammar can (untransformed) not be represented by an LR-parser?
If a grammar has no LR(1) parser, you need to use a different parsing algorithm. In this case, you could use an LR(3) parser. Or you could (in general) use an Earley or GLR parser, which have no lookahead limitations.
I think your question has to do with recovering the original parse from the results of a parse with a transformed grammar. This will depend on the transformation algorithm.
In the example you provide, I think you're using the left-recursion-elimination transformation; this procedure does not preserve derivations, so as far as I know there is no algorithm for recovering the original parse.
There is a different transformation which can be used to construct an LR(1) grammar from.an LR(k) grammar if the value of k is known. That transformation is reversible. However, it's not usually considered practical because it effectively encodes the LR(k) machine into the grammar rules, leading to a massive blow-up of the grammar. It would be equivalent to use a real LR(k) parser, which also has a huge automaton.
First, I would say, "Grammars determine whether a sequence is a sentence of a language. You then say that the transformed grammar builds an invalid parse tree. I would say that it builds a different parse tree, which may or may not be useful. But, to your question about building a parse tree from a non-LR grammar. Consider the following grammar that is not LR(k) for any k because it is ambiguous:
E -> E + E | E * E | number
For example:
7 * 4 + 3
There are two distinct parse trees you can build from this sentence precisely due to the ambiguity in the grammar (really this is the definition of an ambiguous grammar). So, the answer to your question is that I wouldn't know how to in the general case.

Difference between left/right recursive, left/right-most derivation, precedence, associativity etc

I am currently learning language processors and a topic that comes up very often is the direction in which elements in a grammar are consumed. Left to right or right to left.
I understand the concept but there seems to be so many ways of writing these rules and I am not sure if they are all the same. What I've seen so far is:
Right/Left recursion,
Right/Left-most derivation,
Right/Left reduction, precedence, associativity etc.
Do these all mean the same thing?
No, they all have different meanings.
Right- and left-recursion refer to recursion within production rules. A production for a non-terminal is recursive if it can derive a sequence containing that non-terminal; it is left-recursive if the non-terminal can appear at the start (left edge) of the derived sequence, and right-recursive if it can appear at the end (right edge). A production can be recursive without being either left- or right-recursive, and it can even be both left- and right-recursive.
For example:
term: term '*' factor { /* left-recursive */ }
assignment: lval '=' assignment { /* right-recursive */ }
The above examples are both direct recursion; the non-terminal directly derives a sequence containing the non-terminal. Recursion can also be indirect; it is still recursion.
All common parsing algorithm process left-to-right, which is the first L in LL and LR. Top-down (LL) parsing finds a leftmost derivation (the second L), while bottom-up (LR) parsing finds a rightmost derivation (the R).
Effectively, both types of parser start with a single non-terminal (the start symbol) and "guess" a derivation based on some non-terminal in the current sequence until the input text is derived. In a leftmost derivation, it is always the leftmost non-terminal which is expanded. In a rightmost derivation, it is always the rightmost non-terminal.
So a top-down parser always guesses which production to use for the first non-terminal, after which it needs to again work on whatever is now the first non-terminal. ("Guess" here is informal. It can look at the input to be matched -- or at least the next k tokens of the input -- in order to determine which production to use.) This is called top-down processing because it builds the parse tree from the top down.
It's easier (at least for me) to visualize the action of a bottom-up parser in reverse; it builds the parse tree bottom up by repeatedly reading just enough of the input to find some production, which will be the last derivation in the derivation chain. So it does produce a rightmost derivation, but it outputs it back-to-front.
In an LR grammar for an operator language (roughly speaking, a grammar for languages which look like arithmetic expressions), left- and right- associativity are modelled using left- and right-recursive grammar rules, respectively. "Associativity" is an informal description of the grammar, as is "precedence".
Precedence is modelled by using a series of grammar rules, each of which refers to the next rule (and which usually end up with a recursive production for handling parentheses -- '(' expr ')' -- which is neither left- nor right-recursive).
There is an older style of bottom-up parsing, called "operator precedence parsing", in which precedence is explicitly part of the language description. One common operator-precedence algorithm is the so-called Shunting Yard algorithm. But if you have an LALR(1) parser generator, like bison, you might as well use that instead, because it is both more general and more precise.
(I am NOT an expert on parser and compiler theory. I happen to be learning something related. And I'd like to share something I have found so far.)
I strongly suggest taking a look at this awesome article.
It explains and illustrats the LL and LR algorithm. You can clearly see why LL is called top-down and LR is called bottom-up.
Some quotation:
The primary difference between how LL and LR parsers operate is that
an LL parser outputs a pre-order traversal of the parse tree and an LR
parser outputs a post-order traversal.
...
We are converging on a very simple model for how LL and LR parsers
operate. Both read a stream of input tokens and output that same token
stream, inserting rules in the appropriate places to achieve a
pre-order (LL) or post-order (LR) traversal of the parse tree.
...
When you see designations like LL(1), LR(0), etc. the number in
parentheses is the number of tokens of lookahead.
And as to the acronyms: (source)
The first L in LR and LL means: that the parser reads input
text in one direction without backing up; that direction is typically
Left to right within each line, and top to bottom across the lines of
the full input file.
The remaining R and L means: right-most and left-most derivation, respectively.
These are 2 different parsing strategies. A parsing strategy determines the next non-terminal to rewrite. (source)
For left-most derivation, it is always the leftmost nonterminal.
For right-most derivation, it is always the rightmost nonterminal.

Converting a context-free grammar into a LL(1) grammar

First off, I have checked similar questions and none has quite the information I seek.
I want to know if, given a context-free grammar, it is possible to:
Know if there exists or not an equivalent LL(1) grammar. Equivalent in the sense that it should generate the same language.
Convert the context-free grammar to the equivalent LL(1) grammar, given it exists. The conversion should succeed if an equivalent LL(1) grammar exists. It is OK if it does not terminate if an equivalent does not exists.
If the answer to those questions are yes, are such algorithms or tools available somewhere ? My own researches have been fruitless.
Another answer mentions that the Dragon Book has an algorithms to eliminate left recursion and to left factor a context-free grammar. I have access to the book and checked it but it is unclear to me if the grammar is guaranteed to be LL(1). The restriction imposed on the context-free grammar (no null production and no cycle) are agreeable to me.
From university compiler courses I took I remember that LL(1) grammar is context free grammar, but context free grammar is much bigger than LL(1). There is no general algorithm (meaning not NP hard) to check and convert from context-free (that can be transformed to LL(1)) to LL(1).
Applying the bag of tricks like removing left recursion, removing first-follow conflict, left-factoring, etc. are similar to mathematical transformation when you want to integrate a function... You need experience that is sometimes very close to an art. The transformations are often inverse of each other.
One reason why LR type grammars are being used now a lot for generated parsers is that they cover much wider spectrum of context free grammars than LL(1).
Btw. e.g. C grammar can be expressed as LL(1), but C# cannot (e.g. lambda function x -> x + 1 comes to mind, where you cannot decide if you are seeing a parameter of lambda or a known variable).

Resources