How can LR parsers generate parse trees? - parsing

Suppose I have a grammar:
S -> Aaa
A -> a | ε
Clearly, this grammar can parse only sequences aa and aaa. A simple LR(1) parser (or even LL) can parse these when transformed into an equivalent grammar:
S -> aaA
A -> a | ε
Although these grammars are equivalent, their generated parse trees are different. Consider, for the sequence aaa:
S S
/ \ / \
A aa aa A
| |
a a
Grammars determine whether a sequence is part of a language, instead of providing the parse tree that represents it in the language. The un-transformed grammar cannot parse the sequence (without greater look-ahead); While the transformed grammar can parse it, but builds an invalid parse tree.
How would one go about building a parse tree for a sequence - whose (context-free) grammar can (untransformed) not be represented by an LR-parser?

If a grammar has no LR(1) parser, you need to use a different parsing algorithm. In this case, you could use an LR(3) parser. Or you could (in general) use an Earley or GLR parser, which have no lookahead limitations.
I think your question has to do with recovering the original parse from the results of a parse with a transformed grammar. This will depend on the transformation algorithm.
In the example you provide, I think you're using the left-recursion-elimination transformation; this procedure does not preserve derivations, so as far as I know there is no algorithm for recovering the original parse.
There is a different transformation which can be used to construct an LR(1) grammar from.an LR(k) grammar if the value of k is known. That transformation is reversible. However, it's not usually considered practical because it effectively encodes the LR(k) machine into the grammar rules, leading to a massive blow-up of the grammar. It would be equivalent to use a real LR(k) parser, which also has a huge automaton.

First, I would say, "Grammars determine whether a sequence is a sentence of a language. You then say that the transformed grammar builds an invalid parse tree. I would say that it builds a different parse tree, which may or may not be useful. But, to your question about building a parse tree from a non-LR grammar. Consider the following grammar that is not LR(k) for any k because it is ambiguous:
E -> E + E | E * E | number
For example:
7 * 4 + 3
There are two distinct parse trees you can build from this sentence precisely due to the ambiguity in the grammar (really this is the definition of an ambiguous grammar). So, the answer to your question is that I wouldn't know how to in the general case.

Related

How does LR parsing select a qualifying grammar production (to construct the parse tree from the leaves)?

I am reading a tutorial of the LR parsing. The tutorial uses an example grammar here:
S -> aABe
A -> Abc | b
B -> d
Then, to illustrate how the parsing algorithm works, the tutorial shows the process of parsing the word "abbcde" below.
I understand at each step of the algorithm, a qualifying production (namely a gramma rule, illustrated in column 2 in the table) is searched to match a segment of the string. But how does the LR parsing chooses among a set of qualifying productions (illustrate in column 3 in the table)?
An LR parse of a string traces out a rightmost derivation in reverse. In that sense, the ordering of the reductions applied is what you would get if you derived the string by always expanding out the rightmost nonterminal, then running that process backwards. (Try this out on your example - isn’t that neat?)
The specific mechanism by which LR parsers actually do this involves the use of a parsing automaton that tracks where within the grammar productions the parse happens to be, along with some lookahead information. There are several different flavors of LR parser (LR(0), SLR(1), LALR(1), LR(1), etc.), which differ on how the automaton is structured and how they use lookahead information. You may find it helpful to search for a tutorial on how these automata work, as that’s the heart of how LR parsers actually work.

How can LL(1) grammars be recursive if an LL(1) parser is non-recursive?

An LL(1) parser is a non-recursive predctive parser. Given that, why can LL(1) grammars is be recursive? These seem inconsistent with one another.
I think your confusion stems from the fact that there are several different types of recursion here.
First, there's the fact that any CFG that can generate infinitely many strings - which would be basically any CFG you'd actually want to use in practice - has to involve some amount of recursion. If there isn't any recursion, you only get finitely many strings. So in that sense, there's CFG recursion, where there are production rules that lead to the original nonterminal getting produced a second (or third, fourth, etc.) time.
Next, there's the way that the parser is implemented. Some parsers are implemented using recursive descent or recursive backtracking. That's a design decision that's separate from whether the original grammar is recursive. Let's call that parser recursion.
Generally speaking, most LL(1) parsers are implemented to not use parser recursion and to instead do a bunch of table-based lookups to determine how to drive the parsing. However, many LL(1) grammars have CFG recursion in them, but that's separate.
As an example, consider this (very simple) LL(1) grammar:
A → b | cA
Notice that there's CFG recursion here, since the production A → cA is recursive.
After augmenting the grammar, we get this grammar:
S → A$
A → b | cA
Here's the LL(1) parsing table for the above grammar:
| b | c | $
-----+----+----+---
S | A$ | A$ | acc
A | b | cA | -
We can use this parsing table to implement a (non-iterative) LL(1) parser by just keeping track of the partial match so far and consulting this table any time we need to predict which production to use.

Understanding and Writing Parsers

I'm writing a program that requires me to create my first real, somewhat complicated parser. I would like to understand what parsing algorithms exists, as well as how to create a "grammar". So my question(s) are as follows:
1) How does one create a formal grammar that a parser can understand? What are the basic components of a grammar?
2) What parsing algorithms exists, and what kind of input does each exceed at parsing?
3) In light of the broad nature of the questions above, what are some good references I can read through to understand the answer to questions 1 and 2?
I'm looking for more of a broad overview with the keywords/topic areas I need so I can look into the details myself. Thanks everybody!
You generally write a context-free grammar G that describes a certain formal language L (e.g. the set of all syntactically valid C programs) which is simply a set of strings over a certain alphabet (think of all well-formed C programs; or of all well-formed HTML documents; or of all well-formed MARKDOWN posts; all of these are sets of finite strings over certain subsets of the ASCII character set). After that you come up with a parser for the given grammar---that is, an algorithm that, given a string w, decides whether the string w can be derived by the grammar G. (For example, the grammar of the C11 language describes the set of all well-formed C programs.)
Some types of grammars admit simple-to-implement parsers. An example of grammars that are often used in practice are LL grammars. A special subset of LL grammars, called the LL(1) grammars, have parsers that run in linear time (linear in the length of the string we're parsing).
There are more general parsing algorithms---most notably the Early parser and the CYK algorithm---that take as inpuit a string w and a grammar G and decide in time O(|w|^3) whether the string w is derivable by the grammar G. (Notice how cool this is: the algorithm takes the grammar as an agrument. But I don't think this is used in practice.)
I implemented the Early parser in Java some time ago. If your're insterested, the code is available on GitHub.
For a concrete example of the whole process, consider the language of all balanced strings of parenthesis (), (()), ((()))()(())(), etc. We can describe them with the following context-free grammar:
S -> (S) | SS | eps
where eps is the empty production. For example, we can derive the string (())() as follows: S => SS => (S)S => ((S))S => (())S => (())(S) => (())(). We can easily implement a parser for this grammar (left as exercise :-).
A very good references is the so-called dragon book: Compilers: Principles, Techniques, and Tools by Aho et al. It covers all the essential topics. Another good reference is the classic book Introduction to Automata Theory, Languages, and Computation by Hopcroft et al.

Finding a language that is not LL(1)?

I've been playing around with a lot of grammars that are not LL(1) recently, and many of them can be transformed into grammars that are LL(1).
However, I have never seen an example of an unambiguous language that is not LL(1). In other words, a language for which any unambiguous grammar for the language is not LL(1)), nor do I have any idea how I would go about proving that I had found one if I accidentally stumbled across one.
Does anyone know how to prove that a particular unambiguous language is not LL(1)?
I was thinking about the question a while and then found this language at Wikipedia:
S -> A | B
A -> 'a' A 'b' | ε
B -> 'a' B 'b' 'b' | ε
They claim the language described by the grammar above cannot be described by LL(k) grammar. You asked about LL(1) only and this is pretty straightforward. Having first symbol only, you don't know if the sequence is 'ab' or 'aab' (or any more recursive one) and therefore you cannot choose the right rule. So the language is definitely not LL(1).
Also for every sequence generated by this grammar there is only one derivation tree. So the language is unambiguous.
The second part of your question is a little harder. It is much easier to prove the language is LL(1), than the opposite (there is no LL(1) grammar describing the language). I think you just create a grammar describing the language, then you try to make it LL(1). After discovering a conflict which cannot be resolved you somehow have to take advantage of it and create a proof.

How to determine whether a language is LL(1) LR(0) SLR(1)

Is there a simple way to determine whether a grammar is LL(1), LR(0), SLR(1)... just from looking on the grammar without doing any complex analysis?
For instance: To decide whether a BNF Grammar is LL(1) you have to calculate First and Follow sets - which can be time consuming in some cases.
Has anybody got an idea how to do this faster?
Any help would really be appreciated!
First off, a bit of pedantry. You cannot determine whether a language is LL(1) from inspecting a grammar for it, you can only make statements about the grammar itself. It is perfectly possible to write non-LL(1) grammars for languages for which an LL(1) grammar exists.
With that out of the way:
You could write a parser for the grammar and have a program calculate first and follow sets and other properties for you. After all, that's the big advantage of BNF grammars, they are machine comprehensible.
Inspect the grammar and look for violations of the constraints of various grammar types. For instance: LL(1) allows for right but not left recursion, thus, a grammar that contains left recursion is not LL(1). (For other grammar properties you're going to have to spend some quality time with the definitions, because I can't remember anything else off the top of my head right now :).
In answer to your main question: For a very simple grammar, it may be possible to determine whether it is LL(1) without constructing FIRST and FOLLOW sets, e.g.
A → A + A | a
is not LL(1), while
A → a | b
is.
But when you get more complex than that, you'll need to do some analysis.
A → B | a
B → A + A
This is not LL(1), but it may not be immediately obvious
The grammar rules for arithmetic quickly get very complex:
expr → term { '+' term }
term → factor { '*' factor }
factor → number | '(' expr ')'
This grammar handles only multiplication and addition, and already it's not immediately clear whether the grammar is LL(1). It's still possible to evaluate it by looking through the grammar, but as the grammar grows it becomes less feasable. If we're defining a grammar for an entire programming language, it's almost certainly going to take some complex analysis.
That said, there are a few obvious telltale signs that the grammar is not LL(1) — like the A → A + A above — and if you can find any of these in your grammar, you'll know it needs to be rewritten if you're writing a recursive descent parser. But there's no shortcut to verify that the grammar is LL(1).
One aspect, "is the language/grammar ambiguous", is a known undecidable question like the Post correspondence and halting problems.
Straight from the book "Compilers: Principles, Techniques, & Tools" by Aho, et. al.
Page 223:
A grammar G is LL(1) if and only if whenever A -> alpha | beta are two distinct productions of G, the following conditions hold:
For no terminal "a" do both alpha and beta derive strings beginning with "a"
At most one of alpha and beta can derive the empty string
If beta can reach the empty transition via zero or more transitions, then alpha does not derive any string beginning with a terminal in FOLLOW(A). Likewise, if alpha can reach the empty transition via zero or more transitions, then beta does not derive any string beginning with a terminal in FOLLOW(A)
Essentially this is a matter of verifying the grammar passes the Pairwise Disjointness Test and also does not involve Left Recursion. Or more succinctly a grammar G that is left-recursive or ambiguous cannot be LL(1).
Check whether the grammar is ambiguous or not. If it is, then the grammar is not LL(1) because no ambiguous grammar is LL(1).
ya there are shortcuts for ll(1) grammar
1) if A->B1|B2|.......|Bn
then first(B1)intersection first(B2)intersection .first(Bn)=empty set then it is ll(1) grammar
2) if A->B1|epsilon
then B1 intersection follow(A)is empty set
3) if G is any grammar such that every non terminal derives only one production then the grammar is LL(1)
p0 S' → E
p1 E → id
p2 E → id ( E )
p3 E → E + id
Construct the LR(0) DFA, the FOLLOW set for E and the SLR action/goto tables.
Is this an LR(0) grammar? Prove your answer.
Using the SLR tables, show the steps (shifts, reductions, accept) of an LR parser parsing: id ( id + id )

Resources