Is there a well-defined and well-reasoned transformation from LL(*) to PEG? - parsing

I am in the process of investigating PEG (Parsing Expression Grammar) parsers, and one of the topics I'm looking into is equivalence with other parsing techniques.
I found a good paper about transforming regexes into equivalent PEGs at From Regular Expressions to Parsing Expression Grammars.
I am hoping to find a similar treatment for LL(*) parsers but have as-yet come up empty-handed. It seems to me that a lot of the techniques described in 1 are also going to be applicable to the problem of LL(*) transformation, however I'm not sufficiently steeped in the formalisms to be confident of my own analysis.
Your collective help would be much appreciated!

The Wikipedia article about PEG says it all, I think. PEG does recursive descent by using clause ordering for disambiguation. In theory, the family of languages that can be parsed with recursive descent is the LL family, but, because PEG has unlimited lookahead and no ambiguity, the family should be a larger one, probably full CFG.
Every LL(k) grammar can be implemented by a recursive-descent parser with k lookahead, therefore every LL(k) grammar can be transformed to a PEG grammar by ordering the rules so those that require the longest lookahed are listed first.
This is an LL(k) grammar:
params = expr
params = expr ',' params
To make it a PEG grammar for the same language, the rules must be reordered:
params = expr ',' params
params = expr

Related

Is there a simple example how Earley parser outperforms (or allows less verbose grammar than) LL(k) parser?

I've read that Earley is easier to use and that it can handle more cases than LL(k) (see https://www.wikiwand.com/en/Earley_parser):
Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages.
But I cannot find a simple example that shows Earley has an advantage over LL(k).
That quote (and the Wikipedia entry it comes from) do not claim that the Earley parsing algorithm is faster than LL(k), nor that it allows for shorter grammars. What it claims is that the Earley parser can parse grammars which cannot be parsed by LL(k) or LR(k) grammars. That is true; the Earley parser can parse any context-free-grammar.
A simple example of a grammar which cannot be parsed by an LR(k) parser is the language of palindromes (sentences which read the same left-to-right as right-left). Here's a grammar for even-length palindromes over the alphabet {a, b}:
S → ε
S → a S a
S → b S b
You can add odd-length palindromes by adding two more productions (S → a and S → b), and it should be easy to see how to extend it to a larger alphabet.
Note that the grammar is unambiguous; there is only one parse tree for every valid input. That's not an issue for Earley parsing -- the parser can produce a representation of all possible parses from an ambiguous grammar, although it might take longer than parsing an unambiguous grammar. However, LR(k) parsers only exist for unambiguous grammars (and, as shown by the above example, not for all unambiguous grammars).
In the above, I only mention LR(k) parsing, because LR(k) parsing is strictly more powerful than LL(k) parsing. Any grammar with an LL(k) parser can be parsed with an LR(k) parser, so if there is no LR(k) parser, there is also no LL(k) parser. However, the converse is not true: there are grammars which can be parsed by an LR(k) parser for which no LL(k') grammar exists, for any value of k'. Moreover, there are languages which have an LR(1) grammar and which have no LL(k) grammar, for any value of k. The proofs of these assertions can be found in any good textbook on automaton theory.
Any LL(k) language can be parsed in time linear to the length of the input using LL(k), LR(k) or Earley parsers. That is, the three algorithms have asymptotically equal computational complexity for grammars which qualify for all three algorithms. But asymptotic complexity isn't the full story. If you have an LR(1) grammar, it is probably faster (by a constant factor) to use an LR(1) parser, because the individual steps are just lookups in precomputed tables.
If the grammar is also LL(1), then a well-written recursive descent or table-driven LL(1) is also probably faster. (Comparisons between LR(1) and LL(1) parsers are not as clear-cut; a lot will depend on the quality of the parser code.)
But for values of k larger than 1, the Earley algorithm could well be the best choice, because of the size of LR(k) decision tables for larger values of k.

What happens if you directly use LL grammar for an LR parser, after making basic syntactical changes?

sorry for the amateurish question. I have a grammar that's LL and I want to write an LR grammar. Can I use the LL grammar, make minimal syntactical changes for it to fit with an LR parser and use it? Is that a bad idea? Are there structural differences between them that don't translate?
All LL(1) grammars are LR(1), so if you had an LR(1) parser generator, you could definitely use your LL(1) grammar, assuming the BNF syntax is that used by the parser generator.
But you probably don't have an LR(1) parser generator, but rather a parser generator which can only handle the LALR(1) subset of LR(1) grammars. All the same, you're probably fine. "Most" LL(1) grammars are in LALR(1), and it's pretty rare to find a useful LL(1) which is not. (This pattern is unlikely to arise in a practical grammar, for example.)
So it's probably possible. But it might not be a good idea.
Top-down parsers can't handle left-recursion, and without left-recursion you can't write a grammar which represents left-associative operators, which is to say most arithmetic operators. This problem is usually solved in practice by using a right-associative grammar along with an idiosyncratic evaluation function which corrects the associativity. That's less than ideal. Also, LL grammars created by mechanically removing left-recursion tend to be very hard to read.
So you are probably best off using a grammar designed for LR parsing. But you probably don't have to.

What type of grammar is used to parse Lua?

I recently met the concept of LR, LL etc. Which category is Lua? Are there, or, can there be implementations that differ from the official code in this aspect?
LR, LL and so on are algorithms which attempt to find a parser for a given grammar. This is not possible for every grammar, and you can categorize on the basis of that possibility. But you have to be aware of the difference between a language and a grammar.
It might be possible to create an LR(k) parser for a given grammar, for some specific value of k. If so, the grammar is LR(k). Note that an LR(k) grammar is also an LR(k+1) grammar, and that an LL(k) grammar is also LR(k). So these are not categories in the sense that every grammar is in exactly one category.
Any language can be recognised by many different grammars. (In fact, an unlimited number). These grammars can be arbitrarily complex. You can always write a grammar for a given language which is not even context-free. We say that a language is <X> if there exists a grammar for that language which is <X>. But the fact that a specific grammar for that language is not <X> says nothing.
One interesting theorem demonstrates that if there is an LR(k) grammar for any language, then it is possible to derive an LR(1) grammar for that language. So while the k parameter is useful for describing grammars, languages can only be LR(0) or LR(1). This is not true of LL(k) languages, though.
Lua as a language is basically LR(1) and LL(2). The grammar is part of the reference manual, except that the published grammar doesn't specify operator precedences or a few rules having to do with newlines. The actual parser is a hand-written recursive-descent parser (at least, the last time I looked) with a couple of small deviations in order to handle operator precedence and the minor deviations from LL(1). However, there exist LALR(1) parsers for Lua as well.

Removing ambiguity from context free grammars

Given an ambiguous grammar, to remove operator precedence problems we would convert the grammar to follow the operator precedence rules. To solve the operator associativity problem, we convert the grammar into left recursive or right recursive by considering the operator it is associated with.
Now when the computer has to do the parsing, suppose if it uses the recursive descent algorithm, should the grammar be unambiguous in the first place? Or the grammar should have different requirements according to the algorithm?
If the grammar is left recursive, the recursive descent algorithm doesn't terminate. Now how do I give an unambiguous grammar(with associativity problems solved) to the algorithm as the input?
The grammar must be LL(k) to use the standard efficient recursive descent algorithm with no backtracking. There are standard transformations useful for taking a general LR grammar (basically any grammar parsable by a deterministic stack-based algorithm) to LL(k) form. They include left recursion elimination and left factoring. These are extensive topics I won't attempt to cover here. But they are covered well in most any good compiler text and reasonably well in online notes available thru search. Aho Sethi and Ullman Compiler Design is a great reference for this and most other compiler basics.

Difference between an LL and Recursive Descent parser?

I've recently being trying to teach myself how parsers (for languages/context-free grammars) work, and most of it seems to be making sense, except for one thing. I'm focusing my attention in particular on LL(k) grammars, for which the two main algorithms seem to be the LL parser (using stack/parse table) and the Recursive Descent parser (simply using recursion).
As far as I can see, the recursive descent algorithm works on all LL(k) grammars and possibly more, whereas an LL parser works on all LL(k) grammars. A recursive descent parser is clearly much simpler than an LL parser to implement, however (just as an LL one is simpler than an LR one).
So my question is, what are the advantages/problems one might encounter when using either of the algorithms? Why might one ever pick LL over recursive descent, given that it works on the same set of grammars and is trickier to implement?
LL is usually a more efficient parsing technique than recursive-descent. In fact, a naive recursive-descent parser will actually be O(k^n) (where n is the input size) in the worst case. Some techniques such as memoization (which yields a Packrat parser) can improve this as well as extend the class of grammars accepted by the parser, but there is always a space tradeoff. LL parsers are (to my knowledge) always linear time.
On the flip side, you are correct in your intuition that recursive-descent parsers can handle a greater class of grammars than LL. Recursive-descent can handle any grammar which is LL(*) (that is, unlimited lookahead) as well as a small set of ambiguous grammars. This is because recursive-descent is actually a directly-encoded implementation of PEGs, or Parser Expression Grammar(s). Specifically, the disjunctive operator (a | b) is not commutative, meaning that a | b does not equal b | a. A recursive-descent parser will try each alternative in order. So if a matches the input, it will succeed even if b would have matched the input. This allows classic "longest match" ambiguities like the dangling else problem to be handled simply by ordering disjunctions correctly.
With all of that said, it is possible to implement an LL(k) parser using recursive-descent so that it runs in linear time. This is done by essentially inlining the predict sets so that each parse routine determines the appropriate production for a given input in constant time. Unfortunately, such a technique eliminates an entire class of grammars from being handled. Once we get into predictive parsing, problems like dangling else are no longer solvable with such ease.
As for why LL would be chosen over recursive-descent, it's mainly a question of efficiency and maintainability. Recursive-descent parsers are markedly easier to implement, but they're usually harder to maintain since the grammar they represent does not exist in any declarative form. Most non-trivial parser use-cases employ a parser generator such as ANTLR or Bison. With such tools, it really doesn't matter if the algorithm is directly-encoded recursive-descent or table-driven LL(k).
As a matter of interest, it is also worth looking into recursive-ascent, which is a parsing algorithm directly encoded after the fashion of recursive-descent, but capable of handling any LALR grammar. I would also dig into parser combinators, which are a functional way of composing recursive-descent parsers together.

Resources