I am implementing a Recursive Descent Parser to parse a C-like language. For parsing expressions, I was using an Operator Precedence parser, but I wanted to make the precedence of some binary operators higher than the unary operators. How can I do this?
Related
I've read that Earley is easier to use and that it can handle more cases than LL(k) (see https://www.wikiwand.com/en/Earley_parser):
Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages.
But I cannot find a simple example that shows Earley has an advantage over LL(k).
That quote (and the Wikipedia entry it comes from) do not claim that the Earley parsing algorithm is faster than LL(k), nor that it allows for shorter grammars. What it claims is that the Earley parser can parse grammars which cannot be parsed by LL(k) or LR(k) grammars. That is true; the Earley parser can parse any context-free-grammar.
A simple example of a grammar which cannot be parsed by an LR(k) parser is the language of palindromes (sentences which read the same left-to-right as right-left). Here's a grammar for even-length palindromes over the alphabet {a, b}:
S → ε
S → a S a
S → b S b
You can add odd-length palindromes by adding two more productions (S → a and S → b), and it should be easy to see how to extend it to a larger alphabet.
Note that the grammar is unambiguous; there is only one parse tree for every valid input. That's not an issue for Earley parsing -- the parser can produce a representation of all possible parses from an ambiguous grammar, although it might take longer than parsing an unambiguous grammar. However, LR(k) parsers only exist for unambiguous grammars (and, as shown by the above example, not for all unambiguous grammars).
In the above, I only mention LR(k) parsing, because LR(k) parsing is strictly more powerful than LL(k) parsing. Any grammar with an LL(k) parser can be parsed with an LR(k) parser, so if there is no LR(k) parser, there is also no LL(k) parser. However, the converse is not true: there are grammars which can be parsed by an LR(k) parser for which no LL(k') grammar exists, for any value of k'. Moreover, there are languages which have an LR(1) grammar and which have no LL(k) grammar, for any value of k. The proofs of these assertions can be found in any good textbook on automaton theory.
Any LL(k) language can be parsed in time linear to the length of the input using LL(k), LR(k) or Earley parsers. That is, the three algorithms have asymptotically equal computational complexity for grammars which qualify for all three algorithms. But asymptotic complexity isn't the full story. If you have an LR(1) grammar, it is probably faster (by a constant factor) to use an LR(1) parser, because the individual steps are just lookups in precomputed tables.
If the grammar is also LL(1), then a well-written recursive descent or table-driven LL(1) is also probably faster. (Comparisons between LR(1) and LL(1) parsers are not as clear-cut; a lot will depend on the quality of the parser code.)
But for values of k larger than 1, the Earley algorithm could well be the best choice, because of the size of LR(k) decision tables for larger values of k.
I recently met the concept of LR, LL etc. Which category is Lua? Are there, or, can there be implementations that differ from the official code in this aspect?
LR, LL and so on are algorithms which attempt to find a parser for a given grammar. This is not possible for every grammar, and you can categorize on the basis of that possibility. But you have to be aware of the difference between a language and a grammar.
It might be possible to create an LR(k) parser for a given grammar, for some specific value of k. If so, the grammar is LR(k). Note that an LR(k) grammar is also an LR(k+1) grammar, and that an LL(k) grammar is also LR(k). So these are not categories in the sense that every grammar is in exactly one category.
Any language can be recognised by many different grammars. (In fact, an unlimited number). These grammars can be arbitrarily complex. You can always write a grammar for a given language which is not even context-free. We say that a language is <X> if there exists a grammar for that language which is <X>. But the fact that a specific grammar for that language is not <X> says nothing.
One interesting theorem demonstrates that if there is an LR(k) grammar for any language, then it is possible to derive an LR(1) grammar for that language. So while the k parameter is useful for describing grammars, languages can only be LR(0) or LR(1). This is not true of LL(k) languages, though.
Lua as a language is basically LR(1) and LL(2). The grammar is part of the reference manual, except that the published grammar doesn't specify operator precedences or a few rules having to do with newlines. The actual parser is a hand-written recursive-descent parser (at least, the last time I looked) with a couple of small deviations in order to handle operator precedence and the minor deviations from LL(1). However, there exist LALR(1) parsers for Lua as well.
Given an ambiguous grammar, to remove operator precedence problems we would convert the grammar to follow the operator precedence rules. To solve the operator associativity problem, we convert the grammar into left recursive or right recursive by considering the operator it is associated with.
Now when the computer has to do the parsing, suppose if it uses the recursive descent algorithm, should the grammar be unambiguous in the first place? Or the grammar should have different requirements according to the algorithm?
If the grammar is left recursive, the recursive descent algorithm doesn't terminate. Now how do I give an unambiguous grammar(with associativity problems solved) to the algorithm as the input?
The grammar must be LL(k) to use the standard efficient recursive descent algorithm with no backtracking. There are standard transformations useful for taking a general LR grammar (basically any grammar parsable by a deterministic stack-based algorithm) to LL(k) form. They include left recursion elimination and left factoring. These are extensive topics I won't attempt to cover here. But they are covered well in most any good compiler text and reasonably well in online notes available thru search. Aho Sethi and Ullman Compiler Design is a great reference for this and most other compiler basics.
I'm implementing pratt's top down operator precedence parser and I'd like to know in which formal category it falls into - is it LR(1)?
Pratt parser are not LR parsers. And they're not exactly LL parsers either. In fact, Pratt parsers are generally hand-coded in some general purpose programming language; the technique is not based on an abstraction like push-down finite state automata. This makes it somewhat more difficult to prove assertions about a given Pratt parser, such as that it recognizes a particular formal language.
In general, Pratt parsers can easily be designed to recognize a language if the grammar is an operator precedence grammar, so they can be considered to be a dual of operator precedence parsing, even though operator precedence parsing is bottom-up and Pratt parsers are nominally top-down. Tracing a Pratt parser and the transitions of an operator precedence parser for the same language will show the similarity.
So I suppose that it might be possible to come up with a formalism for Pratt parsers, but as far as I know, none exists.
I am in the process of investigating PEG (Parsing Expression Grammar) parsers, and one of the topics I'm looking into is equivalence with other parsing techniques.
I found a good paper about transforming regexes into equivalent PEGs at From Regular Expressions to Parsing Expression Grammars.
I am hoping to find a similar treatment for LL(*) parsers but have as-yet come up empty-handed. It seems to me that a lot of the techniques described in 1 are also going to be applicable to the problem of LL(*) transformation, however I'm not sufficiently steeped in the formalisms to be confident of my own analysis.
Your collective help would be much appreciated!
The Wikipedia article about PEG says it all, I think. PEG does recursive descent by using clause ordering for disambiguation. In theory, the family of languages that can be parsed with recursive descent is the LL family, but, because PEG has unlimited lookahead and no ambiguity, the family should be a larger one, probably full CFG.
Every LL(k) grammar can be implemented by a recursive-descent parser with k lookahead, therefore every LL(k) grammar can be transformed to a PEG grammar by ordering the rules so those that require the longest lookahed are listed first.
This is an LL(k) grammar:
params = expr
params = expr ',' params
To make it a PEG grammar for the same language, the rules must be reordered:
params = expr ',' params
params = expr