Is there a simple example how Earley parser outperforms (or allows less verbose grammar than) LL(k) parser? - parsing

I've read that Earley is easier to use and that it can handle more cases than LL(k) (see https://www.wikiwand.com/en/Earley_parser):
Earley parsers are appealing because they can parse all context-free languages, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages.
But I cannot find a simple example that shows Earley has an advantage over LL(k).

That quote (and the Wikipedia entry it comes from) do not claim that the Earley parsing algorithm is faster than LL(k), nor that it allows for shorter grammars. What it claims is that the Earley parser can parse grammars which cannot be parsed by LL(k) or LR(k) grammars. That is true; the Earley parser can parse any context-free-grammar.
A simple example of a grammar which cannot be parsed by an LR(k) parser is the language of palindromes (sentences which read the same left-to-right as right-left). Here's a grammar for even-length palindromes over the alphabet {a, b}:
S → ε
S → a S a
S → b S b
You can add odd-length palindromes by adding two more productions (S → a and S → b), and it should be easy to see how to extend it to a larger alphabet.
Note that the grammar is unambiguous; there is only one parse tree for every valid input. That's not an issue for Earley parsing -- the parser can produce a representation of all possible parses from an ambiguous grammar, although it might take longer than parsing an unambiguous grammar. However, LR(k) parsers only exist for unambiguous grammars (and, as shown by the above example, not for all unambiguous grammars).
In the above, I only mention LR(k) parsing, because LR(k) parsing is strictly more powerful than LL(k) parsing. Any grammar with an LL(k) parser can be parsed with an LR(k) parser, so if there is no LR(k) parser, there is also no LL(k) parser. However, the converse is not true: there are grammars which can be parsed by an LR(k) parser for which no LL(k') grammar exists, for any value of k'. Moreover, there are languages which have an LR(1) grammar and which have no LL(k) grammar, for any value of k. The proofs of these assertions can be found in any good textbook on automaton theory.
Any LL(k) language can be parsed in time linear to the length of the input using LL(k), LR(k) or Earley parsers. That is, the three algorithms have asymptotically equal computational complexity for grammars which qualify for all three algorithms. But asymptotic complexity isn't the full story. If you have an LR(1) grammar, it is probably faster (by a constant factor) to use an LR(1) parser, because the individual steps are just lookups in precomputed tables.
If the grammar is also LL(1), then a well-written recursive descent or table-driven LL(1) is also probably faster. (Comparisons between LR(1) and LL(1) parsers are not as clear-cut; a lot will depend on the quality of the parser code.)
But for values of k larger than 1, the Earley algorithm could well be the best choice, because of the size of LR(k) decision tables for larger values of k.

Related

What happens if you directly use LL grammar for an LR parser, after making basic syntactical changes?

sorry for the amateurish question. I have a grammar that's LL and I want to write an LR grammar. Can I use the LL grammar, make minimal syntactical changes for it to fit with an LR parser and use it? Is that a bad idea? Are there structural differences between them that don't translate?
All LL(1) grammars are LR(1), so if you had an LR(1) parser generator, you could definitely use your LL(1) grammar, assuming the BNF syntax is that used by the parser generator.
But you probably don't have an LR(1) parser generator, but rather a parser generator which can only handle the LALR(1) subset of LR(1) grammars. All the same, you're probably fine. "Most" LL(1) grammars are in LALR(1), and it's pretty rare to find a useful LL(1) which is not. (This pattern is unlikely to arise in a practical grammar, for example.)
So it's probably possible. But it might not be a good idea.
Top-down parsers can't handle left-recursion, and without left-recursion you can't write a grammar which represents left-associative operators, which is to say most arithmetic operators. This problem is usually solved in practice by using a right-associative grammar along with an idiosyncratic evaluation function which corrects the associativity. That's less than ideal. Also, LL grammars created by mechanically removing left-recursion tend to be very hard to read.
So you are probably best off using a grammar designed for LR parsing. But you probably don't have to.

What type of grammar is used to parse Lua?

I recently met the concept of LR, LL etc. Which category is Lua? Are there, or, can there be implementations that differ from the official code in this aspect?
LR, LL and so on are algorithms which attempt to find a parser for a given grammar. This is not possible for every grammar, and you can categorize on the basis of that possibility. But you have to be aware of the difference between a language and a grammar.
It might be possible to create an LR(k) parser for a given grammar, for some specific value of k. If so, the grammar is LR(k). Note that an LR(k) grammar is also an LR(k+1) grammar, and that an LL(k) grammar is also LR(k). So these are not categories in the sense that every grammar is in exactly one category.
Any language can be recognised by many different grammars. (In fact, an unlimited number). These grammars can be arbitrarily complex. You can always write a grammar for a given language which is not even context-free. We say that a language is <X> if there exists a grammar for that language which is <X>. But the fact that a specific grammar for that language is not <X> says nothing.
One interesting theorem demonstrates that if there is an LR(k) grammar for any language, then it is possible to derive an LR(1) grammar for that language. So while the k parameter is useful for describing grammars, languages can only be LR(0) or LR(1). This is not true of LL(k) languages, though.
Lua as a language is basically LR(1) and LL(2). The grammar is part of the reference manual, except that the published grammar doesn't specify operator precedences or a few rules having to do with newlines. The actual parser is a hand-written recursive-descent parser (at least, the last time I looked) with a couple of small deviations in order to handle operator precedence and the minor deviations from LL(1). However, there exist LALR(1) parsers for Lua as well.

Is there a type of parser generator that handles all deterministic context-free grammars?

I need a way of generating parsers for all deterministic context-free grammars.
I know that every deterministic context-free grammar can be parsed by some LR(k) parser. The problem is that I need to generate parsers for grammars of unknown k. So, to handle every deterministic context-free grammar, k would need to be infinite.
I also know that GLR parsers can parse all context-free grammars, deterministic or not. But I need to reject non-deterministic grammars. I'm not sure if GLR can detect that property from an input grammar.
Is there a type of parser generator that can handle all deterministic context-free grammars, while rejecting non-deterministic grammars, without needing a k input? (The only input is the grammar itself)
The problem of “given a CFG, decide whether it’s LR(k) for any k” is, surprisingly, undecidable! This means that it’s not possible for any parser generator to always be able to take an arbitrary grammar and determine which choice of k to use, or even if such a choice of k exists.
In practice, most grammars that we care about are fairly close to LR(1), for some definition of “fairly close,” which is why most parser generators focus on that simpler case.

What is the difference between LALR and LR parsing? [duplicate]

This question already has answers here:
What is the difference between LR, SLR, and LALR parsers?
(9 answers)
Closed 3 years ago.
The community reviewed whether to reopen this question 4 months ago and left it closed:
Original close reason(s) were not resolved
I understand both LR and LALR are bottom-up parsing algorithms, but what's the difference between the two?
What's the difference between LR(0), LALR(1), and LR(1) parsing? How can I tell if a grammar is LR(0), LALR(1), or LR(1)?
At a high level, the difference between LR(0), LALR(1), and LR(1) is the following:
An LALR(1) parser is an "upgraded" version of an LR(0) parser that keeps track of more precise information to disambiguate the grammar. An LR(1) parser is a significantly more powerful parser that keeps track of even more precise information than an LALR(1) parser.
LALR(1) parsers are a constant factor larger than LR(0) parsers, and LR(1) parsers are usually exponentially larger than LALR(1) parsers.
Any grammar that can be parsed with an LR(0) parser can be parsed with an LALR(1) parser and any grammar that can be parsed with an LALR(1) parser can be parsed with an LR(1) parser. There are grammars that are LALR(1) but not LR(0) and LR(1) but not LALR(1).
More formally, an LR(k) parser is a bottom-up parser that works by maintaining a stack of terminals and nonterminals. The parser is controlled by a finite automaton that determines, based on the current state of the parser and the next k tokens of input, whether to shift a new token onto the stack or reduce the top symbols of the stack by applying a production in reverse.
In order to keep track of enough information to make a determination about whether to shift or reduce, LR(k) parsers have each state correspond to a "configurating set," a set of productions annotated with the following information:
How much of the production has been seen so far, and
What tokens to expect after the production has been completed (the lookahead)
The first of these pieces of information is used to determine whether the parser may need to do a reduction - if none of the productions in a current state have been completed, there's no reason to do a reduction. The second of these pieces of information is used when doing a reduction to determine whether the reduction should be performed. When deciding whether to reduce, an LR(k) parser looks at the next k tokens of the input stream. If they match the lookahead tokens, the parser will reduce, and otherwise the parser does nothing.
Problems arise in an LR(k) parser when there are conflicts about what the parser should do in a given state. One type of conflict, a shift/reduce conflict, comes up when the parser is in a state where a production has been completed, but the lookahead symbols for that production conflict are also used by another uncompleted production in the state. This means that the parser can't tell whether to perform the reduction or not. A second type of conflict is a reduce/reduce conflict, where the parser knows it has to do a reduction, but two or more reductions are possible and it can't tell which to do.
Intuitively, as k gets larger and larger, the parser has more and more precise information available to it to determine when to shift and when to reduce. If a grammar is not LR(0), for example, the parser might have a state where given no lookahead at all it can't determine whether to shift or to reduce. However, that grammar might still be LR(1) because given an extra token of lookahead, it may be able to recognize that it should definitely shift and not reduce or definitely reduce and not shift.
The problem with LR(k) parsers is that as k gets larger, the number of states can increase exponentially. Lookahead in LR(k) parsers is handled by building more and more states in the parser to correspond to different combinations of productions and lookaheads, so as the number of possible lookaheads increases so does the number of states. Consequently, LR(1) parsers are commonly too large to be practical, and LR(2) or greater is almost unheard of in practice.
LALR(1) was invented as a compromise between the space efficiency of LR(0) parsers and the expressive power of LR(1) parsers. There are several ways to think about what an LALR(1) parser is. Originally, LALR(1) parsers were specified as a transformation that converts LR(1) automata into smaller automata. Although an LR(1) parser may have many more states than an LR(0) automaton, the only difference is that an LR(1) parser may have multiple copies of any particular state in an LR(0) automaton, each annotated with different lookahead information. An LALR(1) parser can be formed by starting with an LR(1) parser, then combining together all states that have the same "core" (the set of productions and their positions), then aggregating all the lookahead information together. This results in a parser that has the same number of states as an LR(0) parser but retains some amount of information about lookaheads to help avoid LR conflicts.
Another view of LALR(1) grammars uses the "LALR-by-SLR" method. LALR(1) parsers can be constructed by starting with an LR(0) parser for a grammar, then creating a new grammar for the language that annotates nonterminals with information about what states in the LR(0) parser they correspond to. The information about the FOLLOW sets of the nonterminals in that grammar can then be used to compute the lookaheads in the LR(0) parser.
The net result is that
LR(0) parsers are small, but not very expressive.
LALR(1) parsers are slightly larger due to the lookahead information, but very expressive.
LR(1) parsers are huge, but extremely expressive.
As for your second question - how do you determine whether a grammar is LR(1) or LALR(1) - the standard approach is to try to build the parsing automata for the LR(1) parser and LALR(1) parser and checking for conflicts. To build the LR(1) parser, you build up the LR(1) configurating sets, then check to see if any of those configurating sets have a shift/reduce conflict or reduce/reduce conflict. To construct an LALR(1) parser, you can either build the LR(1) parser and then condense configurating sets with the same core or can use the LALR-by-SLR method based on the LR(0) parser for the language. More details about how to construct these configurating sets are available in most compilers textbooks. You can also check out the lecture notes from a compilers course I taught in Summer 2012, which cover all of the above parsing methods and a few others.
Hope this helps!
LR(0), SLR(1), LALR(1) parsers all have the same number of states. Minimal LR(1) parsers will have a few more states if the grammar requires it, to avoid reduce-reduce conflicts.
Canonical LR(1) parsers will have many more states, too many for medium or large computer languages.
SLR(1) parser generators build an LR(0) state machine and determine the k=1 lookaheads by examining the grammar (which may report erroneous conflicts).
LALR(1) parser generators build an LR(0) state machine and determine the k=1 lookaheads by examining the LR(0) state machine (which is very complicated).
Canonical LR(1) parser generators build an LR(1) state machine.
Minimal LR(1) parser generators build an LR(1) state machine and merge compatible states during the build process.
The parsing algorithm for a good LALR(1) parser is different in two ways: (1) It should have shift-reduce actions, which reduces the number of states by about 30% and makes the parser faster, and (2) it must do one or more reductions when detecting a syntax error, which makes error recovery more complicated.
The parsing algorithm for a canonical LR(1) parser (1) does not have shift-reduce actions and (2) does not make any reductions when detecting a syntax error, which makes error recovery simpler.
There is another case, called minimal LR(1), which uses the same parsing algorithm and error recovery algorithm as LALR(1). Minimal LR(1) parsers offer the power of LR(1) and their size is almost as small as LALR(1). The LRSTAR Parser Generator creates minimal LR(1) parsers for C++ programmers.

Is there a well-defined and well-reasoned transformation from LL(*) to PEG?

I am in the process of investigating PEG (Parsing Expression Grammar) parsers, and one of the topics I'm looking into is equivalence with other parsing techniques.
I found a good paper about transforming regexes into equivalent PEGs at From Regular Expressions to Parsing Expression Grammars.
I am hoping to find a similar treatment for LL(*) parsers but have as-yet come up empty-handed. It seems to me that a lot of the techniques described in 1 are also going to be applicable to the problem of LL(*) transformation, however I'm not sufficiently steeped in the formalisms to be confident of my own analysis.
Your collective help would be much appreciated!
The Wikipedia article about PEG says it all, I think. PEG does recursive descent by using clause ordering for disambiguation. In theory, the family of languages that can be parsed with recursive descent is the LL family, but, because PEG has unlimited lookahead and no ambiguity, the family should be a larger one, probably full CFG.
Every LL(k) grammar can be implemented by a recursive-descent parser with k lookahead, therefore every LL(k) grammar can be transformed to a PEG grammar by ordering the rules so those that require the longest lookahed are listed first.
This is an LL(k) grammar:
params = expr
params = expr ',' params
To make it a PEG grammar for the same language, the rules must be reordered:
params = expr ',' params
params = expr

Resources