How many ways are there to build a parser? [closed] - parsing

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am learning about the ANTLR v4, which is a parser generator based on so-called Adaptive LL(*) algorithm. It claims to be a big improvement over LL(*) algorithm, but I also heard about some algorithm like LR.
What's the advantage/limitation of ANTLR's Adaptive LL(*) algorithm (over LR)?

How many contemporary algorithms are there to build a parser?
To start with one can look at the list of the common parser generators.
See: Comparison of parser generators and look under the heading Parsing algorithm.
ALL(*)
Backtracking Bottom-up
Backtracking LALR(1)
Backtracking LALR(k)
GLR
LALR(1)
LR(1)
IELR(1)
LALR(K)
LR(K)
LL
LL(1)
LL(*)
LL(1), Backtracking, Shunting yard
LL(k) + syntactic and semantic predicates
LL, Backtracking
LR(0)
SLR
Recursive descent
Recursive descent, Backtracking
PEG parser interpreter, Packrat
Packrat (modified)
Packrat
Packrat + Cut + Left Recursion
Packrat (modified), mutating interpreter
2-phase scannerless top-down backtracking + runtime support
Packrat (modified to support left-recursion and resolve grammar ambiguity)
Parsing Machine
Earley
Recursive descent + Pratt
Packrat (modified, partial memoization)
Hybrid recursive descent / operator precedence
Scannerless GLR
runtime-extensible GLR
Scannerless, two phase
Combinators
Earley/combinators
Earley/combinators, infinitary CFGs
Scannerless GLR
delta chain
Besides parser generators, there are also other algorithms/means to parse. In particular Prolog has DCG and most people who have written their first parser from scratch without formal training typically start with recursive descent. Also Chart parser and Left corner parser.
In writing parsers the first question that I always ask myself is how can I make a grammar for the language at the highest type in the Chomsky hierarchy. Here lowest is Type-0 and highest is Type-3.
Almost 90% of the time it is a Type-2 grammar (context-free grammars), then for the easer task it is a Type-3 grammar (regular grammars). I have experimented with Type-1 grammars (context-sensitive grammars) and even Type-0 grammars (unrestricted grammars).
And what's the advantage/limitation of ANTLR's Adaptive LL(*) algorithm?
See the paper written by Terrence Parr the creator of Adaptive LL(*):
Adaptive LL(*) Parsing: The Power of Dynamic Analysis
In practical terms Adaptive LL(*) lets you get from a grammar to a working parser faster because you do not have to understand as much parsing theory because Adaptive LL(*) is, shall I say, nimble enough to side step the mines you unknowingly place in the grammar. The price for this is that some of the mines you unknowingly place in the grammar can lead to inefficiencies in the runtime of the parser.
For most practical programming language purposes Adaptive LL(*) is enough. IIRC Adaptive LL(*) can NOT do Type-0 grammars (unrestricted grammars) which Prolog DCG can, but as I said, most people and most common programming task only need either type 2 or type 3.
Also most parser generators are for type 2, but that does not mean they can't do type 1 or possibly type 0. I cannot be more specific as I do not have practical experience with all of them.
Anytime you use a parsing tool or library there is a learning curve to learning how to use it and what it can and can not do.
If you are new to lexing/parsing and really want to understand it more then take a course and/or read Compilers: Principles, Techniques, and Tools (2nd Edition)

Related

Removing ambiguity from context free grammars

Given an ambiguous grammar, to remove operator precedence problems we would convert the grammar to follow the operator precedence rules. To solve the operator associativity problem, we convert the grammar into left recursive or right recursive by considering the operator it is associated with.
Now when the computer has to do the parsing, suppose if it uses the recursive descent algorithm, should the grammar be unambiguous in the first place? Or the grammar should have different requirements according to the algorithm?
If the grammar is left recursive, the recursive descent algorithm doesn't terminate. Now how do I give an unambiguous grammar(with associativity problems solved) to the algorithm as the input?
The grammar must be LL(k) to use the standard efficient recursive descent algorithm with no backtracking. There are standard transformations useful for taking a general LR grammar (basically any grammar parsable by a deterministic stack-based algorithm) to LL(k) form. They include left recursion elimination and left factoring. These are extensive topics I won't attempt to cover here. But they are covered well in most any good compiler text and reasonably well in online notes available thru search. Aho Sethi and Ullman Compiler Design is a great reference for this and most other compiler basics.

LALR vs LL parser

I've been using lex/yacc and now I'm trying to switch to ANTLR. The major concern is that ANTLR is an LL(*) parser unlike yacc which is LALR. I'm used to thinking bottom-up and I don't exactly know what the advantage of LL grammars is. People say that LL grammars are easier to understand and more popular these days. But it seems that LR parsers are more powerful e.g. LL parsers are incapable of dealing with left-recursions, although there seems to be some workarounds.
So the question is what is the advantage of LL grammars over LALR? I'd appreciate it if somebody could give me some examples. Links to useful articles would be great, too.
Thanks for your help in advance!
(I see this is a great resource: What advantages do LL parsers have over LR parsers?, but it would've been better with some examples.)
LR parsers are strictly more powerful than LL parsers, and in addition, LALR parsers can run in O(n) like LL parsers. So you won't find any functional advantages of LL over LR.
Thus, the only advantage of LL is that LR state machines are quite a bit more complex and difficult to understand, and LR parsers themselves are not especially intuitive. On the other hand, LL parser code which is automatically generated can be very easy to understand and debug.
The greatest advantage I see to LL parsers is that they are so easy to understand and implement! You can hand write recursive descent parsers with code that closely matches the grammar.
LR are generally considered more powerful and also much faster BUT there are a few trade offs that I know:
LR parsers can only use synthesized attributes; they can not pass inherited attributes.
Actions in an LR grammar can cause grammar nondeterminism but not in LL.
However, you will find that LL(*) are also very powerful.

Performance of parsers: PEG vs LALR(1) or LL(k)

I've seen some claims that optimized PEG parsers in general cannot be faster than optimized LALR(1) or LL(k) parsers. (Of course, performance of parsing would depend on a particular grammar.)
I'd like to know if there are any specific limitations of PEG parsers, either valid in general or for some subsets of PEG grammars that would make them inferior to LALR(1) or
LL(k) performance-wise.
In particular, I'm interested in parser generators, but assume that their output can be tweaked for performance in any particular case. I also assume that parsers are optimized and it is possible to tweak a particular grammar a bit if that's needed to improve performance.
Found a good answer about Packrat vs LALR parsing. Some quotes from it:
L(AL)R parsers are linear time parsers, too. So in theory, neither packrat nor L(AL)R parsers are "faster".
What matters, in practice, of course, is implementation. L(AL)R state transitions can be executed in very few machine instructions ("look token code up in vector, get next state and action") so they can be extremely fast in practice.
An observation: most language front-ends don't spend most of their time "parsing"; rather, they spend a lot of time in lexical analysis. Optimize that ..., and the parser speed won't matter much.
PEG parsers can use unlimited lookahead (while maintaining linear parse time on average, via packrat) unlike (default) LL(k), or LR(k) parsers which use limited lookahead, while maintining linear parse time.
Lately (2014-2015) ANTLR4 has made extensions to handle arbitrary lookahead (as in PEG) while maintaining linear parse time on average (said to be more efficient than packrat algorithm), however this is incorporates new extensions and variations of the LR parsing algorithm (and not the default LR algorithm).
The packrat parser (and associated parsers for LL, LR) is not necesarily practical, but provides theoretical bounds on parsing so comparison can be made.
But note that unlimited lookahead can be used to parse grammars/languages in linear time (e.g via packrat or antlr) which are not possible to parse via LL(k) or LR(k) even in non-linear time, So it is important to understand what is compared to what.

CKY for Parsing Programming Languages

Is it a good idea to use the CKY chart parsing algorithm to parse the syntax of programming languages (knowing that it is mostly used to parse the syntax of natural language)?
CKY can parse any context free language, but the time complexity is not great compared to alternatives. CKY requires the grammar to be in Chomsky Normal Form, which can blow up the size of the grammar and hurt running time too. It's an okay approach for a quick-and-dirty parser, but you'll run into issues when you try to scale up to larger inputs or complex grammars.
If you're looking for an understandable parsing algorithm that's relatively straightforward to implement, take a look at Parsing Expression Grammars (PEGs). They can recognize a large subset of context-free languages, plus some languages with limited context sensitivity. Once you have a working PEG parser it's easy to add memoization, which gives you a Packrat Parser that runs in linear time. The academic papers on PEGs, Packrat, and this extension to allow left-recursive grammars are all quite understandable.

Difference between an LL and Recursive Descent parser?

I've recently being trying to teach myself how parsers (for languages/context-free grammars) work, and most of it seems to be making sense, except for one thing. I'm focusing my attention in particular on LL(k) grammars, for which the two main algorithms seem to be the LL parser (using stack/parse table) and the Recursive Descent parser (simply using recursion).
As far as I can see, the recursive descent algorithm works on all LL(k) grammars and possibly more, whereas an LL parser works on all LL(k) grammars. A recursive descent parser is clearly much simpler than an LL parser to implement, however (just as an LL one is simpler than an LR one).
So my question is, what are the advantages/problems one might encounter when using either of the algorithms? Why might one ever pick LL over recursive descent, given that it works on the same set of grammars and is trickier to implement?
LL is usually a more efficient parsing technique than recursive-descent. In fact, a naive recursive-descent parser will actually be O(k^n) (where n is the input size) in the worst case. Some techniques such as memoization (which yields a Packrat parser) can improve this as well as extend the class of grammars accepted by the parser, but there is always a space tradeoff. LL parsers are (to my knowledge) always linear time.
On the flip side, you are correct in your intuition that recursive-descent parsers can handle a greater class of grammars than LL. Recursive-descent can handle any grammar which is LL(*) (that is, unlimited lookahead) as well as a small set of ambiguous grammars. This is because recursive-descent is actually a directly-encoded implementation of PEGs, or Parser Expression Grammar(s). Specifically, the disjunctive operator (a | b) is not commutative, meaning that a | b does not equal b | a. A recursive-descent parser will try each alternative in order. So if a matches the input, it will succeed even if b would have matched the input. This allows classic "longest match" ambiguities like the dangling else problem to be handled simply by ordering disjunctions correctly.
With all of that said, it is possible to implement an LL(k) parser using recursive-descent so that it runs in linear time. This is done by essentially inlining the predict sets so that each parse routine determines the appropriate production for a given input in constant time. Unfortunately, such a technique eliminates an entire class of grammars from being handled. Once we get into predictive parsing, problems like dangling else are no longer solvable with such ease.
As for why LL would be chosen over recursive-descent, it's mainly a question of efficiency and maintainability. Recursive-descent parsers are markedly easier to implement, but they're usually harder to maintain since the grammar they represent does not exist in any declarative form. Most non-trivial parser use-cases employ a parser generator such as ANTLR or Bison. With such tools, it really doesn't matter if the algorithm is directly-encoded recursive-descent or table-driven LL(k).
As a matter of interest, it is also worth looking into recursive-ascent, which is a parsing algorithm directly encoded after the fashion of recursive-descent, but capable of handling any LALR grammar. I would also dig into parser combinators, which are a functional way of composing recursive-descent parsers together.

Resources