CKY for Parsing Programming Languages

CKY for Parsing Programming Languages - parsing

Is it a good idea to use the CKY chart parsing algorithm to parse the syntax of programming languages (knowing that it is mostly used to parse the syntax of natural language)?

CKY can parse any context free language, but the time complexity is not great compared to alternatives. CKY requires the grammar to be in Chomsky Normal Form, which can blow up the size of the grammar and hurt running time too. It's an okay approach for a quick-and-dirty parser, but you'll run into issues when you try to scale up to larger inputs or complex grammars.
If you're looking for an understandable parsing algorithm that's relatively straightforward to implement, take a look at Parsing Expression Grammars (PEGs). They can recognize a large subset of context-free languages, plus some languages with limited context sensitivity. Once you have a working PEG parser it's easy to add memoization, which gives you a Packrat Parser that runs in linear time. The academic papers on PEGs, Packrat, and this extension to allow left-recursive grammars are all quite understandable.

Related

Converting grammars to the most efficient parser

Is there an online algorithm which converts certain grammar to the most efficient parser possible?
For example: SLR/LR(k) such as k>=0

For the class of grammars you are discussing (xLR(k)), they are all linear time anyway, and it is impossible to do sublinear time if you have to examine every character.
If you insist on optimizing parse time, you should get a very fast LR parsing engine. LRStar used to be the cat's meow on this topic, but the guy behind it got zero reward from the world of "I want it for free" and pulled all instances of it off the net. You can settle for Bison.
Frankly most of your parsing time will be determined by how fast your parser can process individual characters, e.g., the lexer. Tune that first and you may discover there's no need to tune the parser.

First let's distinguish LR(k) grammars and LR(k) languages. A grammar may not be LR(1), but let's say, for example, LR(2). But the language it generates must have an LR(1) grammar -- and for that matter, it must have an LALR(1) grammar. The table size for such a grammar is essentially the same as for SLR(1) and is more powerful (all SLR(1) grammars are LALR(1) but not vice versa). So, there is really no reason not to use an LALR(1) parser generator if you are looking to do LR parsing.
Since parsing represents only a fraction of the compilation time in modern compilers when lexical analysis and code generation that contains peephole and global optimizations are taken into consideration, I would just pick a tool considering its entire set of features. You should also remember that one parser generator may take a little longer than another to analyze a grammar and to generate the parsing tables. But once that job is done, the table-driven parsing algorithm that will run in the actual compiler thousands of times should not vary significantly from one parser generator to another.
As far as tools for converting arbitrary grammars to LALR(1), for example (in theory this can be done), you could do a Google search (I have not done this). But since semantics are tied to the productions, I would want to have complete control over the grammar being used for parsing and would probably eschew such conversion tools.

Deterministic Context-Free Grammar versus Context-Free Grammar?

I'm reading my notes for my comparative languages class and I'm a bit confused...
What is the difference between a context-free grammar and a deterministic context-free grammar? I'm specifically reading about how parsers are O(n^3) for CFGs and compilers are O(n) for DCFGs, and don't really understand how the difference in time complexities could be that great (not to mention I'm still confused about what the characteristics that make a CFG a DCFG).
Thank you so much in advance!

Conceptually they are quite simple to understand. The context free grammars are those which can be expressed in BNF. The DCFGs are the subset for which a workable parser can be written.
In writing compilers we are only interested in DCFGs. The reason is that 'deterministic' means roughly that the next rule to be applied at any point in the parse is determined by the input so far and a finite amount of lookahead. Knuth invented the LR() compiler back in the 1960s and proved it could handle any DCFG. Since then some refinements, especially LALR(1) and LL(1), have defined grammars that can be parsed in limited memory, and techniques by which we can write them.
We also have techniques to derive parsers automatically from the BNF, if we know it's one of these grammars. Yacc, Bison and ANTLR are familiar examples.
I've never seen a parser for a NDCFG, but at any point in the parse it would potentially need to consider the whole of the input string and every possible parse that could be applied. It's not hard to see why that would get rather large and slow.
I should point out that many real languages are imperfect, in that they are not entirely context free, not unambiguous or otherwise depart from the ideal DCFG. C/C++ is a good example, but there are many others. These languages are usually handled by special purpose rules such as semantic or syntactic predicates, special case backtracking or other 'tricks' with no effect on performance.
The comments point out that certain kinds of NDCFG are common and many tools provide a way to handle them. One common problem is ambiguity. It is relatively easy to parse an ambiguous grammar by introducing a simple local semantic rule, but of course this can only ever generate one of the possible parse trees. A generalised parser for NDCFG would potentially produce all parse trees, and could perhaps allow those trees to be filtered on some arbitrary condition. I don't know any of those.
Left recursion is not a feature of NDCFG. It presents a particular challenge to the design of LL() parsers but no problems for LR() parsers.

Performance of parsers: PEG vs LALR(1) or LL(k)

I've seen some claims that optimized PEG parsers in general cannot be faster than optimized LALR(1) or LL(k) parsers. (Of course, performance of parsing would depend on a particular grammar.)
I'd like to know if there are any specific limitations of PEG parsers, either valid in general or for some subsets of PEG grammars that would make them inferior to LALR(1) or
LL(k) performance-wise.
In particular, I'm interested in parser generators, but assume that their output can be tweaked for performance in any particular case. I also assume that parsers are optimized and it is possible to tweak a particular grammar a bit if that's needed to improve performance.

Found a good answer about Packrat vs LALR parsing. Some quotes from it:
L(AL)R parsers are linear time parsers, too. So in theory, neither packrat nor L(AL)R parsers are "faster".
What matters, in practice, of course, is implementation. L(AL)R state transitions can be executed in very few machine instructions ("look token code up in vector, get next state and action") so they can be extremely fast in practice.
An observation: most language front-ends don't spend most of their time "parsing"; rather, they spend a lot of time in lexical analysis. Optimize that ..., and the parser speed won't matter much.

PEG parsers can use unlimited lookahead (while maintaining linear parse time on average, via packrat) unlike (default) LL(k), or LR(k) parsers which use limited lookahead, while maintining linear parse time.
Lately (2014-2015) ANTLR4 has made extensions to handle arbitrary lookahead (as in PEG) while maintaining linear parse time on average (said to be more efficient than packrat algorithm), however this is incorporates new extensions and variations of the LR parsing algorithm (and not the default LR algorithm).
The packrat parser (and associated parsers for LL, LR) is not necesarily practical, but provides theoretical bounds on parsing so comparison can be made.
But note that unlimited lookahead can be used to parse grammars/languages in linear time (e.g via packrat or antlr) which are not possible to parse via LL(k) or LR(k) even in non-linear time, So it is important to understand what is compared to what.

Linear-time general parser

I've been reading Dick Grune's Parsing techniques 1st edition for quite a while now, the book is from the mid-90's and the author argues that no such parsing method (Linear-time general parsing) has been discovered until the date.
"we should like to have a linear-time general parsing method.
Unfortunately no such method has been discovered to date." pg 76
Has anyone developed such method?

No such method has been devised. As far as I can tell, the CYK algorithm remains the general parsing algorithm with the best worst case performance (O(n3)).

A GLR parsers are O(n^3) in worst case, but offer linear performance where the grammar is deterministic. Many grammars have this property, so in effect you get linear time parsing in practice.
We've managed to build parsers for many real, complicated languages using a GLR parser, even the famously hard to parse C++.

Packrat with a full memoisation is a guaranteed O(n), but there is a relatively big linear multiplier.

I am developing a parser (JoeSon) written in CoffeeScript, and I believe it is O(n) for most interesting grammars.
I think it is essentially a Packrat parser, but with the ability to bypass the cache for some rules, which I think is necessary to write whitespace-sensitive grammars.
Packrat does not parse all context free grammars. It has difficulty with counting problems, like with the grammar ' S = x S x | x '. But these kinds of grammars are also difficult for humans to parse.
https://github.com/jaekwon/joeson/blob/master/joeson_grammar.coffee
https://github.com/jaekwon/joeson/blob/master/joescript_grammar.coffee

How do recursive ascent parsers work?

How do recursive ascent parsers work? I have written a recursive descent parser myself but I don't understand LR parsers all that well. What I found on Wikipedia has only added to my confusion.
Another question is why recursive ascent parsers aren't used more than their table-based counterparts. It seems that recursive ascent parsers have greater performance overall.

The clasical dragon book explains very well how LR parsers work. There is also Parsing Techniques. A Practical Guide. where you can read about them, if I remember well. The article in wikipedia (at least the introduction) is not right. They were created by Donald Knuth, and he explains them in his The Art of Computer Programming Volume 5. If you understand spanish, there is a complete list of books here posted by me. Not all that books are in spanish, either.
Before to understand how they work, you must understand a few concepts like first, follows and lookahead. Also, I really recommend you to understand the concepts behind LL (descendant) parsers before trying to understand LR (ascendant) parsers.
There are a family of parsers LR, specially LR(K), SLR(K) and LALR(K), where K is how many lookahead they need to work. Yacc supports LALR(1) parsers, but you can make tweaks, not theory based, to make it works with more powerful kind of grammars.
About performance, it depends on the grammar being analyzed. They execute in linear time, but how many space they need depends on how many states do you build for the final parser.

I'm personally having a hard time understanding how a function call can be faster -- much less "significantly faster" than a table lookup. And I suspect that even "significantly faster" is insignificant when compared to everything else that a lexer/parser has to do (primarily reading and tokenizing the file). I looked at the Wikipedia page, but didn't follow the references; did the author actually profile a complete lexer/parser?
More interesting to me is the decline of table-driven parsers with respect to recursive descent. I come from a C background, where yacc (or equivalent) was the parser generator of choice. When I moved to Java, I found one table-driven implementation (JavaCup), and several recursive descent implementations (JavaCC, ANTLR).
I suspect that the answer is similar to the answer of "why Java instead of C": speed of execution isn't as important as speed of development. As noted in the Wikipedia article, table-driven parsers are pretty much impossible to understand from code (back when I was using them, I could follow their actions but would never have been able to reconstruct the grammar from the parser). Recursive descent, by comparison, is very intuitive (which is no doubt why it predates table-driven by some 20 years).

The Wikipedia article on recursive ascent parsing references what appears to be the original paper on the topic ("Very Fast LR Parsing"). Skimming that paper cleared a few things up for me. Things I noticed:
The paper talks about generating assembly code. I wonder if you can do the same things they do if you're generating C or Java code instead; see sections 4 and 5, "Error recovery" and "Stack overflow checking". (I'm not trying to FUD their technique -- it might work out fine -- just saying that it's something you might want to look into before committing.)
They compare their recursive ascent tool to their own table-driven parser. From the description in their results section, it looks like their table-driven parser is "fully interpreted"; it doesn't require any custom generated code. I wonder if there's a middle ground where the overall structure is still table-driven but you generate custom code for certain actions to speed things up.
The paper referenced by the Wikipedia page:
"Very fast LR parsing" (1986)
http://portal.acm.org/citation.cfm?id=13310.13326
Another paper about using code-generation instead of table-interpretation:
"Very fast YACC-compatible parsers (for very little effort)" (1999)
http://www3.interscience.wiley.com/journal/1773/abstract
Also, note that recursive-descent parsing is not the fastest way to parse LL-grammar-based languages:
Difference between an LL and Recursive Descent parser?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart