Deterministic Context-Free Grammar versus Context-Free Grammar? - parsing

I'm reading my notes for my comparative languages class and I'm a bit confused...
What is the difference between a context-free grammar and a deterministic context-free grammar? I'm specifically reading about how parsers are O(n^3) for CFGs and compilers are O(n) for DCFGs, and don't really understand how the difference in time complexities could be that great (not to mention I'm still confused about what the characteristics that make a CFG a DCFG).
Thank you so much in advance!

Conceptually they are quite simple to understand. The context free grammars are those which can be expressed in BNF. The DCFGs are the subset for which a workable parser can be written.
In writing compilers we are only interested in DCFGs. The reason is that 'deterministic' means roughly that the next rule to be applied at any point in the parse is determined by the input so far and a finite amount of lookahead. Knuth invented the LR() compiler back in the 1960s and proved it could handle any DCFG. Since then some refinements, especially LALR(1) and LL(1), have defined grammars that can be parsed in limited memory, and techniques by which we can write them.
We also have techniques to derive parsers automatically from the BNF, if we know it's one of these grammars. Yacc, Bison and ANTLR are familiar examples.
I've never seen a parser for a NDCFG, but at any point in the parse it would potentially need to consider the whole of the input string and every possible parse that could be applied. It's not hard to see why that would get rather large and slow.
I should point out that many real languages are imperfect, in that they are not entirely context free, not unambiguous or otherwise depart from the ideal DCFG. C/C++ is a good example, but there are many others. These languages are usually handled by special purpose rules such as semantic or syntactic predicates, special case backtracking or other 'tricks' with no effect on performance.
The comments point out that certain kinds of NDCFG are common and many tools provide a way to handle them. One common problem is ambiguity. It is relatively easy to parse an ambiguous grammar by introducing a simple local semantic rule, but of course this can only ever generate one of the possible parse trees. A generalised parser for NDCFG would potentially produce all parse trees, and could perhaps allow those trees to be filtered on some arbitrary condition. I don't know any of those.
Left recursion is not a feature of NDCFG. It presents a particular challenge to the design of LL() parsers but no problems for LR() parsers.

Related

Converting grammars to the most efficient parser

Is there an online algorithm which converts certain grammar to the most efficient parser possible?
For example: SLR/LR(k) such as k>=0
For the class of grammars you are discussing (xLR(k)), they are all linear time anyway, and it is impossible to do sublinear time if you have to examine every character.
If you insist on optimizing parse time, you should get a very fast LR parsing engine. LRStar used to be the cat's meow on this topic, but the guy behind it got zero reward from the world of "I want it for free" and pulled all instances of it off the net. You can settle for Bison.
Frankly most of your parsing time will be determined by how fast your parser can process individual characters, e.g., the lexer. Tune that first and you may discover there's no need to tune the parser.
First let's distinguish LR(k) grammars and LR(k) languages. A grammar may not be LR(1), but let's say, for example, LR(2). But the language it generates must have an LR(1) grammar -- and for that matter, it must have an LALR(1) grammar. The table size for such a grammar is essentially the same as for SLR(1) and is more powerful (all SLR(1) grammars are LALR(1) but not vice versa). So, there is really no reason not to use an LALR(1) parser generator if you are looking to do LR parsing.
Since parsing represents only a fraction of the compilation time in modern compilers when lexical analysis and code generation that contains peephole and global optimizations are taken into consideration, I would just pick a tool considering its entire set of features. You should also remember that one parser generator may take a little longer than another to analyze a grammar and to generate the parsing tables. But once that job is done, the table-driven parsing algorithm that will run in the actual compiler thousands of times should not vary significantly from one parser generator to another.
As far as tools for converting arbitrary grammars to LALR(1), for example (in theory this can be done), you could do a Google search (I have not done this). But since semantics are tied to the productions, I would want to have complete control over the grammar being used for parsing and would probably eschew such conversion tools.

Generalized Bottom up Parser Combinators in Haskell

I am wondered why there is no generalized parser combinators for Bottom-up parsing in Haskell like a Parsec combinators for top down parsing.
( I could find some research work went during 2004 but nothing after
https://haskell-functional-parsing.googlecode.com/files/Ljunglof-2002a.pdf
http://www.di.ubi.pt/~jpf/Site/Publications_files/technicalReport.pdf )
Is there any specific reason for not achieving it?
This is because of referential transparency. Just as no function can tell the difference between
let x = 1:x
let x = 1:1:1:x
let x = 1:1:1:1:1:1:1:1:1:... -- if this were writeable
no function can tell the difference between a grammar which is a finite graph and a grammar which is an infinite tree. Bottom-up parsing algorithms need to be able to see the grammar as a graph, in order to enumerate all the possible parsing states.
The fact that top-down parsers see their input as infinite trees allows them to be more powerful, since the tree could be computationally more complex than any graph could be; for example,
numSequence n = string (show n) *> option () (numSequence (n+1))
accepts any finite ascending sequence of numbers starting at n. This has infinitely many different parsing states. (It might be possible to represent this in a context-free way, but it would be tricky and require more understanding of the code than a parsing library is capable of, I think)
A bottom up combinator library could be written, though it is a bit ugly, by requiring all parsers to be "labelled" in such a way that
the same label always refers to the same parser, and
there is only a finite set of labels
at which point it begins to look a lot more like a traditional specification of a grammar than a combinatory specification. However, it could still be nice; you would only have to label recursive productions, which would rule out any infinitely-large rules such as numSequence.
As luqui's answer indicates a bottom-up parser combinator library is not a realistic. On the chance that someone gets to this page just looking for haskell's bottom up parser generator, what you are looking for is called the Happy parser generator. It is like yacc for haskell.
As luqui said above: Haskell's treatment of recursive parser definitions does not permit the definition of bottom-up parsing libraries. Bottom-up parsing libraries are possible though if you represent recursive grammars differently. With apologies for the self-promotion, one (research) parser library that uses such an approach is grammar-combinators. It implements a grammar transformation called the uniform Paull transformation that can be combined with the top-down parser algorithm to obtain a bottom-up parser for the original grammar.
#luqui essentially says, that there are cases in which sharing is unobservable. However, it's not the case in general: many approaches to observable sharing exist. E.g. http://www.ittc.ku.edu/~andygill/papers/reifyGraph.pdf mentions a few different methods to achieve observable sharing and proposes its own new method:
This looping structure can be used for interpretation, but not for
further analysis, pretty printing, or general processing. The
challenge here, and the subject of this paper, is how to allow trees
extracted from Haskell hosted deep DSLs to have observable back-edges,
or more generally, observable sharing. This a well-understood problem,
with a number of standard solutions.
Note that the "ugly" solution of #liqui is mentioned by the paper under the name of explicit labels. The solution proposed by the paper is still "ugly" as it uses so called "stable names", but other solutions such as http://www.cs.utexas.edu/~wcook/Drafts/2012/graphs.pdf (which relies on PHOAS) may work.

LR(k) to LR(1) grammar conversion

I am confused by the following quote from Wikipedia:
In other words, if a language was reasonable enough to allow an
efficient one-pass parser, it could be described by an LR(k) grammar.
And that grammar could always be mechanically transformed into an
equivalent (but larger) LR(1) grammar. So an LR(1) parsing method was,
in theory, powerful enough to handle any reasonable language. In
practice, the natural grammars for many programming languages are
close to being LR(1).[citation needed]
This means that a parser generator, like bison, is very powerful (since it can handle LR(k) grammars), if one is able to convert a LR(k) grammar to a LR(1) grammar. Do some examples of this exist, or a recipe on how to do this? I'd like to know this since I have a shift/reduce conflict in my grammar, but I think this is because it is a LR(2) grammar and would like to convert it to a LR(1) grammar. Side question: is C++ an unreasonable language, since I've read, that bison-generated parsers cannot parse it.
For references on the general purpose algorithm to find a covering LR(1) grammar for an LR(k) grammar, see Real-world LR(k > 1) grammars?
The general purpose algorithm produces quite large grammars; in fact, I'm pretty sure that the resulting PDA is the same size as the LR(k) PDA would be. However, in particular cases it's possible to come up with simpler solutions. The general principle applies, though: you need to defer the shift/reduce decision by unconditionally shifting until the decision can be made with a single lookahead token.
One example: Is C#'s lambda expression grammar LALR(1)?
Without knowing more details about your grammar, I can't really help more than that.
With regard to C++, the things that make it tricky to parse are the preprocessor and some corner cases in parsing (and lexing) template instantiations. The fact that the parse of an expression depends on the "kind" (not type) of a symbol (in the context in which the symbol occurs) makes precise parsing with bison complicated. [1] "Unreasonable" is a value judgement which I'm not comfortable making; certainly, tool support (like accurate syntax colourizers and tab-completers) would have been simple with a different grammar, but the evidence is that it is not that hard to write (or even read) good C++ code.
Notes:
[1] The classic tricky parse, which also applies to C, is (a)*b, which is a cast of a dereference if a represents a type, and otherwise a multiplication. If you were to write it in the context: c/(a)*b, it would be clear that an AST cannot be constructed without knowing whether it's a cast or a product, since that affects the shape of the AST,
A more C++-specific issue is: x<y>(z) (or x<y<z>>(3)) which parse (and arguably tokenise) differently depending on whether x names a template or not.

CKY for Parsing Programming Languages

Is it a good idea to use the CKY chart parsing algorithm to parse the syntax of programming languages (knowing that it is mostly used to parse the syntax of natural language)?
CKY can parse any context free language, but the time complexity is not great compared to alternatives. CKY requires the grammar to be in Chomsky Normal Form, which can blow up the size of the grammar and hurt running time too. It's an okay approach for a quick-and-dirty parser, but you'll run into issues when you try to scale up to larger inputs or complex grammars.
If you're looking for an understandable parsing algorithm that's relatively straightforward to implement, take a look at Parsing Expression Grammars (PEGs). They can recognize a large subset of context-free languages, plus some languages with limited context sensitivity. Once you have a working PEG parser it's easy to add memoization, which gives you a Packrat Parser that runs in linear time. The academic papers on PEGs, Packrat, and this extension to allow left-recursive grammars are all quite understandable.

Linear-time general parser

I've been reading Dick Grune's Parsing techniques 1st edition for quite a while now, the book is from the mid-90's and the author argues that no such parsing method (Linear-time general parsing) has been discovered until the date.
"we should like to have a linear-time general parsing method.
Unfortunately no such method has been discovered to date." pg 76
Has anyone developed such method?
No such method has been devised. As far as I can tell, the CYK algorithm remains the general parsing algorithm with the best worst case performance (O(n3)).
A GLR parsers are O(n^3) in worst case, but offer linear performance where the grammar is deterministic. Many grammars have this property, so in effect you get linear time parsing in practice.
We've managed to build parsers for many real, complicated languages using a GLR parser, even the famously hard to parse C++.
Packrat with a full memoisation is a guaranteed O(n), but there is a relatively big linear multiplier.
I am developing a parser (JoeSon) written in CoffeeScript, and I believe it is O(n) for most interesting grammars.
I think it is essentially a Packrat parser, but with the ability to bypass the cache for some rules, which I think is necessary to write whitespace-sensitive grammars.
Packrat does not parse all context free grammars. It has difficulty with counting problems, like with the grammar ' S = x S x | x '. But these kinds of grammars are also difficult for humans to parse.
https://github.com/jaekwon/joeson/blob/master/joeson_grammar.coffee
https://github.com/jaekwon/joeson/blob/master/joescript_grammar.coffee

Resources