How do recursive ascent parsers work? - parsing

How do recursive ascent parsers work? I have written a recursive descent parser myself but I don't understand LR parsers all that well. What I found on Wikipedia has only added to my confusion.
Another question is why recursive ascent parsers aren't used more than their table-based counterparts. It seems that recursive ascent parsers have greater performance overall.

The clasical dragon book explains very well how LR parsers work. There is also Parsing Techniques. A Practical Guide. where you can read about them, if I remember well. The article in wikipedia (at least the introduction) is not right. They were created by Donald Knuth, and he explains them in his The Art of Computer Programming Volume 5. If you understand spanish, there is a complete list of books here posted by me. Not all that books are in spanish, either.
Before to understand how they work, you must understand a few concepts like first, follows and lookahead. Also, I really recommend you to understand the concepts behind LL (descendant) parsers before trying to understand LR (ascendant) parsers.
There are a family of parsers LR, specially LR(K), SLR(K) and LALR(K), where K is how many lookahead they need to work. Yacc supports LALR(1) parsers, but you can make tweaks, not theory based, to make it works with more powerful kind of grammars.
About performance, it depends on the grammar being analyzed. They execute in linear time, but how many space they need depends on how many states do you build for the final parser.

I'm personally having a hard time understanding how a function call can be faster -- much less "significantly faster" than a table lookup. And I suspect that even "significantly faster" is insignificant when compared to everything else that a lexer/parser has to do (primarily reading and tokenizing the file). I looked at the Wikipedia page, but didn't follow the references; did the author actually profile a complete lexer/parser?
More interesting to me is the decline of table-driven parsers with respect to recursive descent. I come from a C background, where yacc (or equivalent) was the parser generator of choice. When I moved to Java, I found one table-driven implementation (JavaCup), and several recursive descent implementations (JavaCC, ANTLR).
I suspect that the answer is similar to the answer of "why Java instead of C": speed of execution isn't as important as speed of development. As noted in the Wikipedia article, table-driven parsers are pretty much impossible to understand from code (back when I was using them, I could follow their actions but would never have been able to reconstruct the grammar from the parser). Recursive descent, by comparison, is very intuitive (which is no doubt why it predates table-driven by some 20 years).

The Wikipedia article on recursive ascent parsing references what appears to be the original paper on the topic ("Very Fast LR Parsing"). Skimming that paper cleared a few things up for me. Things I noticed:
The paper talks about generating assembly code. I wonder if you can do the same things they do if you're generating C or Java code instead; see sections 4 and 5, "Error recovery" and "Stack overflow checking". (I'm not trying to FUD their technique -- it might work out fine -- just saying that it's something you might want to look into before committing.)
They compare their recursive ascent tool to their own table-driven parser. From the description in their results section, it looks like their table-driven parser is "fully interpreted"; it doesn't require any custom generated code. I wonder if there's a middle ground where the overall structure is still table-driven but you generate custom code for certain actions to speed things up.
The paper referenced by the Wikipedia page:
"Very fast LR parsing" (1986)
http://portal.acm.org/citation.cfm?id=13310.13326
Another paper about using code-generation instead of table-interpretation:
"Very fast YACC-compatible parsers (for very little effort)" (1999)
http://www3.interscience.wiley.com/journal/1773/abstract
Also, note that recursive-descent parsing is not the fastest way to parse LL-grammar-based languages:
Difference between an LL and Recursive Descent parser?

Related

Neural Networks For Generating New Programming Language Grammars

I have recently had the need to create an ANTLR language grammar for the purpose of a transpiler (Converting one scripting language to another). It occurs to me that Google Translate does a pretty good job translating natural language. We have all manner of recurrent neural network models, LSTM, and GPT-2 is generating grammatically correct text.
Question: Is there a model sufficient to train on grammar/code example combinations for the purpose of then outputting a new grammar file given an arbitrary example source-code?
I doubt any such model exists.
The main issue is that languages are generated from the grammars and it is next to impossible to convert back due to the infinite number of parser trees (combinations) available for various source-codes.
So in your case, say you train on python code (1000 sample codes), the resultant grammar for training will be the same. So, the model will always generate the same grammar irrespective of the example source code.
If you use training samples from a number of languages, the model still can't generate the grammar as it consists of an infinite number of possibilities.
Your example of Google translate works for real life translation as small errors are acceptable, but these models don't rely on generating the root grammar for each language. There are some tools that can translate programming languages example, but they don't generate the grammar, work based on the grammar.
Update
How to learn grammar from code.
After comparing to some NLP concepts, I have a list of issue that may arise and a way to counter them.
Dealing with variable names, coding structures and tokens.
For understanding the grammar, we'll have to breakdown the code to its bare minimum form. This means understanding what each and every term in the code means. Have a look at this example
The already simple expression is reduced to the parse tree. We can see that the tree breaks down the expression and tags each number as a factor. This is really important to get rid of the human element of the code (such as variable names etc.) and dive into the actual grammar. In NLP this concept is known as Part of Speech tagging. You'll have to develop your own method to do the tagging, by it's easy given that you know grammar for the language.
Understanding the relations
For this, you can tokenize the reduced code and train using a model based on the output you are looking for. In case you want to write code, make use of a n grams model using LSTM like this example. The model will learn the grammar, but extracting it is not a simple task. You'll have to run separate code to try and extract all the possible relations learned by the model.
Example
Code snippet
# Sample code
int a = 1 + 2;
cout<<a;
Tags
# Sample tags and tokens
int a = 1 + 2 ;
[int] [variable] [operator] [factor] [expr] [factor] [end]
Leaving the operator, expr and keywords shouldn't matter if there is enough data present, but they will become a part of the grammar.
This is a sample to help understand my idea. You can improve on this by having a deeper look at the Theory of Computation and understanding the working of the automata and the different grammars.
What you're describing is 'just' learning structure of Context-Free Grammars.
I'm not sure if this approach will actually work for your case, but it's a long-standing problem in NLP: grammar induction for Context-Free Grammars. An example introduction how to tackle this problem using statistical learning methods can be found in Charniak's Statistical Language Learning.
Note that what I described is about CFGs in general, but you might want to check induction for LL grammars, because parser generators mostly use these types of grammars.
I know nothing about ANTLR, but there are pretty good examples of translating natural language e.g. into valid SQL requests: http://nlpprogress.com/english/semantic_parsing.html#sql-parsing.

Generalized Bottom up Parser Combinators in Haskell

I am wondered why there is no generalized parser combinators for Bottom-up parsing in Haskell like a Parsec combinators for top down parsing.
( I could find some research work went during 2004 but nothing after
https://haskell-functional-parsing.googlecode.com/files/Ljunglof-2002a.pdf
http://www.di.ubi.pt/~jpf/Site/Publications_files/technicalReport.pdf )
Is there any specific reason for not achieving it?
This is because of referential transparency. Just as no function can tell the difference between
let x = 1:x
let x = 1:1:1:x
let x = 1:1:1:1:1:1:1:1:1:... -- if this were writeable
no function can tell the difference between a grammar which is a finite graph and a grammar which is an infinite tree. Bottom-up parsing algorithms need to be able to see the grammar as a graph, in order to enumerate all the possible parsing states.
The fact that top-down parsers see their input as infinite trees allows them to be more powerful, since the tree could be computationally more complex than any graph could be; for example,
numSequence n = string (show n) *> option () (numSequence (n+1))
accepts any finite ascending sequence of numbers starting at n. This has infinitely many different parsing states. (It might be possible to represent this in a context-free way, but it would be tricky and require more understanding of the code than a parsing library is capable of, I think)
A bottom up combinator library could be written, though it is a bit ugly, by requiring all parsers to be "labelled" in such a way that
the same label always refers to the same parser, and
there is only a finite set of labels
at which point it begins to look a lot more like a traditional specification of a grammar than a combinatory specification. However, it could still be nice; you would only have to label recursive productions, which would rule out any infinitely-large rules such as numSequence.
As luqui's answer indicates a bottom-up parser combinator library is not a realistic. On the chance that someone gets to this page just looking for haskell's bottom up parser generator, what you are looking for is called the Happy parser generator. It is like yacc for haskell.
As luqui said above: Haskell's treatment of recursive parser definitions does not permit the definition of bottom-up parsing libraries. Bottom-up parsing libraries are possible though if you represent recursive grammars differently. With apologies for the self-promotion, one (research) parser library that uses such an approach is grammar-combinators. It implements a grammar transformation called the uniform Paull transformation that can be combined with the top-down parser algorithm to obtain a bottom-up parser for the original grammar.
#luqui essentially says, that there are cases in which sharing is unobservable. However, it's not the case in general: many approaches to observable sharing exist. E.g. http://www.ittc.ku.edu/~andygill/papers/reifyGraph.pdf mentions a few different methods to achieve observable sharing and proposes its own new method:
This looping structure can be used for interpretation, but not for
further analysis, pretty printing, or general processing. The
challenge here, and the subject of this paper, is how to allow trees
extracted from Haskell hosted deep DSLs to have observable back-edges,
or more generally, observable sharing. This a well-understood problem,
with a number of standard solutions.
Note that the "ugly" solution of #liqui is mentioned by the paper under the name of explicit labels. The solution proposed by the paper is still "ugly" as it uses so called "stable names", but other solutions such as http://www.cs.utexas.edu/~wcook/Drafts/2012/graphs.pdf (which relies on PHOAS) may work.

Deterministic Context-Free Grammar versus Context-Free Grammar?

I'm reading my notes for my comparative languages class and I'm a bit confused...
What is the difference between a context-free grammar and a deterministic context-free grammar? I'm specifically reading about how parsers are O(n^3) for CFGs and compilers are O(n) for DCFGs, and don't really understand how the difference in time complexities could be that great (not to mention I'm still confused about what the characteristics that make a CFG a DCFG).
Thank you so much in advance!
Conceptually they are quite simple to understand. The context free grammars are those which can be expressed in BNF. The DCFGs are the subset for which a workable parser can be written.
In writing compilers we are only interested in DCFGs. The reason is that 'deterministic' means roughly that the next rule to be applied at any point in the parse is determined by the input so far and a finite amount of lookahead. Knuth invented the LR() compiler back in the 1960s and proved it could handle any DCFG. Since then some refinements, especially LALR(1) and LL(1), have defined grammars that can be parsed in limited memory, and techniques by which we can write them.
We also have techniques to derive parsers automatically from the BNF, if we know it's one of these grammars. Yacc, Bison and ANTLR are familiar examples.
I've never seen a parser for a NDCFG, but at any point in the parse it would potentially need to consider the whole of the input string and every possible parse that could be applied. It's not hard to see why that would get rather large and slow.
I should point out that many real languages are imperfect, in that they are not entirely context free, not unambiguous or otherwise depart from the ideal DCFG. C/C++ is a good example, but there are many others. These languages are usually handled by special purpose rules such as semantic or syntactic predicates, special case backtracking or other 'tricks' with no effect on performance.
The comments point out that certain kinds of NDCFG are common and many tools provide a way to handle them. One common problem is ambiguity. It is relatively easy to parse an ambiguous grammar by introducing a simple local semantic rule, but of course this can only ever generate one of the possible parse trees. A generalised parser for NDCFG would potentially produce all parse trees, and could perhaps allow those trees to be filtered on some arbitrary condition. I don't know any of those.
Left recursion is not a feature of NDCFG. It presents a particular challenge to the design of LL() parsers but no problems for LR() parsers.

CKY for Parsing Programming Languages

Is it a good idea to use the CKY chart parsing algorithm to parse the syntax of programming languages (knowing that it is mostly used to parse the syntax of natural language)?
CKY can parse any context free language, but the time complexity is not great compared to alternatives. CKY requires the grammar to be in Chomsky Normal Form, which can blow up the size of the grammar and hurt running time too. It's an okay approach for a quick-and-dirty parser, but you'll run into issues when you try to scale up to larger inputs or complex grammars.
If you're looking for an understandable parsing algorithm that's relatively straightforward to implement, take a look at Parsing Expression Grammars (PEGs). They can recognize a large subset of context-free languages, plus some languages with limited context sensitivity. Once you have a working PEG parser it's easy to add memoization, which gives you a Packrat Parser that runs in linear time. The academic papers on PEGs, Packrat, and this extension to allow left-recursive grammars are all quite understandable.

Linear-time general parser

I've been reading Dick Grune's Parsing techniques 1st edition for quite a while now, the book is from the mid-90's and the author argues that no such parsing method (Linear-time general parsing) has been discovered until the date.
"we should like to have a linear-time general parsing method.
Unfortunately no such method has been discovered to date." pg 76
Has anyone developed such method?
No such method has been devised. As far as I can tell, the CYK algorithm remains the general parsing algorithm with the best worst case performance (O(n3)).
A GLR parsers are O(n^3) in worst case, but offer linear performance where the grammar is deterministic. Many grammars have this property, so in effect you get linear time parsing in practice.
We've managed to build parsers for many real, complicated languages using a GLR parser, even the famously hard to parse C++.
Packrat with a full memoisation is a guaranteed O(n), but there is a relatively big linear multiplier.
I am developing a parser (JoeSon) written in CoffeeScript, and I believe it is O(n) for most interesting grammars.
I think it is essentially a Packrat parser, but with the ability to bypass the cache for some rules, which I think is necessary to write whitespace-sensitive grammars.
Packrat does not parse all context free grammars. It has difficulty with counting problems, like with the grammar ' S = x S x | x '. But these kinds of grammars are also difficult for humans to parse.
https://github.com/jaekwon/joeson/blob/master/joeson_grammar.coffee
https://github.com/jaekwon/joeson/blob/master/joescript_grammar.coffee

Resources