How to construct parsing table for LL(k>1)? - parsing

On the web, there is a lot of examples showing how to construct parsing tables for a context-free grammar from first/follow sets for LL(1) parser.
But I haven't found anything useful related to k>1 cases. Even wikipedia gives no info about this.
I expect that it must be in some way similar, but pointers to existing research in this area would be very helpful.

I struggle pretty much with the same issues, building LR parser, not LL though. I found a little better page than LL(k) mentioned by #cakeplus -- http://www.seanerikoconnor.freeservers.com/ComputerScience/Compiler/ParserGeneratorAndParser/QuickReviewOfLRandLALRParsingTheory.html There is also related paper available for free -- http://ci.nii.ac.jp/naid/110002673618/
However even those didn't help me much. So I started myself from the basics. If anyone is interested: https://aboutskila.wordpress.com/2013/06/14/lalrk-first-sets/ and the battle will continue :-)

Related

What kind of parser is most frequently used in the real world?

I just learned the theory behind LL(1) parsers and subsequently I learned that this parsing technique is only suitable for a subset of context-free-grammar. LL(1) is predictive and doesn't do any backtracking. I don't know if the real-world needs backtracking capabilities. Does the real-world need backtracking capabilities? LL(1) seems like a fair introduction to parsing, but like nothing anyone would use to "get a job done". Example? You have to left-factor your grammar and then LL(1) isn't even suited to parse a formatting language like the one wikipedia uses, where some rules use multiple reduplication as for instance headings. (correct me if I'm wrong!)
But what parsing-technique is used in the real world if you want to get things done?
Probably the answer involves some kind of compromise. Speed is not so much an issue as of yet.
With "real world" abilities I mean that the parser will parse the whole set of context-free-grammars, if possible without any special-snowflake treatment like left-factoring. I further think of fictitious situations where I should somehow wind up writing back-end stuff for a game project, if the task then would haphazardly be to give a scripting-language to the designers, what parser would meet the requirements without on the other hand being too... "esoteric…" and "complex" - or for short: being an overkill.
I hope that suffices to give you an idea what I mean.
Thank you very much!

Writing a Parser for mixed Languages

I am trying to write a Parser that can analyze mixed Languages and generate an AST of it. I first tried to build it from scratch on my own in Java and failed, because this is quite a hard topic for a Parser-beginner. Then i googled and found http://www2.cs.tum.edu/projects/cup/examples.php and JFlex.
The question now is: What is the best way to do it?
For example i have a Codefile, that contains several Tags, JS Code, and some $CMS_SET(x,y)$ Code. Is the best way to solve this to define a grammar for all those things in CUP and let CUP generate a Parser based on my grammer that can analyze those mixed Language files and generate and AST Tree of it?
Thanks for all helpful answers. :)
EDIT: I need to do it in Java...
This topic is quite hard even for an expert in this area, which I consider myself to be; check my bio.
The first issue is to build individual parsers for each sublanguage. The first thing you will discover is that defining parsers for specific languages is actually hard; you can read the endless list of SO requests for "can I get a parser for X" or "how do I fix my parser for X". Mostly I think these requests end up not going anywhere; the parsing engines aren't very good, you have to twist the grammars and the parsers to make them work on real languages, there isn't any such thing as "pure HTML", the standards documents disagree and your customer always has some twist in his code that you're not prepared for. Finally there's the glitches related to character set encodings, variations in newline endings, and preprocessors to complicate the parsing parsing problem. The C++ preprocessor is a lot more complex than you might think, and you have to have this right. The easiest way to defeat this problem is to find some parser generator with the languages already predefined. ANTLR has a bunch for the now deprecated ANTLR3; there's no assurance these parsers are robust let alone compatible for your purposes.
CUP isn't a particularly helpful parser generator; none of the LL(x) or LALR(x) parser generators are really helpful, because no real langauge matches the categories of things they can parse. The consequence: an endless stream of requests (on SO!) for help "resolving my shift-reduce conflict", or "eliminating right recursion". The only parser generator IMHO that has stood the test of time is a GLR parser generator (I hear good things about GLL but that's pretty recent). We've done 40+ languages with one GLR parser generator, including production IBM COBOL, full C++14 and Java8.
You second problem will be building ASTs. You can hand code the AST building process, but that gets old fast when you have to change the grammars often and/or you have many grammars as you are effectively contemplating. This you can beat your way throught with sweat. (We chose to push the problem of building ASTs into the parser so we didn't have to put any energy this in building a grammar; to do this, your parser engine has to offer you this help and none of the mainstream ones do.).
Now you need to compose parsers. You need to have one invoke the other as the need comes up; of course your chosen parser isn't designed to do this so you'll have to twist it. The first hard part is provide a parser with clues that a sublanguage is coming up in the input stream, and for it to hand off that parsing to the sublangauge parser, and get it to pass a tree back to be incorporated in the parent parser's tree, presumably with some kind of marker so you can tell where the transitions between different sublangauges are in the tree. You can often do this by hacking one language's lexer, when it sees the clue, to invoke the other; but then what you do with the tree it returns? There's no way to give that tree to the current parser and say "integrate this". You get around this by modifying the parsing machinery in arcane ways.
But all of the above isn't where the problem is.
Parsing is inconvenient, but only a small part of what you need to analyze your programs in any interesting way; you need symbol tables, control and data flow analysis, maybe points-to analyses, and the engineering on these will swamp the work listed above. See my essay on "Life After Parsing" (google or via my bio) for a long discussion of what else you need.
In short, I think you are biting off an enormous task in just "parsing", and you haven't even told us what you intend to do with the result. You are welcome to start down this path, but very few people have succeeded; my team spent over a 50 man years of PhD level engineering to get where we are, and we are hardly done.
Java won't make the solution any easier or harder; the langauge in which you solve all of the above is irrelevant.

Any references for parsing incomplete or incorrect code?

Can anybody point me at references on techniques for parsing code that contains syntax errors, or is missing necessary punctuation, for example?
The application that I'm working on is an IDE, where we'd like to provide features like "jump to definition", auto-complete, and refactoring features, without requiring the source to be syntactically correct at the moment the functions are invoked.
Most parser code I've seen appears to work on the principle of "fail early", rather than focusing on error recovery or parsing partially-complete code.
Have you tried ANTLR?
In "The Definitive ANTLR Reference", section 10.7 Automatic Error Recovery Strategy for 5 pages Terrence talks about this. He references Algorithms + Data Structures = Programs, A Note on Error Recovery in Recursive Descent Parsers, Efficient and Comfortable Error Recovery in Recursive Descent Parsers.
Also see the pages from the web site:
Error reporting and recovery
ANTLR 3.0 Error Reporting and Recovery
Custom Syntax Error Recovery
Also check the ANTLR tag for accessing the ANTLR forum where Terrence Parr answers questions. He does answer some questions here as The ANTLR Guy.
Also the new version of ANTLR 4 is due out as well as the book.
Sorry to sound like a sales pitch, but I have been using ANTLR for years because it used by lots of people, is used in production systems, has a few solid versions: Java, C, C#, has a very active community, has a web site, has books, is evolving, maintained, open source, BSD license, easy to use and has some GUI tools.
One of the people working on a GUI for ANTLR 4 that has syntax highlight and auto-completion among other useful IDE editing is Sam Harwell. If you can reach him through the ANTLR forum, he might be able to help you out.
I don’t know of any papers or tutorials, but uu-parsinglib is a Haskell parsing library that can recover from syntax errors in a general fashion. If, for example, ; was expected but int was received, the parser can continue as though ; were inserted at that source position.
It’s up to you where the parser will fail and where it will proceed with corrections, and the results will be delivered alongside a set of the errors corrected during parsing. Even if you don’t intend to implement your parsing code in Haskell, an examination of the library may offer you some insight. Or you can write a parser in Haskell and call it from C.
Research on "Island grammars" may interest you. It's been a while since I looked at them, but I believe that they are supposed to reasonably handle cases where there are many chunks of nonsense in the file. I didn't have much luck with CiteSeer (oddly; usually it's pretty good), but Google Scholar found a number of relevant papers. Generating robust parsers using island grammars looks like a good place to start.

Online resources for writing a parser-generator

I want to write a parser-generator for educational purposes, and was wondering if there are some nice online resources or tutorials that explain how to write one. Something on the lines of "Let's Build a Compiler" by Jack Crenshaw.
I want to write the parser generator for LR(1) grammar.
I have a decent understanding of the theory behind generating the action and goto tables, but want some resource which will help me with implementing it.
Preferred languages are C/C++, Java though even other languages are OK.
Thanks.
I agree with others, the Dragon book is good background for LR parsing.
If you are interested in recursive descent parsers, an enormously fun learning experience is this website, which walks you through building a completely self-contained compiler system that can compile itself and other languages:
MetaII Compiler Tutorial
This is all based on an amazing little 10-page technical paper by Val Schorre: META II: A Syntax-Oriented Compiler Writing Language from honest-to-god 1964. I learned how to build compilers from this back in 1970. There's a mind-blowing moment when you finally grok how the compiler can regenerate itself....
I know the website author from my college days, but have nothing to do with the website.
If you wanted to go the python route I would recommend the following.
Text Processing in Python
Pyparsing
I have found both of these to be extremely helpful and Paul McGuire the author of pyparsing is super at helping you out when you run into problems. The book Text Processing in Python is just a handy reference to have at your finger tips and helps get you into the right frame of mind when attempting to build a parser.
I would also point out that an OO language is better suited as a language parsing engine because it's extensible and polymorphism is the right way to do it (IMHO). Looking at the problem in terms of a state machine rather than "Look for a semicolon at the end of xyz" will demonstrate that your parser becomes much more robust in the end.
Hope that Helps!
Not really online, but the Dragon Book has fairly elaborate discussions of LR parsing.
I found it easier to learn to write recursive-descent parsers before learning to write LR parsers. Well to be honest, after many years of writing parsers, I never found it necessary to write an LR parser.
I've recently written a tutorial at CodeProject called Implementing Programming Language Tools in C# 4.0 which describes recursive descent parsing techniques.

Learning More About Parsing

I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.
Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?
On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.
On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).
Here's my take on your (very good) questions:
I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.
The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.
Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.
I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.
The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.
If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.
In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.
Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).
Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.
For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)
Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.
For 2, you might take a look at Ragel (it's good for C++ and Ruby).
Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler
which can be used to design and implement "low overhead" DSLs very quickly:
http://www.bayfronttechnologies.com/mc_tutorial.html
This site walks you through Val Schorre's 1964 paper on MetaII.
Yes, 1964. And it is amazing. This is how I learned about compilers
back in 1970.

Resources