I want to write a parser-generator for educational purposes, and was wondering if there are some nice online resources or tutorials that explain how to write one. Something on the lines of "Let's Build a Compiler" by Jack Crenshaw.
I want to write the parser generator for LR(1) grammar.
I have a decent understanding of the theory behind generating the action and goto tables, but want some resource which will help me with implementing it.
Preferred languages are C/C++, Java though even other languages are OK.
Thanks.
I agree with others, the Dragon book is good background for LR parsing.
If you are interested in recursive descent parsers, an enormously fun learning experience is this website, which walks you through building a completely self-contained compiler system that can compile itself and other languages:
MetaII Compiler Tutorial
This is all based on an amazing little 10-page technical paper by Val Schorre: META II: A Syntax-Oriented Compiler Writing Language from honest-to-god 1964. I learned how to build compilers from this back in 1970. There's a mind-blowing moment when you finally grok how the compiler can regenerate itself....
I know the website author from my college days, but have nothing to do with the website.
If you wanted to go the python route I would recommend the following.
Text Processing in Python
Pyparsing
I have found both of these to be extremely helpful and Paul McGuire the author of pyparsing is super at helping you out when you run into problems. The book Text Processing in Python is just a handy reference to have at your finger tips and helps get you into the right frame of mind when attempting to build a parser.
I would also point out that an OO language is better suited as a language parsing engine because it's extensible and polymorphism is the right way to do it (IMHO). Looking at the problem in terms of a state machine rather than "Look for a semicolon at the end of xyz" will demonstrate that your parser becomes much more robust in the end.
Hope that Helps!
Not really online, but the Dragon Book has fairly elaborate discussions of LR parsing.
I found it easier to learn to write recursive-descent parsers before learning to write LR parsers. Well to be honest, after many years of writing parsers, I never found it necessary to write an LR parser.
I've recently written a tutorial at CodeProject called Implementing Programming Language Tools in C# 4.0 which describes recursive descent parsing techniques.
Related
I'm learning F# because I'd like to write a lexer and parser. I have a tiny bit of experience with this sort of processing but really need to learn it properly as well as F#.
When learning the lexing/parsing functionality of F#, is studying lex and yacc sufficient?
Or are there some differences that means code for lex/yacc will not work with fslex and fsyacc?
I personally found these OcamlLex and OcamlYacc tutorials excellent resources to get started -- easy to follow, and you can translate most everything in those tutorials for FsLex/FsYacc almost verbatim.
Well, with lex and yacc, you put C/C++ code in the 'actions', whereas with fslex and fsyacc you put F# code there, but I presume you know this?
I think they are otherwise based on the same (established/ancient) tokenizing and parsing technologies, so the general structure/behavior of the grammar should be similar, if that's what you're after...
I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.
Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?
On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.
On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).
Here's my take on your (very good) questions:
I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.
The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.
Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.
I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.
The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.
If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.
In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.
Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).
Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.
For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)
Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.
For 2, you might take a look at Ragel (it's good for C++ and Ruby).
Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler
which can be used to design and implement "low overhead" DSLs very quickly:
http://www.bayfronttechnologies.com/mc_tutorial.html
This site walks you through Val Schorre's 1964 paper on MetaII.
Yes, 1964. And it is amazing. This is how I learned about compilers
back in 1970.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am teaching (with others) a relatively introductory course in computer science for IT professionals without a background in CS. Since I developed the course materials on automata and grammars, I am also responsible for teaching about compilers and compiler construction.
Years ago, when I studied compilation in college, all our examples came from Lex and Yacc. Are these still in widespread use? Is there something that is more commonly used for Java? The students are proficient in C and Java but have never used parser generators.
Any tips on what to teach would be appreciated
Antlr is widely used, well documented, and free. It is supported by Ant, and can target Java among many other languages.
I don't use lexer and parser generators. They're simple enough to generate by hand, and are the easiest parts of a compiler to write. Besides, when you build them by hand, you can make them really fast.
It's a pity your students aren't well-versed in C++. Once I came across the Spirit library with its concept of a rich, EBNF-style DSL, I've left Antlr, Lex and Yacc behind! It's much more flexible having the grammar described alongside the code.
Brilliant library, though with an admittedly non-trivial learning curve.
However, without C++, Antlr is probably your best bet.
Lex and Yacc are still in use. One of the newest languages around, F#, has it's own versions (fslex, fsyacc -- see here for an example.) So I think teaching them is still relevant.
Yacc and all the other LALR(1) parsers date from an era when machine resources were scarce and it was necessary to spend a lot of time engineering the grammar so that you could run a parser at all on a PDP-11 with 64K of RAM. Today it does not make sense to teach a tool like yacc with a terrible human interface and a very limited set of grammars it can use.
I would recommend either one of the PEG-based parsers, such as Rats!, or the GLR parser Elkhound developed by George Necula and Scott McPeak (thanks quark). Sorry I can't recommend a specific tool for Java, but Rats! is good for C.
ANTLR is OK but is too complex for my taste.
PEG parser systems like RATS are simpler than the lex/yacc combo. This may or may not be a plus for your class: is your goal to teach about regular expressions and finite automata, and LR grammars and pushdown automata, etc.? Or do you want the simplest practical compiler frontend tools?
(Since I don't program in Java these days I haven't tried RATS in particular.)
Javacc it's very easy.
In the same file you have the grammar and the token list.
https://javacc.dev.java.net/
I remember using CUP and liking it. Take a look at the CUP Parser Generator for Java.
CUP is maintained at the Technical University of Munich. I believe it's primary purpose is to teach students.
It also has a free licensing model.
...Permission to use, copy, modify, and
distribute this software and its
documentation for any purpose and
without fee is hereby granted,
provided that the above copyright
notice appear in all copies and that
both the copyright notice and this
permission notice and warranty
disclaimer appear in supporting
documentation...
You could skip the generator part and have a look at Scalas parser combinators.
Haven't tried it yet, but I found jparsec a few days ago. It is no parser generator, instead the parser is build in java by combinators in an EBNF style.
I like the GOLD Parsing System very much, because it basically generates the tables needed and you then only have to use a (generic) implementation of a processor which uses the table information to process the tokens. This engine (as it is called) is quite easy to write and is basically a pure implementation using the LALR and DFA tables to process the input, and writing such an implementation may be a good exercise to teach those.
If you plan to work with Java, JavaCC or ANTLR should suffice. This latter one also supports C and Python. But if you plan to work with C++, maybe you should take a look at Boost::Spirit.
I am currently taking a compilers course which uses Lex and Yacc. I don't really know about any other types out there, but the theory we're learning seems to map pretty well to these tools.
I remember using Bison in one of my compilers classes. We also used flex and YACC.
OCaml has a fantastic set of parser generators. Here are some simple examples.
JavaCC is also quite good.
I would strongly recommend avoiding C (and C++) for this purpose because they are extraordinary painful in this context.
My knowledge about implementing a parser is a bit rusty.
I have no idea about the current state of research in the area, and could need some links regarding recent advances and their impact on performance.
General resources about writing a parser are also welcome, (tutorials, guides etc.) since much of what I had learned at college I have already forgotten :)
I have the Dragon book, but that's about it.
And does anyone have input on parser generators like ANTLR and their performance? (ie. comparison with other generators)
edit My main target is RDF/OWL/SKOS in N3 notation.
Mentioning the dragon book and antlr means you've answered your own question.
If you're looking for other parser generators you could also check out boost::spirit (http://spirit.sourceforge.net/).
Depending on what you're trying to achieve you might also want to consider a DSL, which you can either parse yourself or write in a scripting language like boo, ruby, python etc...
Hmm … your request is a bit unspecific. While there are many recent developments in this general area, they're all quite specialized (naturally, since the field has matured). The original parsing approaches haven't really changed, though. You might want to read up on changes in parser creation tools (Antlr, Gold Parser, to name but a few).
You might also want to take a look at SableCC, another parser generator "which generates fully featured object-oriented frameworks for building compilers".
Their is some documentation about basic uses here and here. Since you asked about research papers, SableCC's main developper's master thesis (1998) is available and explains a little more about SableCC advantages.
Although the current stable version is 3.2, the development branch v4 is a complete rewrite and should implement features new to parser generators.
If you want to build custom analyzers for complex languages,
consider our DMS Software Reengineering Toolkit.
See http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html
This provides very strong parsing technology, making it "easy" to define your language
(especially in comparison with most parser generators).
Conventional parser generators may help
with parsing, but they provide zero help in the hard part of the
process, which happens after you can parse the code.
DMS provides a vast amount of machinery to support analyzing and transforming
the code once your have parsed it.
I've been given a job of 'translating' one language into another. The source is too flexible (complex) for a simple line by line approach with regex. Where can I go to learn more about lexical analysis and parsers?
If you want to get "emotional" about the subject, pick up a copy of "The Dragon Book." It is usually the text in a compiler design course. It will definitely meet your need "learn more about lexical analysis and parsers" as well as a bunch of other fun stuff!
IMH(umble)O, save yourself an arm and/or leg and buy an older edition - it will fill your information desires.
Try ANLTR:
ANTLR, ANother Tool for Language
Recognition, is a language tool that
provides a framework for constructing
recognizers, interpreters, compilers,
and translators from grammatical
descriptions containing actions in a
variety of target languages.
There's a book for it also.
Niklaus Wirth's book "Compiler Construction" (available as a free PDF)
http://www.google.com/search?q=wirth+compiler+construction
I've recently been working with PLY which is an implementation of lex and yacc in Python. It's quite easy to get started with it and there are some simple examples in the documentation.
Parsing can quickly become a very technical topic and you'll find that you probably won't need to know all the details of the parsing algorithm if you're using a parser builder like PLY.
Lots of people have recommended books. For many these are much more useful in a structured environment with assignments and due dates and so forth. Even if not, having the material presented in a different way can help greatly.
(a) Have you considered going to a school with a decent CS curriculum?
(b) There are lots of online lectures, such as MIT's Open Courseware. Their EE/CS section has many courses that touch on parsing, though I can't see any on parsing per se. It's typically introduced as one of the first theory courses as language classification and automata is at the heart of much of CS theory.
If you prefer Java based tools, the Java Compiler Compiler, JavaCC, is a nice parser/scanner. It's config file driven, and will generate java code that you can include in your program. I haven't used it a couple years though, so I'm not sure how the current version is. You can find out more here: https://javacc.dev.java.net/
Lexing/Parsing + typecheck + code generation is a great CS exercise I would recommend it to anyone wanting a solid basis, so I'm all for the Dragon Book
I found this site helpful:
Lex and YACC primer/HOWTO
The first time I used lex/yacc was for a relatively simple project. This tutorial was all I really needed. When I approached more complex projects later, the familiarity I had from this tutorial and a simple project allowed me to build something fancier.
After taking (quite) a few compilers classes, I've used both The Dragon Book and C&T. I think C&T does a far better job of making compiler construction digestible. Not to take anything away from The Dragon Book, but I think C&T is a far more practical book.
Also, if you like writing in Java, I recommend using JFlex and BYACC/J for your lexing and parsing needs.
Yet another textbook to consider is Programming Language Pragmatics. I prefer it over the Dragon book, but YMMV.
If you're using Perl, yet another tool to consider is Parse::RecDescent.
If you just need to do this translation once and don't know anything about compiler technology, I would suggest that you get as far as you can with some fairly simplistic translations and then fix it up by hand. Yes, it is a lot of work. But it is less work than learning a complex subject and coding up the right solution for one job. That said, you should still learn the subject, but don't let not knowing it be a roadblock to finishing your current project.
Parsing Techniques - A Practical Guide
By Dick Grune and Ceriel J.H. Jacobs
This book (freely available as PDF) gives an extensive overview of different parsing techniques/algorithms. If you really want to understand the different parsing algorithms, this IMO is a better reference than the Dragon Book (as Parsing Techniques focuses entirely on parsing, while the Dragon Book covers parsing only as one - although important - part of the compiler construction process).
flex and bison are the new lex and yacc though. The syntax for BNF is often derided for being a bit obtuse. Some have moved to ANTLR and Ragel for this reason.
If you're not doing much translation, you may one to pull a one-off using multiline regexes with Perl or Ruby. Writing a compatible BNF grammar for an existing language is not a task to be taken lightly.
On the other hand, it is entirely possible to leverage any given language's .l and .y files if they are available as open source. Then, you could construct new code from an existing parse tree.