Writing a parser - In the need of guides and research papers

Writing a parser - In the need of guides and research papers - parsing

My knowledge about implementing a parser is a bit rusty.
I have no idea about the current state of research in the area, and could need some links regarding recent advances and their impact on performance.
General resources about writing a parser are also welcome, (tutorials, guides etc.) since much of what I had learned at college I have already forgotten :)
I have the Dragon book, but that's about it.
And does anyone have input on parser generators like ANTLR and their performance? (ie. comparison with other generators)
edit My main target is RDF/OWL/SKOS in N3 notation.

Mentioning the dragon book and antlr means you've answered your own question.
If you're looking for other parser generators you could also check out boost::spirit (http://spirit.sourceforge.net/).
Depending on what you're trying to achieve you might also want to consider a DSL, which you can either parse yourself or write in a scripting language like boo, ruby, python etc...

Hmm … your request is a bit unspecific. While there are many recent developments in this general area, they're all quite specialized (naturally, since the field has matured). The original parsing approaches haven't really changed, though. You might want to read up on changes in parser creation tools (Antlr, Gold Parser, to name but a few).

You might also want to take a look at SableCC, another parser generator "which generates fully featured object-oriented frameworks for building compilers".
Their is some documentation about basic uses here and here. Since you asked about research papers, SableCC's main developper's master thesis (1998) is available and explains a little more about SableCC advantages.
Although the current stable version is 3.2, the development branch v4 is a complete rewrite and should implement features new to parser generators.

If you want to build custom analyzers for complex languages,
consider our DMS Software Reengineering Toolkit.
See http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html
This provides very strong parsing technology, making it "easy" to define your language
(especially in comparison with most parser generators).
Conventional parser generators may help
with parsing, but they provide zero help in the hard part of the
process, which happens after you can parse the code.
DMS provides a vast amount of machinery to support analyzing and transforming
the code once your have parsed it.

Related

What front-end can I use with RPython to implement a language?

I've looked high and low for examples of implementing a language using the RPython toolchain, but the only one I've been able to find so far is this one in which the author writes a simple BF interpreter. Because the grammar is so simple, he doesn't need to use a parser/lexer generator. Is there a front-end out there that supports developing a language in RPython?
Thanks!

I'm not aware of any general lexer or parser generator targeting RPython specifically. Some with Python output may work, but I wouldn't bet on it. However, there's a set of parsing tools in rlib.parsing. It seems quite usable. OTOH, there's a warning in the documentation: It's reportedly still in development, experimental, and only used for the Prolog interpreter so far.
Alternatively, you can write the frontend by hand. Lexers can be annoying and unnatural, granted (you may be able to rip out the utility modules for DFAs used by the Python implementation). But parsers are a piece of cake if you know the right algorithms. I'm a huge fan of "Top Down Operator Precedence parsers" a.k.a. "Pratt parsers", which are reasonably simple (recursive descent) but make all expression parsing issues (nesting, precedence, associativity, etc.) a breeze. There's depressingly little information on them, but the few blog posts were sufficient for me:
One by Crockford (wouldn't recommend it though, it throws a whole lot of unrelated stuff into the parser and thus obscures it),
another one at effbot.org (uses Python),
and a third by a sadly even-less-famous guy who's developing a language himself, Robert Nystrom.

Alex Gaynor has ported David Beazley's excellent PLY to RPython. Its documentation is quite good, and he even gave a talk about using it to implement an interpreter at PyCon US 2013.

Can Xtext be used for parsing general purpose programming languages?

I'm currently developing a general-purpose agent-based programming language (its syntaxt will be somewhat inspired by Java, and we are also using object in this language).
Since the beginning of the project we were doubtful about the fact of using ANTLR or Xtext. At that time we found out that Xtext was implementing a subset of the feature of ANTLR. So we decided to use ANLTR for our language losing the possibility to have a full-fledged Eclipse editor for free for our language (such a nice features provided by Xtext).
However, as the best of my knowledge, this summer the Xtext project has done a big step forward. Quoting from the link:
What are the limitations of Xtext?
Sven: You can implement almost any kind of programming language or DSL
with Xtext. There is one exception, that is if you need to use so
called 'Semantic Predicates' which is a rather complicated thing I
don't think is worth being explained here. Very few languages really
need this concept. However the prominent example is C/C++. We want to
look into that topic for the next release.
And that is also reinforced in the Xtext documentation:
What is Xtext? No matter if you want to create a small textual domain-specific language (DSL) or you want to implement a full-blown
general purpose programming language. With Xtext you can create your
very own languages in a snap. Also if you already have an existing
language but it lacks decent tool support, you can use Xtext to create
a sophisticated Eclipse-based development environment providing
editing experience known from modern Java IDEs in a surprisingly short
amount of time. We call Xtext a language development framework.
If Xtext has got rid of its past limitations why is it still not possible to find a complex Xtext grammar for the best known programming languages (Java, C#, etc.)?
On the ANTLR website you can find tons of such grammar examples, for what concerns Xtext instead the only sample I was able to find is the one reported in the documentation. So maybe Xtext is still not mature to be used for implementing a general purpose programming language? I'm a bit worried about this... I would not start to re-write the grammar in Xtext for then to recognize that it was not suited for that.

I think nobody implemented Java or C++ because it is a lot of work (even with Xtext) and the existing tools and compilers are excellent.
However, you could have a look at Xbase and Xtend, which is the expression language we ship with Xtext. It is built with Xtext and is quite a good proof for what you can build with Xtext. We have done that in about 4 person months.
I did a couple of screencasts on Xtend:
http://blog.efftinge.de/2011/03/xtend-screencast-part-1-basics.html
http://blog.efftinge.de/2011/03/xtend-screencast-part-2-switch.html
http://blog.efftinge.de/2011/03/xtend-screencast-part-3-rich-strings-ie.html
Note, that you can simply embed Xbase expressions into your language.

I can't speak for what Xtext is or does well.
I can speak to the problem of developing robust tools for processing real languages, based on our experience with the DMS Software Reengineering Toolkit, which we imagine is a language manipulation framework.
First, parsing of real languages usually involves something messy in lexing and/or parsing, due to the historical ways these languages have evolved. Java is pretty clean. C# has context-dependent keywords and a rudimentary preprocessor sort of like C's. C has a full blown preprocessor. C++ is famously "hard to parse" due to ambiguities in the grammar and shenanigans with template syntax. COBOL is fairly ugly, doesn't have any reference grammars, and comes in a variety of dialects. PHP will turn you to stone if you look at it because it is so poorly defined. (DMS has parsers for all of these, used in anger on real applications).
Yet you can parse all of these with most of the available parsing technologies if you try hard enough, usually by abusing the lexer or the parser to achieve your goals (how the GNU guys abused Bison to parse C++ by tangling lexical analysis with symbol table lookup is a nice ugly case in point). But it takes a lot of effort to get the language details right, and the reference manuals are only close approximations of the truth with respect to what the compilers really accept.
If Xtext has a decent parsing engine, one can likely do this with Xtext. A brief perusal of the Xtext site sounds like the lexers and parsers are fairly decent. I didn't see anything about the "Semantic Predicate"s; we have them in DMS and they are lifesavers in some of the really dark corners of parsing. Even using the really good parsing technology (we use GLR parsers), it would be very hard to parse COBOL data declarations (extracting their nesting structure during the parse) without them.
You have an interesting problem in that your language isn't well defined yet. That will make your initial parsers somewhat messy, and you'll revise them a lot. Here's where strong parsing technology helps you: if you can revise your grammar easily you can focus on what you want your language to look like, rather than focusing on fighting the lexer and parser. The fact that you can change your language definition means in fact that if Xtext has some limitations, you can probably bend your language syntax to match without huge amounts of pain. ANTLR does have the proven ability to parse a language pretty much as you imagine it, modulo the usual amount of parser hacking.
What is never discussed is what else is needed to process a language for real. The first thing you need to be able to do is to construct ASTs, which ANTLR and YACC will help you do; I presume Xtext does also. You also need symbol tables, control and data flow analysis (both local and global), and machinery to transform your language into something else (presumably more executable). Doing just symbol tables you will find surprisingly hard; C++ has several hundred pages of "how to look up an identifier"; Java generics are a lot tougher to get right than you might expect. You might also want to prettyprint the AST back to source code, if you want to offer refactorings. (EDIT: Here both ANTLR and Xtext offer what amounts to text-template driven code generation).
Yet these are complex mechanisms that take as much time, if not more than building the parser. The reason DMS exists isn't because it can parse (we view this just as the ante in a poker game), but because all of this other stuff is very hard and we wanted to amortize the cost of doing it all (DMS has, we think, excellent support for all of these mechanisms but YMMV).
On reading the Xtext overview, it sounds like they have some support for symbol tables but it is unclear what kind of assumption is behind it (e.g., for C++ you have to support multiple inheritance and namespaces).
If you are already started down the ANTLR road and have something running, I'd be tempted to stay the course; I doubt if Xtext will offer you a lot of additional help. If you really really want Xtext's editor, then you can probably switch at the price of restructuring what grammar you have (this is a pretty typical price to pay when changing parsing paradigms). Expect most of your work to appear after you get the parser right, in an ad hoc way. I doubt you will find Xtext or ANTLR much different here.

I guess the most simple answer to your question is: Many general purpose languages can be implemented using Xtext. But since there is no general answer to which parser-capabilities a general purpose languages needs, there is no general answer to your questions.
However, I've got a few pointers:
With Xtext 2.0 (released this summer), Xtext supports syntactic predicates. This is one of the most requested features to handle ambiguous syntax without enabling antlr's backtracking.
You might want to look at the brand-new languages Xbase and Xtend, which are (judging based on their capabilities) general-purpose and which are developed using Xtext. Sven has some nice screen casts in his blog: http://blog.efftinge.de/
Regarding your question why we don't see Xtext-grammars for Java, C++, etc.:
With Xtext, a language is more than just a grammar, so just having a grammar that describes a language's syntax is a good starting point but usually not an artifact valuable enough for shipping. The reason is that with an Xtext-grammar you also define the AST's structure (Abstract Syntax Tree, and an Ecore Model in fact) including true cross references. Since this model is the main internal API of your language people usually spend a lot of thought designing it. Furthermore, to resolve cross references (aka linking) you need to implement scoping (as it is called in Xtext). Without a proper implementation of scoping you can either not have true cross references in your model or you'll get many lining errors.
A guess my point is that creating a grammar + designing the AST model + implementing scoping is just a little more effort that taking a grammar from some language-zoo and translating it to Xtext's syntax.

Learning More About Parsing

I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.
Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?

On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.
On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).

Here's my take on your (very good) questions:
I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.
The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.

Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.

I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.
The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.
If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.

In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.

Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).
Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.
For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)

Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.
For 2, you might take a look at Ragel (it's good for C++ and Ruby).

Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler
which can be used to design and implement "low overhead" DSLs very quickly:
http://www.bayfronttechnologies.com/mc_tutorial.html
This site walks you through Val Schorre's 1964 paper on MetaII.
Yes, 1964. And it is amazing. This is how I learned about compilers
back in 1970.

Is Yacc still used in the industry?

The software base I am developing for uses a signficant amount of yacc which I don't need to deal with. Some times I think it would be helpful in understanding some problems I find but most of the time I can get away with my complete ignorance of yacc.
My question are there enough new projects out there that still use yacc to warrant the time I'll need to learn it?
Edit: Given the response is mostly in favour of learning Yacc, is there a similar language that you would recommend over yacc?

Yes, these tools are worth learning if you ever need to create or modify code that parses a grammar.
For many years the de facto tool for generating code to parse a grammar was yacc, or its GNU cousin, bison.
Lately I've heard there are a couple of new kids on the block, but the principle is the same: you write a declarative grammar in a format that is more or less in Backus-Naur Form (BNF) and yacc/bison/whatever generates some code for you that would be extremely tedious to write by hand.
Also, the principles behind grammars can be very useful to learn even if you don't need to work on such code directly. I haven't worked with parsers much since taking a course on Compiler Design in college, but understanding runtime stacks, lookahead parsers, expression evaluation, and a lot of other related things has helped me immensely to write and debug my code effectively.
edit: Given your followup question about other tools, Yacc/Bison of course are best for C/C++ projects, since they generate C code. There are similar tools for other languages. Not all grammars are equivalent, and some parser generators can only grok grammars of a certain complexity. So you might need to find a tool that can parse your grammar. See http://en.wikipedia.org/wiki/Comparison_of_parser_generators

I don't know about new projects using it but I'm involved in seven different maintenance jobs that use lex and yacc for processing configuration files.
No XML for me, no-sir-ee :-).
Solutions using lex/yacc are a step up from the old configuration files of key=val lines since they allow better hierarchical structures like:
server = "mercury" {
ip = "172.3.5.13"
gateway = "172.3.5.1"
}
server = "venus" {
ip = "172.3.5.21"
gateway = "172.3.5.1"
}
And, yes, I know you can do that with XML, but these are primarily legacy applications written in C and, to be honest, I'd probably use lex/yacc for new (non-Java) jobs as well.
That's because I prefer delivering software on time and budget rather than delivering the greatest new whizz-bang technology - my clients won't pay for my education, they want results first and foremost and I'm already expert at lex/yacc and have all the template code for doing it quickly.

A general rule of thumb: code lasts a long time, so the technologies used in that code last a long time, too. It would take an enormous amount of time to replace the codebase you mention (it took 15 years to build it...), which in turn implies that it will still be around in 5, 10, or more years. (There's even a chance that someone who reads this answer will end up working on it!)
Another rule of thumb: if a general-purpose technology is commonplace enough that you have encountered it already, it's probably commonplace enough that you should familiarize yourself with it, because you'll see it again one day. Who knows: by familiarizing yourself with it, maybe you added a useful tool to your toolbox...
Yacc is one of these technologies: you're probably going to run into it again, it's not that difficult, and the principles you'll learn apply to the whole family of parser constructors.

PEGs are the new hotness, but there are still a ton of projects that use yacc or tools more modern than yacc. I would frown on a new project that chose to use yacc, but for existing projects porting to a more modern tool may not make sense. This makes having rough familiarity with yacc a useful skill.
If you're totally unfamiliar with the topic of parser generators I'd encourage you to learn about one, any one. Many of the concepts are portable between them. Also, it's a useful tool to have in the belt: once you know one you'll understand how they can often be superior compared to regex heavy hand written parsers. If you're already comfortable with the topic of parsers, I wouldn't worry about it. You'll learn yacc if and when you need to in order to get something done.

I work on projects that use Yacc. Not new code - but were they new, they'd still use Yacc or a close relative (Bison, Byacc, ...).
Yes, I regard it as worth learning if you work in C.
Also consider learning ANTLR, or other more modern parser generators. But knowledge of Yacc will stand you in good stead - it will help you learn any other similar tools too, since a lot of the basic theory is similar.

I don't know about yacc/bison specifically, but I have used antlr, cup, jlex and javacc. I thought they would only be of accademic importance, but as it turns out we needed a domain-specific language, and this gave us a much nicer solution than some "simpler" (regex based) parsers out there. Maintenance might be an issue in many environments, though - since most coders these days won't have any experience with parsing tools.

I haven't had the chance to compare it with other parsing systems but I can definitely recommend ANTLR based on my own experience and also with its large and active user base.
Another plus point for ANTLR is ANTLRWorks: The ANTLR GUI Development Environment which is a great help while developing and debugging your grammars. I've yet to see another parsing system which is supported by such an IDE.

We are writing new yacc code at my company for shipping products. Yes, this stuff is still used.

Parsing, where can I learn about it

I've been given a job of 'translating' one language into another. The source is too flexible (complex) for a simple line by line approach with regex. Where can I go to learn more about lexical analysis and parsers?

If you want to get "emotional" about the subject, pick up a copy of "The Dragon Book." It is usually the text in a compiler design course. It will definitely meet your need "learn more about lexical analysis and parsers" as well as a bunch of other fun stuff!
IMH(umble)O, save yourself an arm and/or leg and buy an older edition - it will fill your information desires.

Try ANLTR:
ANTLR, ANother Tool for Language
Recognition, is a language tool that
provides a framework for constructing
recognizers, interpreters, compilers,
and translators from grammatical
descriptions containing actions in a
variety of target languages.
There's a book for it also.

Niklaus Wirth's book "Compiler Construction" (available as a free PDF)
http://www.google.com/search?q=wirth+compiler+construction

I've recently been working with PLY which is an implementation of lex and yacc in Python. It's quite easy to get started with it and there are some simple examples in the documentation.
Parsing can quickly become a very technical topic and you'll find that you probably won't need to know all the details of the parsing algorithm if you're using a parser builder like PLY.

Lots of people have recommended books. For many these are much more useful in a structured environment with assignments and due dates and so forth. Even if not, having the material presented in a different way can help greatly.
(a) Have you considered going to a school with a decent CS curriculum?
(b) There are lots of online lectures, such as MIT's Open Courseware. Their EE/CS section has many courses that touch on parsing, though I can't see any on parsing per se. It's typically introduced as one of the first theory courses as language classification and automata is at the heart of much of CS theory.

If you prefer Java based tools, the Java Compiler Compiler, JavaCC, is a nice parser/scanner. It's config file driven, and will generate java code that you can include in your program. I haven't used it a couple years though, so I'm not sure how the current version is. You can find out more here: https://javacc.dev.java.net/

Lexing/Parsing + typecheck + code generation is a great CS exercise I would recommend it to anyone wanting a solid basis, so I'm all for the Dragon Book

I found this site helpful:
Lex and YACC primer/HOWTO
The first time I used lex/yacc was for a relatively simple project. This tutorial was all I really needed. When I approached more complex projects later, the familiarity I had from this tutorial and a simple project allowed me to build something fancier.

After taking (quite) a few compilers classes, I've used both The Dragon Book and C&T. I think C&T does a far better job of making compiler construction digestible. Not to take anything away from The Dragon Book, but I think C&T is a far more practical book.
Also, if you like writing in Java, I recommend using JFlex and BYACC/J for your lexing and parsing needs.

Yet another textbook to consider is Programming Language Pragmatics. I prefer it over the Dragon book, but YMMV.
If you're using Perl, yet another tool to consider is Parse::RecDescent.
If you just need to do this translation once and don't know anything about compiler technology, I would suggest that you get as far as you can with some fairly simplistic translations and then fix it up by hand. Yes, it is a lot of work. But it is less work than learning a complex subject and coding up the right solution for one job. That said, you should still learn the subject, but don't let not knowing it be a roadblock to finishing your current project.

Parsing Techniques - A Practical Guide
By Dick Grune and Ceriel J.H. Jacobs
This book (freely available as PDF) gives an extensive overview of different parsing techniques/algorithms. If you really want to understand the different parsing algorithms, this IMO is a better reference than the Dragon Book (as Parsing Techniques focuses entirely on parsing, while the Dragon Book covers parsing only as one - although important - part of the compiler construction process).

flex and bison are the new lex and yacc though. The syntax for BNF is often derided for being a bit obtuse. Some have moved to ANTLR and Ragel for this reason.
If you're not doing much translation, you may one to pull a one-off using multiline regexes with Perl or Ruby. Writing a compatible BNF grammar for an existing language is not a task to be taken lightly.
On the other hand, it is entirely possible to leverage any given language's .l and .y files if they are available as open source. Then, you could construct new code from an existing parse tree.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart