Grammar/own-written parser? - parsing

I'm doing some small projects which involve having different syntaxes for something, however sometimes these syntaxes are so easy that using a parser generator might be overkill.
Now, when should I use a hand-made parser, and when should I use a parser generator?
Thanks,
William van Doorn

There is no hard-and-fast answer, other than "use whatever is easiest for the particular situation".
My experience is that parsers tend to get more complicated over their lifetimes, so using a parser generator up front usually pays off. Even if the language doesn't get more complicated, using a generator forces you to create a formal specification of the syntax, which is itself valuable.
The downsides are that other programmers may not know how to use the generator, so it makes it difficult for others to help out, and it makes your project dependent on that generator.

It's worth coding the parser by hand if, and only if, you're super-keen to have it be extremely fast even on a machine of very modest speed. For example, in this article on the history of Turbo Pascal from before it got its name, you can see how and why the prototype impressed the small (then Danish) firm "Borland" to hire the prototype's author (Anders Hejlsberg), fully develop the compiler, and launch it as its main product, and I quote...:
with no great expectations I hit the
compile key - AND THEN I WAS
COMPLETELY FLOORED! My test program,
that took minutes to compile and link
using Digital Research’s Pascal MT+,
was compiled and running before I
could blink an eye! That was a great
WOW moment!
Turbo Pascal's amazing compile speed -- coming first and foremost from a carefully hand-coded and highly tuned recursive descent parser coded in assembly language -- allowed it to use a very different strategy from most compilers: no separate compilation pass generating object files and libraries, and then a linker to put them together, rather, Turbo Pascal 1.0 was a single-pass compiler that directly turned source code into a single executable binary.
I remember just the same amazing experience on the tiny personal computers of that era (when a Z80, 64K or RAM, and two floppies was a lot;-) -- Turbo Pascal, with its amazing parser and the IDE and everything else, fit comfortably in memory together with a substantial program in both source and compiled form -- no floppies were needed, which meant many orders of magnitude of difference in program turnaround time.
If Hejlsberg had stuck to what was already the traditional wisdom at the time -- always use parser generators -- Turbo Pascal would probably never have emerged as a commercial product, and definitely not achieved the dominance in the Pascal world it enjoyed for years.
Of course, on a typical PC of today, such extreme parsing speed would not be needed for most compilers. Possible exceptions include compilers that must run seamlessly as part of an "interpreter-like" environment (the simple compilers for languages such as Perl and Python are typically hand-coded, to substantial extents, for that reason -- that was an implementation choice that made them viable in the '90s, although today it's not clear it's still needed), or compilers that run on very limited hardware resources, such as smartphones or low-end netbooks.
In the vast majority of cases in which you'll be writing a compiler, none of these performance considerations probably apply, and you'll be happier with a parser generator.

Your question title suggests that using a grammar is optional. It really isn't - even if I was going to implement a tiny language, I'd sketch out a grammar on a single sheet of paper.
As for when to use parser generators, this is really personal preference. Many people believe in hand-writing recursive descent parsers, rather than using the table-driven approach, for example. The important thing is to be comfortable in understanding the capabilities of the generator.
And don't be thinking that using parser generators is somehow the more professional, or even the easier approach. Bjarne Stroustrup when writing the first C++ compiler intended to use recursive descent, but got talked out of it by some keen colleagues at Bell Labs, much to his eventual chagrin. See section 3.3.2 of The Design and Evolution of C++ for more details.

Related

What front-end can I use with RPython to implement a language?

I've looked high and low for examples of implementing a language using the RPython toolchain, but the only one I've been able to find so far is this one in which the author writes a simple BF interpreter. Because the grammar is so simple, he doesn't need to use a parser/lexer generator. Is there a front-end out there that supports developing a language in RPython?
Thanks!
I'm not aware of any general lexer or parser generator targeting RPython specifically. Some with Python output may work, but I wouldn't bet on it. However, there's a set of parsing tools in rlib.parsing. It seems quite usable. OTOH, there's a warning in the documentation: It's reportedly still in development, experimental, and only used for the Prolog interpreter so far.
Alternatively, you can write the frontend by hand. Lexers can be annoying and unnatural, granted (you may be able to rip out the utility modules for DFAs used by the Python implementation). But parsers are a piece of cake if you know the right algorithms. I'm a huge fan of "Top Down Operator Precedence parsers" a.k.a. "Pratt parsers", which are reasonably simple (recursive descent) but make all expression parsing issues (nesting, precedence, associativity, etc.) a breeze. There's depressingly little information on them, but the few blog posts were sufficient for me:
One by Crockford (wouldn't recommend it though, it throws a whole lot of unrelated stuff into the parser and thus obscures it),
another one at effbot.org (uses Python),
and a third by a sadly even-less-famous guy who's developing a language himself, Robert Nystrom.
Alex Gaynor has ported David Beazley's excellent PLY to RPython. Its documentation is quite good, and he even gave a talk about using it to implement an interpreter at PyCon US 2013.

Source of parsers for programming languages?

I'm dusting off an old project of mine which calculates a number of simple metrics about large software projects. One of the metrics is the length of files/classes/methods. Currently my code "guesses" where class/method boundaries are based on a very crude algorithm (traverse the file, maintaining a "current depth" and adjusting it whenever you encounter unquoted brackets; when you return to the level a class or method began on, consider it exited). However, there are many problems with this procedure, and a "simple" way of detecting when your depth has changed is not always effective.
To make this give accurate results, I need to use the canonical way (in each language) of detecting function definitions, class definitions and depth changes. This amounts to writing a simple parser to generate parse trees containing at least these elements for every language I want my project to be applicable to.
Obviously parsers have been written for all these languages before, so it seems like I shouldn't have to duplicate that effort (even though writing parsers is fun). Is there some open-source project which collects ready-to-use parser libraries for a bunch of source languages? Or should I just be using ANTLR to make my own from scratch? (Note: I'd be delighted to port the project to another language to make use of a great existing resource, so if you know of one, it doesn't matter what language it's written in.)
If you want language-accurate parsing, especially in the face of language complications such as macros and preprocessor conditionals, you need full language parsers. These are actually quite a lot of work to construct, and most languages don't lend themselves nicely to the various kinds of parser generators around. Nor are most authors of a language parser interested in other langauges; they tend to choose some parser generator that isn't obviously a huge roadblock when they start, implement their parser for the specific purpose they intend, and move on.
Consequence: there are very few libraries of language definitions around that are defined using a single formalism or a shared foundation. The ANTLR crowd maintains one of the larger sets IMHO, although as far as I can tell most of those parsers are not-quite-production capable. There's always Bison, which has been around long enough so you'd expect a library of langauge definitions to be collected somewhere, but I've never seen one.
I've spent the last 15 years defining foundation machinery for program analysis and transformation, and building another such library, called the DMS Software Reengineering Toolkit. It has production quality parsers for C, C++, C#, Java, COBOL (IBM Enterprise version), JCL, PHP, Python, etc. Your opinion may of course vary from mine but these are used daily with DMS to carry out mass change tasks on large bodies of code.
I don't know of any others where the set of langauge definitions are mature and built on a single foundation... it may be that IBM's compilers are such a set, but IBM doesn't offer out the machinery or the language definitions.
If all you want to do is compute simple metrics, you might be able to live with just lexers and ad hoc nest-counting (as you've described). Even that's harder than it looks to make it work right in most cases (check out Python's, Perl's and PHP crazy string syntaxes). When all is said and done, even C is a surprising amount of work just to define an accurate lexer: we have several thousand lines of sophisticated regular expressions to cover all the strange lexemes you find in Microsoft and/or GNU C.
Because DMS has consistently-defined, mature parsers for many languages, it follows that DMS has consistently defined, mature lexers for the same langauges. We actually build a Source Code Search Engine (SCSE) that provides fast search across large bodies of codes in multiple languages that works by lexing the languages it encounters and indexing those lexemes for fast lookup. The SCSE just so happens to compute the kind of metrics you are discussing, too, as it indexes the code base, pretty much the way you describe, except that it has these langauage accurate lexers to use.
You might be interested in gcc-xml if you are parsing C++. Java CUP has grammars for the Java language.

Appropriate uses for yacc/byacc/bison and lex/flex

Most of the posts that I read pertaining to these utilities usually suggest using some other method to obtain the same effect. For example, questions mentioning these tools usual have at least one answer containing some of the following:
Use the boost library (insert appropriate boost library here)
Don't create a DSL use (insert favorite scripting language here)
Antlr is better
Assuming the developer ...
... is comfortable with the C language
... does know at least one scripting
language (e.g., Python, Perl, etc.)
... must write some parsing code in almost
every project worked on
So my questions are:
What are appropriate situations which
are well suited for these utilities?
Are there any (reasonable) situations
where there is not a better
alternative to a problem than yacc
and lex (or derivatives)?
How often in actual parsing problems
can one expect to run into any short
comings in yacc and lex which are
better addressed by more recent
solutions?
For a developer which is not already
familiar with these tools is it worth
it for them to invest time in
learning their syntax/idioms? How do
these compare with other solutions?
The reasons why lex/yacc and derivatives seem so ubiquitous today are that they have been around for much longer than other tools, that they have far more coverage in the literature and that they traditionally came with Unix operating systems. It has very little to do with how they compare to other lexer and parser generator tools.
No matter which tool you pick, there is always going to be a significant learning curve. So once you have used a given tool a few times and become relatively comfortable in its use, you are unlikely to want to incur the extra effort of learning another tool. That's only natural.
Also, in the late 1960s and early 1970s when lex/yacc were created, hardware limitations posed a serious challenge to parsing. The table driven LR parsing method used by Yacc was the most suitable at the time because it could be implemented with a small memory footprint by using a relatively small general program logic and by keeping state in files on tape or disk. Code driven parsing methods such as LL had a larger minimum memory footprint because the parser program's code itself represents the grammar and therefore it needs to fit entirely into RAM to execute and it keeps state on the stack in RAM.
When memory became more plentiful a lot more research went into different parsing methods such as LL and PEG and how to build tools using those methods. This means that many of the alternative tools that have been created after the lex/yacc family use different types of grammars. However, switching grammar types also incurs a significant learning curve. Once you are familiar with one type of grammar, for example LR or LALR grammars, you are less likely to want to switch to a tool that uses a different type of grammar, for example LL grammars.
Overall, the lex/yacc family of tools is generally more rudimentary than more recent arrivals which often have sophisticated user interfaces to graphically visualise grammars and grammar conflicts or even resolve conflicts through automatic refactoring.
So, if you have no prior experience with any parser tools, if you have to learn a new tool anyway, then you should probably look at other factors such as graphical visualisation of grammars and conflicts, auto-refactoring, availability of good documentation, languages in which the generated lexers/parsers can be output etc etc. Don't pick any tool simply because "this is what everybody else seems to be using".
Here are some reasons I could think of for using lex/yacc or flex/bison :
the developer is already familiar with lex/yacc or flex/bison
the developer is most familiar and comfortable with LR/LALR grammars
the developer has plenty of books covering lex/yacc but no books covering others
the developer has a prospective job offer coming up and has been told that lex/yacc skills would increase his chances to get hired
the developer could not get buy-in from project members/stake holders for the use of other tools
the environment has lex/yacc installed and for some reason it is not feasible to install other tools
Whether it's worth learning these tools or not will depend heavily (almost entirely on how much parsing code you write, or how interested you are in writing more code on that general order. I've used them quite a bit, and find them extremely useful.
The tool you use doesn't really make as much difference as many would have you believe. For about 95% of the inputs I've had to deal with, there's little enough difference between one and another that the best choice is simply the one with which I'm most familiar and comfortable.
Of course, lex and yacc produce (and demand that you write your actions in) C (or C++). If you're not comfortable with them, a tool that uses and produces a language you prefer (e.g. Python or Java) will undoubtedly be a much better choice. I, for one, would not advise trying to use a tool like this with a language with which you're unfamiliar or uncomfortable. In particular, if you write code in an action that produces a compiler error, you'll probably get considerably less help from the compiler than usual in tracking down the problem, so you really need to be familiar enough with the language to recognize the problem with only a minimal hint about where compiler noticed something being wrong.
In a previous project, I needed a way to be able to generate queries on arbitrary data in a way that was easy for a relatively non-technical person to be able to use. The data was CRM-type stuff (so First Name, Last Name, Email Address, etc) but it was meant to work against a number of different databases, all with different schemas.
So I developed a little DSL for specifying the queries (e.g. [FirstName]='Joe' AND [LastName]='Bloggs' would select everybody called "Joe Bloggs"). It had some more complicated options, for example there was the "optedout(medium)" syntax which would select all people who had opted-out of receiving messages on a particular medium (email, sms, etc). There was "ingroup(xyz)" which would select everybody in a particular group, etc.
Basically, it allowed us to specify queries like "ingroup('GroupA') and not ingroup('GroupB')" which would be translated to an SQL query like this:
SELECT
*
FROM
Users
WHERE
Users.UserID IN (SELECT UserID FROM GroupMemberships WHERE GroupID=2) AND
Users.UserID NOT IN (SELECT UserID GroupMemberships WHERE GroupID=3)
(As you can see, the queries aren't as effecient as possible, but that's what you get with machine generation, I guess).
I didn't use flex/bison for it, but I did use a parser generator (the name of which has escaped me at the moment...)
I think it's pretty good advice to eschew the creation of new languages just to support a Domain specific language. It's going to be a better use of your time to take an existing language and extend it with domain functionality.
If you are trying to create a new language for some other reason, perhaps for research into language design, then these tools are a bit outdated. Newer generators such as antlr, or even newer implementation languages like ML, make language design a much easier affair.
If there's a good reason to use these tools, it's probably because of their legacy. You might already have a skeleton of a language you need to enhance, which is already implemented in one of these tools. You might also benefit from the huge volumes of tutorial information written about these old tools, for which there is not so great a corpus written for newer and slicker ways of implementing languages.
We have a whole programming language implemented in my office. We use it for that. I think it's meant to be a quick and easy way to write interpreters for things. You could conceivably write almost any sort of text parser using them, but a lot of times it's either A) easier to write it yourself quick or B) you need more flexibility than they provide.

Learning More About Parsing

I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.
Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?
On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.
On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).
Here's my take on your (very good) questions:
I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.
The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.
Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.
I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.
The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.
If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.
In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.
Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).
Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.
For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)
Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.
For 2, you might take a look at Ragel (it's good for C++ and Ruby).
Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler
which can be used to design and implement "low overhead" DSLs very quickly:
http://www.bayfronttechnologies.com/mc_tutorial.html
This site walks you through Val Schorre's 1964 paper on MetaII.
Yes, 1964. And it is amazing. This is how I learned about compilers
back in 1970.

Is Yacc still used in the industry?

The software base I am developing for uses a signficant amount of yacc which I don't need to deal with. Some times I think it would be helpful in understanding some problems I find but most of the time I can get away with my complete ignorance of yacc.
My question are there enough new projects out there that still use yacc to warrant the time I'll need to learn it?
Edit: Given the response is mostly in favour of learning Yacc, is there a similar language that you would recommend over yacc?
Yes, these tools are worth learning if you ever need to create or modify code that parses a grammar.
For many years the de facto tool for generating code to parse a grammar was yacc, or its GNU cousin, bison.
Lately I've heard there are a couple of new kids on the block, but the principle is the same: you write a declarative grammar in a format that is more or less in Backus-Naur Form (BNF) and yacc/bison/whatever generates some code for you that would be extremely tedious to write by hand.
Also, the principles behind grammars can be very useful to learn even if you don't need to work on such code directly. I haven't worked with parsers much since taking a course on Compiler Design in college, but understanding runtime stacks, lookahead parsers, expression evaluation, and a lot of other related things has helped me immensely to write and debug my code effectively.
edit: Given your followup question about other tools, Yacc/Bison of course are best for C/C++ projects, since they generate C code. There are similar tools for other languages. Not all grammars are equivalent, and some parser generators can only grok grammars of a certain complexity. So you might need to find a tool that can parse your grammar. See http://en.wikipedia.org/wiki/Comparison_of_parser_generators
I don't know about new projects using it but I'm involved in seven different maintenance jobs that use lex and yacc for processing configuration files.
No XML for me, no-sir-ee :-).
Solutions using lex/yacc are a step up from the old configuration files of key=val lines since they allow better hierarchical structures like:
server = "mercury" {
ip = "172.3.5.13"
gateway = "172.3.5.1"
}
server = "venus" {
ip = "172.3.5.21"
gateway = "172.3.5.1"
}
And, yes, I know you can do that with XML, but these are primarily legacy applications written in C and, to be honest, I'd probably use lex/yacc for new (non-Java) jobs as well.
That's because I prefer delivering software on time and budget rather than delivering the greatest new whizz-bang technology - my clients won't pay for my education, they want results first and foremost and I'm already expert at lex/yacc and have all the template code for doing it quickly.
A general rule of thumb: code lasts a long time, so the technologies used in that code last a long time, too. It would take an enormous amount of time to replace the codebase you mention (it took 15 years to build it...), which in turn implies that it will still be around in 5, 10, or more years. (There's even a chance that someone who reads this answer will end up working on it!)
Another rule of thumb: if a general-purpose technology is commonplace enough that you have encountered it already, it's probably commonplace enough that you should familiarize yourself with it, because you'll see it again one day. Who knows: by familiarizing yourself with it, maybe you added a useful tool to your toolbox...
Yacc is one of these technologies: you're probably going to run into it again, it's not that difficult, and the principles you'll learn apply to the whole family of parser constructors.
PEGs are the new hotness, but there are still a ton of projects that use yacc or tools more modern than yacc. I would frown on a new project that chose to use yacc, but for existing projects porting to a more modern tool may not make sense. This makes having rough familiarity with yacc a useful skill.
If you're totally unfamiliar with the topic of parser generators I'd encourage you to learn about one, any one. Many of the concepts are portable between them. Also, it's a useful tool to have in the belt: once you know one you'll understand how they can often be superior compared to regex heavy hand written parsers. If you're already comfortable with the topic of parsers, I wouldn't worry about it. You'll learn yacc if and when you need to in order to get something done.
I work on projects that use Yacc. Not new code - but were they new, they'd still use Yacc or a close relative (Bison, Byacc, ...).
Yes, I regard it as worth learning if you work in C.
Also consider learning ANTLR, or other more modern parser generators. But knowledge of Yacc will stand you in good stead - it will help you learn any other similar tools too, since a lot of the basic theory is similar.
I don't know about yacc/bison specifically, but I have used antlr, cup, jlex and javacc. I thought they would only be of accademic importance, but as it turns out we needed a domain-specific language, and this gave us a much nicer solution than some "simpler" (regex based) parsers out there. Maintenance might be an issue in many environments, though - since most coders these days won't have any experience with parsing tools.
I haven't had the chance to compare it with other parsing systems but I can definitely recommend ANTLR based on my own experience and also with its large and active user base.
Another plus point for ANTLR is ANTLRWorks: The ANTLR GUI Development Environment which is a great help while developing and debugging your grammars. I've yet to see another parsing system which is supported by such an IDE.
We are writing new yacc code at my company for shipping products. Yes, this stuff is still used.

Resources