Writing a Parser (for a markup language): Theory & Practice

Writing a Parser (for a markup language): Theory & Practice - parsing

I'd like to write an idiomatic parser for a markup language like Markdown. My version will be slightly different, but I perceive at least a minor need for something like this in Clojure, and I'd like to get on it.
I don't want to use a mess of RegExes (though I realize some will probably be needed), and I'd like to make something both powerful and in idiomatic Clojure.
I've begun a few different attempts (mostly on paper), but I'm terribly happy with them, as I feel as though I'm just improvising. That would be fine, but I've done plenty of exploring in the language of Clojure in the past month or two, and would like to, at least in part, follow in the paths of giants.
I'd like some pointers, or suggestions, or resources (books from O'Reilly would be awesome–love me some eBooks–but Amazon or wherever would be great, too). Whatever you can offer.
EDIT Brian Carper has an interesting post on using ANTLR from Clojure.
There's also clojure-pg and fnparse, which are Clojure parser-generators. fnparse even looks like it's got some decent documentation.
Still looking for resources etc! Just thought I'd update these with some findings of my own.

Best I can think of is that Terrence Parr - the guy that leads the ANTLR parser generator - has written a markup language documented here. Anyway, there's source code there to look at.

There is also clj-peg project, that allows to specify PEG grammar for parsing data

Another not yet mentioned here is clarsec, a port of Haskell's parsec library.
I've recently been on a very similar quest to build a parser in Clojure. I went pretty far down the fnparse path, in particular using the (yet unreleased) fnparse 3 which you can find in the develop branch on github. It is broken into two forms: hound (specifically for LL(1) single lookahead parsers) and cat, which is a packrat parser. Both are functional parsers built on monads (like clarsec). fnparse has some impressive work - the ability to document your parser, build error messages, etc is neat. The documentation on the develop branch is non-existent though other than the function docstrings, which are actually quite good. In the end, I hit some road-blocks with trying to make LL(k) work. I think it's possible to make it work, it's just hard without a decent set of examples on how to make backtracking work well. I'm also so familiar with parsers that split lexing and parsing that it was hard for me to think that way. I'm still very interested in this as a good solution in the future.
In the meantime, I've fallen back to Antlr, which is very robust, well-traveled, well-documented (in 2 books), etc. It doesn't have a Clojure back-end but I hope it will in the future, which would make it really nice for parser work. I'm using it for lexing, parsing, tree transformation, and templating via StringTemplate. It hasn't been entirely bump-free, but I've been able to find workable solutions to all problems so far. Antlr's unique LL(*) parsing algorithm lets you write really readable grammars but still make them fairly efficient (and tweak things gradually if they're not).

Two functional markup translators are;
Pandoc, a markdown implemented in Haskell with source on github
Simple_markdown implemented in OCaml.

Related

Writing a Parser for mixed Languages

I am trying to write a Parser that can analyze mixed Languages and generate an AST of it. I first tried to build it from scratch on my own in Java and failed, because this is quite a hard topic for a Parser-beginner. Then i googled and found http://www2.cs.tum.edu/projects/cup/examples.php and JFlex.
The question now is: What is the best way to do it?
For example i have a Codefile, that contains several Tags, JS Code, and some $CMS_SET(x,y)$ Code. Is the best way to solve this to define a grammar for all those things in CUP and let CUP generate a Parser based on my grammer that can analyze those mixed Language files and generate and AST Tree of it?
Thanks for all helpful answers. :)
EDIT: I need to do it in Java...

This topic is quite hard even for an expert in this area, which I consider myself to be; check my bio.
The first issue is to build individual parsers for each sublanguage. The first thing you will discover is that defining parsers for specific languages is actually hard; you can read the endless list of SO requests for "can I get a parser for X" or "how do I fix my parser for X". Mostly I think these requests end up not going anywhere; the parsing engines aren't very good, you have to twist the grammars and the parsers to make them work on real languages, there isn't any such thing as "pure HTML", the standards documents disagree and your customer always has some twist in his code that you're not prepared for. Finally there's the glitches related to character set encodings, variations in newline endings, and preprocessors to complicate the parsing parsing problem. The C++ preprocessor is a lot more complex than you might think, and you have to have this right. The easiest way to defeat this problem is to find some parser generator with the languages already predefined. ANTLR has a bunch for the now deprecated ANTLR3; there's no assurance these parsers are robust let alone compatible for your purposes.
CUP isn't a particularly helpful parser generator; none of the LL(x) or LALR(x) parser generators are really helpful, because no real langauge matches the categories of things they can parse. The consequence: an endless stream of requests (on SO!) for help "resolving my shift-reduce conflict", or "eliminating right recursion". The only parser generator IMHO that has stood the test of time is a GLR parser generator (I hear good things about GLL but that's pretty recent). We've done 40+ languages with one GLR parser generator, including production IBM COBOL, full C++14 and Java8.
You second problem will be building ASTs. You can hand code the AST building process, but that gets old fast when you have to change the grammars often and/or you have many grammars as you are effectively contemplating. This you can beat your way throught with sweat. (We chose to push the problem of building ASTs into the parser so we didn't have to put any energy this in building a grammar; to do this, your parser engine has to offer you this help and none of the mainstream ones do.).
Now you need to compose parsers. You need to have one invoke the other as the need comes up; of course your chosen parser isn't designed to do this so you'll have to twist it. The first hard part is provide a parser with clues that a sublanguage is coming up in the input stream, and for it to hand off that parsing to the sublangauge parser, and get it to pass a tree back to be incorporated in the parent parser's tree, presumably with some kind of marker so you can tell where the transitions between different sublangauges are in the tree. You can often do this by hacking one language's lexer, when it sees the clue, to invoke the other; but then what you do with the tree it returns? There's no way to give that tree to the current parser and say "integrate this". You get around this by modifying the parsing machinery in arcane ways.
But all of the above isn't where the problem is.
Parsing is inconvenient, but only a small part of what you need to analyze your programs in any interesting way; you need symbol tables, control and data flow analysis, maybe points-to analyses, and the engineering on these will swamp the work listed above. See my essay on "Life After Parsing" (google or via my bio) for a long discussion of what else you need.
In short, I think you are biting off an enormous task in just "parsing", and you haven't even told us what you intend to do with the result. You are welcome to start down this path, but very few people have succeeded; my team spent over a 50 man years of PhD level engineering to get where we are, and we are hardly done.
Java won't make the solution any easier or harder; the langauge in which you solve all of the above is irrelevant.

Is using lex/yacc (or flex/bison) an overkill for configuration file parsing?

For the last couple of weeks I kept reading and playing with flex/bison, the main goal is to parse structured configuration file with nested groups and lists.
flex/bison seems very powerful but too complicated.
I surveyed few open source project and the only example I found for configuration parsing using Bison was ntpd, other projects build their own parser and lexer.
Is it really the right tool for the job? or is it better to build a recursive descent parser by hand (may be with flex as a lexer)?!

It's entirely appropriate. If you are versed in bison you can throw it together way quicker than you could write an RDP or some kind of ad-hoc parser. Might take a little longer if it's your first go at it - but it might also be a good way to learn.
It will also help you design your grammar - if you accidentally make it ambiguous, you'll get a R/R conflict right away, rather than getting way down to a depp dark place in your RDP and finding you have no way out...

I don't believe it's too complicated. Besides, handwritten parsers are poorly maintainable, compared to autogenerated parsers.
The biggest problem with GNU Bison and Flex is that there is no good tutorial for C++. There are plenty of badly written C examples with global variables, which doesn't help Bison/Flex reputation. Your percepsion may change when you have a working example.
Here is a working C++ solution using Bison 3 and Flex. Encapsulate it in your own namespace and voila - you can stuff your project with gazilion parsers for everything.
https://github.com/ezaquarii/bison-flex-cpp-example

There are lots of home-brew configuration file syntaxes that have been developed using primitive ad-hoc approaches, such as splitting a line into a name and value based on simple tokenizing. Such approaches tend to have limitations, and Java properties files come to mind as a particularly bad configuration format.
When you have made the decision to define a lexical and BNF specification for your configuration syntax, you are already ahead of the game. Whether you then choose to implement that specification via hand-written code or via tools such as flex & bison is just a relatively unimportant implementation detail.
When I designed and implemented Config4*, I choose the hand-written code approach, for reasons I discuss in one of the Config4* manuals. However, I agree with the advice from BadZen: if you are already comfortable using flex and bison, then using them will probably save time compared to using a hand-written lexer and recursive-descent parser.

Learning More About Parsing

I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.
Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?

On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.
On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).

Here's my take on your (very good) questions:
I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.
The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.

Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.

I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.
The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.
If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.

In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.

Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).
Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.
For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)

Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.
For 2, you might take a look at Ragel (it's good for C++ and Ruby).

Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler
which can be used to design and implement "low overhead" DSLs very quickly:
http://www.bayfronttechnologies.com/mc_tutorial.html
This site walks you through Val Schorre's 1964 paper on MetaII.
Yes, 1964. And it is amazing. This is how I learned about compilers
back in 1970.

Is Yacc still used in the industry?

The software base I am developing for uses a signficant amount of yacc which I don't need to deal with. Some times I think it would be helpful in understanding some problems I find but most of the time I can get away with my complete ignorance of yacc.
My question are there enough new projects out there that still use yacc to warrant the time I'll need to learn it?
Edit: Given the response is mostly in favour of learning Yacc, is there a similar language that you would recommend over yacc?

Yes, these tools are worth learning if you ever need to create or modify code that parses a grammar.
For many years the de facto tool for generating code to parse a grammar was yacc, or its GNU cousin, bison.
Lately I've heard there are a couple of new kids on the block, but the principle is the same: you write a declarative grammar in a format that is more or less in Backus-Naur Form (BNF) and yacc/bison/whatever generates some code for you that would be extremely tedious to write by hand.
Also, the principles behind grammars can be very useful to learn even if you don't need to work on such code directly. I haven't worked with parsers much since taking a course on Compiler Design in college, but understanding runtime stacks, lookahead parsers, expression evaluation, and a lot of other related things has helped me immensely to write and debug my code effectively.
edit: Given your followup question about other tools, Yacc/Bison of course are best for C/C++ projects, since they generate C code. There are similar tools for other languages. Not all grammars are equivalent, and some parser generators can only grok grammars of a certain complexity. So you might need to find a tool that can parse your grammar. See http://en.wikipedia.org/wiki/Comparison_of_parser_generators

I don't know about new projects using it but I'm involved in seven different maintenance jobs that use lex and yacc for processing configuration files.
No XML for me, no-sir-ee :-).
Solutions using lex/yacc are a step up from the old configuration files of key=val lines since they allow better hierarchical structures like:
server = "mercury" {
ip = "172.3.5.13"
gateway = "172.3.5.1"
}
server = "venus" {
ip = "172.3.5.21"
gateway = "172.3.5.1"
}
And, yes, I know you can do that with XML, but these are primarily legacy applications written in C and, to be honest, I'd probably use lex/yacc for new (non-Java) jobs as well.
That's because I prefer delivering software on time and budget rather than delivering the greatest new whizz-bang technology - my clients won't pay for my education, they want results first and foremost and I'm already expert at lex/yacc and have all the template code for doing it quickly.

A general rule of thumb: code lasts a long time, so the technologies used in that code last a long time, too. It would take an enormous amount of time to replace the codebase you mention (it took 15 years to build it...), which in turn implies that it will still be around in 5, 10, or more years. (There's even a chance that someone who reads this answer will end up working on it!)
Another rule of thumb: if a general-purpose technology is commonplace enough that you have encountered it already, it's probably commonplace enough that you should familiarize yourself with it, because you'll see it again one day. Who knows: by familiarizing yourself with it, maybe you added a useful tool to your toolbox...
Yacc is one of these technologies: you're probably going to run into it again, it's not that difficult, and the principles you'll learn apply to the whole family of parser constructors.

PEGs are the new hotness, but there are still a ton of projects that use yacc or tools more modern than yacc. I would frown on a new project that chose to use yacc, but for existing projects porting to a more modern tool may not make sense. This makes having rough familiarity with yacc a useful skill.
If you're totally unfamiliar with the topic of parser generators I'd encourage you to learn about one, any one. Many of the concepts are portable between them. Also, it's a useful tool to have in the belt: once you know one you'll understand how they can often be superior compared to regex heavy hand written parsers. If you're already comfortable with the topic of parsers, I wouldn't worry about it. You'll learn yacc if and when you need to in order to get something done.

I work on projects that use Yacc. Not new code - but were they new, they'd still use Yacc or a close relative (Bison, Byacc, ...).
Yes, I regard it as worth learning if you work in C.
Also consider learning ANTLR, or other more modern parser generators. But knowledge of Yacc will stand you in good stead - it will help you learn any other similar tools too, since a lot of the basic theory is similar.

I don't know about yacc/bison specifically, but I have used antlr, cup, jlex and javacc. I thought they would only be of accademic importance, but as it turns out we needed a domain-specific language, and this gave us a much nicer solution than some "simpler" (regex based) parsers out there. Maintenance might be an issue in many environments, though - since most coders these days won't have any experience with parsing tools.

I haven't had the chance to compare it with other parsing systems but I can definitely recommend ANTLR based on my own experience and also with its large and active user base.
Another plus point for ANTLR is ANTLRWorks: The ANTLR GUI Development Environment which is a great help while developing and debugging your grammars. I've yet to see another parsing system which is supported by such an IDE.

We are writing new yacc code at my company for shipping products. Yes, this stuff is still used.

Writing a parser - In the need of guides and research papers

My knowledge about implementing a parser is a bit rusty.
I have no idea about the current state of research in the area, and could need some links regarding recent advances and their impact on performance.
General resources about writing a parser are also welcome, (tutorials, guides etc.) since much of what I had learned at college I have already forgotten :)
I have the Dragon book, but that's about it.
And does anyone have input on parser generators like ANTLR and their performance? (ie. comparison with other generators)
edit My main target is RDF/OWL/SKOS in N3 notation.

Mentioning the dragon book and antlr means you've answered your own question.
If you're looking for other parser generators you could also check out boost::spirit (http://spirit.sourceforge.net/).
Depending on what you're trying to achieve you might also want to consider a DSL, which you can either parse yourself or write in a scripting language like boo, ruby, python etc...

Hmm … your request is a bit unspecific. While there are many recent developments in this general area, they're all quite specialized (naturally, since the field has matured). The original parsing approaches haven't really changed, though. You might want to read up on changes in parser creation tools (Antlr, Gold Parser, to name but a few).

You might also want to take a look at SableCC, another parser generator "which generates fully featured object-oriented frameworks for building compilers".
Their is some documentation about basic uses here and here. Since you asked about research papers, SableCC's main developper's master thesis (1998) is available and explains a little more about SableCC advantages.
Although the current stable version is 3.2, the development branch v4 is a complete rewrite and should implement features new to parser generators.

If you want to build custom analyzers for complex languages,
consider our DMS Software Reengineering Toolkit.
See http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html
This provides very strong parsing technology, making it "easy" to define your language
(especially in comparison with most parser generators).
Conventional parser generators may help
with parsing, but they provide zero help in the hard part of the
process, which happens after you can parse the code.
DMS provides a vast amount of machinery to support analyzing and transforming
the code once your have parsed it.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart