Example code for dynamic parsing techniques - parsing

I would like to learn how to write dynamic parsers to perform tasks such as code-completion, highlighting, etc.
I have read the dragon book and written some parsers, but I would like more experience with handling incorrect code, especially code as it is being written.
IDEs like Eclipse and NetBeans obviously include code for stuff like this, but where?
What other projects / books might be relevant?
LISP or functional examples are also welcome.

Take a look at http://www.antlr.org/.

Check out xtext. It uses an ANTLR parser behind the scenes, but generates a syntax-highlighting editor, content assist, outlining, and many other features for you.
See http://www.eclipse.org/Xtext/

Related

parsing code into components

I need to parse some code and convert it to components because I want to make some stats about the code like number of code lines , position of condition statements and so on .
Is there any tool that I can use to fulfill that ?
Antlr is a nice tool that works with many languages, has good documentation and many sample grammars for languages included.
You can also go old-school and use Yacc and Lex (or the GNU versions Bison and Flex), which has pretty good book on generating parsers, as well as the classic dragon book.
It might be overkill, however, and you might just want to use Ruby or even Javascript.
You mention two distinct tasks:
refactor code into modular components
Run static code analysis to get code metrics.
I recommend:
Resharper, as the top code refactoring tool out there (assuming you're a .NET guy)
NCover and NDepend for static code analysis. You get #line of code, cyclomatic complexity, abstractness vs instability diagrams.. all of the cool stuff.

Writing a Parser (for a markup language): Theory & Practice

I'd like to write an idiomatic parser for a markup language like Markdown. My version will be slightly different, but I perceive at least a minor need for something like this in Clojure, and I'd like to get on it.
I don't want to use a mess of RegExes (though I realize some will probably be needed), and I'd like to make something both powerful and in idiomatic Clojure.
I've begun a few different attempts (mostly on paper), but I'm terribly happy with them, as I feel as though I'm just improvising. That would be fine, but I've done plenty of exploring in the language of Clojure in the past month or two, and would like to, at least in part, follow in the paths of giants.
I'd like some pointers, or suggestions, or resources (books from O'Reilly would be awesome–love me some eBooks–but Amazon or wherever would be great, too). Whatever you can offer.
EDIT Brian Carper has an interesting post on using ANTLR from Clojure.
There's also clojure-pg and fnparse, which are Clojure parser-generators. fnparse even looks like it's got some decent documentation.
Still looking for resources etc! Just thought I'd update these with some findings of my own.
Best I can think of is that Terrence Parr - the guy that leads the ANTLR parser generator - has written a markup language documented here. Anyway, there's source code there to look at.
There is also clj-peg project, that allows to specify PEG grammar for parsing data
Another not yet mentioned here is clarsec, a port of Haskell's parsec library.
I've recently been on a very similar quest to build a parser in Clojure. I went pretty far down the fnparse path, in particular using the (yet unreleased) fnparse 3 which you can find in the develop branch on github. It is broken into two forms: hound (specifically for LL(1) single lookahead parsers) and cat, which is a packrat parser. Both are functional parsers built on monads (like clarsec). fnparse has some impressive work - the ability to document your parser, build error messages, etc is neat. The documentation on the develop branch is non-existent though other than the function docstrings, which are actually quite good. In the end, I hit some road-blocks with trying to make LL(k) work. I think it's possible to make it work, it's just hard without a decent set of examples on how to make backtracking work well. I'm also so familiar with parsers that split lexing and parsing that it was hard for me to think that way. I'm still very interested in this as a good solution in the future.
In the meantime, I've fallen back to Antlr, which is very robust, well-traveled, well-documented (in 2 books), etc. It doesn't have a Clojure back-end but I hope it will in the future, which would make it really nice for parser work. I'm using it for lexing, parsing, tree transformation, and templating via StringTemplate. It hasn't been entirely bump-free, but I've been able to find workable solutions to all problems so far. Antlr's unique LL(*) parsing algorithm lets you write really readable grammars but still make them fairly efficient (and tweak things gradually if they're not).
Two functional markup translators are;
Pandoc, a markdown implemented in Haskell with source on github
Simple_markdown implemented in OCaml.

F# parsing Abstract Syntax Trees

What is the best way to use F# to parse an AST to build an interpreter? There are plenty of F# examples for trivial syntax (basic arithmatical operations) but I can't seem to find anything for languages with much larger ranges of features.
Discriminated unions look to be incrediably useful but how would you go about constructing one with large number of options? Is it better to define the types (say addition, subtraction, conditionals, control flow) elsewhere and just bring them together as predefined types in the union?
Or have I missed some far more effective means of writing interpreters? Is having an eval function for each type more effective, or perhaps using monads?
Thanks in advance
Discriminated unions look to be
incrediably useful but how would you
go about constructing one with large
number of options? Is it better to
define the types (say addition,
subtraction, conditionals, control
flow) elsewhere and just bring them
together as predefined types in the
union?
I am not sure what you are asking here; even with a large number of options, DUs still are simple to define. See e.g. this blog entry for a tiny language's DU structure (as well as a more general discussion about writing tree transforms). It's fine to have a DU with many more cases, and common in compilers/interpreters to use such a representation.
As for parsing, I prefer monadic parser combinators; check out FParsec or see this old blog entry. After using such parser combinators, I can never go back to anything like lex/yacc/ANTLR - external DSLs seem so primitive in comparison.
(EDIT: The 'tiny arithmetic examples' you have found are probably pretty much representative of what larger solutions looks like as well. The 'toy' examples usually show off the right architecture.)
You should take a copy of Robert Pickering's "Beginning F#".
The chapter 13, "Parsing Text", contains an example with FsLex and FsYacc, as suggested by Noldorin.
Other than that, in the same book, chapter 12, the author explains how to build an actual simple compiler for an arithmetic language he proposes. Very enlightening. The most important part is what you are looking for: the AST parser.
Good luck.
I second Brian's suggestion to take a look at FParsec. If you're interested in doing things the old-school way with FsLex and FsYacc, one place to look for how to parse a non-trivial language is the F# source itself. See the source\fsharp\FSharp.Compiler directory in the distribution.
You may be interesting in checking out the Lexing and Parsing section of the F# WikiBook. The F# PowerPack library contains the FsLex and FsYacc tools, which asssist greatly with this. The WikiBook guide is a good way to get started with this.
Beyond that, you will need to think about how you actually want to execute the code from the AST form, which is common the design of both compilers and interpreters. This is generally considered the easier part however, and there are lots of general resources on compilers/interpreter out there that should provide information on this.
I haven't done an interpreter myself. Hope the following helps:)
Here's a compiler course taught at Yale using ML, which you might find useful. The lecture notes are very concise (short) and informative. You can follow the first few lecture notes and the assignments. As you know F#, you won't have problem reading ML programs.
Btw, the professor was a student of A. Appel, who is the creator of SML implementation. So from these notes, you also get the most natural way to write a compiler/interpreter in ML family language.
This is an excellent example of a complete Small Basic implementation with F# and FParsec. It includes even IL compiler. The whole code is very accessible and is accompanied by a series of blog post from the author at http://trelford.com/blog/

Learning More About Parsing

I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.
Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?
On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.
On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).
Here's my take on your (very good) questions:
I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.
The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.
Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.
I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.
The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.
If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.
In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.
Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).
Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.
For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)
Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.
For 2, you might take a look at Ragel (it's good for C++ and Ruby).
Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler
which can be used to design and implement "low overhead" DSLs very quickly:
http://www.bayfronttechnologies.com/mc_tutorial.html
This site walks you through Val Schorre's 1964 paper on MetaII.
Yes, 1964. And it is amazing. This is how I learned about compilers
back in 1970.

Parsing, where can I learn about it

I've been given a job of 'translating' one language into another. The source is too flexible (complex) for a simple line by line approach with regex. Where can I go to learn more about lexical analysis and parsers?
If you want to get "emotional" about the subject, pick up a copy of "The Dragon Book." It is usually the text in a compiler design course. It will definitely meet your need "learn more about lexical analysis and parsers" as well as a bunch of other fun stuff!
IMH(umble)O, save yourself an arm and/or leg and buy an older edition - it will fill your information desires.
Try ANLTR:
ANTLR, ANother Tool for Language
Recognition, is a language tool that
provides a framework for constructing
recognizers, interpreters, compilers,
and translators from grammatical
descriptions containing actions in a
variety of target languages.
There's a book for it also.
Niklaus Wirth's book "Compiler Construction" (available as a free PDF)
http://www.google.com/search?q=wirth+compiler+construction
I've recently been working with PLY which is an implementation of lex and yacc in Python. It's quite easy to get started with it and there are some simple examples in the documentation.
Parsing can quickly become a very technical topic and you'll find that you probably won't need to know all the details of the parsing algorithm if you're using a parser builder like PLY.
Lots of people have recommended books. For many these are much more useful in a structured environment with assignments and due dates and so forth. Even if not, having the material presented in a different way can help greatly.
(a) Have you considered going to a school with a decent CS curriculum?
(b) There are lots of online lectures, such as MIT's Open Courseware. Their EE/CS section has many courses that touch on parsing, though I can't see any on parsing per se. It's typically introduced as one of the first theory courses as language classification and automata is at the heart of much of CS theory.
If you prefer Java based tools, the Java Compiler Compiler, JavaCC, is a nice parser/scanner. It's config file driven, and will generate java code that you can include in your program. I haven't used it a couple years though, so I'm not sure how the current version is. You can find out more here: https://javacc.dev.java.net/
Lexing/Parsing + typecheck + code generation is a great CS exercise I would recommend it to anyone wanting a solid basis, so I'm all for the Dragon Book
I found this site helpful:
Lex and YACC primer/HOWTO
The first time I used lex/yacc was for a relatively simple project. This tutorial was all I really needed. When I approached more complex projects later, the familiarity I had from this tutorial and a simple project allowed me to build something fancier.
After taking (quite) a few compilers classes, I've used both The Dragon Book and C&T. I think C&T does a far better job of making compiler construction digestible. Not to take anything away from The Dragon Book, but I think C&T is a far more practical book.
Also, if you like writing in Java, I recommend using JFlex and BYACC/J for your lexing and parsing needs.
Yet another textbook to consider is Programming Language Pragmatics. I prefer it over the Dragon book, but YMMV.
If you're using Perl, yet another tool to consider is Parse::RecDescent.
If you just need to do this translation once and don't know anything about compiler technology, I would suggest that you get as far as you can with some fairly simplistic translations and then fix it up by hand. Yes, it is a lot of work. But it is less work than learning a complex subject and coding up the right solution for one job. That said, you should still learn the subject, but don't let not knowing it be a roadblock to finishing your current project.
Parsing Techniques - A Practical Guide
By Dick Grune and Ceriel J.H. Jacobs
This book (freely available as PDF) gives an extensive overview of different parsing techniques/algorithms. If you really want to understand the different parsing algorithms, this IMO is a better reference than the Dragon Book (as Parsing Techniques focuses entirely on parsing, while the Dragon Book covers parsing only as one - although important - part of the compiler construction process).
flex and bison are the new lex and yacc though. The syntax for BNF is often derided for being a bit obtuse. Some have moved to ANTLR and Ragel for this reason.
If you're not doing much translation, you may one to pull a one-off using multiline regexes with Perl or Ruby. Writing a compatible BNF grammar for an existing language is not a task to be taken lightly.
On the other hand, it is entirely possible to leverage any given language's .l and .y files if they are available as open source. Then, you could construct new code from an existing parse tree.

Resources