I'm learning F# because I'd like to write a lexer and parser. I have a tiny bit of experience with this sort of processing but really need to learn it properly as well as F#.
When learning the lexing/parsing functionality of F#, is studying lex and yacc sufficient?
Or are there some differences that means code for lex/yacc will not work with fslex and fsyacc?
I personally found these OcamlLex and OcamlYacc tutorials excellent resources to get started -- easy to follow, and you can translate most everything in those tutorials for FsLex/FsYacc almost verbatim.
Well, with lex and yacc, you put C/C++ code in the 'actions', whereas with fslex and fsyacc you put F# code there, but I presume you know this?
I think they are otherwise based on the same (established/ancient) tokenizing and parsing technologies, so the general structure/behavior of the grammar should be similar, if that's what you're after...
Related
I'm currently trying to translate OCaml programs (with a fairly standard/limited grammar) into Racket, and I'm trying to see if there is a way to do the parsing to intermediate representation using camlp4. I tried to build a lexer and parser using ocamlyacc and ocamllex but considering how large the grammar can be, it became quite complicated. So, I searched around and found camlp4 to have some of this already built-in, but I can't seem to look up how to get the AST of some OCaml code using it. Any documentation/examples/ideas? Also, if you have any suggestions on how to do this better, that'd be great as well! Thanks.
Just use compiler-libs, which is distributed with the compiler. You can use the ocaml parser itself directly that way.
Here is an example of code that reads a .ml file. The documentation for the parser is quite decent. You will obtain, after parsing, a Parsetree.
For the last couple of weeks I kept reading and playing with flex/bison, the main goal is to parse structured configuration file with nested groups and lists.
flex/bison seems very powerful but too complicated.
I surveyed few open source project and the only example I found for configuration parsing using Bison was ntpd, other projects build their own parser and lexer.
Is it really the right tool for the job? or is it better to build a recursive descent parser by hand (may be with flex as a lexer)?!
It's entirely appropriate. If you are versed in bison you can throw it together way quicker than you could write an RDP or some kind of ad-hoc parser. Might take a little longer if it's your first go at it - but it might also be a good way to learn.
It will also help you design your grammar - if you accidentally make it ambiguous, you'll get a R/R conflict right away, rather than getting way down to a depp dark place in your RDP and finding you have no way out...
I don't believe it's too complicated. Besides, handwritten parsers are poorly maintainable, compared to autogenerated parsers.
The biggest problem with GNU Bison and Flex is that there is no good tutorial for C++. There are plenty of badly written C examples with global variables, which doesn't help Bison/Flex reputation. Your percepsion may change when you have a working example.
Here is a working C++ solution using Bison 3 and Flex. Encapsulate it in your own namespace and voila - you can stuff your project with gazilion parsers for everything.
https://github.com/ezaquarii/bison-flex-cpp-example
There are lots of home-brew configuration file syntaxes that have been developed using primitive ad-hoc approaches, such as splitting a line into a name and value based on simple tokenizing. Such approaches tend to have limitations, and Java properties files come to mind as a particularly bad configuration format.
When you have made the decision to define a lexical and BNF specification for your configuration syntax, you are already ahead of the game. Whether you then choose to implement that specification via hand-written code or via tools such as flex & bison is just a relatively unimportant implementation detail.
When I designed and implemented Config4*, I choose the hand-written code approach, for reasons I discuss in one of the Config4* manuals. However, I agree with the advice from BadZen: if you are already comfortable using flex and bison, then using them will probably save time compared to using a hand-written lexer and recursive-descent parser.
I'd like to write an idiomatic parser for a markup language like Markdown. My version will be slightly different, but I perceive at least a minor need for something like this in Clojure, and I'd like to get on it.
I don't want to use a mess of RegExes (though I realize some will probably be needed), and I'd like to make something both powerful and in idiomatic Clojure.
I've begun a few different attempts (mostly on paper), but I'm terribly happy with them, as I feel as though I'm just improvising. That would be fine, but I've done plenty of exploring in the language of Clojure in the past month or two, and would like to, at least in part, follow in the paths of giants.
I'd like some pointers, or suggestions, or resources (books from O'Reilly would be awesome–love me some eBooks–but Amazon or wherever would be great, too). Whatever you can offer.
EDIT Brian Carper has an interesting post on using ANTLR from Clojure.
There's also clojure-pg and fnparse, which are Clojure parser-generators. fnparse even looks like it's got some decent documentation.
Still looking for resources etc! Just thought I'd update these with some findings of my own.
Best I can think of is that Terrence Parr - the guy that leads the ANTLR parser generator - has written a markup language documented here. Anyway, there's source code there to look at.
There is also clj-peg project, that allows to specify PEG grammar for parsing data
Another not yet mentioned here is clarsec, a port of Haskell's parsec library.
I've recently been on a very similar quest to build a parser in Clojure. I went pretty far down the fnparse path, in particular using the (yet unreleased) fnparse 3 which you can find in the develop branch on github. It is broken into two forms: hound (specifically for LL(1) single lookahead parsers) and cat, which is a packrat parser. Both are functional parsers built on monads (like clarsec). fnparse has some impressive work - the ability to document your parser, build error messages, etc is neat. The documentation on the develop branch is non-existent though other than the function docstrings, which are actually quite good. In the end, I hit some road-blocks with trying to make LL(k) work. I think it's possible to make it work, it's just hard without a decent set of examples on how to make backtracking work well. I'm also so familiar with parsers that split lexing and parsing that it was hard for me to think that way. I'm still very interested in this as a good solution in the future.
In the meantime, I've fallen back to Antlr, which is very robust, well-traveled, well-documented (in 2 books), etc. It doesn't have a Clojure back-end but I hope it will in the future, which would make it really nice for parser work. I'm using it for lexing, parsing, tree transformation, and templating via StringTemplate. It hasn't been entirely bump-free, but I've been able to find workable solutions to all problems so far. Antlr's unique LL(*) parsing algorithm lets you write really readable grammars but still make them fairly efficient (and tweak things gradually if they're not).
Two functional markup translators are;
Pandoc, a markdown implemented in Haskell with source on github
Simple_markdown implemented in OCaml.
I want to write a parser-generator for educational purposes, and was wondering if there are some nice online resources or tutorials that explain how to write one. Something on the lines of "Let's Build a Compiler" by Jack Crenshaw.
I want to write the parser generator for LR(1) grammar.
I have a decent understanding of the theory behind generating the action and goto tables, but want some resource which will help me with implementing it.
Preferred languages are C/C++, Java though even other languages are OK.
Thanks.
I agree with others, the Dragon book is good background for LR parsing.
If you are interested in recursive descent parsers, an enormously fun learning experience is this website, which walks you through building a completely self-contained compiler system that can compile itself and other languages:
MetaII Compiler Tutorial
This is all based on an amazing little 10-page technical paper by Val Schorre: META II: A Syntax-Oriented Compiler Writing Language from honest-to-god 1964. I learned how to build compilers from this back in 1970. There's a mind-blowing moment when you finally grok how the compiler can regenerate itself....
I know the website author from my college days, but have nothing to do with the website.
If you wanted to go the python route I would recommend the following.
Text Processing in Python
Pyparsing
I have found both of these to be extremely helpful and Paul McGuire the author of pyparsing is super at helping you out when you run into problems. The book Text Processing in Python is just a handy reference to have at your finger tips and helps get you into the right frame of mind when attempting to build a parser.
I would also point out that an OO language is better suited as a language parsing engine because it's extensible and polymorphism is the right way to do it (IMHO). Looking at the problem in terms of a state machine rather than "Look for a semicolon at the end of xyz" will demonstrate that your parser becomes much more robust in the end.
Hope that Helps!
Not really online, but the Dragon Book has fairly elaborate discussions of LR parsing.
I found it easier to learn to write recursive-descent parsers before learning to write LR parsers. Well to be honest, after many years of writing parsers, I never found it necessary to write an LR parser.
I've recently written a tutorial at CodeProject called Implementing Programming Language Tools in C# 4.0 which describes recursive descent parsing techniques.
I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.
Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?
On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.
On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).
Here's my take on your (very good) questions:
I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.
The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.
Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.
I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.
The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.
If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.
In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.
Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).
Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.
For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)
Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.
For 2, you might take a look at Ragel (it's good for C++ and Ruby).
Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler
which can be used to design and implement "low overhead" DSLs very quickly:
http://www.bayfronttechnologies.com/mc_tutorial.html
This site walks you through Val Schorre's 1964 paper on MetaII.
Yes, 1964. And it is amazing. This is how I learned about compilers
back in 1970.