YAML parsing - lex or hand-rolled? - parsing

I am trying to write a simple YAML parser, I read the spec from yaml.org,
before I start, I was wondering if it is better to write a hand-rolled parser, or
use lex (flex/bison). I looked at the libyaml (C library) -
doesn't seem to use lex/yacc.
YAML (excluding the flow styles), seems to be more line-oriented, so, is it
easier to write a hand-rolled parser, or use flex/bison
Thanks.

This answer is basically an answer to the question: "Should I roll my own parser or use parser generator?" and has not much to do with YAML. But nevertheless it will "answer" your question.
The question you need to ask is not "does this work with this given language/grammar", but "do I feel confident to implement this". The truth of the matter is that most formats you want to parse will just work with a generated parser. The other truth is that it is feasible to parse even complex languages with a simple hand written recursive descent parser.
I have written among others, a recursive descent parser for EDDL (C and structured elements) and a bison/flex parser for INI. I picked these examples, because they go against intuition and exterior requirements dictated the decision.
Since I established on a technical level it is possible, why would you pick one over the other? This is really hard question to answer, here are some thoughts on the subject:
Writing a good lexer is really hard. In most cases it makes sense to use flex to generate the lexer. There is little use of hand-rolling your own lexer, unless you have really exotic input formats.
Using bison or similar generators make the grammar used for parsing explicitly visible. The primary gain here is that the developer maintaining your parser in five years will immediately see the grammar used and can compare it with any specs.
Using a recursive descent parser makes is quite clear what happens in the parser. This provides the easy means to gracefully handle harry conflicts. You can write a simple if, instead of rearranging the entire grammar to be LALR1.
While developing the parser you can "gloss over details" with a hand written parser, using bison this is almost impossible. In bison the grammar must work or the generator will not do anything.
Bison is awesome at pointing out formal flaws in the grammar. Unfortunately you are left alone to fix them. When hand-rolling a parser you will only find the flaws when the parser reads nonsense.
This is not a definite answer for one or the other, but it points you in the right direction. Since it appears that you are writing the parser for fun, I think you should have written both types of parser.

Related

Generate a parser from programatically generated BNF

I've seen two approaches to parsing:
Use a parser generator like happy. This allows you to specify your language in BNF, and not worry about the intricacies of parsing. However, since it's a preprocessor you have to write your whole parse tree textually.
Use a parser directly like megaparsec. With this approach you have direct access to your code so you can generate your parser programatically, but you haven't got the convenience of happy's simple BNF specification with precedence annotations etc. Also it seems non trivial to print out a BNF tree for documentation from your parsing code unless this is considered during it's construction.
What I'd like to do is something like this:
Generate a data structure programatically that represents BNF.
Feed this through to a "happy like" parser generator to generate a parser.
Feed this through a pretty printer to generate actual BNF documentation.
The reason I want to do this is that the grammar I'm working on has grown quite large and has a lot of repetition, as a lot of it's constructs are similar to others but slightly different. It would improve maintenence effort if it could be generated programmatically instead of modifying happy BNF spec directly, but I'd rather not have to develop my own parser from scratch.
Any ideas about a good approach here. It would be great if I could just generate a data structure and force it into happy (as it presumably generates it's own internal structure after parsing the BNF feed to it) but happy doesn't seem to have a library interface.
I guess I could generate attonated BNF, and feed that through to happy, but it seems like a messy process of converting back and forth. A cleaner approach would be better. Perhaps even a BNF style extension to parsec or megaparsec?
The simplest thing to do would to make some data type representing the relevant grammar, and then convert it to a parser using some parser combinators as a (run-time) "compile" step. Unfortunately, most parser combinators are less efficient and/or less flexible (in some ways) than the parser generators, so this would be a bit of a lowest common denominator approach. That said, the grammar-combinators library may be useful, though it doesn't appear to be maintained.
There are libraries that can generate parsers at run-time. One I found just now is Grempa, which doesn't appear to be maintained but that may not be a problem. Another option (by the same person who made Grempa but maintained) is Earley which, due to the way Earley parsers are made, it makes sense to have an explicit grammar that gets processed into a parser. Earley parsing is certainly flexible, but may be overpowered for you (or maybe not).

Parsec or happy (with alex) or uu-parsinglib

I am going to write a parser of verilog (or vhdl) language and will do a lot of manipulations (sort of transformations) of the parsed data. I intend to parse really big files (full Verilog designs, as big as 10K lines) and I will ultimately support most of the Verilog. I don't mind typing but I don't want to rewrite any part of the code whenever I add support for some other rule.
In Haskell, which library would you recommend? I know Haskell and have used Happy before (to play). I feel that there are possibilities in using Parsec for transforming the parsed string in the code (which is a great plus). I have no experience with uu-paringlib.
So to parse a full-grammar of verilog/VHDL which one of them is recommended? My main concern is the ease and 'correctness' with which I can manipulate the parsed data at my whim. Speed is not a primary concern.
I personally prefer Parsec with the help of Alex for lexing.
I prefer Parsec over Happy because 1) Parsec is a library, while Happy is a program and you'll write in a different language if you use Happy and then compile with Happy. 2) Parsec gives you context-sensitive parsing abilities thanks to its monadic interface. You can use extra state for context-sensitive parsing, and then inspect and decide depending on that state. Or just look at some parsed value before and decide on next parsers etc. (like a <- parseSomething; if test a then ... do ...) And when you don't need any context-sensitive information, you can simply use applicative style and get an implementation like implemented in YACC or a similar tool.
As a downside of Parsec, you'll never know if your Parsec parser contains a left recursion, and your parser will get stuck in runtime (because Parsec is basically a top-down recursive-descent parser). You have to find left recursions and eliminate them. YACC-style parsers can give you some static guarantees and information (like shift/reduce conflicts, unused terminals etc.) that you can't get with Parsec.
Alex is highly recommended for lexing in both situations (I think you have to use Alex if you decide to go on with Happy). Because even if you use Parsec, it really simplifies your parser implementation, and catches a great deal of bugs too (for example: parsing a keyword as an identifier was a common bug I did while I was using Parsec without Alex. It's just one example).
You can have a look at my Lua parser implemented in Alex+Parsec And here's the code to use Alex-generated tokens in Parsec.
EDIT: Thanks John L for corrections. Apparently you can do context-sensitive parsing with Happy too. Also, Alex for lexing is not required in Happy, though it's recommended.

Do production compilers use parser generators?

I've heard that "real compiler writers" roll their own handmade parser rather than using parser generators. I've also heard that parser generators don't cut it for real-world languages. Supposedly, there are many special cases that are difficult to implement using a parser generator. I have my doubts about this:
Theoretically, a GLR parser generator should be able to handle most programming language designs (except maybe C++...)
I know of at least one production language that uses a parser generator: Ruby [1].
When I took my compilers class in school, we used a parser generator.
So my question: Is it reasonable to write a production compiler using a parser generator, or is using a parser generator considered a poor design decision by the compiler community?
[1] https://github.com/ruby/ruby/blob/trunk/parse.y
For what it's worth, GCC used a parser generator pre-4.0 I believe, then switched to a hand written recursive descent parser because it was easier to maintain and extend.
Parser generators DO "cut it" for "real" languages, but the amount of work to transform your grammar into something workable grows exponentially.
Edit: link to the GCC document detailing the change with reasons and benefits vs cost analysis: http://gcc.gnu.org/wiki/New_C_Parser.
I worked for a company for a few years where we were more or less writing compilers. We weren't concerned much with performance; just reducing the amount of work/maintenance. We used a combination of generated parsers + handwritten code to achieve this. The ideal balance is to automate the easy, repetitive parts with the parser generator and then tackle the hard stuff in custom functions.
Sometimes a combination of both methods, is used, like generating code with a parser, and later, modifying "by hand" that code.
Other way is that some scanner (lexer) and parser tools allow them to add custom code, additional to the grammar rules, called "semantic actions". A good example of this case, is that, a parser detects generic identifiers, and some custom code, transform some specific identifiers into keywords.
EDIT:
add "semantic actions"

Writing correct LL(1) grammars?

I'm currently trying to write a (very) small interpreter/compiler for a programming language. I have set the syntax for the language, and I now need to write down the grammar for the language. I intend to use an LL(1) parser because, after a bit of research, it seems that it is the easiest to use.
I am new to this domain, but from what I gathered, formalising the syntax using BNF or EBNF is highly recommended. However, it seems that not all grammars are suitable for implementation using an LL(1) parser. Therefore, I was wondering what was the correct (or recommended) approach to writing grammars in LL(1) form.
Thank you for your help,
Charlie.
PS: I intend to write the parser using Haskell's Parsec library.
EDIT: Also, according to SK-logic, Parsec can handle an infinite lookahead (LL(k) ?) - but I guess the question still stands for that type of grammar.
I'm not an expert on this as I have only made a similar small project with an LR(0) parser. The general approach I would recommend:
Get the arithmetics working. By this, make rules and derivations for +, -, /, * etc and be sure that the parser produces a working abstract syntax tree. Test and evaluate the tree on different input to ensure that it does the arithmetic correctly.
Make things step by step. If you encounter any conflict, resolve it first before moving on.
Get simper constructs working like if-then-else or case expressions working.
Going further depends more on the language you're writing the grammar for.
Definetly check out other programming language grammars as an reference (unfortunately I did not find in 1 min any full LL grammar for any language online, but LR grammars should be useful as an reference too). For example:
ANSI C grammar
Python grammar
and of course some small examples in Wikipedia about LL grammars Wikipedia LL Parser that you probably have already checked out.
I hope you find some of this stuff useful
There are algorithms both for determining if a grammar is LL(k). Parser generators implement them. There are also heuristics for converting a grammar to LL(k), if possible.
But you don't need to restrict your simple language to LL(1), because most modern parser generators (JavaCC, ANTLR, Pyparsing, and others) can handle any k in LL(k).
More importantly, it is very likely that the syntax you consider best for your language requires a k between 2 and 4, because several common programming constructs do.
So first off, you don't necessarily want your grammar to be LL(1). It makes writing a parser simpler and potentially offers better performance, but it does mean that you're language will likely end up more verbose than commonly used languages (which generally aren't LL(1)).
If that's ok, your next step is to mentally step through the grammar, imagine all possibilities that can appear at that point, and check if they can be distinguished by their first token.
There's two main rules of thumb to making a grammar LL(1)
If of multiple choices can appear at a given point and they can
start with the same token, add a keyword in front telling you which
choice was taken.
If you have an optional or repeated part, make
sure it is followed by an ending token which can't appear as the first token of the optional/repeated part.
Avoid optional parts at the beginning of a production wherever possible. It makes the first two steps a lot easier.

When is better to use a parser such as ANTLR vs. writing your own parsing code?

I need to parse a simple DSL which looks like this:
funcA Type1 a (funcB Type1 b) ReturnType c
As I have no experience with grammar parsing tools, I thought it would be quicker to write a basic parser myself (in Java).
Would it be better, even for a simple DSL, for me to use something like ANTLR and construct a proper grammar definition?
Simple answer: when it is easier to write the rules describing your grammar than to write code that accepts the language described by your grammar.
If the only thing you need to parse looks exactly like what you've written above, then I would say you could just write it by hand.
More generally speaking, I would say that most regular languages could be parsed more quickly by hand (using a regular expression).
If you are parsing a context-free language with lots of rules and productions, ANTLR (or other parser generators) can make life much easier.
Also, if you have a simple language that you expect to grow more complicated in the future, it will be easier to add rule descriptions to an ANTLR grammar than to build them into a hand-coded parser.
Grammars tend to evolve, (as do requirements). Home brew parsers are difficult to maintain and lead to re-inventing the wheel example. If you think you can write a quick parser in java, you should know that it would be quicker to use any of the lex/yacc/compiler-compiler solutions. Lexers are easier to write, then you would want your own rule precedence semantics which are not easy to test or maintain. ANTLR also provides an ide for visualising AST, can you beat that mate. Added advantage is the ability to generate intermediate code using string templates, which is a different aspect altogether.
It's better to use an off-the-shelf parser (generator) such as ANTLR when you want to develop and use a custom language. It's better to write your own parser when your objective is to write a parser.
UNLESS you have a lot of experience writing parsers and can get a working parser that way more quickly than using ANTLR. But I surmise from your asking the question that this get-out clause does not apply.

Resources