What is the shortest way to write parser for my language? - parsing

PS.Where to read about parsing theory?

Summary: the shortest is probably Antlr.
Its tempting to go to the Dragon Book to learn about parsing theory. But I don't think the Dragon Book and you have the same idea of what "theory" means. The Dragon Book describes how to built hand-written parsers, parser generators, etc, but you almost certainly want to use a parser-generation tool instead.
A few people have suggested Bison and Flex (or their older versions Yacc and Lex).
Those are the old stalwarts, but they are not very usable tools.
Their documentation is not poor per se, its just that it doesn't quite help in getting dealing with the accidental complexity of using them.
Their internal data is not well encapsulated, and it is very hard to do anything advanced with them. As an example, in phc we still do not have correct line numbers because it is very difficult. They got better when we modified out grammar to include No-op statements, but that is an incredible hack which should not be necessary.
Ostensibly, Bison and Flex work together, but the interface is awkward. Worse, there are many versions of each, which only play nicely with some specific versions of the other. And, last I checked at least, the documentation of which versions went with which was pretty poor.
Writing a recursive descent parser is straightforward, but can be tedious. Antlr can do that for you, and it seems to be a pretty good toolset, with the benefit that what you learn on this project can be applied to lots of other languages and platforms (Antlr is very portable). There are also lots of existing grammars to learn from.
Its not clear what language you're working in, but some languages have excellent parsing frameworks. In particular, the Haskell Parsec Library seems very elegant. If you use C++ you might be tempted to use Spirit. I found it very easy to get started with, and difficult--but still possible--to do advanced things with it. This matches my experience of C++ in general. I say I found it easy to start, but then I had already written a couple of parsers, and studied parsing in compiler class.
Long story short: Antlr, unless you've a very good reason.

It's always a good idea to read the Dragon Book. But be aware that if your language is not trivial, there's not really a "short" way to do it.

It rather depends on your language. Some very simple languages take very little parsing so can be hand-coded; other languages use PEG generators such as Rats! ( PEG is parser expression grammar, which sits between a Regex and a LR parser ) or conventional parser generators such as Antlr and Yacc. Less formal languages require probabilistic techniques such as link grammars.

Write a Recursive Descent Parser. This is sometimes easier than YACC/BISON, and usually more intuitive.

Douglas Crockford has an approachable example of a parser written in JavaScript.

YACC, there are various implementation for different languages.
Good luck with your language ;-)

I used the GOLD Parsing System, because it seemed easier to use than ANTLR for a novice like me, while still being sufficiently-fully-featured for my needs. The web site includes documentation (including an instructions on Writing Grammars, which is half the work) as well as software.

Try Bison for parsing and Flex for lexing
The bison definition of your language is in the form of a context-free grammar. The wikipedia artcile on this topic is quite good, and is probably a good place to start.

Using a parser generator for your host language is the fastest way, combined with parsing theory from a book such as the Dragon Book or the Modern Compiler Construction in {C,ML} series.
If you use C, yacc and the GNU version bison are the standard generators. Antlr is widely used in many languages, supporting Java, C#, and C++ as far as I know. There are also many others in almost any language.
My personal favorite at present is Menhir, an excellent parser generator for OCaml. ML-style languages (Ocaml, Standard ML, etc.) dialects in general are very good for building compilers and interpreters.

ANTLR is the easiest for someone without compiler theory background because of:
ANTLRWORKS (visual parsing and AST debugging)
The ANTLR book (no compiler theory background required)
Just 1 syntax for lexer and parser.

If you are happy with parsing expression grammars, writing your own parsers can be incredibly short. Here is a simple Packrat parser that takes a reasonable subset of PEG:
import functools
class peg_parse:
def __init__(self, grammar):
self.grammar = {k:[tuple(l) for l in rules] for k,rules in grammar.items()}
#functools.lru_cache(maxsize=None)
def unify_key(self, key, text, at=0):
if key not in self.grammar:
return (at + len(key), (key, [])) if text[at:].startswith(key) \
else (at, None)
rules = self.grammar[key]
for rule in rules:
l, res = self.unify_rule(rule, text, at)
if res is not None: return l, (key, res)
return (0, None)
def unify_line(self, parts, text, tfrom):
results = []
for part in parts:
tfrom, res = self.unify_key(part, text, tfrom)
if res is None: return tfrom, None
results.append(res)
return tfrom, results
It accepts grammars of the form of a python dictionary, with nonterminals as keys and alternatives as elements of the array, and each alternative is a sequence of expressions. Below is an example grammar.
term_grammar = {
'expr': [
['term', 'add_op', 'expr'],
['term']],
'term': [
['fact', 'mul_op', 'term'],
['fact']],
'fact': [
['digits'],
['(','expr',')']],
'digits': [
['digit','digits'],
['digit']],
'digit': [[str(i)] for i in list(range(10))],
'add_op': [['+'], ['-']],
'mul_op': [['*'], ['/']]
}
Here is the driver:
import sys
def main(to_parse):
result = peg_parse(term_grammar).unify_key('expr', to_parse)
assert (len(to_parse) - result[0]) == 0
print(result[1])
if __name__ == '__main__': main(sys.argv[1])
Which can be invoked thus:
python3 parser.py '1+2'
('expr',
[('term',
[('fact',
[('digits', [('digit', [('1', [])])])])]),
('add_op', [('+', [])]),
('expr',
[('term', [('fact', [('digits', [('digit', [('2', [])])])])])])])
Parsing Expression Grammars take some care to write: The ordering of alternatives is important (Unlike a Context Free Grammar, the alternatives are an ordered choice, with the first choice being tried first, and second being tried only if the first did not match). However, they can represent all known context free grammars.
If on the other hand, you decide to go with a Context Free Grammar, Earley Parser is one of the simplest.

Related

Parsec or happy (with alex) or uu-parsinglib

I am going to write a parser of verilog (or vhdl) language and will do a lot of manipulations (sort of transformations) of the parsed data. I intend to parse really big files (full Verilog designs, as big as 10K lines) and I will ultimately support most of the Verilog. I don't mind typing but I don't want to rewrite any part of the code whenever I add support for some other rule.
In Haskell, which library would you recommend? I know Haskell and have used Happy before (to play). I feel that there are possibilities in using Parsec for transforming the parsed string in the code (which is a great plus). I have no experience with uu-paringlib.
So to parse a full-grammar of verilog/VHDL which one of them is recommended? My main concern is the ease and 'correctness' with which I can manipulate the parsed data at my whim. Speed is not a primary concern.
I personally prefer Parsec with the help of Alex for lexing.
I prefer Parsec over Happy because 1) Parsec is a library, while Happy is a program and you'll write in a different language if you use Happy and then compile with Happy. 2) Parsec gives you context-sensitive parsing abilities thanks to its monadic interface. You can use extra state for context-sensitive parsing, and then inspect and decide depending on that state. Or just look at some parsed value before and decide on next parsers etc. (like a <- parseSomething; if test a then ... do ...) And when you don't need any context-sensitive information, you can simply use applicative style and get an implementation like implemented in YACC or a similar tool.
As a downside of Parsec, you'll never know if your Parsec parser contains a left recursion, and your parser will get stuck in runtime (because Parsec is basically a top-down recursive-descent parser). You have to find left recursions and eliminate them. YACC-style parsers can give you some static guarantees and information (like shift/reduce conflicts, unused terminals etc.) that you can't get with Parsec.
Alex is highly recommended for lexing in both situations (I think you have to use Alex if you decide to go on with Happy). Because even if you use Parsec, it really simplifies your parser implementation, and catches a great deal of bugs too (for example: parsing a keyword as an identifier was a common bug I did while I was using Parsec without Alex. It's just one example).
You can have a look at my Lua parser implemented in Alex+Parsec And here's the code to use Alex-generated tokens in Parsec.
EDIT: Thanks John L for corrections. Apparently you can do context-sensitive parsing with Happy too. Also, Alex for lexing is not required in Happy, though it's recommended.

YAML parsing - lex or hand-rolled?

I am trying to write a simple YAML parser, I read the spec from yaml.org,
before I start, I was wondering if it is better to write a hand-rolled parser, or
use lex (flex/bison). I looked at the libyaml (C library) -
doesn't seem to use lex/yacc.
YAML (excluding the flow styles), seems to be more line-oriented, so, is it
easier to write a hand-rolled parser, or use flex/bison
Thanks.
This answer is basically an answer to the question: "Should I roll my own parser or use parser generator?" and has not much to do with YAML. But nevertheless it will "answer" your question.
The question you need to ask is not "does this work with this given language/grammar", but "do I feel confident to implement this". The truth of the matter is that most formats you want to parse will just work with a generated parser. The other truth is that it is feasible to parse even complex languages with a simple hand written recursive descent parser.
I have written among others, a recursive descent parser for EDDL (C and structured elements) and a bison/flex parser for INI. I picked these examples, because they go against intuition and exterior requirements dictated the decision.
Since I established on a technical level it is possible, why would you pick one over the other? This is really hard question to answer, here are some thoughts on the subject:
Writing a good lexer is really hard. In most cases it makes sense to use flex to generate the lexer. There is little use of hand-rolling your own lexer, unless you have really exotic input formats.
Using bison or similar generators make the grammar used for parsing explicitly visible. The primary gain here is that the developer maintaining your parser in five years will immediately see the grammar used and can compare it with any specs.
Using a recursive descent parser makes is quite clear what happens in the parser. This provides the easy means to gracefully handle harry conflicts. You can write a simple if, instead of rearranging the entire grammar to be LALR1.
While developing the parser you can "gloss over details" with a hand written parser, using bison this is almost impossible. In bison the grammar must work or the generator will not do anything.
Bison is awesome at pointing out formal flaws in the grammar. Unfortunately you are left alone to fix them. When hand-rolling a parser you will only find the flaws when the parser reads nonsense.
This is not a definite answer for one or the other, but it points you in the right direction. Since it appears that you are writing the parser for fun, I think you should have written both types of parser.

What is the advantage of using a parser generator like happy as opposed to using parser combinators?

To learn how to write and parse a context-free grammar I want to choose a tool. For Haskell, there are two big options: Happy, which generates a parser from a grammar description and *Parsec, which allows you to directly code a parser in Haskell.
What are the (dis)advantages of either approach?
External vs internal DSL
The parser specification format for Happy is an external DSL, whereas with Parsec you have the full power of Haskell available when defining your parsers. This means that you can for example write functions to generate parsers, use Template Haskell and so on.
Precedence rules
With Happy, you can use precedences to simplify your grammar, whereas with Parsec you have to nest the grammar rules correctly yourself. Changing the precedence of an operator is therefore much more tedious in Parsec.
Static checking
Happy will warn you about ambiguities in your grammar at compile time. (Though it's not great at telling you where they are.) With Parsec, you get no warning until your parser fails at run time.
This is the traditional decision: do I use lex/yacc (happy) or do I write my own (mostly recursive descent) parser, only that the parsec library is like a DSL for doing it right.
If one has experience with the yacc/lex approach, using happy will be a smaller learning curve.
In my opinion Parsec hides most of the nasty grammar details and lets you write your parsers more intuitively. If you want to learn this stuff in the first place go with some parser-generator like Happy (or even try to implement one yourself).
I'm used to the parser combinator library uu-parsinglib from utrecht university. One can have error correcting and permutations for free, and also the things that parsec has. I also like it because my implemented grammar looks like an EBNF grammar, without so much monadic stuff, and is easy to read.
Naive parser combinators do not allow left-recursion in grammar rules and I haven't found a library that does.
Happy does allow full BNF in language spec, and some useful staff like priority rules. So, for complicated cases Happy and parser generators in general are much better. However, in case of simple, stupid languages with LL(k) parseable grammars, I would use a parser combinator library as more maintainer-friendly.

Do production compilers use parser generators?

I've heard that "real compiler writers" roll their own handmade parser rather than using parser generators. I've also heard that parser generators don't cut it for real-world languages. Supposedly, there are many special cases that are difficult to implement using a parser generator. I have my doubts about this:
Theoretically, a GLR parser generator should be able to handle most programming language designs (except maybe C++...)
I know of at least one production language that uses a parser generator: Ruby [1].
When I took my compilers class in school, we used a parser generator.
So my question: Is it reasonable to write a production compiler using a parser generator, or is using a parser generator considered a poor design decision by the compiler community?
[1] https://github.com/ruby/ruby/blob/trunk/parse.y
For what it's worth, GCC used a parser generator pre-4.0 I believe, then switched to a hand written recursive descent parser because it was easier to maintain and extend.
Parser generators DO "cut it" for "real" languages, but the amount of work to transform your grammar into something workable grows exponentially.
Edit: link to the GCC document detailing the change with reasons and benefits vs cost analysis: http://gcc.gnu.org/wiki/New_C_Parser.
I worked for a company for a few years where we were more or less writing compilers. We weren't concerned much with performance; just reducing the amount of work/maintenance. We used a combination of generated parsers + handwritten code to achieve this. The ideal balance is to automate the easy, repetitive parts with the parser generator and then tackle the hard stuff in custom functions.
Sometimes a combination of both methods, is used, like generating code with a parser, and later, modifying "by hand" that code.
Other way is that some scanner (lexer) and parser tools allow them to add custom code, additional to the grammar rules, called "semantic actions". A good example of this case, is that, a parser detects generic identifiers, and some custom code, transform some specific identifiers into keywords.
EDIT:
add "semantic actions"

Is there a parser generator that can use the Wirth syntax?

ie: http://en.wikipedia.org/wiki/Wirth_syntax_notation
It seems like most use BNF / EBNF ...
The distinction made by the Wikipedia article looks to me like it is splitting hairs. "BNF/EBNF" has long meant writing grammar rules in roughly the following form:
nonterminal = right_hand_side end_rule_marker
As with other silly langauge differences ( "{" in C, begin in Pascal, endif vs. fi) you can get very different looking but identical meaning by choosing different, er, syntax for end_rule_marker and what you are allowed to say for the right_hand_side.
Normally people allow literal tokens in (your choice! of) quotes, other nonterminal names, and for EBNF, various "choice" operators typically | or / for alternatives, * or + for "repeat", [ ... ] or ? for optional, etc.
Because people designing language syntax are playing with syntax, they seem to invent their own every time they write some down. (Check the various syntax formalisms in the language standards; none of them are the same). Yes, we'd all be better off if there were one standard way to write this stuff. But we don't do that for C or C++ or C# (not even Microsoft could follow their own standard); why should BNF be any different?
Because people that build parser generators usually use it to parse their own syntax, they can and so easily define their own for each parser generator. I've never seen one that did precisely the "WSN" version at Wikipedia and I doubt I ever will in practice.
Does it matter much? Well, no. What really matters is the power of the parser generator behind the notation. You have to bend most grammars to match the power (well, weakness) of the parser generator; for instance, for LL-style generators, your grammar rules can't have left recursion (WSN allows that according to my reading). People building parser generators also want to make it convenient to express non-parser issues (such as, "how to build tree nodes") and so they also add extra notation for non-parsing issues.
So what really drives the parser generator syntax are weaknesses of the parser generator to handle arbitrary languages, extra goodies deemed valuable for the parser generator, and the mood of the guy implementing it.

Resources