Alternative parsing methods - parsing

I know something about regular expressions, parse trees and abstract syntax trees. But once I read there is still another parsing technique that, as far as I remember, people from SO used to re-implement its markdown parser.
What I don't recall is the name of this method, or how it did work. Do you? If not, what it could be?

Maybe you're thinking of Parsing Expression Grammars?
(If I'm remembering the same thing you're remembering, it's cletus writing about this here.)

Here's a blog about SO's markdown parser: https://blog.stackoverflow.com/2009/12/introducing-markdownsharp/
Here's the source: http://code.google.com/p/markdownsharp/
It does use advanced regular expressions. I'm not aware of any "other" technique of parsing. The most common solutions for parsing used by virtually all programmers are:
Regular expressions (or finite state machines) for regular grammars.
Non-deterministic pushdown automata for context-free grammars. This is where you get parser generators like yacc, bison, ANTLR, etc.
See also the Chomsky hierarchy of formal grammars.

Related

What is the advantage of using a parser generator like happy as opposed to using parser combinators?

To learn how to write and parse a context-free grammar I want to choose a tool. For Haskell, there are two big options: Happy, which generates a parser from a grammar description and *Parsec, which allows you to directly code a parser in Haskell.
What are the (dis)advantages of either approach?
External vs internal DSL
The parser specification format for Happy is an external DSL, whereas with Parsec you have the full power of Haskell available when defining your parsers. This means that you can for example write functions to generate parsers, use Template Haskell and so on.
Precedence rules
With Happy, you can use precedences to simplify your grammar, whereas with Parsec you have to nest the grammar rules correctly yourself. Changing the precedence of an operator is therefore much more tedious in Parsec.
Static checking
Happy will warn you about ambiguities in your grammar at compile time. (Though it's not great at telling you where they are.) With Parsec, you get no warning until your parser fails at run time.
This is the traditional decision: do I use lex/yacc (happy) or do I write my own (mostly recursive descent) parser, only that the parsec library is like a DSL for doing it right.
If one has experience with the yacc/lex approach, using happy will be a smaller learning curve.
In my opinion Parsec hides most of the nasty grammar details and lets you write your parsers more intuitively. If you want to learn this stuff in the first place go with some parser-generator like Happy (or even try to implement one yourself).
I'm used to the parser combinator library uu-parsinglib from utrecht university. One can have error correcting and permutations for free, and also the things that parsec has. I also like it because my implemented grammar looks like an EBNF grammar, without so much monadic stuff, and is easy to read.
Naive parser combinators do not allow left-recursion in grammar rules and I haven't found a library that does.
Happy does allow full BNF in language spec, and some useful staff like priority rules. So, for complicated cases Happy and parser generators in general are much better. However, in case of simple, stupid languages with LL(k) parseable grammars, I would use a parser combinator library as more maintainer-friendly.

Writing correct LL(1) grammars?

I'm currently trying to write a (very) small interpreter/compiler for a programming language. I have set the syntax for the language, and I now need to write down the grammar for the language. I intend to use an LL(1) parser because, after a bit of research, it seems that it is the easiest to use.
I am new to this domain, but from what I gathered, formalising the syntax using BNF or EBNF is highly recommended. However, it seems that not all grammars are suitable for implementation using an LL(1) parser. Therefore, I was wondering what was the correct (or recommended) approach to writing grammars in LL(1) form.
Thank you for your help,
Charlie.
PS: I intend to write the parser using Haskell's Parsec library.
EDIT: Also, according to SK-logic, Parsec can handle an infinite lookahead (LL(k) ?) - but I guess the question still stands for that type of grammar.
I'm not an expert on this as I have only made a similar small project with an LR(0) parser. The general approach I would recommend:
Get the arithmetics working. By this, make rules and derivations for +, -, /, * etc and be sure that the parser produces a working abstract syntax tree. Test and evaluate the tree on different input to ensure that it does the arithmetic correctly.
Make things step by step. If you encounter any conflict, resolve it first before moving on.
Get simper constructs working like if-then-else or case expressions working.
Going further depends more on the language you're writing the grammar for.
Definetly check out other programming language grammars as an reference (unfortunately I did not find in 1 min any full LL grammar for any language online, but LR grammars should be useful as an reference too). For example:
ANSI C grammar
Python grammar
and of course some small examples in Wikipedia about LL grammars Wikipedia LL Parser that you probably have already checked out.
I hope you find some of this stuff useful
There are algorithms both for determining if a grammar is LL(k). Parser generators implement them. There are also heuristics for converting a grammar to LL(k), if possible.
But you don't need to restrict your simple language to LL(1), because most modern parser generators (JavaCC, ANTLR, Pyparsing, and others) can handle any k in LL(k).
More importantly, it is very likely that the syntax you consider best for your language requires a k between 2 and 4, because several common programming constructs do.
So first off, you don't necessarily want your grammar to be LL(1). It makes writing a parser simpler and potentially offers better performance, but it does mean that you're language will likely end up more verbose than commonly used languages (which generally aren't LL(1)).
If that's ok, your next step is to mentally step through the grammar, imagine all possibilities that can appear at that point, and check if they can be distinguished by their first token.
There's two main rules of thumb to making a grammar LL(1)
If of multiple choices can appear at a given point and they can
start with the same token, add a keyword in front telling you which
choice was taken.
If you have an optional or repeated part, make
sure it is followed by an ending token which can't appear as the first token of the optional/repeated part.
Avoid optional parts at the beginning of a production wherever possible. It makes the first two steps a lot easier.

Resources on parsing expressions

I'm writing a program that needs to build syntax trees from expressions in prefix notation. What resources would you recommend I look into to learn about parsing expressions?
Your question is rather broad. I'd look into anything dealing with the following:
Parsing expression-grammar
BNFs, EBNF`s
Recursive-Descent Parsers
Operator-precedence parser (basically prefix parser)
Polish notation (talks about prefix notation)
Your best bet is to try and understand BNF's and EBNF's. From there you can go on to writing recursive-descent parsers (they can be created easily from your grammars with a few simple rules).
This page here talks about recursive-descent parsing using BNF's.
In addition to the stuff already listed, I'd recomment to consider the following:
http://en.wikipedia.org/wiki/Parsing_expression_grammar

When is better to use a parser such as ANTLR vs. writing your own parsing code?

I need to parse a simple DSL which looks like this:
funcA Type1 a (funcB Type1 b) ReturnType c
As I have no experience with grammar parsing tools, I thought it would be quicker to write a basic parser myself (in Java).
Would it be better, even for a simple DSL, for me to use something like ANTLR and construct a proper grammar definition?
Simple answer: when it is easier to write the rules describing your grammar than to write code that accepts the language described by your grammar.
If the only thing you need to parse looks exactly like what you've written above, then I would say you could just write it by hand.
More generally speaking, I would say that most regular languages could be parsed more quickly by hand (using a regular expression).
If you are parsing a context-free language with lots of rules and productions, ANTLR (or other parser generators) can make life much easier.
Also, if you have a simple language that you expect to grow more complicated in the future, it will be easier to add rule descriptions to an ANTLR grammar than to build them into a hand-coded parser.
Grammars tend to evolve, (as do requirements). Home brew parsers are difficult to maintain and lead to re-inventing the wheel example. If you think you can write a quick parser in java, you should know that it would be quicker to use any of the lex/yacc/compiler-compiler solutions. Lexers are easier to write, then you would want your own rule precedence semantics which are not easy to test or maintain. ANTLR also provides an ide for visualising AST, can you beat that mate. Added advantage is the ability to generate intermediate code using string templates, which is a different aspect altogether.
It's better to use an off-the-shelf parser (generator) such as ANTLR when you want to develop and use a custom language. It's better to write your own parser when your objective is to write a parser.
UNLESS you have a lot of experience writing parsers and can get a working parser that way more quickly than using ANTLR. But I surmise from your asking the question that this get-out clause does not apply.

What is the shortest way to write parser for my language?

PS.Where to read about parsing theory?
Summary: the shortest is probably Antlr.
Its tempting to go to the Dragon Book to learn about parsing theory. But I don't think the Dragon Book and you have the same idea of what "theory" means. The Dragon Book describes how to built hand-written parsers, parser generators, etc, but you almost certainly want to use a parser-generation tool instead.
A few people have suggested Bison and Flex (or their older versions Yacc and Lex).
Those are the old stalwarts, but they are not very usable tools.
Their documentation is not poor per se, its just that it doesn't quite help in getting dealing with the accidental complexity of using them.
Their internal data is not well encapsulated, and it is very hard to do anything advanced with them. As an example, in phc we still do not have correct line numbers because it is very difficult. They got better when we modified out grammar to include No-op statements, but that is an incredible hack which should not be necessary.
Ostensibly, Bison and Flex work together, but the interface is awkward. Worse, there are many versions of each, which only play nicely with some specific versions of the other. And, last I checked at least, the documentation of which versions went with which was pretty poor.
Writing a recursive descent parser is straightforward, but can be tedious. Antlr can do that for you, and it seems to be a pretty good toolset, with the benefit that what you learn on this project can be applied to lots of other languages and platforms (Antlr is very portable). There are also lots of existing grammars to learn from.
Its not clear what language you're working in, but some languages have excellent parsing frameworks. In particular, the Haskell Parsec Library seems very elegant. If you use C++ you might be tempted to use Spirit. I found it very easy to get started with, and difficult--but still possible--to do advanced things with it. This matches my experience of C++ in general. I say I found it easy to start, but then I had already written a couple of parsers, and studied parsing in compiler class.
Long story short: Antlr, unless you've a very good reason.
It's always a good idea to read the Dragon Book. But be aware that if your language is not trivial, there's not really a "short" way to do it.
It rather depends on your language. Some very simple languages take very little parsing so can be hand-coded; other languages use PEG generators such as Rats! ( PEG is parser expression grammar, which sits between a Regex and a LR parser ) or conventional parser generators such as Antlr and Yacc. Less formal languages require probabilistic techniques such as link grammars.
Write a Recursive Descent Parser. This is sometimes easier than YACC/BISON, and usually more intuitive.
Douglas Crockford has an approachable example of a parser written in JavaScript.
YACC, there are various implementation for different languages.
Good luck with your language ;-)
I used the GOLD Parsing System, because it seemed easier to use than ANTLR for a novice like me, while still being sufficiently-fully-featured for my needs. The web site includes documentation (including an instructions on Writing Grammars, which is half the work) as well as software.
Try Bison for parsing and Flex for lexing
The bison definition of your language is in the form of a context-free grammar. The wikipedia artcile on this topic is quite good, and is probably a good place to start.
Using a parser generator for your host language is the fastest way, combined with parsing theory from a book such as the Dragon Book or the Modern Compiler Construction in {C,ML} series.
If you use C, yacc and the GNU version bison are the standard generators. Antlr is widely used in many languages, supporting Java, C#, and C++ as far as I know. There are also many others in almost any language.
My personal favorite at present is Menhir, an excellent parser generator for OCaml. ML-style languages (Ocaml, Standard ML, etc.) dialects in general are very good for building compilers and interpreters.
ANTLR is the easiest for someone without compiler theory background because of:
ANTLRWORKS (visual parsing and AST debugging)
The ANTLR book (no compiler theory background required)
Just 1 syntax for lexer and parser.
If you are happy with parsing expression grammars, writing your own parsers can be incredibly short. Here is a simple Packrat parser that takes a reasonable subset of PEG:
import functools
class peg_parse:
def __init__(self, grammar):
self.grammar = {k:[tuple(l) for l in rules] for k,rules in grammar.items()}
#functools.lru_cache(maxsize=None)
def unify_key(self, key, text, at=0):
if key not in self.grammar:
return (at + len(key), (key, [])) if text[at:].startswith(key) \
else (at, None)
rules = self.grammar[key]
for rule in rules:
l, res = self.unify_rule(rule, text, at)
if res is not None: return l, (key, res)
return (0, None)
def unify_line(self, parts, text, tfrom):
results = []
for part in parts:
tfrom, res = self.unify_key(part, text, tfrom)
if res is None: return tfrom, None
results.append(res)
return tfrom, results
It accepts grammars of the form of a python dictionary, with nonterminals as keys and alternatives as elements of the array, and each alternative is a sequence of expressions. Below is an example grammar.
term_grammar = {
'expr': [
['term', 'add_op', 'expr'],
['term']],
'term': [
['fact', 'mul_op', 'term'],
['fact']],
'fact': [
['digits'],
['(','expr',')']],
'digits': [
['digit','digits'],
['digit']],
'digit': [[str(i)] for i in list(range(10))],
'add_op': [['+'], ['-']],
'mul_op': [['*'], ['/']]
}
Here is the driver:
import sys
def main(to_parse):
result = peg_parse(term_grammar).unify_key('expr', to_parse)
assert (len(to_parse) - result[0]) == 0
print(result[1])
if __name__ == '__main__': main(sys.argv[1])
Which can be invoked thus:
python3 parser.py '1+2'
('expr',
[('term',
[('fact',
[('digits', [('digit', [('1', [])])])])]),
('add_op', [('+', [])]),
('expr',
[('term', [('fact', [('digits', [('digit', [('2', [])])])])])])])
Parsing Expression Grammars take some care to write: The ordering of alternatives is important (Unlike a Context Free Grammar, the alternatives are an ordered choice, with the first choice being tried first, and second being tried only if the first did not match). However, they can represent all known context free grammars.
If on the other hand, you decide to go with a Context Free Grammar, Earley Parser is one of the simplest.

Resources