Parser library that can handle ambiguity

Parser library that can handle ambiguity - parsing

I'm looking for a mature parser library, either for Scala or Haskell.
The most important point is, that the library can handle ambiguity.
If an expression is ambiguous, I want every possible abstract syntax tree, that matches the expression.
Simple example: The expression a ⊗ b ⊗ c can be seen as (a ⊗ b) ⊗ c or a ⊗ (b ⊗ c), and I need both variants.
Thanks!

I feel like the old guy for remembering when Walder's papers like Comprehending Monads (the precursor to the do notation) were exciting and new. The idea is that you (to quote) replace a failure by a list of successes, meaning maintain a list of all the possible parses. At the end you normally just take the first match, but with this setup, you can take all of them.
These aren't all that efficient for a deterministic parser, which is why they're less in fashion, but they are what you need.
Have a look at polyparse, and in particular Text.ParserCombinators.HuttonMeijer and Text.ParserCombinators.HuttonMeijerWallace.
(Hutton & Meijer translated the parser library to Haskell (from Gofer) and Wallace added extra features.)
Make sure you check it out on simple cases like parsing "aaaa" with
testP = do
a <- many $ char 'a'
b <- many $ char 'a'
return (a,b)
to see if it has the semantics you seek.
You asked for mature. These libraries are part of pure functional programming's heritage! Having said that, I'd call parsec more mature, even though it's younger.
(Speculation: I don't think parsec can do what you want. Its standard choice combinator is deterministic. I haven't looked into tweaking or replacing that behaviour, and I wouldn't want to I'm afraid.)

This question immediately reminded me of the Yacc is dead / No, it's not debate from the end of 2010. The authors of the Yacc is dead paper provide a library in Scala (unmaintained), Haskell and Racket. In the Yacc is alive response, Russ Cox points out that the code runs in exponential time for ambiguous grammars.
It's well-known that it is possible to parse ambiguous grammars in O(n^3), although obviously it can take exponential time to enumerate all the parse trees in the case that there are exponentially many of them -- and there will be in the case of x1 + x2 + x3 ... + xn. bison implements the GLR algorithm which does so; unfortunately, while bison is certainly mature (if not actually moribund), it is written neither in Haskell nor in Scala.
Daniel Spiewak implemented a GLL parser in Scala IIRC, but last time I looked at it, it suffered from some performance issues. So I'm not sure that it could be described as mature, either.

I can't speak to how mature it is or give you any usage examples, but I've had the scala gll-combinators library open in a tab for a few days. It handles ambiguous grammars and looks pretty nifty.

At the end the the choice fell on the Syntax Definition Formalism (SDF2)
with an sdf table generator here
and JSGLR as parser generator.

Related

Is there a parser generator that can use the Wirth syntax?

ie: http://en.wikipedia.org/wiki/Wirth_syntax_notation
It seems like most use BNF / EBNF ...

The distinction made by the Wikipedia article looks to me like it is splitting hairs. "BNF/EBNF" has long meant writing grammar rules in roughly the following form:
nonterminal = right_hand_side end_rule_marker
As with other silly langauge differences ( "{" in C, begin in Pascal, endif vs. fi) you can get very different looking but identical meaning by choosing different, er, syntax for end_rule_marker and what you are allowed to say for the right_hand_side.
Normally people allow literal tokens in (your choice! of) quotes, other nonterminal names, and for EBNF, various "choice" operators typically | or / for alternatives, * or + for "repeat", [ ... ] or ? for optional, etc.
Because people designing language syntax are playing with syntax, they seem to invent their own every time they write some down. (Check the various syntax formalisms in the language standards; none of them are the same). Yes, we'd all be better off if there were one standard way to write this stuff. But we don't do that for C or C++ or C# (not even Microsoft could follow their own standard); why should BNF be any different?
Because people that build parser generators usually use it to parse their own syntax, they can and so easily define their own for each parser generator. I've never seen one that did precisely the "WSN" version at Wikipedia and I doubt I ever will in practice.
Does it matter much? Well, no. What really matters is the power of the parser generator behind the notation. You have to bend most grammars to match the power (well, weakness) of the parser generator; for instance, for LL-style generators, your grammar rules can't have left recursion (WSN allows that according to my reading). People building parser generators also want to make it convenient to express non-parser issues (such as, "how to build tree nodes") and so they also add extra notation for non-parsing issues.
So what really drives the parser generator syntax are weaknesses of the parser generator to handle arbitrary languages, extra goodies deemed valuable for the parser generator, and the mood of the guy implementing it.

Writing correct LL(1) grammars?

I'm currently trying to write a (very) small interpreter/compiler for a programming language. I have set the syntax for the language, and I now need to write down the grammar for the language. I intend to use an LL(1) parser because, after a bit of research, it seems that it is the easiest to use.
I am new to this domain, but from what I gathered, formalising the syntax using BNF or EBNF is highly recommended. However, it seems that not all grammars are suitable for implementation using an LL(1) parser. Therefore, I was wondering what was the correct (or recommended) approach to writing grammars in LL(1) form.
Thank you for your help,
Charlie.
PS: I intend to write the parser using Haskell's Parsec library.
EDIT: Also, according to SK-logic, Parsec can handle an infinite lookahead (LL(k) ?) - but I guess the question still stands for that type of grammar.

I'm not an expert on this as I have only made a similar small project with an LR(0) parser. The general approach I would recommend:
Get the arithmetics working. By this, make rules and derivations for +, -, /, * etc and be sure that the parser produces a working abstract syntax tree. Test and evaluate the tree on different input to ensure that it does the arithmetic correctly.
Make things step by step. If you encounter any conflict, resolve it first before moving on.
Get simper constructs working like if-then-else or case expressions working.
Going further depends more on the language you're writing the grammar for.
Definetly check out other programming language grammars as an reference (unfortunately I did not find in 1 min any full LL grammar for any language online, but LR grammars should be useful as an reference too). For example:
ANSI C grammar
Python grammar
and of course some small examples in Wikipedia about LL grammars Wikipedia LL Parser that you probably have already checked out.
I hope you find some of this stuff useful

There are algorithms both for determining if a grammar is LL(k). Parser generators implement them. There are also heuristics for converting a grammar to LL(k), if possible.
But you don't need to restrict your simple language to LL(1), because most modern parser generators (JavaCC, ANTLR, Pyparsing, and others) can handle any k in LL(k).
More importantly, it is very likely that the syntax you consider best for your language requires a k between 2 and 4, because several common programming constructs do.

So first off, you don't necessarily want your grammar to be LL(1). It makes writing a parser simpler and potentially offers better performance, but it does mean that you're language will likely end up more verbose than commonly used languages (which generally aren't LL(1)).
If that's ok, your next step is to mentally step through the grammar, imagine all possibilities that can appear at that point, and check if they can be distinguished by their first token.
There's two main rules of thumb to making a grammar LL(1)
If of multiple choices can appear at a given point and they can
start with the same token, add a keyword in front telling you which
choice was taken.
If you have an optional or repeated part, make
sure it is followed by an ending token which can't appear as the first token of the optional/repeated part.
Avoid optional parts at the beginning of a production wherever possible. It makes the first two steps a lot easier.

Practical consequences of formal grammar power?

Every undergraduate Intro to Compilers course reviews the commonly-implemented subsets of context-free grammars: LL(k), SLR(k), LALR(k), LR(k). We are also taught that for any given k, each of those grammars is a subset of the next.
What I've never seen is an explanation of what sorts of programming language syntactic features might require moving to a different language class. There's an obvious practical motivation for GLR parsers, namely, avoiding an unholy commingling of parser and symbol table when parsing C++. But what about the differences between the two "standard" classes, LL and LR?
Two questions:
What (general) syntactic constructions can be parsed with LR(k) but not LL(k')?
In what ways, if any, do those constructions manifest as desirable language constructs?
There's a plausible argument for reducing language power by making k as small as possible, because a language requiring many, many tokens of lookahead will be harder for humans to parse, as well as "harder" for machines to parse. Question (2) implicitly asks if the same reasoning ends up holding between classes, as well as within a class.
edit: Here's one example to illustrate the sorts of answers I'm looking for, but for regular languages instead of context-free:
When describing a regular language, one usually gets three operators: +, *, and ?. Now, you can remove + without reducing the power of the language; instead of writing x+, you write xx*, and the effect is the same. But if x is some big and hairy expression, the two xs are likely to diverge over time due to human forgetfulness, yielding a syntactically correct regular expression that doesn't match the original author's intent. Thus, even though adding + doesn't strictly add power, it does make the notation less error-prone.
Are there constructs with similar practical (human?) effects that must be "removed" when switching from LR to LL?

Parsing (I claim) is a bit like sorting: a problem that was the focus of a lot of thought in the early days of CS, leading to a set of well-understood solutions with some nice theoretical results.
My claim is that the picture that we get (or give, for those of us who teach) in a compilers class is, to some degree, a beautiful answer to the wrong question.
To answer your question more directly, an LL(1) grammar can't parse all kinds of things that you might want to parse; the "natural" formulation of an 'if' with an optional 'else', for instance.
But wait! Can't I reformulate my grammar as an LL(1) grammar and then patch up the source tree by walking over it afterward? Sure you can! To some degree, this is what makes the question of what kind of grammar your parser uses largely moot.
Also, back when I was an undergraduate (1990-94), whitespace-sensitive grammars were clearly the work of the Devil; now, Python and Haskell's designs are bringing whitespace-sensitivity back into the light. Also, Packrat parsing says "to heck with your theoretical purity: I'm just going to define a parser as a set of rules, and I don't care what class my grammar belongs to." (paraphrased)
In summary, I would agree with what I believe to be your implied suggestion: in 2009, a clear understanding of the difference between the classes LL(k) and LR(k) is less important in itself than the ability to formulate and debug a grammar that makes your parser generator happy.

The difference between LL and LR is primarily in the lookahead mechanism. People generally say that LR parsers carry more "context". To see this practically, consider a recursive grammar definition with S as the starting symbol:
A -> Ax | x
B -> Ay
C -> Az
S -> B | C
When k is a small fixed value, parsing a string like xxxxxxy is a task better suited to an LR parser. However, these days the popular LL parsers such as ANTLR do not restrict k to such small values and most people no longer care.
I hope this is more or less in line with your question. Of course Knuth showed that any unambiguous context-free language can be recognized by some LR(1) grammar. However, in practice we are also concerned with translation.
As a side note: You might also enjoy reading http://www.antlr.org/article/needlook.html.
This is by no means proven, but I have always questioned whether LR-like parsing is really similar to how the brain works when reading certain notations. For example, when reading an English sentence it is pretty obvious that we read from left-to-right. But, consider the pattern bellow:
. . . . . | . . . . .
I rather expect that with short patterns such as this one people do not literally read "dot dot dot dot dot bar dot dot dot dot dot" from left to right, but rather processes the pattern in parallel or at least in some kind of fuzzy iterative manner. In other words, I do not believe we necessarily read all patterns in a left-to-right manner with the kind of linear lookahead that a LL/LR parser employs.
Furthermore, if we can describe any context-free language using an LR(1) grammar then it is clear that simply recognizing a string is not the same as "understanding" it.

well, for one, Left recursive definitions are impossible in LL(k) grammars (as far as i know), don't know about others. This doesn't make itimpossible to define other things just a massive pain to do otherwise. For instance, putting together expressions can be easy in a left-recursive language (in pseudocode):
lexer rule expression = other rules
| expression
| '(' expression ')';
As far as syntactically useful things that can be made with left-recursion, um does simpler grammars count as syntactically useful?

The capabilities of a language are not limited by its syntax and grammar.
It's possible to define any language feature with an LL(k) grammar, it just might not be very readable to humans.

Parsing rules - how to make them play nice together

So I'm doing a Parser, where I favor flexibility over speed, and I want it to be easy to write grammars for, e.g. no tricky workaround rules (fake rules to solve conflicts etc, like you have to do in yacc/bison etc.)
There's a hand-coded Lexer with a fixed set of tokens (e.g. PLUS, DECIMAL, STRING_LIT, NAME, and so on) right now there are three types of rules:
TokenRule: matches a particular token
SequenceRule: matches an ordered list of rules
GroupRule: matches any rule from a list
For example, let's say we have the TokenRule 'varAccess', which matches token NAME (roughly /[A-Za-z][A-Za-z0-9_]*/), and the SequenceRule 'assignment', which matches [expression, TokenRule(PLUS), expression].
Expression is a GroupRule matching either 'assignment' or 'varAccess' (the actual ruleset I'm testing with is a bit more complete, but that'll do for the example)
But now let's say I want to parse
var1 = var2
And let's say the Parser begins with rule Expression (the order in which they are defined shouldn't matter - priorities will be solved later). And let's say the GroupRule expression will first try 'assignment'. Then since 'expression' is the first rule to be matched in 'assignment', it will try to parse an expression again, and so on until the stack is filled up and the computer - as expected - simply gives up in a sparkly segfault.
So what I did is - SequenceRules add themselves as 'leafs' to their first rule, and become non-roôt rules. Root rules are rules that the parser will first try. When one of those is applied and matches, it tries to subapply each of its leafs, one by one, until one matches. Then it tries the leafs of the matching leaf, and so on, until nothing matches anymore.
So that it can parse expressions like
var1 = var2 = var3 = var4
Just right =) Now the interesting stuff. This code:
var1 = (var2 + var3)
Won't parse. What happens is, var1 get parsed (varAccess), assign is sub-applied, it looks for an expression, tries 'parenthesis', begins, looks for an expression after the '(', finds var2, and then chokes on the '+' because it was expecting a ')'.
Why doesn't it match the 'var2 + var3' ? (and yes, there's an 'add' SequenceRule, before you ask). Because 'add' isn't a root rule (to avoid infinite recursion with the parse-expresssion-beginning-with-expression-etc.) and that leafs aren't tested in SequenceRules otherwise it would parse things like
reader readLine() println()
as
reader (readLine() println())
(e.g. '1 = 3' is the expression expected by add, the leaf of varAccess a)
whereas we'd like it to be left-associative, e.g. parsing as
(reader readLine()) println()
So anyway, now we've got this problem that we should be able to parse expression such as '1 + 2' within SequenceRules. What to do? Add a special case that when SequenceRules begin with a TokenRule, then the GroupRules it contains are tested for leafs? Would that even make sense outside that particular example? Or should one be able to specify in each element of a SequenceRule if it should be tested for leafs or not? Tell me what you think (other than throw away the whole system - that'll probably happen in a few months anyway)
P.S: Please, pretty please, don't answer something like "go read this 400pages book or you don't even deserve our time" If you feel the need to - just refrain yourself and go bash on reddit. Okay? Thanks in advance.

LL(k) parsers (top down recursive, whether automated or written by hand) require refactoring of your grammar to avoid left recursion, and often require special specifications of lookahead (e.g. ANTLR) to be able to handle k-token lookahead. Since grammars are complex, you get to discover k by experimenting, which is exactly the thing you wish to avoid.
YACC/LALR(1) grammars aviod the problem of left recursion, which is a big step forward. The bad news is that there are no real programming langauges (other than Wirth's original PASCAL) that are LALR(1). Therefore you get to hack your grammar to change it from LR(k) to LALR(1), again forcing you to suffer the experiments that expose the strange cases, and hacking the grammar reduction logic to try to handle K-lookaheads when the parser generators (YACC, BISON, ... you name it) produce 1-lookahead parsers.
GLR parsers (http://en.wikipedia.org/wiki/GLR_parser) allow you to avoid almost all of this nonsense. If you can write a context free parser, under most practical circumstances, a GLR parser will parse it without further effort. That's an enormous relief when you try to write arbitrary grammars. And a really good GLR parser will directly produce a tree.
BISON has been enhanced to do GLR parsing, sort of. You still have to write complicated logic to produce your desired AST, and you have to worry about how to handle failed parsers and cleaning up/deleting their corresponding (failed) trees. The DMS Software Reengineering Tookit provides standard GLR parsers for any context free grammar, and automatically builds ASTs without any additional effort on your part; ambiguous trees are automatically constructed and can be cleaned up by post-parsing semantic analyis. We've used this to do define 30+ language grammars including C, including C++ (which is widely thought to be hard to parse [and it is almost impossible to parse with YACC] but is straightforward with real GLR); see C+++ front end parser and AST builder based on DMS.
Bottom line: if you want to write grammar rules in a straightforward way, and get a parser to process them, use GLR parsing technology. Bison almost works. DMs really works.

My favourite parsing technique is to create recursive-descent (RD) parser from a PEG grammar specification. They are usually very fast, simple, and flexible. One nice advantage is you don't have to worry about separate tokenization passes, and worrying about squeezing the grammar into some LALR form is non-existent. Some PEG libraries are listed [here][1].
Sorry, I know this falls into throw away the system, but you are barely out of the gate with your problem and switching to a PEG RD parser, would just eliminate your headaches now.

What is the shortest way to write parser for my language?

PS.Where to read about parsing theory?

Summary: the shortest is probably Antlr.
Its tempting to go to the Dragon Book to learn about parsing theory. But I don't think the Dragon Book and you have the same idea of what "theory" means. The Dragon Book describes how to built hand-written parsers, parser generators, etc, but you almost certainly want to use a parser-generation tool instead.
A few people have suggested Bison and Flex (or their older versions Yacc and Lex).
Those are the old stalwarts, but they are not very usable tools.
Their documentation is not poor per se, its just that it doesn't quite help in getting dealing with the accidental complexity of using them.
Their internal data is not well encapsulated, and it is very hard to do anything advanced with them. As an example, in phc we still do not have correct line numbers because it is very difficult. They got better when we modified out grammar to include No-op statements, but that is an incredible hack which should not be necessary.
Ostensibly, Bison and Flex work together, but the interface is awkward. Worse, there are many versions of each, which only play nicely with some specific versions of the other. And, last I checked at least, the documentation of which versions went with which was pretty poor.
Writing a recursive descent parser is straightforward, but can be tedious. Antlr can do that for you, and it seems to be a pretty good toolset, with the benefit that what you learn on this project can be applied to lots of other languages and platforms (Antlr is very portable). There are also lots of existing grammars to learn from.
Its not clear what language you're working in, but some languages have excellent parsing frameworks. In particular, the Haskell Parsec Library seems very elegant. If you use C++ you might be tempted to use Spirit. I found it very easy to get started with, and difficult--but still possible--to do advanced things with it. This matches my experience of C++ in general. I say I found it easy to start, but then I had already written a couple of parsers, and studied parsing in compiler class.
Long story short: Antlr, unless you've a very good reason.

It's always a good idea to read the Dragon Book. But be aware that if your language is not trivial, there's not really a "short" way to do it.

It rather depends on your language. Some very simple languages take very little parsing so can be hand-coded; other languages use PEG generators such as Rats! ( PEG is parser expression grammar, which sits between a Regex and a LR parser ) or conventional parser generators such as Antlr and Yacc. Less formal languages require probabilistic techniques such as link grammars.

Write a Recursive Descent Parser. This is sometimes easier than YACC/BISON, and usually more intuitive.

Douglas Crockford has an approachable example of a parser written in JavaScript.

YACC, there are various implementation for different languages.
Good luck with your language ;-)

I used the GOLD Parsing System, because it seemed easier to use than ANTLR for a novice like me, while still being sufficiently-fully-featured for my needs. The web site includes documentation (including an instructions on Writing Grammars, which is half the work) as well as software.

Try Bison for parsing and Flex for lexing
The bison definition of your language is in the form of a context-free grammar. The wikipedia artcile on this topic is quite good, and is probably a good place to start.

Using a parser generator for your host language is the fastest way, combined with parsing theory from a book such as the Dragon Book or the Modern Compiler Construction in {C,ML} series.
If you use C, yacc and the GNU version bison are the standard generators. Antlr is widely used in many languages, supporting Java, C#, and C++ as far as I know. There are also many others in almost any language.
My personal favorite at present is Menhir, an excellent parser generator for OCaml. ML-style languages (Ocaml, Standard ML, etc.) dialects in general are very good for building compilers and interpreters.

ANTLR is the easiest for someone without compiler theory background because of:
ANTLRWORKS (visual parsing and AST debugging)
The ANTLR book (no compiler theory background required)
Just 1 syntax for lexer and parser.

If you are happy with parsing expression grammars, writing your own parsers can be incredibly short. Here is a simple Packrat parser that takes a reasonable subset of PEG:
import functools
class peg_parse:
def __init__(self, grammar):
self.grammar = {k:[tuple(l) for l in rules] for k,rules in grammar.items()}
#functools.lru_cache(maxsize=None)
def unify_key(self, key, text, at=0):
if key not in self.grammar:
return (at + len(key), (key, [])) if text[at:].startswith(key) \
else (at, None)
rules = self.grammar[key]
for rule in rules:
l, res = self.unify_rule(rule, text, at)
if res is not None: return l, (key, res)
return (0, None)
def unify_line(self, parts, text, tfrom):
results = []
for part in parts:
tfrom, res = self.unify_key(part, text, tfrom)
if res is None: return tfrom, None
results.append(res)
return tfrom, results
It accepts grammars of the form of a python dictionary, with nonterminals as keys and alternatives as elements of the array, and each alternative is a sequence of expressions. Below is an example grammar.
term_grammar = {
'expr': [
['term', 'add_op', 'expr'],
['term']],
'term': [
['fact', 'mul_op', 'term'],
['fact']],
'fact': [
['digits'],
['(','expr',')']],
'digits': [
['digit','digits'],
['digit']],
'digit': [[str(i)] for i in list(range(10))],
'add_op': [['+'], ['-']],
'mul_op': [['*'], ['/']]
}
Here is the driver:
import sys
def main(to_parse):
result = peg_parse(term_grammar).unify_key('expr', to_parse)
assert (len(to_parse) - result[0]) == 0
print(result[1])
if __name__ == '__main__': main(sys.argv[1])
Which can be invoked thus:
python3 parser.py '1+2'
('expr',
[('term',
[('fact',
[('digits', [('digit', [('1', [])])])])]),
('add_op', [('+', [])]),
('expr',
[('term', [('fact', [('digits', [('digit', [('2', [])])])])])])])
Parsing Expression Grammars take some care to write: The ordering of alternatives is important (Unlike a Context Free Grammar, the alternatives are an ordered choice, with the first choice being tried first, and second being tried only if the first did not match). However, they can represent all known context free grammars.
If on the other hand, you decide to go with a Context Free Grammar, Earley Parser is one of the simplest.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart