Writing correct LL(1) grammars? - parsing

I'm currently trying to write a (very) small interpreter/compiler for a programming language. I have set the syntax for the language, and I now need to write down the grammar for the language. I intend to use an LL(1) parser because, after a bit of research, it seems that it is the easiest to use.
I am new to this domain, but from what I gathered, formalising the syntax using BNF or EBNF is highly recommended. However, it seems that not all grammars are suitable for implementation using an LL(1) parser. Therefore, I was wondering what was the correct (or recommended) approach to writing grammars in LL(1) form.
Thank you for your help,
Charlie.
PS: I intend to write the parser using Haskell's Parsec library.
EDIT: Also, according to SK-logic, Parsec can handle an infinite lookahead (LL(k) ?) - but I guess the question still stands for that type of grammar.

I'm not an expert on this as I have only made a similar small project with an LR(0) parser. The general approach I would recommend:
Get the arithmetics working. By this, make rules and derivations for +, -, /, * etc and be sure that the parser produces a working abstract syntax tree. Test and evaluate the tree on different input to ensure that it does the arithmetic correctly.
Make things step by step. If you encounter any conflict, resolve it first before moving on.
Get simper constructs working like if-then-else or case expressions working.
Going further depends more on the language you're writing the grammar for.
Definetly check out other programming language grammars as an reference (unfortunately I did not find in 1 min any full LL grammar for any language online, but LR grammars should be useful as an reference too). For example:
ANSI C grammar
Python grammar
and of course some small examples in Wikipedia about LL grammars Wikipedia LL Parser that you probably have already checked out.
I hope you find some of this stuff useful

There are algorithms both for determining if a grammar is LL(k). Parser generators implement them. There are also heuristics for converting a grammar to LL(k), if possible.
But you don't need to restrict your simple language to LL(1), because most modern parser generators (JavaCC, ANTLR, Pyparsing, and others) can handle any k in LL(k).
More importantly, it is very likely that the syntax you consider best for your language requires a k between 2 and 4, because several common programming constructs do.

So first off, you don't necessarily want your grammar to be LL(1). It makes writing a parser simpler and potentially offers better performance, but it does mean that you're language will likely end up more verbose than commonly used languages (which generally aren't LL(1)).
If that's ok, your next step is to mentally step through the grammar, imagine all possibilities that can appear at that point, and check if they can be distinguished by their first token.
There's two main rules of thumb to making a grammar LL(1)
If of multiple choices can appear at a given point and they can
start with the same token, add a keyword in front telling you which
choice was taken.
If you have an optional or repeated part, make
sure it is followed by an ending token which can't appear as the first token of the optional/repeated part.
Avoid optional parts at the beginning of a production wherever possible. It makes the first two steps a lot easier.

Related

Can this be parsed by a LALR(1) parser?

I am writing a parser in Bison for a language which has the following constructs, among others:
self-dispatch: [identifier arguments]
dispatch: [expression . identifier arguments]
string slicing: expression[expression,expression] - similar to Python.
arguments is a comma-separated list of expressions, which can be empty too. All of the above are expressions on their own, too.
My problem is that I am not sure how to parse both [method [other_method]] and [someString[idx1, idx2].toInt] or if it is possible to do this at all with an LALR(1) parser.
To be more precise, let's take the following example: [a[b]] (call method a with the result of method b). When it reaches the state [a . [b]] (the lookahead is the second [), it won't know whether to reduce a (which has already been reduced to identifier) to expression because something like a[b,c] might follow (which could itself be reduced to expression and continue with the second construct from above) or to keep it identifier (and shift it) because a list of arguments will follow (such as [b] in this case).
Is this shift/reduce conflict due to the way I expressed this grammar or is it not possible to parse all of these constructs with an LALR(1) parser?
And, a more general question, how can one prove that a language is/is not parsable by a particular type of parser?
Assuming your grammar is unambiguous (which the part you describe appears to be) then your best bet is to specify a %glr-parser. Since in most cases, the correct parse will be forced after only a few tokens, the overhead should not be noticeable, and the advantage is that you do not need to complicate either the grammar or the construction of the AST.
The one downside is that bison cannot verify that the grammar is unambiguous -- in general, this is not possible -- and it is not easy to prove. If it turns out that some input is ambiguous, the GLR parser will generate an error, so a good test suite is important.
Proving that the language is not LR(1) would be tricky, and I suspect that it would be impossible because the language probably is recognizable with an LALR(1) parser. (Impossible to tell without seeing the entire grammar, though.) But parsing (outside of CS theory) needs to create a correct parse tree in order to be useful, and the sort of modifications required to produce an LR grammar will also modify the AST, requiring a post-parse fixup. The difficultly in creating a correct AST spring from the difference in precedence between
a[b[c],d]
and
[a[b[c],d]]
In the first (subset) case, b binds to its argument list [c] and the comma has lower precedence; in the end, b[c] and d are sibling children of the slice. In the second case (method invocation), the comma is part of the argument list and binds more tightly than the method application; b, [c] and d are siblings in a method application. But you cannot decide the shape of the parse tree until an arbitrarily long input (since d could be any expression).
That's all a bit hand-wavey since "precedence" is not formally definable, and there are hacks which could make it possible to adjust the tree. Since the LR property is not really composable, it is really possible to provide a more rigorous analysis. But regardless, the GLR parser is likely to be the simplest and most robust solution.
One small point for future reference: CFGs are not just a programming tool; they also serve the purpose of clearly communicating the grammar in question. Nirmally, if you want to describe your language, you are better off using a clear CFG than trying to describe informally. Of course, meaningful non-terminal names will help, and a few examples never hurt, but the essence of the grammar is in the formal description and omitting that makes it harder for others to "be helpful".

YAML parsing - lex or hand-rolled?

I am trying to write a simple YAML parser, I read the spec from yaml.org,
before I start, I was wondering if it is better to write a hand-rolled parser, or
use lex (flex/bison). I looked at the libyaml (C library) -
doesn't seem to use lex/yacc.
YAML (excluding the flow styles), seems to be more line-oriented, so, is it
easier to write a hand-rolled parser, or use flex/bison
Thanks.
This answer is basically an answer to the question: "Should I roll my own parser or use parser generator?" and has not much to do with YAML. But nevertheless it will "answer" your question.
The question you need to ask is not "does this work with this given language/grammar", but "do I feel confident to implement this". The truth of the matter is that most formats you want to parse will just work with a generated parser. The other truth is that it is feasible to parse even complex languages with a simple hand written recursive descent parser.
I have written among others, a recursive descent parser for EDDL (C and structured elements) and a bison/flex parser for INI. I picked these examples, because they go against intuition and exterior requirements dictated the decision.
Since I established on a technical level it is possible, why would you pick one over the other? This is really hard question to answer, here are some thoughts on the subject:
Writing a good lexer is really hard. In most cases it makes sense to use flex to generate the lexer. There is little use of hand-rolling your own lexer, unless you have really exotic input formats.
Using bison or similar generators make the grammar used for parsing explicitly visible. The primary gain here is that the developer maintaining your parser in five years will immediately see the grammar used and can compare it with any specs.
Using a recursive descent parser makes is quite clear what happens in the parser. This provides the easy means to gracefully handle harry conflicts. You can write a simple if, instead of rearranging the entire grammar to be LALR1.
While developing the parser you can "gloss over details" with a hand written parser, using bison this is almost impossible. In bison the grammar must work or the generator will not do anything.
Bison is awesome at pointing out formal flaws in the grammar. Unfortunately you are left alone to fix them. When hand-rolling a parser you will only find the flaws when the parser reads nonsense.
This is not a definite answer for one or the other, but it points you in the right direction. Since it appears that you are writing the parser for fun, I think you should have written both types of parser.

What is the advantage of using a parser generator like happy as opposed to using parser combinators?

To learn how to write and parse a context-free grammar I want to choose a tool. For Haskell, there are two big options: Happy, which generates a parser from a grammar description and *Parsec, which allows you to directly code a parser in Haskell.
What are the (dis)advantages of either approach?
External vs internal DSL
The parser specification format for Happy is an external DSL, whereas with Parsec you have the full power of Haskell available when defining your parsers. This means that you can for example write functions to generate parsers, use Template Haskell and so on.
Precedence rules
With Happy, you can use precedences to simplify your grammar, whereas with Parsec you have to nest the grammar rules correctly yourself. Changing the precedence of an operator is therefore much more tedious in Parsec.
Static checking
Happy will warn you about ambiguities in your grammar at compile time. (Though it's not great at telling you where they are.) With Parsec, you get no warning until your parser fails at run time.
This is the traditional decision: do I use lex/yacc (happy) or do I write my own (mostly recursive descent) parser, only that the parsec library is like a DSL for doing it right.
If one has experience with the yacc/lex approach, using happy will be a smaller learning curve.
In my opinion Parsec hides most of the nasty grammar details and lets you write your parsers more intuitively. If you want to learn this stuff in the first place go with some parser-generator like Happy (or even try to implement one yourself).
I'm used to the parser combinator library uu-parsinglib from utrecht university. One can have error correcting and permutations for free, and also the things that parsec has. I also like it because my implemented grammar looks like an EBNF grammar, without so much monadic stuff, and is easy to read.
Naive parser combinators do not allow left-recursion in grammar rules and I haven't found a library that does.
Happy does allow full BNF in language spec, and some useful staff like priority rules. So, for complicated cases Happy and parser generators in general are much better. However, in case of simple, stupid languages with LL(k) parseable grammars, I would use a parser combinator library as more maintainer-friendly.

Is there a parser generator that can use the Wirth syntax?

ie: http://en.wikipedia.org/wiki/Wirth_syntax_notation
It seems like most use BNF / EBNF ...
The distinction made by the Wikipedia article looks to me like it is splitting hairs. "BNF/EBNF" has long meant writing grammar rules in roughly the following form:
nonterminal = right_hand_side end_rule_marker
As with other silly langauge differences ( "{" in C, begin in Pascal, endif vs. fi) you can get very different looking but identical meaning by choosing different, er, syntax for end_rule_marker and what you are allowed to say for the right_hand_side.
Normally people allow literal tokens in (your choice! of) quotes, other nonterminal names, and for EBNF, various "choice" operators typically | or / for alternatives, * or + for "repeat", [ ... ] or ? for optional, etc.
Because people designing language syntax are playing with syntax, they seem to invent their own every time they write some down. (Check the various syntax formalisms in the language standards; none of them are the same). Yes, we'd all be better off if there were one standard way to write this stuff. But we don't do that for C or C++ or C# (not even Microsoft could follow their own standard); why should BNF be any different?
Because people that build parser generators usually use it to parse their own syntax, they can and so easily define their own for each parser generator. I've never seen one that did precisely the "WSN" version at Wikipedia and I doubt I ever will in practice.
Does it matter much? Well, no. What really matters is the power of the parser generator behind the notation. You have to bend most grammars to match the power (well, weakness) of the parser generator; for instance, for LL-style generators, your grammar rules can't have left recursion (WSN allows that according to my reading). People building parser generators also want to make it convenient to express non-parser issues (such as, "how to build tree nodes") and so they also add extra notation for non-parsing issues.
So what really drives the parser generator syntax are weaknesses of the parser generator to handle arbitrary languages, extra goodies deemed valuable for the parser generator, and the mood of the guy implementing it.

Practical consequences of formal grammar power?

Every undergraduate Intro to Compilers course reviews the commonly-implemented subsets of context-free grammars: LL(k), SLR(k), LALR(k), LR(k). We are also taught that for any given k, each of those grammars is a subset of the next.
What I've never seen is an explanation of what sorts of programming language syntactic features might require moving to a different language class. There's an obvious practical motivation for GLR parsers, namely, avoiding an unholy commingling of parser and symbol table when parsing C++. But what about the differences between the two "standard" classes, LL and LR?
Two questions:
What (general) syntactic constructions can be parsed with LR(k) but not LL(k')?
In what ways, if any, do those constructions manifest as desirable language constructs?
There's a plausible argument for reducing language power by making k as small as possible, because a language requiring many, many tokens of lookahead will be harder for humans to parse, as well as "harder" for machines to parse. Question (2) implicitly asks if the same reasoning ends up holding between classes, as well as within a class.
edit: Here's one example to illustrate the sorts of answers I'm looking for, but for regular languages instead of context-free:
When describing a regular language, one usually gets three operators: +, *, and ?. Now, you can remove + without reducing the power of the language; instead of writing x+, you write xx*, and the effect is the same. But if x is some big and hairy expression, the two xs are likely to diverge over time due to human forgetfulness, yielding a syntactically correct regular expression that doesn't match the original author's intent. Thus, even though adding + doesn't strictly add power, it does make the notation less error-prone.
Are there constructs with similar practical (human?) effects that must be "removed" when switching from LR to LL?
Parsing (I claim) is a bit like sorting: a problem that was the focus of a lot of thought in the early days of CS, leading to a set of well-understood solutions with some nice theoretical results.
My claim is that the picture that we get (or give, for those of us who teach) in a compilers class is, to some degree, a beautiful answer to the wrong question.
To answer your question more directly, an LL(1) grammar can't parse all kinds of things that you might want to parse; the "natural" formulation of an 'if' with an optional 'else', for instance.
But wait! Can't I reformulate my grammar as an LL(1) grammar and then patch up the source tree by walking over it afterward? Sure you can! To some degree, this is what makes the question of what kind of grammar your parser uses largely moot.
Also, back when I was an undergraduate (1990-94), whitespace-sensitive grammars were clearly the work of the Devil; now, Python and Haskell's designs are bringing whitespace-sensitivity back into the light. Also, Packrat parsing says "to heck with your theoretical purity: I'm just going to define a parser as a set of rules, and I don't care what class my grammar belongs to." (paraphrased)
In summary, I would agree with what I believe to be your implied suggestion: in 2009, a clear understanding of the difference between the classes LL(k) and LR(k) is less important in itself than the ability to formulate and debug a grammar that makes your parser generator happy.
The difference between LL and LR is primarily in the lookahead mechanism. People generally say that LR parsers carry more "context". To see this practically, consider a recursive grammar definition with S as the starting symbol:
A -> Ax | x
B -> Ay
C -> Az
S -> B | C
When k is a small fixed value, parsing a string like xxxxxxy is a task better suited to an LR parser. However, these days the popular LL parsers such as ANTLR do not restrict k to such small values and most people no longer care.
I hope this is more or less in line with your question. Of course Knuth showed that any unambiguous context-free language can be recognized by some LR(1) grammar. However, in practice we are also concerned with translation.
As a side note: You might also enjoy reading http://www.antlr.org/article/needlook.html.
This is by no means proven, but I have always questioned whether LR-like parsing is really similar to how the brain works when reading certain notations. For example, when reading an English sentence it is pretty obvious that we read from left-to-right. But, consider the pattern bellow:
. . . . . | . . . . .
I rather expect that with short patterns such as this one people do not literally read "dot dot dot dot dot bar dot dot dot dot dot" from left to right, but rather processes the pattern in parallel or at least in some kind of fuzzy iterative manner. In other words, I do not believe we necessarily read all patterns in a left-to-right manner with the kind of linear lookahead that a LL/LR parser employs.
Furthermore, if we can describe any context-free language using an LR(1) grammar then it is clear that simply recognizing a string is not the same as "understanding" it.
well, for one, Left recursive definitions are impossible in LL(k) grammars (as far as i know), don't know about others. This doesn't make itimpossible to define other things just a massive pain to do otherwise. For instance, putting together expressions can be easy in a left-recursive language (in pseudocode):
lexer rule expression = other rules
| expression
| '(' expression ')';
As far as syntactically useful things that can be made with left-recursion, um does simpler grammars count as syntactically useful?
The capabilities of a language are not limited by its syntax and grammar.
It's possible to define any language feature with an LL(k) grammar, it just might not be very readable to humans.

Resources