Resources on parsing expressions - parsing

I'm writing a program that needs to build syntax trees from expressions in prefix notation. What resources would you recommend I look into to learn about parsing expressions?

Your question is rather broad. I'd look into anything dealing with the following:
Parsing expression-grammar
BNFs, EBNF`s
Recursive-Descent Parsers
Operator-precedence parser (basically prefix parser)
Polish notation (talks about prefix notation)
Your best bet is to try and understand BNF's and EBNF's. From there you can go on to writing recursive-descent parsers (they can be created easily from your grammars with a few simple rules).
This page here talks about recursive-descent parsing using BNF's.

In addition to the stuff already listed, I'd recomment to consider the following:
http://en.wikipedia.org/wiki/Parsing_expression_grammar

Related

LALR(1) parser generator for scala

I know that it's possible to use, for example, bison-generated Java files in scala project, but is there any native "grammar to scala" LALR(1) generators?
Another plug here: ScalaBison is close to LALR(1) and lets you use Scala in the actions.
I'm not really answering the original question, and please excuse the plug, but you may be interested in our sbt-rats plugin for the sbt tool. It uses the Rats! parser generator for Java, but makes it easier to use from Scala.
Rats! uses parsing expression grammars as its syntax description formalism, not context-free grammars and definitely not LALR(1) grammars. sbt-rats also has a high-level syntax definition language that in most cases means you do not need to write semantic actions to get a syntax tree that represents your input. The plugin will optionally generate case classes for the tree representation and a pretty-printer for the tree structure.

What is the advantage of using a parser generator like happy as opposed to using parser combinators?

To learn how to write and parse a context-free grammar I want to choose a tool. For Haskell, there are two big options: Happy, which generates a parser from a grammar description and *Parsec, which allows you to directly code a parser in Haskell.
What are the (dis)advantages of either approach?
External vs internal DSL
The parser specification format for Happy is an external DSL, whereas with Parsec you have the full power of Haskell available when defining your parsers. This means that you can for example write functions to generate parsers, use Template Haskell and so on.
Precedence rules
With Happy, you can use precedences to simplify your grammar, whereas with Parsec you have to nest the grammar rules correctly yourself. Changing the precedence of an operator is therefore much more tedious in Parsec.
Static checking
Happy will warn you about ambiguities in your grammar at compile time. (Though it's not great at telling you where they are.) With Parsec, you get no warning until your parser fails at run time.
This is the traditional decision: do I use lex/yacc (happy) or do I write my own (mostly recursive descent) parser, only that the parsec library is like a DSL for doing it right.
If one has experience with the yacc/lex approach, using happy will be a smaller learning curve.
In my opinion Parsec hides most of the nasty grammar details and lets you write your parsers more intuitively. If you want to learn this stuff in the first place go with some parser-generator like Happy (or even try to implement one yourself).
I'm used to the parser combinator library uu-parsinglib from utrecht university. One can have error correcting and permutations for free, and also the things that parsec has. I also like it because my implemented grammar looks like an EBNF grammar, without so much monadic stuff, and is easy to read.
Naive parser combinators do not allow left-recursion in grammar rules and I haven't found a library that does.
Happy does allow full BNF in language spec, and some useful staff like priority rules. So, for complicated cases Happy and parser generators in general are much better. However, in case of simple, stupid languages with LL(k) parseable grammars, I would use a parser combinator library as more maintainer-friendly.

Writing correct LL(1) grammars?

I'm currently trying to write a (very) small interpreter/compiler for a programming language. I have set the syntax for the language, and I now need to write down the grammar for the language. I intend to use an LL(1) parser because, after a bit of research, it seems that it is the easiest to use.
I am new to this domain, but from what I gathered, formalising the syntax using BNF or EBNF is highly recommended. However, it seems that not all grammars are suitable for implementation using an LL(1) parser. Therefore, I was wondering what was the correct (or recommended) approach to writing grammars in LL(1) form.
Thank you for your help,
Charlie.
PS: I intend to write the parser using Haskell's Parsec library.
EDIT: Also, according to SK-logic, Parsec can handle an infinite lookahead (LL(k) ?) - but I guess the question still stands for that type of grammar.
I'm not an expert on this as I have only made a similar small project with an LR(0) parser. The general approach I would recommend:
Get the arithmetics working. By this, make rules and derivations for +, -, /, * etc and be sure that the parser produces a working abstract syntax tree. Test and evaluate the tree on different input to ensure that it does the arithmetic correctly.
Make things step by step. If you encounter any conflict, resolve it first before moving on.
Get simper constructs working like if-then-else or case expressions working.
Going further depends more on the language you're writing the grammar for.
Definetly check out other programming language grammars as an reference (unfortunately I did not find in 1 min any full LL grammar for any language online, but LR grammars should be useful as an reference too). For example:
ANSI C grammar
Python grammar
and of course some small examples in Wikipedia about LL grammars Wikipedia LL Parser that you probably have already checked out.
I hope you find some of this stuff useful
There are algorithms both for determining if a grammar is LL(k). Parser generators implement them. There are also heuristics for converting a grammar to LL(k), if possible.
But you don't need to restrict your simple language to LL(1), because most modern parser generators (JavaCC, ANTLR, Pyparsing, and others) can handle any k in LL(k).
More importantly, it is very likely that the syntax you consider best for your language requires a k between 2 and 4, because several common programming constructs do.
So first off, you don't necessarily want your grammar to be LL(1). It makes writing a parser simpler and potentially offers better performance, but it does mean that you're language will likely end up more verbose than commonly used languages (which generally aren't LL(1)).
If that's ok, your next step is to mentally step through the grammar, imagine all possibilities that can appear at that point, and check if they can be distinguished by their first token.
There's two main rules of thumb to making a grammar LL(1)
If of multiple choices can appear at a given point and they can
start with the same token, add a keyword in front telling you which
choice was taken.
If you have an optional or repeated part, make
sure it is followed by an ending token which can't appear as the first token of the optional/repeated part.
Avoid optional parts at the beginning of a production wherever possible. It makes the first two steps a lot easier.

Alternative parsing methods

I know something about regular expressions, parse trees and abstract syntax trees. But once I read there is still another parsing technique that, as far as I remember, people from SO used to re-implement its markdown parser.
What I don't recall is the name of this method, or how it did work. Do you? If not, what it could be?
Maybe you're thinking of Parsing Expression Grammars?
(If I'm remembering the same thing you're remembering, it's cletus writing about this here.)
Here's a blog about SO's markdown parser: https://blog.stackoverflow.com/2009/12/introducing-markdownsharp/
Here's the source: http://code.google.com/p/markdownsharp/
It does use advanced regular expressions. I'm not aware of any "other" technique of parsing. The most common solutions for parsing used by virtually all programmers are:
Regular expressions (or finite state machines) for regular grammars.
Non-deterministic pushdown automata for context-free grammars. This is where you get parser generators like yacc, bison, ANTLR, etc.
See also the Chomsky hierarchy of formal grammars.

When is better to use a parser such as ANTLR vs. writing your own parsing code?

I need to parse a simple DSL which looks like this:
funcA Type1 a (funcB Type1 b) ReturnType c
As I have no experience with grammar parsing tools, I thought it would be quicker to write a basic parser myself (in Java).
Would it be better, even for a simple DSL, for me to use something like ANTLR and construct a proper grammar definition?
Simple answer: when it is easier to write the rules describing your grammar than to write code that accepts the language described by your grammar.
If the only thing you need to parse looks exactly like what you've written above, then I would say you could just write it by hand.
More generally speaking, I would say that most regular languages could be parsed more quickly by hand (using a regular expression).
If you are parsing a context-free language with lots of rules and productions, ANTLR (or other parser generators) can make life much easier.
Also, if you have a simple language that you expect to grow more complicated in the future, it will be easier to add rule descriptions to an ANTLR grammar than to build them into a hand-coded parser.
Grammars tend to evolve, (as do requirements). Home brew parsers are difficult to maintain and lead to re-inventing the wheel example. If you think you can write a quick parser in java, you should know that it would be quicker to use any of the lex/yacc/compiler-compiler solutions. Lexers are easier to write, then you would want your own rule precedence semantics which are not easy to test or maintain. ANTLR also provides an ide for visualising AST, can you beat that mate. Added advantage is the ability to generate intermediate code using string templates, which is a different aspect altogether.
It's better to use an off-the-shelf parser (generator) such as ANTLR when you want to develop and use a custom language. It's better to write your own parser when your objective is to write a parser.
UNLESS you have a lot of experience writing parsers and can get a working parser that way more quickly than using ANTLR. But I surmise from your asking the question that this get-out clause does not apply.

Resources