LALR(1) parser generator for scala - parsing

I know that it's possible to use, for example, bison-generated Java files in scala project, but is there any native "grammar to scala" LALR(1) generators?

Another plug here: ScalaBison is close to LALR(1) and lets you use Scala in the actions.

I'm not really answering the original question, and please excuse the plug, but you may be interested in our sbt-rats plugin for the sbt tool. It uses the Rats! parser generator for Java, but makes it easier to use from Scala.
Rats! uses parsing expression grammars as its syntax description formalism, not context-free grammars and definitely not LALR(1) grammars. sbt-rats also has a high-level syntax definition language that in most cases means you do not need to write semantic actions to get a syntax tree that represents your input. The plugin will optionally generate case classes for the tree representation and a pretty-printer for the tree structure.

Related

Is possible to parse "off-side" (indentation-based) languages with fparsec?

I wish to use FParsec for a python-like language, indentation-based.
I understand that this must be done in the lexing phase, but FParsec don't have a lexing phase. Is possible to use FParsec, or, how can feed it after lexing?
P.D: I'm new at F#, but experienced in other languages
Yes, it's possible.
Here is a relevant article by FParsec author. If you want to go deeper on the subject, this paper might worth a read. The paper points out that there are multiple packages for indentation-aware parsing that based on Parsec, the parser combinator that inspires FParsec.
FParsec doesn't have a separate lexing phase but instead it fuses lexing and parsing to a single phase. IMO indentation-aware parsing is better to be done with parser combinators (FParsec) than parser generators (fslex/fsyacc). The reason is that you need to manually track current indentation and report good error messages based on contexts.

What is the advantage of using a parser generator like happy as opposed to using parser combinators?

To learn how to write and parse a context-free grammar I want to choose a tool. For Haskell, there are two big options: Happy, which generates a parser from a grammar description and *Parsec, which allows you to directly code a parser in Haskell.
What are the (dis)advantages of either approach?
External vs internal DSL
The parser specification format for Happy is an external DSL, whereas with Parsec you have the full power of Haskell available when defining your parsers. This means that you can for example write functions to generate parsers, use Template Haskell and so on.
Precedence rules
With Happy, you can use precedences to simplify your grammar, whereas with Parsec you have to nest the grammar rules correctly yourself. Changing the precedence of an operator is therefore much more tedious in Parsec.
Static checking
Happy will warn you about ambiguities in your grammar at compile time. (Though it's not great at telling you where they are.) With Parsec, you get no warning until your parser fails at run time.
This is the traditional decision: do I use lex/yacc (happy) or do I write my own (mostly recursive descent) parser, only that the parsec library is like a DSL for doing it right.
If one has experience with the yacc/lex approach, using happy will be a smaller learning curve.
In my opinion Parsec hides most of the nasty grammar details and lets you write your parsers more intuitively. If you want to learn this stuff in the first place go with some parser-generator like Happy (or even try to implement one yourself).
I'm used to the parser combinator library uu-parsinglib from utrecht university. One can have error correcting and permutations for free, and also the things that parsec has. I also like it because my implemented grammar looks like an EBNF grammar, without so much monadic stuff, and is easy to read.
Naive parser combinators do not allow left-recursion in grammar rules and I haven't found a library that does.
Happy does allow full BNF in language spec, and some useful staff like priority rules. So, for complicated cases Happy and parser generators in general are much better. However, in case of simple, stupid languages with LL(k) parseable grammars, I would use a parser combinator library as more maintainer-friendly.

Do production compilers use parser generators?

I've heard that "real compiler writers" roll their own handmade parser rather than using parser generators. I've also heard that parser generators don't cut it for real-world languages. Supposedly, there are many special cases that are difficult to implement using a parser generator. I have my doubts about this:
Theoretically, a GLR parser generator should be able to handle most programming language designs (except maybe C++...)
I know of at least one production language that uses a parser generator: Ruby [1].
When I took my compilers class in school, we used a parser generator.
So my question: Is it reasonable to write a production compiler using a parser generator, or is using a parser generator considered a poor design decision by the compiler community?
[1] https://github.com/ruby/ruby/blob/trunk/parse.y
For what it's worth, GCC used a parser generator pre-4.0 I believe, then switched to a hand written recursive descent parser because it was easier to maintain and extend.
Parser generators DO "cut it" for "real" languages, but the amount of work to transform your grammar into something workable grows exponentially.
Edit: link to the GCC document detailing the change with reasons and benefits vs cost analysis: http://gcc.gnu.org/wiki/New_C_Parser.
I worked for a company for a few years where we were more or less writing compilers. We weren't concerned much with performance; just reducing the amount of work/maintenance. We used a combination of generated parsers + handwritten code to achieve this. The ideal balance is to automate the easy, repetitive parts with the parser generator and then tackle the hard stuff in custom functions.
Sometimes a combination of both methods, is used, like generating code with a parser, and later, modifying "by hand" that code.
Other way is that some scanner (lexer) and parser tools allow them to add custom code, additional to the grammar rules, called "semantic actions". A good example of this case, is that, a parser detects generic identifiers, and some custom code, transform some specific identifiers into keywords.
EDIT:
add "semantic actions"

Alternative parsing methods

I know something about regular expressions, parse trees and abstract syntax trees. But once I read there is still another parsing technique that, as far as I remember, people from SO used to re-implement its markdown parser.
What I don't recall is the name of this method, or how it did work. Do you? If not, what it could be?
Maybe you're thinking of Parsing Expression Grammars?
(If I'm remembering the same thing you're remembering, it's cletus writing about this here.)
Here's a blog about SO's markdown parser: https://blog.stackoverflow.com/2009/12/introducing-markdownsharp/
Here's the source: http://code.google.com/p/markdownsharp/
It does use advanced regular expressions. I'm not aware of any "other" technique of parsing. The most common solutions for parsing used by virtually all programmers are:
Regular expressions (or finite state machines) for regular grammars.
Non-deterministic pushdown automata for context-free grammars. This is where you get parser generators like yacc, bison, ANTLR, etc.
See also the Chomsky hierarchy of formal grammars.

When is better to use a parser such as ANTLR vs. writing your own parsing code?

I need to parse a simple DSL which looks like this:
funcA Type1 a (funcB Type1 b) ReturnType c
As I have no experience with grammar parsing tools, I thought it would be quicker to write a basic parser myself (in Java).
Would it be better, even for a simple DSL, for me to use something like ANTLR and construct a proper grammar definition?
Simple answer: when it is easier to write the rules describing your grammar than to write code that accepts the language described by your grammar.
If the only thing you need to parse looks exactly like what you've written above, then I would say you could just write it by hand.
More generally speaking, I would say that most regular languages could be parsed more quickly by hand (using a regular expression).
If you are parsing a context-free language with lots of rules and productions, ANTLR (or other parser generators) can make life much easier.
Also, if you have a simple language that you expect to grow more complicated in the future, it will be easier to add rule descriptions to an ANTLR grammar than to build them into a hand-coded parser.
Grammars tend to evolve, (as do requirements). Home brew parsers are difficult to maintain and lead to re-inventing the wheel example. If you think you can write a quick parser in java, you should know that it would be quicker to use any of the lex/yacc/compiler-compiler solutions. Lexers are easier to write, then you would want your own rule precedence semantics which are not easy to test or maintain. ANTLR also provides an ide for visualising AST, can you beat that mate. Added advantage is the ability to generate intermediate code using string templates, which is a different aspect altogether.
It's better to use an off-the-shelf parser (generator) such as ANTLR when you want to develop and use a custom language. It's better to write your own parser when your objective is to write a parser.
UNLESS you have a lot of experience writing parsers and can get a working parser that way more quickly than using ANTLR. But I surmise from your asking the question that this get-out clause does not apply.

Resources