Translate Haskell Parsec grammar to Scala? - parsing

I'm trying to translate a grammar written in Haskell using Parsec into Scala's parser combinators.
The translation of the actual matching expressions is pretty straightforward and, at least in my opinion, even a little easier in Scala. What's not at all clear to me is how to handle the statefulness that Parsec passes around using monads.
A Scala parser reads in Input and produces a ParseResult[T].
In contrast, a GenParser in Haskell reads in input and a state and produces another parser. Passing that state around in Scala has me confused.
Does anyone have an example of stateful parsing in Scala that they'd be willing to share?

The only way I know of to handle state-fullness in Scala Parsers Combinators is through the into method, also known as >> and flatMap (and, yes, you can use it in for-comprehensions). However, it passes state (or, more precisely, parse result) into a parser, and not along the next parsers, which seems to be what you are describing.
Not knowing Haskell's Parsec, it is difficult for me to guess at how that can be used to translate your grammar.
See also this question. There was a very interesting paper about Scala parser combinators, but I was not able to find it. Some spelunking on Scala Lang might turn it up.

Related

Bypassing left recursion with right-to-left parser

I'm working on a project for my OOP class. Part of the task is developing a parser for a very simple grammar. As far as I understood, by far the simplest parser to implement by hand is recursive-descent parser.
However all operators for the language that I'm parsing are left-associative by nature. As far as I know best way to deal with left recursion enforced by left associativity is to use LR parser instead.
My idea is to parse tokens right-to-left instead, which I believe should enable me to rewrite left associative rules to right associative ones instead.
Will this work, and if not, why not?
You can do this if you'd like, though this won't necessarily solve all your problems. If you're familiar with the LL or LR parsers, there are corresponding versions that work right-to-left called RR and RL parsers that pretty much work like LL or LR parsers scanning in the opposite direction. As a result, they have similar weaknesses to the original LL or LR parsing algorithms, so while this might help you, it might not actually solve anything.
As alternatives, you can try rewriting the grammar to see if you can encode precedence and associativity directly. You could also, depending on the grammar, consider using a precedence parsing algorithm like Dijkstra's shunting-yard algorithm. You could also consider using recursive descent parser with backtracking. Finally, you could use something like an Earley parser, which can handle any grammar and isn't too hard to code up.
Good insight -- it turns out this works really well to solve the left recursion problem, but you also have to parse bottom-up, not just right to left. I published a preprint about this.
Pika parsing: parsing in reverse solves the left recursion and error recovery problems
https://arxiv.org/abs/2005.06444

Generalized Bottom up Parser Combinators in Haskell

I am wondered why there is no generalized parser combinators for Bottom-up parsing in Haskell like a Parsec combinators for top down parsing.
( I could find some research work went during 2004 but nothing after
https://haskell-functional-parsing.googlecode.com/files/Ljunglof-2002a.pdf
http://www.di.ubi.pt/~jpf/Site/Publications_files/technicalReport.pdf )
Is there any specific reason for not achieving it?
This is because of referential transparency. Just as no function can tell the difference between
let x = 1:x
let x = 1:1:1:x
let x = 1:1:1:1:1:1:1:1:1:... -- if this were writeable
no function can tell the difference between a grammar which is a finite graph and a grammar which is an infinite tree. Bottom-up parsing algorithms need to be able to see the grammar as a graph, in order to enumerate all the possible parsing states.
The fact that top-down parsers see their input as infinite trees allows them to be more powerful, since the tree could be computationally more complex than any graph could be; for example,
numSequence n = string (show n) *> option () (numSequence (n+1))
accepts any finite ascending sequence of numbers starting at n. This has infinitely many different parsing states. (It might be possible to represent this in a context-free way, but it would be tricky and require more understanding of the code than a parsing library is capable of, I think)
A bottom up combinator library could be written, though it is a bit ugly, by requiring all parsers to be "labelled" in such a way that
the same label always refers to the same parser, and
there is only a finite set of labels
at which point it begins to look a lot more like a traditional specification of a grammar than a combinatory specification. However, it could still be nice; you would only have to label recursive productions, which would rule out any infinitely-large rules such as numSequence.
As luqui's answer indicates a bottom-up parser combinator library is not a realistic. On the chance that someone gets to this page just looking for haskell's bottom up parser generator, what you are looking for is called the Happy parser generator. It is like yacc for haskell.
As luqui said above: Haskell's treatment of recursive parser definitions does not permit the definition of bottom-up parsing libraries. Bottom-up parsing libraries are possible though if you represent recursive grammars differently. With apologies for the self-promotion, one (research) parser library that uses such an approach is grammar-combinators. It implements a grammar transformation called the uniform Paull transformation that can be combined with the top-down parser algorithm to obtain a bottom-up parser for the original grammar.
#luqui essentially says, that there are cases in which sharing is unobservable. However, it's not the case in general: many approaches to observable sharing exist. E.g. http://www.ittc.ku.edu/~andygill/papers/reifyGraph.pdf mentions a few different methods to achieve observable sharing and proposes its own new method:
This looping structure can be used for interpretation, but not for
further analysis, pretty printing, or general processing. The
challenge here, and the subject of this paper, is how to allow trees
extracted from Haskell hosted deep DSLs to have observable back-edges,
or more generally, observable sharing. This a well-understood problem,
with a number of standard solutions.
Note that the "ugly" solution of #liqui is mentioned by the paper under the name of explicit labels. The solution proposed by the paper is still "ugly" as it uses so called "stable names", but other solutions such as http://www.cs.utexas.edu/~wcook/Drafts/2012/graphs.pdf (which relies on PHOAS) may work.

LR(k) to LR(1) grammar conversion

I am confused by the following quote from Wikipedia:
In other words, if a language was reasonable enough to allow an
efficient one-pass parser, it could be described by an LR(k) grammar.
And that grammar could always be mechanically transformed into an
equivalent (but larger) LR(1) grammar. So an LR(1) parsing method was,
in theory, powerful enough to handle any reasonable language. In
practice, the natural grammars for many programming languages are
close to being LR(1).[citation needed]
This means that a parser generator, like bison, is very powerful (since it can handle LR(k) grammars), if one is able to convert a LR(k) grammar to a LR(1) grammar. Do some examples of this exist, or a recipe on how to do this? I'd like to know this since I have a shift/reduce conflict in my grammar, but I think this is because it is a LR(2) grammar and would like to convert it to a LR(1) grammar. Side question: is C++ an unreasonable language, since I've read, that bison-generated parsers cannot parse it.
For references on the general purpose algorithm to find a covering LR(1) grammar for an LR(k) grammar, see Real-world LR(k > 1) grammars?
The general purpose algorithm produces quite large grammars; in fact, I'm pretty sure that the resulting PDA is the same size as the LR(k) PDA would be. However, in particular cases it's possible to come up with simpler solutions. The general principle applies, though: you need to defer the shift/reduce decision by unconditionally shifting until the decision can be made with a single lookahead token.
One example: Is C#'s lambda expression grammar LALR(1)?
Without knowing more details about your grammar, I can't really help more than that.
With regard to C++, the things that make it tricky to parse are the preprocessor and some corner cases in parsing (and lexing) template instantiations. The fact that the parse of an expression depends on the "kind" (not type) of a symbol (in the context in which the symbol occurs) makes precise parsing with bison complicated. [1] "Unreasonable" is a value judgement which I'm not comfortable making; certainly, tool support (like accurate syntax colourizers and tab-completers) would have been simple with a different grammar, but the evidence is that it is not that hard to write (or even read) good C++ code.
Notes:
[1] The classic tricky parse, which also applies to C, is (a)*b, which is a cast of a dereference if a represents a type, and otherwise a multiplication. If you were to write it in the context: c/(a)*b, it would be clear that an AST cannot be constructed without knowing whether it's a cast or a product, since that affects the shape of the AST,
A more C++-specific issue is: x<y>(z) (or x<y<z>>(3)) which parse (and arguably tokenise) differently depending on whether x names a template or not.

LR(k) or LALR(k) parser generator with features similar to ANTLR

I'm currently in the process of writing a parser for some language. I've been given a grammar for this language, but this grammar has some left recursions and non-LL(*) constructs, so ANTLR doesn't do very well, even with backtracking.
Because removing these left recursions and non-LL(*) constructs is harder than it looked at first glance, I now want to try a LR(k) or LALR(k) parser generator. The higher k the better.
Can anyone recommend me a parser generator fulfilling these requirements?
The generated parser is preferably a LR(k) parser with some high (or even arbitrary) k, or at least a LALR(k) parser with some high k.
The generated parser is written in C or C++, and if it is written in C, it is linkable to C++-Code.
A feature set similar to ANTLR (especially the AST rewriting) would be nice.
Performance is not the most pressing issue, the generated parser is intended to
be used on desktop machines with much memory and cpu power.
Thanks and greetings,
Jost
PS: I'm not asking because I can't google myself, but because there is no time left to test some generators myself. So please only answer if you have experience with the recommended parser generators.
You might consider LRSTAR.
I have no experience with the tool itself, but I've met the author and he seems like a pretty competent guy. (I do build parsing engines and related technology for a living).
LRSTAR 10.0 is available now. On the comparison page, there is a comparison of LRSTAR, ANTLR and Bison. LRSTAR now reads ANTLR's style notation using the same EBNF operators (:, |, *, +, ?). It's a C++ based system generating LR(k) parsers in C++. The parsers do automatic AST construction and traversal. The new version 10.0 reads Yacc/Bison grammars if there is no action code in the grammar.
I have now decided to use DParser, which is a GLR-Parser generator capable of recognizing any context free language. It seems to be well programmed (look at the tests in the source distribution), but lacks a lot of the features ANTLR provides, most notably the AST-Construction tools.
As a plus, it mostly reuses ANTLRs grammar file format, which was the format my grammar is in.

Scala Parsers: Availability, Differences and Combining?

My question is about the Scala Parsers:
Which ones are available (in the Standard library and outside),
what's the difference between them,
do they share a common API and
can different Parsers be combined to parse one input string?
I found at least these:
Scala's "standard" parser (seems to be an LL parser)
Scala's Packrat parser (since 2.8, is a LALR parser)
The Parboiled parser (PEG parser?)
Spiewak's GLL parser combinator
There's also Dan Spiewak's implementation of GLL parser combinators.
It's worth noting that Scala's standard parser combinators are not LL, nor are Packrat combinators LALR. Parser combinators are a form of recursive descent with infinite backtracking. You can think of them a bit like "LL(*)". The class of languages supported by this technique is precisely the class of unambiguous context-free languages, or the same class as LALR(1) and Packrat. However, the class of grammar is quite a bit different, with the most notable weakness being non-support for left-recursion.
Packrat combinators do support left-recursion, but they still fail to support many other, more subtle features of LALR. This weakness generally stems from the ordered choice operator, which can lead to some devilishly tricky grammar bugs, as well as prevents certain nice grammatical formulations. The most often-seen example of these bugs happens when you accidentally order ambiguous choices as shortest first, resulting in a greedy match that prevents the correct branch from ever being tried. LALR doesn't have this problem, since it simply tries all possible branches at once, deferring the decision point until the end of the production.
There is also a new approach known as "parsing with derivatives". The approach is described here. There is an implementation in Scala by Daniel Spiewak.
Just wanted to update this answer with a pointer to the latest iteration of the parboiled project, called parboiled2:
https://github.com/sirthias/parboiled2
parboiled2 targets only Scala (as opposed to Scala + Java), makes use of Scala macros, and is very actively maintained.

Resources