Use FParsec on already tokenized UInt16 stream - parsing

I need to parse an already tokenized stream of type UInt16 seq.
How can I do this with FParsec?
All the top level functions I can find in the reference work on charstreams.
At the moment I convert the UInt16s to chars which seems silly.

Unfortunately it is not possible to use FParsec on anything else than a CharStream.
I solved the problem by writing a simple parser combinator myself, using this article.
Surprisingly this was only one day's worth of work.
I learned a lot about parser combinators in the process.

Related

"Batteries" for Parsec in Haskell

I am new to Haskell, and I have been trying to write a JSON parser using Parsec as an exercise. This has mostly been going well, I am able to parse lists and objects with relatively little code which is also readable (great!). However, for JSON I also need to parse primitives like
Integers (possibly signed)
Floats (possibly using scientific notation such as "3.4e-8")
Strings with e.g. escaped quotes
I was hoping to find ready to use parsers for things like these as part of Parsec. The closest I get is the Parsec.Tokens module (defines integer and friends), but those parsers require a "language definition" that seems way beyond what I should have to make to parse something as simple as JSON -- it appears to be designed for programming languages.
So my questions are:
Are the functions in Parsec.Token the right way to go here? If so, how to make a suitable language definition?
Are "primitive" parsers for integers etc defined somewhere else? Maybe in another package?
Am I supposed to write these kinds of low-level parsers myself? I can see myself reusing them frequently... (obscure scientific data formats etc.)
I have noticed that a question on this site says Megaparsec has these primitives included [1], but I suppose these cannot be used with parsec.
Related questions:
How do I get Parsec to let me call `read` :: Int?
How to parse an Integer with parsec
Are the functions in Parsec.Token the right way to go here?
Yes, they are. If you don't care about the minutiae specified by a language definition (i.e. you don't plan to use the parsers which depend on them, such as identifier or reserved), just use emptyDef as a default:
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (emptyDef)
lexer = P.makeTokenParser emptyDef
integer = P.integer lexer
As you noted, this feels unnecesarily clunky for your use case. It is worth mentioning that megaparsec (cf. Alec's suggestion) provides a corresponding integer parser without the ceremony. (The flip side is that megaparsec doesn't try to bake in support for e.g. reserved words, but that isn't difficult to implement in the cases you actually need it.)

How to write a language with Python-like indentation in syntax?

I'm writing a tool with it's own built-in language similar to Python. I want to make indentation meaningful in the syntax (so that tabs and spaces at line beginning would represent nesting of commands).
What is the best way to do this?
I've written recursive-descent and finite automata parsers before.
The current CPython's parser seems to be generated using something called ASDL.
Regarding the indentation you're asking for, it's done using special lexer tokens called INDENT and DEDENT. To replicate that, just implement those tokens in your lexer (that is pretty easy if you use a stack to store the starting columns of previous indented lines), and then plug them into your grammar as usual (like any other keyword or operator token).
Check out the python compiler and in particular compiler.parse.
I'd suggest ANTLR for any lexer/parser generation ( http://www.antlr.org ).
Also, this website ( http://erezsh.wordpress.com/2008/07/12/python-parsing-1-lexing/ ) has some more information, in particular:
Python’s indentation cannot be solved with a DFA. (I’m still perplexed at whether it can even be solved with a context-free grammar).
PyPy produced an interesting post about lexing Python (they intend to solve it using post-processing the lexer output)
CPython’s tokenizer is written in C. It’s ad-hoc, hand-written, and
complex. It is the only official implementation of Python lexing that
I know of.

Parsec or happy (with alex) or uu-parsinglib

I am going to write a parser of verilog (or vhdl) language and will do a lot of manipulations (sort of transformations) of the parsed data. I intend to parse really big files (full Verilog designs, as big as 10K lines) and I will ultimately support most of the Verilog. I don't mind typing but I don't want to rewrite any part of the code whenever I add support for some other rule.
In Haskell, which library would you recommend? I know Haskell and have used Happy before (to play). I feel that there are possibilities in using Parsec for transforming the parsed string in the code (which is a great plus). I have no experience with uu-paringlib.
So to parse a full-grammar of verilog/VHDL which one of them is recommended? My main concern is the ease and 'correctness' with which I can manipulate the parsed data at my whim. Speed is not a primary concern.
I personally prefer Parsec with the help of Alex for lexing.
I prefer Parsec over Happy because 1) Parsec is a library, while Happy is a program and you'll write in a different language if you use Happy and then compile with Happy. 2) Parsec gives you context-sensitive parsing abilities thanks to its monadic interface. You can use extra state for context-sensitive parsing, and then inspect and decide depending on that state. Or just look at some parsed value before and decide on next parsers etc. (like a <- parseSomething; if test a then ... do ...) And when you don't need any context-sensitive information, you can simply use applicative style and get an implementation like implemented in YACC or a similar tool.
As a downside of Parsec, you'll never know if your Parsec parser contains a left recursion, and your parser will get stuck in runtime (because Parsec is basically a top-down recursive-descent parser). You have to find left recursions and eliminate them. YACC-style parsers can give you some static guarantees and information (like shift/reduce conflicts, unused terminals etc.) that you can't get with Parsec.
Alex is highly recommended for lexing in both situations (I think you have to use Alex if you decide to go on with Happy). Because even if you use Parsec, it really simplifies your parser implementation, and catches a great deal of bugs too (for example: parsing a keyword as an identifier was a common bug I did while I was using Parsec without Alex. It's just one example).
You can have a look at my Lua parser implemented in Alex+Parsec And here's the code to use Alex-generated tokens in Parsec.
EDIT: Thanks John L for corrections. Apparently you can do context-sensitive parsing with Happy too. Also, Alex for lexing is not required in Happy, though it's recommended.

YAML parsing - lex or hand-rolled?

I am trying to write a simple YAML parser, I read the spec from yaml.org,
before I start, I was wondering if it is better to write a hand-rolled parser, or
use lex (flex/bison). I looked at the libyaml (C library) -
doesn't seem to use lex/yacc.
YAML (excluding the flow styles), seems to be more line-oriented, so, is it
easier to write a hand-rolled parser, or use flex/bison
Thanks.
This answer is basically an answer to the question: "Should I roll my own parser or use parser generator?" and has not much to do with YAML. But nevertheless it will "answer" your question.
The question you need to ask is not "does this work with this given language/grammar", but "do I feel confident to implement this". The truth of the matter is that most formats you want to parse will just work with a generated parser. The other truth is that it is feasible to parse even complex languages with a simple hand written recursive descent parser.
I have written among others, a recursive descent parser for EDDL (C and structured elements) and a bison/flex parser for INI. I picked these examples, because they go against intuition and exterior requirements dictated the decision.
Since I established on a technical level it is possible, why would you pick one over the other? This is really hard question to answer, here are some thoughts on the subject:
Writing a good lexer is really hard. In most cases it makes sense to use flex to generate the lexer. There is little use of hand-rolling your own lexer, unless you have really exotic input formats.
Using bison or similar generators make the grammar used for parsing explicitly visible. The primary gain here is that the developer maintaining your parser in five years will immediately see the grammar used and can compare it with any specs.
Using a recursive descent parser makes is quite clear what happens in the parser. This provides the easy means to gracefully handle harry conflicts. You can write a simple if, instead of rearranging the entire grammar to be LALR1.
While developing the parser you can "gloss over details" with a hand written parser, using bison this is almost impossible. In bison the grammar must work or the generator will not do anything.
Bison is awesome at pointing out formal flaws in the grammar. Unfortunately you are left alone to fix them. When hand-rolling a parser you will only find the flaws when the parser reads nonsense.
This is not a definite answer for one or the other, but it points you in the right direction. Since it appears that you are writing the parser for fun, I think you should have written both types of parser.

Choosing a Haskell parser

There are many open sourced parser implementations available to us in Haskell. Parsec seems to be the standard for text parsing and attoparsec seems to be a popular choice for binary parsing but I don't know much beyond that. Is there a particular decision tree that you follow for choosing a parser implementation? Have you learned anything interesting about the strengths or weaknesses of the libraries?
You have several good options.
For lightweight parsing of String types:
parsec
polyparse
For packed bytestring parsing, e.g. of HTTP headers.
attoparsec
For actual binary data most people use either:
binary -- for lazy binary parsing
cereal -- for strict binary parsing
The main question to ask yourself is what is the underlying string type?
String?
bytestring (strict)?
bytestring (lazy)?
unicode text
That decision largely determines which parser toolset you'll use.
The second question to ask is: do I already have a grammar for the data type? If so, I can just use happy
The Happy parser generator
And obviously for custom data types there are a variety of good existing parsers:
XML
haxml
xml-light
hxt
hexpat
CSV
bytestring-csv
csv
JSON
json
rss/atom
feed
Just to add to Don's post: Personally, I quite like Text.ParserCombinators.ReadP (part of base) for no-nonsense quick and easy stuff. Particularly when Parsec seems like overkill.
There is a bytestringreadp library for the bytestring version, but it doesn't cover Char8 bytestrings, and I suspect attoparsec would be a better choice at this point.
I recently converted some code from Parsec to Attoparsec. Both are quite capable.
Attoparsec wins on performance and memory footprint, but Parsec provides better error reporting and has more complete documentation.
Bryan O’Sullivan’s blog post What’s in a parser? Attoparsec rewired (2/2) includes a nice performance benchmark comparing several implementations along with some comments comparing memory usage.

Resources