How to parse a very large file in F# using FParsec - parsing

I'm trying to parse a very large file using FParsec. The file's size is 61GB, which is too big to hold in RAM, so I'd like to generate a sequence of results (i.e. seq<'Result>), rather than a list, if possible. Can this be done with FParsec? (I've come up with a jerry-rigged implementation that actually does this, but it doesn't work well in practice due to the O(n) performance of CharStream.Seek.)
The file is line-oriented (one record per line), which should make it possible in theory to parse in batches of, say, 1000 records at a time. The FParsec "Tips and tricks" section says:
If you’re dealing with large input files or very slow parsers, it
might also be worth trying to parse multiple sections within a single
file in parallel. For this to be efficient there must be a fast way to
find the start and end points of such sections. For example, if you
are parsing a large serialized data structure, the format might allow
you to easily skip over segments within the file, so that you can chop
up the input into multiple independent parts that can be parsed in
parallel. Another example could be a programming languages whose
grammar makes it easy to skip over a complete class or function
definition, e.g. by finding the closing brace or by interpreting the
indentation. In this case it might be worth not to parse the
definitions directly when they are encountered, but instead to skip
over them, push their text content into a queue and then to process
that queue in parallel.
This sounds perfect for me: I'd like to pre-parse each batch of records into a queue, and then finish parsing them in parallel later. However, I don't know how to accomplish this with the FParsec API. How can I create such a queue without using up all my RAM?
FWIW, the file I'm trying to parse is here if anyone wants to give it a try with me. :)

The "obvious" thing that comes to mind, would be pre-processing the file using something like File.ReadLines and then parsing one line at a time.
If this doesn't work (your PDF looked, like a record is a few lines long), then you can make a seq of records or 1000 records or something like that using normal FileStream reading. This would not need to know details of the record, but it would be convenient, if you can at least delimit the records.
Either way, you end up with a lazy seq that the parser can then read.

Related

Multiple parsers calling each other?

I am working on a complicated system that uses a number of XML schemas and associated parsers. One of the schemas is used to hold general data that are accessed by all of the other schemas. I would like to maintain this division in the (flex and bison) parsers. So, if I parse the main XML file and get to, say, the tag <matrix>, I would like to call a <matrix> parser as a subroutine, return its content to the calling program and continue parsing there after the </matrix> tag. I have been looking around the net, but I have not found anything useful. Is it even possible to do this?
It seems easiest to maintain the common pieces in a separate file and to split the individual parser components into two more files: Part 1 has the Prologue and the individual grammar rules, part 2 has the epilogue. Then the three files can be concatenated (in a Makefile) before calling the parser:
parser.y: parser.part1 common.inc parser.part2
cat parser.part1 common.inc parser.part2 >parser.y
Your approach is wrong. You shouldn't need a special parser for each distinctive tag. You should parse all tags regardless of their properties and link them to a tree. Afterwards you can validate the tree to ensure a correct consistency of nested tags. If the markup language you're talking about is really that special, then you could create a parser that takes rules describing each tag. In this case parsing and checking are done at the same time, most HTML parsers are implemented like this.

What is the process for saving erlang values to a file and loading them back?

For example I have a list I want to save as a file that has a lot of other erlang types. Then I want to load it back into a process What would I use? io_lib:format("~P", [Term]) with io:write and then file:consult?
Yes. Note that you need a trailing dot for each term, and that file:consult returns a list of all dot-terminated terms in the file. So if you only have one term, the code would look like:
ok = file:write_file("myfile", io_lib:format("~p.~n", [Term])),
{ok, [Term]} = file:consult("myfile").
As an alternative to legoscia's solution, you can also write the result of erlang:term_to_binary/1 to a file and read it back with erlang:binary_to_term/1. There's a few caveats with this approach, though:
The file will not be human-readable (at least not easily)
You can't store multiple terms easily because erlang:term_to_binary/1 can produce null-characters and newlines, which can create problems with parsing. There are a few ways to get around this, though:
base64 encode the terms and separate by newline
store your terms inside of another term. For instance, if you have three terms you want to store, use erlang:term_to_binary({T1, T2, T3})
There's no handy file:consult equivalent for term_to_binary, so you have to explicitly read (as a binary) and then run binary_to_term
So why would you bother with erlang:term_to_binary/1 at all? Two reasons:
Space efficiency (in most cases)
Parsing-speed (faster to parse term_to_binary than a human-readable term)

Incremental Parsing from Handle in Haskell

I'm trying to interface Haskell with a command line program that has a read-eval-print loop. I'd like to put some text into an input handle, and then read from an output handle until I find a prompt (and then repeat). The reading should block until a prompt is found, but no longer. Instead of coding up my own little state machine that reads one character at a time until it constructs a prompt, it would be nice to use Parsec or Attoparsec. (One issue is that the prompt changes over time, so I can't just check for a constant string of characters.)
What is the best way to read the appropriate amount of data from the output handle and feed it to a parser? I'm confused because most of the handle-reading primatives require me to decide beforehand how much data I want to read. But it's the parser that should decide when to stop.
You seem to have two questions wrapped up in here. One is about incremental parsing, and one is about incremental reading.
Attoparsec supports incremental parsing directly. See the IResult type in Data.Attoparsec.Text. Parsec, alas, doesn't. You can run your parser on what you have, and if it gives an error, add more input and try again, but you really don't know if the error was an unrecoverable parse error, or just needing for more input.
In your case, usualy REPLs read one line at a time. Hence you can use hGetLine to read a line - pass it to Attoparsec, and if it parses evaluate it, and if not, get another line.
If you want to see all this in action, I do this kind of thing in Plush.Job.Output, but with three small differences: 1) I'm parsing byte streams, not strings. 2) I've set it up to pull as much as is available from the input and parse as many items as I can. 3) I'm reading directly from file descriptos. But the same structure should help you do it in your situation.

Using Haskell's Parsec to parse binary files?

Parsec is designed to parse textual information, but it occurs to me that Parsec could also be suitable to do binary file format parsing for complex formats that involve conditional segments, out-of-order segments, etc.
Is there an ability to do this or a similar, alternative package that does this? If not, what is the best way in Haskell to parse binary file formats?
The key tools for parsing binary files are:
Data.Binary
cereal
attoparsec
Binary is the most general solution, Cereal can be great for limited data sizes, and attoparsec is perfectly fine for e.g. packet parsing. All of these are aimed at very high performance, unlike Parsec. There are many examples on hackage as well.
You might be interested in AttoParsec, which was designed for this purpose, I think.
I've used Data Binary successfully.
It works fine, though you might want to use Parsec 3, Attoparsec, or Iteratees. Parsec's reliance on String as its intermediate representation may bloat your memory footprint quite a bit, whereas the others can be configured to use ByteStrings.
Iteratees are particularly attractive because it is easier to ensure they won't hold onto the beginning of your input and can be fed chunks of data incrementally a they come available. This prevents you from having to read the entire input into memory in advance and lets you avoid other nasty workarounds like lazy IO.
The best approach depends on the format of the binary file.
Many binary formats are designed to make parsing easy (unlike text formats that are primarily to be read by humans). So any union data type will be preceded by a discriminator that tells you what type to expect, all fields are either fixed length or preceded by a length field, and so on. For this kind of data I would recommend Data.Binary; typically you create a matching Haskell data type for each type in the file, and then make each of those types an instance of Binary. Define the "get" method for reading; it returns a "Get" monad action which is basically a very simple parser. You will also need to define a "put" method.
On the other hand if your binary data doesn't fit into this kind of world then you will need attoparsec. I've never used that, so I can't comment further, but this blog post is very positive.

Will ANTLR Help? Different Suggestion?

Before I dive into ANTLR (because it is apparently not for the faint of heart), I just want to make sure I have made the right decision regarding its usage.
I want to create a grammar that will parse in a text file with predefined tags so that I can populate values within my application. (The text file is generated by another application.) So, essentially, I want to be able to parse something like this:
Name: TheFileName
Values: 5 3 1 6 1 3
Other Values: 5 3 1 5 1
In my application, TheFileName is stored as a String, and both sets of values are stored to an array. (This is just a sample, the file is much more complicated.) Anyway, am I at least going down the right path with ANTLR? Any other suggestions?
Edit
The files are created by the user and they define the areas via tags. So, it might look something like this.
Name: <string>TheFileName</string>
Values: <array>5 3 1 6 1 3</array>
Important Value: <double>3.45</double>
Something along those lines.
The basic question is how is the file more complicated? Is it basically more of the same, with a tag, a colon and one or more values, or is the basic structure of the other lines more complex? If it's basically just more of the same, code to recognize and read the data is pretty trivial, and a parser generator isn't likely to gain much. If the other lines have substantially different structure, it'll depend primarily on how they differ.
Edit: Based on what you've added, I'd go one (tiny) step further, and format your file as XML. You can then use existing XML parsers (and such) to read the files, extract data, verify that they fit a specified format, etc.
It depends on what control you have over the format of the file you are parsing. If you have no control then a parser-generator such as ANTLR may be valuable. (We do this ourselves for FORTRAN output files over which we have no control). It's quite a bit of work but we have now mastered the basic ANTLR lexer/parser strategy and it's starting to work well.
If, however, you have some or complete control over the format then create it with as much markup as necessary. I would always create such a file in XML as there are so many tools for processing it (not only the parsing, but also XPath, databases, etc.) In general we use ANTLR to parse semi-structured information into XML.
If you don't need for the format to be custom-built, then you should look into using an existing format such as JSON or XML, for which there are parsers available.
Even if you do need a custom format, you may be better off designing one that is dirt simple so that you don't need a full-blown grammar to parse it. Designing your own scripting grammar from scratch and doing a good job of it is a lot of work.
Writing grammar parsers can also be really fun, so if you're curious then you should go for it. But I don't recommend carelessly mixing learning exercises with practical work code.
Well, if it's "much more complicated", then, yes, a parser generator would be helpful. But, since you don't show the actual format of your file, how could anybody know what might be the right tool for the job?
I use the free GOLD Parser Builder, which is incredibly easy to use, and can generate the parser itself in many different languages. There are samples for parsing such expressions also.
If the format of the file is up to the user can you even define a grammar for it?
Seems like you just want a lexer at best. Using ANTLR just for the lexer part is possible, but would seem like overkill.

Resources