How else but aeson? - parsing

aeson seems to take a somewhat simple-minded approach to parsing JSON: it parses a top-level JSON value (an object or array) to its own fixed representation and then offers facilities to help users convert that representation to their own. This approach works pretty well when JSON objects and arrays are small. When they're very large, things start to fall apart, because user code can't do anything until JSON values are completely read and parsed. This seems particularly unfortunate since JSON seems to be designed for recursive descent parsers— it seems like it should be fairly simple to allow user code to step in and say how each piece should be parsed. Is there a deep reason aeson and the earlier json work this way, or should I try to make a new library for more flexible JSON parsing?

json-stream is a stream based parser. This is a bit out of date (2015), but they took the benchmarks from aeson and compared the two libraries: aeson and json-stream performance comparison. There is one case where json-stream is significantly worse than aeson.
If you just want a faster aeson (not streaming), haskell-sajson looks interesting. It wraps a performant C++ library in Haskell and returns Value from aeson.

Related

Parse batch of SequenceExample

There is function to parse SequenceExample --> tf.parse_single_sequence_example().
But it parses only single SequenceExample, which is not effective.
Is there any possibility to parse a batch of SequenceExamples?
tf.parse_example can parse many Examples.
Documentation for tf.parse_example contain a little info about SequenceExample:
Each FixedLenSequenceFeature df maps to a Tensor of the specified type (or tf.float32 if not specified) and shape (serialized.size(), None) + df.shape. All examples in serialized will be padded with default_value along the second dimension.
But it is not clear, how to do that. Have not found any examples in google.
Is it possible to parse many SequenceExamples using parse_example() or may be other function exists?
Edit:
Where can I ask question to tensorflow developers: does they plan to implement parse function for multiple SequenceExample -s?
Any help ll be appreciated.
If you have many small sequences where batching at this stage is important, I would recommend VarLenFeatures or FixedLenSequenceFeatures with regular Example protos (which, as you note, can be parsed in batches with parse_example). For examples of this, see the unit tests associated with example parsing (testSerializedContainingSparse parses Examples with FixedLenSequenceFeatures).
SequenceExamples are more geared toward cases where there is significant amounts of preprocessing work to be done for each SequenceExample (which can be done in parallel with queues). parse_example does does not support SequenceExamples.

How to parse a very large file in F# using FParsec

I'm trying to parse a very large file using FParsec. The file's size is 61GB, which is too big to hold in RAM, so I'd like to generate a sequence of results (i.e. seq<'Result>), rather than a list, if possible. Can this be done with FParsec? (I've come up with a jerry-rigged implementation that actually does this, but it doesn't work well in practice due to the O(n) performance of CharStream.Seek.)
The file is line-oriented (one record per line), which should make it possible in theory to parse in batches of, say, 1000 records at a time. The FParsec "Tips and tricks" section says:
If you’re dealing with large input files or very slow parsers, it
might also be worth trying to parse multiple sections within a single
file in parallel. For this to be efficient there must be a fast way to
find the start and end points of such sections. For example, if you
are parsing a large serialized data structure, the format might allow
you to easily skip over segments within the file, so that you can chop
up the input into multiple independent parts that can be parsed in
parallel. Another example could be a programming languages whose
grammar makes it easy to skip over a complete class or function
definition, e.g. by finding the closing brace or by interpreting the
indentation. In this case it might be worth not to parse the
definitions directly when they are encountered, but instead to skip
over them, push their text content into a queue and then to process
that queue in parallel.
This sounds perfect for me: I'd like to pre-parse each batch of records into a queue, and then finish parsing them in parallel later. However, I don't know how to accomplish this with the FParsec API. How can I create such a queue without using up all my RAM?
FWIW, the file I'm trying to parse is here if anyone wants to give it a try with me. :)
The "obvious" thing that comes to mind, would be pre-processing the file using something like File.ReadLines and then parsing one line at a time.
If this doesn't work (your PDF looked, like a record is a few lines long), then you can make a seq of records or 1000 records or something like that using normal FileStream reading. This would not need to know details of the record, but it would be convenient, if you can at least delimit the records.
Either way, you end up with a lazy seq that the parser can then read.

NSXMLParser vs JSON Parser

What is the pros/cons of NSXMLParser & JSON parser?
Which one is good in which scenario?
Currently, my app uses NSXMLParser. I'm planning to move JSON parser if it is more efficient.
Thanks
NSXMLParser is an "event driven" parser which basically notifies a delegate about the occurrence of certain elements in the XML document.
Event driven parsers do not create a representation of the XML document by itself. The actual processing of the elements has to be done by some delegate. Properly utilizing event driven parsers is elaborate and error prone and requires experience how to approach such a task. Well, you know it.
NSJSONSerialization on the other hand, and all other third party JSON parsers that I know of, create a foundation object (a NSArray or NSDictionary) from the JSON input. Parsing a JSON document and getting a NSDictionary or a NSArray object back is a matter of one statement. A few also support the "event driven" mode.
XML is far more complex than JSON. Inherently, a JSON parser is much more simpler and also almost always more efficient in parsing documents.
Despite it's simplicity, JSON is almost always sufficient to express your data.
So, when you can express your data in JSON, by any means, use JSON. If possible, use NSJSONSerialization.
Other third party JSON parsers may offer additionally features, like an event driven API, an improved way to handle chunks of data, have more sophisticated options to customize certain edge cases, like the handling of Unicode NULL character, Unicode noncharacters, how to convert JSON numbers, etc., and may be possibly faster than NSJSONSerialization.
Today, NSJSONSerialization is about as fast as JSONKit. (For some input, JSONKit is a bit faster). AFAIK, there are two third party parsers which are for any input almost always faster than NSJSONSerialization, especially on arm, and when it comes to convert Numbers. You can expect them to be faster for a factor in the range of 1 to 2. But consider parsing JSON is almost never the culprit for performance issues.

Using Haskell's Parsec to parse binary files?

Parsec is designed to parse textual information, but it occurs to me that Parsec could also be suitable to do binary file format parsing for complex formats that involve conditional segments, out-of-order segments, etc.
Is there an ability to do this or a similar, alternative package that does this? If not, what is the best way in Haskell to parse binary file formats?
The key tools for parsing binary files are:
Data.Binary
cereal
attoparsec
Binary is the most general solution, Cereal can be great for limited data sizes, and attoparsec is perfectly fine for e.g. packet parsing. All of these are aimed at very high performance, unlike Parsec. There are many examples on hackage as well.
You might be interested in AttoParsec, which was designed for this purpose, I think.
I've used Data Binary successfully.
It works fine, though you might want to use Parsec 3, Attoparsec, or Iteratees. Parsec's reliance on String as its intermediate representation may bloat your memory footprint quite a bit, whereas the others can be configured to use ByteStrings.
Iteratees are particularly attractive because it is easier to ensure they won't hold onto the beginning of your input and can be fed chunks of data incrementally a they come available. This prevents you from having to read the entire input into memory in advance and lets you avoid other nasty workarounds like lazy IO.
The best approach depends on the format of the binary file.
Many binary formats are designed to make parsing easy (unlike text formats that are primarily to be read by humans). So any union data type will be preceded by a discriminator that tells you what type to expect, all fields are either fixed length or preceded by a length field, and so on. For this kind of data I would recommend Data.Binary; typically you create a matching Haskell data type for each type in the file, and then make each of those types an instance of Binary. Define the "get" method for reading; it returns a "Get" monad action which is basically a very simple parser. You will also need to define a "put" method.
On the other hand if your binary data doesn't fit into this kind of world then you will need attoparsec. I've never used that, so I can't comment further, but this blog post is very positive.

Will ANTLR Help? Different Suggestion?

Before I dive into ANTLR (because it is apparently not for the faint of heart), I just want to make sure I have made the right decision regarding its usage.
I want to create a grammar that will parse in a text file with predefined tags so that I can populate values within my application. (The text file is generated by another application.) So, essentially, I want to be able to parse something like this:
Name: TheFileName
Values: 5 3 1 6 1 3
Other Values: 5 3 1 5 1
In my application, TheFileName is stored as a String, and both sets of values are stored to an array. (This is just a sample, the file is much more complicated.) Anyway, am I at least going down the right path with ANTLR? Any other suggestions?
Edit
The files are created by the user and they define the areas via tags. So, it might look something like this.
Name: <string>TheFileName</string>
Values: <array>5 3 1 6 1 3</array>
Important Value: <double>3.45</double>
Something along those lines.
The basic question is how is the file more complicated? Is it basically more of the same, with a tag, a colon and one or more values, or is the basic structure of the other lines more complex? If it's basically just more of the same, code to recognize and read the data is pretty trivial, and a parser generator isn't likely to gain much. If the other lines have substantially different structure, it'll depend primarily on how they differ.
Edit: Based on what you've added, I'd go one (tiny) step further, and format your file as XML. You can then use existing XML parsers (and such) to read the files, extract data, verify that they fit a specified format, etc.
It depends on what control you have over the format of the file you are parsing. If you have no control then a parser-generator such as ANTLR may be valuable. (We do this ourselves for FORTRAN output files over which we have no control). It's quite a bit of work but we have now mastered the basic ANTLR lexer/parser strategy and it's starting to work well.
If, however, you have some or complete control over the format then create it with as much markup as necessary. I would always create such a file in XML as there are so many tools for processing it (not only the parsing, but also XPath, databases, etc.) In general we use ANTLR to parse semi-structured information into XML.
If you don't need for the format to be custom-built, then you should look into using an existing format such as JSON or XML, for which there are parsers available.
Even if you do need a custom format, you may be better off designing one that is dirt simple so that you don't need a full-blown grammar to parse it. Designing your own scripting grammar from scratch and doing a good job of it is a lot of work.
Writing grammar parsers can also be really fun, so if you're curious then you should go for it. But I don't recommend carelessly mixing learning exercises with practical work code.
Well, if it's "much more complicated", then, yes, a parser generator would be helpful. But, since you don't show the actual format of your file, how could anybody know what might be the right tool for the job?
I use the free GOLD Parser Builder, which is incredibly easy to use, and can generate the parser itself in many different languages. There are samples for parsing such expressions also.
If the format of the file is up to the user can you even define a grammar for it?
Seems like you just want a lexer at best. Using ANTLR just for the lexer part is possible, but would seem like overkill.

Resources