Fastest iOS data format for parsing - ios

I am in need of a data format which will allow me to reduce the time needed to parse it to a minimum. In other words I'm looking for a format with as little overhead as possible and being parseable in the shortest amount of time.
I am building an application which will pull a lot of data from an API, parse it and display it to the user. So the format should be as small as possible so that the transmission will be fast and also should be very efficient for parsing. What are my options?
Here are a few formats that pop in in my head:
XML (a lot of overhead and slow parsing IMO)
JSON (still too cumbersome)
MessagePack (looks interesting)
CSV (with a custom parser written in C)
Plist (fast parsing, a lot of overhead)
... any others?
So currently I'm looking at CSV the most. Any other suggestions?

As stated by Apple in Property List Programming Guide binary plist representation should be fastest
Property List Representations
A property list can be stored in one of three different ways: in an
XML representation, in a binary format, or in an “old-style” ASCII
format inherited from OpenStep. You can serialize property lists in
the XML and binary formats. The serialization API with the old-style
format is read-only.
XML property lists are more portable than the binary alternative and
can be manually edited, but binary property lists are much more
compact; as a result, they require less memory and can be read and
written much faster than XML property lists. In general, if your
property list is relatively small, the benefits of XML property lists
outweigh the I/O speed and compactness that comes with binary property
lists. If you have a large data set, binary property lists, keyed
archives, or custom data formats are a better solution.
You just need to set the correct flag while creation or reading NSPropertyListBinaryFormat_v1_0. Just be sure that the data you want to represent in the plist are resented by this format.

Related

PostgreSQL Storing Large Json Strings (How to Best Optimize for Memory)

I have three tables, each with an attribute storing large JSON strings as text on a PostgreSQL database (on an AWS RDS instance). These JSON fields represent fabric.js javascript objects for canvas drawing and manipulation. Many of them have svg representations of complex shapes, such as maps.
My initial assumption was that we could somehow compress these fields whenever writing to the DB, and decompress them whenever we need the JSON. I naively wrote a script for the conversion, using a bytea column type, and zlib compression. I chose select records (using Rails ActiveRecord) to compare the JSON bytesize with the bytesize of the compressed JSON, confirming that the compressed version was usually about half the bytesize. I wanted proof of concept that compressing the JSON attribute on one table (notebook_question_answers) would save some memory.
After running my script on all the records for one table, I was very surprised to find out that compressing the JSON fields and converting them to binary type resulted in ADDED memory usage to the database.
So I have two questions:
Does the bytea data type have additional overhead? There are many NULL values on this particular DB attribute, and at least 1-4bytes is allocated for each binary data entry. I read on the PostgreSQL that text is optimized for text, so you should use the text datatype. Otherwise, why is there added memory when using compression?
I've started to read about json vs jsonb PostgreSQL datatypes. json offers parsing, and jsonb offers optimized indexing for search, as far as I understand. Would one of these be good paths to take for reducing memory consumption when storing large JSON strings? What would you suggest?
EDIT:
The column in question was originally of data type "text", and changed it to bytea.
after running the following command, I now see that the table takes up less space than before. About ~30% reduction of total bytes.
VACUUM (VERBOSE, ANALYZE, FULL) notebook_question_answers;

XML Serialization VS XML Parsing

What is the difference between XML Serialization and XML Parsing? When should we use each one?
Parsing is, generally speaking, the processing of an input stream into meaningful data structures; in the XML context, parsing is the process of reading a sequence of characters conforming to the grammar and other constraints of the XML spec into whatever internal representation of XML your program uses.
Serialization is the opposite process: processing the internal data structures of a program (in this context, your internal representation of an XML document) and creating a character sequence (typically written to an output stream) that conforms to the angle-bracket syntax of the spec.
Use a parser to read XML from a character stream into data structures; use a serializer to write data structures out into a character stream.
I don't know much about XML, but here's what I know about serialization and parsing.
parsing - reading data (parse-in) from storage, and writing data (parse-out) to storage… "such as a text file"
serializing - (serialize) translating data into a readable format, and (de-serialize) translate that format back to data… "i.e. you want to translate a struct into readable content, stream that content across a network, and translate it back into code."
here's a new one…
marshalling - (marshall and unmarshall) similar to serialize, except marshalling is used to translate data into a different format… "i.e. you want to translate a stream of bytes into an 32 bit structure (one byte to four bytes)"
in easy terms (for beginners)
TL;DR
XML parsing (or XML deserialization) ==> input: valid XML, output: data structures
XML serialization ==> input: data structures, output: valid XML
XML parsing (a.k.a XML de-serialization)
You take a .xml file (example.xml) as input to process it with your programming language of choise, so that your programm can do something usefull with the data in that file. Your programm will transform the information from the file into data structures that your programming language can deal with (i.e. lists, arrays, objects, etc.).
XML serialization
Your programm (in any programming language), transforms information represented as data structures (lists, arrays, objects, etc.) into a valid XML output which can be saved into a file or tranmitted to another programm.
NOTE: Technically the input (when we are takling about parsing) and the output (when we are talking about serialization) does not have to be a file. As said in the more professional answer above it can be any input/output stream, too. And files don't have to have .xml extension, they can have any file extension which represents a valid XML format (i.e. .svg is also a XML based format). The key to understanding is, that when we do XML parsing we have valid XML on the input side and data structures on the output side, and when we do XML serialization we have data structures on the input side and valid XML on the output side.
To give an example from the Python world: you can use buildin packages (like xml.etree.ElementTree) or third party libraries (like lxml (recommended) or xmltodict) to do both - parse (deserialize) or create (serialize) XML data.

iOS build settings Property List Output Encoding

What's Property List Output Encoding for? If it is set to binary, does it actually compress the plist files?
If set to binary, Xcode appears to take in any XML property lists and to save them into the application bundle as binary property lists. A binary property list has a more compact file format than an XML property list, but I don't know if you'd call it compressed.
From the Property List Programming Guide:
XML property lists are more portable
than the binary alternative and can be
manually edited, but binary property
lists are much more compact; as a
result, they require less memory and
can be read and written much faster
than XML property lists. In general,
if your property list is relatively
small, the benefits of XML property
lists outweigh the I/O speed and
compactness that comes with binary
property lists.

Delphi TStringList wrapper to implement on-the-fly compression

I have an application for storing many strings in a TStringList. The strings will be largely similar to one another and it occurs to me that one could compress them on the fly - i.e. store a given string in terms of a mixture of unique text fragments plus references to previously stored fragments. StringLists such as lists of fully-qualified path and filenames should be able to be compressed greatly.
Does anyone know of a TStringlist descendant that implement this - i.e. provides read and write access to the uncompressed strings but stores them internally compressed, so that a TStringList.SaveToFile produces a compressed file?
While you could implement this by uncompressing the entire stringlist before each access and re-compressing it afterwards, it would be unnecessarily slow. I'm after something that is efficient for incremental operations and random "seeks" and reads.
TIA
Ross
I don't think there's any freely available implementation around for this (not that I know of anyway, although I've written at least 3 similar constructs in commercial code), so you'd have to roll your own.
The remark Marcelo made about adding items in order is very relevant, as I suppose you'll probably want to compress the data at addition time - having quick access to entries already similar to the one being added, gives a much better performance than having to look up a 'best fit entry' (needed for similarity-compression) over the entire set.
Another thing you might want to read up about, are 'ropes' - a conceptually different type than strings, which I already suggested to Marco Cantu a while back. At the cost of a next-pointer per 'twine' (for lack of a better word) you can concatenate parts of a string without keeping any duplicate data around. The main problem is how to retrieve the parts that can be combined into a new 'rope', representing your original string. Once that problem is solved, you can reconstruct the data as a string at any time, while still having compact storage.
If you don't want to go the 'rope' route, you could also try something called 'prefix reduction', which is a simple form of compression - just start out each string with an index of a previous string and the number of characters that should be treated as a prefix for the new string. Be aware that you should not recurse this too far back, or access-speed will suffer greatly. In one simple implementation, I did a mod 16 on the index, to establish the entry at which prefix-reduction started, which gave me on average about 40% memory savings (this number is completely data-dependant of course).
You could try to wrap a Delphi or COM API around Judy arrays. The JudySL type would do the trick, and has a fairly simple interface.
EDIT: I assume you are storing unique strings and want to (or are happy to) store them in lexicographical order. If these constraints aren't acceptable, then Judy arrays are not for you. Mind you, any compression system will suffer if you don't sort your strings.
I suppose you expect general flexibility from the list (including delete operation), in this case I don't know about any out of the box solution, but I'd suggest one of the two approaches:
You split your string into words and
keep separated growning dictionary
to reference the words and save list of indexes internally
You implement something related to
zlib stream available in Delphi, but operating by the block that
for example can contains 10-100
strings. In this case you still have
to recompress/compress the complete
block, but the "price" you pay is lower.
I dont think you really want to compress TStrings items in memory, because it terribly ineffecient. I suggest you to look at TStream implementation in Zlib unit. Just wrap regular stream into TDecompressionStream on load and TCompressionStream on save (you can even emit gzip header there).
Hint: you will want to override LoadFromStream/SaveToStream instead of LoadFromFile/SaveToFile

Using Haskell's Parsec to parse binary files?

Parsec is designed to parse textual information, but it occurs to me that Parsec could also be suitable to do binary file format parsing for complex formats that involve conditional segments, out-of-order segments, etc.
Is there an ability to do this or a similar, alternative package that does this? If not, what is the best way in Haskell to parse binary file formats?
The key tools for parsing binary files are:
Data.Binary
cereal
attoparsec
Binary is the most general solution, Cereal can be great for limited data sizes, and attoparsec is perfectly fine for e.g. packet parsing. All of these are aimed at very high performance, unlike Parsec. There are many examples on hackage as well.
You might be interested in AttoParsec, which was designed for this purpose, I think.
I've used Data Binary successfully.
It works fine, though you might want to use Parsec 3, Attoparsec, or Iteratees. Parsec's reliance on String as its intermediate representation may bloat your memory footprint quite a bit, whereas the others can be configured to use ByteStrings.
Iteratees are particularly attractive because it is easier to ensure they won't hold onto the beginning of your input and can be fed chunks of data incrementally a they come available. This prevents you from having to read the entire input into memory in advance and lets you avoid other nasty workarounds like lazy IO.
The best approach depends on the format of the binary file.
Many binary formats are designed to make parsing easy (unlike text formats that are primarily to be read by humans). So any union data type will be preceded by a discriminator that tells you what type to expect, all fields are either fixed length or preceded by a length field, and so on. For this kind of data I would recommend Data.Binary; typically you create a matching Haskell data type for each type in the file, and then make each of those types an instance of Binary. Define the "get" method for reading; it returns a "Get" monad action which is basically a very simple parser. You will also need to define a "put" method.
On the other hand if your binary data doesn't fit into this kind of world then you will need attoparsec. I've never used that, so I can't comment further, but this blog post is very positive.

Resources