Framework for building structured binary data parsers? - parsing

I have some experience with Pragmatic-Programmer-type code generation: specifying a data structure in a platform-neutral format and writing templates for a code generator that consume these data structure files and produce code that pulls raw bytes into language-specific data structures, does scaling on the numeric data, prints out the data, etc. The nice pragmatic(TM) ideas are that (a) I can change data structures by modifying my specification file and regenerating the source (which is DRY and all that) and (b) I can add additional functions that can be generated for all of my structures just by modifying my templates.
What I had used was a Perl script called Jeeves which worked, but it's general purpose, and any functions I wanted to write to manipulate my data I was writing from the ground up.
Are there any frameworks that are well-suited for creating parsers for structured binary data? What I've read of Antlr suggests that that's overkill. My current target langauges of interest are C#, C++, and Java, if it matters.
Thanks as always.
Edit: I'll put a bounty on this question. If there are any areas that I should be looking it (keywords to search on) or other ways of attacking this problem that you've developed yourself, I'd love to hear about them.

Also you may look to a relatively new project Kaitai Struct, which provides a language for that purpose and also has a good IDE:
Kaitai.io

You might find ASN.1 interesting, as it provide an absract way to describe the data you might be processing. If you use ASN.1 to describe the data abstractly, you need a way to map that abstract data to concrete binary streams, for which ECN (Encoding Control Notation) is likely the right choice.
The New Jersey Machine Toolkit is actually focused on binary data streams corresponding to instruction sets, but I think that's a superset of just binary streams. It has very nice facilities for defining fields in terms of bit strings, and automatically generating accessors and generators of such. This might be particularly useful
if your binary data structures contain pointers to other parts of the data stream.

Related

Adding structure to HDF5 files - Equivalent of NetCDF "Conventions" for HDF5

NetCDF4 has the Conventions convention for adding structure to NetCDFs. I'm looking for the analogous thing but for HDF5 specifically.
My general aim is to add structure to my HDF5 files in a standard way. I want to something like what HDF5 does with images to define a type, using attributes on groups and datasets ~like:
CLASS: IMAGE
IMAGE_VERSION: 1.2
IMAGE_SUBCLASS: IMAGE_TRUECOLOR
...
But as far as I can tell that images specification is stand alone. Maybe I should just reuse the NetCDF "conventions"?
Update:
I'm aware NetCDF4 is implemented on top of HDF5. In this case, we have data from turbulence simulations and experiments not geo data. This data is usually limited to <= 4D. We use HDF5 for storing this data already, but we have no developed standards. Pseudo standard formats have just sort developed organically within the organization.
NetCDF4 files are actually stored using the HDF5 format (http://www.unidata.ucar.edu/publications/factsheets/current/factsheet_netcdf.pdf), however they use netCDF4 conventions for attributes, dimensions, etc. Files are self-describing which is a big plus. HDF5 without netCDF4 allows for much more liberty in defining your data. Is there a specific reason that you would like to use HDF5 instead of netCDF4 ?
I would say that if you don't have any specific constraints (like a model or visualisation software that bugs on netCDF4 files) that you'd be better off using netCDF. netCDF4 can be used by NCO/CDO operators, ncl (ncl also accepts HDF5), idl, the netCDF4 python module, ferret, etc. Personally, I find netCDF4 to be very convenient for storing climate or meteorological data. There's a lot of operators already written for it and you don't have to go through the trouble of developing a standard for your own data - it's already done for you. CMOR (http://cmip-pcmdi.llnl.gov/cmip5/output_req.html) can be used to write CF compliant climate data. It was used for the most recent climate model comparison project.
On the other hand, HDF5 might be worth it if you have another type of data and you are looking for some very specific functionalities for which you need a more customised file format. Would you mind specifying your needs a little better in the comments ?
Update :
Unfortunately, the standards for variable and field names are a little less clear and well-organised for HDF5 files than netCDF since this was the format of choice for big climate modelling projects like CMIP or CORDEX. The problem essentially melts down to using EOSDIS or CF conventions, but finding currently maintained librairies that implement these standards for HDF5 files and have clear documentation isn't exactly easy (if it was you probably wouldn't have posed the question).
If you really just want a standard, NASA explains all the different possible metadata standards in painful detail here : http://gcmd.nasa.gov/add/standards/index.html.
For information, HDF-EOS and HDF5 aren't exactly the same format (HDF-EOS already contains cartography data and is standardised for earth science data), so I don't know if this format would be too restrictive for you. The tools for working with this format are described here: http://hdfeos.net/software/tool.php and summarized here http://hdfeos.org/help/reference/HTIC_Brochure_Examples.pdf.
If you still prefer to use HDF5, your best bet would probably be to download an HDF5 formatted file from NASA for similar data and use it as a basis to create your own tools in the langage of your choice. Here's a list of comprehensive examples using HDF5, HDF4 and HDF-EOS formats with scripts for data treatment and visualisation in Python, MATLAB, IDL and NCL : http://hdfeos.net/zoo/index_openLAADS_Examples.php#MODIS
Essentially the problem is that NASA makes tools available so that you can work with their data, but not necessarily so you can re-create similarily structured data in your own lab setting.
Here's some more specs/infomation about hdf5 for earth science data from NASA :
MERRA product
https://gmao.gsfc.nasa.gov/products/documents/MERRA_File_Specification.pdf
GrADS compatible HDF5 information
http://disc.sci.gsfc.nasa.gov/recipes/?q=recipes/How-to-Read-Data-in-HDF-5-Format-with-GrADS
HDF data manipulation tools on NASA's Atmospheric Science Data Center :
https://eosweb.larc.nasa.gov/HBDOCS/hdf_data_manipulation.html
Hope this helps a little.
The best choice for a standard really depends on the kind you data you want to store. The CF conventions are most useful for measurement data that is georeferenced, for instance data measured with a satellite. It would be helpful to know what your data consists of.
Assuming you have georeferenced data, I think you have two options:
Reuse the CF conventions in HDF like you suggested. There are more people looking into this, a quick Google search gave me this.
HDF-EOS (disclaimer, I have never used it). It stores data in the HDF files using a certain structure but seems to require an extension library to use. I did not find a specification of the structure, only an API. Also there does not seem to be a vibrant community outside NASA.
Therefore I would probably go with option 1: use the CF conventions in your HDF files and see if a 3rd party tool, such as Panoply, can make use of it.

What options exist to store configuration data in Delphi?

I want to store and load diverse program data in a Delphi project. This data ranges from simple strings to more complex recurring configuration object data.
As we all know ini files provide a fast and easy way to store program data but are limited to key-value representations.
XML is often the weapon of choice when it comes to requirements like this but I want to know if there is an alternative to XML.
Recently I found superobject for Delphi which seems to be a lot easier to handle than XML. Is there anything to be said against using JSON for such "non web task"?
Are you aware of other options that support data storage and load in plain text (like ini, xml, json) in Delphi?
In fact it doesn't matter which storing format you choose (ini, xml, json, whatever). Build an abstract Configuration class that fits all your needs and after that think about the concrete class and the concrete storing format, and decide by how easy to implement and maybe human readability
In some cases you also want to have different configuration aspects (global, machine, user).
With your configuration class you can easily mix them together (use global if not user defined) and can also mix up storing formats (global-config from DB, machine-config from Registry, user-config from file).
Good old INI Files work great for me, in combination with the built in TIniFile and TMemIniFile classes in the IniFiles unit
Benefits of INI files;
Not binary.
Easier to move from machine to machine than Registry settings.
Easy to inspect and view.
Unlike XML, it's simple and human readable
INI files are easy to modify either by hand or by tool and are almost bulletproof, whereas it's easy to make a malformed JSON or XML that is completely unreadable, it's hard to do more than "damage one section" of an INI file. Simplicity wins.
Drawbacks:
Unlike XML and Registry it's more or less "two levels", sections and items.
TMemIniFile doesn't order the results in any controllable way. I often wish I could control the order of items in my ini files if they are generated by a human being, I would like the order to be preserved, and TMemIniFile does not preserve order, thus I find I do not love TMemIniFile as much as love plain old TIniFile.

serialization/deserialization vs raw data buffer

Greetings,
I am working on distributed pub-sub system expected to have minimum latency. I am now having to choose between using serialization/deserialization and raw data buffer. I prefer raw data approach as there's almost no overhead, which keeps the latency low. But my colleague argues that I should use marshaling as the parser will be less complex and buggier. I stated my concern about latency but he said it's gonna be worth it in the long run and there's even FPGA devices to accelerate.
What's your opinions on this?
TIA!
Using a 'raw data' approach, if hardcoded in one language, for one platform, has a few problems when you try to write code on another platform in another language (or sometimes even the same language/platform for a different compiler, if your fields don't have natural alignment).
I recommend using an IDL to describe your message formats. If you pick one that is reducible to 'raw data' in your language of choice, then the generated code to access message fields for that language won't do anything but access variables in a structure that is memory overlayed on your data buffer, but it still represents the metadata in a platform and language neutral way, so parsers can make more complicated code for other platforms.
The downside of picking something that is reducible to a C structure overlay is that it doesn't handle optional fields, and doesn't handle variable-length arrays, and may not handle backwards compatible extensions in the future (unless you just tack them on the end of the structure). I'd suggest you read about google's protocol buffers if you haven't yet.

Parsing text file (100+ MB) and sending data over network

I have a requirement to parse a huge text file and send parts of this file to be added as seperate rows in Content Manager. what is the best way of parsing and then update the DB?
I also would need identify certain tokens within this text file.
Please suggest what language should I use to code this requirement.
Thanks
All widely used programming languages can do that, though scripting languages (especially Perl) may be better suited to the task than others. However, your personal experience is a bigger factor: using the language you're most familiar with would probably be best, unless you have specific reasons not to use it, or to use a different language.
A classic problem when working with large files is just reading them in the first place. A lot of standard libraries tend to want to read the entire file into memory / array. However for really large files this is usually not practical.
For what ever language you end up choosing, look over the file I/O libraries carefully and select a method that will allow you to read in the file in chunks. Then run your parsing logic over the chunks and when getting to the end of a chunk, read in the next. Be careful with the parsing logic, it can sometimes be tricky to handle a chunk when it ends in a place that your parsing is not expecting.
Additionally a double buffer system sometimes works well. Process one chunk and when you get near the end, you fill the other buffer with the next chunk. If your parsing is CPU intensive, you might even look at filling a buffer on another thread to overlap the file I/O with the parsing. However, I wouldn't do this first. Start with just getting the logic working before any performance optimizations.
Without more detailed requirements it's difficult to suggest a particular language. Certainly no language is going to magically solve the problem of parsing such a big file. Depending on the format of the file there might be parsing library particularly suited to the job which might guide your choice of language.
If by "Content Manager" you mean Microsoft Content Manager Server I guess one of the Microsoft languages such as C# or VB.Net might be a better choice.
So my answer would pick one of the languages you already know, probably the one you know best.

Lightweight Store Mechanisms

I'm about to write a small utility to organze and tag my mp3s.
What is the best way to store small amounts of data. More importantly, are there databases which exist where I don't need to install a client/server environment, I just include the library and I'm good?
I could use XML, but I'm afraid that the file size would become large and hard to handle, not to mention keeping the memory footprint small.
Thanks
EDIT: I haven't decided on the language, I wanted to make my decision independent of platform. If I had to choose, most likely .NET, second Java, third C++.
My apologies, this is for a Windows App.
On Windows you can use the built-in esent database engine. There is an API you can use from C++
http://blogs.msdn.com/windowssdk/archive/2008/10/23/esent-extensible-storage-engine-api-in-the-windows-sdk.aspx
There is also a managed interop layer that you can use from C# code:
http://www.codeplex.com/ManagedEsent
Which language/platform are you talking about?
In the Java world I prefer using embedded databases such as HSQLDB, H2 or JavaDB (f.k.a. Derby).
They don't need installing and still provide the simple access you're used to from a "real" DBMS.
In the C/Python/Unixy world SQLite is a hot contender in that area.
Another option is the various forms of the Berkeley database (eg, db3, db4, SleepyCat.)
SQLITE if you want the pain of a relational DB without a server install or hassle.
I would use one of the many text-serialization formats. I personally think that YAML 1.1 is the most powerful (built-in support for referential object graphs) and easiest to read/modify by a human (parsing is a bear, use a library such as PyYAML or JYaml or some .NET libaray).
Otherwise XML or JSON are adequate file formats.
Whichever format you use, just compress the file if you're concerned about disk usage. If you're worried about in-memory usage, then I don't see how your serialization format matters...
Have a look at Prevayler - it's a serialization persistence framework (use xstream etc if you want to human-read your data), which is really fast, does not require annotations and "just works". Some basic info:
It does impose a more rigorous transaction pattern, as it does not give you automatic rollback:
Ensure transaction will succeed (with current state of system) - e.g. does it make sense now?
[transaction is added to queue], and stored (for power reset etc)
transaction is executed and applied to the object structure.
Writes of 1000's of transactions/sec
Reads of 100,000's transactions/sec
I haven't used it much, but it's sooo much nicer to use for small projects (persisting any serializable object is so nice)
Oh - as for every one saying "what platform you running on?", Prevayler (java) has/had ports to quite a few platforms, but I can't find a decent list :(. I remember that there were around 5-7, but can only remember .NET.
If you're planning on storing everything in memory while your program does work on it, then serializing to a file using a basic load() and save() function that you write would be fine, and less pain than a full on DB.
In Java that can be done using standard Serialization (or can serialize to and from XML to make it somewhat human readable and editable).
It shouldn't affect your memory footprint at all as it is merely saving and restoring your objects. You just won't get transactions and random access and queries and all that good stuff.
you could even use xml, json, an .ini file... a text file even
I would advise a SQL like database (such as SQLLite). Today your requirements might make a full SQL database seem silly. But you never know how much this "little project" will grow over the years. When it does grow to the point where you have to have a SQL engine, you will be glad you didn't just serialize some Java objects or store stuff in JSON format.

Resources