Adding structure to HDF5 files - Equivalent of NetCDF "Conventions" for HDF5 - hdf5

NetCDF4 has the Conventions convention for adding structure to NetCDFs. I'm looking for the analogous thing but for HDF5 specifically.
My general aim is to add structure to my HDF5 files in a standard way. I want to something like what HDF5 does with images to define a type, using attributes on groups and datasets ~like:
CLASS: IMAGE
IMAGE_VERSION: 1.2
IMAGE_SUBCLASS: IMAGE_TRUECOLOR
...
But as far as I can tell that images specification is stand alone. Maybe I should just reuse the NetCDF "conventions"?
Update:
I'm aware NetCDF4 is implemented on top of HDF5. In this case, we have data from turbulence simulations and experiments not geo data. This data is usually limited to <= 4D. We use HDF5 for storing this data already, but we have no developed standards. Pseudo standard formats have just sort developed organically within the organization.

NetCDF4 files are actually stored using the HDF5 format (http://www.unidata.ucar.edu/publications/factsheets/current/factsheet_netcdf.pdf), however they use netCDF4 conventions for attributes, dimensions, etc. Files are self-describing which is a big plus. HDF5 without netCDF4 allows for much more liberty in defining your data. Is there a specific reason that you would like to use HDF5 instead of netCDF4 ?
I would say that if you don't have any specific constraints (like a model or visualisation software that bugs on netCDF4 files) that you'd be better off using netCDF. netCDF4 can be used by NCO/CDO operators, ncl (ncl also accepts HDF5), idl, the netCDF4 python module, ferret, etc. Personally, I find netCDF4 to be very convenient for storing climate or meteorological data. There's a lot of operators already written for it and you don't have to go through the trouble of developing a standard for your own data - it's already done for you. CMOR (http://cmip-pcmdi.llnl.gov/cmip5/output_req.html) can be used to write CF compliant climate data. It was used for the most recent climate model comparison project.
On the other hand, HDF5 might be worth it if you have another type of data and you are looking for some very specific functionalities for which you need a more customised file format. Would you mind specifying your needs a little better in the comments ?
Update :
Unfortunately, the standards for variable and field names are a little less clear and well-organised for HDF5 files than netCDF since this was the format of choice for big climate modelling projects like CMIP or CORDEX. The problem essentially melts down to using EOSDIS or CF conventions, but finding currently maintained librairies that implement these standards for HDF5 files and have clear documentation isn't exactly easy (if it was you probably wouldn't have posed the question).
If you really just want a standard, NASA explains all the different possible metadata standards in painful detail here : http://gcmd.nasa.gov/add/standards/index.html.
For information, HDF-EOS and HDF5 aren't exactly the same format (HDF-EOS already contains cartography data and is standardised for earth science data), so I don't know if this format would be too restrictive for you. The tools for working with this format are described here: http://hdfeos.net/software/tool.php and summarized here http://hdfeos.org/help/reference/HTIC_Brochure_Examples.pdf.
If you still prefer to use HDF5, your best bet would probably be to download an HDF5 formatted file from NASA for similar data and use it as a basis to create your own tools in the langage of your choice. Here's a list of comprehensive examples using HDF5, HDF4 and HDF-EOS formats with scripts for data treatment and visualisation in Python, MATLAB, IDL and NCL : http://hdfeos.net/zoo/index_openLAADS_Examples.php#MODIS
Essentially the problem is that NASA makes tools available so that you can work with their data, but not necessarily so you can re-create similarily structured data in your own lab setting.
Here's some more specs/infomation about hdf5 for earth science data from NASA :
MERRA product
https://gmao.gsfc.nasa.gov/products/documents/MERRA_File_Specification.pdf
GrADS compatible HDF5 information
http://disc.sci.gsfc.nasa.gov/recipes/?q=recipes/How-to-Read-Data-in-HDF-5-Format-with-GrADS
HDF data manipulation tools on NASA's Atmospheric Science Data Center :
https://eosweb.larc.nasa.gov/HBDOCS/hdf_data_manipulation.html
Hope this helps a little.

The best choice for a standard really depends on the kind you data you want to store. The CF conventions are most useful for measurement data that is georeferenced, for instance data measured with a satellite. It would be helpful to know what your data consists of.
Assuming you have georeferenced data, I think you have two options:
Reuse the CF conventions in HDF like you suggested. There are more people looking into this, a quick Google search gave me this.
HDF-EOS (disclaimer, I have never used it). It stores data in the HDF files using a certain structure but seems to require an extension library to use. I did not find a specification of the structure, only an API. Also there does not seem to be a vibrant community outside NASA.
Therefore I would probably go with option 1: use the CF conventions in your HDF files and see if a 3rd party tool, such as Panoply, can make use of it.

Related

Where is the pydata BLAZE project heading?

I find the blaze ecosystem* amazing because it covers most of the data engineering use cases. There was definitely a lot of interest on these projects during the period 2015-2016, but of late it has been ignored. I say this looking at the commits on the github repos.
So my question to the community are
- What happened in 2016 that resulted in lost interest?
- Are there other python based libraries that have replaced blaze?
blaze ecosystem:
Blaze: An interface to query data on different storage systems
Dask: Parallel computing through task scheduling and blocked algorithms
Datashape: A data description language
DyND: A C++ library for dynamic, multidimensional arrays
Odo: Data migration between different storage systems
references:
http://blaze.pydata.org/
I can give some part of the picture, although others were more involved.
Blaze was both an umbrella project for incubating data-engineering ideas into released oss packages, and a package itself focussing on symbolic manipulations of data-frames and translating these into various backend execution engines, particularly database services. Critically, Blaze wanted to be the (start of a) solution for a very broad range of problems! In particular, the translation layer became very large and hard to maintain and by trying to cater to all, limited the range of operations that the symbolic layer could offer.
In terms of an umbrella project, Blaze was a success. Many ideas that started in Blaze percolated into the ecosystem. Probably the most prominent single project to come out of Blaze is Dask, which, while originally planned as an execution layer for Blaze, implements an even larger API of data-frame operations, as well as other high-level collections and arbitrary graph manipulation. Even fully symbolic optimisations exist in Dask, though this is perhaps not as complete. Other Anaconda-stable projects such as numba and bokeh were influenced by the Blaze effort, but I'll not talk about them here.
As far as datashape/dynd go, this is a somewhat crowded space with many other related projects (xnd, uarray, etc) and ideas that can be loosely thought of as "numpy 2" (i.e., more comprehensive, flexible representation of complex data layouts and their description). This has not really been adopted by the community yet, almost everything uses numpy's type system (notable exception of what arrow does internally).
Finally, for data formats and Odo, I encourage you to consider Intake, which can be seem as a successor, which can offer much more functionality such as data source cataloging, and it does this by limiting the scope of operation to read-side. The big web of interactions that is Odo was also a many-to-many problem that became hard to maintain, and by keeping things simpler, Intake is hoping to become the de-facto layer over data loading libraries and the main way to describe location, description and parametrisation of data. Odo is not dead, though, so if file conversion is exactly what you need, you can still use it.
I was looking for a project similar to odo for loading csv data to various sources. An odo issue (https://github.com/blaze/odo/issues/614) recommended d6tstack, which appears to be currently maintained.
In practice, it is often just as easy to roll your own csv loader, in which case the tableschema project is very handy. It automates inferring datatypes from csv files.

Extracting Specific Information from Scientific Papers

I am looking for specific information that I need to extract from scientific papers. The information mostly resides in the "Evaluation" or "Implementation" sections of the papers. I need to extract any function name, parameter, file name, application name, application version in the content.
Is there any NLP technique/machine learning algorithm to do this type of information extraction from scientific papers?
I'm not aware of any off-the-shelf applications that do this specific task (although that does not mean there isn't one, and there may be commercial solutions to do this). But there are open source options that would probably allow you to do what you want with a bit of work (annotation and/or rule-writing):
GATE (has a "user-friendly" graphical interface so you don't need to code if you don't want to)
Reverb
Stanford OpenIE
Canary (geared towards clinical NLP by the looks of it, but could be more generally applicable)
GROBID (this looks like it could be of use to segment the articles into sections)
Alternatively, you could build your own solution on top of libraries like NLTK or spaCy (if you code in Python) or Stanford CoreNLP (Java). It sounds like you would need to first identify document sections and then search for patterns within them. Whether you adopt a machine learning or rule-based approach, this will probably take a fair bit of work. If you have a predefined list of items you are looking for, that will make your life far easier!

Does it make sense to interrogate structured data using NLP?

I know that this question may not be suitable for SO, but please let this question be here for a while. Last time my question was moved to cross-validated, it froze; no more views or feedback.
I came across a question that does not make much sense for me. How IFC models can be interrogated via NLP? Consider IFC models as semantically rich structured data. IFC defines an EXPRESS based entity-relationship model consisting of entities organized into an object-based inheritance hierarchy. Examples of entities include building elements, geometry, and basic constructs.
How could NLP be used for such type of data? I don't see NLP relevant at all.
In general, I would suggest that using NLP techniques to "interrogate" already (quite formally) structured data like EXPRESS would be overkill at best and a time / maintenance sinkhole at worst. In general, the strengths of NLP (human language ambiguity resolution, coreference resolution, text summarization, textual entailment, etc.) are wholly unnecessary when you already have such an unambiguous encoding as this. If anything, you could imagine translating this schema directly into a Prolog application for direct logic queries, etc. (which is quite a different direction than NLP).
I did some searches to try to find the references you may have been referring to. The only item I found was Extending Building Information Models Semiautomatically Using Semantic Natural Language Processing Techniques:
... the authors propose a new method for extending the IFC schema to incorporate CC-related information, in an objective and semiautomated manner. The method utilizes semantic natural language processing techniques and machine learning techniques to extract concepts from documents that are related to CC [compliance checking] (e.g., building codes) and match the extracted concepts to concepts in the IFC class hierarchy.
So in this example, at least, the authors are not "interrogating" the IFC schema with NLP, but rather using it to augment existing schemas with additional information extracted from human-readable text. This makes much more sense. If you want to post the actual URL or reference that contains the "NLP interrogation" phrase, I should be able to comment more specifically.
Edit:
The project grant abstract you referenced does not contain much in the way of details, but they have this sentence:
... The information embedded in the parametric 3D model is intended for facility or workplace management using appropriate software. However, this information also has the potential, when combined with IoT sensors and cognitive computing, to be utilised by healthcare professionals in Ambient Assisted Living (AAL) environments. This project will examine how as-constructed BIM models of healthcare facilities can be interrogated via natural language processing to support AAL. ...
I can only speculate on the following reason for possibly using an NLP framework for this purpose:
While BIM models include Industry Foundation Classes (IFCs) and aecXML, there are many dozens of other formats, many of them proprietary. Some are CAD-integrated and others are standalone. Rather than pay for many proprietary licenses (some of these enterprise products are quite expensive), and/or spend the time to develop proper structured query behavior for the various diverse file format specifications (which may not be publicly available in proprietary cases), the authors have chosen a more automated, general solution to extract the content they are looking for (which I assume must be textual or textual tags in nearly all cases). This would almost be akin to a search engine "scraping" websites and looking for key words or phrases and synonyms to them, etc. The upside is they don't have to explicitly code against all the different possible BIM file formats to get good coverage, nor pay out large sums of money. The downside is they open up new issues and considerations that come with NLP, including training, validation, supervision, etc. And NLP will never have the same level of accuracy you could obtain from a true structured query against a known schema.

How to extract entities from html using natural language processing or other technique

I am trying to parse entities from web pages that contain a time, a place, and a name. I read a little about natural language processing, and entity extraction, but I am not sure if I am heading down the wrong path, so I am asking here.
I haven't started implementing anything yet, so if certain open source libraries are only suitable for a specific language, that is ok.
A lot of times the data would not be found in sentences, but instead in html structures like lists (e.g. 2013-02-01 - Name of Event - Arena Name).
The structure of the webpages will be vastly different (some might use lists, some might put them in a table, etc.).
What topics can I research to learn more about how to achieve this?
Are there any open source libraries that take into account the structure of html when doing entity extraction?
Would extracting these (name, time, place) entities from html be better (or even possible) with machine vision where the CSS styling might make it easier to differentiate important parts (name, time, location) of the unstructured text?
Any guidance on topics/open source projects that I can research would help I think.
Many programming languages have external libraries that generate canonical date-stamps from various formats (e.g. in Java, using the SimpleDateFormat). As you say, the structure of the web-pages will be vastly different, but date can be expressed using a small number of variations only, so writing down the regular expressiongs for a few (let's say, half-a-dozen) formats will enable extraction of dates from most, if not all, HTML pages.
Extraction of places and names is harder, however. This is where natural language processing will have to come in. What you are looking for is a Named Entity Recognition system. One of the best open source NER systems is the Standford NER. Before using, you should check out their online demo. The demo has three classifiers (for English) that you can choose from. For most of my tasks, I find their english.all.3class.distsim classifier to be quite accurate.
Note that an NER performs well when the places and names you extract are occurring in sentences. If they are going to occur in HTML labels, this approach is probably not going to be very helpful.

Framework for building structured binary data parsers?

I have some experience with Pragmatic-Programmer-type code generation: specifying a data structure in a platform-neutral format and writing templates for a code generator that consume these data structure files and produce code that pulls raw bytes into language-specific data structures, does scaling on the numeric data, prints out the data, etc. The nice pragmatic(TM) ideas are that (a) I can change data structures by modifying my specification file and regenerating the source (which is DRY and all that) and (b) I can add additional functions that can be generated for all of my structures just by modifying my templates.
What I had used was a Perl script called Jeeves which worked, but it's general purpose, and any functions I wanted to write to manipulate my data I was writing from the ground up.
Are there any frameworks that are well-suited for creating parsers for structured binary data? What I've read of Antlr suggests that that's overkill. My current target langauges of interest are C#, C++, and Java, if it matters.
Thanks as always.
Edit: I'll put a bounty on this question. If there are any areas that I should be looking it (keywords to search on) or other ways of attacking this problem that you've developed yourself, I'd love to hear about them.
Also you may look to a relatively new project Kaitai Struct, which provides a language for that purpose and also has a good IDE:
Kaitai.io
You might find ASN.1 interesting, as it provide an absract way to describe the data you might be processing. If you use ASN.1 to describe the data abstractly, you need a way to map that abstract data to concrete binary streams, for which ECN (Encoding Control Notation) is likely the right choice.
The New Jersey Machine Toolkit is actually focused on binary data streams corresponding to instruction sets, but I think that's a superset of just binary streams. It has very nice facilities for defining fields in terms of bit strings, and automatically generating accessors and generators of such. This might be particularly useful
if your binary data structures contain pointers to other parts of the data stream.

Resources