Is it possible to create an HDF5 virtual dataset (VDS) using collective I/O? - hdf5

I am trying to link four different hdf5 files into a single virtual dataset (VDS) collectively. By saying collectively, I mean every process calls H5Pset_virtual(...) for its own local file. Is it possible to create VDS files in this way? I have searched lots of VDS tutorials and documentations (like this or this) by HDF5 group, but cannot find such a feature or example.

No, in the current version of HDF5 (1.10.1) there is no communication of metadata information between processes. So, if you want to use VDS, one process should be responsible for creating the VDS file.

Related

Should i use a framework or self made script for machine learning workflow automation?

For a personal work I try to automate the workflow of my machine learning model but I face some question in the perspective of a professional approach.
At the moment I am doing the following tasks manually:
From the raw data I extract the data that interests me in a directory with the help of a third party software (to which I give in argument the parameters of the extraction).
Then I run another software, or in some cases one (or more) of my scripts (python) in order to pre-process my data which will be stored in a new directory.
Finally I provide the processed data to one of my model which returns the labeled data and that I store in a last directory.
process diagram of the previous description.
Each step (extract, pre-process and model) are always executed in the same order but I change the scripts/software parameters/model according to my needs or the comparison I need to do.
All my scripts are stored in an ordered script directory and the third party software is called from the command line from a python script.
My goal would be to have a script/software that does the whole loop by itself. As input it would take the raw data (or the directory where they are stored) and the different parameters to make the loop with the desired module (and their right parameters).
The number of module and parameter combinations is so big that I can't make a script for each one, that's why I want to build something very modular.
I can code myself my own script but I would like to have a more professional approach as if I had to implement it for a company.
My questions: In my case (customizable/interchangeable module) would it be more appropriate to use a framework (e.g. Kedro or any other) or to build it myself (because my needs are too specific)? If frameworks are appropriate which ones to choose (and why) ?
I've been researching frameworks that already exist but besides the fact that I'm not sure if they fit my needs there are so many that I'd like to spend some time on one that could help me in my future project or professional experience.
thanks you

Adding structure to HDF5 files - Equivalent of NetCDF "Conventions" for HDF5

NetCDF4 has the Conventions convention for adding structure to NetCDFs. I'm looking for the analogous thing but for HDF5 specifically.
My general aim is to add structure to my HDF5 files in a standard way. I want to something like what HDF5 does with images to define a type, using attributes on groups and datasets ~like:
CLASS: IMAGE
IMAGE_VERSION: 1.2
IMAGE_SUBCLASS: IMAGE_TRUECOLOR
...
But as far as I can tell that images specification is stand alone. Maybe I should just reuse the NetCDF "conventions"?
Update:
I'm aware NetCDF4 is implemented on top of HDF5. In this case, we have data from turbulence simulations and experiments not geo data. This data is usually limited to <= 4D. We use HDF5 for storing this data already, but we have no developed standards. Pseudo standard formats have just sort developed organically within the organization.
NetCDF4 files are actually stored using the HDF5 format (http://www.unidata.ucar.edu/publications/factsheets/current/factsheet_netcdf.pdf), however they use netCDF4 conventions for attributes, dimensions, etc. Files are self-describing which is a big plus. HDF5 without netCDF4 allows for much more liberty in defining your data. Is there a specific reason that you would like to use HDF5 instead of netCDF4 ?
I would say that if you don't have any specific constraints (like a model or visualisation software that bugs on netCDF4 files) that you'd be better off using netCDF. netCDF4 can be used by NCO/CDO operators, ncl (ncl also accepts HDF5), idl, the netCDF4 python module, ferret, etc. Personally, I find netCDF4 to be very convenient for storing climate or meteorological data. There's a lot of operators already written for it and you don't have to go through the trouble of developing a standard for your own data - it's already done for you. CMOR (http://cmip-pcmdi.llnl.gov/cmip5/output_req.html) can be used to write CF compliant climate data. It was used for the most recent climate model comparison project.
On the other hand, HDF5 might be worth it if you have another type of data and you are looking for some very specific functionalities for which you need a more customised file format. Would you mind specifying your needs a little better in the comments ?
Update :
Unfortunately, the standards for variable and field names are a little less clear and well-organised for HDF5 files than netCDF since this was the format of choice for big climate modelling projects like CMIP or CORDEX. The problem essentially melts down to using EOSDIS or CF conventions, but finding currently maintained librairies that implement these standards for HDF5 files and have clear documentation isn't exactly easy (if it was you probably wouldn't have posed the question).
If you really just want a standard, NASA explains all the different possible metadata standards in painful detail here : http://gcmd.nasa.gov/add/standards/index.html.
For information, HDF-EOS and HDF5 aren't exactly the same format (HDF-EOS already contains cartography data and is standardised for earth science data), so I don't know if this format would be too restrictive for you. The tools for working with this format are described here: http://hdfeos.net/software/tool.php and summarized here http://hdfeos.org/help/reference/HTIC_Brochure_Examples.pdf.
If you still prefer to use HDF5, your best bet would probably be to download an HDF5 formatted file from NASA for similar data and use it as a basis to create your own tools in the langage of your choice. Here's a list of comprehensive examples using HDF5, HDF4 and HDF-EOS formats with scripts for data treatment and visualisation in Python, MATLAB, IDL and NCL : http://hdfeos.net/zoo/index_openLAADS_Examples.php#MODIS
Essentially the problem is that NASA makes tools available so that you can work with their data, but not necessarily so you can re-create similarily structured data in your own lab setting.
Here's some more specs/infomation about hdf5 for earth science data from NASA :
MERRA product
https://gmao.gsfc.nasa.gov/products/documents/MERRA_File_Specification.pdf
GrADS compatible HDF5 information
http://disc.sci.gsfc.nasa.gov/recipes/?q=recipes/How-to-Read-Data-in-HDF-5-Format-with-GrADS
HDF data manipulation tools on NASA's Atmospheric Science Data Center :
https://eosweb.larc.nasa.gov/HBDOCS/hdf_data_manipulation.html
Hope this helps a little.
The best choice for a standard really depends on the kind you data you want to store. The CF conventions are most useful for measurement data that is georeferenced, for instance data measured with a satellite. It would be helpful to know what your data consists of.
Assuming you have georeferenced data, I think you have two options:
Reuse the CF conventions in HDF like you suggested. There are more people looking into this, a quick Google search gave me this.
HDF-EOS (disclaimer, I have never used it). It stores data in the HDF files using a certain structure but seems to require an extension library to use. I did not find a specification of the structure, only an API. Also there does not seem to be a vibrant community outside NASA.
Therefore I would probably go with option 1: use the CF conventions in your HDF files and see if a 3rd party tool, such as Panoply, can make use of it.

How to pull patient ID for a program/device that records MRI patient scans

My company is working on a project and we've run into a wall trying to determine the best way to pull the patients ID. We need the patients ID to name the video file for easy searching.
We want to install this system into a bunch of different scan rooms with different MRI's so we think (but we may be wrong) The best way would be to sniff from the network the conversation between the MRI and the server since this would be more standardized.
I know very little about HL7 or how MRI's interact with the server. If you have any knowledge of these protocols I would love to hear from you.
As was already pointed out in a comment - this is more related to DICOM, than HL7.
I assume the MRI machines will store their image data on a PACS server at some point. So the easiest way would be to just query the PACS server via its DICOM interface for MRI studies. The studies have all the patient information embedded in the DICOM image files when they are stored from the MRI to the PACS. There's also a possibility, that You could query the MRI machine itself via its DICOM interface if it happens to provide the necessary Query/Retrieve SCP-s. Information on that can be found in the DICOM conformance statement of the MRI-s. However getting the data from a PACS server would be the easiest and most logical way of achieving it.
There should be DICOM libraries available for all major programming languages and platforms both paid and free. Get one and start exploring!

Performance of flat directory structures in iOS

On the iOS filesystem, is there a way to optimize file access performance by using a tiered directory structure vs. a flat directory structure?
Specifically, my app has Objects that each contain a number of images and data files. A user could create thousands of these Objects and I need to optimize access to one image for ~100 arbitrary Objects at a time.
In this situation, how should I organize files on the filesystem? Would a tiered directory structure be faster than a flat one? And if so, how should I structure the tiered system (i.e. how many tiers, and how many subdirectories / files per tier)?
THANKS!
Well first of all you might as well try it with a flat structure to see if it is slow or not. Perhaps apple has put in code to optimize how files are found and you don't even need to worry about this. You can probably build out the whole app and just test how quickly it loads and see if that meets your requirements.
If you need to speed it up I would suggest trying to make some sort of structure based on the name of the file. You could have a folder which has all of the items beginning with the letter 'a' or 'b' and so on and so forth. This would split it into 26 folders which should significantly decrease the amount of items in each. Depending on how you name the files you might want a different scheme so that each of the folders had a similar amount of items in it
If you are using Core Data, you could always just enable the Allows External Storage option in the attribute of your model and let the system decide where it should go.
That would be a decent first step to see if the performance is ok.

Suitable data storage backend for Erlang application when data doesn't fit memory

I'm researching possible options how to organize data storage for an Erlang application. The data it supposed to use is basically a huge collection of binary blobs indexed by short string ids. Each blob is under 10 Kb but there are many of them. I'd expect that in total they would have size up to 200 Gb so obviously it cannot fit into memory. The typical operation on this data is either reading a blob by its id or updating a blob by its id or adding a new one. At each given period of day only a subset of ids is being used so the data storage access performance might benefit from in-memory cache. Speaking about performance - it is quite critical. The target is to have around 500 reads and 500 updates per second on commodity hardware (say on EC2 VM).
Any suggestions what to use here? As I understand dets is out of question as it is limited to 2G (or was it 4G?). Mnesia probably out of question too; my impression is that it was mainly designed for cases when data fits memory. I'm considering trying EDTK's Berkeley DB driver for the task. Would it work in the above scenario? Does anybody have experience using it in the production in the similar conditions?
tcerl came out of facing the same size limit. I'm not using Erlang these days but it sounds like what you're looking for.
Have you looked at what CouchDB is doing? It might not be quite what you are after as a drop in product, but there is lots of erlang code in there for storing data. There is also some talk of providing a native erlang interface instead of the REST api.
Is there any reason why you can't just use a file system, treating filename as your string id and file contents as a binary blob? You can choose one (filesystem) that fits your performance requirements, and you should get caching basically for free, provided by your OS.
Mnesia can store data on disk just fine. There's also dets (disk based term storage) which is roughly analogous to Berkeley DB. It's in the standard lib: http://www.erlang.org/doc/apps/stdlib/index.html
I would recommend Apache CouchDB.
It's a great fit for Erlang, and from the sound of it (you mention ID-based blobs and don't mention any relational requirements) you're looking for a document-oriented database.
Since the interface is REST, you can very simply add a commodity HTTP cache in front of it if you need caching.
The documentation for CouchDB is of a very high quality.
It also has built-in Map-Reduce :)

Resources