Repair corrupted hdf5 file - hdf5

We are saving time of flight mass spectra in hdf5 files. For more than 99% of the cases this works without any errors. But sometimes measurements crashes.
We save different (meta-) data but my question is about one incrementing table:
Storage Layout: CHUNKED: 1x1x10x248584
Compression: 8,312:1GZIP: level=5
Values are inserted each time step (usually 10 seconds). After 1min the layout is 6x1x10x248584.
But the corrupted table has size: 0x1x10x248584 (Tried with h5py 2.10.0 and HDFView 3.1.0).
My question: Is there any low-level library (python preferred) where I can try to access the lost data? The file size (measuring several hours --> several GB) promises the data is there, but cannot be read with those two programs I tried.
Thank you.
[Update]:
In HDFView it looks like this for the corrupted File. From file size I expect the first dimension to be >1000.

Related

Why does categorizing a Dask DataFrame constructed from a Parquet file drastically increase its size?

Here is the archetypal scenario:
I construct a Dask DataFrame from a set of Parquet files written by FastParquet
I run categorize() on the DataFrame. Quite a few categories become newly "known."
I save the DataFrame to a new Parquet file-set via FastParquet
The new Parquet files now take up several times more disk space than the original set! Now, it's not that I care about disk space—I have enough—but rather I seek understanding:
Even if the original file-set's categories were not "known," they still had to have been in the file-set's disk space somewhere. If anything, I might expect a decrease in disk usage, if the original file-set's categorical columns were not using a dictionary to begin with.
So, yeah, just trying to understand. What gives?

Importing MNIST dataset with Fortran

A Linux/GFortran question.
I know exactly what my problem is but I can't figure out how to solve it...
I want to import the MNIST dataset images and labels into Fortran arrays to play around with Machine Learning algorithms using Fortran. I've done this with Python but I can't replicate reading the data files with Fortran.
The dataset files and file layout descriptions are at:
http://yann.lecun.com/exdb/mnist/
The 2 problems I'm struggling with are...
1) The data in the files is stored in unsigned bytes. I can't find a similar datatype in Fortran. I'm using integer(kind=1) to read the first 4 bytes successfully, which constitutes the file magic number, but I'm worried about incorrectly reading the value of one of these bytes into the signed integer(kind=1) datatype.
2) The data is stored in Big-Endian format. So when I read the number of images, rows and columns, which are stored in 4 byte integers, into my Little-Endian machine, I receive the obvious gobbledegook. Ideally, what I would like to be able to do is specify the Endiness of a variable to read from a file in an edit descriptor. Is this possible?
Any assistance would be much appreciated.
Kind regards

How to store large compressed CSV on S3 for use with Dask

I have a large dataset (~1 terabyte of data) spread across several csv files that I want to store (compressed) on S3. I have had issues reading compressed files into dask because they are too large, and so my initial solution was to split each csv into manageable sizes. These files are then read in the following way:
ddf = dd.read_csv('s3://bucket-name/*.xz', encoding = "ISO-8859-1",
compression='xz', blocksize=None, parse_dates=[6])
Before I ingest the full dataset - is this the correct approach, or is there a better way to accomplish what I need?
This seems sensible to me.
The only challenge that arises here is due to compression. If a compression format doesn't support random access then Dask can't break up large files into multiple smaller pieces. This can also be true for formats that do support random access, like xz, but are not configured to for that particular file.
Breaking up the file manually into many small files and using blocksize=None as you've done above is a good solution in this case.

Advice (Best practices) for handling large number of large 2D arrays in HDF5 files

I am using a python program to write a 4000x4000 array into an hdf5 file.
Then, I read the data by a c-program where I need it as an input to do some simulations. I need approximately 1000 of these 4000x4000 arrays (meaning, I am doing 1000 simulation runs).
My question now is the following: Which way is "better", 1000 separate hdf5 files or one big hdf5-file with 1000 different dataset (named 'dataset_%04d')?
Any advice or best practices behaviour for this kind of problem is greatly appreciated (as I am not too familiar with hdf5).
In case, this is of interest, here is the python code I am using to write the hdf5 file:
import h5py
h5f = h5py.File( 'data_0001.h5', 'w' )
h5f.create_dataset( 'dataset_1', data=myData )
h5f.close
This is really interesting as I'm currently dealing with similar problem.
Performace
To investigate the problem a little closer, I have created following file
import h5py
import numpy as np
def one_file(shape=(4000, 4000), n=1000):
h5f = h5py.File('data.h5', 'w')
for i in xrange(n):
dataset = np.random.random(shape)
dataset_name = 'dataset_{:08d}'.format(i)
h5f.create_dataset(dataset_name, data=dataset)
print i
h5f.close()
def more_files(shape=(4000, 4000), n=1000):
for i in xrange(n):
file_name = 'data_{:08d}'.format(i)
h5f = h5py.File(file_name, 'w')
dataset = np.random.random(shape)
h5f.create_dataset('dataset', data=dataset)
h5f.close()
print i
Then, in IPython,
>>> from testing import one_file, more_files
>>> %timeit one_file(n=25) # with n=25, the resulting file is 3.0GB
1 loops, best of 3: 42.5 s per loop
>>> %timeit more_files(n=25)
1 loops, best of 3: 41.7 s per loop
>>> %timeit one_file(n=250)
1 loops, best of 3: 7min 29s per loop
>>> %timeit more_files(n=250)
1 loops, best of 3: 8min 10s per loop
The difference is quite surprising to me, for n=25 having more files is faster, however this is no longer truth for more datasets.
Experience
As others noted in comments, there is probably no correct answer as this is very problem specific. I deal with hdf5 files for my research in plasma physics. I don't know if it helps you but I can share my hdf5 experience.
I'm running lots of simulations and output for a given simulation used to go to one hdf5 file. When a simulation finished, it dumped it's state to this hdf5 file, so later I was able to take this state and extend simulation from that point (I could change some parameters as well and I don't need to start from scratch). The output from this simulation went again to the same file. This was great - I had only one file for one simulation. However, there are certain drawbacks with this approach:
When a simulation crashes, you end up with a file that is not 'complete' - you can't start new simulation from that file.
There is no simple way, how you can safely take a look into hdf5 file when another process is writing to that file. If it happen that you try to read from and another process is writing to, you end up corrupted file and all you data is lost!
I don't know of any simple way how you can delete groups from a file (I anyone know a way, let me know). So, if I need to restructure a file, I need to create new one from it (h5copy, h5repack, ...).
So I end up with this approach which works much better:
I'm periodically flushing state from a simulation and after that I'm writing to a new file. If simulation crashes, I need only delete last file and I don't lost that much of cpu time.
I'm currently only plotting data from all files but the last one. Note that there is another way: see here, but my approach is definitely simpler and I'm ok with that.
It is much better to process more small files than one huge file - you see the progress and so on.
Hope this helps.
A little late to the party, I know, but I thought I'd share my experiences. My data sizes are smaller, but from a simplicity-of-analysis standpoint I actually prefer one large (1000, 4000, 4000) dataset. In your case, it looks like you'd need to use the maxshape property to make it extendable as you create new results. Saving multiple separate datasets makes it hard to look at trends across datasets since you have to slice them all separately. With one dataset you could do eg. data[:, 5, 20] to look across the 3rd axis. Also, to address the corruption problem, I highly recommend using h5py.File as a context manager:
with h5py.File('myfilename') as f:
f.create_dataset('mydata', data=data, maxshape=(1000, 4000, 4000))
This automatically closes the file even if there is an exception. I used to curse incessantly due to corrupted data and then I started doing this and haven't had a problem since.

importing and processing data from a CSV File in Delphi

I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?
It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
end;
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".
The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck

Resources