Read HDF5 dataset of multiple data types - hdf5

I have a HDF5 file dataset which contains different data types(int and float).
While reading it in numpy array, it detects it as array of type np.void.
import numpy as np
import h5py
f = h5py.File('Sample.h5', 'r')
array = np.array(f['/Group1/Dataset'])
print(array.dtype)
Image of the data types {print(array.dtype)}
How can I read this dataset into arrays with each column as the same data type as that of input? Thanks in advance for the reply

Here are 2 simple examples showing both ways to slice a subset of the dataset using the HDF5 Field/Column names.
The first method extracts a subset of the data to a record array by slicing when accessing the dataset.
The second method follows your current method. It extracts the entire dataset to a record array, then slices a new view to access a subset of the data.
Print statements are used liberally so you can see what's going on.
Method 1
real_array= np.array(f['/Group1/Dataset'][:,'XR','YR','ZR'])
print(real_array.dtype)
print(real_array.shape)
Method 2
cmplx_array = np.array(f['/Group1/Dataset'])
print(cmplx_array.dtype)
print(cmplx_array.shape)
disp_real = cmplx_array[['XR','YR','ZR']]
print(disp_real.dtype)
print(disp_real.shape)
Review this SO topic for additional insights into copying values from a recarray to a ndarray, and back.
copy-numpy-recarray-to-ndarray

Related

How to Combine Two HDF5 Datasets without intermediate buffer

I have several HDF5 files all of which have a /dataset that contains vectors. I would like to combine all these vectors into one dataset in one file (that is repeatedly append from one file to another). The combined dataset would have chunked storage and be resizable.
Every option I've seen for doing this seems to require reading all the data into a buffer, and then writing it back out, is there a way to more simply pass a dataset/dataspace from one file to another in order to append the data?
Have you investigated h5py Group .copy() method? Although documented as a group action, it works with any h5py object (groups, datasets, links and references). By default it copies object attributes, and supports recursive copying of group members. If you prefer a command line tool, the HDF Group has one to do this. Take a look at h5copy here: HDF5 Group h5 copy doc
Here is a example that demonstrates a simple h5py .copy() implementation. It creates a set of 3 files -- each with 1 dataset (named /dataset, dtype=float, shape=(10,10)). It then creates a NEW HDF5 file, and is followed by another loop to open the previous files and copies the dataset from the "read" file (h5r) to the new "write" file (h5w).
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='w') as h5f:
arr = np.random.random(100).reshape(10,10)
h5f.create_dataset('dataset',data=arr)
with h5py.File('SO_68025342_all.h5',mode='w') as h5w:
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
h5r.copy('dataset', h5w, name='dataset_'+str(i) )
Here is a method to copy data from multiple files to a single dataset in the merged file. It comes with caveats: 1) all datasets must have the same shape, and 2) you know the number of datasets in advance to size the new dataset. (If not, you can create a resizeable dataset by addingmaxshape=(None,a0,a1), and then use .resize() as needed. I have another post with 2 examples here: How can I combine multiple .h5 file? Look at Methods 3a and 3b.
with h5py.File('SO_68025342_merge.h5',mode='w') as h5w:
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
if 'dataset' not in h5w.keys():
a0, a1 = h5r['dataset'].shape
h5w.create_dataset('dataset', shape=(3,a0,a1))
h5w['dataset'][i-1,:] = h5r['dataset']
Assuming your files aren't so conveniently named, you can use glob.iglob() to loop on the file names to read. Then use .keys() to get the dataset names in each file. Also, if all of your datasets really are named /dataset, you need to come up with a naming convention for the new datasets.
Here is a link to the h5py docs with more details: h5py Group .copy() method
If you are not bound to a particular library and programming language, one way to solve your issue could be with the usage of HDFql (in C, C++, Java, Python, C#, Fortran or R).
Given that your posts seem to mention C# quite often, find below a solution in C#. It assumes that 1) the dataset name is dset, 2) each dataset is of data type float, and 3) each dataset is a vector of one dimension (size 100) - feel free to adapt the code to your concrete use-case:
// declare variable
float []data = new float[100];
// retrieve all file names (from current directory) that end with '.h5'
HDFql.Execute("SHOW FILE LIKE \\.h5$");
// create an HDF5 file named 'output.h5' and use (i.e. open) it
HDFql.Execute("CREATE AND USE FILE output.h5");
// create a chunked and extendible HDF5 dataset named 'dset' in file 'output.h5'
HDFql.Execute("CREATE CHUNKED(100) DATASET dset AS FLOAT(0 TO UNLIMITED)");
// register variable 'data' for subsequent usage (by HDFql)
HDFql.VariableRegister(data);
// loop cursor and process each file found
while(HDFql.CursorNext() == HDFql.Success)
{
// alter (i.e. extend) dataset 'dset' (from file 'output.h5') with more 100 floats
HDFql.Execute("ALTER DIMENSION dset TO +100");
// select (i.e. read) dataset 'dset' (from file found) and populate variable 'data'
HDFql.Execute("SELECT FROM \"" + HDFql.CursorGetChar() + "\" dset INTO MEMORY " + HDFql.VariableGetNumber(data));
// insert (i.e. write) values stored in variable 'data' into dataset 'dset' (from file 'output.h5') at the end of it (using an hyperslab)
HDFql.Execute("INSERT INTO dset(-1:::) VALUES FROM MEMORY " + HDFql.VariableGetNumber(data));
}

How can I create a Dask array from zipped .npy files?

I have a large dataset stored as zipped npy files. How can I stack a given subset of these into a Dask array?
I'm aware of dask.array.from_npy_stack but I don't know how to use it for this.
Here's a crude first attempt that uses up all my memory:
import numpy as np
import dask.array as da
data = np.load('data.npz')
def load(files):
list_ = [da.from_array(data[file]) for file in files]
return da.stack(list_)
x = load(['foo', 'bar'])
Well, you can't load a large npz file into memory, because then you're already out of memory. I would read each one in in a delayed fashion, and then call da.from_array and da.stack as you sort of are in your example.
Here are some docs that may help if you haven't seen them before: https://docs.dask.org/en/latest/array-creation.html#using-dask-delayed

Performance and data manipulation on Dask

I have imported a parquet file of approx. 800MB with ~50 millions rows into dask dataframe.
There are 5 columns: DATE, TICKER, COUNTRY, RETURN, GICS
Questions:
How can I specify data type in read_parquet or I have to do it with astype?
Can I parse date within read_parquet
I simply tried to do the follow:
import dask.dataframe as dd
dd.read_parquet('.\abc.gzip')
df['INDUSTRY'] = df.GICS.str[0:4]
n = df.INDUSTRY.unique().compute()
and it takes forever to return. Am I doing anything wrong here? partitions are automatically set to 1.
I am trying to do something like df[df.INDUSTRY == '4010'].compute(), it also takes forever to return or crash.
To answer your questions:
A parquet file has types stored, as noted in the Apache docs here, thus you won't be able to change the data type when you read the file in, meaning you'll have to use astype.
You can't convert a string to date within a the read, though if you use the map_partitions function, documented here you can convert the column to date, as in this example:
import dask.dataframe as dd
df = dd.read_parquet(your_file)
meta = ('date', 'datetime64[ns]')
# you can add your own date format, or just let pandas guess
to_date_time = lambda x: pd.to_datetime(x, format='%Y-%m-%d')
df['date_clean'] = df.date.map_partitions(to_date_time, meta=meta)
The map_partitions function will convert the dates on each chunk of the parquet when the file is computed, making it functionally the same as converting the date when the file is read in.
Here I think again you would benefit from using the map_partitions function, so you might try something like this
import dask.dataframe as dd
df = dd.read_parquet('.\abc.gzip')
df['INDUSTRY']df.GICS.map_partitions(lambda x: x.str[0:4], meta=('INDUSTRY', 'str'))
df[df.INDUSTRY == '4010']
Note that if you run compute the object is converted to pandas. If the file is too large than Dask won't be able to compute it, and thus nothing will be returned. Without seeing the data it's harder to say more, but do checkout these tools to profile your computations to see if you are leveraging all your CPUs.

DataLoader - shuffle implicit pairs

Is there a way to handle the DataLoader as a list ? The idea is that I want to shuffle implicit pairs of images, without setting the shuffling into True
Basically, I have for example 10 scenes, each containing let's say 100 sequences, so they are represented inside the directory as
'1_1.png', '1_2.png', '1_3.png', '....., '2_1.png', '2_2.png', '2_3.png', ...., '3_1.png', '3_2.png', '3_3.png', ..., ...., '10_1.png', '10_2.png', '10_3.png', ...
I don't want complete shuffling of data, what I want simply is to shuffle but keeping pairs, so they are represented in the data loader as
[ '1_3.png', '1_4.png', '2_2.png', '2_3.png', '10_1.png', '10_2.png', '1_2.png', '1_3.png', ...]
and so on
Please have a look at this question which I have already asked on Stack Overflow concerning shuffling array of implicit pairs, where you can understand what I mean
As an example:
if this is a list
L = [['1_1'],['1_2'],['1_3'],['1_4'],['1_5'],['1_6'],['2_1'],['2_2'],['2_3'],['2_4'],['2_5'],['2_6'],['3_1'],['3_2'],['3_3'],['3_4'],['3_5'],['3_6']]
then this is the output
[['1_2'], ['1_3'], ['2_1'], ['2_2'], ['2_4'], ['2_5'],
['2_2'], ['2_3'], ['1_3'], ['1_4'], ['3_4'], ['3_5'],
['3_3'], ['3_4'], ['3_2'], ['3_3'], ['1_6'], ['2_1'],
['2_5'], ['2_6'], ['2_6'], ['3_1'], ['1_4'], ['1_5'],
['1_1'], ['1_2'], ['2_3'], ['2_4'], ['1_5'], ['1_6'],
['3_1'], ['3_2'], ['3_5'], ['3_6']]
I want to achieve the same for a DataLoader
The main idea, is that I want to train my network on sequential frames, but it doesn't have to be the complete sequence, but at least I need each step, two sequences are there
I think you are looking for data.Sampler: instead of the completely radom default shuffle of data.DataLoader, you can provide your own "sampler" that sample examples from your Dataset.
Looking at the input parameters of data.DataLoader:
sampler (Sampler, optional) – defines the strategy to draw samples
from the dataset. If specified, shuffle must be False.
I think a good starting point for is too look at the code of data.SubsetRandomSampler.

Best way to transpose a grid of data in a file

I have large data files of values on a 2D grid.
They are organized such that subsequent rows of data in the grid are subsequent lines in the file.
Each column is separated by a tab character.
Essentially, this is a CSV file, but with tabs instead of columns.
I need the transpose the data (first row becomes first column) and output it to another file. What's the best way to do this? Any language is okay (I prefer to use Perl or C/C++). Currently, I have Perl script just read in the entire file into memory, but I have files which are simply gigantic.
The simplest way would be to make multiple passes through your input, extracting a subset of columns on each pass. The number of columns would be determined by how much memory you wanted to use and how many rows are in the input file.
For example:
On pass 1 you read the entire input file and process only the first, say, 10 columns. If the input had 1 million rows, the output would be a file with 1 million columns and 10 rows. On the next pass you would read the input again, and process columns 11 thru 20, appending the results to the original output file. And so on....
If you have Python with NumPy installed, it's as easy as this:
#!/usr/bin/env python
import numpy, csv
with open('/path/to/data.csv', 'rb') as file:
csvdata = csv.reader()
data = numpy.array(csvdata)
transpose = data.T
... the csv module is part of Python's standard library.

Resources