How to Combine Two HDF5 Datasets without intermediate buffer - hdf5

I have several HDF5 files all of which have a /dataset that contains vectors. I would like to combine all these vectors into one dataset in one file (that is repeatedly append from one file to another). The combined dataset would have chunked storage and be resizable.
Every option I've seen for doing this seems to require reading all the data into a buffer, and then writing it back out, is there a way to more simply pass a dataset/dataspace from one file to another in order to append the data?

Have you investigated h5py Group .copy() method? Although documented as a group action, it works with any h5py object (groups, datasets, links and references). By default it copies object attributes, and supports recursive copying of group members. If you prefer a command line tool, the HDF Group has one to do this. Take a look at h5copy here: HDF5 Group h5 copy doc
Here is a example that demonstrates a simple h5py .copy() implementation. It creates a set of 3 files -- each with 1 dataset (named /dataset, dtype=float, shape=(10,10)). It then creates a NEW HDF5 file, and is followed by another loop to open the previous files and copies the dataset from the "read" file (h5r) to the new "write" file (h5w).
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='w') as h5f:
arr = np.random.random(100).reshape(10,10)
h5f.create_dataset('dataset',data=arr)
with h5py.File('SO_68025342_all.h5',mode='w') as h5w:
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
h5r.copy('dataset', h5w, name='dataset_'+str(i) )
Here is a method to copy data from multiple files to a single dataset in the merged file. It comes with caveats: 1) all datasets must have the same shape, and 2) you know the number of datasets in advance to size the new dataset. (If not, you can create a resizeable dataset by addingmaxshape=(None,a0,a1), and then use .resize() as needed. I have another post with 2 examples here: How can I combine multiple .h5 file? Look at Methods 3a and 3b.
with h5py.File('SO_68025342_merge.h5',mode='w') as h5w:
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
if 'dataset' not in h5w.keys():
a0, a1 = h5r['dataset'].shape
h5w.create_dataset('dataset', shape=(3,a0,a1))
h5w['dataset'][i-1,:] = h5r['dataset']
Assuming your files aren't so conveniently named, you can use glob.iglob() to loop on the file names to read. Then use .keys() to get the dataset names in each file. Also, if all of your datasets really are named /dataset, you need to come up with a naming convention for the new datasets.
Here is a link to the h5py docs with more details: h5py Group .copy() method

If you are not bound to a particular library and programming language, one way to solve your issue could be with the usage of HDFql (in C, C++, Java, Python, C#, Fortran or R).
Given that your posts seem to mention C# quite often, find below a solution in C#. It assumes that 1) the dataset name is dset, 2) each dataset is of data type float, and 3) each dataset is a vector of one dimension (size 100) - feel free to adapt the code to your concrete use-case:
// declare variable
float []data = new float[100];
// retrieve all file names (from current directory) that end with '.h5'
HDFql.Execute("SHOW FILE LIKE \\.h5$");
// create an HDF5 file named 'output.h5' and use (i.e. open) it
HDFql.Execute("CREATE AND USE FILE output.h5");
// create a chunked and extendible HDF5 dataset named 'dset' in file 'output.h5'
HDFql.Execute("CREATE CHUNKED(100) DATASET dset AS FLOAT(0 TO UNLIMITED)");
// register variable 'data' for subsequent usage (by HDFql)
HDFql.VariableRegister(data);
// loop cursor and process each file found
while(HDFql.CursorNext() == HDFql.Success)
{
// alter (i.e. extend) dataset 'dset' (from file 'output.h5') with more 100 floats
HDFql.Execute("ALTER DIMENSION dset TO +100");
// select (i.e. read) dataset 'dset' (from file found) and populate variable 'data'
HDFql.Execute("SELECT FROM \"" + HDFql.CursorGetChar() + "\" dset INTO MEMORY " + HDFql.VariableGetNumber(data));
// insert (i.e. write) values stored in variable 'data' into dataset 'dset' (from file 'output.h5') at the end of it (using an hyperslab)
HDFql.Execute("INSERT INTO dset(-1:::) VALUES FROM MEMORY " + HDFql.VariableGetNumber(data));
}

Related

Read HDF5 dataset of multiple data types

I have a HDF5 file dataset which contains different data types(int and float).
While reading it in numpy array, it detects it as array of type np.void.
import numpy as np
import h5py
f = h5py.File('Sample.h5', 'r')
array = np.array(f['/Group1/Dataset'])
print(array.dtype)
Image of the data types {print(array.dtype)}
How can I read this dataset into arrays with each column as the same data type as that of input? Thanks in advance for the reply
Here are 2 simple examples showing both ways to slice a subset of the dataset using the HDF5 Field/Column names.
The first method extracts a subset of the data to a record array by slicing when accessing the dataset.
The second method follows your current method. It extracts the entire dataset to a record array, then slices a new view to access a subset of the data.
Print statements are used liberally so you can see what's going on.
Method 1
real_array= np.array(f['/Group1/Dataset'][:,'XR','YR','ZR'])
print(real_array.dtype)
print(real_array.shape)
Method 2
cmplx_array = np.array(f['/Group1/Dataset'])
print(cmplx_array.dtype)
print(cmplx_array.shape)
disp_real = cmplx_array[['XR','YR','ZR']]
print(disp_real.dtype)
print(disp_real.shape)
Review this SO topic for additional insights into copying values from a recarray to a ndarray, and back.
copy-numpy-recarray-to-ndarray

Preparing image dataset for input into Caffe deep learning

I know the first step is to create two file lists with the corresponding labels, one for the training and one for the test set. Suppose the former is called train.txt and the latter val.txt. The paths in these file lists should be relative. The labels should start at 0 and look similar to this:
relative/path/img1.jpg 0
relative/path/img2.jpg 0
relative/path/img3.jpg 1
relative/path/img4.jpg 1
relative/path/img5.jpg 2
For each of these two sets, we will create a separate LevelDB. Is this formatted as a text file? I thought I would create a directory with several subdirectories for each of my classes. Do I manually have to create a text file?
Please see this tutorial on how to use convert_imageset to build levelDb or lmdb datasets for caffe's training.
As you can see from these instruction it does not matter how you arrange the image files on your disk (same folder/different folders...) as long as you have the correct paths in your 'train.txt'/'val.txt' files relative to '/path/to/jpegs/' argument. But if you want to use convert_imageset tool, you'll have to create a text file listing all the images you want to use.

F# - Organisation of algorithms in a file

I do not find a good way to organize various algorithms. Today the file is like this :
1/ Extraction of values from Excel
2/ First algorithm based on these values (extracted from Excel) starting with
"let matriceAlgo1 ="
3/ Second algorithm starting from the same values
"let matriceAlgo2 ="
4/ Synthesis algorithm, doing a weighted average (depending on several values) of the 2/ and 3/ and selecting the result to be shown.
"let matriceSynthesis ="
My question is the following : what should i put before the different parts of this file in order to just call them by there name ? I have seen answers explaining that Module could be an answer but I don't know how to apply it in my case (or anything else if it's not the good answer).At the end, I would like to be able to write something like this :
"launch Extraction
launch First Algorithm
launch Second Algorithm
Launch Synthesis"
The way I usually organize files is to have some clear visual separator between different sections of a file (see for example Crawler.fsx on GitHub) and then have one "main" section at the end that calls functions declared previously.
I don't really use modules unless I have a large number of functions with clashing names. It would be good idea to use modules if your algorithm consists of more functions (e.g. Alg1.initialize, Alg1.run, etc.). Then you could easily switch between using different algorithms using module alias:
module Alg = Alg1 // or Alg2
let a = Alg.initialize
Alg.run a
If the file is getting longer, then you could also move sections to separate files and use #load "File.fs" to load algorithms or functions from a file. In that case, you probably need to use modules, but you can always open the module after loading the file.

Comparing using Map Reduce(Cloudera Hadoop 0.20.2) two text files of size of almost 3GB

I'm trying to do the following in hadoop map/reduce( written in java, linux kernel OS)
Text files 'rules-1' and 'rules-2' (total 3GB in size) contains some rules, each rule are separated by endline character, so the files can be read using readLine() function.
These files 'rules-1' and 'rules-2' needs to be imported as a whole from hdfs in every map function in my cluster i.e. these file are not splittable across different map function.
Input to the mapper's map function is a text file called 'record' (each line is terminated by endline character), so from the 'record' file we get the (key, value) pair. The file is splittable and can be given as input to different map function used in the whole map/reduce process.
What needs to be done is compare each value(i.e. lines from record file) with the rules inside 'rules-1' and 'rules-2'
Problem is, if I pull out each line of rules-1 and rules-2 files to a static arraylist only once, so that each mapper can share the same arraylint and try to compare elements in the arraylist with the each input value from the record file, I get a memory overflow error, since 3GB cannot be stored at a time in the arraylist.
Alternatively, if I import only few lines from the rules-1 and rules-2 files at a time and compare them to each value, map/reduce is taking a lot time to finish its job.
Could you guys provide me any other alternative ideas how can this be done without the memory overflow error? Will it help if I put those file-1 and file-2 inside a hdfs supporting database or something? I'm going out of ideas actually.Would really appreciate if some of you guys could provide me your valuable suggestions.
Iif you input files are small - you can load them into static variables and use rules as an input.
If above is not a case I can suggest the following ways:
a) To give rule-1 and rule-2 high replication factor close to the number of nodes you have. Then you can read from HDFS rule=1 and rule-2 for each record in the input relatively efficient - because it will be sequential read from the local datanode.
b) If you can consider some hash function which, when applied to the rule and to the input string will predict without false negatives that they can match - then you can emit this hash for rules, input record and resolve all possible matches in the reducer. It will be very similar to the way how a join is done using MR
c) I would consider some other optimization techniques like building search trees, or sorting since otherwise the problem looks computationally expensive and will took forever...
On this page find Real-World Cluster Configurations
it will cover file size configuration
You could use the param "mapred.child.java.opts" in conf/mapred-site.xml to increase the memory for your mappers. You might not be able to run as many map slots per server but with more servers in your cluster you could still parallelize your job.
Read the content text file from the MapReduce function and read the keyword text file from the mapper function (for reading your HDFS) and split using StringTokenizer value.toString reading from MapReduce and in your mapper function write HDFS read text file code it will read line-by-line so use two while loops here you compare. Whenever you want data send it to reducer.
Split the 3gb text file into several text files and apply that all text files as usual MapReduce your previous program.
For splitting text file I written Java program and you decide how many lines you want write in each text file.

Best way to transpose a grid of data in a file

I have large data files of values on a 2D grid.
They are organized such that subsequent rows of data in the grid are subsequent lines in the file.
Each column is separated by a tab character.
Essentially, this is a CSV file, but with tabs instead of columns.
I need the transpose the data (first row becomes first column) and output it to another file. What's the best way to do this? Any language is okay (I prefer to use Perl or C/C++). Currently, I have Perl script just read in the entire file into memory, but I have files which are simply gigantic.
The simplest way would be to make multiple passes through your input, extracting a subset of columns on each pass. The number of columns would be determined by how much memory you wanted to use and how many rows are in the input file.
For example:
On pass 1 you read the entire input file and process only the first, say, 10 columns. If the input had 1 million rows, the output would be a file with 1 million columns and 10 rows. On the next pass you would read the input again, and process columns 11 thru 20, appending the results to the original output file. And so on....
If you have Python with NumPy installed, it's as easy as this:
#!/usr/bin/env python
import numpy, csv
with open('/path/to/data.csv', 'rb') as file:
csvdata = csv.reader()
data = numpy.array(csvdata)
transpose = data.T
... the csv module is part of Python's standard library.

Resources