HDF5 files larger than expected using HDFql

HDF5 files larger than expected using HDFql - hdf5

Consider the following code, which simply dumps one million 2-Byte integers into a HDF5 file using HDFql:
std::string filepath = "/tmp/test.h5";
sprintf(script_, "CREATE TRUNCATE FILE %s", filepath.c_str());
HDFql::execute(script_);
sprintf(script_, "USE FILE %s", filepath.c_str());
HDFql::execute(script_);
HDFql::execute("CREATE CHUNKED DATASET data AS SMALLINT(UNLIMITED)");
const int data_size = 1000000;
std::vector<uint16_t> data(data_size);
HDFql::variableRegister(&data[0]);
for(int i=0; i<data_size; i++) {data.at(i)=i;}
sprintf(script_, "ALTER DIMENSION data TO +%d", num_data-1);
HDFql::execute(script_);
sprintf(script_, "INSERT INTO data(-%d:1:1:%d) VALUES FROM MEMORY 0", 0, num_data);
HDFql::execute(script_);
Since HDF5 is an efficient binary method of storing data, I'd expect this file size to be around 1E6*2 ~ 2MB big. Instead the file size is ~40MB! That's around 20 times larger than you'd expect. I found this after using HDFql to convert one binary format to HDF5, the resulting HDF5 files were way bigger than the original binary. Does anyone know what's going on here?
Many thanks!

Related

Performance of multiple chunked datasets in the same HDF5 file?

Suppose (i am adding a code example below) that i create multiple chunked datasets in the same HDF5 file, and i start appending data to each dataset in random order. Since HDF does not know in advance what size to allocate for each dataset, i would think that each append operation (or perhaps a dataset buffer when filled) is directly appended to the HDF5 file. If so, the data of each dataset would be interleaved with the data from the other datasets, and would be spread out in chunks over the entire HDF5 file.
My question is: If the above description is more or less accurate, would this not adversely affect the performance of the read operations done later from that file, and perhaps also the file size if more metadata records are required? And (corrollary), if the option exists to store each dataset in a separate file, would it not be better to do so from the viewpoint of read performance?
Here is an example of how the HDF5 file that i describe in the beginning could be created:
import h5py, numpy as np
dtype1 = np.dtype( [ ('t','f8'), ('T','f8') ] )
dtype2 = np.dtype( [ ('q','i2'), ('Q','f8'), ('R','f8') ] )
dtype3 = np.dtype( [ ('p','f8'), ('P','i8') ] )
with h5py.File('foo.hdf5','w') as f:
dset1 = f.create_dataset('dset1', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype1))
dset2 = f.create_dataset('dset2', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype2))
dset3 = f.create_dataset('dset3', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype3))
for _ in range(10):
random_lengths = np.random.randint(low=1, high=10, size=3)
d1 = np.ones( (random_lengths[0],), dtype=dtype1 )
dset1[-1] = d1
dset1.resize( (dset1.shape[0]+1,) )
d2 = np.ones( (random_lengths[1],), dtype=dtype2 )
dset2[-1] = d2
dset2.resize( (dset2.shape[0]+1,) )
d3 = np.ones( (random_lengths[2],), dtype=dtype3 )
dset3[-1] = d3
dset3.resize( (dset3.shape[0]+1,) )
I know i could try it both ways (single file multiple datasets or multiple files single datasets) and time it, but the result might depend on the specifics of the example data used and i would rather have a more general answer to this question, and possibly some insight into how HDF5/h5py work internally in this case.

dask dataframe,to_parquet creates a 10 kb file first and then creates rest of the files

When dask dataframe's to_parquet is called, a small parquet file of 10 kb is immediately created. Then other files of size 100 MB are created. Not sure how this small file was created. my intent is to create the files with uniform size.
Also I'm missing one row count between source csv and output parquet files.
def ProcessPQ(aDfChunk, output_pqfolder):
fPQfile = os.path.join(output_pqfolder,"{}.parquet".format(str(uuid.uuid1())))
aDfChunk.to_parquet(fPQfile, compression="snappy")
def ConvertTxttoPQ(aLocation,aFilePattern,aColumns,aDTypes):
files=os.path.join(aLocation, aFilePattern)
df = dd.read_csv(files, dtype=aDTypes, header=None,sep='|', names=aColumns)
df2 = df.repartition(npartitions=2)
df2.map_partitions(ProcessPQ, aObjFileInfo.outputfilelocation).compute(num_workers=3)

MedianBlur() calculate max kernel size

I want to use MedianBlur function with very high Ksize, like 301 or more. But if I pass ksize too high, sometimes the function will crash. The error message is:
OpenCV Error: (k < 16) in cv::medianBlur_8u_O1, in file ../opencv\modules\imgproc\src\smooth.cpp
(I use opencv4nodejs, but I also tried the original OpenCV 3.4.6).
I did reduce the ksize in a try/catch loop, but not so effective, since I have to work with videos.
I did checkout the OpenCV source code and did some research.
In OpenCV 3.4.6, the crash come from line 241, file opencv\modules\imgproc\src\median_blur.simd.hpp:
for ( k = 0; k < 16 ; ++k )
{
sum += H.coarse[k];
if ( sum > t )
{
sum -= H.coarse[k];
break;
}
}
CV_Assert( k < 16 ); // Error here
t is caculated base on ksize. But sum and H.coarse array's calculations are quite complicated.
Did further researches, I found a scientific document about the algorithm: https://www.researchgate.net/publication/321690537_Efficient_Scalable_Median_Filtering_Using_Histogram-Based_Operations
I am trying to read but honestly, I don't understand too much.
How do I calculate the maximum ksize with a given image?

The maximum kernel size is determined from the bit depth of the image. As mentioned in the publication you cited:
"An 8-bit value is limited to a max value of 255. Our goal is to
support larger kernel sizes, including kernels that are greater in
size than 17 × 17, thus the larger 32-bit data type is used"
so for an image of data type CV_8U the maximum kernel size is 255.

OpenCV read image from csv file

I have image in csv file and i want to load it in my program. I found that I can load image from cvs like this:
CvMLData mlData;
mlData.read_csv(argv[1]);
const CvMat* tmp = mlData.get_values();
cv::Mat img(tmp, true),img1;
img.convertTo(img, CV_8UC3);
cv::namedWindow("img");
cv::imshow("img", img);
I have RGB picture in that file but I got grey picture... Can somebody explain me how to load color image or how can I modify this code to get color image?
Thanks!

Updated
Ok, I don't know how to read your file into OpenCV for the moment, but I can offer you a work-around to get you started. The following will create a header for a PNM format file to match your CSV file and then append your data onto the end and you should end up with a file that you can load.
printf "P3\n284 177\n255\n" > a.pnm # Create PNM header
tr -d ',][' < izlaz.csv >> a.pnm # Append CSV data, after removing commas and []
If I do the above, I can see your bench, tree and river.
If you cannot read that PNM file directly into OpenCV, you can make it into a JPEG with ImageMagick like this:
convert a.pnm a.jpg
I also had a look at the University of Wisconsin ML data archive, that is read with those OpenCV functions that you are using, and the format of their data is different from yours... theirs is like this:
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
yours looks like this:
[201, 191, 157, 201 ... ]
So maybe this tr command is enough to convert your data:
tr -d '][' < izlaz.csv > TryMe.csv
Original Answer
If you run the following on your CSV file, it translates commas into newlines and then counts the lines:
tr "," "\n" < izlaz.csv | wc -l
And that gives 150,804 lines, which means 150,804 commas in your file and therefore 150,804 integers in your file (+/- 1 or 2). If your greyscale image is 177 rows by 852 columns, you are going to need 150,804 RGB triplets (i.e. 450,000 +/- integers) to represent a colour image, as it is you only have a single greyscale value for each pixel.
The fault is in the way you write the file, not the way you read it.

To see color image I must set number of channels. So this code works for me:
CvMLData mlData;
mlData.read_csv(argv[1]);
const CvMat* tmp = mlData.get_values();
cv::Mat img(tmp, true),img1;
img.convertTo(img, CV_8UC3);
img= img.reshape(3); //set number of channels

Real to Complex FFT with CUFFT, using OpenCV as Data source

I'm having an issue trying to perform a two dimensional transform on an array of floats using cuFFT. I've had a look at the documentation, but some of the information is contradictory/not clear; so I have a few questions:
My data is 480 rows, with 640 columns (e.g. float data[480][640] but in a single dimension so float data[480*640])
If we say my input dimensions (of real data) are N1 = 480 and N2 = 640. Are the dimensions (after a real to complex transform) N1=480, N2=321?
Can I cudaMemcpy the data directly into a cufftReal array of the same size? Or must it be acufftComplex array?
If it must be acufftComplex array, I am assuming the elements need to be in the place of the real components?
What is the correct structure of a call to cufftPlan2d, cufftExecR2C and cufftC2R given the above values.
I think that's all for now...
Many thanks in advance
EDIT: So, I've implemented the Forward and Inverse transforms as suggested by JackOLantern. However my results are not what I am expecting (an identical Result after FFT as Before it). I have an image gallery here showing two sets of examples. The first is from my room, the second from my University Project.
In the cuFFT Documentation, there is ambiguity in the use of cufftPlan2d (hence why I asked). In the documentation, for a two dimensional array, the data should be input as above (float data[480][640] == float data[NY][NX]) So NY represents the rows. However in the function listing for cufftPlan2d, it states that nx (the parameter) is for the rows...
Swapping the values of NX and NY in the function call gives the result as in the project image (correct orientation, but split into three partially overlapping images at 1/4 the normal size) however, using the parameters as JackOLantern states in his answer gives a slanted/skewed result.
Am I doing something wrong here? Or does the cuFFT library have issues with this type of thing.
ALSO: I have undone a couple of the edits made by JackOLantern to this question as my issues MAY stem from the fact my data is coming from OpenCV.

EDIT: I've recently found out that I was the one who made a mistake in the way I used the function.
Originally I though the function definition referred to the size of the data being passed into it.
However, it appears that the parameters actually refer directly to the size of the REAL part.
This means that the parameters refer to:
The size of the input data when using R2C (Real to Complex)
The size of the output data when using C2R (Complex to Real)
So it appears that the cuFFT documentation and the library itself do not correspond.
When performing an R2C followed by a C2R (real to complex, complex to real respectively), the documentation states that for a Real input of NX x NY dimensions, the Complex output is NX x (floor(NY/2) +1); and vice versa.
However the actual output is of dimensions NX x NY and the actual input is of dimensions NX x NY. This is (half) mentioned on the very first page as
C2R - Symmetric complex input to real output
Implying that the complex data must be Symmetric, i.e. must also have the redundant data in addition to the non-redundant data.
There are a number of other contradictions within the documentation as well which I won't go into.
Needless to say, the problem has been solved.
I have included a MWE below. Near the top are a couple of lines with #define NUM_C2 and appropriate comments. Changing this changes whether the documentation format is followed, or my "fix".
The output is
The Input Real data
The Intermediate Complex data
The output Real data
The ratio of the output data to the input data (there are minor FFT errors, ~1 indicates correct)
Feel free to change the parameters (NUM_R and NUM_C) and feel free to comment if you think I have made a mistake somewhere.
#include <iostream>
#include <math.h>
#include <cufft.h>
// e.g. float data[NUM_R][NUM_C]
#define NUM_R 12
#define NUM_C 16
// Documentation Version
//#define NUM_C2 (1+NUM_C/2)
// "Correct" Version
#define NUM_C2 NUM_C
using namespace std;
int main(int argc, char** argv)
{
cufftReal *in_h, *out_h, *in_d, *out_d;
cufftComplex *mid_d, *mid_h;
cufftHandle pF, pI;
int r, c;
in_h = (cufftReal*) malloc(NUM_R * NUM_C * sizeof(cufftReal));
out_h= (cufftReal*) malloc(NUM_R * NUM_C * sizeof(cufftReal));
mid_h= (cufftComplex*)malloc(NUM_C2*NUM_R*sizeof(cufftComplex));
cudaMalloc((void**) &in_d, NUM_R * NUM_C * sizeof(cufftReal));
cudaMalloc((void**)&out_d, NUM_R * NUM_C * sizeof(cufftReal));
cudaMalloc((void**)&mid_d, NUM_C2 * NUM_R * sizeof(cufftComplex));
cufftPlan2d(&pF, NUM_R, NUM_C, CUFFT_R2C);
cufftPlan2d(&pI, NUM_R,NUM_C2, CUFFT_C2R);
cout<<endl<<"------"<<endl;
for(r=0; r<NUM_R; r++)
{
for(c=0; c<NUM_C; c++)
{
in_h[c + NUM_C * r] = cos(2.0*M_PI*(c*7.0/NUM_C+r*3.0/NUM_R));
out_h[c+ NUM_C * r] = 0.f;
cout<<in_h[c+NUM_C*r];
if(c<(NUM_C-1)) cout<<", ";
else cout<<endl;
}
}
cudaMemcpy((cufftReal*)in_d, (cufftReal*)in_h, NUM_R * NUM_C * sizeof(cufftReal),cudaMemcpyHostToDevice);
cufftExecR2C(pF, (cufftReal*)in_d, (cufftComplex*)mid_d);
cudaMemcpy((cufftComplex*)mid_h, (cufftComplex*)mid_d, NUM_C2*NUM_R*sizeof(cufftComplex), cudaMemcpyDeviceToHost);
cout<<endl<<"------"<<endl;
for(r=0; r<NUM_R; r++)
{
for(c=0; c<NUM_C2; c++)
{
cout<<mid_h[c+(NUM_C2)*r].x<<"|"<<mid_h[c+(NUM_C2)*r].y;
if(c<(NUM_C2-1)) cout<<", ";
else cout<<endl;
}
}
cufftExecC2R(pI, (cufftComplex*)mid_d, (cufftReal*)out_d);
cudaMemcpy((cufftReal*)out_h, (cufftReal*)out_d, NUM_R*NUM_C*sizeof(cufftReal), cudaMemcpyDeviceToHost);
cout<<endl<<"------"<<endl;
for(r=0; r<NUM_R; r++)
{
for(c=0; c<NUM_C; c++)
{
cout<<out_h[c+NUM_C*r]/(NUM_R*NUM_C);
if(c<(NUM_C-1)) cout<<", ";
else cout<<endl;
}
}
cout<<endl<<"------"<<endl;
for(r=0; r<NUM_R; r++)
{
for(c=0; c<NUM_C; c++)
{
cout<<(out_h[c+NUM_C*r]/(NUM_R*NUM_C))/in_h[c+NUM_C*r];
if(c<(NUM_C-1)) cout<<", ";
else cout<<endl;
}
}
free(in_h);
free(out_h);
free(mid_h);
cudaFree(in_d);
cudaFree(out_h);
cudaFree(mid_d);
return 0;
}

1) If we say my input dimensions (of real data) are N1 = 480 and N2 = 640. Are the dimensions (after a real to complex transform) N1=480, N2=321?
The output of cufftExecR2C is a NX*(NY/2+1) cufftComplex matrix. So in your case, you will have a 480x321 float2 matrix as output.
2) Can I cudaMemcpy the data directly into a cufftReal array of the same size? Or must it be a cufftComplex array?
If it must be a cufftComplex array, I am assuming the elements need to be in the place of the real components?
Yes, you can copy the data to a cufftReal array and the N1xN2 data.
3) What is the correct structure of a call to cufftPlan2d, cufftExecR2C and cufftC2R given the above values.
cufftPlan2d(&plan, N1, N2, CUFFT_R2C);
cufftExecR2C(plan, (cufftReal*)idata, (cufftComplex*) odata);

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

HDF5 files larger than expected using HDFql - hdf5

Related

Performance of multiple chunked datasets in the same HDF5 file?

dask dataframe,to_parquet creates a 10 kb file first and then creates rest of the files

MedianBlur() calculate max kernel size

OpenCV read image from csv file

Real to Complex FFT with CUFFT, using OpenCV as Data source

Categories

Resources