H5PY Writes Very Slow - hdf5

I have a h5py dataset like below. I want to index the records by string instead of by numeric value. So, e.g. I would be able to get the value of the first record by dset[dset.attrs['id1']].
I am trying to write the attributes with the code below, but it is extremely slow. If I do a %timeit dset.attrs[rid] = idx in the loop a single write is about 310ms. The strings I am writing are 36 characters. I have about 100k records I need to write, which would take about 9 hours. Something must be terribly wrong? Also the CPU is pegged.
ids = ['id1', 'id2', 'id3']
h5 = h5py.File("/tmp/ds.h5", "w")
dset = h5.create_dataset("lds", (100000, ), dtype='float32')
for idx, id in enumerate(ids): # loop takes forever
dset.attrs[id] = idx # takes about ~310ms
EDIT
Minimal "working" example.
for idx, rid in enumerate(range(10)):
%timeit dset.attrs[str(rid)] = idx
10 loops, best of 3: 470 ms per loop
10 loops, best of 3: 470 ms per loop
...
Nearly 0.5 second for a single write.

Use the latest value for parameter libver. This is a lot faster. So, e.g.
h5py.File('ds.h5', 'w', libver='latest')
See here: https://github.com/h5py/h5py/issues/705

Related

Iterate on (or access directly) xarray chunks

I'm after a way to iterate on xarray chunks, so something similar to dask.array.blocks but that would give me access to xarray chunks with coordinates and dimensions.
For the record, I'm aware that xarray.map_blocks exists, but what I'm doing maps input chunks to output chunks of unknown shape, so I'd like to write something custom by looping directly on the xarray chunks.
I've tried to look into the xarray.map_blocks source code, since I guess something similar to what I need is in there, but I had a hard time understanding what's going on there.
EDIT:
My use case is that I would like, for each xarray chunk, to get an output xarray chunk of variable length along a new dimension (called foo below), and eventually concatenate them along foo.
This is a mocked scenario that should at least clarify what I'm after.
For now I've solved the problem constructing, from each dask chunk of the DataArray, an "xarray" chunk (but this looks quite convoluted), and then using client.map(fn_on_chunk, xarray_chunks).
n = 1000
x_raster = y_raster = np.arange(n)
time = np.arange(10)
vals_raster = np.arange(n*n*10).reshape(n, n, 10)
da_raster = xr.DataArray(vals_raster, coords={"y": y_raster, "x": x_raster, 'time':time})
da_raster = da_raster.chunk(dict(x=100, y=100))
def fn_on_chunk(da_chunk):
# Tried to replicate the fact that I can't know in advance
# the lenght of one dimension of the output
len_range = np.random.randint(10)
outs = []
for foo in range(len_range):
# Do some magic that finds needed coordinates
# on this particular chunk
x_chunk, y_chunk = fn_magic(foo)
out = da_chunk.sel(x=x_chunk, y=y_chunk)
out['foo'] = foo
outs.append(out)
return xr.concat(outs, dim='foo')

Filtering input file with chunksize and skiprows using line number as index in dask dataframe

I have ~70gb output of MD simulations. A pattern of a fixed-number-of-lines explanation and a fixed-number-of-lines data regularly repeat in the file. How can I read the file in Dask Dataframe chunk by chunk in which the explanation lines are ignored?
I successfully wrote a lambda function in the skiprows argument of the pandas.read_csv to ignore the explanation lines and only read the data lines. I converted the pandas-entered code to dask one but it does not work. Here you can see the dask code written by replacing pandas.read_csv with dd.read_csv:
# First extracting number of atoms and hence, number of data lines:
with open(filename[0],mode='r') as file: # The same as Chanil's code
line = file.readline()
line = file.readline()
line = file.readline()
line = file.readline() # natoms
natoms = int(line)
skiplines = 9 # Number of explanation lines repeating after nnatoms lines of data
def logic_for_chunk(index):
"""This function read a chunk """
if index % (natoms+skiplines) > 8:
return False
return True
df_chunk = dd.read_csv('trajectory.txt',sep=' ',header=None,index_col=False,skiprows=lambda x: logic_for_chunk(x),chunksize=natoms)
Here the indexes of the dataframe is line numbers of the file. Using above code, at the first chunk, lines 0 to 8 in file are ignored, then the lines 9 to 58 are read. At the next chunk, the line 59 to 67 are ignored and then a natoms-size chunk from line 68 to 117 are read. This happens until all the data snapshots are read.
Unfortunately, while the above code works well in pandas, it does not works in dask. How can I implement a similar procedure in dask dataframe?
The dask dataframe read_csv function cuts the file up at byte locations. It is unable to determine exactly how many lines are in each partition, so it is unwise to depend on the row index within each partition.
If there is some other way to detect a bad line then I would try that. Ideally you will be able to determine a bad line based on the content of the line, not on its location within the file (like every eighth line).

luajit copy table is slow

Within a larger lua-script, I have to copy several tables dt:
for i=1,dt:nrow() do
local r = {}
for j=1,dt:ncol() do
r[j] = dt[i][j]
end
rslt:append(r)
end
The tables are about 50,000 lines x 25 cols, containing mainly doubles. luajit takes about 10 times as long as "standard" lua. On all other calculations/operations I do before, luajit is faster (1.5 to 3 times).
As silly as this may sound, try pre-allocating the r table with 25 values:
local r = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
Unfortunately Lua API doesn't allow pre-allocation of tables, so this is the only way to avoid re-allocations caused by array assignment in the inner loop. My tests show noticeable improvement, but not close to 10x (although I don't use your methods, so your results may vary).

Moving Average across Variables in Stata

I have a panel data set for which I would like to calculate moving averages across years.
Each year is a variable for which there is an observation for each state, and I would like to create a new variable for the average of every three year period.
For example:
P1947=rmean(v1943 v1944 v1945), P1947=rmean(v1944 v1945 v1946)
I figured I should use a foreach loop with the egen command, but I'm not sure about how I should refer to the different variables within the loop.
I'd appreciate any guidance!
This data structure is quite unfit for purpose. Assuming an identifier id you need to reshape, e.g.
reshape long v, i(id) j(year)
tsset id year
Then a moving average is easy. Use tssmooth or just generate, e.g.
gen mave = (L.v + v + F.v)/3
or (better)
gen mave = 0.25 * L.v + 0.5 * v + 0.25 * F.v
More on why your data structure is quite unfit: Not only would calculation of a moving average need a loop (not necessarily involving egen), but you would be creating several new extra variables. Using those in any subsequent analysis would be somewhere between awkward and impossible.
EDIT I'll give a sample loop, while not moving from my stance that it is poor technique. I don't see a reason behind your naming convention whereby P1947 is a mean for 1943-1945; I assume that's just a typo. Let's suppose that we have data for 1913-2012. For means of 3 years, we lose one year at each end.
forval j = 1914/2011 {
local i = `j' - 1
local k = `j' + 1
gen P`j' = (v`i' + v`j' + v`k') / 3
}
That could be written more concisely, at the expense of a flurry of macros within macros. Using unequal weights is easy, as above. The only reason to use egen is that it doesn't give up if there are missings, which the above will do.
FURTHER EDIT
As a matter of completeness, note that it is easy to handle missings without resorting to egen.
The numerator
(v`i' + v`j' + v`k')
generalises to
(cond(missing(v`i'), 0, v`i') + cond(missing(v`j'), 0, v`j') + cond(missing(v`k'), 0, v`k')
and the denominator
3
generalises to
!missing(v`i') + !missing(v`j') + !missing(v`k')
If all values are missing, this reduces to 0/0, or missing. Otherwise, if any value is missing, we add 0 to the numerator and 0 to the denominator, which is the same as ignoring it. Naturally the code is tolerable as above for averages of 3 years, but either for that case or for averaging over more years, we would replace the lines above by a loop, which is what egen does.
There is a user written program that can do that very easily for you. It is called mvsumm and can be found through findit mvsumm
xtset id time
mvsumm observations, stat(mean) win(t) gen(new_variable) end

Size of the array that Fortran can handle

I have 30000 files to process each file has 80000 x 5 lines. I need to read all files and process them finding the average of each line. I have written the code to read and extract all data from the file. My code is in Fortran. There is an array of (30000 X 800000) My program could not go over (3300 X 80000). I need to add the 4th column of each file in 300 file steps, I mean 4th column of 1st file with 4th column of 301st file, 4th col of 2nd file with 4th col of 302nd file and so on .Do you think this is because of the limitation of the size of array that Fortran can handle? If so, is there any way to increase the size of the array that Fortran can handle? What about the no of files? My code looks like this:
This program runs well.
implicit double precision (a-h,o-z),integer(i-n)
dimension x(78805,5),y(78805,5),den(78805,5)
dimension b(3300,78805),bb(78805)
character*70,fn
nf = 3300 ! NUMBER OF FILES
nj = 78804 ! Number of rows in file.
ns = 300 ! No. of steps for files.
ncores = 11 ! No of Cores
c--------------------------------------------------------------------
c--------------------------------------------------------------------
!Initialization
do i = 0,nf
do j = 1, nj
x(j,1) = 0.0
y(j,2) = 0.0
den(j,4) = 0.0
c a(i,j) = 0.0
b(i,j) = 0.0
c aa(j) = 0.0
bb(j) = 0.0
end do
end do
c-------!Body program-----------------------------------------------
iout = 6 ! Output Files upto "ns" no.
DO i= 1,nf ! LOOP FOR THE NUMBER OF FILES
write(fn,10)i
open(1,file=fn)
do j=1,nj ! Loop for the no of rows in the domain
read(1,*)x(j,1),y(j,2),den(j,4)
if(i.le.ns) then
c a(i,j) = prob(j,3)
b(i,j) = den(j,4)
else
c a(i,j) = prob(j,3) + a(i-ns,j)
b(i,j) = den(j,4) + b(i-ns,j)
end if
end do
close(1)
c ----------------------------------------------------------
c -----Write Out put [Probability and density matrix]-------
c ----------------------------------------------------------
if(i.ge.(nf-ns)) then
do j = 1, nj
c aa(j) = a(i,j)/(ncores*1.0)
bb(j) = b(i,j)/(ncores*1.0)
write(iout,*) int(x(j,1)),int(y(j,2)),bb(j)
end do
close(iout)
iout = iout + 1
end if
END DO
10 format(i0,'.txt')
END
It's hard to say for sure because you haven't given all the details yet, but your problem is quite possibly that you are using a 32 bit compiler producing 32 bit executables and you are simply running out of address space.
Although your operating system supports 64 bit address space, your 32 bit process is still limited to 32 bit addresses.
You have found a limit at 3300*78805*8 which is just under 2GB and this supports my theory.
No matter what is the cause of your immediate problem, your fundamental problem is that you appear to be loading everything into memory at once. I've not closely studied your algorithm but on first inspection it seems likely that you could re-arrange it to avoid having everything in memory at once.

Resources