Writing Dask/XArray to NetCDF - Parallel IO - dask

I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF file is quite slow (~3hrs) and seems to not run in parallel. It is unclear to me if the "to_netcdf" function in Xarray is supposed to support parallel writes. Currently my approach is to write an empty netcdf file with NetCDF4 and then append the data from the Xarray:
f_mosaic = 't1.nc'
meta = {'width': dat_f.shape[1],
'height': dat_f.shape[2],
'crs': rasterio.crs.CRS(init='epsg:'+fi['CPER']['Reflectance']['Metadata']['Coordinate_System']['EPSG Code'].value.decode("utf-8")),
'transform': aff_final,
'count': dat_f.shape[0]}
with netCDF4.Dataset(f_mosaic, mode='w', format="NETCDF4") as t1:
# Create spatial dimensions
y = t1.createDimension('y', meta['width'])
x = t1.createDimension('x', meta['height'])
wl_dim = t1.createDimension('wl',meta['count'])
reflectance = t1.createVariable("reflectance","int16",("wl","y","x",),fill_value=null_val,zlib=True)
reflectance.setncattr('grid_mapping', 'crs')
crs = t1.createVariable('crs', 'c')
crs.spatial_ref = meta['crs'].wkt
crs.epsg_code = meta['crs'].to_string()
crs.GeoTransform = " ".join(str(x) for x in meta['transform'].to_gdal())
dat_f.to_netcdf(path=f_mosaic,mode='a',format='NETCDF4',encoding={'reflectance':{'zlib':True}})
Overall, the question is, how can I write this data to a NETCDF4 file quickly? Does dask/Xarray support parallel writes with NETCDF4? If so, what am I doing incorrectly?
Thanks!

Related

Reasons why swifter/dask/ray only use one core for an apply task?

I have this function that I would like to apply to a large dataframe in parallel:
from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
def standardize_smiles(smiles):
if smiles is None:
return None
try:
mol = Chem.MolFromSmiles(smiles)
# removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
clean_mol = rdMolStandardize.Cleanup(mol)
# if many fragments, get the "parent" (the actual mol we are interested in)
parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
# try to neutralize molecule
uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
# note that no attempt is made at reionization at this step
# nor at ionization at some pH (rdkit has no pKa caculator)
# the main aim to to represent all molecules from different sources
# in a (single) standard way, for use in ML, catalogue, etc.
te = rdMolStandardize.TautomerEnumerator() # idem
taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)
#except:
# return False
standardize_smiles('CCC')
'CCC'
However, neither Dask, nor Swifter, nor Ray can do the job. All frameworks use a single CPU for some reason.
Native Pandas
import pandas as pd
N = 1000
smilest_test = pd.DataFrame({'smiles': ['CCC']*N})
smilest_test
CPU times: user 3.58 s, sys: 0 ns, total: 3.58 s
Wall time: 3.58 s
Swifter 1.3.4
smiles_test['standardized_siles'] = smiles_test.smiles.swifter.allow_dask_on_strings(True).apply(standardize_smiles)
CPU times: user 892 ms, sys: 31.4 ms, total: 923 ms
Wall time: 5.14 s
While this WORKS with the dummy data, it does not with the real data, which looks like this:
The strings are a bit more complicated than the ones in the dummy data.
it seems first swifter needs some time to prepare the parallel execution and only uses one core, but then uses more cores. However, for the real data, it only uses 3 out of 8 cores.
I have the same issue with other frameworks such as dask, ray, modin, swifter.
Is there something that I miss here? Is there a problem when the dataframe contains stings? Why does the parallel execution take so much time even on a single computer (with multiple cores)? Or is there an issue with the RDKit library that I am using that makes it difficult to parallelize the above function?

Dask Delayed with xarray - compute() result is still delayed

I tried to perform with Dask and xarray some analysis (e.g. avg) over two datasets, then compute a difference between the two results.
This is my code
cluster = LocalCluster(n_workers=5, threads_per_worker=3, **worker_kwargs)
def calc_avg(path):
mean = xr.open_mfdataset( path,combine='nested', concat_dim="time", parallel=True, decode_times=False, decode_cf=False)['var'].sel(lat=slice(south,north), lon=slice(west,east)).mean(dim='time')
return mean
def diff_(x,y):
return x-y
p1 = "/path/to/first/multi-file/dataset"
p2 = "/path/to/second/multi-file/dataset"
a = dask.delayed(calc_avg)(p1)
b = dask.delayed(calc_avg)(p2)
total = dask.delayed(diff_)(a,b)
result = total.compute()
The executiuon time here is 17s.
However, plotting the result (result.plot()) takes more than 1 min, so it seems that the calculation actually happens when trying to plot the result.
Is this the proper way to use Dask delayed?
You’re wrapping a call to xr.open_mfdataset, which is itself a dask operation, in a delayed function. So when you call result.compute, you’re executing the functions calc_avg and mean. However, calc_avg returns a dask-backed DataArray. So yep, the 17s task converts the scheduled delayed dask graph of calc_avg and mean into a scheduled dask.array dask graph of open_mfdataset and array ops.
To resolve this, drop the delayed wrappers and simply use the dask.array xarray workflow:
a = calc_avg(p1) # this is already a dask array because
# calc_avg calls open_mfdataset
b = calc_avg(p2) # so is this
total = a - b # dask understands array math, so this "just works"
result = total.compute() # execute the scheduled job
See the xarray guide to parallel computing with dask for an introduction.

I'm using Dask to apply LabelingFunction using Snorkel on multiple datasets but it seems to take forever. Is this normal?

My problem is as follow:
I have several datasets (900K, 1M7 and 1M7 entries) in csv format which I load into multiple Dask Dataframe.
Then I concatenate them all in one Dask Dataframe that I can feed to my Snorkel Applier, which applies a bunch of Labeling Function to each row of my Dataframe and return a numpy array with as many rows as there are in the Dataframe and as many columns as there are Labeling Functions.
The call to Snorkel Applier seems to take forever when I do that with 3 datasets (more than 2 days...). However if I just run the code with only the first dataset, the call takes around 2 hours. Of course I don't do the concatenation step.
So I was wondering how can this be ? Should I change the number of partitions in the concatenated Dataframe ? Or maybe I'm using Dask badly in the first place ?
Here is the code I'm using:
from snorkel.labeling.apply.dask import DaskLFApplier
import dask.dataframe as dd
import numpy as np
import os
start = time.time()
applier = DaskLFApplier(lfs) # lfs are the function that are going to be applied, one of them featurize one of the column of my Dataframe and apply a sklearn classifier (I put n_jobs to None when loading the model)
# If I have only one CSV to read
if isinstance(PATH_TO_CSV, str):
training_data = dd.read_csv(PATH_TO_CSV, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'})
slices = None
# If I have several CSV
elif isinstance(PATH_TO_CSV, list):
training_data_list = [dd.read_csv(path, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'}) for path in PATH_TO_CSV]
training_data = dd.concat(training_data_list, axis=0)
# some useful things I do to know where to slice my final result and be sure I can assign each part to each dataset
df_sizes = [len(df) for df in training_data_list]
cut_idx = np.insert(np.cumsum(df_sizes), 0, 0)
slices = list(zip(cut_idx[:-1], cut_idx[1:]))
# The call that lasts forever: I tested all the code above without that line on my 3 datasets and it runs perfectly fine
L_train = applier.apply(training_data)
end = time.time()
print('Time elapsed: {}'.format(timedelta(seconds=end-start)))
If you need more info I will try to get them to you as much as I can.
Thank in you advance for your help :)
It seems that by default applier function is using processes, so does not benefit from additional workers you might have available:
# add this to the beginning of your code
from dask.distributed import Client
client = Client()
# you can see the address of the client by typing `client` and opening the dashboard
# skipping your other code
# you need to pass the client explicitly to the applier
# after launching this open the dashboard and watch the workers work :)
L_train = applier.apply(training_data, scheduler=client)

Performance of multiple chunked datasets in the same HDF5 file?

Suppose (i am adding a code example below) that i create multiple chunked datasets in the same HDF5 file, and i start appending data to each dataset in random order. Since HDF does not know in advance what size to allocate for each dataset, i would think that each append operation (or perhaps a dataset buffer when filled) is directly appended to the HDF5 file. If so, the data of each dataset would be interleaved with the data from the other datasets, and would be spread out in chunks over the entire HDF5 file.
My question is: If the above description is more or less accurate, would this not adversely affect the performance of the read operations done later from that file, and perhaps also the file size if more metadata records are required? And (corrollary), if the option exists to store each dataset in a separate file, would it not be better to do so from the viewpoint of read performance?
Here is an example of how the HDF5 file that i describe in the beginning could be created:
import h5py, numpy as np
dtype1 = np.dtype( [ ('t','f8'), ('T','f8') ] )
dtype2 = np.dtype( [ ('q','i2'), ('Q','f8'), ('R','f8') ] )
dtype3 = np.dtype( [ ('p','f8'), ('P','i8') ] )
with h5py.File('foo.hdf5','w') as f:
dset1 = f.create_dataset('dset1', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype1))
dset2 = f.create_dataset('dset2', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype2))
dset3 = f.create_dataset('dset3', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype3))
for _ in range(10):
random_lengths = np.random.randint(low=1, high=10, size=3)
d1 = np.ones( (random_lengths[0],), dtype=dtype1 )
dset1[-1] = d1
dset1.resize( (dset1.shape[0]+1,) )
d2 = np.ones( (random_lengths[1],), dtype=dtype2 )
dset2[-1] = d2
dset2.resize( (dset2.shape[0]+1,) )
d3 = np.ones( (random_lengths[2],), dtype=dtype3 )
dset3[-1] = d3
dset3.resize( (dset3.shape[0]+1,) )
I know i could try it both ways (single file multiple datasets or multiple files single datasets) and time it, but the result might depend on the specifics of the example data used and i would rather have a more general answer to this question, and possibly some insight into how HDF5/h5py work internally in this case.

Large csv to parquet using Dask - OOM

I've 7 csv files with 8 GB each and need to convert to parquet.
Memory usage goes to 100 GB and I had to kill it .
I tried with Distributed Dask as well . The memory is limited to 12 GB but no output produced for long time.
FYI. I used to traditional pandas with Chunking + Prod consumer --> was able to convert in 30 mins
What I'm missing for Dask processing ?
def ProcessChunk(df,...):
df.to_parquet()
for factfile in fArrFileList:
df = dd.read_csv(factfile, blocksize="100MB",
dtype=fColTypes, header=None, sep='|',names=fCSVCols)
result = ProcessChunk(df,output_parquet_file, chunksize, fPQ_Schema, fCSVCols, fColTypes)
Thanks all for suggestions. map_partitions worked.
df = dd.read_csv(filename, blocksize="500MB",
dtype=fColTypes, header=None, sep='|',names=fCSVCols)
df.map_partitions(DoWork,output_parquet_file, chunksize, Schema, CSVCols, fColTypes).compute(num_workers=2)
But the same approach for Dask Distributed Local Cluster didn't work well.when the csv size < 100 MB it worked in local cluster mode.
I had a similar problem and I found that use dask to split to smallest parquet is very slow and will eventually fail. If you have access to a Linux Terminal you can use parallel or split. For an example of their usage check answers from here
My workflow is supposing your files are called file1.csv,..., file7.csv and are stored in data/raw. I'm assuming you are using the terminal commands from your notebook and this is the reason I'm adding the %%bash magic
create folders data/raw_part/part1/,... ,data/raw_part/part7/
%%bash
for year in {1..7}
do
mkdir -p data/raw_parts/part${i}
done
for each file run (in case you want to use parallel)
%%bash
cat data/raw/file1.csv | parallel --header : --pipe -N1000000 'cat >data/raw_parts/part1/file_{#}.csv'```
convert files to parquet
first create output folders
%%bash
for year in {1..7}
do
mkdir -p data/processed/part${i}
done
define function to convert csv to parquet
import pandas as pd
import os
from dask import delayed, compute
# this can run in parallel
#delayed
def convert2parquet(fn, fldr_in, fldr_out):
fn_out = fn.replace(fldr_in, fldr_out)\
.replace(".csv", ".parquet")
df = pd.read_csv(fn)
df.to_parquet(fn_out, index=False)
get all files you want to convert
jobs = []
fldr_in = "data/raw_parts/"
for (dirpath, dirnames, filenames) in os.walk(fldr_in):
if len(filenames) > 0:
jobs += [os.path.join(dirpath, fn)
for fn in filenames]
process all in parallel
%%time
to_process = [convert2parquet(job, fldr_in, fldr_out) for job in jobs]
out = compute(to_process)

Resources