I tried to perform with Dask and xarray some analysis (e.g. avg) over two datasets, then compute a difference between the two results.
This is my code
cluster = LocalCluster(n_workers=5, threads_per_worker=3, **worker_kwargs)
def calc_avg(path):
mean = xr.open_mfdataset( path,combine='nested', concat_dim="time", parallel=True, decode_times=False, decode_cf=False)['var'].sel(lat=slice(south,north), lon=slice(west,east)).mean(dim='time')
return mean
def diff_(x,y):
return x-y
p1 = "/path/to/first/multi-file/dataset"
p2 = "/path/to/second/multi-file/dataset"
a = dask.delayed(calc_avg)(p1)
b = dask.delayed(calc_avg)(p2)
total = dask.delayed(diff_)(a,b)
result = total.compute()
The executiuon time here is 17s.
However, plotting the result (result.plot()) takes more than 1 min, so it seems that the calculation actually happens when trying to plot the result.
Is this the proper way to use Dask delayed?
You’re wrapping a call to xr.open_mfdataset, which is itself a dask operation, in a delayed function. So when you call result.compute, you’re executing the functions calc_avg and mean. However, calc_avg returns a dask-backed DataArray. So yep, the 17s task converts the scheduled delayed dask graph of calc_avg and mean into a scheduled dask.array dask graph of open_mfdataset and array ops.
To resolve this, drop the delayed wrappers and simply use the dask.array xarray workflow:
a = calc_avg(p1) # this is already a dask array because
# calc_avg calls open_mfdataset
b = calc_avg(p2) # so is this
total = a - b # dask understands array math, so this "just works"
result = total.compute() # execute the scheduled job
See the xarray guide to parallel computing with dask for an introduction.
Related
I have this function that I would like to apply to a large dataframe in parallel:
from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
def standardize_smiles(smiles):
if smiles is None:
return None
try:
mol = Chem.MolFromSmiles(smiles)
# removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
clean_mol = rdMolStandardize.Cleanup(mol)
# if many fragments, get the "parent" (the actual mol we are interested in)
parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
# try to neutralize molecule
uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
# note that no attempt is made at reionization at this step
# nor at ionization at some pH (rdkit has no pKa caculator)
# the main aim to to represent all molecules from different sources
# in a (single) standard way, for use in ML, catalogue, etc.
te = rdMolStandardize.TautomerEnumerator() # idem
taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)
#except:
# return False
standardize_smiles('CCC')
'CCC'
However, neither Dask, nor Swifter, nor Ray can do the job. All frameworks use a single CPU for some reason.
Native Pandas
import pandas as pd
N = 1000
smilest_test = pd.DataFrame({'smiles': ['CCC']*N})
smilest_test
CPU times: user 3.58 s, sys: 0 ns, total: 3.58 s
Wall time: 3.58 s
Swifter 1.3.4
smiles_test['standardized_siles'] = smiles_test.smiles.swifter.allow_dask_on_strings(True).apply(standardize_smiles)
CPU times: user 892 ms, sys: 31.4 ms, total: 923 ms
Wall time: 5.14 s
While this WORKS with the dummy data, it does not with the real data, which looks like this:
The strings are a bit more complicated than the ones in the dummy data.
it seems first swifter needs some time to prepare the parallel execution and only uses one core, but then uses more cores. However, for the real data, it only uses 3 out of 8 cores.
I have the same issue with other frameworks such as dask, ray, modin, swifter.
Is there something that I miss here? Is there a problem when the dataframe contains stings? Why does the parallel execution take so much time even on a single computer (with multiple cores)? Or is there an issue with the RDKit library that I am using that makes it difficult to parallelize the above function?
I need to compute a difference between two datasets (two daily variables resampled on monthly basis) with Dask and Xarray. Here my code:
def diff(path_1,path_2):
import xarray as xr
max_v=xr.open_mfdataset(path_1, combine='by_coords', concat_dim="time", parallel=True)['variable_1'].resample({'time': '1M'}).max()
min_v=xr.open_mfdataset(path_2, combine='by_coords', concat_dim="time", parallel=True)['variable_2'].resample({'time': '1M'}).min()
return (max_v-min_v).compute()
future = client.submit(diff,path_1,path_2)
diff = client.gather(future)
I also tried this:
%%time
def max_var(path):
import xarray as xr
multi_file_dataset = xr.open_mfdataset(path, combine='by_coords', concat_dim="time", parallel=True)
max_v=multi_file_dataset['variable_1'].resample(time='1M').max(dim='time')
return max_v.compute()
def min_var(path):
import xarray as xr
multi_file_dataset = xr.open_mfdataset(path, combine='by_coords', concat_dim="time", parallel=True)
min_v=multi_file_dataset['variable_2'].resample(time='1M').min(dim='time')
return min_v.compute()
futures=[]
future = client.submit(max_temp,path1)
futures.append(future)
future = client.submit(min_temp,path2)
futures.append(future)
results = client.gather(futures)
diff = results[0]-results[1]
But I noticed that the computation becomes very slow in the final step of getitem-nanmax e getitem-nanmin (1974 out of 1980 for example).
Here the cluster configuration:
cluster = SLURMCluster(walltime='1:00:00',cores=5,memory='5GB')
cluster.scale(jobs=10)
Each datasets consists of several files: total size=7GB
Is there a better way to implement this computation?
Thanks
Not 100% sure this works on your case, but without a mwe it's difficult to do much better. So, my suspicion is that .compute() used by xarray might conflict with the client.submit, because now computing is happening on the worker and I'm not sure if it can correctly distribute the work among peers (but this is a suspicion, I'm not sure). So one way out of this is to get the computations out into the main script (since xarray will integrate with dask in the backgroun), so perhaps this will work:
import xarray as xr
max_v=xr.open_mfdataset(path_1, combine='by_coords', concat_dim="time", parallel=True, chunks={'time': 10})['variable_1'].resample({'time': '1M'}).max()
min_v=xr.open_mfdataset(path_2, combine='by_coords', concat_dim="time", parallel=True, chunks={'time': 10})['variable_2'].resample({'time': '1M'}).min()
diff_result = (max_v-min_v).compute()
Below is the mwe on a different dataset:
import xarray as xr
# chunks option will create dask array
ds = xr.tutorial.open_dataset('rasm', decode_times=True, chunks={'time': 10})
# these are lazy calculations
max_v = ds['Tair'].resample({'time': '1M'}).max()
min_v = ds['Tair'].resample({'time': '1M'}).min()
# this will use dask scheduler in the background
diff_result = (max_v-min_v).compute()
# since the data refers to the same variable, all the results will be either 0 or `nan` (if the variable was not available in that time/x/y combination)
My problem is as follow:
I have several datasets (900K, 1M7 and 1M7 entries) in csv format which I load into multiple Dask Dataframe.
Then I concatenate them all in one Dask Dataframe that I can feed to my Snorkel Applier, which applies a bunch of Labeling Function to each row of my Dataframe and return a numpy array with as many rows as there are in the Dataframe and as many columns as there are Labeling Functions.
The call to Snorkel Applier seems to take forever when I do that with 3 datasets (more than 2 days...). However if I just run the code with only the first dataset, the call takes around 2 hours. Of course I don't do the concatenation step.
So I was wondering how can this be ? Should I change the number of partitions in the concatenated Dataframe ? Or maybe I'm using Dask badly in the first place ?
Here is the code I'm using:
from snorkel.labeling.apply.dask import DaskLFApplier
import dask.dataframe as dd
import numpy as np
import os
start = time.time()
applier = DaskLFApplier(lfs) # lfs are the function that are going to be applied, one of them featurize one of the column of my Dataframe and apply a sklearn classifier (I put n_jobs to None when loading the model)
# If I have only one CSV to read
if isinstance(PATH_TO_CSV, str):
training_data = dd.read_csv(PATH_TO_CSV, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'})
slices = None
# If I have several CSV
elif isinstance(PATH_TO_CSV, list):
training_data_list = [dd.read_csv(path, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'}) for path in PATH_TO_CSV]
training_data = dd.concat(training_data_list, axis=0)
# some useful things I do to know where to slice my final result and be sure I can assign each part to each dataset
df_sizes = [len(df) for df in training_data_list]
cut_idx = np.insert(np.cumsum(df_sizes), 0, 0)
slices = list(zip(cut_idx[:-1], cut_idx[1:]))
# The call that lasts forever: I tested all the code above without that line on my 3 datasets and it runs perfectly fine
L_train = applier.apply(training_data)
end = time.time()
print('Time elapsed: {}'.format(timedelta(seconds=end-start)))
If you need more info I will try to get them to you as much as I can.
Thank in you advance for your help :)
It seems that by default applier function is using processes, so does not benefit from additional workers you might have available:
# add this to the beginning of your code
from dask.distributed import Client
client = Client()
# you can see the address of the client by typing `client` and opening the dashboard
# skipping your other code
# you need to pass the client explicitly to the applier
# after launching this open the dashboard and watch the workers work :)
L_train = applier.apply(training_data, scheduler=client)
I have a Dask dataframe that is grouped, and then a function is applied to each group. That function uses some pre-calculated metrics from another dataframe as part of its work.
In the actual code, all the data is in parquet datasets loaded from S3 and run on a distributed Dask cluster. Here's a simplified example using csv files.
profiles.csv
company,stat1
1000,10
2000,20
catalog.csv
company,desc
1000,ABC
1000,def
2000,GHI
2000,jkl
code
from dask import dataframe as ddf
profiles_df = ddf.read_csv("profiles.csv").set_index("company")
catalog_df = ddf.read_csv("catalog.csv").set_index("company")
def refine(group_df):
profile = profiles_df.loc[group_df.name].compute()
group_df["desc_"] = group_df["desc"].apply(lambda t: f"{t}-{int(profile.stat1)}")
return group_df
catalog_grouped_df = catalog_df.groupby("company")
refined_catalog_meta = catalog_df._meta.copy()
refined_catalog_meta["desc_"] = None
refined_catalog_df = catalog_grouped_df.apply(refine, meta=refined_catalog_meta)
refined_catalog_df.compute()
This works, except that the source profiles_df csv/parquet is being read over and over again for each invocation of refine(group_df). How do I improve this so that profiles_df is read once, and then only the row is relevant for each group is passed to or accessed by the refine function?
Update
I've managed to stop the repeated reads from the source Parquet datasets by reading the profiles_df and scattering it. Something like this
from dask import dataframe as ddf
from dask.distributed import default_client
profiles_df = ddf.read_csv("profiles.csv").set_index("company")
catalog_df = ddf.read_csv("catalog.csv").set_index("company")
def refine(group_df):
profile = profiles_df.loc[group_df.name].compute()
group_df["desc_"] = group_df["desc"].apply(lambda t: f"{t}-{int(profile.stat1)}")
return group_df
profiles_df = default_client().scatter(profiles_df.compute(), broadcast=True)
catalog_grouped_df = catalog_df.groupby("company")
refined_catalog_meta = catalog_df._meta.copy()
refined_catalog_meta["desc_"] = None
refined_catalog_df = catalog_grouped_df.apply(refine, meta=refined_catalog_meta)
refined_catalog_df.compute()
…
The main downside is that profiles_df is being read to the calling client and then sent to the scheduler. Is there a way I can get the scheduler or a worker to do the read and scatter?
I'm trying to persist 1.5 million images to a dask cluster as a dask array, and then get some summary stats. I'm following an image processing tutorial from #mrocklin's blog and have edited my script to be a minimally reproducible example:
import time
import dask
import dask.array as da
import numpy as np
from distributed import Client
client = Client()
def get_imgs(num_imgs):
def get():
arr = np.random.randint(2000, size=(3, 120, 120)).flatten()
return arr
delayed_get = dask.delayed(get)
return [da.from_delayed(delayed_get(), shape=(3 * 120 * 120,), dtype=np.uint16) for num in range(num_imgs)]
imgs = get_imgs(1500000)
imgs = da.stack(imgs, axis=0)
client.persist(imgs)
The persist step causes my jupyter process to crash. Is that because the persist step causes a bunch of operations to be done on each object in the collection, and the collection is too large to fit in memory? So I use scatter instead:
start = time.time()
imgs_future = client.scatter(imgs, broadcast=True)
print(time.time() - start)
But the jupyter process crashes, or the network connection to the scheduler gets lost.
So I tried breaking up the scatter step:
st = time.time()
chunk_size = 50000
chunk_num = 0
chunk_futures = []
start = 0
end = start + chunk_size
is_last_chunk = False
for dataset in client.list_datasets():
client.unpublish_dataset(dataset)
while True:
cst = time.time()
chunk = imgs[start:end]
cst1 = time.time()
if start == 0:
print('loaded chunk in', cst1 - cst)
if len(chunk) == 0:
break
chunk_future = client.scatter(chunk)
chunk_futures.append(chunk_future)
dataset_name = "chunk_{}".format(chunk_num)
client.publish_dataset(**{dataset_name: chunk_future})
if start == 0:
print('submitted chunk in', time.time() - cst1)
start = end
if is_last_chunk:
break
chunk_num += 1
end = start + chunk_size
if end > len(image_paths_to_submit):
is_last_chunk = True
end = len(image_paths_to_submit)
if start == end:
break
if chunk_num % 5 == 0:
print('chunk_num', chunk_num, 'start', start)
print('completed in', time.time() - st)
But this approach results in the connection being lost as well. What's the recommended approach to persisting a large image dataset in a cluster in an asynchronous way?
I've looked at the delayed best practices and what jumps out at me is that I may be using too many tasks? So maybe I need to do more batching in each get() call.
Is that because the persist step causes a bunch of operations to be done on each object in the collection, and the collection is too large to fit in memory?
The best way to find out if this is the case is by using Dask's dashboard.
https://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
I'm following an image processing tutorial from #mrocklin's blog
That post is somewhat old. You may also want to take a look at this more recent post:
https://blog.dask.org/2019/06/20/load-image-data
I've looked at the delayed best practices and what jumps out at me is that I may be using too many tasks? So maybe I need to do more batching in each get() call.
Yes, that might be a problem. If you can keep the number of tasks down that would be nice.