Difference between two datasets with Dask and Xarray - dask

I need to compute a difference between two datasets (two daily variables resampled on monthly basis) with Dask and Xarray. Here my code:
def diff(path_1,path_2):
import xarray as xr
max_v=xr.open_mfdataset(path_1, combine='by_coords', concat_dim="time", parallel=True)['variable_1'].resample({'time': '1M'}).max()
min_v=xr.open_mfdataset(path_2, combine='by_coords', concat_dim="time", parallel=True)['variable_2'].resample({'time': '1M'}).min()
return (max_v-min_v).compute()
future = client.submit(diff,path_1,path_2)
diff = client.gather(future)
I also tried this:
%%time
def max_var(path):
import xarray as xr
multi_file_dataset = xr.open_mfdataset(path, combine='by_coords', concat_dim="time", parallel=True)
max_v=multi_file_dataset['variable_1'].resample(time='1M').max(dim='time')
return max_v.compute()
def min_var(path):
import xarray as xr
multi_file_dataset = xr.open_mfdataset(path, combine='by_coords', concat_dim="time", parallel=True)
min_v=multi_file_dataset['variable_2'].resample(time='1M').min(dim='time')
return min_v.compute()
futures=[]
future = client.submit(max_temp,path1)
futures.append(future)
future = client.submit(min_temp,path2)
futures.append(future)
results = client.gather(futures)
diff = results[0]-results[1]
But I noticed that the computation becomes very slow in the final step of getitem-nanmax e getitem-nanmin (1974 out of 1980 for example).
Here the cluster configuration:
cluster = SLURMCluster(walltime='1:00:00',cores=5,memory='5GB')
cluster.scale(jobs=10)
Each datasets consists of several files: total size=7GB
Is there a better way to implement this computation?
Thanks

Not 100% sure this works on your case, but without a mwe it's difficult to do much better. So, my suspicion is that .compute() used by xarray might conflict with the client.submit, because now computing is happening on the worker and I'm not sure if it can correctly distribute the work among peers (but this is a suspicion, I'm not sure). So one way out of this is to get the computations out into the main script (since xarray will integrate with dask in the backgroun), so perhaps this will work:
import xarray as xr
max_v=xr.open_mfdataset(path_1, combine='by_coords', concat_dim="time", parallel=True, chunks={'time': 10})['variable_1'].resample({'time': '1M'}).max()
min_v=xr.open_mfdataset(path_2, combine='by_coords', concat_dim="time", parallel=True, chunks={'time': 10})['variable_2'].resample({'time': '1M'}).min()
diff_result = (max_v-min_v).compute()
Below is the mwe on a different dataset:
import xarray as xr
# chunks option will create dask array
ds = xr.tutorial.open_dataset('rasm', decode_times=True, chunks={'time': 10})
# these are lazy calculations
max_v = ds['Tair'].resample({'time': '1M'}).max()
min_v = ds['Tair'].resample({'time': '1M'}).min()
# this will use dask scheduler in the background
diff_result = (max_v-min_v).compute()
# since the data refers to the same variable, all the results will be either 0 or `nan` (if the variable was not available in that time/x/y combination)

Related

Reasons why swifter/dask/ray only use one core for an apply task?

I have this function that I would like to apply to a large dataframe in parallel:
from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
def standardize_smiles(smiles):
if smiles is None:
return None
try:
mol = Chem.MolFromSmiles(smiles)
# removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
clean_mol = rdMolStandardize.Cleanup(mol)
# if many fragments, get the "parent" (the actual mol we are interested in)
parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
# try to neutralize molecule
uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
# note that no attempt is made at reionization at this step
# nor at ionization at some pH (rdkit has no pKa caculator)
# the main aim to to represent all molecules from different sources
# in a (single) standard way, for use in ML, catalogue, etc.
te = rdMolStandardize.TautomerEnumerator() # idem
taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)
#except:
# return False
standardize_smiles('CCC')
'CCC'
However, neither Dask, nor Swifter, nor Ray can do the job. All frameworks use a single CPU for some reason.
Native Pandas
import pandas as pd
N = 1000
smilest_test = pd.DataFrame({'smiles': ['CCC']*N})
smilest_test
CPU times: user 3.58 s, sys: 0 ns, total: 3.58 s
Wall time: 3.58 s
Swifter 1.3.4
smiles_test['standardized_siles'] = smiles_test.smiles.swifter.allow_dask_on_strings(True).apply(standardize_smiles)
CPU times: user 892 ms, sys: 31.4 ms, total: 923 ms
Wall time: 5.14 s
While this WORKS with the dummy data, it does not with the real data, which looks like this:
The strings are a bit more complicated than the ones in the dummy data.
it seems first swifter needs some time to prepare the parallel execution and only uses one core, but then uses more cores. However, for the real data, it only uses 3 out of 8 cores.
I have the same issue with other frameworks such as dask, ray, modin, swifter.
Is there something that I miss here? Is there a problem when the dataframe contains stings? Why does the parallel execution take so much time even on a single computer (with multiple cores)? Or is there an issue with the RDKit library that I am using that makes it difficult to parallelize the above function?

Dask dataframe parallel task

I want to create features(additional columns) from a dataframe and I have the following structure for many functions.
Following this documentation https://docs.dask.org/en/stable/delayed-best-practices.html I have come up with the code below.
However I get the error message: concurrent.futures._base.CancelledError and many times I get the warning: distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
I understand that the object I am appending to delay is very large(it works ok when I use the commented out df) which is why the program crashes but is there a better way of doing it?
import pandas as pd
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
import dask
def main():
#df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000), "col2": np.random.randint(101, 200, 100000), "col3": np.random.uniform(0, 4, 100000)})
df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000000), "col2": np.random.randint(101, 200, 100000000), "col3": np.random.uniform(0, 4, 100000000)})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
delay = []
def create_col_sth():
group = ddf.groupby("col1")["col3"]
#dask.delayed
def small_fun(lag):
return f"col_{lag}", group.transform(lambda x: x.shift(lag), meta=('x', 'float64')).apply(lambda x: np.log(x), meta=('x', 'float64'))
for lag in range(5):
x = small_fun(lag)
delay.append(x)
create_col_sth()
delayed = dask.compute(*delay)
for data in delayed:
ddf[data[0]] = data[1]
ddf.to_parquet("test", engine="fastparquet")
if __name__ == "__main__":
cluster = LocalCluster(n_workers=6,
threads_per_worker=2,
memory_limit='8GB')
client = Client(cluster)
main()
Not sure if this will resolve all of your issues, but generally you don't need to (and shouldn't) mix delayed and dask.datafame operations like this. Additionally, you shouldn't pass large data objects into delayed functions through closures like group in your example. Instead, include them as explicit arguments, or in this case, don't use delayed at all and use dask.dataframe native operations or in-memory operations with dask.dataframe.map_partitions.
Implementing these, I would rewrite your main function as follows:
df = pd.DataFrame({
"col1": np.random.randint(1, 100, 100000000),
"col2": np.random.randint(101, 200, 100000000),
"col3": np.random.uniform(0, 4, 100000000),
})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
group = ddf.groupby("col1")["col3"]
# directly assign the dataframe operations as columns
for lag in range(5):
ddf[f"col_{lag}"] = (
group
.transform(lambda x: x.shift(lag), meta=('x', 'float64'))
.apply(lambda x: np.log(x), meta=('x', 'float64'))
)
# this triggers the operation implicitly - no need to call compute
ddf.to_parquet("test", engine="fastparquet")
After long periods of frustration with Dask, I think I hacked the holy grail of refactoring your pandas transformations wrapped with dask.
Learning points:
Index intelligently. If you are grouping by or merging you should consider indexing the columns you use for those.
Partition and repartition intelligently. If you have a dataframe of 10k rows and another of 1m rows, they should naturally have different partitions.
Don't use dask data frame transformation methods except for example merge. The others should be in pandas code wrapped around map_partitions.
Don't accumulate too large graphs so consider saving after for example indexing or after making a complex transformation.
 
If possible filter the data frame and work with smaller subset you can always merge this back to the bigger data set.
If you are working in your local machine set the memory limits within the boundaries of system specifications. This point is very important. In the example below I create one million rows of 3 columns one is an int64 and two are float64 which are 8bytes each and 24bytes in total this gives me 24 million bytes.
import pandas as pd
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
import dask
# https://stackoverflow.com/questions/52642966/repartition-dask-dataframe-to-get-even-partitions
def _rebalance_ddf(ddf):
"""Repartition dask dataframe to ensure that partitions are roughly equal size.
Assumes `ddf.index` is already sorted.
"""
if not ddf.known_divisions: # e.g. for read_parquet(..., infer_divisions=False)
ddf = ddf.reset_index().set_index(ddf.index.name, sorted=True)
index_counts = ddf.map_partitions(lambda _df: _df.index.value_counts().sort_index()).compute()
index = np.repeat(index_counts.index, index_counts.values)
divisions, _ = dd.io.io.sorted_division_locations(index, npartitions=ddf.npartitions)
return ddf.repartition(divisions=divisions)
def main(client):
size = 1000000
df = pd.DataFrame({"col1": np.random.randint(1, 10000, size), "col2": np.random.randint(101, 20000, size), "col3": np.random.uniform(0, 100, size)})
# Select appropriate partitions
ddf = dd.from_pandas(df, npartitions=500)
del df
gc.collect()
# This is correct if you want to group by a certain column it is always best if that column is an indexed one
ddf = ddf.set_index("col1")
ddf = _rebalance_ddf(ddf)
print(ddf.memory_usage_per_partition(index=True, deep=False).compute())
print(ddf.memory_usage(deep=True).sum().compute())
# Always persist your data to prevent big task graphs actually if you omit this step processing will fail
ddf.to_parquet("test", engine="fastparquet")
ddf = dd.read_parquet("test")
# Dummy code to create a dataframe to be merged based on col1
ddf2 = ddf[["col2", "col3"]]
ddf2["col2/col3"] = ddf["col2"] / ddf["col3"]
ddf2 = ddf2.drop(columns=["col2", "col3"])
# Repartition the data
ddf2 = _rebalance_ddf(ddf2)
print(ddf2.memory_usage_per_partition(index=True, deep=False).compute())
print(ddf2.memory_usage(deep=True).sum().compute())
def mapped_fun(data):
for lag in range(5):
data[f"col_{lag}"] = data.groupby("col1")["col3"].transform(lambda x: x.shift(lag)).apply(lambda x: np.log(x))
return data
# Process the group by transformation in pandas but wrapped with Dask if you use the Dask functions to do this you will
# have a variety of issues.
ddf = ddf.map_partitions(mapped_fun)
# Additional... you can merge ddf with ddf2 but on an indexed column otherwise you run into a variety of issues
ddf = ddf.merge(ddf2, on=['col1'], how="left")
ddf.to_parquet("final", engine="fastparquet")
if __name__ == "__main__":
cluster = LocalCluster(n_workers=6,
threads_per_worker=2,
memory_limit='8GB')
client = Client(cluster)
main(client)

Dask Delayed with xarray - compute() result is still delayed

I tried to perform with Dask and xarray some analysis (e.g. avg) over two datasets, then compute a difference between the two results.
This is my code
cluster = LocalCluster(n_workers=5, threads_per_worker=3, **worker_kwargs)
def calc_avg(path):
mean = xr.open_mfdataset( path,combine='nested', concat_dim="time", parallel=True, decode_times=False, decode_cf=False)['var'].sel(lat=slice(south,north), lon=slice(west,east)).mean(dim='time')
return mean
def diff_(x,y):
return x-y
p1 = "/path/to/first/multi-file/dataset"
p2 = "/path/to/second/multi-file/dataset"
a = dask.delayed(calc_avg)(p1)
b = dask.delayed(calc_avg)(p2)
total = dask.delayed(diff_)(a,b)
result = total.compute()
The executiuon time here is 17s.
However, plotting the result (result.plot()) takes more than 1 min, so it seems that the calculation actually happens when trying to plot the result.
Is this the proper way to use Dask delayed?
You’re wrapping a call to xr.open_mfdataset, which is itself a dask operation, in a delayed function. So when you call result.compute, you’re executing the functions calc_avg and mean. However, calc_avg returns a dask-backed DataArray. So yep, the 17s task converts the scheduled delayed dask graph of calc_avg and mean into a scheduled dask.array dask graph of open_mfdataset and array ops.
To resolve this, drop the delayed wrappers and simply use the dask.array xarray workflow:
a = calc_avg(p1) # this is already a dask array because
# calc_avg calls open_mfdataset
b = calc_avg(p2) # so is this
total = a - b # dask understands array math, so this "just works"
result = total.compute() # execute the scheduled job
See the xarray guide to parallel computing with dask for an introduction.

I'm using Dask to apply LabelingFunction using Snorkel on multiple datasets but it seems to take forever. Is this normal?

My problem is as follow:
I have several datasets (900K, 1M7 and 1M7 entries) in csv format which I load into multiple Dask Dataframe.
Then I concatenate them all in one Dask Dataframe that I can feed to my Snorkel Applier, which applies a bunch of Labeling Function to each row of my Dataframe and return a numpy array with as many rows as there are in the Dataframe and as many columns as there are Labeling Functions.
The call to Snorkel Applier seems to take forever when I do that with 3 datasets (more than 2 days...). However if I just run the code with only the first dataset, the call takes around 2 hours. Of course I don't do the concatenation step.
So I was wondering how can this be ? Should I change the number of partitions in the concatenated Dataframe ? Or maybe I'm using Dask badly in the first place ?
Here is the code I'm using:
from snorkel.labeling.apply.dask import DaskLFApplier
import dask.dataframe as dd
import numpy as np
import os
start = time.time()
applier = DaskLFApplier(lfs) # lfs are the function that are going to be applied, one of them featurize one of the column of my Dataframe and apply a sklearn classifier (I put n_jobs to None when loading the model)
# If I have only one CSV to read
if isinstance(PATH_TO_CSV, str):
training_data = dd.read_csv(PATH_TO_CSV, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'})
slices = None
# If I have several CSV
elif isinstance(PATH_TO_CSV, list):
training_data_list = [dd.read_csv(path, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'}) for path in PATH_TO_CSV]
training_data = dd.concat(training_data_list, axis=0)
# some useful things I do to know where to slice my final result and be sure I can assign each part to each dataset
df_sizes = [len(df) for df in training_data_list]
cut_idx = np.insert(np.cumsum(df_sizes), 0, 0)
slices = list(zip(cut_idx[:-1], cut_idx[1:]))
# The call that lasts forever: I tested all the code above without that line on my 3 datasets and it runs perfectly fine
L_train = applier.apply(training_data)
end = time.time()
print('Time elapsed: {}'.format(timedelta(seconds=end-start)))
If you need more info I will try to get them to you as much as I can.
Thank in you advance for your help :)
It seems that by default applier function is using processes, so does not benefit from additional workers you might have available:
# add this to the beginning of your code
from dask.distributed import Client
client = Client()
# you can see the address of the client by typing `client` and opening the dashboard
# skipping your other code
# you need to pass the client explicitly to the applier
# after launching this open the dashboard and watch the workers work :)
L_train = applier.apply(training_data, scheduler=client)

Debugging very slow `from_delayed` call

I have a long-ish dask chained pipeline, and one of the last bits is a string of dask.dataframe.from_delayed calls like below. That line is extremely slow - many minutes per call. It take 1-2 hours to just setup the pipeline.
When I debug the problem, I pull out the relevant code and pass in arrays with the same shape. It runs instantly.
Is this because my real life pipeline has an upstream graph that it's contending with? My solution is going to be to split my pipeline into two and see if that solves it. Anything else that could be going on here?
import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
image = da.zeros((100, 8192, 8192), chunks=(100,256,256))
labels = da.zeros((100, 8192, 8192), chunks=(100,256,256))
image_chunks = image.to_delayed().ravel()
labels_chunks = labels.to_delayed().ravel()
results = []
for image_chunk, labels_chunk in zip(image_chunks, labels_chunks):
offsets = np.array(image_chunk.key[1:]) * np.array(image.chunksize)
result = dask.delayed(lambda x,y,z: None)(image_chunk, labels_chunk, offsets)
results.append(result)
df_meta = pd.DataFrame(columns=['a', 'b'], dtype=np.float64)
df_meta = df_meta.astype({'a': np.int64})
# This line takes forever in actual use, but is instantaneous in this example.
df = dd.from_delayed(results, meta=df_meta)
The code that you have posted works great for me (as you predicted). Without knowing more I don't know how to help. In your situation I would slowly add back in parts of your actual pipeline and see when things get slow. That should help you to isolate the problem.

Resources