How do I get xarray.interp() to work in parallel? - dask

I'm using xarray.interp on a large 3D DataArray (weather data: lat, lon, time) to map the values (wind speed) to new values based on a discrete mapping function f.
The interpolation method seems to only utilise one core for computation, making the process horibly inefficient. I can not figure out how to make xarray to use more than one core for this task.
I did monitor the computation via htop and a dask dashboard for xarray.interp.
htop only shows one core to be in use, the dashboard doesn't show any activity in any of the workers. The only dask activity I can observe is from loading the netcdf data file from disk. If I preload the data using .load(), this dask activity is gone.
I also tried using using a scipy.interpolate.interp1d function with xarray.apply_ufunc() to achieve the equivalent result I am aiming for but did not observe any parallel utilisation (htop) or activity (dask dashboard) either.
The fastest approach for me right now is using numpy.interp and then recasting it back to a xr.DataArray with the coordinates of the original DataArray. But that's also not parallelised and only some percent faster.
In the following MWE I don't see any dask activity after the da.load() statement in block 4.
edit:
The code has to be run in the separate blocks 1 - 4 when evaluting using e.g. htop. Because load() is causing multi-core activity and happens either explicitly (block 2) or implicitly (triggered by 4), it's easy to missattribute the multi-core activity to .interp() when its caused by data loading if you run the script as a whole.
# 1: For the dask dashboard
from dask.distributed import Client
client = Client()
display(client)
import xarray as xr
import numpy as np
da = xr.tutorial.open_dataset("air_temperature", chunks={})['air']
# 2: Preload data into memory
da.load()
# 3: Dummy interpolation function
xp = np.linspace(0,400,21)
fp = -1*(xp-300)**2
xr_interp_da = xr.DataArray(fp, [('xp', xp)], name='interpolation function')
# 4: I expect this to run in parallel but it does not
f = xr_interp_da.interp({'xp':da})

Related

Understanding dask cudf object lifecycle

I want to understand the efficient memory management process for Dask objects. I have setup a Dask GPU cluster and I am able to execute tasks that runs across the cluster. However, with the dask objects, especially when I run the compute function, the process that runs on the GPU is quickly growing by using more and more of the memory and soon I am getting "Run out of memory Error".
I want to understand how I can release the memory from dask object once I am done with using them. In this following example, after the compute function how can I release that object. I am running the following code for a few times. The memory keeps growing in the process where it is running
import cupy as cp
import pandas as pd
import cudf
import dask_cudf
nrows = 100000000
df2 = cudf.DataFrame({'a': cp.arange(nrows), 'b': cp.arange(nrows)})
ddf2 = dask_cudf.from_cudf(df2, npartitions=5)
ddf2['c'] = ddf2['a'] + 5
ddf2
ddf2.compute()
Please check this blog post by Nick Becker. you may want to set up a client first.
You read into cudf first, which you shouldn't do as practice. You should read directly into dask_cudf.
When dask_cudf computes, the result returns as a cudf dataframe, which MUST fit into the remaining memory of your GPU. Chances are reading into cudf first may have taken a chunk of your memory.
Then, you can delete a dask object when you are done using client.cancel().

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Parallelization on cluster dask

I'm looking for the best way to parallelize on a cluster the following problem. I have several files
folder/file001.csv
folder/file002.csv
:
folder/file100.csv
They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files.
In one side I can just run
df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes')
But I'm wondering if there is a better/smarter way to do so in a sort of
delayed-groupby way.
Every filexxx.csv fits in memory on a node. Given that every node has n cores it will be ideal use all of them. For every single file I can use this hacky way
import numpy as np
import multiprocessing as mp
cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def parallelize(data, func):
data_split = np.array_split(data, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
data = parallelize(data, f);
And, again, I'm not sure if there is an efficent dask way to do so.
you could use a Client (will run in multi process by default) and read your data with a certain blocksize. you can get the amount of workers (and number of cores per worker) with the ncores method and then calculate the optimal blocksize.
however according to the documantaion blocksize is by default "computed based on available physical memory and the number of cores."
so i think the best way to do it is a simple:
from distributed import Client
# if you run on a single machine just do: client = Client()
client = Client('cluster_scheduler_path')
ddf = dd.read_csv("folder/*")
EDIT: after that use map_partitions and do the gorupby for each partition:
# Note ddf is a dask dataframe and df is a pandas dataframe
new_ddf = ddf.map_partitions(lambda df: df.groupby("key").apply(f), meta=meta)
don't use compute because it will result in a single pandas.dataframe, instead use a dask output method to keep the entire process parallel and larger then ram compatible.

How to use all the cpu cores using Dask?

I have a pandas series with more than 35000 rows. I want to use dask make it more efficient. However, I both the dask code and the pandas code are taking the same time.
Initially "ser" is pandas series and fun1 and fun2 are basic functions performing pattern match in individual rows of series.
Pandas:
ser = ser.apply(fun1).apply(fun2)
Dask:
ser = dd.from_pandas(ser, npartitions = 16)
ser = ser.apply(fun1).apply(fun2)
On checking the status of cores of cpu, I found that not all the cores were getting used. Only one core was getting used to 100%.
Is there any method to make the series code faster using dask or utilize all the cores of cpu while performing Dask operations in series?
See http://dask.pydata.org/en/latest/scheduler-overview.html
It is likely that the functions that you are calling are pure-python, and so claim the GIL, the lock which ensures that only one python instruction is being carried out at a time within a thread. In this case, you will need to run your functions in separate processes to see any parallelism. You could do this by using the multiprocess scheduler
ser = ser.apply(fun1).apply(fun2).compute(scheduler='processes')
or by using the distributed scheduler (which works fine on a single machine, and actually comes with some next-generation benefits, such as the status dashboard); in the simplest, default case, creating a client is enough:
client = dask.distributed.Client()
but you should read the docs

dask.bag, how should I efficiently run several computations over the same data

I'm processing a fair amount of data, and the bulk of the time is spent loading the data and parsing the json/whatever. I'd like to collect simple statistics over the whole dataset using a single scan.
I'd hoped I could use the graph simplification in compute using the following pattern:
parsed = read_text(files).map(parsing)
example_stat_future = parsed.map(foo).frequencies()
another_stat_future = parsed.map(bar).sum()
etc.
example_stat, another_stat = compute(example_stat_future, another_stat_future)
but I see extreme slowdowns when trying that. Here's my example code:
from json import loads, dumps
from time import time
import dask.bag as db
print("Setup some dummy data")
for partition in range(10):
with open("/tmp/issue.%d.jsonl" % partition, "w") as f_out:
for i in range(100000):
f_out.write(dumps({"val": i, "doubleval": i * 2}) + "\n")
print("Running as distinct computations")
loaded = db.read_text("/tmp/issue.*.jsonl").map(loads)
first_val = loaded.pluck("val").sum()
second_val = loaded.pluck("doubleval").sum()
start = time()
first_val.compute()
print("First value", time() - start)
start = time()
second_val.compute()
print("Second value", time() - start)
print("Running as a single computation")
loaded = db.read_text("/tmp/issue.*.jsonl").map(loads)
first_val = loaded.pluck("val").sum()
second_val = loaded.pluck("doubleval").sum()
start = time()
db.compute(first_val, second_val)
print("Both values", time() - start)
And the output
On datasets with millions of items, I've never finished a run before killing it for taking too long.
Setup some dummy data
Running as distinct computations
First value 0.7081761360168457
Second value 0.6579079627990723
Running as a single computation
Both values 37.74176549911499
Is there a common pattern for solving this kind of issue?
Short answer
Import and run the following and things should be faster
from dask.distributed import Client
c = Client()
Make sure you have dask.distributed installed
conda install dask distributed -c conda-forge
# or
pip install dask distributed --upgrade
Although note, you'll have to do this within an if __name__ == '__main__': block at the bottom of the file rather than at the top level:
from ... import ...
if __name__ == '__main__':
c = Client()
# proceed with the rest of your dask.bag code
Long answer
Dask has a variety of schedulers. Dask.bag uses the multiprocessing scheduler by default, but could use others just as easily. See this doc for more information.
The multiprocessing scheduler does work in a separate process, and then it brings those results back to the main process when necessary. For simple linear workloads like b.map(...).filter(...).frequencies() tasks can all be fused into one single task that goes to a process, computes, and then returns just a very small result.
However, when a workload has any sort of forking (such as you describe) the multiprocessing scheduler has to send the data back to the main process. Depending on the data, this can be expensive because we need to serialize the objects as they move between processes. The basic multiprocessing scheduler within Dask has no concept of data locality. Everything is coordinated by the central process.
Fortunately, Dask's distributed scheduler is much smarter and can handle these situations easily. When you run dask.distributed.Client() without any arguments you create a local "cluster" of processes on your computer. There are a variety of other advantages to this; for example if you navigate to https://localhost:8787/status you'll be treated to a running dashboard of all of your computations.

Resources