Monitor dask-xarray performance - dask

I have the following basic code which (I thought) should set up xarray to use a LocalCluster.
from dask.distributed import Client
client = Client("tcp://127.0.0.1:46573") # this points to a LocalCluster
import xarray as xr
ds = xr.open_mfdataset('*.nc', combine='by_coords') # Uses dask to defer actually loading data
I now launch some task which also completes with no issues:
(ds.mean('time').mean('longitude')**10).compute()
I noticed that the tabs for the Task graph, Workers, or Task stream (among others) in the dask-labextension for my LocalCluster remain empty. Shouldn't there be some sort of progress displayed while the computation is running?
Which leads me to wonder, how do I tell xarray to explicitly use this cluster? Or is Client a singleton such that xarray only ever has one instance to use anyway?

When you create a Dask Client it automatically registers itself as the default way to run Dask computations.
You can check to see if an object is a Dask collection with the dask.is_dask_collection function. As you say, I believe that xr.open_mfdataset uses Dask by default, but this would be a good way to check.
As to why you're not seeing anything on the dashboard, I unfortunately don't know enough about your situation to be able to help you there.

Related

memory/cpu utilization issues when running dask on cluster (SGE)

I ported my code that uses pandas to dask (basically replaced pd.read_csv with dd.read_csv). The code essentially applies a series of filters/transformations on the (dask) dataframe. Similarly, I use dask bag/array instead of numpy array-like data. The code works great locally (i.e., on my workstation or on a virtual machine). However, when I run the code to a cluster (SGE), my jobs gets killed by the scheduler saying that memory/cpu usage have exceeded the allocated limit. It looks like dask is trying to consume all memory/threads available on the node as opposed to what has been allocated by the scheduler. There seem to be two approaches to fix this issue - (a) set memory/cpu limits for dask from within the code as soon as we load the library (just like we set matplotlib.use("Agg") as soon as we load matplotlib when we need to set the backend) and/or (b) have dask understand the memory limits set by the scheduler. Could you please help on how to go about this issue. My preference would be to set mem/cpu limits for dask from within the code. PS: I understand there is a way to spin up dask workers in a cluster environment and specify these limits, but I am interested in running my code that works great locally on the cluster with minimal additional changes. Thanks!

How to create a custom Dask worker with imports

I'm setting up Dask, and I can use dask for multiprocessing just fine.
I run into issues, however, when I want to use pre-configured Dask workers. They don't have the same imports I do with my main process.
I was wondering. How do I add custom imports into dask workers so all futures accessing those workers can operate effectively.
Ideally you Dask workers should all have the same software environment. Typically this is guaranteed outside of Dask with Docker images or with a Network File System (NFS). There are some other solutions like Client.upload_file, which can be useful for small scripts.

Is there a dask equivalent to maxtasksperchild?

We have jobs which interact with native code and there are unavoidable memory leaks while the worker is processing the task. The simple solution for our problems has been to restart the worker after a specified number of tasks.
We are migrating from python's multiprocessing which has a useful maxtasksperchild option which closes down the workers after a specified number of tasks.
Is there something built-in in dask that is comparable to maxtasksperchild?
As a workaround, we are keeping track of the workers who have completed a task by appending their worker address to the result payload and calling retire_workers on the client side manually.
No, there is no such equivalent in Dask

Get Dask diagnostic values for a Dask distributed client

Is there any way to get Dask diagnostic data, not the dashboard for a Dask.distributed client?
Dask already provides a nice Bokeh dashboard, where it plots quite a lot diagnostics informations. However, what I want are not the plots but their values. Something like, along with the timestamp, progress value, cpu and memory usage. I would like to store these values in a database for my own monitoring purposes.
So far, I have tried to use function Dask.distributed.get_task_stream(), it provides information about the workers in a list but I would like to get in a stream manner, what exactly Task Stream plot shows in the dashboard.
Note: there exists a package called dask.diagnostics and from there you can import a ProgressBar, Profiler(), ResourceProfiler() etc., However, from my current understanding, they are only for a single machine scheduler and not for a distributed scheduler. Am I right? Or, can I use them for a distributed environment?
In most cases we recommend the get_task_stream function that you've already found.
If you want to trigger something on every transition you might consider the Scheduler plugins. In particular, the task stream plugin that feeds that dashboard lives here:
https://github.com/dask/distributed/blob/master/distributed/diagnostics/task_stream.py

Waiting for external dependencies in dask

Context:
I'm using custom dask graphs to manage and distribute computations.
Problem:
Some tasks include reading in files which are produced outside of dask and not necessarily available at the time of calling dask.get(graph,result_key).
Question:
Having the i/o tasks wait for files is not an option as this would block workers. Is there (or which would be) a good way to let dask wait for the files to become available and only then execute the i/o tasks?
Thanks a lot for any thoughts!
It sounds like you might want to use some of the more real-time features of Dask, described here.
You might consider making tasks that use secede and rejoin or use async-await style programming and only launch tasks once your client process notices that they exist.

Resources