How to create a custom Dask worker with imports - dask

I'm setting up Dask, and I can use dask for multiprocessing just fine.
I run into issues, however, when I want to use pre-configured Dask workers. They don't have the same imports I do with my main process.
I was wondering. How do I add custom imports into dask workers so all futures accessing those workers can operate effectively.

Ideally you Dask workers should all have the same software environment. Typically this is guaranteed outside of Dask with Docker images or with a Network File System (NFS). There are some other solutions like Client.upload_file, which can be useful for small scripts.

Related

Monitor dask-xarray performance

I have the following basic code which (I thought) should set up xarray to use a LocalCluster.
from dask.distributed import Client
client = Client("tcp://127.0.0.1:46573") # this points to a LocalCluster
import xarray as xr
ds = xr.open_mfdataset('*.nc', combine='by_coords') # Uses dask to defer actually loading data
I now launch some task which also completes with no issues:
(ds.mean('time').mean('longitude')**10).compute()
I noticed that the tabs for the Task graph, Workers, or Task stream (among others) in the dask-labextension for my LocalCluster remain empty. Shouldn't there be some sort of progress displayed while the computation is running?
Which leads me to wonder, how do I tell xarray to explicitly use this cluster? Or is Client a singleton such that xarray only ever has one instance to use anyway?
When you create a Dask Client it automatically registers itself as the default way to run Dask computations.
You can check to see if an object is a Dask collection with the dask.is_dask_collection function. As you say, I believe that xr.open_mfdataset uses Dask by default, but this would be a good way to check.
As to why you're not seeing anything on the dashboard, I unfortunately don't know enough about your situation to be able to help you there.

memory/cpu utilization issues when running dask on cluster (SGE)

I ported my code that uses pandas to dask (basically replaced pd.read_csv with dd.read_csv). The code essentially applies a series of filters/transformations on the (dask) dataframe. Similarly, I use dask bag/array instead of numpy array-like data. The code works great locally (i.e., on my workstation or on a virtual machine). However, when I run the code to a cluster (SGE), my jobs gets killed by the scheduler saying that memory/cpu usage have exceeded the allocated limit. It looks like dask is trying to consume all memory/threads available on the node as opposed to what has been allocated by the scheduler. There seem to be two approaches to fix this issue - (a) set memory/cpu limits for dask from within the code as soon as we load the library (just like we set matplotlib.use("Agg") as soon as we load matplotlib when we need to set the backend) and/or (b) have dask understand the memory limits set by the scheduler. Could you please help on how to go about this issue. My preference would be to set mem/cpu limits for dask from within the code. PS: I understand there is a way to spin up dask workers in a cluster environment and specify these limits, but I am interested in running my code that works great locally on the cluster with minimal additional changes. Thanks!

Is there a dask equivalent to maxtasksperchild?

We have jobs which interact with native code and there are unavoidable memory leaks while the worker is processing the task. The simple solution for our problems has been to restart the worker after a specified number of tasks.
We are migrating from python's multiprocessing which has a useful maxtasksperchild option which closes down the workers after a specified number of tasks.
Is there something built-in in dask that is comparable to maxtasksperchild?
As a workaround, we are keeping track of the workers who have completed a task by appending their worker address to the result payload and calling retire_workers on the client side manually.
No, there is no such equivalent in Dask

Waiting for external dependencies in dask

Context:
I'm using custom dask graphs to manage and distribute computations.
Problem:
Some tasks include reading in files which are produced outside of dask and not necessarily available at the time of calling dask.get(graph,result_key).
Question:
Having the i/o tasks wait for files is not an option as this would block workers. Is there (or which would be) a good way to let dask wait for the files to become available and only then execute the i/o tasks?
Thanks a lot for any thoughts!
It sounds like you might want to use some of the more real-time features of Dask, described here.
You might consider making tasks that use secede and rejoin or use async-await style programming and only launch tasks once your client process notices that they exist.

Running process scheduler in Dask distributed

Local dask allows using process scheduler. Workers in dask distributed are using ThreadPoolExecutor to compute tasks. Is it possible to replace ThreadPoolExecutor with ProcessPoolExecutor in dask distributed? Thanks.
The distributed scheduler allows you to work with any number of processes, via any of the deployment options. Each of these can have one or more threads. Thus, you have the flexibility to choose your favourite mix of threads and processes as you see fit.
The simplest expression of this is with the LocalCluster (same as Client() by default):
cluster = LocalCluster(n_workers=W, threads_per_worker=T, processes=True)
makes W workers with T threads each (which can be 1).
As things stand, the implementation of workers uses a thread pool internally, and you cannot swap in a process pool in its place.

Resources