I'm setting up a Dask script to be executed on PSC Bridges P100 GPU nodes. These nodes offer 2 GPUs and 32 CPU-cores. I would like to start CPU and GPU-based dask-workers.
The CPU workers will be started:
dask-worker --nprocs 1 --nthreads 1
while the GPU workers as:
CUDA_VISIBLE_DEVICE=0 dask-worker --nprocs 1 --nthreads 1
My workflow consists of a set of CPU only tasks and a set of GPU tasks, that depend on the results from the CPU tasks. Is there a way to bind a GPU task only to the GPU workers?
Furthermore, I would like to make sure that the GPU tasks land on the same compute node as the CPU task they depend on. Can I somehow do that?
For your sort of problem it makes sense to run dask using the dask.distributed backend (a more sophisticated task scheduler), which provides a functionality called "worker resources".
For every worker it lets you specify virtual worker resources with an associated count, such as "GPU=2". On the client side you can then specify which and how many resources are required for each task. See the docs here.
For making sure a GPU task lands on the same compute node as the task it depends on you could:
set the resources accordingly, i.e. splitting up tasks explicitly
using resources like "GPU1" and "GPU2"
alternatively, bundle the CPU and subsequent GPU task into one task either by manually defining an encompassing function or by using dask graph optimizations as documented here (I'm thinking of "fusing" tasks).
Related
I am using dask on LSF cluster to process some images in parallel. The processing function itself uses joblib to perform multiple computations on the image in parallel.
It seems that setting n_workers and cores parameters to some numbers will generally produce n_workers * cores futures running at the same time. I would like to have n_workers futures being processesed at a time, each of them having cores cores at disposal for the purpose of using them with joblib.
How do I achieve such result?
I am seeing some performance degradation on my data analysis when I go more than 25 workers, each with 192 threads. Are there any limits on scheduler? there is no load footprint on communication(ib is used) or cpu or ram).
for example initially I have 170K hdf files on the lustrefs:
ddf=dd.read_hdf(hdf5files,key="G18",mode="r")
ddf.repartition(npartitions=4096).to_parquet(splitspath+"gdr3-input-cache")
the code is running slower on 64 workers than 25. looks like the scheduler on initial tasks design phase is very overloaded.
EDIT:
dask-2021.06.0
distributed-2021.06.0
There are many potential bottlenecks. Here are some hints.
Yes, the scheduler is a single process through which all tasks must pass, and it introduces an overhead per task (<1ms) just to manipulate its internal state and send . So, if you have many tasks per second, you will see the overhead take a larger fraction of the total time.
Similarly, if you have a lot of workers, you will have a lot of network traffic for both distribution of tasks and any data shuffling between workers. More workers, more traffic.
Thirdly, python uses a global lock, the GIL, when running code. Even when your tasks are GIL-friendly (e.g., array/dataframe ops), threads may still need the GIL sometimes, and this can cause contention and degraded performance.
Finally, you say you are using lustre, so you have many tasks simultaneously hitting network storage, which will have its own limitations both for metadata access and for data traffic.
I have some calculations calling the pardiso() solver from python. The solver allocates its own memory in a way that is opaque to python, but the pointers used to access that memory are stored in python. If I were to try and run these calculations using dask.delayed is there any way to tell dask the expected memory consumption of the calculation so that it can schedule them appropriately?
There are at least two solutions to a situation where there is some constraint that dask should respect: resources argument and Semaphore.
For the resources the workflow is to allocate some amount of resources to each worker (either via cli when launching the workers or using resources kwarg in LocalCluster or another type of cluster). Then, the code would specify how much of this resource is used by each task at the time of .compute or .map/.submit.
The workflow with Semaphore is to specify the number of leases possible (note that unlike resources this is an integer, so in some sense less flexible) when creating the Semaphore (see docs). Then whenever the costly resource is accessed it should be wrapped in with sem context manager.
To compare and contrast the performances of three different algorithms in a scientific experiment, I am planning to use Celery scheduler. These algorithms are implemented by three different tools. They may or may not have parallelism implemented which I don't want to make any prior assumption about. The dataset contains 10K data points. All three tools are supposed to run on all the data points; which translates to 30K tasks scheduled by the scheduler. All I want is to allocate the same amount of resources to all the tools, across all the executions.
Assume, my physical Ubuntu 18.04 server is equipped with 24 cores and 96 GB of RAM. Tasks are scheduled by 4 Celery workers, each handling a single task. I want to put an upper limit of 4 CPU cores and 16 GB of memory per task. Moreover, no two tasks should race for the same cores, i.e., 4 tasks should be using 16 cores in total, each scheduled on its own set of cores.
Is there any means to accomplish this setup, either through Celery, or cgroup, or by any other mechanism? I want to refrain from using docker, kubernetes, or any VM based approach, unless it is absolutely required.
Dealing with CPU cores should be fairly easy by specifying concurrency to 6. But limiting memory usage is hard part of the requirement and I believe you can accomplish that by making worker processes be owned by particular cgroup that you specified memory limit on.
An alternative would be to run Celery workers in containers with specified limits.
I prefer not to do this as there may be tasks (or task with particular arguments) that allocate tiny amount of RAM so it would be wasteful if you can't use 4G of RAM while such task runs.
Pity Celery autoscaling is deprecated (it is one of the coolest features of Celery, IMHO). It should not be a difficult task to implement Celery autoscaler that scales up/down depending on memory utilization.
Local dask allows using process scheduler. Workers in dask distributed are using ThreadPoolExecutor to compute tasks. Is it possible to replace ThreadPoolExecutor with ProcessPoolExecutor in dask distributed? Thanks.
The distributed scheduler allows you to work with any number of processes, via any of the deployment options. Each of these can have one or more threads. Thus, you have the flexibility to choose your favourite mix of threads and processes as you see fit.
The simplest expression of this is with the LocalCluster (same as Client() by default):
cluster = LocalCluster(n_workers=W, threads_per_worker=T, processes=True)
makes W workers with T threads each (which can be 1).
As things stand, the implementation of workers uses a thread pool internally, and you cannot swap in a process pool in its place.