Running itself multithreaded functions on a dask cluster - dask

I have some function which uses image processing functions which are itself multithreaded. I distribute many of those function calls on a dask cluster.
First, I started a scheduler on a host: dask-scheduler. The I started the workers: dask-worker --nthreads 1 --memory-limit 0.9 tcp://scheduler:8786.
The python code looks similar to this:
import SimpleITK as sitk
def func(filename):
sitk.ProcessObject.SetGlobalDefaultNumberOfThreads(4) # limit to four threads
img = sitk.ReadImage(filename)
# Do more stuff and store resulting image
# SimpleITK is already multithreaded
return 'someresult'
# [...]
from distributed import Client
client = Client('tcp://scheduler:8786')
futures = client.map(func, ['somefile', 'someotherfile'])
for result in client.gather(futures):
print(result)
Right now, I have to set the number of threads for each worker to one, in order not to overcommit the CPU on the worker node. But in some cases it makes sense to limit the number of cores used by SimpleITK, because the gain is not so high. Instead, I could run multiple function calls in parallel on the same host.
But in that case I would have to calculate all the core usages by hand.
Ideally, I would like to set an arbitrary number of cores each function can use and dask should decide how many parallel functions invocations are started on each node, given the number of available threads. I.e. is it possible to specify the number of threads a function will use?

No, Dask is not able to either limit the number of threads spawned by some function, and doesn't attempt to measure this either.
The only thing I could think you might want to do is use Dask's abstract rsources, where you control how much of each labelled quantity is available per worker and how much each task needs to run.
futures = client.map(func, ['somefile', 'someotherfile'], resources=...)
I don't see an obvious way to assign resources to workers using Cluster() (i.e., the default LocalCluster), you may need to use the CLI.

Related

Dask : tasks submit with resources constraints not working

I submit a Dask task like that :
client = Client(cluster)
future = client.submit(
# dask task
my_dask_task, # a task that consume at most 100MiB
# task arguments
arg1,
arg2,
)
Everything work fine.
Now I set some constraints :
client = Client(cluster)
future = client.submit(
# dask task
my_dask_task, # a task that consume at most 100MiB
# task arguments
arg1,
arg2,
# resource constraints at the Dask scheduler level
resources={
'process': 1,
'memory': 100*1024*1024 # 100MiB
}
)
The problem is, in that case, the future is never resolved. And the Python program wait for ever. Even with only 'process': 1 and/or setting very few amount of ram like 'memory': 10. So its weird.
Along this reduced example, in my real world application, a given Dask worker is configured to have multiples processes, and thus, may run at the same times multiples tasks.
So I want to set the RAM amount of each task, to avoid the Dask scheduler to run tasks on a given Dask worker, that can lead to out of memory errors.
Why it doesn't work as expected ? How to debug ?
Thank you
Adding to #pavithraes's comment - the resources argument to client.submit and other scheduling calls does NOT modify the available workers. Instead, it creates a constraint on the workers that can be used for the given tasks. Importantly, the terms you use here, "process" and "memory" are not interpreted by dask in terms of physical hardware - they are simply qualifiers you can define that dask uses to filter the available workers to only those which match your tag criteria.
From the dask docs:
Resources listed in this way are just abstract quantities. We could equally well have used terms “mem”, “memory”, “bytes” etc. above because, from Dask’s perspective, this is just an abstract term. You can choose any term as long as you are consistent across workers and clients.
It’s worth noting that Dask separately track number of cores and available memory as actual resources and uses these in normal scheduling operation.
Because of this, your tasks hang forever because the scheduler is actually waiting for workers which meet your conditions to appear so that it can schedule these tasks. Unless you create workers with these tags applied, the jobs will never start.
See the dask docs on specifying and using worker resources, and especially the section on Specifying Resources, for more information about how to configure workers such that such resource constraints can be applied.

How to pick proper number of threads, workers, processes for Dask when running in an ephemeral environment as single machine and cluster

Our company is currently leveraging prefect.io for data workflows (ELT, report generation, ML, etc). We have just started adding the ability to do parallel task execution, which is powered by Dask. Our flows are executed using ephemeral AWS Fargate containers, which will use Dask LocalCluster with a certain number of workers, threads, processes passed into the LocalCluster object.
Our journey on Dask will look very much like this:
Continue using single machine LocalCluster until we out grow max cpu/memory allowed
When we out grow a single container, spawn additional worker containers on the initial container (a la dask-kubernetes) and join them to the LocalCluster.
We're currently starting with containers that have 256 cpu(.25 vCPU) and 512 memory and pinning the LocalCluster to 1 n_workers and 3 threads_per_worker to get a reasonable amount of parallelism. However, this really is guess work. 1 n_workers since its a machine with less than 1 vcpu and 3 threads because that doesn't sound crazy to me based on my previous experience running other python based applications in Fargate. This seems to work fine in a very simply example that just maps a function against a list of items.
RENEWAL_TABLES = [
'Activity',
'CurrentPolicyTermStaus',
'PolicyRenewalStatus',
'PolicyTerm',
'PolicyTermStatus',
'EndorsementPolicyTerm',
'PolicyLifeState'
]
RENEWAL_TABLES_PAIRS = [
(i, 1433 + idx) for idx, i in enumerate(RENEWAL_TABLES)
]
#task(state_handlers=[HANDLER])
def dummy_step():
LOGGER.info('Dummy Step...')
sleep(15)
#task(state_handlers=[HANDLER])
def test_map(table):
LOGGER.info('table: {}...'.format(table))
sleep(15)
with Flow(Path(__file__).stem, SCHEDULE, state_handlers=[HANDLER]) as flow:
first_step = dummy_step()
test_map.map(RENEWAL_TABLES_PAIRS).set_dependencies(upstream_tasks=[first_step])
I see no more than 3 tasks executed at once.
I would really like to understand how to best configure n_workers(single machinne), threads, processes as we expand the size of the single machine out to adding remote workers. I know it depends on my workload, but you could see a combination of things in a single flow where one task does an extract from a database to a csv and another task run a pandas computation. I have seen things online where it seems like it should be threads = number of cpus requested for the documentation, but it seems like you can still achieve parallelism with less than one cpu in Fargate.
Any feedback would be appreciated and could help others looking to leverage Dask in a more ephemeral nature.
Given that Fargate increments from .25 -> .50 -> 1 -> 2 -> 4 for vCPU, I think it’s safe to go with a 1 worker to 1 vcpu setup. However, would be helpful to understand how to choose a good upper limit for number of threads per worker given how Fargate vcpu allotment works.

Why does dask worker fails due to MemoryError on "small" size task? [Dask.bag]

I am running a pipeline on multiple images. The pipeline consist of reading the images from file system, doing so processing on each of them, then saving the images to file system. However the dask worker fails due to MemoryError.
Is there a way to assure the dask workers don't load too many images in memory? i.e. Wait until there is enough space on a worker before starting the processing pipeline on a new image.
I have one scheduler and 40 workers with 4 cores, 15GB ram and running Centos7. I am trying to process 125 images in a batch; each image is fairly large but small enough to fit on a worker; around 3GB require for the whole process.
I tried to process a smaller amount of images and it works great.
EDITED
from dask.distributed import Client, LocalCluster
# LocalCluster is used to show the config of the workers on the actual cluster
client = Client(LocalCluster(n_workers=2, resources={'process': 1}))
paths = ['list', 'of', 'paths']
# Read the file data from each path
data = client.map(read, path, resources={'process': 1)
# Apply foo to the data n times
for _ in range(n):
data = client.map(foo, x, resources={'process': 1)
# Save the processed data
data.map(save, x, resources={'process': 1)
# Retrieve results
client.gather(data)
I expected the images to be process as space was available on the workers but it seems like the images are all loaded simultaneously on the different workers.
EDIT:
My issues is that all task get assigned to workers and they don't have enough memory. I found how to limit the number of task a worker handle at a single moment [https://distributed.readthedocs.io/en/latest/resources.html#resources-are-applied-separately-to-each-worker-process](see here).
However, with that limit, when I execute my task they all finish the read step, then the process step and finally the save step. This is an issues since the image are spilled to disk.
Would there be a way to make every task finish before starting a new one?
e.g. on Worker-1: read(img1)->process(img1)->save(img1)->read(img2)->...
Dask does not generally know how much memory a task will need, it can only know the size of the outputs, and that, only once they are finished. This is because Dask simply executes a pthon function and then waits for it to complete; but all osrts of things can happen within a python function. You should generally expect as many tasks to begin as you have available worker cores - as you are finding.
If you want a smaller total memory load, then your solution should be simple: have a small enough number of workers, so that if all of them are using the maximum memory that you can expect, you still have some spare in the system to cope.
To EDIT: you may want to try running optimize on the graph before submission (although this should happen anyway, I think), as it sounds like your linear chains of tasks should be "fused". http://docs.dask.org/en/latest/optimize.html

Semaphores in dask.distributed?

I have a dask cluster with n workers and want the workers to do queries to the database. But the database is only capable of handling m queries in parallel where m < n. How can I model that in dask.distributed? Only m workers should work on such a task in parallel.
I have seen that distributed supports locks (http://distributed.readthedocs.io/en/latest/api.html#distributed.Lock). But with that, I could do only one query in parallel, not m.
Also I have seen that I could define resources per worker (https://distributed.readthedocs.io/en/latest/resources.html). But that does not fit also, as the database is independent from the workers. I would either have to define 1 database resource per worker (which leads to too much parallel queries). Or I would have to distribute m database resources to n workers, which is difficult on setting up the cluster and suboptimal in execution.
Is it possible to define something like semaphores in dask to solve that?
You could probably hack something together with Locks and Variables.
A cleaner solution would be to just implement Semaphores much like how Locks are implemented. Depending on your experience this may not be that hard, (the lock implementation is 150 lines) and would be a welcome pull request.
https://github.com/dask/distributed/blob/master/distributed/lock.py
You can use a dask.distributed.Queue
class DDSemaphore(object):
"""Dask Distributed Semaphore"""
def __init__(self, value=1):
self._q = dask.distributed.Queue()
for _ in range(value):
self._q.put(42)
def acquire():
self._q.get()
def release():
self._q.put(42)
dask.distributed now contains a Semaphore class that can be used via the distributed Futures API:
https://docs.dask.org/en/stable/futures.html#id1
If you're using Dask collections such as Bag, DataFrame, or Array, then you may need to obtain their enclosed Future objects to use them with the Semaphore. Do that with futures_of()
https://docs.dask.org/en/stable/user-interfaces.html?highlight=futures_of#combining-interfaces

DASK - Stopping workers during execution causes completed tasks to be launched twice

I want to use dask to process some 5000 batch tasks that store their results in a relational database, and after they are all completed I want to run a final task that will query the databse and generate a result file (which will be stored in AWS S3)
So it's more or less like this:
from dask import bag, delayed
batches = bag.from_sequence(my_batches())
results = batches.map(process_batch_and_store_results_in_database)
graph = delayed(read_database_and_store_bundled_result_into_s3)(results)
client = Client('the_scheduler:8786')
client.compute(graph)
And this works, but: Near the end of processing, many workers are idle and I would like to be able to turn them off (and save some money on AWS EC2), but if I do that, the scheduler will "forget" that those tasks were already completed and try to run them again on the remaining workers.
I understand that this is actually a feature, not a bug, as Dask is trying to keep track of all the results before starting read_database_and_store_bundled_result_into_s3, but: Is there any way that I can tell dask to just orchestrate the distributed processing graph and not worry about state management?
I recommend that you simply forget the futures after they complete. This solution uses the dask.distributed concurrent.futures interface rather than dask.bag. In particular it uses the as_completed iterator.
from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')
futures = client.map(process_batch_and_store_results_in_database, my_batches())
seq = as_completed(futures)
del futures # now only reference to the futures is within seq
for future in seq:
pass # let future be garbage collected

Resources