I have a dask cluster with n workers and want the workers to do queries to the database. But the database is only capable of handling m queries in parallel where m < n. How can I model that in dask.distributed? Only m workers should work on such a task in parallel.
I have seen that distributed supports locks (http://distributed.readthedocs.io/en/latest/api.html#distributed.Lock). But with that, I could do only one query in parallel, not m.
Also I have seen that I could define resources per worker (https://distributed.readthedocs.io/en/latest/resources.html). But that does not fit also, as the database is independent from the workers. I would either have to define 1 database resource per worker (which leads to too much parallel queries). Or I would have to distribute m database resources to n workers, which is difficult on setting up the cluster and suboptimal in execution.
Is it possible to define something like semaphores in dask to solve that?
You could probably hack something together with Locks and Variables.
A cleaner solution would be to just implement Semaphores much like how Locks are implemented. Depending on your experience this may not be that hard, (the lock implementation is 150 lines) and would be a welcome pull request.
https://github.com/dask/distributed/blob/master/distributed/lock.py
You can use a dask.distributed.Queue
class DDSemaphore(object):
"""Dask Distributed Semaphore"""
def __init__(self, value=1):
self._q = dask.distributed.Queue()
for _ in range(value):
self._q.put(42)
def acquire():
self._q.get()
def release():
self._q.put(42)
dask.distributed now contains a Semaphore class that can be used via the distributed Futures API:
https://docs.dask.org/en/stable/futures.html#id1
If you're using Dask collections such as Bag, DataFrame, or Array, then you may need to obtain their enclosed Future objects to use them with the Semaphore. Do that with futures_of()
https://docs.dask.org/en/stable/user-interfaces.html?highlight=futures_of#combining-interfaces
Related
I submit a Dask task like that :
client = Client(cluster)
future = client.submit(
# dask task
my_dask_task, # a task that consume at most 100MiB
# task arguments
arg1,
arg2,
)
Everything work fine.
Now I set some constraints :
client = Client(cluster)
future = client.submit(
# dask task
my_dask_task, # a task that consume at most 100MiB
# task arguments
arg1,
arg2,
# resource constraints at the Dask scheduler level
resources={
'process': 1,
'memory': 100*1024*1024 # 100MiB
}
)
The problem is, in that case, the future is never resolved. And the Python program wait for ever. Even with only 'process': 1 and/or setting very few amount of ram like 'memory': 10. So its weird.
Along this reduced example, in my real world application, a given Dask worker is configured to have multiples processes, and thus, may run at the same times multiples tasks.
So I want to set the RAM amount of each task, to avoid the Dask scheduler to run tasks on a given Dask worker, that can lead to out of memory errors.
Why it doesn't work as expected ? How to debug ?
Thank you
Adding to #pavithraes's comment - the resources argument to client.submit and other scheduling calls does NOT modify the available workers. Instead, it creates a constraint on the workers that can be used for the given tasks. Importantly, the terms you use here, "process" and "memory" are not interpreted by dask in terms of physical hardware - they are simply qualifiers you can define that dask uses to filter the available workers to only those which match your tag criteria.
From the dask docs:
Resources listed in this way are just abstract quantities. We could equally well have used terms “mem”, “memory”, “bytes” etc. above because, from Dask’s perspective, this is just an abstract term. You can choose any term as long as you are consistent across workers and clients.
It’s worth noting that Dask separately track number of cores and available memory as actual resources and uses these in normal scheduling operation.
Because of this, your tasks hang forever because the scheduler is actually waiting for workers which meet your conditions to appear so that it can schedule these tasks. Unless you create workers with these tags applied, the jobs will never start.
See the dask docs on specifying and using worker resources, and especially the section on Specifying Resources, for more information about how to configure workers such that such resource constraints can be applied.
I have some function which uses image processing functions which are itself multithreaded. I distribute many of those function calls on a dask cluster.
First, I started a scheduler on a host: dask-scheduler. The I started the workers: dask-worker --nthreads 1 --memory-limit 0.9 tcp://scheduler:8786.
The python code looks similar to this:
import SimpleITK as sitk
def func(filename):
sitk.ProcessObject.SetGlobalDefaultNumberOfThreads(4) # limit to four threads
img = sitk.ReadImage(filename)
# Do more stuff and store resulting image
# SimpleITK is already multithreaded
return 'someresult'
# [...]
from distributed import Client
client = Client('tcp://scheduler:8786')
futures = client.map(func, ['somefile', 'someotherfile'])
for result in client.gather(futures):
print(result)
Right now, I have to set the number of threads for each worker to one, in order not to overcommit the CPU on the worker node. But in some cases it makes sense to limit the number of cores used by SimpleITK, because the gain is not so high. Instead, I could run multiple function calls in parallel on the same host.
But in that case I would have to calculate all the core usages by hand.
Ideally, I would like to set an arbitrary number of cores each function can use and dask should decide how many parallel functions invocations are started on each node, given the number of available threads. I.e. is it possible to specify the number of threads a function will use?
No, Dask is not able to either limit the number of threads spawned by some function, and doesn't attempt to measure this either.
The only thing I could think you might want to do is use Dask's abstract rsources, where you control how much of each labelled quantity is available per worker and how much each task needs to run.
futures = client.map(func, ['somefile', 'someotherfile'], resources=...)
I don't see an obvious way to assign resources to workers using Cluster() (i.e., the default LocalCluster), you may need to use the CLI.
In the context of the Dask distributed scheduler w/ a LocalCluster: Can somebody help me understand the dynamics of having a large (heap) mapping function?
For example, consider the Dask Data Frame ddf and the map_partitions operation:
def mapper():
resource=... #load some large resource eg 50MB
def inner(pdf):
return pdf.apply(lambda x: ..., axis=1)
return inner
mapper_fn = mapper() #50MB on heap
ddf.map_partitions(mapper_fn)
What happens here? Dask will serialize mapper_fn and send to all tasks? Say, I have n partitions so, n tasks.
Empirically, I've observed, that if I have 40 tasks and a 50MB mapper, then it takes about 70s taks to start working, the cluster seems to kinda sit there w/ full CPU, but the dashboard shows nothing. What is happening here? What are the consequences of having large (heap) functions in the dish distributed scheduler?
Dask serializes non-trivial functions with cloudpickle, and includes the serialized version of those functions in every task. This is highly inefficient. We recommend that you not do this, but instead pass data explicitly.
resource = ...
ddf.map_partitions(func, resource=resource)
This will be far more efficient.
I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.
I'm confused about how to get the best from dask.
The problem
I have a dataframe which contains several timeseries (every one has its own key) and I need to run a function my_fun on every each of them. One way to solve it with pandas involves
df = list(df.groupby("key")) and then apply my_fun
with multiprocessing. The performances, despite the huge usage of RAM, are pretty good on my machine and terrible on google cloud compute.
On Dask my current workflow is:
import dask.dataframe as dd
from dask.multiprocessing import get
Read data from S3. 14 files -> 14 partitions
`df.groupby("key").apply(my_fun).to_frame.compute(get=get)
As I didn't set the indices df.known_divisions is False
The resulting graph is
and I don't understand if what I see it is a bottleneck or not.
Questions:
Is it better to have df.npartitions as a multiple of ncpu or it doesn't matter?
From this it seems that is better to set the index as key. My guess is that I can do something like
df["key2"] = df["key"]
df = df.set_index("key2")
but, again, I don't know if this is the best way to do it.
For questions like "what is taking time" in Dask, you are generally recommended to use the "distributed" scheduler rather than multiprocessing - you can run with any number of processes/threads you like, but you have much more information available via the diagnostics dashboard.
For your specific questions, if you are grouping over a column that is not nicely split between partitions and applying anything other than the simple aggregations, you will inevitably need a shuffle. Setting the index does this shuffle for you as a explicit step, or you get the implicit shuffle apparent in your task graph. This is a many-to-many operation, each aggregation tasks needs input from every original partition, hence the bottle-neck. There is no getting around that.
As for number of partitions, yes you can have sub-optimal conditions like 9 partitions on 8 cores (you will calculate 8 tasks, and then perhaps block for the final task on one core while the others are idle); but in general you can depend on dask to make reasonable scheduling decisions so long as you are not using a very small number of partitions. In many cases, it will not matter much.