Dask Distributed Scheduler and Large Functions - dask

In the context of the Dask distributed scheduler w/ a LocalCluster: Can somebody help me understand the dynamics of having a large (heap) mapping function?
For example, consider the Dask Data Frame ddf and the map_partitions operation:
def mapper():
resource=... #load some large resource eg 50MB
def inner(pdf):
return pdf.apply(lambda x: ..., axis=1)
return inner
mapper_fn = mapper() #50MB on heap
ddf.map_partitions(mapper_fn)
What happens here? Dask will serialize mapper_fn and send to all tasks? Say, I have n partitions so, n tasks.
Empirically, I've observed, that if I have 40 tasks and a 50MB mapper, then it takes about 70s taks to start working, the cluster seems to kinda sit there w/ full CPU, but the dashboard shows nothing. What is happening here? What are the consequences of having large (heap) functions in the dish distributed scheduler?

Dask serializes non-trivial functions with cloudpickle, and includes the serialized version of those functions in every task. This is highly inefficient. We recommend that you not do this, but instead pass data explicitly.
resource = ...
ddf.map_partitions(func, resource=resource)
This will be far more efficient.

Related

Dask utilising all ram when saving to parquet

I am having issues using dask. It is very slow compared to pandas especially when reading large datasets of up to 40gig. The data set grows to about 100+ columns which are mainly float64 after some additional processing(This is quite slow especially when I call compute like so: output = df[["date", "permno"]].compute(scheduler='threading'))
I think I could live with delay even if frustrating however, when I try to save the data to parquet: df.to_parquet('my data frame', engine="fastparquet") it runs out of memory in a server with about 110gig ram. I notice that the buff/cache memory when I do free -h goes up from about 40megabytes to 40+gig.
I am confused how this is possible given that dask does not load everything into memory. I use 100 partitions for the dataset in dask.
Dask computations are executed lazily. The underlying operations aren't actually executed until the last possible moment. Here's what I can gather from your question / comment:
you read a 40GB dataset
you run grouping / sorting
you join with other datasets
you try to write to Parquet
The computation bottleneck isn't necessarily related to the Parquet writing part. Your bottleneck may be with the grouping, sorting, or joining.
You may need to perform a broadcast join, strategically persist, or repartition, it's hard to say given the information provided.

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Parallelizing in-memory tasks in dask using shared memory (no sending to other processes)?

I have a trivially parallelizable in-memory problem, but one that which does not give great speedups with regular Python multiprocessing (only 2xish), due to the need for sending lots of data back and forth between processes. Hoping dask can help.
My code basically looks like this:
delayed_results = []
for key, kdf in natsorted(scdf.groupby(grpby_key)):
d1 = dd.from_pandas(kdf, npartitions=1)
d2 = dd.from_pandas(other_dfs[key], npartitions=1)
result = dask.delayed(function)(d1, d2, key=key, n_jobs=n_jobs, **kwargs)
delayed_results.append(result)
outdfs = dask.compute(*delayed_results)
This is what my old joblib code looked like:
outdfs = Parallel(n_jobs=n_jobs)(delayed(function)(scdf, other_dfs[key], key=key, n_jobs=n_jobs, **kwargs) for key, scdf in natsorted(scdf.groupby(grpby_key)))
However, the dask code is much much slower and more memory-consuming, both for the threaded and multiprocessing schedulers. I was hoping that dask could be used to parallelize tasks without needing to send stuff to other processes. Is there a way to use multiple processes with dask by using shared memory?
Btw. The docs have a reference to http://distributed.readthedocs.io/en/latest/local-cluster.html where they explain that this scheduler
It handles data locality with more sophistication, and so can be more
efficient than the multiprocessing scheduler on workloads that require
multiple processes.
But they have no examples of its usage. What should I replace my dask.compute() call with in the code above to try the local cluster?
So you can just do the following
from distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
<your code>
Distributed will by default register itself as the executor, and you can just use dask.compute as normal

large data shuffle causes timeouts

I'm trying to read a 100000 data records of about 100kB each simultaneously from 50 disks, shuffling them, and writing it to 50 output disks at disk speed. What's a good way of doing that with Dask?
I've tried creating 50 queues and submitting 50 reader/writer functions using 100 workers (all on different machines, this is using Kubernetes). I ramp up first the writers, then the readers gradually. The scheduler gets stuck at 100% CPU at around 10 readers, and then gets timeouts when any more readers are added. So this approach isn't working.
Most dask operations have something like 1ms of overhead. As a result Dask is not well suited to be placed within innermost loops. Typically it is used at a coarser level, parallelizing across many Python functions, each of which is expected to take 100ms.
In a situation like yours I would push data onto a shared message system like Kafka, and then use Dask to pull off chunks of data when appropriate.
Data transfer
If your problem is in the bandwidth limitation of moving data through dask queues then you might consider turning your data into dask-reference-able futures before placing things into queues. See this section of the Queue docstring: http://dask.pydata.org/en/latest/futures.html#distributed.Queue
Elements of the Queue must be either Futures or msgpack-encodable data (ints, strings, lists, dicts). All data is sent through the scheduler so it is wise not to send large objects. To share large objects scatter the data and share the future instead.
So you probably want something like the following:
def f(queue):
client = get_client()
for fn in local_filenames():
data = read(fn)
future = client.scatter(data)
queue.put(future)
Shuffle
If you're just looking to shuffle data then you could read it with something like dask.bag or dask.dataframe
df = dd.read_parquet(...)
and then sort your data using the set_index method
df.set_index('my-column')

Semaphores in dask.distributed?

I have a dask cluster with n workers and want the workers to do queries to the database. But the database is only capable of handling m queries in parallel where m < n. How can I model that in dask.distributed? Only m workers should work on such a task in parallel.
I have seen that distributed supports locks (http://distributed.readthedocs.io/en/latest/api.html#distributed.Lock). But with that, I could do only one query in parallel, not m.
Also I have seen that I could define resources per worker (https://distributed.readthedocs.io/en/latest/resources.html). But that does not fit also, as the database is independent from the workers. I would either have to define 1 database resource per worker (which leads to too much parallel queries). Or I would have to distribute m database resources to n workers, which is difficult on setting up the cluster and suboptimal in execution.
Is it possible to define something like semaphores in dask to solve that?
You could probably hack something together with Locks and Variables.
A cleaner solution would be to just implement Semaphores much like how Locks are implemented. Depending on your experience this may not be that hard, (the lock implementation is 150 lines) and would be a welcome pull request.
https://github.com/dask/distributed/blob/master/distributed/lock.py
You can use a dask.distributed.Queue
class DDSemaphore(object):
"""Dask Distributed Semaphore"""
def __init__(self, value=1):
self._q = dask.distributed.Queue()
for _ in range(value):
self._q.put(42)
def acquire():
self._q.get()
def release():
self._q.put(42)
dask.distributed now contains a Semaphore class that can be used via the distributed Futures API:
https://docs.dask.org/en/stable/futures.html#id1
If you're using Dask collections such as Bag, DataFrame, or Array, then you may need to obtain their enclosed Future objects to use them with the Semaphore. Do that with futures_of()
https://docs.dask.org/en/stable/user-interfaces.html?highlight=futures_of#combining-interfaces

Resources