DASK - Stopping workers during execution causes completed tasks to be launched twice - dask

I want to use dask to process some 5000 batch tasks that store their results in a relational database, and after they are all completed I want to run a final task that will query the databse and generate a result file (which will be stored in AWS S3)
So it's more or less like this:
from dask import bag, delayed
batches = bag.from_sequence(my_batches())
results = batches.map(process_batch_and_store_results_in_database)
graph = delayed(read_database_and_store_bundled_result_into_s3)(results)
client = Client('the_scheduler:8786')
client.compute(graph)
And this works, but: Near the end of processing, many workers are idle and I would like to be able to turn them off (and save some money on AWS EC2), but if I do that, the scheduler will "forget" that those tasks were already completed and try to run them again on the remaining workers.
I understand that this is actually a feature, not a bug, as Dask is trying to keep track of all the results before starting read_database_and_store_bundled_result_into_s3, but: Is there any way that I can tell dask to just orchestrate the distributed processing graph and not worry about state management?

I recommend that you simply forget the futures after they complete. This solution uses the dask.distributed concurrent.futures interface rather than dask.bag. In particular it uses the as_completed iterator.
from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')
futures = client.map(process_batch_and_store_results_in_database, my_batches())
seq = as_completed(futures)
del futures # now only reference to the futures is within seq
for future in seq:
pass # let future be garbage collected

Related

Dask task/job queuing

I've been using Dask for a good while but I still I don't know if there is a queue system for tasks by default. Let's say we have a local cluster and a client for it:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
cli = Client(cluster)
I'd like to run, not in parallel but task after task (submit after submit, or future after future) the following:
import time
a, b = cli.submit(time.sleep, 5), cli.submit(time.sleep, 1)
It's easy to see that these run at the same time since future b finishes before future a. My question is the following
Is it possible to force that future b does not start before future a finishes?
If tasks are heavy, I don't want that all of them are running in the same time, I'd like some queue system. Is dask-jobqueue way to go or what? I have no external queue systems available (slurm etc.)
Or does the Dask-Scheduler somehow take care that it won't burden workers too much by scheduling too many simultaneous tasks?
To have one task depend on another, you have some options:
don't launch tasks until the previous ones have finished, e.g., by using the .results attribute to wait for them. In this case, Dask isn't doing much for you.
set up your cluster to limit the number of worker threads to as many tasks as you think can comfortably run simultaneously, with appropriate arguments to LocalCluster (this is the preferred solution)
have a task explicitly depend on a previous one, e.g.,
.
def sleepme(t, *args):
time.sleep(t)
print("done", t)
f1 = client.submit(sleepme, 5)
f2 = client.submit(sleepme, 1, f1) # won't run until f1 is done

Running itself multithreaded functions on a dask cluster

I have some function which uses image processing functions which are itself multithreaded. I distribute many of those function calls on a dask cluster.
First, I started a scheduler on a host: dask-scheduler. The I started the workers: dask-worker --nthreads 1 --memory-limit 0.9 tcp://scheduler:8786.
The python code looks similar to this:
import SimpleITK as sitk
def func(filename):
sitk.ProcessObject.SetGlobalDefaultNumberOfThreads(4) # limit to four threads
img = sitk.ReadImage(filename)
# Do more stuff and store resulting image
# SimpleITK is already multithreaded
return 'someresult'
# [...]
from distributed import Client
client = Client('tcp://scheduler:8786')
futures = client.map(func, ['somefile', 'someotherfile'])
for result in client.gather(futures):
print(result)
Right now, I have to set the number of threads for each worker to one, in order not to overcommit the CPU on the worker node. But in some cases it makes sense to limit the number of cores used by SimpleITK, because the gain is not so high. Instead, I could run multiple function calls in parallel on the same host.
But in that case I would have to calculate all the core usages by hand.
Ideally, I would like to set an arbitrary number of cores each function can use and dask should decide how many parallel functions invocations are started on each node, given the number of available threads. I.e. is it possible to specify the number of threads a function will use?
No, Dask is not able to either limit the number of threads spawned by some function, and doesn't attempt to measure this either.
The only thing I could think you might want to do is use Dask's abstract rsources, where you control how much of each labelled quantity is available per worker and how much each task needs to run.
futures = client.map(func, ['somefile', 'someotherfile'], resources=...)
I don't see an obvious way to assign resources to workers using Cluster() (i.e., the default LocalCluster), you may need to use the CLI.

Dask Distributed with Asynchronous Real-time Parallelism

I'm reading the documentation on dask.distributed and it looks like I could submit functions to the distributed cluster via client.submit().
I have an existing function some_func that is grabbing individual documents (say, a text file) asynchronously and I want to take the raw document and grab all words that don't contain a vowel and shove it back into a different database. This data processing step is blocking.
Assuming that there are several million documents and the distributed cluster only has 10 nodes with 1 process available (i.e., it can only process 10 documents at a time), how will dask.distributed handle the flow of the documents that it needs to process?
Here is some example code:
client = dask.distributed('tcp://1.2.3.4:8786')
def some_func():
doc = retrieve_next_document_asynchronously()
client.submit(get_vowelless_words, doc)
def get_vowelless_words(doc):
vowelless_words = process(doc)
write_to_database(vowelless_words)
if __name__ == '__main__':
for i in range(1000000):
some_func()
Since the processing of a document is blocking and the cluster can only handle 10 documents simultaneously, what happens when 30 other documents are retrieved while the cluster is busy? I understand that client.submit() is asynchronous and it would return a concurrent future but what would happen in this case? Would it hold the document in memory until it 1/10 cores are available and potentially cause the machine to run out of memory after, say, if 1,000 documents are waiting.
What would the scheduler do in this case? FIFO? Should I somehow change the code so that it waits for a core to be available before retrieving the next document? How might that be accomplished?
To use Queues with dask, below is a modified example of using dask Queues with a distributed cluster (based on the documentation):
#!/usr/bin/env python
import distributed
from queue import Queue
from threading import Thread
client = distributed.Client('tcp://1.2.3.4:8786')
nprocs = len(client.ncores())
def increment(x):
return x+1
def double(x):
return 2*x
input_q = Queue(maxsize=nprocs)
remote_q = client.scatter(input_q)
remote_q.maxsize = nprocs
inc_q = client.map(increment, remote_q)
inc_q.maxsize = nprocs
double_q = client.map(double, inc_q)
double_q.maxsize = nprocs
result_q = client.gather(double_q)
def load_data(q):
i = 0
while True:
q.put(i)
i += 1
load_thread = Thread(target=load_data, args=(input_q,))
load_thread.start()
while True:
size = result_q.qsize()
item = result_q.get()
print(item, size)
In this case, we explicitly limit the maximum size of each queue to be equal to the number of distributed processes that are available. Otherwise, the while loop will overload the cluster. Of course, you can adjust the maxsize to be some multiple of the number of available processes as well. For simple functions like increment and double, I found that maxsize = 10*nprocs is still reasonable but this will surely be limited by the amount of time that it takes to run your custom function.
When you call submit all of the arguments are serialized and immediately sent to the scheduler. An alternative would be to both get documents and process them on the cluster (this assumes that documents are globally visible from all workers).
for fn in filenames:
doc = client.submit(retrieve_doc, fn)
process = client.submit(process_doc, doc)
fire_and_forget(process)
If documents are only available on your client machine and you want to restrict flow then you might consider using dask Queues or the as_completed iterator.

Dask distributed perform computations without returning data

I have a dynamic Dask Kubernetes cluster.
I want to load 35 parquet files (about 1.2GB) from Gcloud storage into Dask Dataframe then process it with apply() and after saving the result to parquet file to Gcloud.
During loading files from Gcloud storage, a cluster memory usage is increasing to about 3-4GB. Then workers (each worker has 2GB of RAM) are terminated/restarted and some tasks getting lost,
so cluster starts computing the same things in a circle.
I removed apply() operation and leave only read_parquet() to test
if my custom code causes a trouble, but the problem was the same, even with just single read_parquet() operation. This is a code:
client = Client('<ip>:8786')
client.restart()
def command():
client = get_client()
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute()
x = client.submit(command)
x.result()
Note: I'm submitting a single command function to run all necessary commands to avoid problems with gcsfs authentication inside a cluster
After some investigation, I understood that problem could be in .compute() which returns all data to a process, but this process (my command function) is running on a worker. Because of that, a worker doesn't have enough RAM, crashes and lose all computed task which triggers tasks re-run.
My goal is:
to read from parquet files
perform some computations with apply()
and without even returning data from a cluster write it back to Gcloud storage in parquet format.
So, simply I want to keep data on a cluster and not return it back. Just compute and save data somewhere else.
After reading Dask distributed docs, I have found client.persist()/compute() and .scatter() methods. They look like what I need, but I don't really understand how to use them.
Could you, please, help me with client.persist() and client.compute() methods for my example
or suggest another way to do it? Thank you very much!
Dask version: 0.19.1
Dask distributed version: 1.23.1
Python version: 3.5.1
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute() # this triggers computations, but brings all of the data to one machine and creates a Pandas dataframe
df = df.persist() # this triggers computations, but keeps all of the data in multiple pandas dataframes spread across multiple machines

Parallelizing in-memory tasks in dask using shared memory (no sending to other processes)?

I have a trivially parallelizable in-memory problem, but one that which does not give great speedups with regular Python multiprocessing (only 2xish), due to the need for sending lots of data back and forth between processes. Hoping dask can help.
My code basically looks like this:
delayed_results = []
for key, kdf in natsorted(scdf.groupby(grpby_key)):
d1 = dd.from_pandas(kdf, npartitions=1)
d2 = dd.from_pandas(other_dfs[key], npartitions=1)
result = dask.delayed(function)(d1, d2, key=key, n_jobs=n_jobs, **kwargs)
delayed_results.append(result)
outdfs = dask.compute(*delayed_results)
This is what my old joblib code looked like:
outdfs = Parallel(n_jobs=n_jobs)(delayed(function)(scdf, other_dfs[key], key=key, n_jobs=n_jobs, **kwargs) for key, scdf in natsorted(scdf.groupby(grpby_key)))
However, the dask code is much much slower and more memory-consuming, both for the threaded and multiprocessing schedulers. I was hoping that dask could be used to parallelize tasks without needing to send stuff to other processes. Is there a way to use multiple processes with dask by using shared memory?
Btw. The docs have a reference to http://distributed.readthedocs.io/en/latest/local-cluster.html where they explain that this scheduler
It handles data locality with more sophistication, and so can be more
efficient than the multiprocessing scheduler on workloads that require
multiple processes.
But they have no examples of its usage. What should I replace my dask.compute() call with in the code above to try the local cluster?
So you can just do the following
from distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
<your code>
Distributed will by default register itself as the executor, and you can just use dask.compute as normal

Resources