dask distributed code is slower than corresponding serial execution - dask

I have this dask example of a standalone python script that runs on my desktop that has 4 CPU nodes It takes 0.735 seconds currently. The goal is to use separate processes on my Linux to overcome the limitations of the GIL etc.
import numpy as np
import dask
from dask import delayed
from dask.distributed import Client
import time
def main():
if __name__ == "__main__":
client = Client()
tmp,pres = setUpData()
startTime = time.time()
executeCalc(tmp,pres)
stopTime = time.time()
print(stopTime-startTime)
def setUpData():
temperature = 273 + 20 * np.random.random([4,17,73,144])
pres = ["1000","925","850","700","600","500","400","300","250","200","150","100","70","50","30","20","10"]
level = np.array(pres)
level = level.astype(float)*100
return temperature,level
def executeCalc(tmp,pres):
potempList = []
for i in range (0,tmp.shape[0]):
tmpInstant = tmp[i,:,:,:]
potempList.append(delayed(pot)(tmpInstant,pres))
results = dask.compute(potempList,scheduler='processes',num_workers=4)
def pot(tmp,pres):
potemp = np.zeros((17,73,144))
potemp = tmp * (100000./pres[:,None,None])
return potemp
main()
Here is the corresponding serial execution with only trivial modifications and this takes 0.0024 seconds.
def executeCalc(tmp,pres):
potempList = []
for i in range (0,tmp.shape[0]):
tmpInstant = tmp[i,:,:,:]
potemp = pot(tmpInstant,pres)
Where am I going wrong ? At the very least for this trivial amount of data the execution times should be identical.

This assumption is where you’re going wrong:
At the very least for this trivial amount of data the execution times should be identical.
Any parallel execution engine does the same amount of work as the serial engine, plus a lot of additional overhead. So you need to have enough parallelizeable work to do to justify the large time required to spin up the scheduler, web server, worker processes, transmit the job and results across workers, serialize and deserialize all inputs all outputs, monitor and manage job and worker state, etc.
To see the benefits of dask, you need a task which is many orders of magnitude larger than this.
Take a look at the dask docs on best practices. I’m paraphrasing, but the first recommendation is don’t use dask unless you have to.

Related

Dask: LocalCluster scheduler not using all cores and slower than default threaded scheduler?

I am using dask array to speed up computations on a single machine (either 4-core or 32 core) using either the default "threads" scheduler or the dask.distributed LocalCluster (threads, no processes).
Given that the dask.distributed scheduler is newer and comes with a a nice dashboard, I was hoping to use this scheduler. However, I found that the LocalCluster scheduler is slower (by a factor 2 or more) than the default scheduler. The LocalCluster scheduler also did not fully utilize all requested cores, and occasionally on the 32-core machine just used just one or a few.
Question: is this expected behavior? If not, what can I do to improve the performance of the LocalCluster scheduler?
Below is the code I used for testing, an example output (run on 4-core machine) and a snapshot of the system monitor following the test.
Code
import numpy as np
import dask.array as da
import dask.distributed
from datetime import datetime
n_threads= 4
n = 40_000
def test(n=40_000, chunk=1000):
da.random.seed(731)
x = da.random.random((n,n), chunks=(chunk,chunk))
y = x + x.T
z = y[::2,5000:].mean(axis=1)
return z
print("Test default threads scheduler (size={}, {} threads)".format(n, n_threads))
start = datetime.now()
result1 = test(n=n).compute(scheduler="threads", num_workers=n_threads)
print("Done in {}".format(datetime.now()-start))
print("Test dask distributed LocalCluster scheduler (size={}, {} threads)".format(n, n_threads))
client = dask.distributed.Client(processes=False, n_workers=1, threads_per_worker=n_threads)
print("Client: ", client)
start = datetime.now()
result2 = test(n=n).compute()
print("Done in {}".format(datetime.now()-start))
client.close()
error = np.mean(np.abs(result1-result2))
print("Mean absolute difference between results: {}".format(error))
Output
>> python test_dask.py
Test default threads scheduler (size=40000, 4 threads)
Done in 0:00:09.872372
Test dask distributed LocalCluster scheduler (size=40000, 4 threads)
Client: <Client: 'inproc://192.168.0.129/32574/1' processes=1 threads=4, memory=16.67 GB>
Done in 0:00:18.028071
Mean absolute difference between results: 0.0
CPU and Memory usage
(default threads scheduler from ~43-53 seconds, LocalCluster from ~23-45 seconds)
Numpy workloads typically do work well with many threads as opposed to many processes, because the underlying operations release the GIL, and with threads you minimise memory copies.
The distributed scheduler (i.e., LocalCluster) allows you to choose your mix of processes and threads, and indeed can work in-process too (although this is rarer). See the long list of arguments, particularly n_workers, threads_per_worker and processes. If you have one worker and many threads, you should have something similar to the non-distributed threaded scheduler.
Note, however, that the distributed scheduler is more complicated than the default threaded one. This usually means smarter performance, but always means more per-task overhead/latency. You will feel that when the execution time of each task is very short, which may well be the case for your simple numpy operations.

Dask Distributed with Asynchronous Real-time Parallelism

I'm reading the documentation on dask.distributed and it looks like I could submit functions to the distributed cluster via client.submit().
I have an existing function some_func that is grabbing individual documents (say, a text file) asynchronously and I want to take the raw document and grab all words that don't contain a vowel and shove it back into a different database. This data processing step is blocking.
Assuming that there are several million documents and the distributed cluster only has 10 nodes with 1 process available (i.e., it can only process 10 documents at a time), how will dask.distributed handle the flow of the documents that it needs to process?
Here is some example code:
client = dask.distributed('tcp://1.2.3.4:8786')
def some_func():
doc = retrieve_next_document_asynchronously()
client.submit(get_vowelless_words, doc)
def get_vowelless_words(doc):
vowelless_words = process(doc)
write_to_database(vowelless_words)
if __name__ == '__main__':
for i in range(1000000):
some_func()
Since the processing of a document is blocking and the cluster can only handle 10 documents simultaneously, what happens when 30 other documents are retrieved while the cluster is busy? I understand that client.submit() is asynchronous and it would return a concurrent future but what would happen in this case? Would it hold the document in memory until it 1/10 cores are available and potentially cause the machine to run out of memory after, say, if 1,000 documents are waiting.
What would the scheduler do in this case? FIFO? Should I somehow change the code so that it waits for a core to be available before retrieving the next document? How might that be accomplished?
To use Queues with dask, below is a modified example of using dask Queues with a distributed cluster (based on the documentation):
#!/usr/bin/env python
import distributed
from queue import Queue
from threading import Thread
client = distributed.Client('tcp://1.2.3.4:8786')
nprocs = len(client.ncores())
def increment(x):
return x+1
def double(x):
return 2*x
input_q = Queue(maxsize=nprocs)
remote_q = client.scatter(input_q)
remote_q.maxsize = nprocs
inc_q = client.map(increment, remote_q)
inc_q.maxsize = nprocs
double_q = client.map(double, inc_q)
double_q.maxsize = nprocs
result_q = client.gather(double_q)
def load_data(q):
i = 0
while True:
q.put(i)
i += 1
load_thread = Thread(target=load_data, args=(input_q,))
load_thread.start()
while True:
size = result_q.qsize()
item = result_q.get()
print(item, size)
In this case, we explicitly limit the maximum size of each queue to be equal to the number of distributed processes that are available. Otherwise, the while loop will overload the cluster. Of course, you can adjust the maxsize to be some multiple of the number of available processes as well. For simple functions like increment and double, I found that maxsize = 10*nprocs is still reasonable but this will surely be limited by the amount of time that it takes to run your custom function.
When you call submit all of the arguments are serialized and immediately sent to the scheduler. An alternative would be to both get documents and process them on the cluster (this assumes that documents are globally visible from all workers).
for fn in filenames:
doc = client.submit(retrieve_doc, fn)
process = client.submit(process_doc, doc)
fire_and_forget(process)
If documents are only available on your client machine and you want to restrict flow then you might consider using dask Queues or the as_completed iterator.

Parallelization on cluster dask

I'm looking for the best way to parallelize on a cluster the following problem. I have several files
folder/file001.csv
folder/file002.csv
:
folder/file100.csv
They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files.
In one side I can just run
df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes')
But I'm wondering if there is a better/smarter way to do so in a sort of
delayed-groupby way.
Every filexxx.csv fits in memory on a node. Given that every node has n cores it will be ideal use all of them. For every single file I can use this hacky way
import numpy as np
import multiprocessing as mp
cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def parallelize(data, func):
data_split = np.array_split(data, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
data = parallelize(data, f);
And, again, I'm not sure if there is an efficent dask way to do so.
you could use a Client (will run in multi process by default) and read your data with a certain blocksize. you can get the amount of workers (and number of cores per worker) with the ncores method and then calculate the optimal blocksize.
however according to the documantaion blocksize is by default "computed based on available physical memory and the number of cores."
so i think the best way to do it is a simple:
from distributed import Client
# if you run on a single machine just do: client = Client()
client = Client('cluster_scheduler_path')
ddf = dd.read_csv("folder/*")
EDIT: after that use map_partitions and do the gorupby for each partition:
# Note ddf is a dask dataframe and df is a pandas dataframe
new_ddf = ddf.map_partitions(lambda df: df.groupby("key").apply(f), meta=meta)
don't use compute because it will result in a single pandas.dataframe, instead use a dask output method to keep the entire process parallel and larger then ram compatible.

DASK - Stopping workers during execution causes completed tasks to be launched twice

I want to use dask to process some 5000 batch tasks that store their results in a relational database, and after they are all completed I want to run a final task that will query the databse and generate a result file (which will be stored in AWS S3)
So it's more or less like this:
from dask import bag, delayed
batches = bag.from_sequence(my_batches())
results = batches.map(process_batch_and_store_results_in_database)
graph = delayed(read_database_and_store_bundled_result_into_s3)(results)
client = Client('the_scheduler:8786')
client.compute(graph)
And this works, but: Near the end of processing, many workers are idle and I would like to be able to turn them off (and save some money on AWS EC2), but if I do that, the scheduler will "forget" that those tasks were already completed and try to run them again on the remaining workers.
I understand that this is actually a feature, not a bug, as Dask is trying to keep track of all the results before starting read_database_and_store_bundled_result_into_s3, but: Is there any way that I can tell dask to just orchestrate the distributed processing graph and not worry about state management?
I recommend that you simply forget the futures after they complete. This solution uses the dask.distributed concurrent.futures interface rather than dask.bag. In particular it uses the as_completed iterator.
from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')
futures = client.map(process_batch_and_store_results_in_database, my_batches())
seq = as_completed(futures)
del futures # now only reference to the futures is within seq
for future in seq:
pass # let future be garbage collected

dask.bag, how should I efficiently run several computations over the same data

I'm processing a fair amount of data, and the bulk of the time is spent loading the data and parsing the json/whatever. I'd like to collect simple statistics over the whole dataset using a single scan.
I'd hoped I could use the graph simplification in compute using the following pattern:
parsed = read_text(files).map(parsing)
example_stat_future = parsed.map(foo).frequencies()
another_stat_future = parsed.map(bar).sum()
etc.
example_stat, another_stat = compute(example_stat_future, another_stat_future)
but I see extreme slowdowns when trying that. Here's my example code:
from json import loads, dumps
from time import time
import dask.bag as db
print("Setup some dummy data")
for partition in range(10):
with open("/tmp/issue.%d.jsonl" % partition, "w") as f_out:
for i in range(100000):
f_out.write(dumps({"val": i, "doubleval": i * 2}) + "\n")
print("Running as distinct computations")
loaded = db.read_text("/tmp/issue.*.jsonl").map(loads)
first_val = loaded.pluck("val").sum()
second_val = loaded.pluck("doubleval").sum()
start = time()
first_val.compute()
print("First value", time() - start)
start = time()
second_val.compute()
print("Second value", time() - start)
print("Running as a single computation")
loaded = db.read_text("/tmp/issue.*.jsonl").map(loads)
first_val = loaded.pluck("val").sum()
second_val = loaded.pluck("doubleval").sum()
start = time()
db.compute(first_val, second_val)
print("Both values", time() - start)
And the output
On datasets with millions of items, I've never finished a run before killing it for taking too long.
Setup some dummy data
Running as distinct computations
First value 0.7081761360168457
Second value 0.6579079627990723
Running as a single computation
Both values 37.74176549911499
Is there a common pattern for solving this kind of issue?
Short answer
Import and run the following and things should be faster
from dask.distributed import Client
c = Client()
Make sure you have dask.distributed installed
conda install dask distributed -c conda-forge
# or
pip install dask distributed --upgrade
Although note, you'll have to do this within an if __name__ == '__main__': block at the bottom of the file rather than at the top level:
from ... import ...
if __name__ == '__main__':
c = Client()
# proceed with the rest of your dask.bag code
Long answer
Dask has a variety of schedulers. Dask.bag uses the multiprocessing scheduler by default, but could use others just as easily. See this doc for more information.
The multiprocessing scheduler does work in a separate process, and then it brings those results back to the main process when necessary. For simple linear workloads like b.map(...).filter(...).frequencies() tasks can all be fused into one single task that goes to a process, computes, and then returns just a very small result.
However, when a workload has any sort of forking (such as you describe) the multiprocessing scheduler has to send the data back to the main process. Depending on the data, this can be expensive because we need to serialize the objects as they move between processes. The basic multiprocessing scheduler within Dask has no concept of data locality. Everything is coordinated by the central process.
Fortunately, Dask's distributed scheduler is much smarter and can handle these situations easily. When you run dask.distributed.Client() without any arguments you create a local "cluster" of processes on your computer. There are a variety of other advantages to this; for example if you navigate to https://localhost:8787/status you'll be treated to a running dashboard of all of your computations.

Resources