I'm reading the documentation on dask.distributed and it looks like I could submit functions to the distributed cluster via client.submit().
I have an existing function some_func that is grabbing individual documents (say, a text file) asynchronously and I want to take the raw document and grab all words that don't contain a vowel and shove it back into a different database. This data processing step is blocking.
Assuming that there are several million documents and the distributed cluster only has 10 nodes with 1 process available (i.e., it can only process 10 documents at a time), how will dask.distributed handle the flow of the documents that it needs to process?
Here is some example code:
client = dask.distributed('tcp://1.2.3.4:8786')
def some_func():
doc = retrieve_next_document_asynchronously()
client.submit(get_vowelless_words, doc)
def get_vowelless_words(doc):
vowelless_words = process(doc)
write_to_database(vowelless_words)
if __name__ == '__main__':
for i in range(1000000):
some_func()
Since the processing of a document is blocking and the cluster can only handle 10 documents simultaneously, what happens when 30 other documents are retrieved while the cluster is busy? I understand that client.submit() is asynchronous and it would return a concurrent future but what would happen in this case? Would it hold the document in memory until it 1/10 cores are available and potentially cause the machine to run out of memory after, say, if 1,000 documents are waiting.
What would the scheduler do in this case? FIFO? Should I somehow change the code so that it waits for a core to be available before retrieving the next document? How might that be accomplished?
To use Queues with dask, below is a modified example of using dask Queues with a distributed cluster (based on the documentation):
#!/usr/bin/env python
import distributed
from queue import Queue
from threading import Thread
client = distributed.Client('tcp://1.2.3.4:8786')
nprocs = len(client.ncores())
def increment(x):
return x+1
def double(x):
return 2*x
input_q = Queue(maxsize=nprocs)
remote_q = client.scatter(input_q)
remote_q.maxsize = nprocs
inc_q = client.map(increment, remote_q)
inc_q.maxsize = nprocs
double_q = client.map(double, inc_q)
double_q.maxsize = nprocs
result_q = client.gather(double_q)
def load_data(q):
i = 0
while True:
q.put(i)
i += 1
load_thread = Thread(target=load_data, args=(input_q,))
load_thread.start()
while True:
size = result_q.qsize()
item = result_q.get()
print(item, size)
In this case, we explicitly limit the maximum size of each queue to be equal to the number of distributed processes that are available. Otherwise, the while loop will overload the cluster. Of course, you can adjust the maxsize to be some multiple of the number of available processes as well. For simple functions like increment and double, I found that maxsize = 10*nprocs is still reasonable but this will surely be limited by the amount of time that it takes to run your custom function.
When you call submit all of the arguments are serialized and immediately sent to the scheduler. An alternative would be to both get documents and process them on the cluster (this assumes that documents are globally visible from all workers).
for fn in filenames:
doc = client.submit(retrieve_doc, fn)
process = client.submit(process_doc, doc)
fire_and_forget(process)
If documents are only available on your client machine and you want to restrict flow then you might consider using dask Queues or the as_completed iterator.
Related
I have this dask example of a standalone python script that runs on my desktop that has 4 CPU nodes It takes 0.735 seconds currently. The goal is to use separate processes on my Linux to overcome the limitations of the GIL etc.
import numpy as np
import dask
from dask import delayed
from dask.distributed import Client
import time
def main():
if __name__ == "__main__":
client = Client()
tmp,pres = setUpData()
startTime = time.time()
executeCalc(tmp,pres)
stopTime = time.time()
print(stopTime-startTime)
def setUpData():
temperature = 273 + 20 * np.random.random([4,17,73,144])
pres = ["1000","925","850","700","600","500","400","300","250","200","150","100","70","50","30","20","10"]
level = np.array(pres)
level = level.astype(float)*100
return temperature,level
def executeCalc(tmp,pres):
potempList = []
for i in range (0,tmp.shape[0]):
tmpInstant = tmp[i,:,:,:]
potempList.append(delayed(pot)(tmpInstant,pres))
results = dask.compute(potempList,scheduler='processes',num_workers=4)
def pot(tmp,pres):
potemp = np.zeros((17,73,144))
potemp = tmp * (100000./pres[:,None,None])
return potemp
main()
Here is the corresponding serial execution with only trivial modifications and this takes 0.0024 seconds.
def executeCalc(tmp,pres):
potempList = []
for i in range (0,tmp.shape[0]):
tmpInstant = tmp[i,:,:,:]
potemp = pot(tmpInstant,pres)
Where am I going wrong ? At the very least for this trivial amount of data the execution times should be identical.
This assumption is where you’re going wrong:
At the very least for this trivial amount of data the execution times should be identical.
Any parallel execution engine does the same amount of work as the serial engine, plus a lot of additional overhead. So you need to have enough parallelizeable work to do to justify the large time required to spin up the scheduler, web server, worker processes, transmit the job and results across workers, serialize and deserialize all inputs all outputs, monitor and manage job and worker state, etc.
To see the benefits of dask, you need a task which is many orders of magnitude larger than this.
Take a look at the dask docs on best practices. I’m paraphrasing, but the first recommendation is don’t use dask unless you have to.
I have some function which uses image processing functions which are itself multithreaded. I distribute many of those function calls on a dask cluster.
First, I started a scheduler on a host: dask-scheduler. The I started the workers: dask-worker --nthreads 1 --memory-limit 0.9 tcp://scheduler:8786.
The python code looks similar to this:
import SimpleITK as sitk
def func(filename):
sitk.ProcessObject.SetGlobalDefaultNumberOfThreads(4) # limit to four threads
img = sitk.ReadImage(filename)
# Do more stuff and store resulting image
# SimpleITK is already multithreaded
return 'someresult'
# [...]
from distributed import Client
client = Client('tcp://scheduler:8786')
futures = client.map(func, ['somefile', 'someotherfile'])
for result in client.gather(futures):
print(result)
Right now, I have to set the number of threads for each worker to one, in order not to overcommit the CPU on the worker node. But in some cases it makes sense to limit the number of cores used by SimpleITK, because the gain is not so high. Instead, I could run multiple function calls in parallel on the same host.
But in that case I would have to calculate all the core usages by hand.
Ideally, I would like to set an arbitrary number of cores each function can use and dask should decide how many parallel functions invocations are started on each node, given the number of available threads. I.e. is it possible to specify the number of threads a function will use?
No, Dask is not able to either limit the number of threads spawned by some function, and doesn't attempt to measure this either.
The only thing I could think you might want to do is use Dask's abstract rsources, where you control how much of each labelled quantity is available per worker and how much each task needs to run.
futures = client.map(func, ['somefile', 'someotherfile'], resources=...)
I don't see an obvious way to assign resources to workers using Cluster() (i.e., the default LocalCluster), you may need to use the CLI.
I'm looking for the best way to parallelize on a cluster the following problem. I have several files
folder/file001.csv
folder/file002.csv
:
folder/file100.csv
They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files.
In one side I can just run
df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes')
But I'm wondering if there is a better/smarter way to do so in a sort of
delayed-groupby way.
Every filexxx.csv fits in memory on a node. Given that every node has n cores it will be ideal use all of them. For every single file I can use this hacky way
import numpy as np
import multiprocessing as mp
cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def parallelize(data, func):
data_split = np.array_split(data, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
data = parallelize(data, f);
And, again, I'm not sure if there is an efficent dask way to do so.
you could use a Client (will run in multi process by default) and read your data with a certain blocksize. you can get the amount of workers (and number of cores per worker) with the ncores method and then calculate the optimal blocksize.
however according to the documantaion blocksize is by default "computed based on available physical memory and the number of cores."
so i think the best way to do it is a simple:
from distributed import Client
# if you run on a single machine just do: client = Client()
client = Client('cluster_scheduler_path')
ddf = dd.read_csv("folder/*")
EDIT: after that use map_partitions and do the gorupby for each partition:
# Note ddf is a dask dataframe and df is a pandas dataframe
new_ddf = ddf.map_partitions(lambda df: df.groupby("key").apply(f), meta=meta)
don't use compute because it will result in a single pandas.dataframe, instead use a dask output method to keep the entire process parallel and larger then ram compatible.
This is a follow up question to a potential answer to one of my previous questions on using Dask computed to access one element in a large array .
Why does using Dask compute cause the execution to hang below?
Here's the working code snippet:
#Suppose you created a scheduler at the ip address of 111.111.11.11:8786
from dask.distributed import Client
import dask.array as da
# client1
client1 = Client("111.111.11.11:8786")
x = da.ones(10000000, chunks=(100000,)) # 1e7 size array cut into 1e5 size chunks
x = x.persist()
client1.publish_dataset(x=x)
# client2
client2 = Client("111.111.11.11:8786")
x = client2.get_dataset('x') #get the lazy collection x
result = x[0].compute() #code execution hangs here
print(result)
persist behaves differently, depending on whether you have a distributed client active or not. In your case, you call it before making any client, with the result that the whole of the data is packed into the graph description. This behaviour is OK on the threaded scheduler, where memory is shared between workers, but when you publish, you are sending the whole thing to the scheduler, and apparently it is choking.
If you make client1 first, you will notice that persist happens very quickly (the scheduler is only getting pointers to the data in this case), and the publish-fetch cycle will work as expected.
I want to use dask to process some 5000 batch tasks that store their results in a relational database, and after they are all completed I want to run a final task that will query the databse and generate a result file (which will be stored in AWS S3)
So it's more or less like this:
from dask import bag, delayed
batches = bag.from_sequence(my_batches())
results = batches.map(process_batch_and_store_results_in_database)
graph = delayed(read_database_and_store_bundled_result_into_s3)(results)
client = Client('the_scheduler:8786')
client.compute(graph)
And this works, but: Near the end of processing, many workers are idle and I would like to be able to turn them off (and save some money on AWS EC2), but if I do that, the scheduler will "forget" that those tasks were already completed and try to run them again on the remaining workers.
I understand that this is actually a feature, not a bug, as Dask is trying to keep track of all the results before starting read_database_and_store_bundled_result_into_s3, but: Is there any way that I can tell dask to just orchestrate the distributed processing graph and not worry about state management?
I recommend that you simply forget the futures after they complete. This solution uses the dask.distributed concurrent.futures interface rather than dask.bag. In particular it uses the as_completed iterator.
from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')
futures = client.map(process_batch_and_store_results_in_database, my_batches())
seq = as_completed(futures)
del futures # now only reference to the futures is within seq
for future in seq:
pass # let future be garbage collected