I'm distributing the computation of some functions using Dask. My general layout looks like this:
from dask.distributed import Client, LocalCluster, as_completed
cluster = LocalCluster(processes=config.use_dask_local_processes,
n_workers=1,
threads_per_worker=1,
)
client = Client(cluster)
cluster.scale(config.dask_local_worker_instances)
work_futures = []
# For each group do work
for group in groups:
fcast_futures.append(client.submit(_work, group))
# Wait till the work is done
for done_work in as_completed(fcast_futures, with_results=False):
try:
result = done_work.result()
except Exception as error:
log.exception(error)
My issue is that for a large number of jobs I tend to hit memory limits. I see a lot of:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 1.15 GB -- Worker memory limit: 1.43 GB
It seems that each future isn't releasing its memory. How can I trigger that? I'm using dask==1.2.0 on Python 2.7.
Results are help by the scheduler so long as there is a future on a client pointing to it. Memory is released when (or shortly after) the last future is garbage-collected by python. In your case you are keeping all of your futures in a list throughout the computation. You could try modifying your loop:
for done_work in as_completed(fcast_futures, with_results=False):
try:
result = done_work.result()
except Exception as error:
log.exception(error)
done_work.release()
or replacing the as_completed loop with something that explicitly removes futures from the list once they have been processed.
Related
I am working with dask on a distributed cluster, and I noticed a peak memory consumption when getting the results back to the local process.
My minimal example consists in instanciating the cluster and creating a simple array of ~1.6G with dask.array.arange.
I expected the memory consumption to be around the array size, but I observed a memory peak around 3.2G.
Is there any copy done by Dask during the computation ? Or does Jupyterlab needs to make a copy ?
import dask.array
import dask_jobqueue
import distributed
cluster_conf = {
"cores": 1,
"log_directory": "/work/scratch/chevrir/dask-workspace",
"walltime": '06:00:00',
"memory": "5GB"
}
cluster = dask_jobqueue.PBSCluster(**cluster_conf)
cluster.scale(n=1)
client = distributed.Client(cluster)
client
# 1.6 G in memory
a = dask.array.arange(2e8)
%load_ext memory_profiler
%memit a.compute()
# peak memory: 3219.02 MiB, increment: 3064.36 MiB
What happens when you do compute():
the graph of your computation is constructued (this is small) and send to the scheduler
the scheduler gets workers to produce the pieces of the array, which should be a total of about 1.6GB on the workers
the client constructs an empty array for the output you are asking for, knowing its type and size
the client receives bunches of bytes across the network or IPC from each worker which has pieces of the output. These are copied into the output of the client
the complete array is returned to you
You can see that the penultimate step here necessarily requires duplication of data. The original bytes buffers may eventually be garbage collected later.
I have some function which uses image processing functions which are itself multithreaded. I distribute many of those function calls on a dask cluster.
First, I started a scheduler on a host: dask-scheduler. The I started the workers: dask-worker --nthreads 1 --memory-limit 0.9 tcp://scheduler:8786.
The python code looks similar to this:
import SimpleITK as sitk
def func(filename):
sitk.ProcessObject.SetGlobalDefaultNumberOfThreads(4) # limit to four threads
img = sitk.ReadImage(filename)
# Do more stuff and store resulting image
# SimpleITK is already multithreaded
return 'someresult'
# [...]
from distributed import Client
client = Client('tcp://scheduler:8786')
futures = client.map(func, ['somefile', 'someotherfile'])
for result in client.gather(futures):
print(result)
Right now, I have to set the number of threads for each worker to one, in order not to overcommit the CPU on the worker node. But in some cases it makes sense to limit the number of cores used by SimpleITK, because the gain is not so high. Instead, I could run multiple function calls in parallel on the same host.
But in that case I would have to calculate all the core usages by hand.
Ideally, I would like to set an arbitrary number of cores each function can use and dask should decide how many parallel functions invocations are started on each node, given the number of available threads. I.e. is it possible to specify the number of threads a function will use?
No, Dask is not able to either limit the number of threads spawned by some function, and doesn't attempt to measure this either.
The only thing I could think you might want to do is use Dask's abstract rsources, where you control how much of each labelled quantity is available per worker and how much each task needs to run.
futures = client.map(func, ['somefile', 'someotherfile'], resources=...)
I don't see an obvious way to assign resources to workers using Cluster() (i.e., the default LocalCluster), you may need to use the CLI.
I'm reading about SQS and message queues and I'm wondering what happens when there is a fast producer and slow consumer. What happens? Where is the message buildup stored? Is it on the memory stack and eventually a stack overflow happens and the server crashes? Is that the problem that memory is used as queue which is finite?
If you continuously consume messages more slowly than they are produced, typically messages in SQS (or whichever other message system you may choose to use) will expire and be lost. Slow consumer can cause memory problems if you have a producer and consumer in the same in-memory streams, and in that case you may run out of memory and crash, but that is less typical in a message based architecture. More commonly messages will be stored on disk, but there is usually still a finite retention period.
[Edit]
If you use an in memory unbounded queue, you can run into issues, take the following example:
from threading import Thread
import time
from multiprocessing import Queue
class SlowConsumer(Thread):
def __init__(self, queue: Queue):
self.queue = queue
Thread.__init__(self)
def run(self):
while True:
v = self.queue.get()
print(f"Processing #{v}")
time.sleep(1)
q = Queue()
c = SlowConsumer(q)
c.start()
i=0
while True:
q.put(i)
i+=1
time.sleep(.1)
The consumer will always be falling further and further behind the producer. In the end, the process will exhaust available memory. This is a pathalogical case which should be avoided -- generally by using a bounded queue. Note: effectively the same problem can exist with external queues that support limitless retention, such as Kafka, but we won't run out of memory, we will run out of disk. For on disk queues, it is also best to set a retention period, if a sensible one exists.
I am running a pipeline on multiple images. The pipeline consist of reading the images from file system, doing so processing on each of them, then saving the images to file system. However the dask worker fails due to MemoryError.
Is there a way to assure the dask workers don't load too many images in memory? i.e. Wait until there is enough space on a worker before starting the processing pipeline on a new image.
I have one scheduler and 40 workers with 4 cores, 15GB ram and running Centos7. I am trying to process 125 images in a batch; each image is fairly large but small enough to fit on a worker; around 3GB require for the whole process.
I tried to process a smaller amount of images and it works great.
EDITED
from dask.distributed import Client, LocalCluster
# LocalCluster is used to show the config of the workers on the actual cluster
client = Client(LocalCluster(n_workers=2, resources={'process': 1}))
paths = ['list', 'of', 'paths']
# Read the file data from each path
data = client.map(read, path, resources={'process': 1)
# Apply foo to the data n times
for _ in range(n):
data = client.map(foo, x, resources={'process': 1)
# Save the processed data
data.map(save, x, resources={'process': 1)
# Retrieve results
client.gather(data)
I expected the images to be process as space was available on the workers but it seems like the images are all loaded simultaneously on the different workers.
EDIT:
My issues is that all task get assigned to workers and they don't have enough memory. I found how to limit the number of task a worker handle at a single moment [https://distributed.readthedocs.io/en/latest/resources.html#resources-are-applied-separately-to-each-worker-process](see here).
However, with that limit, when I execute my task they all finish the read step, then the process step and finally the save step. This is an issues since the image are spilled to disk.
Would there be a way to make every task finish before starting a new one?
e.g. on Worker-1: read(img1)->process(img1)->save(img1)->read(img2)->...
Dask does not generally know how much memory a task will need, it can only know the size of the outputs, and that, only once they are finished. This is because Dask simply executes a pthon function and then waits for it to complete; but all osrts of things can happen within a python function. You should generally expect as many tasks to begin as you have available worker cores - as you are finding.
If you want a smaller total memory load, then your solution should be simple: have a small enough number of workers, so that if all of them are using the maximum memory that you can expect, you still have some spare in the system to cope.
To EDIT: you may want to try running optimize on the graph before submission (although this should happen anyway, I think), as it sounds like your linear chains of tasks should be "fused". http://docs.dask.org/en/latest/optimize.html
I want to use dask to process some 5000 batch tasks that store their results in a relational database, and after they are all completed I want to run a final task that will query the databse and generate a result file (which will be stored in AWS S3)
So it's more or less like this:
from dask import bag, delayed
batches = bag.from_sequence(my_batches())
results = batches.map(process_batch_and_store_results_in_database)
graph = delayed(read_database_and_store_bundled_result_into_s3)(results)
client = Client('the_scheduler:8786')
client.compute(graph)
And this works, but: Near the end of processing, many workers are idle and I would like to be able to turn them off (and save some money on AWS EC2), but if I do that, the scheduler will "forget" that those tasks were already completed and try to run them again on the remaining workers.
I understand that this is actually a feature, not a bug, as Dask is trying to keep track of all the results before starting read_database_and_store_bundled_result_into_s3, but: Is there any way that I can tell dask to just orchestrate the distributed processing graph and not worry about state management?
I recommend that you simply forget the futures after they complete. This solution uses the dask.distributed concurrent.futures interface rather than dask.bag. In particular it uses the as_completed iterator.
from dask.distributed import Client, as_completed
client = Client('the_scheduler:8786')
futures = client.map(process_batch_and_store_results_in_database, my_batches())
seq = as_completed(futures)
del futures # now only reference to the futures is within seq
for future in seq:
pass # let future be garbage collected