Loading large datasets with dask - hdf5

I am in an HPC environment with clusters, tightly coupled interconnects, and backing Lustre filesystems. We have been exploring how to leverage Dask to not only provide computation, but also to act as a distributed cache to speed up our workflows. Our proprietary data format is n-dimensional and regular, and we have coded a lazy reader to pass into the from_array/from_delayed methods.
We have had some issues with loading and persisting larger-than-memory datasets across a Dask cluster.
Example with hdf5:
# Dask scheduler has been started and connected to 8 workers
# spread out on 8 machines, each with --memory-limit=150e9.
# File locking for reading hdf5 is also turned off
from dask.distributed import Client
c = Client({ip_of_scheduler})
import dask.array as da
import h5py
hf = h5py.File('path_to_600GB_hdf5_file', 'r')
ds = hf[hf.keys()[0]]
x = da.from_array(ds, chunks=(100, -1, -1))
x = c.persist(x) # takes 40 minutes, far below network and filesystem capabilities
print x[300000,:,:].compute() # works as expected
We have also loaded datasets (using slicing, dask.delayed, and from_delayed) from some of our own file file formats, and have seen similar degradation of performance as the file size increases.
My questions: Are there inherent bottlenecks to using Dask as a distributed cache? Will all data be forced to funnel through the scheduler? Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow? If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?

Are there inherent bottlenecks to using Dask as a distributed cache?
There are bottlenecks to every system, but it sounds like you're not close to running into the bottlenecks that I would expect from Dask.
I suspect that you're running into something else.
Will all data be forced to funnel through the scheduler?
No, workers can execute functions that load data on their own. That data will then stay on the workers.
Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow?
Workers are just Python processes, so if Python processes running on your cluster can take advantage of Lustre (this is almost certainly the case) then yes, Dask Workers can take advantage of Lustre.
If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
This is certainly common. The tradeoff here is between distributed bandwidth to your NFS and the availability of distributed memory.
In your position I would use Dask's diagnostics to figure out what was taking up so much time. You might want to read through the documentation on understanding performance and the section on the dashboard in particular. That section has a video that might be particularly helpful. I would ask two questions:
Are workers running tasks all the time? (status page, Task Stream plot)
Within those tasks, what is taking up time? (profile page)

Related

Memory issue in dask with using local cluster

I'm trying to use Dask local cluster to manage system wide memory usage,
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(scheduler_port=5272, dashboard_address=5273,memory_limit='4GB')
I connect with:
client = Client('tcp://127.0.0.1:5272')
I have 8 cores and 32 GB. The local cluster distributes 4GB * 4 = 16GB memory (I have another task that required about 10GB memory) into the local cluster. However, previously there are some tasks I could finish well without calling client = Client('tcp://127.0.0.1:5272'). After I call client = Client('tcp://127.0.0.1:5272'), memory error triggered. What can i do in this scenario? Thanks!
I'm thinking if it is because each worker is only allocated 4GB memory... but if I assign memory_limit='16GB'. If it uses all of the resources it would take 64GB. I don't have that much memory. What can I do?
It's not clear what you are trying to achieve, but your observation on memory is correct. If a worker is constrained by memory, then they won't be able to complete the task. What are ways out of this?
getting access to more resources, if you don't have access to additional hardware, then you can check coiled.io or look into the various dask cloud options
optimizing your code, perhaps some calculations could be done in smaller chunks, data could be compressed (e.g. categorical dtype) or there are other opportunities to reduce memory requirements (really depends on the functions, but let's say some internal calculation could be done at a smaller accuracy with fewer resources)
using all available resources with a non-distributed code (which would add some overhead to the resource requirements).

Dask Distributed - Plugin for Monitoring Memory Usage

I have a distributed Dask cluster that I send a bunch of work to via Dask Distributed Client.
At the end of sending a bunch of work, I'd love to get a report or something that tells me what was the peak memory usage of each worker.
Is this possible via existing diagnostics tools? https://docs.dask.org/en/latest/diagnostics-distributed.html
Thanks!
Best,
Specifically for memory, it's possible to extract information from the scheduler (while it's running) using client.scheduler_info() (this can be dumped as a json). For peak memory there would have to be an extra function that will compare the current usage with the previous usage and pick max.
For a lot of other useful information, but not the peak memory consumption, there's the built-in report:
from dask.distributed import performance_report
with performance_report(filename="dask-report.html"):
## some dask computation
(code from the documentation: https://docs.dask.org/en/latest/diagnostics-distributed.html)
Update: there is also a dedicated plugin for dask to record min/max memory usage per task: https://github.com/itamarst/dask-memusage
Update 2: there is a nice blog post with code to track memory usage by dask: https://blog.dask.org/2021/03/11/dask_memory_usage

Best practice on data access with remote cluster: pushing from client memory to workers vs direct link from worker to data storage

Hi I am new to dask and cannot seem to find relevant examples on the topic of this title. Would appreciate any documentation or help on this.
The example I am working with is pre-processing of an image dataset on the azure environment with the dask_cloudprovider library, I would like to increase the speed of processing by dividing the work on a cluster of machines.
From what I have read and tested, I can
(1) load the data to memory on the client machine, and push it to the workers or
'''psudo code
load data into array and send it to workers through delayed func'''
(2) I can establish a link between every worker node and the data storage (see func below), and access the data on the worker level.
'''def get_remote_image(img_path):
ACCOUNT_NAME = 'xxx'
ACCOUNT_KEY = 'xxx'
CONTAINER = 'xxx'
abfs = adlfs.AzureBlobFileSystem(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY, container_name=CONTAINER)
file = abfs.cat(img_path)
image = imageio.core.asarray(imageio.imread(file, "PNG"))
return cv2.cvtColor(image, cv2.COLOR_RGB2BGR)'''
What I would like to know more about is whether there are any best practices on accessing and working on data on a remote cluster using dask?
If you were to try version 1), you would first see warnings saying that sending large delayed objects is a bad pattern in Dask, and makes for large graphs and high memory use on the scheduler. You can send the data directly to workers using client.scatter, but it would still be essentially a serial process, bottlenecking on receiving and sending all of your data through the client process's network connection.
The best practice and canonical way to load data in Dask is for the workers to do it. All the built in loading functions work this way, and is even true when running locally (because any download or open logic should be easily parallelisable).
This is also true for the outputs of your processing. You haven't said what you plan to do next, but to grab all of those images to the client (e.g., .compute()) would be the other side of exactly the same bottleneck. You want to reduce and/or write your images directly on the workers and only handle small transfers from the client.
Note that there are examples out there of image processing with dask (e.g., https://examples.dask.org/applications/image-processing.html ) and of course a lot about arrays. Passing around whole image arrays might be fine for you, but this should be worth a read.

How much overhead is there per partition when loading dask_cudf partitions into GPU memory?

PCIE bus bandwidth latencies force constraints on how and when applications should copy data to and from GPUs.
When working with cuDF directly, I can efficiently move a single large chunk of data into a single DataFrame.
When using dask_cudf to partition my DataFrames, does Dask copy partitions into GPU memory one at a time? In batches? If so, is there significant overhead from multiple copy operations instead of a single larger copy?
This probably depends on the scheduler you're using. As of 2019-02-19 dask-cudf uses the single-threaded scheduler by default (cudf segfaulted for a while there if used in multiple threads), so any transfers would be sequential if you're not using some dask.distributed cluster. If you're using a dask.distributed cluster, then presumably this would happen across each of your GPUs concurrently.
It's worth noting that dask.dataframe + cudf doesn't do anything special on top of what cudf would do. It's as though you called many cudf calls in a for loop, or in one for-loop per GPU, depending on the scheduler choice above.
Disclaimer: cudf and dask-cudf are in heavy flux. Future readers probably should check with current documentation before trusting this answer.

Storm process increasing memory

I am implementing a distributed algorithm for pagerank estimation using Storm. I have been having memory problems, so I decided to create a dummy implementation that does not explicitly save anything in memory, to determine whether the problem lies in my algorithm or my Storm structure.
Indeed, while the only thing the dummy implementation does is message-passing (a lot of it), the memory of each worker process keeps rising until the pipeline is clogged. I do not understand why this might be happening.
My cluster has 18 machines (some with 8g, some 16g and some 32g of memory). I have set the worker heap size to 6g (-Xmx6g).
My topology is very very simple:
One spout
One bolt (with parallelism).
The bolt receives data from the spout (fieldsGrouping) and also from other tasks of itself.
My message-passing pattern is based on random walks with a certain stopping probability. More specifically:
The spout generates a tuple.
One specific task from the bolt receives this tuple.
Based on a certain probability, this task generates another tuple and emits it again to another task of the same bolt.
I am stuck at this problem for quite a while, so it would be very helpful if someone could help.
Best Regards,
Nick
It seems you have a bottleneck in your topology, ie, a bolt receivers more data than in can process. Thus, the bolt's input queue grows over time consuming more and more memory.
You can either increase the parallelism for the "bottleneck bolt" or enable fault-tolerance mechanism which also enables flow-control via limited number of in-flight tuples (https://storm.apache.org/documentation/Guaranteeing-message-processing.html). For this, you also need to set "max spout pending" parameter.

Resources