Dask Distributed - Plugin for Monitoring Memory Usage - dask

I have a distributed Dask cluster that I send a bunch of work to via Dask Distributed Client.
At the end of sending a bunch of work, I'd love to get a report or something that tells me what was the peak memory usage of each worker.
Is this possible via existing diagnostics tools? https://docs.dask.org/en/latest/diagnostics-distributed.html
Thanks!
Best,

Specifically for memory, it's possible to extract information from the scheduler (while it's running) using client.scheduler_info() (this can be dumped as a json). For peak memory there would have to be an extra function that will compare the current usage with the previous usage and pick max.
For a lot of other useful information, but not the peak memory consumption, there's the built-in report:
from dask.distributed import performance_report
with performance_report(filename="dask-report.html"):
## some dask computation
(code from the documentation: https://docs.dask.org/en/latest/diagnostics-distributed.html)
Update: there is also a dedicated plugin for dask to record min/max memory usage per task: https://github.com/itamarst/dask-memusage
Update 2: there is a nice blog post with code to track memory usage by dask: https://blog.dask.org/2021/03/11/dask_memory_usage

Related

Memory issue in dask with using local cluster

I'm trying to use Dask local cluster to manage system wide memory usage,
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(scheduler_port=5272, dashboard_address=5273,memory_limit='4GB')
I connect with:
client = Client('tcp://127.0.0.1:5272')
I have 8 cores and 32 GB. The local cluster distributes 4GB * 4 = 16GB memory (I have another task that required about 10GB memory) into the local cluster. However, previously there are some tasks I could finish well without calling client = Client('tcp://127.0.0.1:5272'). After I call client = Client('tcp://127.0.0.1:5272'), memory error triggered. What can i do in this scenario? Thanks!
I'm thinking if it is because each worker is only allocated 4GB memory... but if I assign memory_limit='16GB'. If it uses all of the resources it would take 64GB. I don't have that much memory. What can I do?
It's not clear what you are trying to achieve, but your observation on memory is correct. If a worker is constrained by memory, then they won't be able to complete the task. What are ways out of this?
getting access to more resources, if you don't have access to additional hardware, then you can check coiled.io or look into the various dask cloud options
optimizing your code, perhaps some calculations could be done in smaller chunks, data could be compressed (e.g. categorical dtype) or there are other opportunities to reduce memory requirements (really depends on the functions, but let's say some internal calculation could be done at a smaller accuracy with fewer resources)
using all available resources with a non-distributed code (which would add some overhead to the resource requirements).

diagnosing bandwidth from dask dashboard

This may be a very silly pb but I cannot diagnose bandwidth from the dask dashboard. I am under the impression the line is always so low it is not visible, cf screen grab.
Can I use the dashboard to get a value in such a situation?
Yes, there are a few places in the dashboard where bandwidth is mentioned:
workers: the read/write network bandwidth is listed real-time per worker
bandwidth-per-workers: the aggregate bandwidth is accumulated per worker pairs
bandwidth-per-type: the aggregate bandwidth is accumulated per type of data
Some of these are more accessible in the performance_report, which may interest you. See https://docs.dask.org/en/latest/diagnostics-distributed.html#capture-diagnostics

Monitor Google Cloud Run memory usage

Is there any built-in way to monitor memory usage of an application running in managed Google Cloud Run instances?
In the "Metrics" page of a managed Cloud Run service, there is an item called "Container Memory Allocation". However, as far as I understand it, this graph refers to the instance's maximum allocated memory (chosen in the settings), and not to the memory actually used inside the container. (Please correct me if I'm wrong.)
In the Stackdriver Monitoring list of available metrics for managed Cloud Run ( https://cloud.google.com/monitoring/api/metrics_gcp#gcp-run ), there also doesn't seem to be any metric related to the memory usage, only to allocated memory.
Thank you in advance.
Cloud Run now exposes a new metrics named "Memory Utilization" in Cloud Monitoring, see more details here.
This metrics captures the container memory utilization distribution across all container instances of the revision. It is recommended to look at the percentiles of this metric: 50th percentile, 95th percentiles ad 99th percentiles to understand how utilized are your instances
Currently, there seems to be no way to monitor the memory usage of a Google Cloud Run instance through Stackdriver or on "Cloud Run" page in Google Cloud Console.
I have filed a feature request on your behalf, in order to add memory usage metrics to Cloud Run. You can see and track this feature request in the following link.
There is not currently a metric on memory utilization. However, if your service reaches a memory limit, the following log will appear in Stackdriver Logging with ERROR-level severity:
"Memory limit of 256M exceeded with 325M used. Consider increasing the memory limit, see https://cloud.google.com/run/docs/configuring/memory-limits"
(Replace specific numbers accordingly.)
Based on this log message, you could create a Log-based Metric for memory exceeded.

How much overhead is there per partition when loading dask_cudf partitions into GPU memory?

PCIE bus bandwidth latencies force constraints on how and when applications should copy data to and from GPUs.
When working with cuDF directly, I can efficiently move a single large chunk of data into a single DataFrame.
When using dask_cudf to partition my DataFrames, does Dask copy partitions into GPU memory one at a time? In batches? If so, is there significant overhead from multiple copy operations instead of a single larger copy?
This probably depends on the scheduler you're using. As of 2019-02-19 dask-cudf uses the single-threaded scheduler by default (cudf segfaulted for a while there if used in multiple threads), so any transfers would be sequential if you're not using some dask.distributed cluster. If you're using a dask.distributed cluster, then presumably this would happen across each of your GPUs concurrently.
It's worth noting that dask.dataframe + cudf doesn't do anything special on top of what cudf would do. It's as though you called many cudf calls in a for loop, or in one for-loop per GPU, depending on the scheduler choice above.
Disclaimer: cudf and dask-cudf are in heavy flux. Future readers probably should check with current documentation before trusting this answer.

Loading large datasets with dask

I am in an HPC environment with clusters, tightly coupled interconnects, and backing Lustre filesystems. We have been exploring how to leverage Dask to not only provide computation, but also to act as a distributed cache to speed up our workflows. Our proprietary data format is n-dimensional and regular, and we have coded a lazy reader to pass into the from_array/from_delayed methods.
We have had some issues with loading and persisting larger-than-memory datasets across a Dask cluster.
Example with hdf5:
# Dask scheduler has been started and connected to 8 workers
# spread out on 8 machines, each with --memory-limit=150e9.
# File locking for reading hdf5 is also turned off
from dask.distributed import Client
c = Client({ip_of_scheduler})
import dask.array as da
import h5py
hf = h5py.File('path_to_600GB_hdf5_file', 'r')
ds = hf[hf.keys()[0]]
x = da.from_array(ds, chunks=(100, -1, -1))
x = c.persist(x) # takes 40 minutes, far below network and filesystem capabilities
print x[300000,:,:].compute() # works as expected
We have also loaded datasets (using slicing, dask.delayed, and from_delayed) from some of our own file file formats, and have seen similar degradation of performance as the file size increases.
My questions: Are there inherent bottlenecks to using Dask as a distributed cache? Will all data be forced to funnel through the scheduler? Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow? If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
Are there inherent bottlenecks to using Dask as a distributed cache?
There are bottlenecks to every system, but it sounds like you're not close to running into the bottlenecks that I would expect from Dask.
I suspect that you're running into something else.
Will all data be forced to funnel through the scheduler?
No, workers can execute functions that load data on their own. That data will then stay on the workers.
Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow?
Workers are just Python processes, so if Python processes running on your cluster can take advantage of Lustre (this is almost certainly the case) then yes, Dask Workers can take advantage of Lustre.
If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
This is certainly common. The tradeoff here is between distributed bandwidth to your NFS and the availability of distributed memory.
In your position I would use Dask's diagnostics to figure out what was taking up so much time. You might want to read through the documentation on understanding performance and the section on the dashboard in particular. That section has a video that might be particularly helpful. I would ask two questions:
Are workers running tasks all the time? (status page, Task Stream plot)
Within those tasks, what is taking up time? (profile page)

Resources