I am using dask distributed. I have a dask cluster running on AWS. I would like to shutdown workers if they are idle. How do I find out if the dask worker is idle? I have access to the client
from xxxxxx.distributed.ecscluster import EcsCluster
from dask.distributed import Client
cpu_cluster = EcsCluster(workers=1)
client = Client(cpu_cluster)
You can use dask-labextension to access cluster visualizations and identify workers that are idle.
You can also plug into the Dask adaptive scaling APIs which are designed to spin down idle nodes.
Related
I used AWS EC2 to run everything, in one Jupiter notebook process, I do:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(scheduler_port=5272, dashboard_address=5273,memory_limit='4GB')
# client = Client(cluster)
to start a client specifying a port. Then in another process, I join it with,
client = Client('tcp://127.0.0.1:5272')
To check the dash board, I open with safari
http://ec2-1-123-12-123.compute-1.amazonaws.com:5273/status
I could see the dash board, but when i run some big processing tasks, there is no action in the dashboard. still showing little cpu usage. But I saw, huge cpu usage in htop , with my ddf.compute() running... Did I miss anything?
I'm trying to use Dask local cluster to manage system wide memory usage,
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(scheduler_port=5272, dashboard_address=5273,memory_limit='4GB')
I connect with:
client = Client('tcp://127.0.0.1:5272')
I have 8 cores and 32 GB. The local cluster distributes 4GB * 4 = 16GB memory (I have another task that required about 10GB memory) into the local cluster. However, previously there are some tasks I could finish well without calling client = Client('tcp://127.0.0.1:5272'). After I call client = Client('tcp://127.0.0.1:5272'), memory error triggered. What can i do in this scenario? Thanks!
I'm thinking if it is because each worker is only allocated 4GB memory... but if I assign memory_limit='16GB'. If it uses all of the resources it would take 64GB. I don't have that much memory. What can I do?
It's not clear what you are trying to achieve, but your observation on memory is correct. If a worker is constrained by memory, then they won't be able to complete the task. What are ways out of this?
getting access to more resources, if you don't have access to additional hardware, then you can check coiled.io or look into the various dask cloud options
optimizing your code, perhaps some calculations could be done in smaller chunks, data could be compressed (e.g. categorical dtype) or there are other opportunities to reduce memory requirements (really depends on the functions, but let's say some internal calculation could be done at a smaller accuracy with fewer resources)
using all available resources with a non-distributed code (which would add some overhead to the resource requirements).
We are trying Dask Distributed to make some heavy computes and visualization for a frontend.
Now we have one worker with gunicorn that connects to an existing Distributed Dask cluster, the worker uploads the data currently with read_csv and persist into the cluster.
I've tried using pickle to save the futures from the persist dataframe, but it doesn't work.
We want to have multiple gunicorn workers, each with a different client connecting to the same cluster and using the same data, but with more workers each one uploads a new dataframe.
It sounds like you are looking for Dask's abililty to publish datasets
A convenient way to do this is to using the client.datasets mapping
Client 1
client = Client('...')
df = dd.read_csv(...)
client.datasets['my-data'] = df
Client 2..n
client = Client('...') # same scheduler
df = client.datasets['my-data']
In below code, why dd.read_csv is running on cluster?
client.read_csv should run on cluster.
import dask.dataframe as dd
from dask.distributed import Client
client=Client('10.31.32.34:8786')
dd.read_csv('file.csv',blocksize=10e7)
dd.compute()
Is it the case that once I make a client object, all api calls will run on cluster?
The commnad dd.read_csv('file.csv', blocksize==1e8) will generate many pd.read_csv(...) commands, each of which will run on your dask workers. Each task will look for the file.csv file, seek to some location within that file defined by your blocksize, and read those bytes to create a pandas dataframe. The file.csv file should be universally present for each worker.
It is common for people to use files that are on some universally available storage, like a network file system, database, or cloud object store.
In addition to the first answer:
yes, creating a client to a distributed client will make that be the default scheduler for all following dask work. You can, however, specify where you would like work to run as follows
for a specific compute,
dd.compute(scheduler='threads')
for a black of code,
with dask.config.set(scheduler='threads'):
dd.compute()
until further notice,
dask.config.set(scheduler='threads')
dd.compute()
See http://dask.pydata.org/en/latest/scheduling.html
I am in an HPC environment with clusters, tightly coupled interconnects, and backing Lustre filesystems. We have been exploring how to leverage Dask to not only provide computation, but also to act as a distributed cache to speed up our workflows. Our proprietary data format is n-dimensional and regular, and we have coded a lazy reader to pass into the from_array/from_delayed methods.
We have had some issues with loading and persisting larger-than-memory datasets across a Dask cluster.
Example with hdf5:
# Dask scheduler has been started and connected to 8 workers
# spread out on 8 machines, each with --memory-limit=150e9.
# File locking for reading hdf5 is also turned off
from dask.distributed import Client
c = Client({ip_of_scheduler})
import dask.array as da
import h5py
hf = h5py.File('path_to_600GB_hdf5_file', 'r')
ds = hf[hf.keys()[0]]
x = da.from_array(ds, chunks=(100, -1, -1))
x = c.persist(x) # takes 40 minutes, far below network and filesystem capabilities
print x[300000,:,:].compute() # works as expected
We have also loaded datasets (using slicing, dask.delayed, and from_delayed) from some of our own file file formats, and have seen similar degradation of performance as the file size increases.
My questions: Are there inherent bottlenecks to using Dask as a distributed cache? Will all data be forced to funnel through the scheduler? Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow? If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
Are there inherent bottlenecks to using Dask as a distributed cache?
There are bottlenecks to every system, but it sounds like you're not close to running into the bottlenecks that I would expect from Dask.
I suspect that you're running into something else.
Will all data be forced to funnel through the scheduler?
No, workers can execute functions that load data on their own. That data will then stay on the workers.
Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow?
Workers are just Python processes, so if Python processes running on your cluster can take advantage of Lustre (this is almost certainly the case) then yes, Dask Workers can take advantage of Lustre.
If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
This is certainly common. The tradeoff here is between distributed bandwidth to your NFS and the availability of distributed memory.
In your position I would use Dask's diagnostics to figure out what was taking up so much time. You might want to read through the documentation on understanding performance and the section on the dashboard in particular. That section has a video that might be particularly helpful. I would ask two questions:
Are workers running tasks all the time? (status page, Task Stream plot)
Within those tasks, what is taking up time? (profile page)