Dask Distributed - Same persist data multiple clients - dask

We are trying Dask Distributed to make some heavy computes and visualization for a frontend.
Now we have one worker with gunicorn that connects to an existing Distributed Dask cluster, the worker uploads the data currently with read_csv and persist into the cluster.
I've tried using pickle to save the futures from the persist dataframe, but it doesn't work.
We want to have multiple gunicorn workers, each with a different client connecting to the same cluster and using the same data, but with more workers each one uploads a new dataframe.

It sounds like you are looking for Dask's abililty to publish datasets
A convenient way to do this is to using the client.datasets mapping
Client 1
client = Client('...')
df = dd.read_csv(...)
client.datasets['my-data'] = df
Client 2..n
client = Client('...') # same scheduler
df = client.datasets['my-data']

Related

How to find out if dask worker is idle?

I am using dask distributed. I have a dask cluster running on AWS. I would like to shutdown workers if they are idle. How do I find out if the dask worker is idle? I have access to the client
from xxxxxx.distributed.ecscluster import EcsCluster
from dask.distributed import Client
cpu_cluster = EcsCluster(workers=1)
client = Client(cpu_cluster)
You can use dask-labextension to access cluster visualizations and identify workers that are idle.
You can also plug into the Dask adaptive scaling APIs which are designed to spin down idle nodes.

Best practice on data access with remote cluster: pushing from client memory to workers vs direct link from worker to data storage

Hi I am new to dask and cannot seem to find relevant examples on the topic of this title. Would appreciate any documentation or help on this.
The example I am working with is pre-processing of an image dataset on the azure environment with the dask_cloudprovider library, I would like to increase the speed of processing by dividing the work on a cluster of machines.
From what I have read and tested, I can
(1) load the data to memory on the client machine, and push it to the workers or
'''psudo code
load data into array and send it to workers through delayed func'''
(2) I can establish a link between every worker node and the data storage (see func below), and access the data on the worker level.
'''def get_remote_image(img_path):
ACCOUNT_NAME = 'xxx'
ACCOUNT_KEY = 'xxx'
CONTAINER = 'xxx'
abfs = adlfs.AzureBlobFileSystem(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY, container_name=CONTAINER)
file = abfs.cat(img_path)
image = imageio.core.asarray(imageio.imread(file, "PNG"))
return cv2.cvtColor(image, cv2.COLOR_RGB2BGR)'''
What I would like to know more about is whether there are any best practices on accessing and working on data on a remote cluster using dask?
If you were to try version 1), you would first see warnings saying that sending large delayed objects is a bad pattern in Dask, and makes for large graphs and high memory use on the scheduler. You can send the data directly to workers using client.scatter, but it would still be essentially a serial process, bottlenecking on receiving and sending all of your data through the client process's network connection.
The best practice and canonical way to load data in Dask is for the workers to do it. All the built in loading functions work this way, and is even true when running locally (because any download or open logic should be easily parallelisable).
This is also true for the outputs of your processing. You haven't said what you plan to do next, but to grab all of those images to the client (e.g., .compute()) would be the other side of exactly the same bottleneck. You want to reduce and/or write your images directly on the workers and only handle small transfers from the client.
Note that there are examples out there of image processing with dask (e.g., https://examples.dask.org/applications/image-processing.html ) and of course a lot about arrays. Passing around whole image arrays might be fine for you, but this should be worth a read.

Best practices for Kafka streams

We have a predict service written in python to provide the Machine Learning service, you send it a set of data, and it will give the Anomaly Detection or Predict and so on.
I want to use Kafka streams to process the real-time data.
There are two ways to select:
Kafka streams jobs only complete the ETL function: load data, and do simple transform and save data to Elastic Search. And then start a timer periodically load data from ES and call predict service to compute and save result back to ES.
Kafka streams jobs do all the thing besides the ETL, when Kafka streams jobs complete the ETL and then send the data to predict service, and save the compute result to Kafka, and a consumer will forward the result from Kafka to ES.
I think the second way is more real-time, but I don't know it's a good idea to do so much predict tasks in streaming jobs.
Is there any common patterns or advice for such application?
Yes, I'd opt for the second option as well.
What you can do is to use Kafka as the data pipeline between your ML-Training module and your Prediction module. These modules could be very well implemented in Kafka Streams.
Take a look on the diagram below:

Confusion regarding cluster scheduler and single machine distributed scheduler

In below code, why dd.read_csv is running on cluster?
client.read_csv should run on cluster.
import dask.dataframe as dd
from dask.distributed import Client
client=Client('10.31.32.34:8786')
dd.read_csv('file.csv',blocksize=10e7)
dd.compute()
Is it the case that once I make a client object, all api calls will run on cluster?
The commnad dd.read_csv('file.csv', blocksize==1e8) will generate many pd.read_csv(...) commands, each of which will run on your dask workers. Each task will look for the file.csv file, seek to some location within that file defined by your blocksize, and read those bytes to create a pandas dataframe. The file.csv file should be universally present for each worker.
It is common for people to use files that are on some universally available storage, like a network file system, database, or cloud object store.
In addition to the first answer:
yes, creating a client to a distributed client will make that be the default scheduler for all following dask work. You can, however, specify where you would like work to run as follows
for a specific compute,
dd.compute(scheduler='threads')
for a black of code,
with dask.config.set(scheduler='threads'):
dd.compute()
until further notice,
dask.config.set(scheduler='threads')
dd.compute()
See http://dask.pydata.org/en/latest/scheduling.html

Loading large datasets with dask

I am in an HPC environment with clusters, tightly coupled interconnects, and backing Lustre filesystems. We have been exploring how to leverage Dask to not only provide computation, but also to act as a distributed cache to speed up our workflows. Our proprietary data format is n-dimensional and regular, and we have coded a lazy reader to pass into the from_array/from_delayed methods.
We have had some issues with loading and persisting larger-than-memory datasets across a Dask cluster.
Example with hdf5:
# Dask scheduler has been started and connected to 8 workers
# spread out on 8 machines, each with --memory-limit=150e9.
# File locking for reading hdf5 is also turned off
from dask.distributed import Client
c = Client({ip_of_scheduler})
import dask.array as da
import h5py
hf = h5py.File('path_to_600GB_hdf5_file', 'r')
ds = hf[hf.keys()[0]]
x = da.from_array(ds, chunks=(100, -1, -1))
x = c.persist(x) # takes 40 minutes, far below network and filesystem capabilities
print x[300000,:,:].compute() # works as expected
We have also loaded datasets (using slicing, dask.delayed, and from_delayed) from some of our own file file formats, and have seen similar degradation of performance as the file size increases.
My questions: Are there inherent bottlenecks to using Dask as a distributed cache? Will all data be forced to funnel through the scheduler? Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow? If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
Are there inherent bottlenecks to using Dask as a distributed cache?
There are bottlenecks to every system, but it sounds like you're not close to running into the bottlenecks that I would expect from Dask.
I suspect that you're running into something else.
Will all data be forced to funnel through the scheduler?
No, workers can execute functions that load data on their own. That data will then stay on the workers.
Are the workers able to take advantage of Lustre, or are functions and/or I/O serialized somehow?
Workers are just Python processes, so if Python processes running on your cluster can take advantage of Lustre (this is almost certainly the case) then yes, Dask Workers can take advantage of Lustre.
If this is the case, would it be more effective to not call persist on massive datasets and just let Dask handle the data and computation when it needs to?
This is certainly common. The tradeoff here is between distributed bandwidth to your NFS and the availability of distributed memory.
In your position I would use Dask's diagnostics to figure out what was taking up so much time. You might want to read through the documentation on understanding performance and the section on the dashboard in particular. That section has a video that might be particularly helpful. I would ask two questions:
Are workers running tasks all the time? (status page, Task Stream plot)
Within those tasks, what is taking up time? (profile page)

Resources