Confusion regarding cluster scheduler and single machine distributed scheduler - dask

In below code, why dd.read_csv is running on cluster?
client.read_csv should run on cluster.
import dask.dataframe as dd
from dask.distributed import Client
client=Client('10.31.32.34:8786')
dd.read_csv('file.csv',blocksize=10e7)
dd.compute()
Is it the case that once I make a client object, all api calls will run on cluster?

The commnad dd.read_csv('file.csv', blocksize==1e8) will generate many pd.read_csv(...) commands, each of which will run on your dask workers. Each task will look for the file.csv file, seek to some location within that file defined by your blocksize, and read those bytes to create a pandas dataframe. The file.csv file should be universally present for each worker.
It is common for people to use files that are on some universally available storage, like a network file system, database, or cloud object store.

In addition to the first answer:
yes, creating a client to a distributed client will make that be the default scheduler for all following dask work. You can, however, specify where you would like work to run as follows
for a specific compute,
dd.compute(scheduler='threads')
for a black of code,
with dask.config.set(scheduler='threads'):
dd.compute()
until further notice,
dask.config.set(scheduler='threads')
dd.compute()
See http://dask.pydata.org/en/latest/scheduling.html

Related

Configuration Dask Distributed

I'm setting up an environment for our data scientists to work on. Currently we have a single node running Jupyterhub with Anaconda and Dask installed. (2 sockets with 6 cores and 2 threads per core with 140 gb ram). When users create a LocalCluster, currently the default settings are to take all the available cores and memory (as far as I can tell). This is okay when done explicitly, but I want the standard LocalCluster to use less than this. Because almost everything we do is
Now when looking into the config I see no configuration dealing with n_workers, n_threads_per_worker, n_cores etc. For memory, in dask.config.get('distributed.worker') I see two memory related options (memory and memory-limit) both specifying the behaviour listed here: https://distributed.dask.org/en/latest/worker.html.
I've also looked at the jupyterlab dask extension, which lets me do all this. However, I can't force people to use jupyterlab.
TL;DR I want to be able set the following standard configuration when creating a cluster:
n_workers
processes = False (I think?)
threads_per_worker
memory_limit either per worker, or for the cluster. I know this can only be a soft limit.
Any suggestions for configuration is also very welcome.
As of 2019-09-20 this isn't implemented. I recommend raising an feature request at https://github.com/dask/distributed/issues/new , or even a pull request.

Using NFS with Dask workers

I have been experimenting with using an NFS shared drive with my user and Dask workers. Is this something that can work? I noticed that Dask created two files in my home directory, global.lock and purge.lock, and did not clean them up when workers were finished. What do these files do?
It is entirely normal to use the NFS to host a user's software environment. The files you're seeing are used by a different system altogether.
When Dask workers run out of space they spill excess data to disk. An NFS can work here, but it's much nicer to use local disk if available. This is usually configurable with the --local-directory dask-worker keyword, or the temporary-directory configuration value.
You can read more about storage issues with NFS and more guidelines here: https://docs.dask.org/en/latest/setup/hpc.html
Yes, Dask can be used with an NFS mound, and indeed you can share configuration/scheduler state between the various processes. Each worker process will use its own temporary storage area. The lock files are safe to ignore, and their existence will depend on exactly the workload you are doing.

How can I send to a remote dask-distributed cluster objects whose source code only exists locally?

I have a remote dask-distributed cluster to which I want to send a series of objects to be used during computations. The problem is the source code that defines the classes of those objects only exists locally and, as a consequence, pickling does not work. Is there another way that those objects could be sent without having to previously move the code into the cluster?
You can use Client.upload_file to push out small source files to your workers.
Otherwise you will need to ensure that the software environment between your client and workers are the same. Usually people do this by relying on a network file system, or on Docker images.

File Not Found Error in Dask program run on cluster

I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers.
When I run the program with read_csv file in dask. It gives me Error, file not found
When one of your workers tries to load the CSV, it will not be able to find it, because it is not present on that local disc. This should not be a surprise. You can get around this in a number of ways:
copy the file to every worker; this is obviously wasteful in terms of disc space, but the easiest to achieve
place the file on a networked filesystem (NFS mount, gluster, HDFS, etc.)
place the file on an external storage system such as amazon S3 and refer to that location
load the data in your local process and distribute it with scatter; in this case presumably the data was small enough to fit in memory and probably dask would not be doing much for you.

What is recommended solution for monitoring heterogeneous infrastructure?

I am looking for monitoring tool for the following use cases:
Collect basic metrics about virtual machine (cpu usage, memory usage, i/o, available space)
Extract metrics from SQL Server (probably running some queries)
Extract information from external service about processing i.e how many processing are currently running and for how long. I am thinking about writing python scripts, but don't know how to combine with monitoring tool
Have the ability to plot charts and manage alerts and it will nice to have ability to send not only mails, but send message to slack/ms teams.
I was thing about Prometheus, because it has wmi_exporter, node_exporter, sql exporter, alert manager with possibility to send notifications to multiple destinations, but I don't know what to do with this external service and python scripts.
Any suggestions?
Prometheus can definitely do what you say you need done. Some of it may not be trivial, but you can definitely fill in the blanks yourself.
E.g. you can get machine metrics basically out of the box by firing up a node_exporter and having it scraped by Prometheus, but I don't think it has e.g. information on all running processes. The latter might require you to write an agent/exporter: a simple web server that exposes metrics on /metrics; there exists a Python client library to help with that. Or have said processes (assuming they're your code) push metrics to a Pushgateway instead, if they're short lived batch jobs.
Oh, and for charts/dashboards you probably want Grafana, as Prometheus' abilities in that area are rather limited and Grafana integrates rather well with Prometheus.

Resources