I ported my code that uses pandas to dask (basically replaced pd.read_csv with dd.read_csv). The code essentially applies a series of filters/transformations on the (dask) dataframe. Similarly, I use dask bag/array instead of numpy array-like data. The code works great locally (i.e., on my workstation or on a virtual machine). However, when I run the code to a cluster (SGE), my jobs gets killed by the scheduler saying that memory/cpu usage have exceeded the allocated limit. It looks like dask is trying to consume all memory/threads available on the node as opposed to what has been allocated by the scheduler. There seem to be two approaches to fix this issue - (a) set memory/cpu limits for dask from within the code as soon as we load the library (just like we set matplotlib.use("Agg") as soon as we load matplotlib when we need to set the backend) and/or (b) have dask understand the memory limits set by the scheduler. Could you please help on how to go about this issue. My preference would be to set mem/cpu limits for dask from within the code. PS: I understand there is a way to spin up dask workers in a cluster environment and specify these limits, but I am interested in running my code that works great locally on the cluster with minimal additional changes. Thanks!
Related
I'm trying to run a pipeline via dask on a cluster on gcp. The pipeline loads a lot of avro files from cloud storage (~5300 files with around 300MB each) like this
bag = db.read_avro(
'gcs://mybucket/myfiles-*.avro',
blocksize=5000000
)
It then applies some transformations and saves the data back to cloud storage (as parquet files).
I've tested this pipeline with a fraction of the avro files and it works perfectly, but when I tell it to ingest all the files, the scheduler process sits at 100% CPU for a long time and at some point it runs out of memory (I have tried scaling my master node running the scheduler up to 64GB of RAM but that still does not suffice), while the workers are idling. I assume that the problem is that it has to create an excessive amount of tasks that are all held in RAM before being distributed to the workers.
Is this some sort of antipattern that I'm using when trying to open a very large number of files? If so, is there perhaps a built-in way to better cope with this or would I have to split the avro files manually?
Avro with Dask at scale is not particularly well-trodden territory. There is no theoretical reason it should not work. You could inspect the contents of the graph to see if things are getting serialised there that are large, or if simply a massive number of tasks are being generated. If the former, it may be solvable, and you could raise an issue.
As you say, you may be able to keep the load on the scheduler down by processing sub-batches out of the total set of files at a time and waiting for completion.
Lately, I've been trying to do some machine learning work with Dask on an HPC cluster which uses the SLURM scheduler. Importantly, on this cluster SLURM is configured to have a hard wall-time limit of 24h per job.
Initially, I ran my code with a single worker, but my job was running out of memory. I tried to increase the number of workers (and, therefore, the number of requested nodes), but the workers got stuck in the SLURM queue (with the reason for such being labeled as "Priority"). Meanwhile, the master would run and eventually hit the wall-time, leaving the workers to die when they finally started.
Thinking that the issue might be my requesting too many SLURM jobs, I tried condensing the workers into a single, multi-node job using a workaround I found on github. Nevertheless, these multi-node jobs ran into the same issue.
I then attempted to get in touch with the cluster's IT support team. Unfortunately, they are not too familiar with Dask and could only provide general pointers. Their primary suggestions were to either put the master job on hold until the workers were ready, or launch new masters every 24h until the the workers could leave the queue. To help accomplish this, they cited the SLURM options --begin and --dependency. Much to my chagrin, I was unable to find a solution using either suggestion.
As such, I would like to ask if, in a Dask/SLURM environment, there is a way to force the master to not start until the workers are ready, or to launch a master that is capable of "inheriting" workers previously created by another master.
Thank you very much for any help you can provide.
I might be wrong on the below, but in my experience with SLURM, Dask itself won't be able to communicate with the SLURM scheduler. There is dask_jobqueue that helps to create workers, so one option could be to launch the scheduler on a low-resource node (that presumably could be requested for longer).
There is a relatively new feature of heterogeneous jobs on SLURM (see https://slurm.schedmd.com/heterogeneous_jobs.html), and as I understand this will guarantee that your workers, scheduler and client launch at the same time, and perhaps this is something that your IT can help with as this is specific to SLURM (rather than dask). Unfortunately, this will work only for non-interactive workloads.
The answer to my problem turned out to be deceptively simple. Our SLURM configuration uses the backfill scheduler. Because my Dask workers were using the maximum possible --time (24 hours), this meant that the backfill scheduler wasn't working effectively. As soon as I lowered --time to the amount I believed was necessary for the workers to finish running the script, they left "queue hell"!
I have the following basic code which (I thought) should set up xarray to use a LocalCluster.
from dask.distributed import Client
client = Client("tcp://127.0.0.1:46573") # this points to a LocalCluster
import xarray as xr
ds = xr.open_mfdataset('*.nc', combine='by_coords') # Uses dask to defer actually loading data
I now launch some task which also completes with no issues:
(ds.mean('time').mean('longitude')**10).compute()
I noticed that the tabs for the Task graph, Workers, or Task stream (among others) in the dask-labextension for my LocalCluster remain empty. Shouldn't there be some sort of progress displayed while the computation is running?
Which leads me to wonder, how do I tell xarray to explicitly use this cluster? Or is Client a singleton such that xarray only ever has one instance to use anyway?
When you create a Dask Client it automatically registers itself as the default way to run Dask computations.
You can check to see if an object is a Dask collection with the dask.is_dask_collection function. As you say, I believe that xr.open_mfdataset uses Dask by default, but this would be a good way to check.
As to why you're not seeing anything on the dashboard, I unfortunately don't know enough about your situation to be able to help you there.
I'm setting up Dask, and I can use dask for multiprocessing just fine.
I run into issues, however, when I want to use pre-configured Dask workers. They don't have the same imports I do with my main process.
I was wondering. How do I add custom imports into dask workers so all futures accessing those workers can operate effectively.
Ideally you Dask workers should all have the same software environment. Typically this is guaranteed outside of Dask with Docker images or with a Network File System (NFS). There are some other solutions like Client.upload_file, which can be useful for small scripts.
When using Dask normally things work fine. However, when I use Dask with an adaptive cluster I find that sometimes all the tasks get assigned to a single worker. Why is this?
This should be considered a usability bug, and it would be reasonable to file an issue about it.
However, to explain what is going on (at least today 2018-08-09) probably what happens is that
Your scheduler first has no tasks and so has no workers assigned to it
You submit a lot of work from a client, the scheduler responds and asks for many workers
The first worker arrives and the scheduler hands it all of the work
Milliseconds later, several other workers arrive. The scheduler then proceeds to load balance between the available workers
Ideally, the load balancing heuristics should handle the situation. There were older versions of Dask where this performed less well, but usually this is fine. I recommend first updating your version of the dask and distributed packages to the newest possible releases and if that doesn't work, report an issue with a minimal example if possible.