Performance impact with dask multiple processes no-nanny - dask

I notice a 5/6 times performance degradation using dask workers with processes only and no-nanny versus with nanny. Is this expected behaviour?
I want to run dask without a nanny due to state in the worker. I appreciate that having state in workers is not desirable but its beyond my control (3rd party library).
Alternative if I run dask workers with a nanny can I capture worker failures/restarts and reinitialise the worker?

A Nanny process just starts a dask-worker process, and then watches it, restarting it if it falls over. It should not affect performance at all. If you do not have a nanny then you can not capture worker failures or restarts. This is the role of the nanny.

Related

How to have Dask workers terminate when done?

I can't just shut down the entire cluster like in this answer because there might be other jobs running. I run one cluster in order to avoid having to use Kubernetes. Jobs get submitted to this cluster, but they call into C libraries that leak memory.
The workers run one thread per process, so it would be acceptable to terminate the entire worker process and have it be restarted.
I can't just use os.kill from the task itself because the task's return value has to be propagated back through Dask. I have to get Dask to terminate the process for me at the right time.
Is there any way to do this?

Dask: Would storage network speed cause a worker to die

I am running a process that writes large files across the storage network. I can run the process using a simple loop and I get no failures. I can run using distributed and jobqueue during off peak hours and no workers fail. However when I run the same command during peak hours, I get worker killing themselves.
I have ample memory for the task and plenty of workers, so I am not sitting in a queue.
The error logs usually has a bunch of over garbage collection limits followed by a Worker killed with Signal 9
Signal 9 suggests that the process has violated some system limit, not that Dask has decided for the worker to die. Since this only happens on high disk IO at busy times, indeed I agree that the network storage is the likely culprit, e.g., a lot of writes have been buffered, but are not being cleared through the relatively low bandwidth.
Dask also uses local storage for temporary files, and "local" might be the network storage. If you have real local disks on the nodes, you should use that, or if not, maybe turn off disk-spilling altogether. https://docs.dask.org/en/latest/setup/hpc.html#local-storage

How do I monitor and restart my application running in Docker based on memory usage?

I have an application running in Docker that leaks memory over time. I need to restart this application periodically when memory usage gets over a threshold. My application can respond to signals or by touching tmp/restart.txt (this is Rails)... as long as I can run a script or send a configurable signal when limits are triggered, I can safely shut down/restart my process.
I have looked into limiting memory utilization with Docker, but I am not seeing a custom action when a limit or reservation is hit. A SIGKILL would not be appropriate for my app... I need some time to clean up.
I am using runit as a minimal init system inside the container, and ECS for container orchestration. This feels like a problem that is attended to at the application or init level... killing a container rather than restarting the process seems heavy.
I have used Monit for this in the past, but I don't like how Monit deals with pidfiles... too often Monit loses control of a process. I am trying out Inspeqtor which seems to fit the bill very well, but while it supports runit there are no packages that work with runit out of the box.
So my question is, if SIGKILL is inappropriate for my use case, what's the best way to monitor a process for memory usage and then perform a cleanup/restart action based on that usage crossing a threshold?

What threads do Dask Workers have active?

When running a Dask worker I notice that there are a few extra threads beyond what I was expecting. How many threads should I expect to see running from a Dask Worker and what are they doing?
Dask workers have the following threads:
A pool of threads in which to run tasks. This is typically somewhere between 1 and the number of logical cores on the computer
One administrative thread to manage the event loop, communication over (non-blocking) sockets, responding to fast queries, the allocation of tasks onto worker threads, etc..
A couple of threads that are used for optional compression and (de)serialization of messages during communication
One thread to monitor and profile the two items above
Additionally, by default there is an additional Nanny process that watches the worker. This process has a couple of its own threads for administration.
These are internal details as of October 2018 and may change without notice.
People who run into "too many threads" issues often are running tasks that are themselves multi-threaded, and so get an N-squared threading issue. Often the solution here is to use environment variables like OMP_NUM_THREADS=1 but this depends on the exact libraries that you're using.

How to reliably clean up dask scheduler/worker

I'm starting up a dask cluster in an automated way by ssh-ing into a bunch of machines and running dask-worker. I noticed that I sometimes run into problems when processes from a previous experiment are still running. Wha'ts the best way to clean up after dask? killall dask-worker dask-scheduler doesn't seem to do the trick, possibly because dask somehow starts up new processes in their place.
If you start a worker with dask-worker, you will notice in ps, that it starts more than one process, because there is a "nanny" responsible for restarting the worker in the case that it somehow crashes. Also, there may be "semaphore" processes around for communicating between the two, depending on which form of process spawning you are using.
The correct way to stop all of these would be to send a SIGINT (i.e., keyboard interrupt) to the parent process. A KILL signal might not give it the chance to stop and clean up the child process(s). If some situation (e.g., ssh hangup) caused a more radical termination, or perhaps a session didn't send any stop signal at all, then you will probably have to grep the output of ps for dask-like processes and kill them all.

Resources