My tasks are returning with KilledWorker exceptions when using Dask with the dask.distributed scheduler. What do these errors mean?
This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly. It is designed to protect the cluster against tasks that kill workers, for example by segfaults or memory errors.
Whenever a worker dies unexpectedly the scheduler notes which tasks were running on that worker when it died. It retries those tasks on other workers but also marks them as suspicious. If the same task is present on several workers when they die then eventually the scheduler will give up on trying to retry this task, and instead marks it as failed with the exception KilledWorker.
Often this means that your task has some other issue. Perhaps it causes a segmentation fault or allocates too much memory. Perhaps it uses a library that is not threadsafe. Or perhaps it is just very unlucky. Regardless, you should inspect your worker logs to determine why your workers are failing. This is likely a bigger issue than your task failing.
You can control this behavior by modifying the following entry in your ~/.config/dask/distributed.yaml file.
allowed-failures: 3 # number of retries before a task is considered bad
Related
I can't just shut down the entire cluster like in this answer because there might be other jobs running. I run one cluster in order to avoid having to use Kubernetes. Jobs get submitted to this cluster, but they call into C libraries that leak memory.
The workers run one thread per process, so it would be acceptable to terminate the entire worker process and have it be restarted.
I can't just use os.kill from the task itself because the task's return value has to be propagated back through Dask. I have to get Dask to terminate the process for me at the right time.
Is there any way to do this?
We have jobs which interact with native code and there are unavoidable memory leaks while the worker is processing the task. The simple solution for our problems has been to restart the worker after a specified number of tasks.
We are migrating from python's multiprocessing which has a useful maxtasksperchild option which closes down the workers after a specified number of tasks.
Is there something built-in in dask that is comparable to maxtasksperchild?
As a workaround, we are keeping track of the workers who have completed a task by appending their worker address to the result payload and calling retire_workers on the client side manually.
No, there is no such equivalent in Dask
When running a Dask worker I notice that there are a few extra threads beyond what I was expecting. How many threads should I expect to see running from a Dask Worker and what are they doing?
Dask workers have the following threads:
A pool of threads in which to run tasks. This is typically somewhere between 1 and the number of logical cores on the computer
One administrative thread to manage the event loop, communication over (non-blocking) sockets, responding to fast queries, the allocation of tasks onto worker threads, etc..
A couple of threads that are used for optional compression and (de)serialization of messages during communication
One thread to monitor and profile the two items above
Additionally, by default there is an additional Nanny process that watches the worker. This process has a couple of its own threads for administration.
These are internal details as of October 2018 and may change without notice.
People who run into "too many threads" issues often are running tasks that are themselves multi-threaded, and so get an N-squared threading issue. Often the solution here is to use environment variables like OMP_NUM_THREADS=1 but this depends on the exact libraries that you're using.
Initially I have no process for delayed jobs(as indicated by htop), then when I run the command RAILS_ENV=production bin/delayed_job start I got one delayed job worker, as indicated by files in tmp/pids. However htop indicates now that there are two processes, as shown in the picture below.
So why is this happening? The other delayed job consumes memory where I don't have much of it!, however its TIME+ is zero, so it didn't consume time, so what does this means ?
I guess these are actually not two processes but two threads of a single process. You can hide threads by typing the capital H key in htop. If you'll see just one line then, you'll prove that it's a single process.
Delayed job probably has some master thread that governs the worker threads (or just the single worker in your setup), watches the queues and runs the workers if needed. Threads share most of the memory so I rather don't think the resources consumption issue comes from the two lines in htop.
At least once a day my Delayed::Job workers will randomly stop working jobs off the queue, yet the processes are still alive.
Pictured: "Zombies"
When I inspect the remaining jobs in the queue, none will show that they are locked/being-worked by the zombified workers in question. Even when looking at failed jobs its hard to make a definite question connection between a failure and the workers going into zombie mode.
I have a theory that a job has an error that causes workers to segfault, but not completely die. Is there any way to inspect a worker process and see what it's doing? How would one go about debugging this issue when there's not even a stacktrace or failed job to inspect?