Long running workers blocking GIL timeout errors - dask

I'm using dask-distributed with a local setup (LocalCluster with 5 workers) on a dask.delayed workload. Most of the work is done by the vtk Python bindings. Since vtk is C++ based I think that means the workers don't release the GIL when in a long-running statement. When I run the workload, my terminal prints out a bunch of errors like this:
Traceback (most recent call last):
File "C:\Users\patri\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\comm\core.py", line 221, in connect
_raise(error)
File "C:\Users\patri\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\comm\core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:49721' after 10 s: connect() didn't finish in time
My workload continues fine however - I get a bunch of errors on the command line but it keeps chugging along. So I think the workers aren't crashing, but the heartbeat communication stops. Since I don't want to mess with vtk internals to release the GIL, how can I fix the errors? I get so many of these benign timeout errors that I can't see any real errors that might happen.

Release the GIL temporally by sleeping the VTK event loop thread.
If you are using a vtkWindowRendererInteractor instance, create a timer with a callback which sleeps the execution a bit using the sleep builtin.

Related

Properly handle timeout on CloudRun

We use Google Cloud Run to wrap an analysis developed in R behind a web API. For this, we have a small Fastify app that launches an R script and uploads the results to Google Cloud Storage. The process' stdout and stderr are written to a file and are also uploaded at the end of the analysis.
However, we sometimes run into issues when a process takes longer to execute than expected. In these cases, we fail to upload anything and it's difficult to debug, because stdout and stderr are "lost" on the instance. The only thing we see in the Cloud Run logs is this message
The request has been terminated because it has reached the maximum request timeout
Is there a recommended way to handle a request timeout?
In App Engine there used to be a descriptive error: DeadllineExceededError for Python and DeadlineExceededException for Java.
We currently evaluate the following approach
Explicitly set Cloud Run's request timeout
Provide the same value as an environment variable, so it's available to the container
When receiving a request, we start a timer that calls a "cleanup" function just before the timeout is exceeded
The cleanup function stops the running analysis and uploads the current stdout and stderr files to Cloud Storage
This feels a little complicated so any feedback very appreciated.
Since the default timeout is 5 minutes and can extend up to 60 minutes, I would simply start by increasing this to 10 minutes. Then observe over the course of a month how that affects your service.
Aside from that fix, I would start investigating why your process is taking longer than expected and if it's perhaps due to a forever-growing result set.
If there's no result set scalability concern, then bumping the default timeout up from 5-minutes seems to be the most reasonable and simple fix. It would only be a problem until your script has to deal with more data in the future for some reason.

Automate bazel shutdown

I am building a large project on a remote machine using Bazel. Clean build times are around 30 minutes. Incremental builds (changing code in 1-2 files) typically take around 10-20 seconds.
The problem I have is that when I log out of my machine and log back in again after 1-2 days the build command takes around 10 minutes even though I have not modified any source code.
If I call bazel shutdown and then call bazel build again the "no-build" op takes around 5-10 seconds (i.e. much better than the other "no-build" op).
If I log out and log back in again immediately I can see there is still a bazel process running in the background, which disappears when I call bazel shutdown. I am guessing that when I do not shut bazel down properly it gets killed in such a way that corrupts or deletes cached data. The long "no-build" op then spends a long time reconstructing data that was previously stored in the Bazel cache.
Is there a way to automatically shut down the bazel server when I am disconnected? Preferably this should work both when (i) I call exit from the command-line to log out, (ii) I get automatically disconnected through some kind of timeout or interruption in network connectivity.
Set up your development environment so that you sessions do not automatically exit / get killed, e.g., using a tool like screen or tmux. When you want to end a session call bazel shutdown prior to exit. Not completely automated but the point is that you should be in control of when your sessions end.

What do KilledWorker exceptions mean in Dask?

My tasks are returning with KilledWorker exceptions when using Dask with the dask.distributed scheduler. What do these errors mean?
This error is generated when the Dask scheduler no longer trusts your task, because it was present too often when workers died unexpectedly. It is designed to protect the cluster against tasks that kill workers, for example by segfaults or memory errors.
Whenever a worker dies unexpectedly the scheduler notes which tasks were running on that worker when it died. It retries those tasks on other workers but also marks them as suspicious. If the same task is present on several workers when they die then eventually the scheduler will give up on trying to retry this task, and instead marks it as failed with the exception KilledWorker.
Often this means that your task has some other issue. Perhaps it causes a segmentation fault or allocates too much memory. Perhaps it uses a library that is not threadsafe. Or perhaps it is just very unlucky. Regardless, you should inspect your worker logs to determine why your workers are failing. This is likely a bigger issue than your task failing.
You can control this behavior by modifying the following entry in your ~/.config/dask/distributed.yaml file.
allowed-failures: 3 # number of retries before a task is considered bad

Ruby mod_passenger process timeout

A few Ruby apps I've worked with hang for a long time on slow calls causing processes to backup on the machine eventually requiring a reboot. Is there a quick and easy way in Passenger to limit a execution time for a single Apache call.
In PHP if a process exceeds the max execution time setting in php.ini the process returns an error to Apache and the server keeps merrily plugging away.
I would take a look at fixing the application. Cutting off requests at the web server level is really more of a band aid and not addressing the core problem - which is request failures, one way or another. If the Ruby app is dependent on another service that is timing out, you can patch the code like this, using the timeout.rb library:
require 'timeout'
status = Timeout::timeout(5) {
# Something that should be interrupted if it takes too much time...
}
This will let the code "give up" and close out the request gracefully when needed.

ftplib timeout continuously prevents script from running through

I have a question, which probably identifies me as beginner in programming, which I am indeed.
I wrote a script, that downloads many netcdf files, each file is about 500 mb in size and there are many hundred files. The script would run for several days if all files were downloaded. The problem is, that the script regularly stops with the error message:
TimeoutError: [Errno 60] Operation timed out
This is annoying, because I would like to start the script and come back some days later when the task of downloading is done. Like this I would have to check if the script is still running at least every hour.
I found in the manual, that the timeout of the ftp-connection can be set manually.
I set it really high
ftp = FTP("rancmems.mercator-ocean.fr", timeout=10000)
My question is is this the line, where the problem comes from and is there any chance to solve the problem?
Thank you for helping :-)

Resources