Dask Workers lifetime option not waiting for job to finish - dask

When applying workers lifetime option with restart, looks like if the worker is running a job, it still moves ahead with restart.
Applied lifetime restart option for every 60 secs using 1 worker and ran a job which simply sleeps for twice the amount of time. The restart still appears to take place even if the worker is running the job.
For graceful restart, thought the worker would wait for a long running task / job to finish and when idle would then restart itself. That way even if you have along running task its not interrupted by the auto restart option.

Related

How to safely shut down Sidekiq on Heroku

I guess I need a sanity check here because if I want to prevent any sidekiq jobs from ending prematurely, Heroku Redis should handle this for me?
When I want to push new changes to a production site, I put the application in maintenance mode: heroku maintenance:on. Now when I do this and run heroku ps I can see both my web process and my worker (i.e. sidekiq) are up still (makes sense because its just to prevent users having access to the site).
If I shut down the worker dyno with a command like this: heroku ps:stop worker after the site is in maintenance mode, will this safely stop sidekiq workers before it does down? Also, from Sidekiq's documentation:
https://github.com/mperham/sidekiq/wiki/Deployment#heroku
It mentions a -t N switch where N is a number in seconds but that Heroku has a hard limit of allowing a process 30 seconds to shut down on its own. Am I correct that if I stop the worker process with the heroku command, it will give any currently running jobs N seconds to finish before giving it a SIGTERM signal?
If not, what additional steps do I need to take to make sure Sidekiq has safely shut down?
Sounds like you are fine. Heroku sends SIGTERM when you call ps:stop. Sending SIGTERM tells Sidekiq to shut down within N seconds. Your worker dyno should be safely down within 30 seconds.

Killing tasks spawned by a job

I am considering if replacing celery with dask. Currently we have a cluster where different jobs are submitted, each one generating multiple tasks that run in parallel. Celery has a killer feature, the "revoke" command: I can kill all the tasks of a given job without interfering with the other jobs that are running at the same time. How can I do that with dask? I only find references saying that this is not possible, but for us it would be a disaster. So don't want to be force the shut down the entire cluster when a calculation goes rogue, thus killing the jobs of the other users.
You can cancel tasks using the Client.cancel command.
If they haven't started yet then they won't start, however if they're already running in a thread somewhere then there isn't much that Python can do to stop them other than to tear down the process.

Dask restart worker(s) using client

Is there a way using dask client to restart a worker or worker list provided. Needed a way to bounce a worker after a task is executed to reset the state of the process which may have been changed by the execution.
Client.restart() restarts entire grid and so may end up killing any tasks running in parallel to one that just completed.

Sidekiq - Enqueued Job is running from old code

I have about 30 sidekiq jobs scheduled in the future (let's days 1 in a day for the next 30 days).
I use capistrano for deployment. So I have 5 release directories at anytime. Let's say:
/var/www/release1/ (recent)
/var/www/release2/
/var/www/release3/
/var/www/release4/
/var/www/release5/
Let's say after few days, I make a new release. Now, the previously scheduled jobs are still running from the old code. Is this expected? How can we fix this to ensure that it uses the latest release directory when it starts running rather than when it is scheduled?
I'd just like to contribute with an alternate answer for someone who might get into this situation by other reason.
It happened to me that there was a sidekiq zombie process running. So, even if I would stop sidekiq manually and restart it, I had another sidekiq process hanging running with old code. Therefore, it's a good idea to run unix htop command or ps aux | grep sidek and try to look for zombie processes.
This could be because sidekiq process didn't restart after a successful deployment.
Make sure your deployment process restarts sidekiq and make sure restart actually works, otherwise sidekiq processes are still holding on to old code.
https://github.com/mperham/sidekiq/wiki/Deployment

Delayed Jobs workers getting timed out on engineyard

I think i'm having a problem where engineyard is adding a timeout to some of my delayed job workers, (seems to be 10 minutes). I have a copy process that can run for > 10 minutes and everytime it gets to that 10 minutes threshold the job is killed. Is there anyway to configure the engineyard timeout for worker instances?? I'm looking through and all I see is timeouts regarding nginx/apache
There isn't a timeout set for the Delayed Job workers, so this is more likely a memory usage issue. Monit tracks the memory consumed by the workers and will restart those that reach a set threshold. Monit's actions will be logged in /var/log/syslog, so this can be checked to confirm if Monit is terminating the workers. The memory threshold is set in the /etc/monit.d/delayed_job.monitrc file(s) and can be increased to fit the workers' requirements. After alteration of the configuration Monit must be reloaded using sudo monit reload.
If you submit a ticket at https://support.cloud.engineyard.com the support staff will be more than happy to help you further diagnose this issue.

Resources