Locked delayed_job row lingers in the database after capistrano deploy - ruby-on-rails

Whenever I deploy with capistrano or run cap production delayed_job:restart, I end up with the currently-running delayed_job row remaining locked.
The delayed_job process is successfully stopped, a new delayed_job process is started, and a new row is locked by the new process. The problem is that the last process' row is still sitting there & marked as locked. So I have to go into the database manually, delete the row, and then manually add that job back into the queue for the new delayed_job process to get to.
Is there a way for the database cleanup & re-queue of the previous job to happen automatically?

I have the same problem. This happens whenever a job is forcibly killed. Part of the problem is that worker processes are managed by the daemons gem rather than delayed_job itself. I'm currently investigating ways to fix this, such as:
Setting a longer timeout before daemons forcibly terminates (nothing about this in docs for delayed_joob or daemons)
Clearing locks before starting delayed_job workers
I'll post back here whan and if I come up with a solution.

Adjust your Daemon wait time or raise an exception on SIGINT.
#John Carney is correct. In short, all delayed_job workers get sent something like a SIGINT (nice interrupt) on a redeploy. delayed_job workers, by default, will complete their current job (if they are working on one) and then gracefully terminate.
However, if the job that they are working on is a longer-running job, there's an amount of time the Daemon manager waits before it gets annoyed and sends a more serious interrupt signal, like a SIGTERM or SIGKILL. This wait time and what gets sent really depends on your setup and configuration.
When that happens, the delayed_job worker gets killed immediately without being able to finish the job it is working on or even cleanup after itself and mark the job as no longer locked.
This ends up in a "stranded" job that is marked as "locked" but locked to a process/worker that no longer exists. Not good.
That's the crux of the issue and what is happening. To get around this, you have two main options, depending on your what your jobs look like (we use both):
1. Raise an exception when an interrupt is received.
You can do this by setting the raise_signal_exceptions configuration to either :term or true:
Delayed::Worker.raise_signal_exceptions = :term
This configuration options accepts :term, true or false (default). You can read more on the original commit here.
I would try first with :term and see if that solves your issue. If not, you may need to set it to true.
Setting to :term or true will gracefully raise an exception and unlock the job for another delayed_job worker to pickup the job and start working on it.
Setting it to true means that your delayed_job workers won't even attempt to finish the current job that they are working on. They will just immediately raise an exception, unlock the job and terminate themselves.
2. Adjust how your workers are interrupted/terminated/killed on a redeploy.
This really depends on your redeploy, etc. In our case, we are using Cloud66 to handle deploys so we just had to configure this with them. But this is what ours looks like:
stop_sequence: int, 172800, term, 90, kill # Allows long-running delayed jobs to finish before being killed (i.e. on redeploy). Sends SIGINT, waits 48 hours, sends SIGTERM, waits 90 seconds, sends SIGKILL.
On a redeploy, this tells the Daemon manager to follow these steps will each delayed_job worker:
Send a SIGINT.
Wait 172800 seconds (2 days) - we have very long-running jobs.
Send a SIGTERM, if the worker is still alive.
Wait 90 seconds.
Send a SIGKILL, if the worker is still alive.
Anyway, that should help get you on the right track to configuring this properly for yourself.
We use both methods by setting a lengthy timeout as well as raising an exception when a SIGTERM is received. This ensures that if there is a job that runs past the 2 day limit, it will at least raise an exception and unlock the job, allowing us to investigate instead of just leaving a stranded job that is locked to a process that no longer exists.

Related

Ruby delayed_job gem how to stop process

I am currently using the delayed_job_active_record gem to run some scheduled tasks on a long run basis. The processes run in the background on a separate worker dyno on heroku and rarely go wrong but in some cases I would like to be able to stop a process mid run. I have been running the processes locally and because of the setup I have, the scheduled tasks only kick off the process which is essentially a very long loop.
Using
bin/delayed_job stop
only stops the jobs but since the process has started, it doesn't top this.
Because of this, I can't seem to stop the process once it has got going without restarting the entire dyno. This seems a bit excessive but is my only option at the moment.
Any help is greatly appreciated
I don't think there's anyway to interrupt it without essentially killing the process like you are doing. I would usually delete the job record in the database and then terminate the worker running it so it doesn't just retry the job (if you've got retries enabled for that job).
Another option... Since you know it's long running and, I imagine, has multiple steps... Modularize the operation and/or add periodic checks for a 'cancelled' flag you put somewhere in the model(s). If you detect the cancelled request, you can then give up and do any cleanup needed. This is probably preferred anyway so you can manage what happens when it's aborted more explicitly.

How do I handle long running jobs on Heroku?

I want to use Heroku but the fact they restart dynos every 24 hours at random times is making things a bit difficult.
I have a series of jobs dealing with payment processing that are very important, and I want them backed by the database so they're 100% reliable. For this reason, I chose DJ which is slow.
Because I chose DJ, it means that I also can't just push 5,000,000 events to the database at once (1 per each email send).
Because of THAT, I have longer running jobs (send 200,000 text messages over a few hours).
With these longer running jobs, it's more challenging to get them working if they're cut off right in the middle.
It appears heroku sends SIGTERM and then expects the process to shut down within 30 seconds. This is not going to happen for my longer jobs.
Now I'm not sure how to handle them... the only way I can think is to update the database immediately after sending texts for instance (for example, a sms_sent_at column), but that just means I'm destroying database performance instead of sending a single update query for every batch.
This would be a lot better if I could schedule restarts, at least then I could do it at night when I'm 99% likely not going to be running any jobs that don't take longer than 30 seconds to shut down.
Or.. another way, can I 'listen' for SIGTERM within a long running DJ and at least abort the loop early so it can resume later?
Manual restarts will reset the 24 hr clock - heroku ps:restart at your preferred time ought to give you the control you are looking for.
More info can be found here: Dynos and the Dyno Manager
Here's the proper answer, you listen for SIGTERM (I'm using DJ here) and then gracefully rescue. It's important that the jobs are idempotent.
Long running delayed_job jobs stay locked after a restart on Heroku
class WithdrawPaymentsJob
def perform
begin
term_now = false
old_term_handler = trap('TERM') { term_now = true; old_term_handler.call }
loop do
puts 'doing long running job'
sleep 1
if term_now
raise 'Gracefully terminating job early...'
end
end
ensure
trap('TERM', old_term_handler)
end
end
end
Here's how you solve it with Que:
if Que.worker_count.zero?
raise 'Gracefully terminating job early...'
end

How to lock Resque jobs to one server

I have a "cluster" of Resque servers in my infrastructure. They all have the same exact job priorities etc. I automagically scale the number of Resque servers up and down based on how many pending jobs there are and available resources on the servers to handle said jobs. I always have a minimum of two Resque servers up.
My issue is that when I do a quick, one off job, sometimes both the servers process that job. This is bad.
I've tried adding a lock to my job with something like the following:
require 'resque-lock-timeout'
class ExampleJob
extend Resque::Plugins::LockTimeout
def self.perform
# some code
end
end
This plugin works for longer running jobs. However for these super tiny one off jobs, processing happens right away. The Resque servers both do not see the lock set by its sister server, both set a lock, process the job, unlock, and are done.
I'm not entirely sure what to do at this point or what solutions there are except for having one dedicated server handle this type of job. That would be a serious pain to configure and scale. I really want both the servers to be able to handle it, but once one of them grabs it from the queue, ensure the other does not run it.
Can anyone suggest some viable solution(s)?
Write your lock interpreter to wait T milliseconds before it looks for a lock with a unique_id less than the value of the lock it made.
This will determine who won the race, and the loser will self-terminate.
T is the parallelism latency between all N servers in the pool of a given queue. You can determine this heuristically by scaling back from 1000 milliseconds until you again find the job happening in-duplicate. Give padding for latency variation.
This is called the Busy-Wait solution to mutex thread safety. It is considered one of the trade-offs acceptable given the various scenarios in which one must solve Mutex (e.g. Locking, etc)
I'll post some links when off mobile. Wikipedia entry on mutex should explain all this.
Of this won't work for you, then:
1. Use a scheduler to control duplication.
2. Classify short-running jobs to a queue designed to run them in serial.
TL;DR there is no perfect solution, only good trade-off for your conditions.
It should not be possible for two workers to get the same 'payload' because items are dequeued using BLPOP. Redis will only send the queued item to the first client that calls BLPOP. It sounds like you are enqueueing the job more than once and therefore two workers are able to acquire different payloads with the same arguments. The purpose of 'resque-lock-timeout' is to assure that payloads that have the same method and arguments do not run concurrently; it does not however stop the second payload from being worked if the first job releases the lock before the second job tries to acquire it.
It would make sense that this only happens to short running jobs. Here is what might be happening:
payload 1 is enqueued
payload 2 is enqueued
payload 1 is locked
payload 1 is worked
payload 1 is unlocked
payload 2 is locked
payload 2 is worked
payload 2 is unlocked
Where as in long running jobs the following senario might happen:
payload 1 is enqueued
payload 2 is enqueued
payload 1 is locked
payload 1 is worked
payload 2 is fails to get lock
payload 1 is unlocked
Try turning off Resque and enqueueing your job. Take a look in redis at the list for your Resque queue (or monitor Redis using redis-cli monitor). See if Resque has queued more than one payload. If you still only see one payload then monitor the list to see if another one of your resque workers is calling recreate on failed jobs.
If you want to have 'resque-lock-timeout' hold the lock for longer than the duration it takes to process the job you can override the release_lock! method to set an expiry on the lock instead of just deleting it.
module Resque
module Plugins
module LockTimeout
def release_lock!(*args)
lock_redis.expire(redis_lock_key(*args), 60) # expire lock after 60 seconds
end
end
end
end
https://github.com/lantins/resque-lock-timeout/blob/master/lib/resque/plugins/lock_timeout.rb#l153-155

Fork delayed job from the app server?

Here's my simple ideal case scenario for when I'd like delayed job to run:
When the first application server (whether through mongrel or passenger) starts, it'll start my delayed job workers.
When the last running application server terminates, it'll kill all the delayed job workers.
The first part (starting) is doable, although I'm not sure what the "right" or "best" way to do it is. Just make a conditional (on process not already running) system call to delayed_job start?
The second part (terminating) -- well, I'm not sure if it is doable or not. Definitely have no idea how this effect could be accomplished.
Any thoughts or ideas?
Is there another way that you start/end delayed job workers that you think is best?
Side question:
The main questions above are for the production environment -- a more difficult case because there are multiple app servers running at the same time. Could the same thing be easily done in the development environment (where there's guaranteed to only be one application server, not a cluster of them) by forking a child process to run the delayed job workers that would always terminate when the parent terminates? How would I go about doing this?
You could definitely pull the terminating off with god.
Simply watch the app processes and god will fire a callback when they're all stopped.

Can I start and stop delayed_job workers from within my Rails app?

I've got an app that could benefit from delayed_job and some background processing. The thing is, I don't really need/want delayed_job workers running all the time.
The app runs in a shared hosting environment and in multiple locations (for different users). Plus, the app doesn't get a large amount of usage.
Is there a way to start and stop processing jobs (either with the script or rake task) from my app only after certain actions/events?
You could call out to system:
system "cd #{Rails.root} && rake delayed_job:start RAILS_ENV=production"
You could just change delayed_job to check less often too. Instead of the 5 second default, set it to 15 minutes or something.
Yes, you can, but I'm not sure what the benefit will be. You say you don't want workers running all the time - what are your concerns? Memory usage? Database connections?
To keep the impact of delayed_job low on your system, I'd run only one worker, and configure it to sleep most of the time.
Delayed::Worker::sleep_delay = 60 * 5 # in your initializer.rb
A single worker will only wake up and check the db for new jobs every 5 minutes. Running this way keeps you from 'customizing' too much.
But if you really want to start a Delayed::Worker programatically, look in that class for work_off, and implement your own script/run_jobs_and_exit script. It should probably look much like script/delayed_job does - 3 lines.
I found this because I was looking for a way to run some background jobs without spending all the money to run them all the time when they weren't needed. Someone made a hack using google app engine to run the background jobs:
http://viatropos.com/blog/how-to-run-background-jobs-on-heroku-for-free/
It's a little outdated though. There is an interesting comment in the thread:
"When I need to send an e-mail, copy a file, etc I basically add it to the queue. At the end of every request it checks if there is anything in the queue. If so then it uses the Heroku API to set the worker to 1. At the end of a worker getting a task done it checks to see if there is anything left in the queue. If not then it sets the workers back to 0. The end result is the background worker will just work for a few seconds here and there. I can do all the background processing that I need and the bill at the end of the month rarely ever reaches 1 hour total worth of work. Even if it does no problem, I'll pay $0.05 for background processing. :)"
If you go to stop a worker, you are given the PID. You can simply kill -9 PID if all else fails.

Resources