Dokku remove worker without finishing tasks - ruby-on-rails

I have Ruby on Rails app and after I do any deployment with Dokku, I have my old worker running the tasks normally, but when I complete the deploy, after a few seconds this worker is removed and a new one is created without finishing all the tasks that were running. Does anyone know how I get around this? Or that the worker finishes all tasks before removing, or when a worker is created after deploying, the tasks from the old worker go to the new worker? Any indication?

Dokku's zero-downtime deploy mechanism will wait a configurable amount of time before running docker container stop. This will send a SIGTERM to your application, and then a SIGKILL after a grace period.
Your application code should handle SIGTERM and gracefully stop accepting new work, finish old work, and then terminate. This is generally what background processing frameworks do by default, but you may need to configure this in yours or add the functionality if it is a custom framework you wrote.

I managed to solve it with a workaround, I created a check_worker.rb file that I call in the predeploy, it silences the workers so as not to take any more tasks, and while there is a worker with tasks being executed, it does not advance with the deploy until it completes all, only after completing all the tasks it completes the deploy
app.json
{
"scripts": {
"dokku": {
"predeploy": "ruby check_worker.rb"
}
}
}
check_worker.rb
#!/usr/bin/env ruby
require 'sidekiq'
require 'sidekiq/api'
ps = Sidekiq::ProcessSet.new
ps.each(&:quiet!)
workers = true
while workers
works = []
active_workers = Sidekiq::Workers.new.map do |_process_id, _thread_id, work|
works = work
end
workers = works.length > 0
if workers
sleep 2
end
end

Related

Is there a way to tell a sleeping delayed job worker to process the queue from Rails code?

I'm using delayed_job library as adapter for Active Jobs in Rails:
config.active_job.queue_adapter = :delayed_job
My delayed_job worker is configured to sleep for 60 seconds before checking the queue for new jobs:
Delayed::Worker.sleep_delay = 60
In my Rails code I add a new job to the queue like this:
MyJob.perform_later(data)
However, the job will not be picked up by the delayed_job worker immediately, even if there are no jobs in the queue, because of the sleep_delay. Is there a way to tell the delayed_job worker to wake up and start processing the job queue if it's sleeping?
There is a MyJob.perform_now method, but it blocks the thread, so it's not what I want because I want to execute a job asynchronously.
Looking at the delayed_job code and it appears that there's no way to directly control or communicate with the workers after they are daemonized.
I think the best you can do would be to start a separate worker with a small sleep_delay that only reads a specific queue, then use that queue for these jobs. A separate command is necessary because you can't start a worker pool where the workers have different sleep delays:
Start your main worker: bin/delayed_job start
Start your fast worker: bin/delayed_job start --sleep-delay=5 --queue=fast --identifier=999 (the identifier is necessary to differentiate the workers daemons)
Update your job to use that queue:
class MyJob < ApplicationJob
queue_as :fast
def perform...
end
Notes:
When you want to stop the workers you'll also have to do it separately: bin/delayed_job stop and bin/delayed_job stop --identifier=999.
This introduces some potential parallelism and extra load on the server when both workers are working at the same time.
The main worker will process jobs from the fast queue too, it's just a matter of which worker grabs the job first. If you don't want that you need to setup the main worker to only read from the other queue(s), by default that's only 'default', so: bin/delayed_job start --queue=default.

How many rails instances does delayed job initialize if running multiple pools

I'm running Delayed Job with the pool option like:
bundle exec bin/delayed_job -m --pool=queue1 --pool=queue2 start
Will this spawn one OR multiple rails instances? (ie: will it spawn one instance for all the pools or will every pool get its own rails instance)?
When testing locally it seemed to only spawn one rails instance for all the pools.
But I want to confirm this 100% (esp on production).
I tried using commands like these to see what the DJ processes were actually pointing to:
ps aux, lsof, pstree
Anyone know for sure how this works, or any easy way to find out? I started digging through the source code but figured someone prob knows a quicker way.
Thanks!
It should spawn multiple processes, not sure why you're seeing only one.
From the readme:
Use the --pool option to specify a worker pool. You can use this option multiple times to start different numbers of workers for different queues.
The following command will start 1 worker for the tracking queue, 2 workers for the mailers and tasks queues, and 2 workers for any jobs:
RAILS_ENV=production script/delayed_job --pool=tracking --pool=mailers,tasks:2 --pool=*:2 start
Further details after discussion in comments
The question mentions "Rails instances", but instance is a generic term. The word you're looking for is process. The text quoted from DelayedJob's readme uses the word worker, short for worker process. In Rails, you usually refer to server processes as just servers, and to worker processes as just workers.
The rails console, too, is just another process.
In Rails all these processes will load the whole application, but will do different things.
Server processes will wait for incoming HTTP requests and send back responses; worker processes will periodically poll a queue (DelayedJob uses the DB) and execute jobs; the console process will start a REPL and wait for input.
They will all have access to the same code (models, DB config, assets, view template, etc), but will have very different responsibilities.
I hope this makes things clearer.
After digging through the code the short answer is..
Running something like this:
bundle exec bin/delayed_job -m --pool=queue1 --pool=queue2 start
will start ONE rails process/instance for ALL the pools/queues you specify.
Details below if you want more explanation:
In the Command class:
this loops through and setups up the workers:
def setup_pools
worker_index = 0
#worker_pools.each do |queues, worker_count|
options = #options.merge(:queues => queues)
worker_count.times do
process_name = "delayed_job.#{worker_index}"
run_process(process_name, options)
worker_index += 1
end
end
end
Which will run this for each queue:
def run(worker_name = nil, options = {})
Dir.chdir(root)
Delayed::Worker.after_fork
Delayed::Worker.logger ||= Logger.new(File.join(#options[:log_dir], 'delayed_job.log'))
worker = Delayed::Worker.new(options)
worker.name_prefix = "#{worker_name} "
worker.start
Each worker is daemonized, but there aren't new rails processes started. It just loops through each pool/queue in its own daemon.
You can see this in the "start" method:
def start
loop do
self.class.lifecycle.run_callbacks(:loop, self) do
#realtime = Benchmark.realtime do
#result = work_off
end
end
If you want to start a rails instance for each new queue you could use monit and do something like:
check process delayed_job_0
with pidfile /var/www/apps/{app_name}/shared/pids/delayed_job.0.pid
start program = "/usr/bin/env RAILS_ENV=production /var/www/apps/{app_name}/current/bin/delayed_job start -i 0"
stop program = "/usr/bin/env RAILS_ENV=production /var/www/apps/{app_name}/current/bin/delayed_job stop -i 0"
group delayed_job
check process delayed_job_1
with pidfile /var/www/apps/{app_name}/shared/pids/delayed_job.1.pid
start program = "/usr/bin/env RAILS_ENV=production /var/www/apps/{app_name}/current/bin/delayed_job start -i 1"
stop program = "/usr/bin/env RAILS_ENV=production /var/www/apps/{app_name}/current/bin/delayed_job stop -i 1"
group delayed_job

delayed_job - Stop only one process

Using delayed_job gem, how can I stop only one process without stopping all the workers?
For example:
rake jobs:work start workers
process1 = SomeClass.enqueue start process 1 in code
process2 = SomeClass.enqueue start process 2 in code
process1.stop will stop only process 1 and keep process 2 running.
I guess a similar question would be "How can I get the PID of a delayed job process?" because then I can kill the process using the PID.
Delayed Job on a whole is a process. To know the pid check the .pid file that is created in temp folder when delayed job starts.
In order to stop one particular process from code use Delayed::Job.all, or find your job and use the destroy method to close it.

Rufus-scheduler only running once on production

I'm using rufus-scheduler to run a process every day from a rails server. For testing purposes, let's say every 5 minutes. My code looks like this:
in config/initializers/task_scheduler.rb
scheduler = Rufus::Scheduler::PlainScheduler.start_new
scheduler.every "10m", :first_in => '30s' do
# Do stuff
end
I've also tried the cron format:
scheduler.cron '50 * * * *' do
# stuff
end
for example, to get the process to run every hour at 50 minutes after the hour.
The infuriating part is that it works on my local machine. The process will run regularly and just work. It's only on my deployed-to-production app that the process will run once, and not repeat.
ps faux reveals that cron is running, passenger is handling the spin-up of the rails process, the site has been pinged again so it knows it should refresh, and production shows the changes in the code. The only thing that's different is that, without a warning or error, the scheduled task doesn't repeat.
Help!
You probably shouldn't run rufus-scheduler in the rails server itself, especially not with a multi-process framework like passenger. Instead, you should run it in a daemon process.
My theory on what's happening:
Passenger starts up a ruby server process and uses it to fork off other servers to handle requests. But since rufus-scheduler runs its jobs in a separate thread from the main thread, the rufus thread is only alive in the original ruby process (ruby's fork only duplicates the thread that does the forking). This might seem like a good thing because it prevents multiple schedulers from running, but... Passenger may kill ruby processes under certain conditions - and if it kills the original, the scheduler thread is gone.
Add the Below lines to your apache2 config /etc/apache2/apach2.conf and restart your apache server
RailsAppSpawnerIdleTime 0
PassengerMinInstances 1
Kelvin is right.
Passenger kills 'unnecessary' threads.
http://groups.google.com/group/rufus-ruby/search?group=rufus-ruby&q=passenger

How do I clear stuck/stale Resque workers?

As you can see from the attached image, I've got a couple of workers that seem to be stuck. Those processes shouldn't take longer than a couple of seconds.
I'm not sure why they won't clear or how to manually remove them.
I'm on Heroku using Resque with Redis-to-Go and HireFire to automatically scale workers.
None of these solutions worked for me, I would still see this in redis-web:
0 out of 10 Workers Working
Finally, this worked for me to clear all the workers:
Resque.workers.each {|w| w.unregister_worker}
In your console:
queue_name = "process_numbers"
Resque.redis.del "queue:#{queue_name}"
Otherwise you can try to fake them as being done to remove them, with:
Resque::Worker.working.each {|w| w.done_working}
EDIT
A lot of people have been upvoting this answer and I feel that it's important that people try hagope's solution which unregisters workers off a queue, whereas the above code deletes queues. If you're happy to fake them, then cool.
You probably have the resque gem installed, so you can open the console and get current workers
Resque.workers
It returns a list of workers
#=> [#<Worker infusion.local:40194-0:JAVA_DYNAMIC_QUEUES,index_migrator,converter,extractor>]
pick the worker and prune_dead_workers, for example the first one
Resque.workers.first.prune_dead_workers
Adding to answer by hagope, I wanted to be able to only unregister workers that had been running for a certain amount of time. The code below will only unregister workers running for over 300 seconds (5 minutes).
Resque.workers.each {|w| w.unregister_worker if w.processing['run_at'] && Time.now - w.processing['run_at'].to_time > 300}
I have an ongoing collection of Resque related Rake tasks that I have also added this to: https://gist.github.com/ewherrmann/8809350
Run this command wherever you ran the command to start the server
$ ps -e -o pid,command | grep [r]esque
you should see something like this:
92102 resque: Processing ProcessNumbers since 1253142769
Make note of the PID (process id) in my example it is 92102
Then you can quit the process 1 of 2 ways.
Gracefully use QUIT 92102
Forcefully use TERM 92102
* I'm not sure of the syntax it's either QUIT 92102 or QUIT -92102
Let me know if you have any trouble.
I just did:
% rails c production
irb(main):001:0>Resque.workers
Got the list of workers.
irb(main):002:0>Resque.remove_worker(Resque.workers[n].id)
... where n is the zero based index of the unwanted worker.
I had a similar problem that Redis saved the DB to disk that included invalid (non running) workers. Each time Redis/resque was started they appeared.
Fix this using:
Resque::Worker.working.each {|w| w.done_working}
Resque.redis.save # Save the DB to disk without ANY workers
Make sure you restart Redis and your Resque workers.
Started working on https://github.com/shaiguitar/resque_stuck_queue/ recently. It's not a solution to how to fix stuck workers but it addresses the issue of resque hanging/being stuck, so I figured it could be helpful for people on this thread. From README:
"If resque doesn't run jobs within a certain timeframe, it will trigger a pre-defined handler of your choice. You can use this to send an email, pager duty, add more resque workers, restart resque, send you a txt...whatever suits you."
Been used in production and works pretty well for me thus far.
Here's how you can purge them from Redis by hostname. This happens to me when I decommission a server and workers do not exit gracefully.
Resque.workers.each { |w| w.unregister_worker if w.id.start_with?(hostname) }
I ran into this issue and started down the path of implementing a lot of the suggestions here. However, I discovered the root cause that was creating this issue was that I was using the gem redis-rb 3.3.0. Downgrading to redis-rb 3.2.2 prevented these workers from getting stuck in the first place.
I've cleared them out from redis-cli directly. Luckily redistogo.com allows access from environments outside heroku.
Get dead worker ID from the list. Mine was
55ba6f3b-9287-4f81-987a-4e8ae7f51210:2
Run this command in redis directly.
del "resque:worker:55ba6f3b-9287-4f81-987a-4e8ae7f51210:2:*"
You can monitor redis db to see what it's doing behind the scenes.
redis xxx.redistogo.com> MONITOR
OK
1380274567.540613 "MONITOR"
1380274568.345198 "incrby" "resque:stat:processed" "1"
1380274568.346898 "incrby" "resque:stat:processed:c65c8e2b-555a-4a57-aaa6-477b27d6452d:2:*" "1"
1380274568.346920 "del" "resque:worker:c65c8e2b-555a-4a57-aaa6-477b27d6452d:2:*"
1380274568.348803 "smembers" "resque:queues"
Second last line deletes the worker.
In resque 2.0.0, here's one way that seems to work to remove only actually appearantly-dead workers in resque 2.0.0:
Resque::Worker.all_workers_with_expired_heartbeats.each { |w| w.unregister_worker }
I am not an expert in what's going, it's possible there's a better way to do this or that this will have problems. I'm just trying to figure this out too.
This seems to remove workers that haven't sent a "heartbeat" in much longer than expected from the resque worker list.
If the phantom worker was in a "running" state, then a new entry in the "failed" job queue will be created corresponding to phantom job.
I had stuck/stale resque workers here too, or should I say 'jobs', because the worker is actually still there and running fine, it's the forked process that is stuck.
I chose the brutal solution of killing the forked process "Processing" since more than 5min, via a bash script, then the worker just spawn the next in queue, and everything keeps on going
have a look at my script here: https://gist.github.com/jobwat/5712437
If you are using newer versions of Resque, you'll need to use the following command as the internal APIs have changed...
Resque::WorkerRegistry.working.each {|work| Resque::WorkerRegistry.remove(work.id)}
This avoids the problem as long as you have a resque version newer than 1.26.0:
resque: env QUEUE=foo TERM_CHILD=1 bundle exec rake resque:work
Keep in mind that it does not let the currently running job finish.
If you use Docker, you can also use this command:
<id> is the worker id.
docker stop <id>
docker start <id>

Resources