Long-running Sidekiq jobs keep dying - ruby-on-rails

I'm using the sidekiq gem to process background jobs in Rails. For some reason, the job just hang after a while -- the process either becomes unresponsive, showing up on top but not much else, or mysteriously vanishes, without errors (nothing is reported to airbrake.io).
Has anyone had experience with this?

Use the TTIN signal to get a backtrace of all threads in the process so you can figure out where the workers are stuck.
https://github.com/mperham/sidekiq/wiki/Signals

I've experienced this, and haven't found a solution/root cause.
I couldn't resolve this cleanly, but came up with a hack.
I configured God to monitor my Sidekiq processes, and to restart them if a file changed.
I then setup a Cron Job that ran every 5 minutes that checked all the current Sidekiq workers for a queue. If a certain % of the workers had a start time of <= 5 minutes in the past, it meant those workers hung for some reason. If that happened, I touched a file, which made God restart Sidekiq. For me, 5 minutes was ideal but it depends on how long your jobs typically run.
This is the only way I could resolve hanging Sidekiq jobs without manually checking on them every hour, and restarting it myself.

Related

Ruby delayed_job gem how to stop process

I am currently using the delayed_job_active_record gem to run some scheduled tasks on a long run basis. The processes run in the background on a separate worker dyno on heroku and rarely go wrong but in some cases I would like to be able to stop a process mid run. I have been running the processes locally and because of the setup I have, the scheduled tasks only kick off the process which is essentially a very long loop.
Using
bin/delayed_job stop
only stops the jobs but since the process has started, it doesn't top this.
Because of this, I can't seem to stop the process once it has got going without restarting the entire dyno. This seems a bit excessive but is my only option at the moment.
Any help is greatly appreciated
I don't think there's anyway to interrupt it without essentially killing the process like you are doing. I would usually delete the job record in the database and then terminate the worker running it so it doesn't just retry the job (if you've got retries enabled for that job).
Another option... Since you know it's long running and, I imagine, has multiple steps... Modularize the operation and/or add periodic checks for a 'cancelled' flag you put somewhere in the model(s). If you detect the cancelled request, you can then give up and do any cleanup needed. This is probably preferred anyway so you can manage what happens when it's aborted more explicitly.

Sidekiq workers suddenly slowed down (almost like stuck workers)

I've been using sidekiq for a while now and it was working flawlessly (up to 5 million jobs processed). However in the past few days the workers got stuck and thus the jobs left unprocessed. Only by restarting the workers, they'll start working and consuming the jobs again, but they'll eventually stuck again (~10-30minutes, I haven't done any exact measurements).
Here's my setup:
Rails v4.2.5.1, with ActiveJob.
MySQL DB, clustered (with 3 masters)
ActiveRecord::Base.connection_pool set to 32 (verified in Sidekiq process as well).
2 sidekiq workers, 3 threads per worker (total 6).
Symptons:
If the workers just got restarted, they process the jobs fast (~1s).
After several jobs processed, the time needed to complete a job (the same job that previously take only ~1s to complete) suddenly spiked to ~2900s, which make the worker look like stuck.
The slows down affect any kind of jobs (there's no specific offending job).
CPU usage and Memory consumption is normal and no swap either.
Here is the TTIN log. It seems like the process hung when:
retrieve_connection
clear_active_connections
But I'm not sure why it is happening. Anyone have similar experience or know something about this issue? Thanks in advance.
EDIT:
Here's the relevant mysql show processlist log.
I'm running Sidekiq on 2 different machines:
20.7.4.5
20.7.4.6
Focusing on 20.7.4.5, there were 10 connections and all of them are currently sleeping. If I understand correctly:
1 is passenger connection
3 are the currently "busy"(stuck) sidekiq workers.
6 are the unclosed connections.
There's no long-running query here since all the connections are currently sleeping (idle, waiting to be terminated with default timeout duration of 8 hours), is this correct?
EDIT 2:
So it turns out the issue has something to do with our DB configuration. We are using this schema:
Sidekiq workers => Load balancer => DB clusters.
With this setup, sidekiq workers start hanging after a while (completing job MUCH slower, up to 3000s, while it usually takes only 1s).
However if we setup the workers to directly talk with the DB cluster, it works flawlessly. So something is probably wrong with our setup, and this is not a sidekiq issue.
Thanks for all the help guys.

Current Sidekiq job lost when deploying to Heroku

I have a Sidekiq job that runs for a while and when I deploy to Heroku and the job is running, it can't finish within in the few seconds.
That is fine, as the job is designed to be able to be re-run if needed.
The problem is that the job gets lost (instead of put back to redis and run again after deploy).
I found that it is advised to set :timeout: 8 on heroku and I tried it, but it had no effect (also tried seeting to 5).
When there is an exception, I get errors reported, but I don't see any. So not sure what could be wrong.
Any tips on how to debug this?
The free version of Sidekiq will push unfinished jobs back to Redis after the timeout has passed, default of 8 seconds. Heroku gives a process 10 seconds to shut down. That means we have 2 seconds to get those jobs back to Redis or they will be lost. If your network is slow, if the Redis server is swapping, etc, that 2 sec deadline might not be met and the jobs lost.
You were on the right track: one answer is to lower the timeout so you have a better chance of meeting that deadline. But network or swapping delay can't be predicted: even 5 seconds might not be enough time.
Under normal healthy conditions, things should work as designed. Keep your machines healthy (uncongested network, plenty of RAM) and the basic fetch should work well. Sidekiq Pro's reliable fetch feature is a fundamental redesign of how Sidekiq fetches jobs and works around all of these issues by keeping jobs in Redis all the time so they can't be lost. But it comes with serious trade offs too: it's more complicated, slower and more Redis intensive than "basic" fetch.
In short, I don't know why you are losing jobs but make sure your instances and Redis server are healthy and the latency is low.
https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
This is actually feature of sidekiq - designed to steer you toward paying pro version:
http://sidekiq.org/products/pro
RELIABILITY
More reliable message processing.
Cloud environments are noisy and unreliable. Seeing timeouts? Wild swings in latency or performance? Ruby VM crashes or processes disappearing?
If a Sidekiq process crashes while processing a job, that job is lost.
If the Sidekiq client gets a networking error while pushing a job to Redis, an exception is raised and the job is not delivered.
Sidekiq Pro uses Redis's RPOPLPUSH command to ensure that jobs will not be lost if the process crashes or gets a KILL signal.
The Sidekiq Pro client can withstand transient Redis outages or timeouts. It will enqueue jobs locally upon error and attempt to deliver those jobs once connectivity is restored.
Deploy terminates all processes that belongs to user, therefore job is lost. There is actually not much you can do there.
As #mike-perham and #esse noted, Sidekiq is designed the way it can loose jobs due to its fetching mechanism. Your options to get around this are:
To buy Sidekiq Pro (although it was reported to cause the same issue)
To write your own fetcher (but that would mean you can not use most of 3rd party libraries, as they will not work with your custom fetcher)
To mimic Sidekiq Pro's reliable fetch by backing up your jobs data. In case you are up for this way, check out attentive_sidekiq gem which does exactly that.

delayed_job process silently quits

I wish I had more information to put here, but i'm kind of just casting out the nets and hoping someone has some ideas on what I can try or what direction to look. Basically I have a rails app that uses delayed jobs. It offloads a process that takes about 10 or 15 minutes to a background task. It was working fine up until yesterday. Now every time I log onto the server, I find that there are no delayed job processes running. I've restarted, stopped and started, etc. a dozen times and am getting nowhere. The second it tries to process the first item in the queue, the process gets killed, and nothing gets logged to the log file.
I tried running it like this:
RAILS_ENV=production script/delayed_job run
Instead of the normal daemon:
RAILS_ENV=production script/delayed_job start
and that did not give me anymore info. Here is the output:
delayed_job: process with pid 4880 started.
Killed
It runs for probably 10 seconds before it just kills. I have no idea where to start on this. I've tried a number of things like downgrading daemon gem to 1.0.10 like suggested in other posts.
Any ideas would be amazing.
The solution to this in case anyone else comes across is, was that it was simply running out of memory and the OS was killing it.
I ran the jobs a couple times, and watched top while waiting. I saw the memory usage for that pid slowly climb till the process was killed and all memory released.
Hope that might help someone.

running fork in delayed job

we use delayed job in our web application and we need multiple delayed jobs workers happening parallelly, but we don't know how many will be needed.
solution i currently try is running one worker and calling fork/Process.detach inside the needed task.
i was trying to run fork directly in rails application previously but it didnt work too good with passenger.
this solution seems to work well. could there be any caveats in production?
one issue which happened to me today and which anyone trying that should take care of was following:
i noticed that worker is down so i started it. something i didnt think about was that there were 70 jobs waiting in queue. and since processes are forked, they pretty much killed our server for around half an hour by starting all almost immediately and eating all memory in process.. :]
so ensuring that there is god watching over the worker is important.
also worker seems to die often but not sure yet if its connected with forking.

Resources