Delayed Job worker daemons randomly stop working jobs off queue - ruby-on-rails

At least once a day my Delayed::Job workers will randomly stop working jobs off the queue, yet the processes are still alive.
Pictured: "Zombies"
When I inspect the remaining jobs in the queue, none will show that they are locked/being-worked by the zombified workers in question. Even when looking at failed jobs its hard to make a definite question connection between a failure and the workers going into zombie mode.
I have a theory that a job has an error that causes workers to segfault, but not completely die. Is there any way to inspect a worker process and see what it's doing? How would one go about debugging this issue when there's not even a stacktrace or failed job to inspect?

Related

Identify the retry job in Sidekiq

When a job in Sidekiq fails, it goes into retry queue and it is retried for 25 times, as per https://github.com/mperham/sidekiq/wiki/Error-Handling#automatic-job-retry. So the question is, Is there any way to find whether the job that is currently getting executed, is running for first-time or it is the n'th retry of that job?
Note this job is running separately in a worker.
P.s: I'm new to sidekiq, workers and async jobs, so pardon if the question is not clear or an obvious one.

How to have Dask workers terminate when done?

I can't just shut down the entire cluster like in this answer because there might be other jobs running. I run one cluster in order to avoid having to use Kubernetes. Jobs get submitted to this cluster, but they call into C libraries that leak memory.
The workers run one thread per process, so it would be acceptable to terminate the entire worker process and have it be restarted.
I can't just use os.kill from the task itself because the task's return value has to be propagated back through Dask. I have to get Dask to terminate the process for me at the right time.
Is there any way to do this?

Kill multiple Sidekiq jobs from the same worker

I would like to know how to kill many Sidekiq jobs from the same worker at once.
I deployed a bug on a production environment and there are queued jobs that are bugging out. I can simply fix the bug and deploy again, but the jobs are time-sensitive (they send out SMS alert to people).
When the bug is gone, the jobs will be executed and many people will get outdated SMS alerts. So I would like to kill all the jobs from that worker before deploying my fix.
Any suggestions? The buggy jobs are enqueued with many other jobs and I can't just remove all jobs from one queue.
Ideally you should enqueue those messages to a different queue so you can clear that queue on its own. There's no other efficient way to remove a set of jobs.

Can 2 sidekiq worker threads process the same job?

Is it possible that 1 job is being processed twice by 2 different sidekiq threads? I am using sidekiq to insert some analytics events into a mongodb collection, asynchronously. I see around 15 duplicates in that collection. My guess is that 2 worker threads picked the same job, at the same time, and added to the collection.
Does sidekiq ensure that the job is picked only by 1 thread. We can ignore the restart case, as the jobs are small and will complete in less than 8s.
Is firing analytics events asynchronously using sidekiq not a good practice? What are my options? I could add a unique key to the event and check it before insert to avoid insertion of duplicates, but that's adding data (+ an overhead/query) that I am never going to use (and it adds up for millions of events). Can I somehow ensure that a job is processed only once by sidekiq?
Thanks for your help.
No. Sidekiq uses Redis as a work queue for background processing. Redis provides atomic operations for adding jobs to the queue and popping jobs off of the queue (specifically the redis BRPOP command). Each Sidekiq worker tries to fetch a job from the queue with a timeout via BRPOP and any given job popped from the queue will only be returned to one of the workers pulling work from the queue.
What is more likely is that you are enqueuing multiple jobs.
Another possibility is that your job is throwing an error, causing it to partially execute, and then be re-tried multiple times. By default Sidekiq will retry failed jobs, but doesn't have any built in mechanism for transactions/atomicity of work. ie: If your sidekiq job does A, B, and C and doing B raises an exception, causing the job to fail - it will be retried, causing A to be run again each time the job is retried.

Long-running Sidekiq jobs keep dying

I'm using the sidekiq gem to process background jobs in Rails. For some reason, the job just hang after a while -- the process either becomes unresponsive, showing up on top but not much else, or mysteriously vanishes, without errors (nothing is reported to airbrake.io).
Has anyone had experience with this?
Use the TTIN signal to get a backtrace of all threads in the process so you can figure out where the workers are stuck.
https://github.com/mperham/sidekiq/wiki/Signals
I've experienced this, and haven't found a solution/root cause.
I couldn't resolve this cleanly, but came up with a hack.
I configured God to monitor my Sidekiq processes, and to restart them if a file changed.
I then setup a Cron Job that ran every 5 minutes that checked all the current Sidekiq workers for a queue. If a certain % of the workers had a start time of <= 5 minutes in the past, it meant those workers hung for some reason. If that happened, I touched a file, which made God restart Sidekiq. For me, 5 minutes was ideal but it depends on how long your jobs typically run.
This is the only way I could resolve hanging Sidekiq jobs without manually checking on them every hour, and restarting it myself.

Resources