I've been using sidekiq for a while now and it was working flawlessly (up to 5 million jobs processed). However in the past few days the workers got stuck and thus the jobs left unprocessed. Only by restarting the workers, they'll start working and consuming the jobs again, but they'll eventually stuck again (~10-30minutes, I haven't done any exact measurements).
Here's my setup:
Rails v4.2.5.1, with ActiveJob.
MySQL DB, clustered (with 3 masters)
ActiveRecord::Base.connection_pool set to 32 (verified in Sidekiq process as well).
2 sidekiq workers, 3 threads per worker (total 6).
Symptons:
If the workers just got restarted, they process the jobs fast (~1s).
After several jobs processed, the time needed to complete a job (the same job that previously take only ~1s to complete) suddenly spiked to ~2900s, which make the worker look like stuck.
The slows down affect any kind of jobs (there's no specific offending job).
CPU usage and Memory consumption is normal and no swap either.
Here is the TTIN log. It seems like the process hung when:
retrieve_connection
clear_active_connections
But I'm not sure why it is happening. Anyone have similar experience or know something about this issue? Thanks in advance.
EDIT:
Here's the relevant mysql show processlist log.
I'm running Sidekiq on 2 different machines:
20.7.4.5
20.7.4.6
Focusing on 20.7.4.5, there were 10 connections and all of them are currently sleeping. If I understand correctly:
1 is passenger connection
3 are the currently "busy"(stuck) sidekiq workers.
6 are the unclosed connections.
There's no long-running query here since all the connections are currently sleeping (idle, waiting to be terminated with default timeout duration of 8 hours), is this correct?
EDIT 2:
So it turns out the issue has something to do with our DB configuration. We are using this schema:
Sidekiq workers => Load balancer => DB clusters.
With this setup, sidekiq workers start hanging after a while (completing job MUCH slower, up to 3000s, while it usually takes only 1s).
However if we setup the workers to directly talk with the DB cluster, it works flawlessly. So something is probably wrong with our setup, and this is not a sidekiq issue.
Thanks for all the help guys.
Related
Right now I have my puma setup successfully running, with zero downtime deployment via phased restarts (served behind nginx).
It works pretty nice, since no connections are lost and the new versions are available seamlessly.
But: right now we are setting the process count to the amount of CPUs - 1, so for a big server with 32 CPUs there are 31 puma processes, which get restarted one by one during phased restarts.
This takes really a long long long time for us (around 15 minutes, since each process takes about 30 seconds to boot (yeah, many gems, big system)).
I saw that the clustered mode can also be used for fast deployments with setting preload_app! - but I can't understand what happends during the deployment:
Will current requests get dropped? Is there a small timeframe, where no new connections will be accepted? I tried to figure it out via the READMEs, but it wasn't clear to me what exactly happens.
Would be great to hear an explanation, thank you!
Ok, I tested it myself: new connections get stuck and you have to wait until your workers have finished loading. Existing connections are not closed. So for zero downtime deployments, you really need to do phased restarts (with the disadvantage that it takes pretty long if you have many workers).
Here is my problem: Each night, I have to process around 50k Background Jobs, each taking an average of 60s. Those jobs are basically calling the Facebook, Instagram and Twitter APIs to collect users' posts and save them in my DB. The jobs are processed by sidekiq.
At first, my setup was:
:concurrency: 5 in sidekiq.yml
pool: 5 in my database.yml
RAILS_MAX_THREADS set to 5 in my Web Server (puma) configuration.
My understanding is:
my web server (rails s) will use max 5 threads hence max 5 connections to my DB, which is OK as the connection pool is set to 5.
my sidekiq process will use 5 threads (as the concurrency is set to 5), which is also OK as the connection pool is set to 5.
In order to process more jobs in the same time and reducing the global time to process all my jobs, I decided to increase the sidekiq concurrency to 25. In Production, I provisionned a Heroku Postgres Standard Database with a maximum connection of 120, to be sure I will be able to use Sidekiq concurrency.
Thus, now the setup is:
:concurrency: 25 in sidekiq.yml
pool: 25 in my database.yml
RAILS_MAX_THREADS set to 5 in my Web Server (puma) configuration.
I can see that 25 sidekiq workers are working but each Job is taking way more time (sometimes more than 40 minutes instead of 1 minute) !?
Actually, I've been doing some tests and realize that processing 50 of my Jobs with a sidekiq concurrency of 5, 10 or 25 result in the same duration. As if somehow, there was a bottleneck of 5 connections somewhere.
I have checked Sidekiq Documentation and some other posts on SO (sidekiq - Is concurrency > 50 stable?, Scaling sidekiq network archetecture: concurrency vs processes) but I haven't been able to solve my problem.
So I am wondering:
is my understanding of the rails database.yml connection pool and sidekiq concurrency right ?
What's the correct way to setup those parameters ?
Dropping this here in case someone else could use a quick, very general pointer:
Sometimes increasing the number of concurrent workers may not yield the expected results.
For instance, if there's a large discrepancy between the number of tasks and the number of cores, the scheduler will keep switching your tasks and there isn't really much to gain, the jobs will just take about the same or a bit more time.
Here's a link to a rather interesting read on how job scheduling works https://en.wikipedia.org/wiki/Scheduling_(computing)#Operating_system_process_scheduler_implementations
There are other aspects to consider as well, such as datastore access, are your workers using the same table(s)? Is it backed by a storage engine that locks the entire table, such as MyISAM? If that's the case, it won't matter if you have 100 workers running at the same time, and enough RAM and cores, they will all be waiting in line for whichever query is running to release the lock on the table they're all meant to be working with.
This can also happen with tables using engines such as InnoDB, which doesn't lock the entire table on write but you may have different workers accessing the same rows (InnoDB uses row-level locking) or simply some large indexes that don't lock but slow down the table.
Another issue I've encountered was related to Rails (which I'm assuming you're using) taking quite a toll on RAM in some cases, so you might want to look at your memory footprint as well.
My suggestion is to turn on logging and look at the data, where do your workers spend most time at? Is it something on the network layer (unlikely), is it waiting to get access to a core? Reading/writing from your data store? Is your machine swapping?
I have a Sidekiq job that runs for a while and when I deploy to Heroku and the job is running, it can't finish within in the few seconds.
That is fine, as the job is designed to be able to be re-run if needed.
The problem is that the job gets lost (instead of put back to redis and run again after deploy).
I found that it is advised to set :timeout: 8 on heroku and I tried it, but it had no effect (also tried seeting to 5).
When there is an exception, I get errors reported, but I don't see any. So not sure what could be wrong.
Any tips on how to debug this?
The free version of Sidekiq will push unfinished jobs back to Redis after the timeout has passed, default of 8 seconds. Heroku gives a process 10 seconds to shut down. That means we have 2 seconds to get those jobs back to Redis or they will be lost. If your network is slow, if the Redis server is swapping, etc, that 2 sec deadline might not be met and the jobs lost.
You were on the right track: one answer is to lower the timeout so you have a better chance of meeting that deadline. But network or swapping delay can't be predicted: even 5 seconds might not be enough time.
Under normal healthy conditions, things should work as designed. Keep your machines healthy (uncongested network, plenty of RAM) and the basic fetch should work well. Sidekiq Pro's reliable fetch feature is a fundamental redesign of how Sidekiq fetches jobs and works around all of these issues by keeping jobs in Redis all the time so they can't be lost. But it comes with serious trade offs too: it's more complicated, slower and more Redis intensive than "basic" fetch.
In short, I don't know why you are losing jobs but make sure your instances and Redis server are healthy and the latency is low.
https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
This is actually feature of sidekiq - designed to steer you toward paying pro version:
http://sidekiq.org/products/pro
RELIABILITY
More reliable message processing.
Cloud environments are noisy and unreliable. Seeing timeouts? Wild swings in latency or performance? Ruby VM crashes or processes disappearing?
If a Sidekiq process crashes while processing a job, that job is lost.
If the Sidekiq client gets a networking error while pushing a job to Redis, an exception is raised and the job is not delivered.
Sidekiq Pro uses Redis's RPOPLPUSH command to ensure that jobs will not be lost if the process crashes or gets a KILL signal.
The Sidekiq Pro client can withstand transient Redis outages or timeouts. It will enqueue jobs locally upon error and attempt to deliver those jobs once connectivity is restored.
Deploy terminates all processes that belongs to user, therefore job is lost. There is actually not much you can do there.
As #mike-perham and #esse noted, Sidekiq is designed the way it can loose jobs due to its fetching mechanism. Your options to get around this are:
To buy Sidekiq Pro (although it was reported to cause the same issue)
To write your own fetcher (but that would mean you can not use most of 3rd party libraries, as they will not work with your custom fetcher)
To mimic Sidekiq Pro's reliable fetch by backing up your jobs data. In case you are up for this way, check out attentive_sidekiq gem which does exactly that.
we use delayed job in our web application and we need multiple delayed jobs workers happening parallelly, but we don't know how many will be needed.
solution i currently try is running one worker and calling fork/Process.detach inside the needed task.
i was trying to run fork directly in rails application previously but it didnt work too good with passenger.
this solution seems to work well. could there be any caveats in production?
one issue which happened to me today and which anyone trying that should take care of was following:
i noticed that worker is down so i started it. something i didnt think about was that there were 70 jobs waiting in queue. and since processes are forked, they pretty much killed our server for around half an hour by starting all almost immediately and eating all memory in process.. :]
so ensuring that there is god watching over the worker is important.
also worker seems to die often but not sure yet if its connected with forking.
I'm using the sidekiq gem to process background jobs in Rails. For some reason, the job just hang after a while -- the process either becomes unresponsive, showing up on top but not much else, or mysteriously vanishes, without errors (nothing is reported to airbrake.io).
Has anyone had experience with this?
Use the TTIN signal to get a backtrace of all threads in the process so you can figure out where the workers are stuck.
https://github.com/mperham/sidekiq/wiki/Signals
I've experienced this, and haven't found a solution/root cause.
I couldn't resolve this cleanly, but came up with a hack.
I configured God to monitor my Sidekiq processes, and to restart them if a file changed.
I then setup a Cron Job that ran every 5 minutes that checked all the current Sidekiq workers for a queue. If a certain % of the workers had a start time of <= 5 minutes in the past, it meant those workers hung for some reason. If that happened, I touched a file, which made God restart Sidekiq. For me, 5 minutes was ideal but it depends on how long your jobs typically run.
This is the only way I could resolve hanging Sidekiq jobs without manually checking on them every hour, and restarting it myself.