I am running a local rails application on a VM with 64 cores and 192 GB RAM. We are running with 12 Sidekiq Processes with 10 Threads each and 40 Puma Processes.
The App Memory and CPU allocation is taken care by the docker.
The Problem we are facing is when we enqueue large number of jobs, say 1000, Sidekiq picks up 10*12 jobs and starts processing them.
Mid-execution suddenly the number of jobs running goes to 0. Moreover Sidekiq stops picking up new jobs.
All the threads in each process have the busy status but none of them processing any jobs.
This problem goes away when we reduce the number of Sidekiq Processes to 2. All the jobs process normally and Sidekiq even picks up new jobs from Enqueued Set.
What should be the correct way to debug this issue or how should I proceed further?
I tried reducing number of threads keeping the number of processes same. But that also resulted in the same issue.
Related
I am having a job which takes more than 1 hour to execute.Due to that,remaining jobs gets enqueued and not able to start.So I have decided to set maximum run time for background jobs.Is there any way to set timeout for jobs in sidekiq?
You cannot timeout or stop jobs in Sidekiq. This is dangerous and can corrupt your application data.
Sounds like you only have one Sidekiq process with concurrency of 1. You can start multiple Sidekiq processes and they will work on different jobs and you can increase concurrency to do the same.
I've been using sidekiq for a while now and it was working flawlessly (up to 5 million jobs processed). However in the past few days the workers got stuck and thus the jobs left unprocessed. Only by restarting the workers, they'll start working and consuming the jobs again, but they'll eventually stuck again (~10-30minutes, I haven't done any exact measurements).
Here's my setup:
Rails v4.2.5.1, with ActiveJob.
MySQL DB, clustered (with 3 masters)
ActiveRecord::Base.connection_pool set to 32 (verified in Sidekiq process as well).
2 sidekiq workers, 3 threads per worker (total 6).
Symptons:
If the workers just got restarted, they process the jobs fast (~1s).
After several jobs processed, the time needed to complete a job (the same job that previously take only ~1s to complete) suddenly spiked to ~2900s, which make the worker look like stuck.
The slows down affect any kind of jobs (there's no specific offending job).
CPU usage and Memory consumption is normal and no swap either.
Here is the TTIN log. It seems like the process hung when:
retrieve_connection
clear_active_connections
But I'm not sure why it is happening. Anyone have similar experience or know something about this issue? Thanks in advance.
EDIT:
Here's the relevant mysql show processlist log.
I'm running Sidekiq on 2 different machines:
20.7.4.5
20.7.4.6
Focusing on 20.7.4.5, there were 10 connections and all of them are currently sleeping. If I understand correctly:
1 is passenger connection
3 are the currently "busy"(stuck) sidekiq workers.
6 are the unclosed connections.
There's no long-running query here since all the connections are currently sleeping (idle, waiting to be terminated with default timeout duration of 8 hours), is this correct?
EDIT 2:
So it turns out the issue has something to do with our DB configuration. We are using this schema:
Sidekiq workers => Load balancer => DB clusters.
With this setup, sidekiq workers start hanging after a while (completing job MUCH slower, up to 3000s, while it usually takes only 1s).
However if we setup the workers to directly talk with the DB cluster, it works flawlessly. So something is probably wrong with our setup, and this is not a sidekiq issue.
Thanks for all the help guys.
One of the benefits of Sidekiq over Resqueue is that it can run multiple jobs in the same process. The drawback, however, is I can't figure out how to force a set of concurrent jobs to run in different processes.
Here's my use case: say I have to generate 64M rows of data, and I have 8 vCPUs on an amazon EC2 instance. I'd like to carve the task up into 8 concurrent jobs generating 8M rows each. The problem is that if I'm running 8 sidekiq processes, sometimes sidekiq will decide to run 2 or more of the jobs in the same process, and so it doesn't use all 8 vCPUs and takes much longer to finish. Is there any way to tell sidekiq which worker to use or to force it to spread jobs in a group evenly amongst processes?
Answer is you can't easily, by design. Specialization is what leads to SPOFs.
You can create a custom queue for each process and then create one job for each queue.
You can use JRuby which doesn't suffer the same flaw.
You can execute the processing as a rake task which will spawn one process per job, ensuring an even load.
You can carve up 64 jobs instead of 8 and get a more even load that way.
I would probably do the latter unless the resulting I/O crushes the machine.
I have scheduled some background tasks using sidekiq with workers concurrency of 22
but I am seeing about 13.7% of memory consumption is happening even if none of the workers are working, Is this normal or should I have to change some configuration in sidekiq to avoid this
ubuntu 9331 21.8 13.7 1505656 1082988 ? Sl Mar06 557:08 sidekiq 2.7.5 jobs [0 of 22 busy]
Thanks
13.7% seems to be a lot of memory, but I don't know how many GB you have. (I'm pretty bad at reading top results).
However, even if all your workers are idle, Sidekiq is still running ready to process new jobs. And this consume memory.
So depending on the quantity of memory you have, it's perfectly normal.
In my Resque Web UI, it says that currently N of M Workers Working. In this case, M is not right. This happened because I rebooted a couple of Resque instances without shutting down the workers first, I'm guessing a SIGKILL leaves the worker as active in the Redis cache.
The number of Resque instances I am running is constantly growing and shrinking depending on the queue/system load at any given time. Is there anyway possible that I can update the number of actual workers running according to Redis so that it is always in sync? Anyway to get this number accurate would be great.