In my Resque Web UI, it says that currently N of M Workers Working. In this case, M is not right. This happened because I rebooted a couple of Resque instances without shutting down the workers first, I'm guessing a SIGKILL leaves the worker as active in the Redis cache.
The number of Resque instances I am running is constantly growing and shrinking depending on the queue/system load at any given time. Is there anyway possible that I can update the number of actual workers running according to Redis so that it is always in sync? Anyway to get this number accurate would be great.
Related
I am running a local rails application on a VM with 64 cores and 192 GB RAM. We are running with 12 Sidekiq Processes with 10 Threads each and 40 Puma Processes.
The App Memory and CPU allocation is taken care by the docker.
The Problem we are facing is when we enqueue large number of jobs, say 1000, Sidekiq picks up 10*12 jobs and starts processing them.
Mid-execution suddenly the number of jobs running goes to 0. Moreover Sidekiq stops picking up new jobs.
All the threads in each process have the busy status but none of them processing any jobs.
This problem goes away when we reduce the number of Sidekiq Processes to 2. All the jobs process normally and Sidekiq even picks up new jobs from Enqueued Set.
What should be the correct way to debug this issue or how should I proceed further?
I tried reducing number of threads keeping the number of processes same. But that also resulted in the same issue.
I am working on a project that launches a process via a Rails worker that is very resource intensive and it can only be handled properly by a Performance Worker on Heroku, 1X workers are killed because they use too much RAM and 2X workers can barely handle the load exceeding their RAM limits by up to 160%. A performance worker does the job fine with no issues.
My question is, is there a way to dynamically switch the Dyno size to Performance before a job initiates and then scale it back down once the job is finished or a queue is empty?
I know HireFire exists but to my knowledge this service only increases the amount of workers based on a queue length etc? Another possible solution I thought about was using the Heroku API which has a Dyno endpoint to resize the worker dyno before the job starts and then resize it back down when the job ends.
Does anyone else have other recommendations, ideas or strategies for this issue?
Thanks!
The best way is the one you mentioned: use the Heroku Platform API to scale your Dyno size up before starting the job, and then down again afterwards.
This is because tools like HireFire only work by inspecting stuff like application response time, router queue, etc. -- so there's no way for them to know you're about to run some job and then scale up just for that.
Depending on the specifics of the usage, you may be able to just create a distinct dyno-type in your procfile that only runs this particular worker and is always scaled to performance, but isn't always running? You could even just run this with one-off runs, instead of scaling it potentially (this can also be done via the API, roughly equivalent to heroku run ...). That said, #rdegges answer should certainly work.
I've been using sidekiq for a while now and it was working flawlessly (up to 5 million jobs processed). However in the past few days the workers got stuck and thus the jobs left unprocessed. Only by restarting the workers, they'll start working and consuming the jobs again, but they'll eventually stuck again (~10-30minutes, I haven't done any exact measurements).
Here's my setup:
Rails v4.2.5.1, with ActiveJob.
MySQL DB, clustered (with 3 masters)
ActiveRecord::Base.connection_pool set to 32 (verified in Sidekiq process as well).
2 sidekiq workers, 3 threads per worker (total 6).
Symptons:
If the workers just got restarted, they process the jobs fast (~1s).
After several jobs processed, the time needed to complete a job (the same job that previously take only ~1s to complete) suddenly spiked to ~2900s, which make the worker look like stuck.
The slows down affect any kind of jobs (there's no specific offending job).
CPU usage and Memory consumption is normal and no swap either.
Here is the TTIN log. It seems like the process hung when:
retrieve_connection
clear_active_connections
But I'm not sure why it is happening. Anyone have similar experience or know something about this issue? Thanks in advance.
EDIT:
Here's the relevant mysql show processlist log.
I'm running Sidekiq on 2 different machines:
20.7.4.5
20.7.4.6
Focusing on 20.7.4.5, there were 10 connections and all of them are currently sleeping. If I understand correctly:
1 is passenger connection
3 are the currently "busy"(stuck) sidekiq workers.
6 are the unclosed connections.
There's no long-running query here since all the connections are currently sleeping (idle, waiting to be terminated with default timeout duration of 8 hours), is this correct?
EDIT 2:
So it turns out the issue has something to do with our DB configuration. We are using this schema:
Sidekiq workers => Load balancer => DB clusters.
With this setup, sidekiq workers start hanging after a while (completing job MUCH slower, up to 3000s, while it usually takes only 1s).
However if we setup the workers to directly talk with the DB cluster, it works flawlessly. So something is probably wrong with our setup, and this is not a sidekiq issue.
Thanks for all the help guys.
I have a Sidekiq job that runs for a while and when I deploy to Heroku and the job is running, it can't finish within in the few seconds.
That is fine, as the job is designed to be able to be re-run if needed.
The problem is that the job gets lost (instead of put back to redis and run again after deploy).
I found that it is advised to set :timeout: 8 on heroku and I tried it, but it had no effect (also tried seeting to 5).
When there is an exception, I get errors reported, but I don't see any. So not sure what could be wrong.
Any tips on how to debug this?
The free version of Sidekiq will push unfinished jobs back to Redis after the timeout has passed, default of 8 seconds. Heroku gives a process 10 seconds to shut down. That means we have 2 seconds to get those jobs back to Redis or they will be lost. If your network is slow, if the Redis server is swapping, etc, that 2 sec deadline might not be met and the jobs lost.
You were on the right track: one answer is to lower the timeout so you have a better chance of meeting that deadline. But network or swapping delay can't be predicted: even 5 seconds might not be enough time.
Under normal healthy conditions, things should work as designed. Keep your machines healthy (uncongested network, plenty of RAM) and the basic fetch should work well. Sidekiq Pro's reliable fetch feature is a fundamental redesign of how Sidekiq fetches jobs and works around all of these issues by keeping jobs in Redis all the time so they can't be lost. But it comes with serious trade offs too: it's more complicated, slower and more Redis intensive than "basic" fetch.
In short, I don't know why you are losing jobs but make sure your instances and Redis server are healthy and the latency is low.
https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
This is actually feature of sidekiq - designed to steer you toward paying pro version:
http://sidekiq.org/products/pro
RELIABILITY
More reliable message processing.
Cloud environments are noisy and unreliable. Seeing timeouts? Wild swings in latency or performance? Ruby VM crashes or processes disappearing?
If a Sidekiq process crashes while processing a job, that job is lost.
If the Sidekiq client gets a networking error while pushing a job to Redis, an exception is raised and the job is not delivered.
Sidekiq Pro uses Redis's RPOPLPUSH command to ensure that jobs will not be lost if the process crashes or gets a KILL signal.
The Sidekiq Pro client can withstand transient Redis outages or timeouts. It will enqueue jobs locally upon error and attempt to deliver those jobs once connectivity is restored.
Deploy terminates all processes that belongs to user, therefore job is lost. There is actually not much you can do there.
As #mike-perham and #esse noted, Sidekiq is designed the way it can loose jobs due to its fetching mechanism. Your options to get around this are:
To buy Sidekiq Pro (although it was reported to cause the same issue)
To write your own fetcher (but that would mean you can not use most of 3rd party libraries, as they will not work with your custom fetcher)
To mimic Sidekiq Pro's reliable fetch by backing up your jobs data. In case you are up for this way, check out attentive_sidekiq gem which does exactly that.
I'm running ruby/rails on heroku, and I want to scale my workers..
Heroku has this great api to scale the workers..
heroku.post_ps_scale(APP_NAME, 'worker', count)
#brings up new worker dyno's, or scales down
The problem is the workers are not aware of queues... If I have two clients and each get their own queues in delayed job. Then their tasks clash with each other.
To get workers working on different queues, but they are on the same dyno as the app that started them (aka the web dyno)
cmd = "rake jobs:work WORKER=STYLE_SERF QUEUES=#{queue}"
Rush::Box.new[Rails.root].bash( cmd, :background => true )
Question
How do I create worker dyno's that are queue aware?
I expect it would be some combination of the code above.
EDIT:
Some have suggested that I make use of the procFile, the problem I face is that I don't know the number or worker queues that I'm going to need.
It is excellent for setting up "urgent" or "due_yesterday" workers who work off specific queues... but I want to have a queue per client.
So worker_1 works on client_1 jobs and worker_2 works on client_2 jobs. Never crossing to the other.
It's easy to imagine client_1 doing product updates every hour and client_2 doing them once a week.
EXTENDED QUESTION (based on edit) Can one adjust the procfile at run time? Is it a file, or is it compiled into the code? If it's an accessible file, then perhaps I could "create" worker types.
You could try http://hirefire.io/ it's 10usd per month , you get 30 days to try it but scale both workers and dynos amazingly according to your queue size and response time and it s customizable !