In my Rails application, I have a model called Report
Report has one or many chunks (called Chunk) that would generate a piece of content based on external service calls (APIs, etc.)
When user requests to generate a report, by using Sidekiq, I queue the "chunk's jobs" in order to run them in the background and notify user that we will be emailing them the result once the report is generated.
Report uses a state machine, to flag whether or not all the jobs are successfully finished. All the chunks must be completed before we flag the report as ready. If one fails, we need to either try again, or give up at some point.
I determined the states as draft (default), working, finished The finish result is a combination of all the services pieces together. 'Draft' is when the chunks are still in the queue and none of them has started generating any content.
How would you tackle this situation with Sidekiq? How do you keep a track (live) which chunk's services are finished, or working or failed, so we can flag the report finished or failed?
I'd like to see a way to periodically check the jobs to see where they are standing, and change a state when they all finished successfully, or flag it fail, if all the retries give up!
Thank you
We had a similar need in our application to determine when sidekiq jobs were finished during automated testing.
What we used is the sidekiq-status gem: https://github.com/utgarda/sidekiq-status
Here's the rough usage:
job_id = Job.perform_async()
You'd then pass the job ID to the place where it will try to check the status of the job
Sidekiq::Status::status job_id #=> :working, :queued, :failed, :complete
Hope this helps.
This is a Sidekiq Pro feature called Batches.
https://github.com/mperham/sidekiq/wiki/Batches
Related
In the Sidekiq wiki it talks about the need for jobs to be idempotent and transactional. Conceptually this makes sense to me, and this SO answer has what looks like an effective approach at a small scale. But it's not perfect. Jobs can disappear in the middle of running. We've noticed certain work is incomplete and when we look in the logs they cut short in the middle of the work as if the job just evaporated. Probably due to a server restart or something, but it often doesn't find its way back into the queue. super_fetch tries to address this, but it errs on the side of duplicating jobs. With that we see a lot of jobs that end up running twice simultaneously. Having a database transaction cannot protect us from duplicate work if both transactions start at the same time. We'd need locking to prevent that.
Besides the transaction, though, I haven't been able to figure out a graceful solution when we want to do things in bulk. For example, let's say I need to send out 1000 emails. Options I can think of:
Spawn 1000 jobs, which each individually start a transaction, update a record, and send an email. This seems to be the default, and it is pretty good in terms of idempotency. But it has the side effect of creating a distributed N+1 query, spamming the database and causing user facing slowdowns and timeouts.
Handle all of the emails in one large transaction and accept that emails may be sent more than once, or not at all, depending on the structure. For example:
User.transaction do
users.update_all(email_sent: true)
users.each { |user| UserMailer.notification(user).deliver_now }
end
In the above scenario, if the UserMailer loop halts in the middle due to an error or a server restart, the transaction rolls back and the job goes back into the queue. But any emails that have already been sent can't be recalled, since they're independent of the transaction. So there will be a subset of the emails that get re-sent. Potentially multiple times if there is a code error and the job keeps requeueing.
Handle the emails in small batches of, say, 100, and accept that up to 100 may be sent more than once, or not at all, depending on the structure, as above.
What alternatives am I missing?
One additional problem with any transaction based approach is the risk of deadlocks in PostgreSQL. When a user does something in our system, we may spawn several processes that need to update the record in different ways. In the past the more we've used transactions the more we've had deadlock errors. It's been a couple of years since we went down that path, so maybe more recent versions of PostgreSQL handle deadlock issues better. We tried going one further and locking the record, but then we started getting timeouts on the user side as web processes compete with background jobs for locks.
Is there any systematic way of handling jobs that gracefully copes with these issues? Do I just need to accept the distributed N+1s and layer in more caching to deal with it? Given the fact that we need to use the database to ensure idempotency, it makes me wonder if we should instead be using delayed_job with active_record, since that handles its own locking internally.
This is a really complicated/loaded question, as the architecture really depends on more factors than can be concisely described in simple question/answer formats. However, I can give a general recommendation.
Separate Processing From Delivery
start a transaction, update a record, and send an email
Separate these steps out. Better to avoid doing both a DB update and email send inside a transaction, batched or not.
Do all your logic and record updates inside transactions separately from email sends. Do them individually or in bulk or perhaps even in the original web request if it's fast enough. If you save results to the DB, you can use transactions to rollback failures. If you save results as args to email send jobs, make sure processing entire batch succeeds before enqueing the batch. You have flexibility now b/c it's a pure data transform.
Enqueue email send jobs for each of those data transforms. These jobs must do little to no logic & processing! Keep them dead simple, no DB writes -- all processing should have already been done. Only pass values to an email template and send. This is critical b/c this external effect can't be wrapped in a transaction. Making email send jobs a read-only for your system (it "writes" to email, external to your system) also gives you flexibility -- you can cache, read from replicas, etc.
By doing this, you'll separate the DB load for email processing from email sends, and they are now dealt with separately. Bugs in your email processing won't affect email sends. Email send failures won't affect email processing.
Regarding Row Locking & Deadlocks
There shouldn't be any need to lock rows at all anymore -- the transaction around processing is enough to let the DB engine handle it. There also shouldn't be any deadlocks, since no two jobs are reading and writing the same rows.
Response: Jobs that die in the middle
Say the job is killed just after the transaction completes but before the emails go out.
I've reduced the possibility of that happening as much as possible by processing in a transaction separately from email sending, and making email sending as dead simple as possible. Once the transaction commits, there is no more processing to be done, and the only things left to fail are systems generally outside your control (Redis, Sidekiq, the DB, your hosting service, the internet connection, etc).
Response: Duplicate jobs
Two copies of the same job might get pulled off the queue, both checking some flag before it has been set to "processing"
You're using Sidekiq and not writing your own async job system, so you need to consider job system failures out of your scope. What remains are your job performance characteristics and job system configurations. If you're getting duplicate jobs, my guess is your jobs are taking longer to complete than the configured job timeout. Your job is taking so long that Sidekiq thinks it died (since it hasn't reported back success/fail yet), and then spawns another attempt. Speed up or break up the job so it will succeed or fail within the configured timeout, and this will stop happening (99.99% of the time).
Unlike web requests, there's no human on the other side that will decide whether or not to retry in an async job system. This is why your job performance profile needs to be predictable. Once a system gets large enough, I'd expect completely separate job queues and workers based on differences like:
expected job run time
expected job CPU/mem/disk usage
expected job DB or other I/O usage
job read only? write only? both?
jobs hitting external services
jobs users are actively waiting on
This is a super interesting question but I'm afraid it's nearly impossible to give a "one size fits all" kind of answer that is anything but rather generic. What I can try to answer is your question of individual jobs vs. all jobs at once vs. batching.
In my experience, generally the approach of having a scheduling job that then schedules individual jobs tends to work best. So in a full-blown system I have a schedule defined in clockwork where I schedule a scheduling job which then schedules the individual jobs:
# in config/clock.rb
every(1.day, 'user.usage_report', at: '00:00') do
UserUsageReportSchedulerJob.perform_now
end
# in app/jobs/user_usage_report_scheduler_job.rb
class UserUsageReportSchedulerJob < ApplicationJob
def perform
# need_usage_report is a scope to determine the list of users who need a report.
# This could, of course, also be "all".
User.need_usage_report.each(&UserUsageReportJob.method(:perform_later))
end
end
# in app/jobs/user_usage_report_job.rb
class UserUsageReportJob < ApplicationJob
def perform(user)
# the actual report generation
end
end
If you're worried about concurrency here, tweak Sidekiq's concurrency settings and potentially the connection settings of your PostgreSQL server to allow for the desired level of concurrency. I can say that I've had projects where we've had schedulers that scheduled tens of thousands of individual (small) jobs which Sidekiq then happily took in in batches of 10 or 20 on a low priority queue and processed over a couple of hours with no issues whatsoever for Sidekiq itself, the server, the database etc.
I have sidekiq jobs doing processing on a many types of resources. However, for a particular type of resource, eg: Resource X, I need to ensure that only 1 sidekiq job can process that particular resource at any given time.
For example, if i have 3 sidekiq jobs that gets queued simultaneously and want to interact with resource X, then only 1 sidekiq job can process resource X while the 2 remaining sidekiq jobs will have to wait (or be re-queued) until the sidekiq job that is currently processing the resource finishes.
Currently, i am trying to add a record in a database table for when a sidekiq job is processing the resource and use that to stop other sidekiq jobs from processing the resource until that record is deleted from the database by the sidekiq job that added it (when it finishes processing resource X) or after a certain elapsed time has passed (eg: If the record was created more than 5 minutes ago, then it is considered to no longer hold exclusive access to resource X and the next sidekiq job that wants to process resource X may alter that record and claim exclusive access to resource X).
A pseudocode of my current implementation:
def perform(res_id, res_type)
# Only applies to "RESOURCE_X"
if res_type == RESOURCE_X
if ResourceProcessor.where(res_id).empty? || ((Time.now-ResourceProcessor.where(res_id).first.created_at) > 5.minutes)
ResourceProcessor.create(res_id: res_id).save
process_resource_x(res_id)
else
SidekiqWorker.delayed(res_id, res_type, 5.minutes) #Try again later
return
end
#Letting other sidekiq jobs know they can now fight over who gets to process resource X
ResourceProcessor.where(res_id).destroy
else
process_other_resource(res_id)
end
end
Unfortunately, my solution does not work. It works just fine if there is a delay between sidekiq jobs that wants to process resource X. However, if the jobs that want to process resource X arrives simultaneously, then my solution falls apart.
Is there any way i can enforce some sort of synchronization only when processing resource X?
Btw, my sidekiq jobs may be distributed across several machines (but they access the same redis server on a dedicated machine).
I did more research based on the comment provided by Thomas.
The link he provided was extremely useful. They implemented their own custom Lock class to achieve the results they want. However, i did not use their custom lock code because I needed a different behaviour.
The specific behaviour i was looking to implement is "Re-queue if locked" and not "Wait if lock".
There are alternative tools that I could have used, such as redis-semaphore and with_advisory_gem.
I tested redis-semaphore and found it buggy. It wasnt returning the lock state and resource count correctly. Also, after checking the issues on Github, in some situations, redis-semaphore may get itself into its own deadlock, so i decided to abandon using it. As a result, i also decided not to use the with_advisory_gem due to its lower star count than redis-semaphore.
In the end I found a way to implement the locking pattern i described in my question, which is to block sidekiq jobs based on a value in my database. I dealt with the concurrency issue of multiple sidekiq jobs reading stale values by locking the entire database row with rail's very own Locking-pessimistic class. This ensured that only 1 sidekiq worker can access the database row which holds the locking value at any given time. Locking period is kept to a minimal because only a read and when applicable, a write operation is performed while locking the database row. Subsequent operations such as doing a requeue and cleaning up is done after.
Example Scenario:
Payment handling and electronic-product delivery transaction.
Requirements
There are approximately a few thousand payment transactions a day that need to be executed. Each taking about 1 second. (So the entire process should take about an hour)
Transactions must be processed linearly in a single thread (the next transaction must not start until the last transaction has completed, strong FIFO order is necessary)
Each payment transaction is wrapped inside a database transaction, anything that goes back to roll the transaction back, it is aborted and put into another queue for manual error handling. After that, it should continue to process the rest of the transactions.
Order of Importance
Single execution (if failed, put into error queue for manual handling)
Single Threadedness
FIFO
Is Sidekiq suitable for such mission critical processes? Would sidekiq would be able to fullfil all of these requirements? Or would you recommend other alternatives? Could you point me to some best practices regarding payment handling in rails?
Note: The question is not regarding whether to use stripe or ActiveMerchant for payment handling. It is more about the safest way to programmatically execute those processes in the background.
Yes, Sidekiq can fulfill all of these requirements.
To process your transactions one at a time in serial, you can launch a single Sidekiq process with a concurrency of 1 that only works on that queue. The process will work jobs off the queue one at a time, in order.
For a failed task to go into a failure queue, you'll need to use the Sidekiq Failures gem and ensure retries are turned off for that task.
To guarantee that each task is executed at least once, you can purchase Sidekiq Pro and use Reliable Fetch. If Sidekiq crashes, it will execute the task when it starts back up. This assumes you will set up monitoring to ensure the Sidekiq process stays running. You might also want to make your task idempotent, so it doesn't write the same transaction twice. (In theory, the process could crash after your database transaction commits, but before Sidekiq reports to Redis that the task completed.)
If using Ruby is a constraint, Sidekiq is probably your best bet. I've used several different queuing systems in Ruby and none of them have the reliability guarantee that Sidekiq does.
the next transaction must not start until the last transaction has completed
In that case, I think background processing is suitable as long as you create the next job at the completion of the previous job.
class Worker
include Sidekiq::Worker
def perform(*params)
# do work, raising exception if necessary
NextWorker.perform_async(params, here)
end
end
I have a "cluster" of Resque servers in my infrastructure. They all have the same exact job priorities etc. I automagically scale the number of Resque servers up and down based on how many pending jobs there are and available resources on the servers to handle said jobs. I always have a minimum of two Resque servers up.
My issue is that when I do a quick, one off job, sometimes both the servers process that job. This is bad.
I've tried adding a lock to my job with something like the following:
require 'resque-lock-timeout'
class ExampleJob
extend Resque::Plugins::LockTimeout
def self.perform
# some code
end
end
This plugin works for longer running jobs. However for these super tiny one off jobs, processing happens right away. The Resque servers both do not see the lock set by its sister server, both set a lock, process the job, unlock, and are done.
I'm not entirely sure what to do at this point or what solutions there are except for having one dedicated server handle this type of job. That would be a serious pain to configure and scale. I really want both the servers to be able to handle it, but once one of them grabs it from the queue, ensure the other does not run it.
Can anyone suggest some viable solution(s)?
Write your lock interpreter to wait T milliseconds before it looks for a lock with a unique_id less than the value of the lock it made.
This will determine who won the race, and the loser will self-terminate.
T is the parallelism latency between all N servers in the pool of a given queue. You can determine this heuristically by scaling back from 1000 milliseconds until you again find the job happening in-duplicate. Give padding for latency variation.
This is called the Busy-Wait solution to mutex thread safety. It is considered one of the trade-offs acceptable given the various scenarios in which one must solve Mutex (e.g. Locking, etc)
I'll post some links when off mobile. Wikipedia entry on mutex should explain all this.
Of this won't work for you, then:
1. Use a scheduler to control duplication.
2. Classify short-running jobs to a queue designed to run them in serial.
TL;DR there is no perfect solution, only good trade-off for your conditions.
It should not be possible for two workers to get the same 'payload' because items are dequeued using BLPOP. Redis will only send the queued item to the first client that calls BLPOP. It sounds like you are enqueueing the job more than once and therefore two workers are able to acquire different payloads with the same arguments. The purpose of 'resque-lock-timeout' is to assure that payloads that have the same method and arguments do not run concurrently; it does not however stop the second payload from being worked if the first job releases the lock before the second job tries to acquire it.
It would make sense that this only happens to short running jobs. Here is what might be happening:
payload 1 is enqueued
payload 2 is enqueued
payload 1 is locked
payload 1 is worked
payload 1 is unlocked
payload 2 is locked
payload 2 is worked
payload 2 is unlocked
Where as in long running jobs the following senario might happen:
payload 1 is enqueued
payload 2 is enqueued
payload 1 is locked
payload 1 is worked
payload 2 is fails to get lock
payload 1 is unlocked
Try turning off Resque and enqueueing your job. Take a look in redis at the list for your Resque queue (or monitor Redis using redis-cli monitor). See if Resque has queued more than one payload. If you still only see one payload then monitor the list to see if another one of your resque workers is calling recreate on failed jobs.
If you want to have 'resque-lock-timeout' hold the lock for longer than the duration it takes to process the job you can override the release_lock! method to set an expiry on the lock instead of just deleting it.
module Resque
module Plugins
module LockTimeout
def release_lock!(*args)
lock_redis.expire(redis_lock_key(*args), 60) # expire lock after 60 seconds
end
end
end
end
https://github.com/lantins/resque-lock-timeout/blob/master/lib/resque/plugins/lock_timeout.rb#l153-155
I'm very happy with By so far, only I have this one issue:
When one process takes 1 or 2 hours to complete, all other jobs in the queue seem to wait for that one job to finish. Worse still is when uploading to a server which time's out regularly.
My question: is Bj running jobs in parallel or one after another?
Thank you,
Damir
BackgroundJob will only allow one worker to run per webserver instance. This is by design to keep things simple. Here is a quote from Bj's README:
If one ignores platform specific details the design of Bj is quite simple: the
main Rails application submits jobs to table, stored in the database. The act
of submitting triggers exactly one of two things to occur:
1) a new long running background runner to be started
2) an existing background runner to be signaled
The background runner refuses to run two copies of itself for a given
hostname/rails_env combination. For example you may only have one background
runner processing jobs on localhost in development mode.
The background runner, under normal circumstances, is managed by Bj itself -
you need do nothing to start, monitor, or stop it - it just works. However,
some people will prefer manage their own background process, see 'External
Runner' section below for more on this.
The runner simply processes each job in a highest priority oldest-in fashion,
capturing stdout, stderr, exit_status, etc. and storing the information back
into the database while logging it's actions. When there are no jobs to run
the runner goes to sleep for 42 seconds; however this sleep is interuptable,
such as when the runner is signaled that a new job has been submitted so,
under normal circumstances there will be zero lag between job submission and
job running for an empty queue.
You can learn more on the github page: Here