Better approach to do background processes using Sidekiq and MongoDB - ruby-on-rails

I have a service that analyzes e-mails in a list. Each lists can contain 500 to 200,000 emails. The best way I found was to analyze these lists is using Sidekiq services background. It's working, but i don't think this is the best way to analyze these e-mails, the reason is :
The Sidekiq is configured to use threads 25, and i'm using MongoDB to store these emails.
The workflow of the job is:
Search the e-mail in MongoDB using the ID passed via parameters to the job;
Analyze the e-mail and update the status of the analysis in the database : Valid or Invalid.
So it makes an average of 25 concurrent accesses to the database and after 1 ~ 4 seconds (processing time) it need to access the database again to update the status ( 1 time for each job ).
If a list has 100.000 emails, 100.000 jobs will be needed to complete this task. 200.000 accesses will be made in the database, 25 concurrency in most of time.
This is a good way to perform this service ? Is there a way to perform less access to MongoDB ?
I can not see another way to do this .

Related

Sidekiq Idempotency, N+1 Queries and deadlocks

In the Sidekiq wiki it talks about the need for jobs to be idempotent and transactional. Conceptually this makes sense to me, and this SO answer has what looks like an effective approach at a small scale. But it's not perfect. Jobs can disappear in the middle of running. We've noticed certain work is incomplete and when we look in the logs they cut short in the middle of the work as if the job just evaporated. Probably due to a server restart or something, but it often doesn't find its way back into the queue. super_fetch tries to address this, but it errs on the side of duplicating jobs. With that we see a lot of jobs that end up running twice simultaneously. Having a database transaction cannot protect us from duplicate work if both transactions start at the same time. We'd need locking to prevent that.
Besides the transaction, though, I haven't been able to figure out a graceful solution when we want to do things in bulk. For example, let's say I need to send out 1000 emails. Options I can think of:
Spawn 1000 jobs, which each individually start a transaction, update a record, and send an email. This seems to be the default, and it is pretty good in terms of idempotency. But it has the side effect of creating a distributed N+1 query, spamming the database and causing user facing slowdowns and timeouts.
Handle all of the emails in one large transaction and accept that emails may be sent more than once, or not at all, depending on the structure. For example:
User.transaction do
users.update_all(email_sent: true)
users.each { |user| UserMailer.notification(user).deliver_now }
end
In the above scenario, if the UserMailer loop halts in the middle due to an error or a server restart, the transaction rolls back and the job goes back into the queue. But any emails that have already been sent can't be recalled, since they're independent of the transaction. So there will be a subset of the emails that get re-sent. Potentially multiple times if there is a code error and the job keeps requeueing.
Handle the emails in small batches of, say, 100, and accept that up to 100 may be sent more than once, or not at all, depending on the structure, as above.
What alternatives am I missing?
One additional problem with any transaction based approach is the risk of deadlocks in PostgreSQL. When a user does something in our system, we may spawn several processes that need to update the record in different ways. In the past the more we've used transactions the more we've had deadlock errors. It's been a couple of years since we went down that path, so maybe more recent versions of PostgreSQL handle deadlock issues better. We tried going one further and locking the record, but then we started getting timeouts on the user side as web processes compete with background jobs for locks.
Is there any systematic way of handling jobs that gracefully copes with these issues? Do I just need to accept the distributed N+1s and layer in more caching to deal with it? Given the fact that we need to use the database to ensure idempotency, it makes me wonder if we should instead be using delayed_job with active_record, since that handles its own locking internally.
This is a really complicated/loaded question, as the architecture really depends on more factors than can be concisely described in simple question/answer formats. However, I can give a general recommendation.
Separate Processing From Delivery
start a transaction, update a record, and send an email
Separate these steps out. Better to avoid doing both a DB update and email send inside a transaction, batched or not.
Do all your logic and record updates inside transactions separately from email sends. Do them individually or in bulk or perhaps even in the original web request if it's fast enough. If you save results to the DB, you can use transactions to rollback failures. If you save results as args to email send jobs, make sure processing entire batch succeeds before enqueing the batch. You have flexibility now b/c it's a pure data transform.
Enqueue email send jobs for each of those data transforms. These jobs must do little to no logic & processing! Keep them dead simple, no DB writes -- all processing should have already been done. Only pass values to an email template and send. This is critical b/c this external effect can't be wrapped in a transaction. Making email send jobs a read-only for your system (it "writes" to email, external to your system) also gives you flexibility -- you can cache, read from replicas, etc.
By doing this, you'll separate the DB load for email processing from email sends, and they are now dealt with separately. Bugs in your email processing won't affect email sends. Email send failures won't affect email processing.
Regarding Row Locking & Deadlocks
There shouldn't be any need to lock rows at all anymore -- the transaction around processing is enough to let the DB engine handle it. There also shouldn't be any deadlocks, since no two jobs are reading and writing the same rows.
Response: Jobs that die in the middle
Say the job is killed just after the transaction completes but before the emails go out.
I've reduced the possibility of that happening as much as possible by processing in a transaction separately from email sending, and making email sending as dead simple as possible. Once the transaction commits, there is no more processing to be done, and the only things left to fail are systems generally outside your control (Redis, Sidekiq, the DB, your hosting service, the internet connection, etc).
Response: Duplicate jobs
Two copies of the same job might get pulled off the queue, both checking some flag before it has been set to "processing"
You're using Sidekiq and not writing your own async job system, so you need to consider job system failures out of your scope. What remains are your job performance characteristics and job system configurations. If you're getting duplicate jobs, my guess is your jobs are taking longer to complete than the configured job timeout. Your job is taking so long that Sidekiq thinks it died (since it hasn't reported back success/fail yet), and then spawns another attempt. Speed up or break up the job so it will succeed or fail within the configured timeout, and this will stop happening (99.99% of the time).
Unlike web requests, there's no human on the other side that will decide whether or not to retry in an async job system. This is why your job performance profile needs to be predictable. Once a system gets large enough, I'd expect completely separate job queues and workers based on differences like:
expected job run time
expected job CPU/mem/disk usage
expected job DB or other I/O usage
job read only? write only? both?
jobs hitting external services
jobs users are actively waiting on
This is a super interesting question but I'm afraid it's nearly impossible to give a "one size fits all" kind of answer that is anything but rather generic. What I can try to answer is your question of individual jobs vs. all jobs at once vs. batching.
In my experience, generally the approach of having a scheduling job that then schedules individual jobs tends to work best. So in a full-blown system I have a schedule defined in clockwork where I schedule a scheduling job which then schedules the individual jobs:
# in config/clock.rb
every(1.day, 'user.usage_report', at: '00:00') do
UserUsageReportSchedulerJob.perform_now
end
# in app/jobs/user_usage_report_scheduler_job.rb
class UserUsageReportSchedulerJob < ApplicationJob
def perform
# need_usage_report is a scope to determine the list of users who need a report.
# This could, of course, also be "all".
User.need_usage_report.each(&UserUsageReportJob.method(:perform_later))
end
end
# in app/jobs/user_usage_report_job.rb
class UserUsageReportJob < ApplicationJob
def perform(user)
# the actual report generation
end
end
If you're worried about concurrency here, tweak Sidekiq's concurrency settings and potentially the connection settings of your PostgreSQL server to allow for the desired level of concurrency. I can say that I've had projects where we've had schedulers that scheduled tens of thousands of individual (small) jobs which Sidekiq then happily took in in batches of 10 or 20 on a low priority queue and processed over a couple of hours with no issues whatsoever for Sidekiq itself, the server, the database etc.

options for managing ActiveMailer generated queues

An application, hosted as the only application on a server, will be handling e-mails for large numbers of users who will launch groups of mailings. Most other processing by the application is not very intense.
While the volumes of mails will not be massive, they are important ≈ in the thousands per day. Mails will mostly be sent as individual items following an action that involves multiple mail recepients; a lag will occur between individual items and within sub-groups of the mail recipients.
In other words, each mail can have a calculation as to the time when it should be issued.
There are multiple options for handling queues, which I would group into two categories.
a) RAM-based objects. These have the disadvantage of losing the queues if something happens to the server.
b) database-based objects. These require more processing. (I can only think of a mechanism whereby the mails are stored with their time release and a cron job (scheduler gem) check every minute for unreleased mails and where a datetime is < Time.now, sending them off and modifying the mail's 'released' attribute)
Not having experience with any of the queuing options, my question is based on your experience, which option and ActiveJob adapters (or non!) makes the most sense given the context, while containing complexity?

Rails/Postgres - What type of DB lock do I need?

I have a PendingEmail table which I push many records to for emails I want to send.
I then have multiple Que workers which process my app's jobs. One of said jobs is my SendEmailJob.
The purpose of this job is to check PendingEmail, pull the latest 500 ordered by priority, make a batch request to my 3rd party email provider, wait for array response of all 500 responses, then delete the successful items and mark the failed records' error column. The single job will continue in this fashion until the records returned from the DB are 0, and the job will exit/destroy.
The issues are:
It's critical only one SendEmailJob processes email at one time.
I need to check the database every second if a current SendEmailJob isn't running. If it is running, then there's no issue as that job will get to it in ~3 seconds.
If a table is locked (however that may be), my app/other workers MUST still be able to INSERT, as other parts of my app need to add emails to the table. I mainly just need to restrict SELECT I think.
All this needs to be FAST. Part of the reason I did it this way is for performance as I'm sending millions of email in a short timespan.
Currently my jobs are initiated with a clock process (Clockwork), so it would add this job every 1 second.
What I'm thinking...
Que already uses advisory locks and other PG mechanisms. I'd rather not attempt to mess with that table trying to prevent adding more than one job in the first place. Instead, I think it's ok that potentially many SendEmailJob could be running at once, as long as they abort early if there is a lock in place.
Apparently there are some Rails ways to do this but I assume I will need to execute code directly to PG to initiate some sort of lock in each job, but before doing that it checks if there already is one lock, and if there is it aborts)
I just don't know which type of lock to choose, whether to do it in Rails or in the database directly. There are so many of them with such subtle differences (I'm using PG). Any insight would be greatly appreciated!
Answer: I needed an advisory lock.

What's the best way to manage Resque jobs on per user basis?

I'm migrating from Delayed_jobs to Resque and I have difficulties finding the best way to handle those cases:
A user can NOT add twice the same command to the list of jobs (e.g. "export all my data"). Only one export command at a time. For other it's fine to have many (e.g. send emails)
Some jobs should not run for more than 5 minutes, while other are allowed to run for 30 minutes. In both cases, I'd like to have a time-out in case process is blocked or is not completed on time.
Can add jobs to start in a few days
Inform the user on all their current & future jobs.
Can cancel some jobs (current and future) for the user
Keep ability to have different lists (mostly for priorities / slow and fast tasks)
I looked at resque-status and it seems like it provides the low level query, but I would still need to do my per user job management.
Suggestions on best way to handle this?

How do I create a worker daemon which waits for jobs and executes them?

I'm new to Rails and multithreading and am curious about how to achieve the following in the most elegant way.
I couldn't find any nice tutorials which explained in detail what's the best design decision for the following task:
I have a couple of HTTP requests which will be run for a user in the background, for example, parsing a couple websites and get some information like HTTP response code, response time, then return the results. For performance reasons, I decided to split the total number of URLs to parse into batches of 25 each, then execute each batch in a thread, join these and write the result to a database.
I decided to use the following gem (http://rubygems.org/gems/thread) to ensure that there's a maximum number of threads that are run simultaneously. So far so good.
The problem is, if two users start their analysis in parallel, the maximum number of threads is two times the maximum of my threadpool.
My solution (imho) is to create a worker daemon which runs on its own and waits for jobs from the clients.
My question is, what's the best way to achieve this in Rails?
Maybe create a Rake task, and use it as a daemon (see: "Daemoninsing a rake task") and (how?) add jobs to it?
Thank you very much in advance!
I'd build a queue in a table in the database, and a bit of code that is periodically started by cron, which walks that table, passing requests to Typhoeus and Hydra.
Here's how the author summarizes the gem:
Like a modern code version of the mythical beast with 100 serpent heads, Typhoeus runs HTTP requests in parallel while cleanly encapsulating handling logic.
As users add requests, append them to the table. You'll want fields like:
A "processed" field so you can tell which were handled in case the system goes down.
A "success" field so you can tell which requests were processed successfully, so you can retry if they failed.
A "retry_count" field so you can retry up to "n" times, then flag that URL as unreachable.
A "next_scan_time" field that says when the URL should be scanned again so you don't DOS a site by hitting it continuously.
Typhoeus and Hydra are easy to use, and do make it easy to handle multiple requests.
There are a bunch of libraries for Rails that can manage queues of long-running background jobs for you. Here are a few:
Sidekiq uses Redis for job storage and supports multiple worker threads.
Resque also uses Redis and a single worker thread.
delayed_job manages a job queue through ActiveRecord (or Mongoid).
Once you've chosen one, I'd recommend using Foreman to simplify launching multiple daemons at once.

Resources