options for managing ActiveMailer generated queues - ruby-on-rails

An application, hosted as the only application on a server, will be handling e-mails for large numbers of users who will launch groups of mailings. Most other processing by the application is not very intense.
While the volumes of mails will not be massive, they are important ≈ in the thousands per day. Mails will mostly be sent as individual items following an action that involves multiple mail recepients; a lag will occur between individual items and within sub-groups of the mail recipients.
In other words, each mail can have a calculation as to the time when it should be issued.
There are multiple options for handling queues, which I would group into two categories.
a) RAM-based objects. These have the disadvantage of losing the queues if something happens to the server.
b) database-based objects. These require more processing. (I can only think of a mechanism whereby the mails are stored with their time release and a cron job (scheduler gem) check every minute for unreleased mails and where a datetime is < Time.now, sending them off and modifying the mail's 'released' attribute)
Not having experience with any of the queuing options, my question is based on your experience, which option and ActiveJob adapters (or non!) makes the most sense given the context, while containing complexity?

Related

Sidekiq Idempotency, N+1 Queries and deadlocks

In the Sidekiq wiki it talks about the need for jobs to be idempotent and transactional. Conceptually this makes sense to me, and this SO answer has what looks like an effective approach at a small scale. But it's not perfect. Jobs can disappear in the middle of running. We've noticed certain work is incomplete and when we look in the logs they cut short in the middle of the work as if the job just evaporated. Probably due to a server restart or something, but it often doesn't find its way back into the queue. super_fetch tries to address this, but it errs on the side of duplicating jobs. With that we see a lot of jobs that end up running twice simultaneously. Having a database transaction cannot protect us from duplicate work if both transactions start at the same time. We'd need locking to prevent that.
Besides the transaction, though, I haven't been able to figure out a graceful solution when we want to do things in bulk. For example, let's say I need to send out 1000 emails. Options I can think of:
Spawn 1000 jobs, which each individually start a transaction, update a record, and send an email. This seems to be the default, and it is pretty good in terms of idempotency. But it has the side effect of creating a distributed N+1 query, spamming the database and causing user facing slowdowns and timeouts.
Handle all of the emails in one large transaction and accept that emails may be sent more than once, or not at all, depending on the structure. For example:
User.transaction do
users.update_all(email_sent: true)
users.each { |user| UserMailer.notification(user).deliver_now }
end
In the above scenario, if the UserMailer loop halts in the middle due to an error or a server restart, the transaction rolls back and the job goes back into the queue. But any emails that have already been sent can't be recalled, since they're independent of the transaction. So there will be a subset of the emails that get re-sent. Potentially multiple times if there is a code error and the job keeps requeueing.
Handle the emails in small batches of, say, 100, and accept that up to 100 may be sent more than once, or not at all, depending on the structure, as above.
What alternatives am I missing?
One additional problem with any transaction based approach is the risk of deadlocks in PostgreSQL. When a user does something in our system, we may spawn several processes that need to update the record in different ways. In the past the more we've used transactions the more we've had deadlock errors. It's been a couple of years since we went down that path, so maybe more recent versions of PostgreSQL handle deadlock issues better. We tried going one further and locking the record, but then we started getting timeouts on the user side as web processes compete with background jobs for locks.
Is there any systematic way of handling jobs that gracefully copes with these issues? Do I just need to accept the distributed N+1s and layer in more caching to deal with it? Given the fact that we need to use the database to ensure idempotency, it makes me wonder if we should instead be using delayed_job with active_record, since that handles its own locking internally.
This is a really complicated/loaded question, as the architecture really depends on more factors than can be concisely described in simple question/answer formats. However, I can give a general recommendation.
Separate Processing From Delivery
start a transaction, update a record, and send an email
Separate these steps out. Better to avoid doing both a DB update and email send inside a transaction, batched or not.
Do all your logic and record updates inside transactions separately from email sends. Do them individually or in bulk or perhaps even in the original web request if it's fast enough. If you save results to the DB, you can use transactions to rollback failures. If you save results as args to email send jobs, make sure processing entire batch succeeds before enqueing the batch. You have flexibility now b/c it's a pure data transform.
Enqueue email send jobs for each of those data transforms. These jobs must do little to no logic & processing! Keep them dead simple, no DB writes -- all processing should have already been done. Only pass values to an email template and send. This is critical b/c this external effect can't be wrapped in a transaction. Making email send jobs a read-only for your system (it "writes" to email, external to your system) also gives you flexibility -- you can cache, read from replicas, etc.
By doing this, you'll separate the DB load for email processing from email sends, and they are now dealt with separately. Bugs in your email processing won't affect email sends. Email send failures won't affect email processing.
Regarding Row Locking & Deadlocks
There shouldn't be any need to lock rows at all anymore -- the transaction around processing is enough to let the DB engine handle it. There also shouldn't be any deadlocks, since no two jobs are reading and writing the same rows.
Response: Jobs that die in the middle
Say the job is killed just after the transaction completes but before the emails go out.
I've reduced the possibility of that happening as much as possible by processing in a transaction separately from email sending, and making email sending as dead simple as possible. Once the transaction commits, there is no more processing to be done, and the only things left to fail are systems generally outside your control (Redis, Sidekiq, the DB, your hosting service, the internet connection, etc).
Response: Duplicate jobs
Two copies of the same job might get pulled off the queue, both checking some flag before it has been set to "processing"
You're using Sidekiq and not writing your own async job system, so you need to consider job system failures out of your scope. What remains are your job performance characteristics and job system configurations. If you're getting duplicate jobs, my guess is your jobs are taking longer to complete than the configured job timeout. Your job is taking so long that Sidekiq thinks it died (since it hasn't reported back success/fail yet), and then spawns another attempt. Speed up or break up the job so it will succeed or fail within the configured timeout, and this will stop happening (99.99% of the time).
Unlike web requests, there's no human on the other side that will decide whether or not to retry in an async job system. This is why your job performance profile needs to be predictable. Once a system gets large enough, I'd expect completely separate job queues and workers based on differences like:
expected job run time
expected job CPU/mem/disk usage
expected job DB or other I/O usage
job read only? write only? both?
jobs hitting external services
jobs users are actively waiting on
This is a super interesting question but I'm afraid it's nearly impossible to give a "one size fits all" kind of answer that is anything but rather generic. What I can try to answer is your question of individual jobs vs. all jobs at once vs. batching.
In my experience, generally the approach of having a scheduling job that then schedules individual jobs tends to work best. So in a full-blown system I have a schedule defined in clockwork where I schedule a scheduling job which then schedules the individual jobs:
# in config/clock.rb
every(1.day, 'user.usage_report', at: '00:00') do
UserUsageReportSchedulerJob.perform_now
end
# in app/jobs/user_usage_report_scheduler_job.rb
class UserUsageReportSchedulerJob < ApplicationJob
def perform
# need_usage_report is a scope to determine the list of users who need a report.
# This could, of course, also be "all".
User.need_usage_report.each(&UserUsageReportJob.method(:perform_later))
end
end
# in app/jobs/user_usage_report_job.rb
class UserUsageReportJob < ApplicationJob
def perform(user)
# the actual report generation
end
end
If you're worried about concurrency here, tweak Sidekiq's concurrency settings and potentially the connection settings of your PostgreSQL server to allow for the desired level of concurrency. I can say that I've had projects where we've had schedulers that scheduled tens of thousands of individual (small) jobs which Sidekiq then happily took in in batches of 10 or 20 on a low priority queue and processed over a couple of hours with no issues whatsoever for Sidekiq itself, the server, the database etc.

Should data being used by ActiveJob (resque) be persisted or put into a ruby object and passed by object id?

I am using Twilio to send/receive texts in a Rails 4.2 app. I am sending in bulk, around 1000 at a time, and receiving sporadically.
Currently when I receive a text I save it to the DB (to, from, body) and then pass that record to an ActiveJob worker to process later. For sending messages I currently persist the Twilio params to another DB and pass that record to a different ActiveJob worker. Since I am often doing it in batches I have two workers. The first outgoing message worker sends a single message. The second one queries the DB and finds all the user who should receive the message, creates a DB record for each message that should be sent, and then passes that record to the first outgoing message worker. So the second one basically just creates a bunch of jobs for the first one to process.
Right now I have the workers destroying the records once they finish processing (both incoming and outgoing). I am worried about not persisting things incase the server, redis, or resque go down but I do not know if this is actually a good design pattern. It was suggested to me just to use a vanilla ruby object and pass it's id to the worker but I am not sure how that effects data reliability. So is it over kill to be creating all these DBs and should I just be creating vanilla ruby objects and passing those object's ids to the workers?
Any and all insight is appreciated,
Drew
It seems to me that the approach of sending a minimal amount of data to your jobs is the best approach. Check out the 'Best Practices' section on the sidekiq wiki: https://github.com/mperham/sidekiq/wiki/Best-Practices
What if your queue backs up and that quote object changes in the meantime? Don't save state to Sidekiq, save simple identifiers. Look up the objects once you actually need them in your perform method.
Also in terms of reliability - you should be worried about your job queue going down. It happens. You either design your system to be fault tolerant of a failure or you find a job queue system that has higher reliability guarantees (but even then no queue system can guarantee 100% message deliverability). Sidekiq pro has better reliability guarantees than sidekiq (non-pro), but if you design your jobs with a little bit of forethought, you can create jobs that can scan your database after a crash and re-queue any jobs that may have been lost.
How much work you spend desinging fault tolerant solutions really just depends how critical it is that your information make it from point A to point B :)

One Delayed_job per email vs. delayed_job for all emails?

As part of my app I am sending out an email to many users daily. Depending on their status they will be sent one of five possible types of emails.
The logic that determines which email the user receives is fairly long.
Should I:
1) Should I create a delayed_job for each email
or
2) Send the entire logic (50 lines of Ruby) with the send commands into a single job
What are the pros/cons of either approach?
Further to Sabyasachi Ghosh's answer, here are the differences between DelayedJob and Resque:
DelayedJob relies on the DB
Requires ActiveRecord
Uses Ruby Objects (not just references)
Has much deeper queuing functionality (queue depth etc)
Runs much heavier than resque
Resque relies on Redis
Lightweight
Runs independently of ActiveRecord
Is meant to process references (not entire objects)
Modularity
In answer to your question, I would look at modularity
Rails' is based on the principle of DRY code -- which essentially means you need to be as modular as possible (reusing code wherever you can). This leads to efficiency & simpler development cycles
In light of this, you have to observe your queueing functionality from the perspective of modularity. What does the queuing system actually do?
It queues things
Therefore, you want to include as little code as possible in the queuing system
I would create a redis instance (you can get them on Heroku), and use resque to queue specific information (such as id or email)
This will allow you to use resque to run through the Redis list, sending as many emails as you need
If you have a huge logic i recommed not to put in the delayed job if you need to send a bunch of email. better use resque(https://github.com/resque/resque) or sidekiq(http://sidekiq.org/). as in the time of sending email it delayed job will lock your database so your performance will be low.
If you have small logic and less number of email just go for delayed job for each email. as it is easy to setup and implement.
I think so you need to send each email via delayed job because if anything happens with your delayed job like if your delayed job get crashed or stopped in that case when system will re execute your job it can cause problems so i suggest you to add each email in delayed job.

Practical use of delayed background job when dealing with many users

When a background job starts, it's sent to the back of a queue where a worker handles it; a task clears and the other starts. I think I've got this one right except I don't understand the practical side of it in some cases. Sure, if you're a company sending out 15,000 newsletters once a week using a delayed job makes perfect sense. But when you have an application of even 100 users, in which some task is long enough to need background work (like sending/fetching emails that might take a minute) then each user will have to wait in line while another user gets cleared (in case there's a single worker).
This is the part I'm not sure I'm getting right. I'm talking about the same job, but individually for each user. Does that count as a job per user? If I have 100 users, do I need to keep 100 workers for each one's process to not get tied up?
I've tried using delayed_job to simulate that, and indeed when I sign in with a different account I have to wait until another user's email gets sent until mine is. While the plugin is swift and simple to work with, I think it's not the right approach here.
I've also tried using Ajax, but since it's an HTTP request it ties up the browser in loading mode until it gets a response from the server (even with async: true). Not sure if I ruled this one out too quickly, but I was sortof looking for a more elegant server solution.
Is there a way to achieve a background job like this? (I've heard of different, mostly commercial solutions promising little waiting time, but I'm interested in completely eliminating the queue between users). If not, is there a method to make an ajax request without waiting for a response? I realize my questions are both drastically different but both seem like an appropriate solution to this problem.
Resque is a background processing engine that can support multiple queues.
Ways you could use this:
Group your tasks into queues that make sense on their priority. If you need fast response times, use it in a 'foreground' queue. Slow? (like sending/receiving emails) can be in the 'background' queue
Have one queue per user (you will need to have many many workers for this)
This SO question also gives a way to use delayed_jobs with multiple queues/tables
The purpose of delayed_job and other message queues is to asynchronously process jobs outside of your core application. I always use a queue for sending email since I'm relying on an outside application (sometimes a third-party API like gmail) to send them and I can't guarantee available and operating efficiency.
So for your use case, even with very few users, I highly recommend offloading emails to delayed_job. This will speed up your front end (ajax) and will also give you retries upon failure. You could spin up multiple workers to process the queue, but it shouldn't be necessary with your numbers unless your calls to send mail are taking a really long time (more than a couple seconds?).
And yes in most situations I'd create separate jobs for each user even though the message might be identical. The only time I'd process them all together would be if the email application / API has bulk sending and you can reduce the number of calls significantly by sending a large payload in a few calls.

E-mailing users through RoR app

I am at the tail end of building a forum/Q&A community-based application, and I would like to add email notifications. The app has several different entities, including: threads, questions, projects, photos, etc. The goal is that a user can "subscribe" to any number of these entities, queuing an e-mail whenever the entity receives new comments or activity. This functionality is very similar to facebook and forums.
I have looked into ActionMailer (with rake tasks and delayed jobs), MailChimp API (and plugins), and other app mailers (PostageApp and Postmark).
I am leaning against ActionMailer, because of potential issues with memory hogging and server overload. The app will be running on Heroku, but I'm afraid the servers could be easily overwhelmed sending out potentially hundreds of emails every few minutes.
Another complexity is that there will be different types of subscriptions (instant email notification, daily email notification) based on user preference.
What would be the best way to manage email for functionality like this? Any tips/ideas are greatly appreciated!
You can use ActionMailer to send with SendGrid, or Postmark. PostageApp still needs an SMTP server and adds an additional dependency, but it can be nice to have. MailChimp is for newsletters only I believe, so that's probably not much use for you here.
Giving a high level overview here, a few things are important:
Keep mailer logic from cluttering controllers.
Prevent delaying responses to user requests.
Avoid issues with "application overload".
Handle event-based and periodic emails.
To address #1, you will want to use an Observer to decide when to send an event-based email. To account for #2 and #3, you can drop in DelayedJob so that emails are sent in the background. You can use SendGrid with ActionMailer on Heroku pretty easily (especially if you drop in Pony). For #4 you should just create a rake task that handles the business logic of deciding who to email and queues the send jobs as DJ tasks like the Observer would.
And just to be clear, DelayedJob will execute jobs in a separate process. In the case of Heroku, you're actually running each DelayedJob worker in a different Dyno, which is an entirely separate stack/environment (quite probably on a different physical server). There won't be any issues with your app getting overloaded this way, unless of course your database can't keep up with adding jobs (in which case you can use Redis as a DJ store instead). You can easily scale horizontally by adding more DJ workers as needed.
Take a look at SimpleWorker, a cloud-based background processing / worker queue.
It's an add-on for Heroku and is able to scale up and out to handle a set of one-time posts or scheduled emails. (A master job, for example, can be scheduled and then when it comes off schedule to run, it queues up 10s, 100s, 1000s of jobs to run concurrently across a scaled out infrastructure.)
Heroku workers can work fine given they'll run as separate processes but if you have variable load, then you want a service that can scale up and scale down by the jobs -- so you a) don't pay for unused capacity and b) can handle burst traffic and batch output.
(Disclosure: I work for the company.)

Resources