ActiveStorage::PurgeJob called under high throughput - ruby-on-rails

I am attaching files via Active Record in rails via S3 in async jobs. Occasionally, an object will be uploaded, the job will complete, and then "Performing ActiveStorage::PurgeJob" and "Deleted file from key" will appear in the logs. The key will no longer be accessible in s3. I am not triggering this purge in my codebase. This does not happen every time - but often in times of high throughput.

Related

How does dataflow manage current processes during upscaling streaming job?

When dataflow streaming job with autoscaling enabled is deployed, it uses single worker.
Let's assume that pipeline reads pubsub messages, does some DoFn operations and uploads into BQ.
Let's also assume that PubSub queue is already a bit big.
So pipeline get started and loads some pubsubs processing them on single worker.
After couple of minutes it gets realized that some extra workers are needed and creates them.
Many pubsub messages are already loaded and are being processed but not acked yet.
And here is my question: how dataflow will manage those unacked yet, being processed elements?
My observations would suggest that dataflow sends many of those already being processed messages to a newly created worker and we can see that the same element is being processed at the same time on two workers.
Is this expected behavior?
Another question is - what next? First wins? Or new wins?
I mean, we have the same pubsub message that is still being processed on first worker and on the new one.
What if process on first worker will be faster and finishes processing? It will be acked and goes downstream or will be drop because new process for this element is on and only new one can be finalized?
Dataflow provides exactly-once processing of every record. Funnily enough, this does not mean that user code is run only once per record, whether by the streaming or batch runner.
It might run a given record through a user transform multiple times, or it might even run the same record simultaneously on multiple workers; this is necessary to guarantee at-least once processing in the face of worker failures. Only one of these invocations can “win” and produce output further down the pipeline.
More information here - https://cloud.google.com/blog/products/data-analytics/after-lambda-exactly-once-processing-in-google-cloud-dataflow-part-1

Sidekiq Idempotency, N+1 Queries and deadlocks

In the Sidekiq wiki it talks about the need for jobs to be idempotent and transactional. Conceptually this makes sense to me, and this SO answer has what looks like an effective approach at a small scale. But it's not perfect. Jobs can disappear in the middle of running. We've noticed certain work is incomplete and when we look in the logs they cut short in the middle of the work as if the job just evaporated. Probably due to a server restart or something, but it often doesn't find its way back into the queue. super_fetch tries to address this, but it errs on the side of duplicating jobs. With that we see a lot of jobs that end up running twice simultaneously. Having a database transaction cannot protect us from duplicate work if both transactions start at the same time. We'd need locking to prevent that.
Besides the transaction, though, I haven't been able to figure out a graceful solution when we want to do things in bulk. For example, let's say I need to send out 1000 emails. Options I can think of:
Spawn 1000 jobs, which each individually start a transaction, update a record, and send an email. This seems to be the default, and it is pretty good in terms of idempotency. But it has the side effect of creating a distributed N+1 query, spamming the database and causing user facing slowdowns and timeouts.
Handle all of the emails in one large transaction and accept that emails may be sent more than once, or not at all, depending on the structure. For example:
User.transaction do
users.update_all(email_sent: true)
users.each { |user| UserMailer.notification(user).deliver_now }
end
In the above scenario, if the UserMailer loop halts in the middle due to an error or a server restart, the transaction rolls back and the job goes back into the queue. But any emails that have already been sent can't be recalled, since they're independent of the transaction. So there will be a subset of the emails that get re-sent. Potentially multiple times if there is a code error and the job keeps requeueing.
Handle the emails in small batches of, say, 100, and accept that up to 100 may be sent more than once, or not at all, depending on the structure, as above.
What alternatives am I missing?
One additional problem with any transaction based approach is the risk of deadlocks in PostgreSQL. When a user does something in our system, we may spawn several processes that need to update the record in different ways. In the past the more we've used transactions the more we've had deadlock errors. It's been a couple of years since we went down that path, so maybe more recent versions of PostgreSQL handle deadlock issues better. We tried going one further and locking the record, but then we started getting timeouts on the user side as web processes compete with background jobs for locks.
Is there any systematic way of handling jobs that gracefully copes with these issues? Do I just need to accept the distributed N+1s and layer in more caching to deal with it? Given the fact that we need to use the database to ensure idempotency, it makes me wonder if we should instead be using delayed_job with active_record, since that handles its own locking internally.
This is a really complicated/loaded question, as the architecture really depends on more factors than can be concisely described in simple question/answer formats. However, I can give a general recommendation.
Separate Processing From Delivery
start a transaction, update a record, and send an email
Separate these steps out. Better to avoid doing both a DB update and email send inside a transaction, batched or not.
Do all your logic and record updates inside transactions separately from email sends. Do them individually or in bulk or perhaps even in the original web request if it's fast enough. If you save results to the DB, you can use transactions to rollback failures. If you save results as args to email send jobs, make sure processing entire batch succeeds before enqueing the batch. You have flexibility now b/c it's a pure data transform.
Enqueue email send jobs for each of those data transforms. These jobs must do little to no logic & processing! Keep them dead simple, no DB writes -- all processing should have already been done. Only pass values to an email template and send. This is critical b/c this external effect can't be wrapped in a transaction. Making email send jobs a read-only for your system (it "writes" to email, external to your system) also gives you flexibility -- you can cache, read from replicas, etc.
By doing this, you'll separate the DB load for email processing from email sends, and they are now dealt with separately. Bugs in your email processing won't affect email sends. Email send failures won't affect email processing.
Regarding Row Locking & Deadlocks
There shouldn't be any need to lock rows at all anymore -- the transaction around processing is enough to let the DB engine handle it. There also shouldn't be any deadlocks, since no two jobs are reading and writing the same rows.
Response: Jobs that die in the middle
Say the job is killed just after the transaction completes but before the emails go out.
I've reduced the possibility of that happening as much as possible by processing in a transaction separately from email sending, and making email sending as dead simple as possible. Once the transaction commits, there is no more processing to be done, and the only things left to fail are systems generally outside your control (Redis, Sidekiq, the DB, your hosting service, the internet connection, etc).
Response: Duplicate jobs
Two copies of the same job might get pulled off the queue, both checking some flag before it has been set to "processing"
You're using Sidekiq and not writing your own async job system, so you need to consider job system failures out of your scope. What remains are your job performance characteristics and job system configurations. If you're getting duplicate jobs, my guess is your jobs are taking longer to complete than the configured job timeout. Your job is taking so long that Sidekiq thinks it died (since it hasn't reported back success/fail yet), and then spawns another attempt. Speed up or break up the job so it will succeed or fail within the configured timeout, and this will stop happening (99.99% of the time).
Unlike web requests, there's no human on the other side that will decide whether or not to retry in an async job system. This is why your job performance profile needs to be predictable. Once a system gets large enough, I'd expect completely separate job queues and workers based on differences like:
expected job run time
expected job CPU/mem/disk usage
expected job DB or other I/O usage
job read only? write only? both?
jobs hitting external services
jobs users are actively waiting on
This is a super interesting question but I'm afraid it's nearly impossible to give a "one size fits all" kind of answer that is anything but rather generic. What I can try to answer is your question of individual jobs vs. all jobs at once vs. batching.
In my experience, generally the approach of having a scheduling job that then schedules individual jobs tends to work best. So in a full-blown system I have a schedule defined in clockwork where I schedule a scheduling job which then schedules the individual jobs:
# in config/clock.rb
every(1.day, 'user.usage_report', at: '00:00') do
UserUsageReportSchedulerJob.perform_now
end
# in app/jobs/user_usage_report_scheduler_job.rb
class UserUsageReportSchedulerJob < ApplicationJob
def perform
# need_usage_report is a scope to determine the list of users who need a report.
# This could, of course, also be "all".
User.need_usage_report.each(&UserUsageReportJob.method(:perform_later))
end
end
# in app/jobs/user_usage_report_job.rb
class UserUsageReportJob < ApplicationJob
def perform(user)
# the actual report generation
end
end
If you're worried about concurrency here, tweak Sidekiq's concurrency settings and potentially the connection settings of your PostgreSQL server to allow for the desired level of concurrency. I can say that I've had projects where we've had schedulers that scheduled tens of thousands of individual (small) jobs which Sidekiq then happily took in in batches of 10 or 20 on a low priority queue and processed over a couple of hours with no issues whatsoever for Sidekiq itself, the server, the database etc.

How to temporarily store files on heroku for delayed job import

I have an import functionality in my rails app that imports a CSV file and updates records accordingly. As this file gets bigger, the request takes longer and eventually times out. Therefore I chose to implement delayed_job to handle my long running requests. The only problem is, when the job runs, the error message Errno::ENOENT: No such file or directory is thrown. This is because my solution works with the CSV file in memory.
Is there a way to save the CSV file to my heroku server (and delete it after the import)?
Heroku's filesystems are ephemeral, i.e. content on them doesn't persist and they are not shared across dynos.
If your delayed job is running on another dyno (which is how it should if it already isn't), you can not access the csv which exists on the disk of your web dyno.
One workaround would be to create an action which serves the CSV. You could then use some HTTP library to download the CSV before the job starts.
You can't store files on a dyno's filesystem and read them back from another dyno.
You can store the temporary file in an external cloud storage like AWS S3 and read this back from your delayed job.

Why is my Resque job showing up as a DirtyExit even though it succeeds in doing it's job?

We have a scheduled job in resque which runs every morning at 06:00.
The script runs three different SQL-queries, two of which are quick (10-15 sec) and one heavy one (~12 min). After finishing each query it converts the resulting ActiveRecord to a csv in a Tempfile and then writes the csv to a file in an S3 bucket.
The job works fine, the files end up in S3 like expected, but for some reason the job still ends up in resques list of failed jobs with a dirty exit status.
What is marking it as failed and why?

Uploading multiple files to heroku serially locks up all dynos

Its my understanding that when I upload a file to my heroku instance its a synchronous request and I will get a 200 back when the request is done, which means my upload has been processed and stored by paperclip.
I am using plupload which does a serial upload (one file at a time). On Heroku I have 3 dynos and my app becomes unresponsive and I get timeouts trying to use the app. My upload should really only tie up at most a single dyno while all the files are being uploaded since its done serially and file 2 doesnt start until a response is returned from file 1.
As a test I bumped my dynos to 15 and ran the upload. Again I see the posts come into the logs and then I start seeing output of paperclip commands (cant remember if it was identify or convert) and I start getting timeouts.
I'm really lost as to why this is happening. I do know I 'can' upload directly to s3 but my current approach should be just fine. Its an admin interface that is only used by a single person and again at most it should tie up a single dyno since all the uploaded files are sent serially.
Any ideas?
I've been working on the same problem for a couple of days. The problem, so far as I understand, is that when uploading files through heroku, your requests are still governed by the 30 second timeout limit. On top of this, it seems that subsequent requests issued to the same dyno (application instance) can cause it to accrue the response times and terminate. For example, if you issue two subsequent requests to your web app that each take 15 seconds to upload, you could recieve a timeout, which will force the dyno to terminate the request. This is most likely why you are receiving timeout errors. If this continues on multiple dynos, you could end up with an application crash, or just generally poor performance.
What I ended up doing was using jquery-file-upload. However, if you are uploading large files (multiple MBs), then you will still experience errors as heroku is still processing the uploads. In particular I used this technique to bypass heroku entirely and upload directly from the client's browser to s3. I use this to upload to a temp directory, and then use carrierwave to 're-download' the file and process medium and thumbnail versions in the background by pushing the job to Qu. Now, there are no timeouts, but the user has to wait for the jobs to get processed in the background.
Also important to note is that heroku dynos operate independently of each other, so by increasing the number of web dynos, you are creating more instances of your application for other users, but each one is still subject to 30 second timeouts and 512Mb of memory. Regardless of how many dynos you have, you will still have the same issues. More dynos != better performance.
You can use something like Dropzonejs in order to divide your files in queue and send them separately. That way the request wont timeout.

Resources