Rails background image upload causing application timeout - ruby-on-rails

Unfortunately the bounty was awarded to an answer that does not solve this problem, for those with similar issues.
I have a form with an image upload (heroku to s3). When I submit the form, my rails server waits for the background job that uploads the image to finish before returning a response to the user. This causes an application timeout every single time there is an image upload.
Current order of events:
User submits form
Server receives form
If there is an image, server starts a background job
If a background job was started, the server waits for it to complete (rails times out here)
If started, the background job completes
The server processes the request
The server responds to the user
Desired order of events:
User submits form
Server receives form
Server processes non-image fields
If there is an image, server starts a background job
The server responds to the user
The background job completes and the server processes the uploaded image (saves URL)
Uploader code
class PhotoUploader < CarrierWave::Uploader::Base
include ::CarrierWave::Backgrounder::Delay
include CarrierWave::MimeTypes
process :set_content_type
storage :fog
end
Carrierwave::Backgrounder initializer
CarrierWave::Backgrounder.configure do |c|
c.backend :sidekiq, queue: :carrierwave
end
User model
class User < ActiveRecord::Base
mount_uploader :photo, PhotoUploader, delayed: true
process_in_background :photo
end
There is no controller code because the form is handled by ActiveAdmin. I can override wherever is needed, but have not been able to figure out what needs to change.
What do I have to change to get the correct order of events?

The underlying issue here is Heroku has strict limits on how long a request may block without sending data back to the client. If you hit that limit (30 seconds for the initial byte), Heroku will timeout your request. For file uploads, it is very likely you will hit this limit.
The best approach is to have the user's browser directly upload the file to S3 first. There is some discussion here that is relevant to this: Direct Uploads to S3 using Carrierwave
If you use something like the jQuery File Upload plugin (https://github.com/blueimp/jQuery-File-Upload), the flow would be something like:
User adds one or more files to the form before clicking submit.
The files are uploaded directly to S3 and a file upload token is added for each file to your form.
User submits form with the tokens of the uploaded files rather than the contents of the files.
The server can move the files to their real home in S3 based on the submitted token.
This allows your web server to focus on just serving the request and not block on file uploads which can take a long time.
This requires a bit more work due to Heroku's limits - but ultimately I think is your only option to avoid the timeout limits.
Also, I'd recommend you create an upload S3 bucket and then set an S3 lifecycle policy to purge files older than some interval. When you are doing direct file uploads, it is common for some uploads to not be processed due to the user giving up etc, so the lifecycle does the job of cleaning up these files.

Related

How to upload files and handle processing and validations - a very general overview?

The problem at hand
I have a rails app.
Users will be uploading files. Anywhere between 1 file to 3000 files. Sometimes they are zip files, and sometimes they are not. I do not want hold up the server with these files uploads, so I am looking for a solution around this problem.
The zipped files will have to be unzipped.
I then want to check whether: the user has previously uploaded the same files? i.e. if the user has already uploaded the same file(2) one week ago, then this is a problem: (i) either we don’t allow that particular file to be uploaded, or we ask the user: are you sure you want to upload the same file again?
Then I want to store the keys/links to the files within the appropriate models/records on the back end.
Was wondering what the best workflow for handling the above could be: i.e a very general overview: in other words, could AWS Lambda / Google cloud computing etc. etc be best employed to handle the above problem? How would we use the Shrine gem, to best handle this situation? Would it make sense to use AWS Lambda rather than using background jobs?
My preferences are to use the Shrine gem for uploading.
My Ideas:
In the client side, the user drags and drops the files the user
wants to upload.
All the files are then uploaded (whether zipped or otherwise) to a temporary bucket location via the Shrine gem.
IF the zip files are uploaded then perhaps an AWS lambda function must be triggered to unzip the files. If that’s the case,then at the end of the day, the keys for these files must somehow be returned to the client, to handle validation issues – but then how would the AWS lambda function be able to return this request to the original client side where the request was originated? Or rather,should the AWS lambda function be generated from the client side,passing in the IDs of the unzipped blobs?
Then we need to run some validations: we want to handle the situation where there are duplicate files. We will need to check with our rails backed as to whether those files have already been uploaded.
After those validation issues are handled, then user submits the form, and all the keys are stored within the appropriate records.
These ideas are by no means prescriptive
Am seeking some very general advise on what the best way is of doing this all. I am by no means constrained to AWS: I could use Google or Azure just as easily. Any guidance on the above would be much appreciated.
Specific questions:
How would the AWS lambda function get triggered?
How would be be able to return the keys of the uploaded files back to the client?
What do I mean by general overview?
Here are some examples of general overviews:
(1) Uploading & Unzipping files to S3 through Rails hosted on Heroku?
(2) https://www.quora.com/How-do-I-extract-large-zip-files-in-AWS-Lambda
Any pointers in the right direction would be much appreciated.
Cheers!
This isn't a really difficult problem to solve if you are willing to change the process flow a little bit.
In the client side, the user drags and drops the files the user wants to upload.
When the user requests the upload operation to begin you can make HTTP GET requests to an API Gateway endpoint, backed with a Lambda. The Lambda can query for previous files uploaded by the client and send back a result set showing what files already exist. You then filter those out and send only what is considered new from the client to the server. This will save the user time in waiting for the upload to happen and save you time on the S3/Lambda side of not having to store duplicates or process them. This isn't a substitute for server-side validation though, you'll still want to do that. For legit clients, this will save you and them a lot of bandwidth and storage.
All the files are then uploaded (whether zipped or otherwise) to a temporary bucket location via the Shrine gem.
This works. As they enter the temp bucket, use a Lambda with an S3 event to process the files, unzip files, push any metadata needed into DynamoDb and delete the files from the temp bucket. In the temp bucket, I would place the files into a folder that is unique per request and user. I would take the user/client Id and a UUID of some kind and make that your folder name. Such as Johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f, or encode this value into a Base64 string and make that your folder name. Store this in DynamoDb with each file uploaded into your permanent bucket with the Hash Key being the userid/clientid, a Sort Key being the full folder path + file name and an extra attribute of IsProcessed. The IsProcessed attribute will be updated by your Lambda that is processing the files and moving them to their permanent S3 bucket. If there are errors, you can put the error in this field. If it is successful then you put it in this field.
the keys for these files must somehow be returned to the client, to handle validation issues – but then how would the AWS lambda function be able to return this request to the original client side where the request was originated? Or rather,should the AWS lambda function be generated from the client side,passing in the IDs of the unzipped blobs?
The original API request to push the files to the temp S3 bucket would be able to return back to the client the folder name johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f to the client. So let's say you made a HTTP POST to /jobs. You would return back 201 Created with a HTTP Header of Location /jobs/johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f. Your client can then start polling /jobs/johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f for the status of the process.
Your response back to /jobs/johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f can return the DynamoDB records. This would include all DynamoDB records for the HashKey matching the folder name. Your client side can look at all of the objects in the result set and check the IsProcessed attribute to see if everything worked out ok, or if there were issues.
Then we need to run some validations: we want to handle the situation where there are duplicate files. We will need to check with our rails backed as to whether those files have already been uploaded.
Handle this with the Lambda that is executed by the temporary bucket. Grab the files from the temp bucket folder, handle your business logic and back-end queries then push them to their final permanent bucket.
After those validation issues are handled, then user submits the form, and all the keys are stored within the appropriate records.
All of this would happen asynchronously, starting when the user submits the form. The client side needs to be able to handle this by making HTTP GET requests to the endpoint mentioned above, checking for the status of the process. This gives you some more flexibility as you can also publish SNS messages on failures as well, such as sending an email to the clients if they upload 3,000 files and you need to spend 30 minutes processing them.

Share 1 storage repository between 2 database

I'm having my project for my "distributed system development" class, and my project is making a minimal version of cloud storage (sth like Google Drive).
my approach here is using 2 backend server written in Rails, with 1 proxy server to control requests sent to servers, 2 two Postgres server with master-slave replication relationship.
But problem here is how to storage real assets (video, pdf , mp3 ...). I have no experience in this.
example:
if 1 user opens 2 browser tab, then in each tab he uploads 1 video with same name to 1 directory, what will happend?
Since you probably want to upload asynchronously, this is pretty easy to handle: generate some sort of token before uploading (i.e. filename + hash), then hand the upload off to the delayed job. If the user tries uploading the second file, it will generate the same token and be rejected.
Example for keeping track of the uploads in the DB. Generate a record before upload starts and save the filename and the hash.
Asset.create(filename: ..., hash: ...)
Once the upload finishes you can update the record with the S3 URL or whatever you use for storage (pass the asset id to the delayed job). The validation then is easy:
validates uniqueness: { scope: :filename }

Rails/Heroku - How to create a background job for process that requires file upload

I run my Rails app on Heroku. I have an admin dashboard that allows for creating new objects in bulk through a custom CSV uploader. Ultimately I'll be uploading CSVs with 10k-35k rows. The parser works perfectly on my dev environment and 20k+ entries are successfully created through uploading the CSV. On Heroku, however, I run into H12 errors (request timeout). This obviously makes sense since the files are so large and so many objects are being created. To get around this I tried some simple solutions, amping up the dyno power on Heroku and reducing the CSV file to 2500 rows. Neither of these did the trick.
I tried to use my delayed_job implementation in combination with adding a worker dyno to my procfile to .delay the file upload and process so that the web request wouldn't timeout waiting for the file to process. This fails, though, because this background process relies on a CSV upload which is held in memory at the time of the web request so the background job doesn't have the file when it executes.
It seems like what I might need to do is:
Execute the upload of the CSV to S3 as a background process
Schedule the processing of the CSV file as a background job
Make sure the CSV parser knows how to find the file on S3
Parse and finish
This solution isn't 100% ideal as the admin user who uploads the file will essentially get an "ok, you sent the instructions" confirmation without good visibility into whether or not the process is executing properly. But I can handle that and fix later if it gets the job done.
tl;dr question
Assuming the above-mentioned solution is the right/recommended approach, how can I structure this properly? I am mostly unclear on how to schedule/create a delayed_job entry that knows where to find a CSV file uploaded to S3 via Carrierwave. Any and all help much appreciated.
Please request any code that's helpful.
I've primarily used sidekiq to queue asynchronous processes on heroku.
This link is also a great resource to help you get started with implementing sidekiq with heroku.
You can put the files that need to be processed in a specific S3 bucket and eliminate the need for passing file names to background job.
Background job can fetch files from the specific s3 bucket and start processing.
To provide real time update to the user, you can do the following:
use memcached to maintain the status. Background job should keep updating the status information. If you are not familiar with caching, you can use a db table.
include javascript/jquery in the user response. This script should make ajax requests to get the status information and provide updates to user online. But if it is a big file, user may not want to wait for the completion of the job in which case it is better provide a query interface for checking job status.
background job should delete/move the file from the bucket on completion.
In our app, we let users import data for multiple models and developed a generic design. We maintain the status information in db since we perform some analytics on it. If you are interested, here is a blog article http://koradainc.com/blog/ that describes our design. The design does not describe background process or S3 but combined with above steps should give you full solution.

rails controller download from aws s3

I am trying to build a really easy way for my users to download audio content from aws via my website. Here is the flow:
I give the user a download link. Ex: www.mysite.com/foobar
User clicks on the link.
In my rails controller, I create an expiring aws s3 url and automatically start downloading the audio content from that url.
User's browser should ask the user whether or not to save the file or not. In the event the user accepts to save the file, I want a callback to my rails app to log that the user actually downloaded the file.
So, from a user's perspective, I want the process to be as simple as going to a url I determine, and accepting to download the file when prompted.
In the background, I want to keep the aws s3 url hidden from the user and I want to have the flexibility to write callback logic after the user accepts the download.
What is the recommended way to achieving this?
The best way to solve this is to create an S3 URL with a very short (10 minute?) lifetime and return a redirect to the S3 URL. This does expose the S3 url to the user, but isn't a vulnerability.
If you want to hide the S3 URL, you will need to proxy the download through your servers, which is expensive and consumes a worker process for long periods of time. I do not recommend this, but it is the only way to conceal the S3 resource.
Additionally, if triggering a download vs. a view is important, you need to set the Content-Disposition header to trigger an attachment download:
Content-Disposition: attachment; filename="fname.ext"

Can we find out when a Paperclip download is complete?

I have an application where I need to know when a user's Rails/Paperclip file download is complete. My app is set up to interact with Amazon S3 and I need to run a javascript function when the user has received the completed file.
How can I do this?
Tracking weather or not the download completes is hard, especially in Javascript. There are a few blurred lines in your question which makes me think its not possible.
First, send_file passes a special header to tell the webserver telling it what to send. See the send_file docs. Rails doesn't actually send the file at all, it sets this header which tells the webserver to send the file but then returns immediately, and moves on to serve another request. To be able to track if the download completes you'll have to occupy your Rails application process sending the file and block until the user downloads it, instead of leaving that to the webserver (which is what its designed to do). This is super inefficient.
Next, how can you still be on a page to execute a javascript function if you are downloading a file? Your user clicks the file download link and is taken to wherever the file is, weather that be a send_file from Rails or a redirect to S3 or whatever, they are no longer on the page they came from. If you are thinking about the way Chrome or Firefox works where the download goes into a download manager and the user stays on the page, theres no more interaction with the server on the old page! If you want that page to be notified of download completion, then you'd need a periodic check or long poll to the server to see if the download is done.
I think you'd be better served by redirecting to the S3 file and setting a session variable to redirect the user to where you want them to go after the download is complete so that the next time they visit any page they are back in your planned flow.
Hope this helps!

Resources