One of the features of the application which I am currently working on is photo upload. Customers upload photos in frontend and photos are passed to rails backed and then stored on Amazon S3.
I have noticed that a huge amount of request time is spent uploading photos to s3. The photos are uploaded one by one so latency is multiplied. It would be great if I could somehow store photos temporarily in RAM and increase request speed.
I have thinked about running a Sidekiq job with a file as params but according to sidekiq documentation passing a huge object is not good practise. How can I solve this in another vay ?
I think this problem by using an API to generate a presigned url and using cognito to upload the image on s3 and get the image link.
nginx/puma running on machine A should save the image as a local file. Run Sidekiq on the same machine A and pass the filename to a job in a host-specific queue for Sidekiq to process. That way you can pass a file reference without worrying which machine will process it.
Make sure Sidekiq deletes the file so you don't fill up the disk!
https://www.mikeperham.com/2013/11/13/advanced-sidekiq-host-specific-queues/
Related
I have a simple setup going for an API i'm building in rails.
A zip is uploaded via a POST, and I take the file, store it in rails.root/tmp using carrierwave and then background an s3 upload with sidekiq.
the reason i store the file temporarily is because i can't send a complex object to sidekiq, so i store it and send the id, and let sidekiq find it and do work with it, then delete the file once it's done.
the problem is that once it's time for my sidekiq worker to find the file by its path, it can't because it doesn't exist. i've read that heroku's ephemeral file system deletes its files when things are reconfigured/servers are restarted, etc.
none of these things are happening, however and the file doesn't exist. so my theory is that the sidekiq worker is actually trying to open the path that gets passed to it on its own filesystem since it's a separate worker and that file doesn't exist. can someone confirm this? if that's the case, are the any alternate ways to do this?
If your worker is executed on another dyno than your web process, you are experiencing this issue because of dyno isolation. read more about this here: https://devcenter.heroku.com/articles/dynos#isolation-and-security
Although it is possible to run sidekiq workers and the web process on the same machine (maybe not on heroku, i am not sure about that), it is not advisable to design your system architecture like that.
If your application grows or experiences temporarily high loads, you may want to spread the load across multiple servers, and usually also run your workers on separate servers than your web process in order to not block the web process in case that your workers are keeping the server busy.
In all those cases you can never share data on the local filesystem between the web process and the worker.
I would recommend to consider directly uploading the file to S3 using https://github.com/waynehoover/s3_direct_upload
This also takes a lot of load off your web server
I'm using unicorn on Heroku. one of the issues I'm having is with file uploads. We use carrierwave for uploads, and basically, even for a file that's about 2MB size, by the time 50-60% upload is done, Unicorn times out.
We aren't using unicorn when we test locally, and I don't have any issues with large files locally (though the files get uploaded to AWS using carrierwave, just as with production + staging). However, on staging & production servers, I see that we get a timeout.
Any strategies on fixing this issue? I'm not sure I can put this file upload on a delayed job (because I need to confirm to my users that the file has indeed been successfully uploaded).
Thanks!
Ringo
If you're uploading big files to S3 via Heroku, you can't reasonably avoid timeouts. If someone decides to upload a large file, it's going to time out. If it takes longer than 30s to upload to Heroku, transfer to S3, and process, the request will time out. For good reason too, a 30s request is just crappy performance.
This blog post (and github repo) is very helpful: http://pjambet.github.io/blog/direct-upload-to-s3/
With it, you should be able to get direct-to-s3 file uploads working. You completely avoid hitting Heroku for the bulk of the upload. Using jquery-fileupload's callbacks, you can post to your application after the file is successfully uploaded, and process it in the background using delayed_job. Confirming to your users that the upload is successful is an application problem you just need to take care of.
Sounds like your timeout is set too low. What does your unicorn config look like?
See https://devcenter.heroku.com/articles/rails-unicorn for a good starting point.
I want to upload images (around 200 kB each) in bulk. We have multiple options like as CarrierWave, Paperclip, and others. How can I perform these uploads in bulk?
Like other things in computer Sc, the answer is it depends™. What I really mean is
End-Users going to be uploading these?. If yes use jQuery File Upload plugin to present an easy to use interface.
For storage, you can storage it on your server. or even better, upload the images directly from users computers to amazon s3. here is an example of Uploading Files to S3 in Ruby with Paperclip
Obviously you will need to convert this into background job where all images get images with ajax in separate jobs. if you already don't have a fav que system, I would suggest resque or sidekiq
Note: if you choose to upload images directly to S3 via CORS, then your rails server is free need from managing file uploads. And direct uploads are suggested if you have large images or large number of files to be uploaded.
However, direct uploads limit your ability to modify the images(resize etc). so keep that in mind if you are choosing the direct upload solution.
TL;DR
Don't. Rails isn't optimized for bulk uploads, so do it out-of-band whenever you can.
Use FTP/SFTP
The best way to deal with large volumes of files is to use an entirely out-of-band process rather than tying up your Rails process. For example, use FTP, FTPS, SCP, or SFTP to upload your files in bulk. Once the files are on a server, post-process them with a cron job, or use inotify to kick off a rake task.
NB: Make sure you pay attention to file-locking when you use this technique.
Use a Queue
If you insist on doing it via Rails, don't upload hundreds of files. Instead, upload a single archive containing your files to be processed by a background job or queue. There are many alternatives here, including Sidekiq and RabbitMQ among others.
Once the archive is uploaded and the queued job submitted, your queue process can unpack the archive and do whatever else needs to be done. This type of solution scales very well.
Currently I have an application that is uploading images to S3 in a background (Sidekiq) task. It works fine, however I have had to "hack" together a solution and was curious of anyone knew of a better way to do this.
Problem:
When using Paperclip and a background job on Heroku, the worker is most often times not able to access the tmp file because it is spun up on a different server. I have tried to have paperclip use the tmp folder on Heroku, and it stores it there, however the background tasks have always returned a "File not found".
Temp solution:
This results in having to encode the image to a base64 string and pass that into the perform task (disgusting, bad, horrible, large overhead).
Is there a better way to do this on Heroku? I don't want to save an image blob into the database, as that is just as bad of a practice.
Would it be possible to use the Direct upload approach in the Heroku S3 guide? And then have some background job to resize or process if needed?
Since Heroku is a read-only filesystem I can't use paperclip to store a small quantity of files on the server. Database image storage is an option, but not particularly ideal since that may crank my client's DB size up from a few hundred KB to over the 5 MB 'free' shared DB limit (depending on size of images).
That leaves Amazon S3 as a likely solution. I understand that Heroku is hosted on EC2 (I believe?). Amazon's pricing wording was a little bit confusing when referring to S3-EC2 file transfers. If I have my client setup an S3 account and let them do file transfers to and from there, what is the pricing going to look like?
Is it cheaper from an S3 point-of-view to to both upload and download data in the rails controllers, and then feed the data to the browser using send_file? Or would it make more sense to just link straight to the image or pdf from the browser like normal?
Would my client have to pay anything at all since heroku is hosted on Amazon? I was looking for other questions related to this but there weren't any really straight answers concerning which parts of the file transfer would be charged for.
I guess the storage would cost a little (hardly anything), but what about the bandwidth? Thanks :)
Is it cheaper from an S3 point-of-view
to to both upload and download data in
the rails controllers, and then feed
the data to the browser using
send_file? Or would it make more sense
to just link straight to the image or
pdf from the browser like normal?
From an S3 standpoint, yes, this would be free, because Heroku would be covering your transfer costs. HOWEVER: Heroku only lets a script run for 30 seconds, and during that time, other clients wont be able to load the site, so this is really a terrible idea. Your best bet is to serve the files out of S3 directly, in which case, yes your customer would be transfer between S3 and the end user.
Any interaction you have with the file from Heroku (i.e. metadata and what not) will be free because it is EC2->S3.
For most cases, your pricing would be identical to what it would be if you were not using heroku. The only case where this would change would be if your app is constantly accessing the data directly on S3 (to read metadata/load files)
You can use Paperclip on Heroku - just not the local file system for storage. Fortunately Paperclip can use s3 for storage. Heroku has a tech article here that covers it.
Also when an asset that's been uploaded is displayed on a page (lookup asset_host) the image would be loaded directly from your s3 buckets URL so you will pay Amazon for a get request to the image and then for data transfer involved but also for storing the assets on s3. Have you looked at the s3 calculator to get indicative costs?