Uploading files to ec2, first to ebs volume then moving to s3 - upload

http://farm8.staticflickr.com/7020/6702134377_cf70482470_z.jpg
OK sorry for the terrible drawing but it seemed a better way to organize my thoughts and convey them. I have been wrestling for a while with how to create an optimal de-coupled easily scale-able system for uploading files to a web app on AWS.
Uploading directly to S3 would work except for the fact the files need to be instantly accessible to the uploader for manipulation then once manipulated they can go to s3 where they will be served to all instances.
I played with the idea of creating a SAN with something like glusterfs then uploading directly to that and serving from that. I have not ruled it out but from varying sources the reliability of this solution might be less than ideal (if anyone has better insight on this I would love to hear). In any case I wanted to formulate a more "out of the box" (in the context of AWS) solution.
So to elaborate on this diagram, I want the file to be uploaded to the local filesystem of the instance it happens to go to, which is an EBS volume. The storage location of the file would not be served to the public (i.e. /tmp/uploads/ ) It could still be accessed by the instance through a readfile() operation in PHP so that the user could see and manipulate it right after uploading. Once the user is finished manipulating the file a message to move it to s3 could be queued in SQS.
My question is then once I save the file "locally" on the instance (which could be any instance due to the load balancer), how can I record which instance it is on (in the DB) so that subsequent requests through PHP to read or move the file will find said file.
If anyone with more experience in this has some insight I would be very grateful. Thanks.

I have a suggestion for a different design that might solve your problem.
Why not always write the file to S3 first? And then copy it to the local EBS file system on whichever node you're on while you're working on it (I'm not quite sure what manipulations you need to do, but I'm hoping it doesn't matter). When you're finished modifying the file, simply write it back to S3 and delete it from the local EBS volume.
In this way, none of the nodes in your cluster need to know which of the others might have the file because the answer is it's always in S3. And by deleting the file locally, you get a fresh version of the file if another node updates it.
Another thing you might consider if it's too expensive to copy the file from S3 every time (it's too big, or you don't like the latency). You could turn on the session affinity in the load balancer (AWS calls it sticky sessions). This can be handled by your own cookie or by the ELB. Now subsequent requests from the same browser come to the same cluster node. Simply check the modified time of the file on the local EBS volume against the S3 copy and replace if it's more recent. Then you get to take advantage of the local EBS file system while the file's being worked on.
Of course there are a bunch of things I don't get about your system. Apologies for that.

Related

Is it not prefer to use :local/:disk option on production environment when using ActiveStorage? Will it make me not able to backup the files?

I am working on a project using Rails as my backend API server. Uploading clients' files will be one of the most important part of this application (it is something like a CRM/ERP system). However, my client prefer storing all the data and files into their own server due to security and privacy issues of their clients.
However, while I am reading the documents of ActiveStorage, it sounds like :disk option is just used for test and development environment. I understand using cloud storage like s3 would benefit on scalability and backup stuffs that are much secure and flexible for web development, but after all, you know, client's requirement.
1) Therefore, would like to know is it not preferred to use :disk on any production environment? What will be the cons that I may miss out?
Also, will it be hard for me to do backup for the files, as I saw in the /storage path, the files are all saved not into the same names of the original files.
I am having a guess that, could I just backup the whole sites by just doing pg_dump and a clone of the entire site directory including the /storage file (they will be gitignore, so I need to backup them by myself and do some git clone git pull stuff while doing recovery or server transition). Will this workflow work flawlessly?
2) What should be the actual backup and recover flow if I use :disk option in ActiveStorage?
Thanks for your help, and am appreciate any of your helps!
Disk and local are indeed not recommended for production.
If you lose the table contents or some of the files in storage/ you may not be able to recover your data.
The growing storage/ directory will make it difficult to move your application somewhere else as you'd have to copy all of the content along with the code.
It will also make it difficult scaling horizontally as the storage/ directory must be present on all instances of your applications and always in sync. You may counter this issue by setting up a NFS share somewhere and mounting under storage/ but that can come with some reliability issues - a timeout or permissions error when writing a file for instance will generate the ActiveStorage table entry but without the associated file => lots of annoying errors.
I believe it may also be rather difficult making incremental backups, you'd have to dump the table and zip it along with all of storage/ files, if something changes while you're backing up or restoring your data you'll experience all kinds of errors.
These are not really impossible to work around, just rather unfeasible.
You may want to check out Minio or a similar application. It gives you ActiveStorage support with none of the S3 costs and data privacy concerns. Just drop it on a docker instance somewhere on your network, set up persistence and backups/RAID and you're kind of done.

Heroku erases images each time the application shuts down and restarts after being inactive for x minutes?

I saw this question:After git push heroku - uploaded files on Heroku are lost
each time the application shuts down and restarts after being inactive
for x minutes), your application is recreated and all stored data is
lost.(C)
Right now I have user which can upload two photos.I got email confirmation of new users. So I can check that user registered and uploaded photo 4 and 14 hours ago.
I've made my last commit and pushed it to heroku around 19 hours ago. And this 4 images, that new users uploaded are lost now.But I can see images if I just now register the user. So it seems to be really true that my app was inactive for x minutes and then it restarts and deletes images.
I read some questions like this Rails] Images erased after a new commit on heroku
There it says that I should use external server like aws s3( I have no idea what it is and how much it will cost and how to connect it)
So is it really true? what are my other options? may be I should simply use digital ocean(won't be there the same problem?) or something else. Will this problem continue in paid account?
I use rails and upload files using carrierwave gem, I can't upload code here cause I am writing from another laptop.
Heroku's filesystem is readonly. You can't expect anything you upload to persist there, you need to use an external storage mechanism, something like Amazon's S3 for example.
See the links for more details.
https://devcenter.heroku.com/articles/s3
https://devcenter.heroku.com/articles/dynos#isolation-and-security
The filesystem that your Heroku instance runs on is not read-only, but it is transient - i.e. files that you store there will not persist after an instance restart.
This is a deliberate design decision by Heroku, to force you to think about where you store your data and how it impacts on scalability.
You're asking about Digital Ocean - I haven't used them but I assume from your question that they allow you to store to a persistent local filesystem.
The question that you then have to ask is: what happens if you want to run more than one instance of your app? Do they share the same persistent filesystem? Can they access each other's files? How do you handle file locking to avoid race-conditions when several app instances are using the same filesystem?
Heroku's model forces you to either put stuff in a database or store it using some external service. Generally, any of these sorts of systems will be reasonably scalable - you can have multiple Heroku instances (perhaps running on different machines, different data-centers, etc), and they will all interact nicely.
I do agree that for a simple use-case where you just want to run a single instance of an app during development it can be inconvenient, but I think this is the reasoning behind it - to force you to design this sort of thing in, rather than developing your whole app on the assumption that it can store everything locally and then find out later that you need to completely redesign to make it scalable.
What you're looking at is something called the ephemeral file system in Heroku:
Each dyno gets its own ephemeral filesystem, with a fresh copy of the most recently deployed code. During the dyno’s lifetime its running processes can use the filesystem as a temporary scratchpad, but no files that are written are visible to processes in any other dyno and any files written will be discarded the moment the dyno is stopped or restarted
In short, it means that any files you upload will only last for the time the dyno is running. When the dyno shuts down, the files will be removed unless they were part of the local git repo.
The way to resolve the issue is to store the files on a third-party service - typically S3. This will store the files on a system independent of Heroku.
Both Paperclip & Carrierwave support S3 (Simple Storage Service) - through a gem called fog. S3 gives you a "free" tier, allowing you to store a certain amount of data (I've forgotten how much) for free.
I would strongly recommend setting up an S3 account and linking it to your Heroku app. This way, any files you upload will be stored off-site.

Heroku - hosting files and static files for my project

I want to use Heroku for hosting my Ruby on Rails project. It will involve lots of file uploads, mostly images. Can I host and serve that static files on Heroku or is it wiser to use services like Amazon S3. What is Your opinion on that approach ? What are my options for hosting that static files on Heroku ?
To answer your question, Heroku's "ephemeral filesystem" will not serve as a storage for static uploads. Heroku is an app server, period. You have to plug into data storage elsewhere.
From Heroku's spec:
Ephemeral filesystem
Each dyno gets its own ephemeral filesystem, with a fresh copy of the most recently deployed code. During the dyno’s lifetime its running processes can use the filesystem as a temporary scratchpad, but no files that are written are visible to processes in any other dyno and any files written will be discarded the moment the dyno is stopped or restarted. For example, this occurs any time a dyno is replaced due to application deployment and approximately once a day as part of normal dyno management.
Heroku is a great option for RoR in my opinion. I have used it personally and ran to the problem that has been mentioned here already (you can't store anything in Heroku's filesystem). I therefore used S3 following this tutorial: https://devcenter.heroku.com/articles/s3
Hope it helps!
PD: Make sure not to store the S3 credentials on any file, but rather create variables as described here: https://devcenter.heroku.com/articles/config-vars
I used to have them on a file and long story short someone gained access to my Amazon account and my account was billed several thousands of dollars (just from a couple of days). The Amazon staff was kind enough to waive those. Just something to have in mind.
As pointed out, you shouldn't do this with Heroku for the specific reason of ephemeral storage, but to answer your question more broadly storing user-uploaded content on a local filesystem on any host has a few inherent issues:
You can quickly run out of local storage space on the disk
You can lose all your user-uploaded content if the hardware crashes / the directory gets deleted / etc.
Heroku, EC2, Digital Ocean, etc. all provide servers that don't come with any guarantee of persistence (ephemeral storage especially). This means that your instance may shut down at any point, be swapped out, etc.
You can't scale your application horizontally. The files on one server won't be accessible from another (or dyno, or whatever your provider of choice calls them).
S3, however, is such a widely-used solution because:
It's incredibly cheap (we store 20 TB of data for something like $500 a month)
Your uploaded files aren't at risk of disappearing due to hardware failure
Your uploaded files are decoupled from the application, meaning any server / dyno / whatever could access them.
You can always publish your S3 buckets into cloud front if you need a CDN without any extra effort.
And certainly many more reasons. The most important thing to remember, is that by storing uploaded content locally on a server, you put yourself in a position where you can't scale horizontally, regardless of how you're hosting your app.
It it wiser to host files on S3, and actually it is even more wiser to use direct uploads to S3.
You can read the arguments, for example, here.
Main point: Heroku is really, really expensive thing.
So you need to save every bit of resources you have. And the only option to store static files on Heroku is having separate dyno running app server for you. And static files don't need app server. So it's just a waste of CPU time (and you should read that as "a waste of a lot of my money").
Also, uploading huge amount of huge files will quickly get you out of memory quota (read that as "will waste even more of my money because I will need to run more dynos"). So it's best to upload files directly to S3.
Heroku is great for hosting your app. Use the tool that best suites the task.
UPD. Forgot to tell you – not only you will need separate dyno for static assets, your static assets will die every time this dyno is restarted.
I had the same problem. I do solve it by adding all my images in my rails app. I then reference the images using their links that might be something like
myapp.herokuapp.com/assets/image1.jpg
I might add the link from the CMS. It might not be the best option, but it works.

What is the best approach to handle large file uploads in a rails app?

I am interested in understanding the different approaches to handling large file uploads in a Rails application, 2-5Gb files.
I understand that in order to transfer a file of this size it will need to be broken down into smaller parts, I have done some research and here is what I have so far.
Server-side config will be required to accept large POST requests and probably a 64bit machine to handle anything over 4Gb.
AWS supports multipart upload.
HTML5 FileSystemAPI has a persistent uploader that uploads the file in chunks.
A library for Bitorrent although this requires a transmission client which is not ideal
Can all of these methods be resumed like FTP, the reason I dont want to use FTP is that I want to keep in the web app if this is possible? I have used carrierwave and paperclip but I am looking for something that will be able to be resumed as uploading a 5Gb file could take some time!
Of these approaches I have listed I would like to undertand what has worked well and if there are other approaches that I may be missing? No plugins if possible, would rather not use Java Applets or Flash. Another concern is that these solutions hold the file in memory while uploading, that is also a constraint I would rather avoid if possible.
I've dealt with this issue on several sites, using a few of the techniques you've illustrated above and a few that you haven't. The good news is that it is actually pretty realistic to allow massive uploads.
A lot of this depends on what you actually plan to do with the file after you have uploaded it... The more work you have to do on the file, the closer you are going to want it to your server. If you need to do immediate processing on the upload, you probably want to do a pure rails solution. If you don't need to do any processing, or it is not time-critical, you can start to consider "hybrid" solutions...
Believe it or not, I've actually had pretty good luck just using mod_porter. Mod_porter makes apache do a bunch of the work that your app would normally do. It helps not tie up a thread and a bunch of memory during the upload. It results in a file local to your app, for easy processing. If you pay attention to the way you are processing the uploaded files (think streams), you can make the whole process use very little memory, even for what would traditionally be fairly expensive operations. This approach requires very little actual setup to your app to get working, and no real modification to your code, but it does require a particular environment (apache server), as well as the ability to configure it.
I've also had good luck using jQuery-File-Upload, which supports good stuff like chunked and resumable uploads. Without something like mod_porter, this can still tie up an entire thread of execution during upload, but it should be decent on memory, if done right. This also results in a file that is "close" and, as a result, easy to process. This approach will require adjustments to your view layer to implement, and will not work in all browsers.
You mentioned FTP and bittorrent as possible options. These are not as bad of options as you might think, as you can still get the files pretty close to the server. They are not even mutually exclusive, which is nice, because (as you pointed out) they do require an additional client that may or may not be present on the uploading machine. The way this works is, basically, you set up an area for them to dump to that is visible by your app. Then, if you need to do any processing, you run a cron job (or whatever) to monitor that location for uploads and trigger your servers processing method. This does not get you the immediate response the methods above can provide, but you can set the interval to be small enough to get pretty close. The only real advantage to this method is that the protocols used are better suited to transferring large files, the additional client requirement and fragmented process usually outweigh any benefits from that, in my experience.
If you don't need any processing at all, your best bet may be to simply go straight to S3 with them. This solution falls down the second you actually need to do anything with the files other than server them as static assets....
I do not have any experience using the HTML5 FileSystemAPI in a rails app, so I can't speak to that point, although it seems that it would significantly limit the clients you are able to support.
Unfortunately, there is not one real silver bullet - all of these options need to be weighed against your environment in the context of what you are trying to accomplish. You may not be able to configure your web server or permanently write to your local file system, for example. For what it's worth, I think jQuery-File-Upload is probably your best bet in most environments, as it only really requires modification to your application, so you could move an implementation to another environment most easily.
This project is a new protocol over HTTP to support resumable upload for large files. It bypass Rails by providing its own server.
http://tus.io/
http://www.jedi.be/blog/2009/04/10/rails-and-large-large-file-uploads-looking-at-the-alternatives/ has some good comparisons of the options, including some outside of Rails.
Please go through it.It was helpful in my case
Also another site to go to is:-
http://bclennox.com/extremely-large-file-uploads-with-nginx-passenger-rails-and-jquery
Please let me know if any of this does not work out
I would by-pass the rails server and post your large files(split into chunks) directly from the browser to Amazon Simple Storage. Take a look at this post on splitting files with JavaScript. I'm a little curious how performant this setup would be and I feel like tinkering with this setup this weekend.
I think that Brad Werth nailed the answer
just one approach could be upload directly to S3 (and even if you do need some reprocessing after you could theoretical use aws lambda to notify your app ... but to be honest I'm just guessing here, I'm about to solve the same problem myself, I'll expand on this later)
http://aws.amazon.com/articles/1434
if you use carrierwave
https://github.com/dwilkie/carrierwave_direct_example
Uploading large files on Heroku with Carrierwave
Let me also pin down few options that might help others looking for a real world solution.
I have a Rails 6 with Ruby 2.7 and the main purpose of this app is to create a Google drive like environment where users can upload images and videos and them process them again for high quality.
Obviously we did tried using local processing using Sidekiq background jobs but it was overwhelming during large uploads like 1GB and more.
We did tried tuts.io but personally I think is not quite easy to setup just like Jquery File uploads.
So we experimented with AWS..moving in steps listed below and it worked like a charm....uploading directly to S3 from the browser.
using React drop zone uploader...we uploads multiple files to S3.
we setup Aws Lambda for an input bucket to get triggered for all types of object creations on that bucket.
this Lambda converts the file and again uploads the reprocessed one to another one - output bucket and notifies us using Aws SNS to keep a track of what worked and what failed.
in Rails side... we just dynamically use the new output bucket and then serve it with Aws Cloud-front distribution.
You may check Aws notes on MediaConvert to check step by step guide and they also have a well written Github repos for all sorts of experimentation.
So, from the user's point of view, he can upload one large file, with Acceleration enabled on the S3, the React library show uploading progress and once it gets uploaded, Rails callback api again verifies its existence in the S3 BUCKET like mybucket/user_id/file_uploaded_slug and then its confirmed to user through a simple flash message.
You can also configure Lambda to notify end user on successful upload/encoding, if needed.
Refer this documentation - https://github.com/mike1011/aws-media-services-vod-automation/tree/master/MediaConvert-WorkflowWatchFolderAndNotification
Hope it helps someone here.

Is it an issue to create a directory for each file upload, in a web application on linux/unix?

I am doing file-upload for a web-application (running on unix/linux). I'm wondering if there would be a concern if I planned to create a new directory for each file upload? This is the out-of-the-box approach for the Ruby on Rails plugin "paperclip". I debating what the trade-offs are, or whether perhaps it's just not a concern, if deploying on a linux/unix environment.
The options would seem to be:
One folder per file attachment - per how paperclip seems to work out of the box
One folder per user perhaps (i.e. if web service has multiple users with their own account) - and then one would need to add some uniqueness to the filename (perhaps the model ID)
Put all attachments in one folder - but this is probably going too far the other way
Question - Should I be concerned about the number of directories being created? Is this an issue for an O/S if the service was popular? Any advice for a website that was allowing users with their own separate account to upload files, what structure might be good with respect to storing them? (I guess I've discounted the concept of storing files in mysql.)
Thanks
Assuming Ext3 formatted drive under Linux (the most common).
From (http://en.wikipedia.org/wiki/Ext3)
"There is a limit of 31998 sub-directories per one directory, stemming from its limit of 32000 links per inode.[13]"
So, if you'll hit the limit of 32k uploads, which isn't that high, your application will fail.
Not as such, but having gazillions of folders in one directory (or the same for files) isn't recommended (it's a real hit to speed).
Reason: c-style strings
A good solution would be to hierchially (sic?) store things something like:
/path/to/usernamefirstletter/username/year/month/file
If you have a separate partition for the directory where the new files/directories get created, I'd say it's not a problem. It can get a problem if you just use another partition since you can run out of inodes and/or free disk space which can be bad.
Using a separate partition would (in case of a DOS attack) only stop your application from working correctly and the system won't get hurt in any way.

Resources