serving premium videos using ruby on rails - ruby-on-rails

I am stuck with a performance related issue. I have a ROR application running on VPS. I am posting my question here after having spent lot of time on internet regarding this issue and was not able to find reliable solution.
My ROR application has nearly 300 premium videos and 200 pdf's, after registration user is allowed to watch Free Videos. If the user upgrades account by making payment, then he/she can watch premium videos.
Currently I am serving video file using send file method, below is the code.
SEND_FILE_METHOD = :default
def download
head(:not_found) and return if (track = Video.find_by_id(params[:id])).nil?
#head(:forbidden) and return unless track.downloadable?(current_user)
path = track.video.path(params[:style])
head(:bad_request) and return unless File.exist?(path) && params[:format].to_s == File.extname(path).gsub(/^\.+/, '')
contenttype = MIME::Types.type_for(path).first.content_type # => "image/gif"
send_file_options = { :type => contenttype }
case SEND_FILE_METHOD
when :apache then send_file_options[:x_sendfile] = true
when :nginx then head(:x_accel_redirect => path.gsub(Rails.root, ''), :content_type => send_file_options[:type]) and return
end
send_file(path, send_file_options)
end
My question here is,
what is this right way to serve premium video's via rails application.(My Client VPS has only 1 GB RAM ) and videos are saved in FILE SYSTEM on VPS. My concern is that, if my application gets more than a 100 request at a time, ROR app may fail to server request. I also thought of placing it in public folder, but the videos are only for Paid user, so I cant put them in public videos.
Any help or link to solution is highly appreciated..
Many Thanks :)

If you really want to support serving static content at high volumes, don't put the load on your web server, especially if it's not that strong.
Keep your server resources for stuff that are your core business logic (like managing accounts and permissions) and delegate serving static content to other dedicated services.
I'm only familiar with the AWS suite so I can recommend the following:
For the PDFs (and other files) - Store them on AWS S3 and have your app point to that location. You can create an address within your domain, and can generate separate links for different customers (with expiration) so you can control who accesses the files and for how long.
For the videos - If you serve them as regular files (e.g. download an mp4) - I'd go with the same S3 solution. If you plan to stream them to clients - look at CloudFront Streaming
CloudFront (AWS implementation of a CDN) in general is a good idea if you have clients from different geographies and you want to serve their content from the closest location to them.
As for prices - you can see at the products' pages. It's pretty cheap in my opinion and definitely worth the money considering the scalability it gives you. You might have some learning-curve at the beginning to start working with their APIs but it's really not that difficult and plenty of tutorials are available.
Equivalent solutions (from other vendors) do exist. I'd recommend looking around to see what fits you best.
Good luck!

If possible, you should consider using something similar to S3 or a CDN if possible. These are built to get files to customers quickly and will result in zero load on your server once you've handed the file off to them.
If you want to stick with serving files from your vps then you should look at X-Sendfile (apache) or X-Accel-Redirect (nginx). With these the file is still served by your VPS but it is handled by the web server rather than your rails code.
Rails will use generate these headers for you when you use send_file, but for this to work you need to configure your web server appropriately.
On apache this means installing the sendfile module, on nginx you have to configure which bits of the filesystem are accessible. The code for Rack::Sendfile (the mechanism rails uses) explains how to do this

Related

Web application security concerns with user-uploaded files on Amazon S3?

Background
I have a web application where users may upload a wide variety of files (e.g. jpg, png, r, csv, epub, pdf, docx, nb, tex, etc). Currently, we whitelist exactly which files types are a user may upload. This limitation is sometimes annoying for users (i.e. because they must zip disallowed files, then upload the zip) and for us (i.e. because users write support asking for additional file types to be whitelisted).
Ideal Solution
Ideally, I'd love to whitelist files more aggressively. Specifically, I'd would like to (1) figure out which file types may be trusted and (2) whitelist them all. Having a larger whitelist would be more convienient for users and it reduce the number of support tickets (if only very slightly).
What I Know
I've done a few hours of research and have identified common problems (e.g. path traversal, placing assets in root directory, htaccess vulnerabilities, failure to validate mime type, etc). While this research has been interesting, my understanding is that many of these issues are moot (or considerably mitigated) if your assets are stored on Amazon S3 (or a similar cloud storage service) – which is how most modern web application manage user-uploaded files.
Hasn't this question already been asked a zillion times?!
Please don't mistake this as a general "What are the security risks of user-uploaded content?" question. There are already many questions like that and I don't want to rehash that discussion here.
More specifically, my question is, "What risks, if any, exist given a conventional / modern web application setup?" In other words, I don't care about some old PHP app or vulnerabilities related to IE6. What should I be worried about assuming files are stored in a cloud service like AmazonS3?
Context about infrastructure / architecture
So... To answer that, you'll probably need more context about my setup. That said, I suspect this is a relatively common setup and therefore hope the answers will be broadly useful to anyone writing a modern web application.
My stack
Ruby on Rails application, hosted on Heroku
Users may upload a variety of files (via Paperclip)
Server validates both mime type and extension (against a whitelist)
Files are stored on Amazon S3 (with varying ACL permissions)
When a user uploads a file...
I upload the file directly on AS3 in a tmp folder (hasn't touched my server yet)
My server then downloads the file from the tmp folder on AS3.
Paperclip runs validations and executes any processing (e.g. cutting thumbnails of images)
Finally, Paperclip places the file(s) back on AS3 in their new, permanent location.
When a user downloads a file...
User clicks to download a file which sends a request to my API (e.g. /api/atricle/123/download)
Internally, my API reads the file from AS3 and then serves it to the user (as content type attachment)
Thus the file does briefly pass through my server (i.e. not merely a redirect)
From the user's perspective, the file is served from my API (i.e. the user has no idea the file live on AS3)
Questions
Given this setup, is it safe to whitelist a wide range of file types?
Are there some types of files that are always best avoided (e.g. JS files)?
Are there any glaring flaws in my setup? I suspect not, but if so, please alert me!

PDF caching on heroku with cloudflare

I'm having a problem getting the caching I need to work using CloudFlare.
We use CloudFlare for caching all our assets on S3 which works 100% using a separate subdomain cdn
We also use CloudFlare for our main site (hosted on Heroku) as well, e.g. www
My problem is I can't get CloudFlare to cache PDFs that are generated from our Rails app. I'm using the WickedPDF gem to dynamically generate certain PDFs for invoices, etc. I don't want to upload these as files to say S3 but we would like to have CloudFlare cache these so they don't get generated each and every time, as the time spent generating these PDFs is a little intensive.
CloudFlare is turned on and is "accelerating" for the subdomain in question and we're using SSL, but PDFs never seem to cache properly.
Is there something else we need to do to ensure these get cached? Or maybe there's another solution that would work for Heroku? (eg we can't use Page caching since it relies on the filesystem) I also checked the WickedPDF documentation so see if we could do anything else, but found nothing about expire controls.
Thanks,
We should actually cache it as long as the resources are on-domain & not being delivered through a third-party resource in some way.
Keep in mind:
1. Our caching depends on the number of requests for the resources (at least three).
2. Caching is very much data center dependent (in other words, if your site receives a lot of traffic at a data center it is going to be cached; if your site doesn't get a lot of traffic in another data center it may not cache).
I would open a support ticket if you're still having issues.

Rails implementation for securing S3 documents

I would like to protect my s3 documents behind by rails app such that if I go to:
www.myapp.com/attachment/5 that should authenticate the user prior to displaying/downloading the document.
I have read similar questions on stackoverflow but I'm not sure I've seen any good conclusions.
From what I have read there are several things you can do to "protect" your S3 documents.
1) Obfuscate the URL. I have done this. I think this is a good thing to do so no one can guess the URL. For example it would be easy to "walk" the URL's if your S3 URLs are obvious: https://s3.amazonaws.com/myapp.com/attachments/1/document.doc. Having a URL such as:
https://s3.amazonaws.com/myapp.com/7ca/6ab/c9d/db2/727/f14/document.doc seems much better.
This is great to do but doesn't resolve the issue of passing around URLs via email or websites.
2) Use an expiring URL as shown here: Rails 3, paperclip + S3 - Howto Store for an Instance and Protect Access
For me, however this is not a great solution because the URL is exposed (even for just a short period of time) and another user could perhaps in time reuse the URL quickly. You have to adjust the time to allow for the download without providing too much time for copying. It just seems like the wrong solution.
3) Proxy the document download via the app. At first I tried to just use send_file: http://www.therailsway.com/2009/2/22/file-downloads-done-right but the problem is that these files can only be static/local files on your server and not served via another site (S3/AWS). I can however use send_data and load the document into my app and immediately serve the document to the user. The problem with this solution is obvious - twice the bandwidth and twice the time (to load the document to my app and then back to the user).
I'm looking for a solution that provides the full security of #3 but does not require the additional bandwidth and time for loading. It looks like Basecamp is "protecting" documents behind their app (via authentication) and I assume other sites are doing something similar but I don't think they are using my #3 solution.
Suggestions would be greatly appreciated.
UPDATE:
I went with a 4th solution:
4) Use amazon bucket policies to control access to the files based on referrer:
http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?UsingBucketPolicies.html
UPDATE AGAIN:
Well #4 can easily be worked around via a browsers developer's tool. So I'm still in search of a solid solution.
You'd want to do two things:
Make the bucket and all objects inside it private. The naming convention doesn't actually matter, the simpler the better.
Generate signed URLs, and redirect to them from your application. This way, your app can check if the user is authenticated and authorized, and then generate a new signed URL and redirect them to it using a 301 HTTP Status code. This means that the file will never go through your servers, so there's no load or bandwidth on you. Here's the docs to presign a GET_OBJECT request:
https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/S3/Presigner.html
I would vote for number 3 it is the only truly secure approach. Because once you pass the user to the S3 URL that is valid till its expiration time. A crafty user could use that hole the only question is, will that affect your application?
Perhaps you could set the expire time to be lower which would minimise the risk?
Take a look at an excerpt from this post:
Accessing private objects from a browser
All private objects are accessible via
an authenticated GET request to the S3
servers. You can generate an
authenticated url for an object like
this:
S3Object.url_for('beluga_baby.jpg', 'marcel_molina')
By default
authenticated urls expire 5 minutes
after they were generated.
Expiration options can be specified
either with an absolute time since the
epoch with the :expires options, or
with a number of seconds relative to
now with the :expires_in options:
I have been in the process of trying to do something similar for quite sometime now. If you dont want to use the bandwidth twice, then the only way that this is possible is to allow S3 to do it. Now I am totally with you about the exposed URL. Were you able to come up with any alternative?
I found something that might be useful in this regard - http://docs.aws.amazon.com/AmazonS3/latest/dev/AuthUsingTempFederationTokenRuby.html
Once a user logs in, an aws session with his IP as a part of the aws policy should be created and then this can be used to generate the signed urls. So in case, somebody else grabs the URL the signature will not match since the source of the request will be a different IP. Let me know if this makes sense and is secure enough.

How to proxy files from S3 through rails application to avoid leeching?

In order to avoid hot-linking, S3 bandwidth leeching, etc I would like to make my bucket private and serve the files through a Rails app. Concept in general sounds very easy, but I am not entirely sure which approach would be the best for the situation.
I am using paperclip for general asset management. Is there any build-in way to achieve this type of proxy?
In general I can easily parse the url's from paperclip and point them back to my own controller. What should happen from this point? Should I simply use Net::HTTP to download the image, and then serve it with send_data? In between I want to log referer and set proper Control-Cache headers, since I have a reverse-proxy in front of the app. Is Net::HTTP + send_data resonable way in this case?
Maybe whole ideas is really bad for some reasons I am not aware at this moment? I general I believe that reveling the direct S3 links to public bucket is dangerous and yield in some serious problems in case of leeching / hot-linking...
Update:
If you have any other ideas which can reduce S3 bill and prevent hot-linking leeching in anyway please share, even if they are not directly related to Rails.
Use (a private bucket|private files) and use signed URLs to the files stored on S3.
The signature includes an expiration time (e.g. 10 minutes from now, whatever you would like to set), as well as a cryptographic hash. S3 will refuse to serve files if the signature is invalid, or if the expiration time has passed.
This is useful because only you can create valid URLs to your private files in S3, and you can control how long the URLs remain valid. This prevents leeching, because leechers can't sign their own URLs and, if they get a URL that you signed, that URL will expire very shortly and after that can not be used.
Since there wasn't a nuts and bolts answer above, here's a small code sample of how to stream a file that's stored on S3.
render :text => proc { |response, output|
AWS::S3::S3Object.stream(path, bucket) do |segment|
output.write segment
output.flush # not sure if this is needed
end
}
Depending on your webserver this may (mongrel) or may not (webrick) work, so don't get too frustrated if it doesn't stream in development.
Provide temporary pre-signed URLs:
def show
redirect_to Aws::S3::Presigner.new.presigned_url(
:get_object,
bucket: 'mybucket',
key: '/folder/file.pdf'
expires_in: 60)
end
S3 still distributes the content so you offload the work from Rails (which is very slow at it), handles HTTP caching, HEAD operations, and uses Amazon CDN.
I'd probably avoid to do this -- at least until I'd have no other choice.
You need to take into account that you'll probably also add to the bandwidth bill if you download the image each time. Also, by processing each image through a script you'll also need more CPU and RAM required to do this. Not the greatest outlook -- IMHO.
I would probably enable the access logs for Amazon S3 and write a tool small to analyze usage and change the permissions on the bucket/object in case usage is goes the roof. Run this as a cronjob every 10 minutes or so and you should be save?
You could also use s3stat. They also offer a free plan.
Edit: As per my recommendation for Varnish, I'm adding a link to a blog entry about preventing hotlinking using Varnish.

Why would you upload assets directly to S3?

I have seen quite a few code samples/plugins that promote uploading assets directly to S3. For example, if you have a user object with an avatar, the file upload field would load directly to S3.
The only way I see this being possible is if the user object is already created in the database and your S3 bucket + path is something like
user_avatars.domain.com/some/id/partition/medium.jpg
But then if you had an image tag that tried to access that URL when an avatar was not uploaded, it would yield a bad result. How would you handle checking for existence?
Also, it seems like this would not work well for most has many associations. For example, if a user had many songs/mp3s, where would you store those and how would you access them.
Also, your validations will be shot.
I am having trouble thinking of situations where direct upload to S3 (or any cloud) is a good idea and was hoping people could clarify either proper use cases, or tell me why my logic is incorrect.
Why pay for storage/bandwidth/backups/etc. when you can have somebody in the cloud handle it for you?
S3 (and other Cloud-based storage options) handle all the headaches for you. You get all the storage you need, a good distribution network (almost definitely better than you'd have on your own unless you're paying for a premium CDN), and backups.
Allowing users to upload directly to S3 takes even more of the bandwidth load off of you. I can see the tracking concerns, but S3 makes it pretty easy to handle that situation. If you look at the direct upload methods, you'll see that you can force a redirect on a successful upload.
Amazon will then pass the following to the redirect handler: bucket, key, etag
That should give you what you need to track the uploaded asset after success. Direct uploads give you the best of both worlds. You get your tracking information and it unloads your bandwidth.
Check this link for details: Amazon S3: Browser-Based Uploads using POST
If you are hosting your Rails application on Heroku, the reason could very well be that Heroku doesn't allow file-uploads larger than 4MB:
http://docs.heroku.com/s3#direct-upload
So if you would like your users to be able to upload large files, this is the only way forward.
Remember how web servers work.
Unless you're using a sort of async web setup like you could achieve with Node.JS or Erlang (just 2 examples), then every upload request your web application serves ties up an entire process or thread while the file is being uploaded.
Imagine that you're uploading a file that's several megabytes large. Most internet users don't have tremendously fast uplinks, so your web server spends a lot of time doing nothing. While it's doing all of that nothing, it can't service any other requests. Which means your users start to get long delays and/or error responses from the server. Which means they start using some other website to get the same thing done. You can always have more processes and threads running, but each of those costs additional memory which eventually means additional $.
By uploading straight to S3, in addition to the bandwidth savings that Justin Niessner mentioned and the Heroku workaround that Thomas Watson mentioned, you let Amazon worry about that problem. You can have a single-process webserver effectively handle very large uploads, since it punts that actual functionality over to Amazon.
So yeah, it's more complicated to set up, and you have to handle the callbacks to track things, but if you deal with anything other than really small files (and even in those cases), why cost yourself more money?
Edit: fixing typos

Resources