Rails ActiveStorage: how to avoid one redirect for each image? - ruby-on-rails

If you use ActiveStorage and you have a page with N images you get N additional requests to your Rails app (i.e. N redirects). That means wasting a lot of server resources if you have tens of images on a page.
I know that the redirect is useful for signed URLs. However I wonder why Rails does not precompute the final signed URL and embed that into the HTML page... In this way we could keep the advantages of signed URLs / protected files, without making N additional calls to the Rails server.
Is it possible to include the final URL / pre-signed URL of image variants directly in the HTML (thus avoiding the redirect)? Otherwise, why is that impossible?

After days of reasoning and tests, I am really excited of my final solution, which I explain below. This is an opinionated approach to images and may not represent the current Rails Way™️, however it has incredible advantages for websites that serve many public images, in particular:
When you serve a page with N images you don't get 1 + N requests to your app server, instead you get only 1 request for the page
The images are served through a CDN and this improves the loading time
The bucket is not completely public, instead it is protected by Cloudflare
The images are cached by Cloudflare, which greatly reduce your S3 bill
You greatly reduce the number of API requests (i.e. exists) to S3
This solution does not require large changes to Rails, and thus it is straightforward to switch back to Rails default behavior in case of problems
Here's the solution:
Create an s3 bucket and configure it to host a public website (i.e. call it storage.example.com) - you can even disable the public access at bucket level and allow access only to the Cloudflare ips using a bucket policy
Go to Cloudflare and configure a CNAME for storage.example.com that points to your domain; you need to use Flexible SSL (you can use a page rule for the subdomain); use page rules to set heavy caching: set Cache Everything and set a very long value (e.g. 1 year) for Browser Cache TTL and Edge Cache TTL
In you Rails application you can keep using private storage / acl, which is the default Rails behavior
In your Rails application call #post.variant(...).processed after every update or creation of #post; then in your views use 'https://storage.example.com/' + #post.variant(...).key' (note that we don't call processed here in the views to avoid additional checks in s3); you can also have a rake task that calls processed on each object, in case you need to regenerate the variants; this is works perfectly if you have only a few variants (e.g. 1 image / variant per post) that are changed infrequently
Most of the above steps are optional, so you can combine them based on your needs.

You can use the service_url to create direct links to your resources.
We don't use Rails views in our project so my knowledge about the view layer is rusty. I think you could put it in a dedicated helper and then use it from your views.

Related

How to set up app to disallow Cloudfront from fetching anything?

I use rails 5.2 and cloudfront for assets
How to set up app to disallow Cloudfront from fetching anything except for assets?
CloudFront doesn't have an explicit way to "allow only" certain path prefixes, since they will ultimately match the default * cache behavior if they don't match any others, but there are several ways of working around this, depending on the level of sophistication and complexity that suits your taste... but all of them would start with this step:
create a new cache behavior using the desired path pattern, such as /assets/* and select your existing origin to handle these requests.
At this point, CloudFront still works as before, it's just internally considering the asset requests to match one behavior and everything else to match the other.
So, what we need next is something different for the "other."
The simplest solution is to create a second Origin, using the Origin Domain Name invalid.invalid. This is a syntactically valid hostname that points to a nonexistent target (the .invalid TLD is reserved for such purposes).
After creating this origin, edit your default cache behavior to use this new origin.
With this change in place and propagated, CloudFront will process /assets/* requests as before, but will throw an error on any other path. (The error is 502 Bad Gateway, if I remember correctly).
This accomplishes the simple purpose of blocking all other requests.
If you want to be a bit more proactive, and actually redirect requests back to the main site, you can accomplish this by creating an empty bucket in S3, and select the "Redirect requests" option. In the "target bucket or domain" box, put your main web site hostname. Then take the "Endpoint" shown in the Static website hosting box and use that as your origin hostname in CloudFront, for the default cache behavior. Any requests that arrive at CloudFront (for other than /assets/* will receive a redirect back to the main site.
This option may be the better option if your CDN has been inadvertently picked up by search engines, because the links will redirect back to the main site.

PDF caching on heroku with cloudflare

I'm having a problem getting the caching I need to work using CloudFlare.
We use CloudFlare for caching all our assets on S3 which works 100% using a separate subdomain cdn
We also use CloudFlare for our main site (hosted on Heroku) as well, e.g. www
My problem is I can't get CloudFlare to cache PDFs that are generated from our Rails app. I'm using the WickedPDF gem to dynamically generate certain PDFs for invoices, etc. I don't want to upload these as files to say S3 but we would like to have CloudFlare cache these so they don't get generated each and every time, as the time spent generating these PDFs is a little intensive.
CloudFlare is turned on and is "accelerating" for the subdomain in question and we're using SSL, but PDFs never seem to cache properly.
Is there something else we need to do to ensure these get cached? Or maybe there's another solution that would work for Heroku? (eg we can't use Page caching since it relies on the filesystem) I also checked the WickedPDF documentation so see if we could do anything else, but found nothing about expire controls.
Thanks,
We should actually cache it as long as the resources are on-domain & not being delivered through a third-party resource in some way.
Keep in mind:
1. Our caching depends on the number of requests for the resources (at least three).
2. Caching is very much data center dependent (in other words, if your site receives a lot of traffic at a data center it is going to be cached; if your site doesn't get a lot of traffic in another data center it may not cache).
I would open a support ticket if you're still having issues.

Rails implementation for securing S3 documents

I would like to protect my s3 documents behind by rails app such that if I go to:
www.myapp.com/attachment/5 that should authenticate the user prior to displaying/downloading the document.
I have read similar questions on stackoverflow but I'm not sure I've seen any good conclusions.
From what I have read there are several things you can do to "protect" your S3 documents.
1) Obfuscate the URL. I have done this. I think this is a good thing to do so no one can guess the URL. For example it would be easy to "walk" the URL's if your S3 URLs are obvious: https://s3.amazonaws.com/myapp.com/attachments/1/document.doc. Having a URL such as:
https://s3.amazonaws.com/myapp.com/7ca/6ab/c9d/db2/727/f14/document.doc seems much better.
This is great to do but doesn't resolve the issue of passing around URLs via email or websites.
2) Use an expiring URL as shown here: Rails 3, paperclip + S3 - Howto Store for an Instance and Protect Access
For me, however this is not a great solution because the URL is exposed (even for just a short period of time) and another user could perhaps in time reuse the URL quickly. You have to adjust the time to allow for the download without providing too much time for copying. It just seems like the wrong solution.
3) Proxy the document download via the app. At first I tried to just use send_file: http://www.therailsway.com/2009/2/22/file-downloads-done-right but the problem is that these files can only be static/local files on your server and not served via another site (S3/AWS). I can however use send_data and load the document into my app and immediately serve the document to the user. The problem with this solution is obvious - twice the bandwidth and twice the time (to load the document to my app and then back to the user).
I'm looking for a solution that provides the full security of #3 but does not require the additional bandwidth and time for loading. It looks like Basecamp is "protecting" documents behind their app (via authentication) and I assume other sites are doing something similar but I don't think they are using my #3 solution.
Suggestions would be greatly appreciated.
UPDATE:
I went with a 4th solution:
4) Use amazon bucket policies to control access to the files based on referrer:
http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?UsingBucketPolicies.html
UPDATE AGAIN:
Well #4 can easily be worked around via a browsers developer's tool. So I'm still in search of a solid solution.
You'd want to do two things:
Make the bucket and all objects inside it private. The naming convention doesn't actually matter, the simpler the better.
Generate signed URLs, and redirect to them from your application. This way, your app can check if the user is authenticated and authorized, and then generate a new signed URL and redirect them to it using a 301 HTTP Status code. This means that the file will never go through your servers, so there's no load or bandwidth on you. Here's the docs to presign a GET_OBJECT request:
https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/S3/Presigner.html
I would vote for number 3 it is the only truly secure approach. Because once you pass the user to the S3 URL that is valid till its expiration time. A crafty user could use that hole the only question is, will that affect your application?
Perhaps you could set the expire time to be lower which would minimise the risk?
Take a look at an excerpt from this post:
Accessing private objects from a browser
All private objects are accessible via
an authenticated GET request to the S3
servers. You can generate an
authenticated url for an object like
this:
S3Object.url_for('beluga_baby.jpg', 'marcel_molina')
By default
authenticated urls expire 5 minutes
after they were generated.
Expiration options can be specified
either with an absolute time since the
epoch with the :expires options, or
with a number of seconds relative to
now with the :expires_in options:
I have been in the process of trying to do something similar for quite sometime now. If you dont want to use the bandwidth twice, then the only way that this is possible is to allow S3 to do it. Now I am totally with you about the exposed URL. Were you able to come up with any alternative?
I found something that might be useful in this regard - http://docs.aws.amazon.com/AmazonS3/latest/dev/AuthUsingTempFederationTokenRuby.html
Once a user logs in, an aws session with his IP as a part of the aws policy should be created and then this can be used to generate the signed urls. So in case, somebody else grabs the URL the signature will not match since the source of the request will be a different IP. Let me know if this makes sense and is secure enough.

How to proxy files from S3 through rails application to avoid leeching?

In order to avoid hot-linking, S3 bandwidth leeching, etc I would like to make my bucket private and serve the files through a Rails app. Concept in general sounds very easy, but I am not entirely sure which approach would be the best for the situation.
I am using paperclip for general asset management. Is there any build-in way to achieve this type of proxy?
In general I can easily parse the url's from paperclip and point them back to my own controller. What should happen from this point? Should I simply use Net::HTTP to download the image, and then serve it with send_data? In between I want to log referer and set proper Control-Cache headers, since I have a reverse-proxy in front of the app. Is Net::HTTP + send_data resonable way in this case?
Maybe whole ideas is really bad for some reasons I am not aware at this moment? I general I believe that reveling the direct S3 links to public bucket is dangerous and yield in some serious problems in case of leeching / hot-linking...
Update:
If you have any other ideas which can reduce S3 bill and prevent hot-linking leeching in anyway please share, even if they are not directly related to Rails.
Use (a private bucket|private files) and use signed URLs to the files stored on S3.
The signature includes an expiration time (e.g. 10 minutes from now, whatever you would like to set), as well as a cryptographic hash. S3 will refuse to serve files if the signature is invalid, or if the expiration time has passed.
This is useful because only you can create valid URLs to your private files in S3, and you can control how long the URLs remain valid. This prevents leeching, because leechers can't sign their own URLs and, if they get a URL that you signed, that URL will expire very shortly and after that can not be used.
Since there wasn't a nuts and bolts answer above, here's a small code sample of how to stream a file that's stored on S3.
render :text => proc { |response, output|
AWS::S3::S3Object.stream(path, bucket) do |segment|
output.write segment
output.flush # not sure if this is needed
end
}
Depending on your webserver this may (mongrel) or may not (webrick) work, so don't get too frustrated if it doesn't stream in development.
Provide temporary pre-signed URLs:
def show
redirect_to Aws::S3::Presigner.new.presigned_url(
:get_object,
bucket: 'mybucket',
key: '/folder/file.pdf'
expires_in: 60)
end
S3 still distributes the content so you offload the work from Rails (which is very slow at it), handles HTTP caching, HEAD operations, and uses Amazon CDN.
I'd probably avoid to do this -- at least until I'd have no other choice.
You need to take into account that you'll probably also add to the bandwidth bill if you download the image each time. Also, by processing each image through a script you'll also need more CPU and RAM required to do this. Not the greatest outlook -- IMHO.
I would probably enable the access logs for Amazon S3 and write a tool small to analyze usage and change the permissions on the bucket/object in case usage is goes the roof. Run this as a cronjob every 10 minutes or so and you should be save?
You could also use s3stat. They also offer a free plan.
Edit: As per my recommendation for Varnish, I'm adding a link to a blog entry about preventing hotlinking using Varnish.

How to use multiple caches in rails?

I have a rails application where I would like to use both memcached and the file store cache, for different purposes.
I want to use the file store cache to keep a large number of pages that don't change often (some not at all) - i.e. page caching - and use memcached for everything else (action and DB caching etc). The reason is that the pages stored on the file store cache are likely to require a large amount of storage, but individually most will be accessed infrequently.
Is this possible to do or will configuring memcached as the cache mean that it is also used for page caching?
As a secondary question, what is a safe way to remove pages from the file store cache in some form of cron job, as there does not seem to be an option to specify ttl for this cache. For example a UNIX find command would quickly find and remove all old pages or pages that haven't been accessed in a long time - is this safe to do given the app server might potentially try to serve one of those pages at the time (tho this is very unlikely)? If not then what is the best way to do this.
If you want to use the filesystem only for page caching and memcached for action and fragment caching, you're fine. Page caching always uses the filesystem. Just remember that page caching bypasses your Rails application, so you can't use it for pages that include content that changes from user to user or for pages that are access controlled with filters.
Regarding the removal of pages, on Unix, a file can be deleted, but it is not actually removed from disk until all open file handles are closed. If the app server has opened the file to serve a request, and the find command deletes it a split-second later, the app server doesn't suddenly get an error when it tries to read.
You could also consider having find delete files based on their last access time, instead of creation or modification, and using a sweeper in your Rails app to delete the cached page when its content is out of date.
A simpler approach may be to use a http cache upstream of your application as your page cache rather than two stores within rails. This way you can use http headers to control the cache behavior, including TTL's. These same limits will also apply to browser's local caches as a nice bonus.
Varnish is about as high performance as it gets, but would require setting up another moving piece in your hosting environment as a proxy. This may still be worthwhile depending on what you're doing.
A simpler approach might be Rack::Cache, which will be easy to set up provided you're using a rack enabled version of rails.

Resources