URL fingerprint caching on Amazon S3

URL fingerprint caching on Amazon S3 - url

I have a bucket on Amazon S3 where I keep files that sometimes change but I want to use maximum caching on them, so I want to use URL fingerprinting to invalidate the cache.
I use the "last modified" date of the files for the fingerprint, and the html page requesting the S3 files always knows each file's fingerprint.
Now, I realize that I could use the fingerprint in the query string, like so:
http://aws.amazon.com/bucket/myFile.jpg?v=1310476099061
but the query string is not always enough for some proxies or older browsers to invalidate the cache, and some proxies and browsers don't even cache it if it contains a query string. That's why I want to keep the fingerprint in the actual URL, like one of these:
http://aws.amazon.com/bucket/myFile-1310476099061.jpg
http://aws.amazon.com/bucket/1310476099061/myFile.jpg
http://aws.amazon.com/bucket/myFile.jpg/1310476099061
etc
Any of these URLs would be perfect for requesting the myFile.jpg, but I want it all to be remapped to the http://aws.amazon.com/bucket/myFile.jpg file. That is, I only want the URL to change so the browser will think that it is a new file and get a fresh file which it will cache for a year. When I upload a new version of that file, the fingerprint is automatically updated.
Now here is my question: Is there any way to rewrite the url so that a request for a URL likehttp://aws.amazon.com/bucket/myFile-xxxxxx.jpg will serve the http://aws.amazon.com/bucket/myFile.jpg file on Amazon S3? Or are there any other workarounds that will still keep the file cached? Thanks =)

I'm afraid you're stuck with the version in the querystring. There is no way to rewrite the urls on S3 without actually changing the filename.

Related

"Request has expired" when using S3 with Active Storage

I'm using ActiveStorage for the first time.
Everything works fine in development but in production (Heroku) my images disappear without a reason.
They were showing ok the first time, but now no image is displayed. In the console I can see this error:
GET https://XXX.s3.amazonaws.com/variants/Q7MZrLyoKKmQFFwMMw9tQhPW/XXX 403 (Forbidden)
If I try to visit that URL directly I get an XML
<Error>
<Code>AccessDenied</Code>
<Message>Request has expired</Message>
<X-Amz-Expires>300</X-Amz-Expires>
<Expires>2018-07-24T13:48:25Z</Expires>
<ServerTime>2018-07-24T15:25:37Z</ServerTime>
<RequestId>291D41FAC6708334</RequestId>
<HostId>lEVGuwA6Hvlm/i40PeXaje9SEBYks9+uk6DvBs=</HostId>
</Error>
This is what I have in the view
<div class="cover" style="background-image: url('<%= rails_representation_path(experience.thumbnail) %>')"></div>
This is what I have in the model
def thumbnail
self.cover.variant(resize: "300x300").processed
end
In simple words, I don't want images to expire but to be always there.
Thanks

ActiveStorage does not support non-expiring link. It uses expiring links (private), and support uploading files only as private on your service.
It was a problem for me too, and did 2 patches (caution) for S3 only, one simple ~30lines that override ActiveStorage to work only with non-expiring (public) links, and another that add an acl option to has_one_attached and has_many_attached methods.
Hope it helps.

Your question doesn't say so, but it's common to use a CDN like AWS CloudFront with a Rails app. Especially on Heroku you probably want to conserve compute power.
Here is what happens in that scenario. You render a page as usual, and all the images are requested from the asset host, which is the CDN, because that's how it is configured to integrate. Its setup to fetch anything it doesn't find in cache from origin, which is your application again.
First all image requests are passed through. The ActiveStorage controller creates signed URLs for them, and the CDN passes them on, but also caches them.
Now comes the problem. The signed URL expires in 5 minutes by default, but the CDN caches usually much longer. This is because usually you use digest assets, meaning they are invalidated not by time but by name, on any change.
The solution is simple. Increase the expiry of the signed URL to be longer than the cache's TTL. Now the cache drops the cached signed URL before it becomes invalid.
Set the URL expiry using ActiveStorage::Service.url_expires_in in 5.2 or directly in Rails.application.config.active_storage.service_urls_expire_in in an initializer see this answer for details.
To set cache TTL in CloudFront: open the AWS console, pick the distribution, open the Behavior tab, scroll down to these fields:
Then optionally issue an invalidation to force re-caching of all contents.
Keep in mind there is a security trade-off. If the image contents are private, then they don't belong into a CDN most likely, and shouldn't have long lasting temp URLs either. In that case choose a solution that exempts attachments from CDN altogether. Your application will have to handle the additional load of signing all attached assets' URLs on top of rendering the relevant page.
Further keep in mind, that this isn't necessarily a good solution, but more of a workaround. With the above setup you will cache redirects, and the heavier requests will hit your storage bucket directly. The usual scenario for CDNs is large media, not lightweight redirects. You do relieve the app of handling a lot of requests though. How much that is a valid optimization should be looked into.

I had this same issue, but after I corrected the time on my computer, the problem was resolved. It was a server time difference, that the aws servers did not recognize.

#production.rb
Change
config.active_storage.service = :local
To
config.active_storage.service = :amazon
Should match aws/amazon whatever you defined it as in storage.yml

From the server, how do I force an external file to expire so that the browser receives a fresh one?

I have a show view, that uses a 'Universal Viewer' to load images. The image dimensions come from a json file that comes from a IIIF image server.
I fixed a bug and a new json file exists, but the user's browser is still using the old info.json file.
I understand that I could just have them do a hard-reload, like I myself did on my machine, but many users may be affected, and I'm just damn curious now.

Modern browsers all ship with cache control functionality baked into it. Using a combination of ETags and Cache-Control headers, you can accomplish what you seek without having to change the file names or use cache busting query parameters.
ETags allow you to communicate a token to a client that will tell their browser to update the cached version. This token can be created based on the content creation date, content length, or a fingerprint of the content.
Cache-Control headers allow you to create policies for web resources about how long, who, and how your content can be cached.
Using ETags and Cache-Control headers is a useful way to communicate to users when to update their cache when serving IIIF or any other content. However, adding ETags and Cache-Control this can be quite specific to your local implementation. Many frameworks (like Ruby on Rails) have much of this functionality baked into it. There are also web server configurations that may need to be modified, some sample configurations are available from the HTML5 Boilerplate project that use these strategies.
Sample Apache configurations for:
ETags https://github.com/h5bp/server-configs-apache/blob/master/src/web_performance/etags.conf
Cache expiration https://github.com/h5bp/server-configs-apache/blob/master/src/web_performance/expires_headers.conf

It depends on where the JSON file is being served from, and how it's being cached.
The guaranteed way to expire the cache on the file is to change the filename every time it changes. This is typically done be renaming it filename-MD5HASH.ext, where the MD5HASH is the MD5 hash of the file.
If you can't change the file name (it comes from a source you can't control, you might be able to get away with adding a caching busting query key to the URL. Something like http://example.com/file.ext?q=123456.

How do I limit AWS CloudFont so that it only serves requests from a single directory on my domain?

I have gone through the process of creating a CloudFront distribution with the Origin Domain Name pointing to my main Rails application where assets (images, css, js, ect) are located at /assets.
However, by default, the CloudFront distribution is mirroring the entire domain (including dynamic pages).
How can I limit it to just the /assets sub-tree?
PS This is the article I am following:
https://devcenter.heroku.com/articles/using-amazon-cloudfront-cdn
Thanks!

Since the default cache behavior can't (afaik) be removed, this seems like a clever "serverless" solution:
Create a bucket in S3. The name won't matter. Don't put anything in it.
Add a second origin to your CloudFront distribution, selecting the new bucket as the origin.
Create a second cache behavior with path pattern /assets/* pointing to your original origin.
Change the default cache behavior to use the new S3 origin (the unused, empty bucket).
CloudFront will forward requests for /assets/* to your existing server, where they will be handled as now, but all other requests will be sent to the empty bucket, which has no content and no permissions, so the response will be 403 Forbidden.
Optionally, add an appropriate "robots.txt" file to that otherwise-empty bucket, and make it publicly readable, so CloudFront will serve it up to any crawlers that visit your CloudFront distribution, disallowing them from indexing, which should hopefully prompt them to remove any already-indexed results and not try to index the assets or any other paths they might have already learned by crawling the previously-exposed content at the "wrong" URL.

How to proxy files from S3 through rails application to avoid leeching?

In order to avoid hot-linking, S3 bandwidth leeching, etc I would like to make my bucket private and serve the files through a Rails app. Concept in general sounds very easy, but I am not entirely sure which approach would be the best for the situation.
I am using paperclip for general asset management. Is there any build-in way to achieve this type of proxy?
In general I can easily parse the url's from paperclip and point them back to my own controller. What should happen from this point? Should I simply use Net::HTTP to download the image, and then serve it with send_data? In between I want to log referer and set proper Control-Cache headers, since I have a reverse-proxy in front of the app. Is Net::HTTP + send_data resonable way in this case?
Maybe whole ideas is really bad for some reasons I am not aware at this moment? I general I believe that reveling the direct S3 links to public bucket is dangerous and yield in some serious problems in case of leeching / hot-linking...
Update:
If you have any other ideas which can reduce S3 bill and prevent hot-linking leeching in anyway please share, even if they are not directly related to Rails.

Use (a private bucket|private files) and use signed URLs to the files stored on S3.
The signature includes an expiration time (e.g. 10 minutes from now, whatever you would like to set), as well as a cryptographic hash. S3 will refuse to serve files if the signature is invalid, or if the expiration time has passed.
This is useful because only you can create valid URLs to your private files in S3, and you can control how long the URLs remain valid. This prevents leeching, because leechers can't sign their own URLs and, if they get a URL that you signed, that URL will expire very shortly and after that can not be used.

Since there wasn't a nuts and bolts answer above, here's a small code sample of how to stream a file that's stored on S3.
render :text => proc { |response, output|
AWS::S3::S3Object.stream(path, bucket) do |segment|
output.write segment
output.flush # not sure if this is needed
end
}
Depending on your webserver this may (mongrel) or may not (webrick) work, so don't get too frustrated if it doesn't stream in development.

Provide temporary pre-signed URLs:
def show
redirect_to Aws::S3::Presigner.new.presigned_url(
:get_object,
bucket: 'mybucket',
key: '/folder/file.pdf'
expires_in: 60)
end
S3 still distributes the content so you offload the work from Rails (which is very slow at it), handles HTTP caching, HEAD operations, and uses Amazon CDN.

I'd probably avoid to do this -- at least until I'd have no other choice.
You need to take into account that you'll probably also add to the bandwidth bill if you download the image each time. Also, by processing each image through a script you'll also need more CPU and RAM required to do this. Not the greatest outlook -- IMHO.
I would probably enable the access logs for Amazon S3 and write a tool small to analyze usage and change the permissions on the bucket/object in case usage is goes the roof. Run this as a cronjob every 10 minutes or so and you should be save?
You could also use s3stat. They also offer a free plan.
Edit: As per my recommendation for Varnish, I'm adding a link to a blog entry about preventing hotlinking using Varnish.

CDN and URL's with query-strings

We have an images folder on our web servers that we may publish via a CDN. Sometimes we append query-string like syntax to URL's to help us freshen content that has changed, even though it rarely does. Example:
/images/file.png?20090821
will URL's like this work with your average content-delivery-network?

Yes, We use Akamai, which keeps a cached copy of each distict url requested including the querystring. So the first request for /images/file.png?20090821 will go to the origin server. Requests there after for /images/file.png?20090821 will get the image from the Akamai servers. The next day, assuming the img src changes to /images/file.png?20090822, the first request will go to the origin server again.

Amazon CloudFront turned on this feature in May 2012

You wouldn't have problem with CDN. However, you may have problem with browsers. Some browsers wouldn't cache any content with query string. Even though it may be faster to fetch the image from CDN but it will not be as fast as cached image. So you want do something like this,
/images/file.png/20090821
Our CDN provider also recommends a hash mechanism. When we publish our content, it adds a hash to the URL so you don't have to add the version yourself. Unfortunately, I don't know the details on how that magic is done.

amazon cloudfront won't propagate the query string.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart