How can I exclude URLs within Amazon Cloudfront?

How can I exclude URLs within Amazon Cloudfront? - url

I plan on using Amazon Cloudfront CDN, and there is a URL that I need to exclude. It's a dynamic URL (contains a querystring).
For example, I want to store/cache every page on my site, except:
http://www.mypage.com/do-not-cache/?query1=true&query2=5
EDIT - Is this what invalidating 'objects' does? If so, please provide an example.
Thanks

At the risk of helping you solve the wrong problem, I should point out that if you configure CloudFront to forward query strings to the origin server, then the response will be cached against the entire URI -- that is, against the path + query-string -- not against the path, only, so...
/dynamic-page?r=1
/dynamic-page?r=2
/dynamic-page?r=2&foo=bar
...would be three different "pages" as far as CloudFront is concerned. It would never serve a request from the cache unless the query string were identical.
If you configure CloudFront to forward query strings to your origin, CloudFront will include the query string portion of the URL when caching the object. [...]
This is true even if your origin always returns the same [content] regardless of the query string.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/QueryStringParameters.html
So, it shouldn't really be necessary to explicitly and deliberately avoid caching on this page. The correct behavior should be automatic, if CloudFront is configured to forward the query string.
Additionally, of course, if your origin server sets Cache-Control: private, no-cache, no-store or similar in the response headers, neither CloudFront nor the browser should cache the response.
But if you're very insistent that CloudFront be explicitly configured not to cache this page, create a new cache behavior, with the path pattern matching /do-not-cache*, and configure CloudFront to forward all of the request headers to the origin, which disables caching for page requests matching that path pattern.

Related

'reload' vs 'reloadFromOrigin' in WKWebView

What's the difference between reload and reloadFromOrigin in WKWebView? Apple's documentation says that reloadFromOrigin:
Reloads the current page, performing end-to-end revalidation using
cache-validating conditionals if possible.
But I am not sure what that really means.

I was interested in this too. Looking at WebKit's source (Source/WebCore/loader/FrameLoader.cpp, FrameLoader::addExtraFieldsToRequest(...) around the if (loadType == FrameLoadType::Reload) conditional) it looks like the key difference is in what extra HTTP header request fields are specified.
reloadFromOrigin() sets the Cache-Control and Pragma fields to no-cache, while a simple reload() only results in in the Cache-Control header field having max-age=0 set.
To work out what this means I looked at the Header Field Definitions section of the HTTP 1.1 spec. Section 14.9.4 'Cache Revalidation and Reload Controls' states:
The client can specify these three kinds of action using Cache- Control request directives:
End-to-end reload
The request includes a "no-cache" cache-control directive or, for compatibility with HTTP/1.0 clients, "Pragma: no-cache". Field names MUST NOT be included with the no-cache directive in a request. The server MUST NOT use a cached copy when responding to such a request.
Specific end-to-end revalidation
The request includes a "max-age=0" cache-control directive, which forces each cache along the path to the origin server to revalidate its own entry, if any, with the next cache or server. The initial request includes a cache-validating conditional with the client's current validator.
Unspecified end-to-end revalidation
The request includes "max-age=0" cache-control directive, which forces each cache along the path to the origin server to revalidate its own entry, if any, with the next cache or server. The initial request does not include a cache-validating
conditional; the first cache along the path (if any) that holds a cache entry for this resource includes a cache-validating conditional with its current validator.
From my reading of the spec, it looks like reload() uses only max-age=0 so may result in you getting a cached, but validated, copy of the requested data, whereas reloadFromOrigin() will force a fresh copy to be obtained from the origin server.
(This seems to contradict Apple's header/Class Reference documentation for the two functions in WKWebView. I think the descriptions for the two should be swapped over - I have lodged Bug Report/Radar 27020398 with Apple and will update this answer either way if I hear back from them...)

Rails 404 handler for non-Rails URLs

I've inherited a site with hundreds of scattered HTML and non-framework PHP files, which I am porting to Ruby on Rails 3.0.
As functionality is added in the Rails app, the corresponding pages are deleted from the document root; but, because there are often links to these in Google or from external sites, simply returning a 404 is not acceptable.
A URL like '/contact.php' should redirect to '/app/contact/', for example.
For the first few cases of this, I created simple stub html files at the old locations, with Meta tags within to perform the redirect. This doesn't scale well, particularly once I start replacing product pages, of which there are thousands.
My preference is to delete the old pages, then have the 404 handler dispatch these to the new Rails app, which will examine the URL using regexes and database lookup to try to figure out what the replacement page is, then issue a 301 redirect to that new page.
In httpd.conf, I placed the directive:
ErrorDocument 404 /app/error/handle404
# /app/error is a rails url.
When I hit "http://localhost/does-not-exist", this causes my ErrorController to be invoked, as expected.
However, within the controller, I cannot find the original path ("/does-not-exist") anywhere in request, request.headers, or ENV - I've been calling likely methods like request.request_uri (which contains /app/error/handle404), and examining request.headers and ENV without finding the expected original path.
The Apache access_log shows only the request for /does-not-exist, indicating that it transparently invoked /app/error/handle404 (without doing a redirect or causing a second request to be made).
How can I get access to the original URL?
Edit: to clarify, here is the sequence of events:
User hits legacy path like http://mysite/foo.php, probably coming from some ancient link from a blog.
...but foo.php no longer exists!
this is a 404, thus Apache invokes ErrorDocument
directive is "ErrorDocument 404 /railsapp/error/handle404"
Rails routes this to ErrorController action "handle404" - this is working correctly
problem: in ErrorController, request.request.uri, request.headers do not provide any clue as to which URL the user was actually trying to get to, like "/foo.php"; I need to know the original URL to serve up an appropriate replacement page.

As I couldn't find the original, non-rewritten URL in the Rails request, I ended up doing it in PHP - plain, old-fashioned, non-framework PHP with explicit mysqli_*() calls.
The PHP error handler receives the necessary information in the $_SERVER hash; $_SERVER['REQUEST_URI'] contains the original URI that I needed.
I look this up in a database, and if I find a corresponding entry, issue a 301 redirect to the new location; if there's no entry, I simply display a 404 page to the user.
Simplified (PHP):
$url = $_SERVER['REQUEST_URI'];
$redir = lookupRedirect($url); # database stuff here
if (! $redir) {
include ('404.phtml');
} else {
header("Status: 301");
header("Location: " . $redir['new_url']);
}
It's an ugly kluge, but I just couldn't find a way to make the Rails app aware of the error URL.

max age with nginx/passenger/memcached/rails2.3.5

I notice that in my production enviornment (where I have memcached implemented) in see a cache-control - max-age header in firebug, anytime I am looking at an index page (posts for example).
Cache-Control max-age=315360000
In my dev environment that header looks like following.
Cache-Contro private, max-age=0, must-revalidate
As far as I know I have not done anything special with my nginx.conf file to specify max age for regular content, I do have expires-max set for css, jpg etc. here is my nginx.conf file..
http://pastie.org/1167080
So why is this cache-control is being set? How can I control this cache-control, because the side effect of this is kinda bad. This is what happens.
1 - User request all_posts listing and get a list of 10 pages (paginated)
2 - User view page 1, 2 3 and the respective caches are created.
3 - User goes back to page 1 and firefox does not even make a request to the server. Normally I would expect it would reqeust and hit the cache created in step #2.
The other issue is that if a new post has been created and now the cache is refreshed and it should be at the top of page 1, the user does not get to see it..because the browser isn't hitting the server.
Please help!
Thanks
Update :
I tried setting expires_now in my index action. NO difference the max-age is still the same large value.
Could this be a problem with my max-age regex? I basically want it to match only asset files(images, js, css etc)

I think you're right that this is a problem with your max-age regex.
You're matching against this pattern: ^.+.(jpg|jpeg|gif|png|css|js|swf)?([0-9]+)?$
Because you have question marks ("this part is optional") after both parenthesized sections, the only mandatory part of the regex is that the request URI have at least two characters (.+.). In other words, it's matching pretty near every request to your site.
This is the pattern we use: \.(js|css|jpg|jpeg|gif|png|swf)$
That will match only requests for paths ending with a dot and then one of those seven patterns.

Why we don't use such URL formats?

I am reworking on the URL formats of my project. The basic format of our search URLs is this:-
www.projectname/module/search/<search keyword>/<exam filter>/<subject filter>/... other params ...
On searching with no search keyword and exam filter, the URL will be :-
www.projectname/module/search///<subject filter>/... other params ...
My question is why don't we see such URLs with back to back slashes (3 slashes after www.projectname/module/search)? Please note that I am not using .htaccess rewrite rules in my project anymore. This URL works perfect functionally. So, should I use this format?
For more details on why we chose this format, please check my other question:-
Suggest best URL style

Web servers will typically remove multiple slashes before the application gets to see the request,for a mix of compatibility and security reasons. When serving plain files, it is usual to allow any number of slashes between path segments to behave as one slash.
Blank URL path segments are not invalid in URLs but they are typically avoided because relative URLs with blank segments may parse unexpectedly. For example in /module/search, a link to //subject/param is not relative to the file, but a link to the server subject with path /param.
Whether you can see the multiple-slash sequences from the original URL depends on your server and application framework. In CGI, for example (and other gateway standards based on it), the PATH_INFO variable that is typically used to implement routing will usually omit multiple slashes. But on Apache there is a non-standard environment variable REQUEST_URI which gives the original form of the request without having elided slashes or done any %-unescaping like PATH_INFO does. So if you want to allow empty path segments, you can, but it'll cut down on your deployment options.
There are other strings than the empty string that don't make good path segments either. Using an encoded / (%2F), \ (%5C) or null byte (%00) is blocked by default by many servers. So you can't put any old string in a segment; it'll have to be processed to remove some characters (often ‘slug’-ified to remove all but letters and numbers). Whilst you are doing this you may as well replace the empty string with _.

Probably because it's not clearly defined whether or not the extra / should be ignored or not.
For instance: http://news.bbc.co.uk/sport and http://news.bbc.co.uk//////////sport both display the same page in Firefox and Chrome. The server is treating the two urls as the same thing, whereas your server obviously does not.
I'm not sure whether this behaviour is defined somewhere or not, but it does seem to make sense (at least for the BBC website - if I type an extra /, it does what I meant it to do.)

Redirect 301 with hash part (anchor) #

One of our website has URL like this : example.oursite.com. We decided to move our site with an URL like this www.oursite.com/example. To do this, we wrote a rewrite rule in our Apache server that redirect to our new URL with a code 301.
Many websites link to us with URLs of the form example.oursite.com/#id=23. The problem is that the redirection erase the hash part of the URL with IE. As far as I know, the hash part is never sent to the server.
I wanted to implement the redirection with javascript to keep the hash part, but the Search Engine will not be aware that our URL changed. (no code 301 returned)
I want the Search Engine to be notified of our new URL(301) because we need to transfer the page rank to our new URL.
Is there a way to redirect with a 301 code and keep the hash part(#id=23) of in the URL ?

Search engines do in fact care about hash tags, they frequently use them to highlight specific content on a page.
To the question, however, anchor locations are unfortunately not sent to the server as part of the HTTP request. If you want to redirect a user, you will need to do this in Javascript on the client side.
Good article: http://web.archive.org/web/20090508005814/http://www.mikeduncan.com/named-anchors-are-not-sent/

Seeing as the server will never see the # (ruling out 301 Redirects) and Google has deprecated their AJAX Crawling scheme, it seems that a front-end solution is the only way!
How I did it:
(function() {
var redirects = [
['#!/about', '/about'],
['#!/contact', '/contact'],
['#!/page-x', '/pageX']
]
for (var i=0; i<redirects.length; i++) {
if (window.location.hash == redirects[i][0]) {
window.location.replace(redirects[i][1]);
}
}
})();
I'm assuming that because Google crawlers do indeed execute Javascript, the new pages will be indexed properly.
I've put it in a <script> tag directly underneath the <title> tag, so that it get executed before any other JS/CSS. Note that this script should only be required for your index file.

I am fairly certain that the hash/page anchor/bookmark part of a URL is not indexed by search engines, and therefore has no effect on your page ranking. Doing a google search for "inurl:#" returns zero documents, so that backs up my assumption. Links from external sites will be indexed without the hash.
You are right in that the hash part isn't sent to the server, so as far as I am aware, there isn't a good way to be able to create a redirection url with the hash in it.
Because of this, it's up to the browser to correctly manage the hash during a redirect. Firefox 3.5 appears to do this successfully. If you append a hash to a URL that has a known redirect, you will see the URL change in the address bar to the new location, but the hash stays on there successfully.
Edit: In response to the comment below, if there isn't a hash sign in the external URL for the part you need, then it is entirely possible to rewrite the URL. An Apache rewrite rule would take care of it:
RewriteCond %{HTTP_HOST} !^exemple\.oursite\.com [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^/(.*) http://www.oursite.com/exemple/$1 [L,R]
If you're not using Apache, then you'll have to look into the server docs for something similar.

Google has a special syntax for AJAX applications that is based on hash URLs: http://code.google.com/web/ajaxcrawling/docs/getting-started.html

You could create a page on the old address that catches all requests and redirects to the new site with the correct address and code.
I did something like that, but it was in asp.net, which I guess it's not the language you use. Anyway there should be a way to do this in any language.

When returning status 301, your server is supposed to return a 'Location:' header which points to the new location. In practice, the way this is implemented varies; some servers provide the full URL (netloc and path), some just provide the new path and expect the browser to look for that path on the original netloc. It sounds like your rewrite rule is stripping the path.
An easy way to see what the returned Location header is, in the python shell:
>>> import httplib
>>> conn = httplib.HTTPConnection('exemple.oursite.com')
>>> conn.request('HEAD', '/')
>>> res = conn.getresponse()
>>> print res.getheader('location')
I'm afraid I don't know enough about mod_rewrite to tell you how to do the rewrite rule correctly, but this should give you an idea of what your server is actually telling clients to do.

The search bots don't care about hash tags. And if you are using them for some kind of flash or AJAX calls, you have more serious problems than your 301 redirects don't work. Because unless you have the content in an alternate form, the search engines are not indexing your site and you are definitely suffering as far as SEO goes.

I registered my account so I can't edit.
zombat : I'm sorry I made a mistake in my comment. The link to our video is exemple.oursite.com/#video_id=233. In this case, my rewrite rule in Apache doesn't work.
Nick Berardi: We changed the way our links work. We don't use # anymore, only for backward compatibility

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart