Rails 404 handler for non-Rails URLs - ruby-on-rails

I've inherited a site with hundreds of scattered HTML and non-framework PHP files, which I am porting to Ruby on Rails 3.0.
As functionality is added in the Rails app, the corresponding pages are deleted from the document root; but, because there are often links to these in Google or from external sites, simply returning a 404 is not acceptable.
A URL like '/contact.php' should redirect to '/app/contact/', for example.
For the first few cases of this, I created simple stub html files at the old locations, with Meta tags within to perform the redirect. This doesn't scale well, particularly once I start replacing product pages, of which there are thousands.
My preference is to delete the old pages, then have the 404 handler dispatch these to the new Rails app, which will examine the URL using regexes and database lookup to try to figure out what the replacement page is, then issue a 301 redirect to that new page.
In httpd.conf, I placed the directive:
ErrorDocument 404 /app/error/handle404
# /app/error is a rails url.
When I hit "http://localhost/does-not-exist", this causes my ErrorController to be invoked, as expected.
However, within the controller, I cannot find the original path ("/does-not-exist") anywhere in request, request.headers, or ENV - I've been calling likely methods like request.request_uri (which contains /app/error/handle404), and examining request.headers and ENV without finding the expected original path.
The Apache access_log shows only the request for /does-not-exist, indicating that it transparently invoked /app/error/handle404 (without doing a redirect or causing a second request to be made).
How can I get access to the original URL?
Edit: to clarify, here is the sequence of events:
User hits legacy path like http://mysite/foo.php, probably coming from some ancient link from a blog.
...but foo.php no longer exists!
this is a 404, thus Apache invokes ErrorDocument
directive is "ErrorDocument 404 /railsapp/error/handle404"
Rails routes this to ErrorController action "handle404" - this is working correctly
problem: in ErrorController, request.request.uri, request.headers do not provide any clue as to which URL the user was actually trying to get to, like "/foo.php"; I need to know the original URL to serve up an appropriate replacement page.

As I couldn't find the original, non-rewritten URL in the Rails request, I ended up doing it in PHP - plain, old-fashioned, non-framework PHP with explicit mysqli_*() calls.
The PHP error handler receives the necessary information in the $_SERVER hash; $_SERVER['REQUEST_URI'] contains the original URI that I needed.
I look this up in a database, and if I find a corresponding entry, issue a 301 redirect to the new location; if there's no entry, I simply display a 404 page to the user.
Simplified (PHP):
$url = $_SERVER['REQUEST_URI'];
$redir = lookupRedirect($url); # database stuff here
if (! $redir) {
include ('404.phtml');
} else {
header("Status: 301");
header("Location: " . $redir['new_url']);
}
It's an ugly kluge, but I just couldn't find a way to make the Rails app aware of the error URL.

Related

Web page without real files corresponding to URLs?

geniuses!
I need to make a demo page acting like DBpedia (http://dbpedia.org).
Two pages from different URLs,
http://dbpedia.org/page/Barack_Obama and
http://dbpedia.org/page/Lionel_Messi,
show different content.
I cannot really think DBpedia has million pages for all individual entities (E.g., Barack Obama and Lionel Messi).
How can I handle such URL request?
I know a bit about GET request but example URLs above do not seem like to use GET method.
Thank you in advance!
ps. Please teach me the process. Something like:
1. A user enters URL on a browser.
2. ...
When visiting http://dbpedia.org/page/Barack_Obama, your browser does send a GET request, e.g.:
GET /page/Barack_Obama HTTP/1.1
Host: dbpedia.org
The server (dbpedia.org) receives this GET request and then decides what to do. From the outside, you can’t know (for sure) how the server does something. The two common cases are:
Static web page: a file gets served that exists somewhere on the server. The URL path is often mapped to the server’s file system, but that’s not necessarily the case.
Dynamic web page: a file gets served that is generated on the fly. The content often comes from a database, but that’s not necessarily the case.
After trying some solutions, I'm now using Spring Web MVC framework.
Maybe Dynamic web page solution mentioned in unor's answer.
#Controller
public class SimpleDisplayController {
#RequestMapping("/page/{symbolicName:[!-z]+}")
public String displayEntity(HttpServletRequest hsr, Model model) {
String reqPath = (String) hsr.getAttribute(HandlerMapping.PATH_WITHIN_HANDLER_MAPPING_ATTRIBUTE);
String entityLb = reqPath.substring(reqPath.lastIndexOf("/"));
model.addAttribute("label", entityLb);
return "entity";
}
}
I could get request using regex as you can see at the 4th line: #RequestMapping("/page/{symbolicName:[!-z]+}").
The function above returns the string 'entity' which is the name of a HTML file serving as a template.
The following code is a body part of the example HTML template.
<body>
<p th:text="'About entity ' + ${label} + '...'" />
</body>
Since I add an attribute with the key 'label' in the controller above, the template can process ${label}.
In the example HTML template, th:text is a snytax of Thymeleaf (Java library to make an XML/XHTML/HTML5 template) which is supported by Spring.

Dynamically creating web page (using portion of URL as a variable)

In a site I'm developing I have a page that presents a post based on the variable in the url:
http://www.mywebsite.com?id=18
So this would load the post who's ID is 18 in the mySQL database.
I would like the create the same effect, but with the url being something like:
http://www.mywebsite.com/articles/title-of-article-18/
Would there be a way to create these pages on the fly with dynamic post content, where the url would originally be created by:
"http://www.mywebsite.com/articles/" + postTitle
You are looking for mod_rewrite and rewriting of urls via htaccess.
What it does is it takes patterns from your url, and the htaccess file detects the pattern redirects that to http://www.mywebsite.com?id=18. Users still see the nice url.
The directory /articles/title-of-article-18/ will not actually exist, and the user never really reaches that location because the htaccess secretly changes the url that the server processes.
See
http://en.wikipedia.org/wiki/Rewrite_engine
or a random tutorial I found:
http://www.blogstorm.co.uk/htaccess-mod_rewrite-ultimate-guide/
try url-rewriting
http://www.simple-talk.com/dotnet/asp.net/a-complete-url-rewriting-solution-for-asp.net-2.0/

Can I redirect www.abc.com/a b c/test.html to www.abc.com/a-b-c/test.html with MVC3

I am changing the way links show on my web site. I changed from allowing space in the URL to a new format where the URL has dashes where spaces used to be.
This effects only ONE string in the middle of the URL.
Google has indexed many of my pages with the old spaces in the URL but now they show up as 404s. Is it possible for me to put some code in place (temporary) that can redirect those URLs with spaces to the ones with dashes. I think it's a 403 redirect. A permanent redirect.
Thanks,
We wen't through the same thing recently. We ended up creating a LegacyController, which basically called into RedirectToActionPermanent or RedirectToRoutePermanent. (HTTP 301 - Moved Permanently).
Ideally, you should let IIS7 do the redirects, but we couldn't, because we needed to call our DB in order to figure out where to go.
If your redirect is as simple as you say it is (e.g no "dynamic" info in the URL), then you should use IIS.
Why don't you try to configure you routing to support both: legacy and new routes?
Basically /a b c/page and /a-b-c/page should be mapped to the same action of controller.

301 redirect in ASP.NET MVC from known URL's

I have a .Net MVC 1 site that replaced a legacy. Google still has a stack of old URL's in its index and i need to 301 redirect them. All of the old URL's are .html or .php pages, i also have a db table for old urls and their new equivalent. I know what i need to do, im just unsure of how to do it! Here are my thoughts
somewhere in the global.asax catch the url requested using a regexp
do a db lookup to hopefully find the new url
if we found the new url then 301 redirect it. if not either 301 to the homepage or throw a 404
Ive tried hacking around myself with little luck, plus all of the examples i can find online dont really cover this example. Would really like to do this via code rather than adding about 80 seperate routes to the global.asax
Any help or links is greatly appreciated
You could probably use Handlers. If the URLs are distinct enough, you could write a pretty broad entry in to urlMappings section of your web.config and then use the handler to re-route the traffic.
I'd use the catch all route and before returning that 404 I'd insert the logic to check if it needs a 301 instead.
[UrlRoute(Name = "404", Path = "{*path}", Order = 100)]
public ViewResult NotFound(string path)
{
}

Redirect 301 with hash part (anchor) #

One of our website has URL like this : example.oursite.com. We decided to move our site with an URL like this www.oursite.com/example. To do this, we wrote a rewrite rule in our Apache server that redirect to our new URL with a code 301.
Many websites link to us with URLs of the form example.oursite.com/#id=23. The problem is that the redirection erase the hash part of the URL with IE. As far as I know, the hash part is never sent to the server.
I wanted to implement the redirection with javascript to keep the hash part, but the Search Engine will not be aware that our URL changed. (no code 301 returned)
I want the Search Engine to be notified of our new URL(301) because we need to transfer the page rank to our new URL.
Is there a way to redirect with a 301 code and keep the hash part(#id=23) of in the URL ?
Search engines do in fact care about hash tags, they frequently use them to highlight specific content on a page.
To the question, however, anchor locations are unfortunately not sent to the server as part of the HTTP request. If you want to redirect a user, you will need to do this in Javascript on the client side.
Good article: http://web.archive.org/web/20090508005814/http://www.mikeduncan.com/named-anchors-are-not-sent/
Seeing as the server will never see the # (ruling out 301 Redirects) and Google has deprecated their AJAX Crawling scheme, it seems that a front-end solution is the only way!
How I did it:
(function() {
var redirects = [
['#!/about', '/about'],
['#!/contact', '/contact'],
['#!/page-x', '/pageX']
]
for (var i=0; i<redirects.length; i++) {
if (window.location.hash == redirects[i][0]) {
window.location.replace(redirects[i][1]);
}
}
})();
I'm assuming that because Google crawlers do indeed execute Javascript, the new pages will be indexed properly.
I've put it in a <script> tag directly underneath the <title> tag, so that it get executed before any other JS/CSS. Note that this script should only be required for your index file.
I am fairly certain that the hash/page anchor/bookmark part of a URL is not indexed by search engines, and therefore has no effect on your page ranking. Doing a google search for "inurl:#" returns zero documents, so that backs up my assumption. Links from external sites will be indexed without the hash.
You are right in that the hash part isn't sent to the server, so as far as I am aware, there isn't a good way to be able to create a redirection url with the hash in it.
Because of this, it's up to the browser to correctly manage the hash during a redirect. Firefox 3.5 appears to do this successfully. If you append a hash to a URL that has a known redirect, you will see the URL change in the address bar to the new location, but the hash stays on there successfully.
Edit: In response to the comment below, if there isn't a hash sign in the external URL for the part you need, then it is entirely possible to rewrite the URL. An Apache rewrite rule would take care of it:
RewriteCond %{HTTP_HOST} !^exemple\.oursite\.com [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^/(.*) http://www.oursite.com/exemple/$1 [L,R]
If you're not using Apache, then you'll have to look into the server docs for something similar.
Google has a special syntax for AJAX applications that is based on hash URLs: http://code.google.com/web/ajaxcrawling/docs/getting-started.html
You could create a page on the old address that catches all requests and redirects to the new site with the correct address and code.
I did something like that, but it was in asp.net, which I guess it's not the language you use. Anyway there should be a way to do this in any language.
When returning status 301, your server is supposed to return a 'Location:' header which points to the new location. In practice, the way this is implemented varies; some servers provide the full URL (netloc and path), some just provide the new path and expect the browser to look for that path on the original netloc. It sounds like your rewrite rule is stripping the path.
An easy way to see what the returned Location header is, in the python shell:
>>> import httplib
>>> conn = httplib.HTTPConnection('exemple.oursite.com')
>>> conn.request('HEAD', '/')
>>> res = conn.getresponse()
>>> print res.getheader('location')
I'm afraid I don't know enough about mod_rewrite to tell you how to do the rewrite rule correctly, but this should give you an idea of what your server is actually telling clients to do.
The search bots don't care about hash tags. And if you are using them for some kind of flash or AJAX calls, you have more serious problems than your 301 redirects don't work. Because unless you have the content in an alternate form, the search engines are not indexing your site and you are definitely suffering as far as SEO goes.
I registered my account so I can't edit.
zombat : I'm sorry I made a mistake in my comment. The link to our video is exemple.oursite.com/#video_id=233. In this case, my rewrite rule in Apache doesn't work.
Nick Berardi: We changed the way our links work. We don't use # anymore, only for backward compatibility

Resources