We want to remove the /dotnetnuke/ from all 300 pages of our website that has been running since Feb 09.
Google isn't indexing all of our pages just 98. I'm thinking that the /dotnetnuke/ is causing our content to be too deep in our site for Google to find(?)
We also don't have any Page Rank although our site appears on page one for most search queries. Obviously we don't want to lose our position in Google.
Would you advise that we do remove the /dotnetnuke/ in our urls and if so should we create a new site and use 301 redirects or is there a way of removing the /dotnetnuke/ from our existing urls but still keeping our Google history?
Many thanks
DotNetNuke uses its own URL rewriting which is built in to the framework. DotNetNuke uses the provider model, so you can also plug in your own URL rewriter or secure one from a third party. If that is what you need, I'd suggest taking a look at Bruce Chapman's iFinity URL Rewriter as a quality free third party extension to DotNetNuke. He also offers a fancier commercial version called URL Master, which I haven't needed to use as of yet.
However, I believe the /dotnetnuke/ you're referring too may not actually be part of your "pages," but the actual alias of your DotNetNuke portal (i.e. www.yoursite.com/dotnetnuke). This would mean that /dotnetnuke/ is part of your base path for all pages because using the base path as an identifier is how DotNetNuke determines that you want to load a particular portal. If this is the case, you could potentially just change your portal alias to be www.yoursite.com (depending on the level of access you have to the site/server).
Lastly, sometimes virtual pages do not get included in DotNetNuke's site map. If you are using a third party module for your dynamic content - it may in fact not be represented on your site map. I'd look in to what pages are currently represented on your site map as well.
In IIS7 you can use URL rewrite functionality to hide /dotnetnuke/.
301 redirect will also work fine (just make sure you are not using 302 - Google doesn't like it)
Another answer in adition to the first 2 is that you are running DNN on GoDaddy hosting. Godaddy has a strange way of setting up sites here is how you can remove that problem
Set up a second (non primary) domain. Under domain management, you can actually assign the second domain to point to a subdirectory. Make sure that the subdirectory is whatever you set dnn to
Might have this wrong as i got it off godaddys site but have done it twice and got it to work correctly
Related
a month ago i relaunched a Website in Typo3 CMS. Before that, the site was hosted with Joomla CMS.
In Joomla Config, SEO Links were disabled, so Google indexed the Page Urls this:
www.domain.de/index.php?com_component&itemid=123....
for example.
Now, a month later (after the Typo3 Relaunch), these Links are still visible in Google because the Urls don't return a 404-Error. That's because "index.php" also exists on Typo3 and Typo3 doesnt care about the additional query string/variables - it returns a 200 status code and shows the front page.
In Google Webmaster Tools it's possible to delete single Urls from the Google Index, but that way i have to delete about 10000 Urls manually...
My Question is: Is there a way to remove these old Urls from the Google Index?
Greetings
With this amount of URL's there is only one sensible solution, implement the proper 404 handling in your TYPO3, or even better redirections to same content placed in TYPO3.
You can use TYPO3's handler (search for it in Install Tool > All configuration) it's called pageNotFound_handling, you can use options like REDIRECT for redirecting to some page or even USER_FUNCTION, which allow you to use own PHP script, check the description in the Install Tool.
You can also write a simple condition in TypoScript and check if Joomla typical params exists in the URL - so that easy way you can return custom 404 page. If it's important to you to make more sophisticated condition (for an example, you want to redirect links which previously pointed to some gallery in Joomla, to new gallery in TYPO3) you can make usage of userFunc condition and that would be probably best option for SEO
If these urls contain an acceptable number of common indicators, you could redirect these links with a rule in your virtual host or .htaccess so that google will run into the correct error message.
I wrote a google chrome extension to remove urls in bulk in google webmaster tools. Check it out here: https://github.com/noitcudni/google-webmaster-tools-bulk-url-removal.
Basically, it's a glorified for loop. You put all the urls in a text file. For example,
http://your-domain/link-1
http://your-domain/link-2
Having installed the extension as described in the README, you'll find a new "choose a file" button.
Select the file you just created. The extension reads it in, loops thru all the urls and submits them for removal.
I've just noticed that the long, convoluted Facebook URLs that we're used to now look like this:
http://www.facebook.com/example.profile#!/pages/Another-Page/123456789012345
As far as I can recall, earlier this year it was just a normal URL-fragment-like string (starting with #), without the exclamation mark. But now it's a shebang or hashbang (#!), which I've previously only seen in shell scripts and Perl scripts.
The new Twitter URLs now also feature the #! symbols. A Twitter profile URL, for example, now looks like this:
http://twitter.com/#!/BoltClock
Does #! now play some special role in URLs, like for a certain Ajax framework or something since the new Facebook and Twitter interfaces are now largely Ajaxified?
Would using this in my URLs benefit my Web application in any way?
This technique is now deprecated.
This used to tell Google how to index the page.
https://developers.google.com/webmasters/ajax-crawling/
This technique has mostly been supplanted by the ability to use the JavaScript History API that was introduced alongside HTML5. For a URL like www.example.com/ajax.html#!key=value, Google will check the URL www.example.com/ajax.html?_escaped_fragment_=key=value to fetch a non-AJAX version of the contents.
The octothorpe/number-sign/hashmark has a special significance in an URL, it normally identifies the name of a section of a document. The precise term is that the text following the hash is the anchor portion of an URL. If you use Wikipedia, you will see that most pages have a table of contents and you can jump to sections within the document with an anchor, such as:
https://en.wikipedia.org/wiki/Alan_Turing#Early_computers_and_the_Turing_test
https://en.wikipedia.org/wiki/Alan_Turing identifies the page and Early_computers_and_the_Turing_test is the anchor. The reason that Facebook and other Javascript-driven applications (like my own Wood & Stones) use anchors is that they want to make pages bookmarkable (as suggested by a comment on that answer) or support the back button without reloading the entire page from the server.
In order to support bookmarking and the back button, you need to change the URL. However, if you change the page portion (with something like window.location = 'http://raganwald.com';) to a different URL or without specifying an anchor, the browser will load the entire page from the URL. Try this in Firebug or Safari's Javascript console. Load http://minimal-github.gilesb.com/raganwald. Now in the Javascript console, type:
window.location = 'http://minimal-github.gilesb.com/raganwald';
You will see the page refresh from the server. Now type:
window.location = 'http://minimal-github.gilesb.com/raganwald#try_this';
Aha! No page refresh! Type:
window.location = 'http://minimal-github.gilesb.com/raganwald#and_this';
Still no refresh. Use the back button to see that these URLs are in the browser history. The browser notices that we are on the same page but just changing the anchor, so it doesn't reload. Thanks to this behaviour, we can have a single Javascript application that appears to the browser to be on one 'page' but to have many bookmarkable sections that respect the back button. The application must change the anchor when a user enters different 'states', and likewise if a user uses the back button or a bookmark or a link to load the application with an anchor included, the application must restore the appropriate state.
So there you have it: Anchors provide Javascript programmers with a mechanism for making bookmarkable, indexable, and back-button-friendly applications. This technique has a name: It is a Single Page Interface.
p.s. There is a fourth benefit to this technique: Loading page content through AJAX and then injecting it into the current DOM can be much faster than loading a new page. In addition to the speed increase, further tricks like loading certain portions in the background can be performed under the programmer's control.
p.p.s. Given all of that, the 'bang' or exclamation mark is a further hint to Google's web crawler that the exact same page can be loaded from the server at a slightly different URL. See Ajax Crawling. Another technique is to make each link point to a server-accessible URL and then use unobtrusive Javascript to change it into an SPI with an anchor.
Here's the key link again: The Single Page Interface Manifesto
First of all: I'm the author of the The Single Page Interface Manifesto cited by raganwald
As raganwald has explained very well, the most important aspect of the Single Page Interface (SPI) approach used in FaceBook and Twitter is the use of hash # in URLs
The character ! is added only for Google purposes, this notation is a Google "standard" for crawling web sites intensive on AJAX (in the extreme Single Page Interface web sites). When Google's crawler finds an URL with #! it knows that an alternative conventional URL exists providing the same page "state" but in this case on load time.
In spite of #! combination is very interesting for SEO, is only supported by Google (as far I know), with some JavaScript tricks you can build SPI web sites SEO compatible for any web crawler (Yahoo, Bing...).
The SPI Manifesto and demos do not use Google's format of ! in hashes, this notation could be easily added and SPI crawling could be even easier (UPDATE: now ! notation is used and remains compatible with other search engines).
Take a look to this tutorial, is an example of a simple ItsNat SPI site but you can pick some ideas for other frameworks, this example is SEO compatible for any web crawler.
The hard problem is to generate any (or selected) "AJAX page state" as plain HTML for SEO, in ItsNat is very easy and automatic, the same site is in the same time SPI or page based for SEO (or when JavaScript is disabled for accessibility). With other web frameworks you can ever follow the double site approach, one site is SPI based and another page based for SEO, for instance Twitter uses this "double site" technique.
I would be very careful if you are considering adopting this hashbang convention.
Once you hashbang, you can’t go back. This is probably the stickiest issue. Ben’s post put forward the point that when pushState is more widely adopted then we can leave hashbangs behind and return to traditional URLs. Well, fact is, you can’t. Earlier I stated that URLs are forever, they get indexed and archived and generally kept around. To add to that, cool URLs don’t change. We don’t want to disconnect ourselves from all the valuable links to our content. If you’ve implemented hashbang URLs at any point then want to change them without breaking links the only way you can do it is by running some JavaScript on the root document of your domain. Forever. It’s in no way temporary, you are stuck with it.
You really want to use pushState instead of hashbangs, because making your URLs ugly and possibly broken -- forever -- is a colossal and permanent downside to hashbangs.
To have a good follow-up about all this, Twitter - one of the pioneers of hashbang URL's and single-page-interface - admitted that the hashbang system was slow in the long run and that they have actually started reversing the decision and returning to old-school links.
Article about this is here.
I always assumed the ! just indicated that the hash fragment that followed corresponded to a URL, with ! taking the place of the site root or domain. It could be anything, in theory, but it seems the Google AJAX Crawling API likes it this way.
The hash, of course, just indicates that no real page reload is occurring, so yes, it’s for AJAX purposes. Edit: Raganwald does a lovely job explaining this in more detail.
I've just noticed that the long, convoluted Facebook URLs that we're used to now look like this:
http://www.facebook.com/example.profile#!/pages/Another-Page/123456789012345
As far as I can recall, earlier this year it was just a normal URL-fragment-like string (starting with #), without the exclamation mark. But now it's a shebang or hashbang (#!), which I've previously only seen in shell scripts and Perl scripts.
The new Twitter URLs now also feature the #! symbols. A Twitter profile URL, for example, now looks like this:
http://twitter.com/#!/BoltClock
Does #! now play some special role in URLs, like for a certain Ajax framework or something since the new Facebook and Twitter interfaces are now largely Ajaxified?
Would using this in my URLs benefit my Web application in any way?
This technique is now deprecated.
This used to tell Google how to index the page.
https://developers.google.com/webmasters/ajax-crawling/
This technique has mostly been supplanted by the ability to use the JavaScript History API that was introduced alongside HTML5. For a URL like www.example.com/ajax.html#!key=value, Google will check the URL www.example.com/ajax.html?_escaped_fragment_=key=value to fetch a non-AJAX version of the contents.
The octothorpe/number-sign/hashmark has a special significance in an URL, it normally identifies the name of a section of a document. The precise term is that the text following the hash is the anchor portion of an URL. If you use Wikipedia, you will see that most pages have a table of contents and you can jump to sections within the document with an anchor, such as:
https://en.wikipedia.org/wiki/Alan_Turing#Early_computers_and_the_Turing_test
https://en.wikipedia.org/wiki/Alan_Turing identifies the page and Early_computers_and_the_Turing_test is the anchor. The reason that Facebook and other Javascript-driven applications (like my own Wood & Stones) use anchors is that they want to make pages bookmarkable (as suggested by a comment on that answer) or support the back button without reloading the entire page from the server.
In order to support bookmarking and the back button, you need to change the URL. However, if you change the page portion (with something like window.location = 'http://raganwald.com';) to a different URL or without specifying an anchor, the browser will load the entire page from the URL. Try this in Firebug or Safari's Javascript console. Load http://minimal-github.gilesb.com/raganwald. Now in the Javascript console, type:
window.location = 'http://minimal-github.gilesb.com/raganwald';
You will see the page refresh from the server. Now type:
window.location = 'http://minimal-github.gilesb.com/raganwald#try_this';
Aha! No page refresh! Type:
window.location = 'http://minimal-github.gilesb.com/raganwald#and_this';
Still no refresh. Use the back button to see that these URLs are in the browser history. The browser notices that we are on the same page but just changing the anchor, so it doesn't reload. Thanks to this behaviour, we can have a single Javascript application that appears to the browser to be on one 'page' but to have many bookmarkable sections that respect the back button. The application must change the anchor when a user enters different 'states', and likewise if a user uses the back button or a bookmark or a link to load the application with an anchor included, the application must restore the appropriate state.
So there you have it: Anchors provide Javascript programmers with a mechanism for making bookmarkable, indexable, and back-button-friendly applications. This technique has a name: It is a Single Page Interface.
p.s. There is a fourth benefit to this technique: Loading page content through AJAX and then injecting it into the current DOM can be much faster than loading a new page. In addition to the speed increase, further tricks like loading certain portions in the background can be performed under the programmer's control.
p.p.s. Given all of that, the 'bang' or exclamation mark is a further hint to Google's web crawler that the exact same page can be loaded from the server at a slightly different URL. See Ajax Crawling. Another technique is to make each link point to a server-accessible URL and then use unobtrusive Javascript to change it into an SPI with an anchor.
Here's the key link again: The Single Page Interface Manifesto
First of all: I'm the author of the The Single Page Interface Manifesto cited by raganwald
As raganwald has explained very well, the most important aspect of the Single Page Interface (SPI) approach used in FaceBook and Twitter is the use of hash # in URLs
The character ! is added only for Google purposes, this notation is a Google "standard" for crawling web sites intensive on AJAX (in the extreme Single Page Interface web sites). When Google's crawler finds an URL with #! it knows that an alternative conventional URL exists providing the same page "state" but in this case on load time.
In spite of #! combination is very interesting for SEO, is only supported by Google (as far I know), with some JavaScript tricks you can build SPI web sites SEO compatible for any web crawler (Yahoo, Bing...).
The SPI Manifesto and demos do not use Google's format of ! in hashes, this notation could be easily added and SPI crawling could be even easier (UPDATE: now ! notation is used and remains compatible with other search engines).
Take a look to this tutorial, is an example of a simple ItsNat SPI site but you can pick some ideas for other frameworks, this example is SEO compatible for any web crawler.
The hard problem is to generate any (or selected) "AJAX page state" as plain HTML for SEO, in ItsNat is very easy and automatic, the same site is in the same time SPI or page based for SEO (or when JavaScript is disabled for accessibility). With other web frameworks you can ever follow the double site approach, one site is SPI based and another page based for SEO, for instance Twitter uses this "double site" technique.
I would be very careful if you are considering adopting this hashbang convention.
Once you hashbang, you can’t go back. This is probably the stickiest issue. Ben’s post put forward the point that when pushState is more widely adopted then we can leave hashbangs behind and return to traditional URLs. Well, fact is, you can’t. Earlier I stated that URLs are forever, they get indexed and archived and generally kept around. To add to that, cool URLs don’t change. We don’t want to disconnect ourselves from all the valuable links to our content. If you’ve implemented hashbang URLs at any point then want to change them without breaking links the only way you can do it is by running some JavaScript on the root document of your domain. Forever. It’s in no way temporary, you are stuck with it.
You really want to use pushState instead of hashbangs, because making your URLs ugly and possibly broken -- forever -- is a colossal and permanent downside to hashbangs.
To have a good follow-up about all this, Twitter - one of the pioneers of hashbang URL's and single-page-interface - admitted that the hashbang system was slow in the long run and that they have actually started reversing the decision and returning to old-school links.
Article about this is here.
I always assumed the ! just indicated that the hash fragment that followed corresponded to a URL, with ! taking the place of the site root or domain. It could be anything, in theory, but it seems the Google AJAX Crawling API likes it this way.
The hash, of course, just indicates that no real page reload is occurring, so yes, it’s for AJAX purposes. Edit: Raganwald does a lovely job explaining this in more detail.
Scenario
The web server gets a request for http://domain.com/folder/page. The Accept-Language header tells us the user prefers Greek, with the language code el. That's good, since we have a Greek version of page.
Now we could do one of the following with the URL:
Return a Greek version keeping the current URL: http://domain.com/folder/page
Redirect to http://domain.com/folder/page/el
Redirect to http://domain.com/el/folder/page
Redirect to http://el.domain.com/folder/page
Redirect to http://domain.com/folder/page?hl=el
...other alternatives?
Which one is best? Pros, cons from a user perspective? developer perspective?
I would not go for option 1, if your pages are publically available, i.e. you are not required to log in to view the pages.
The reason is that a search engine will not scan the different language versions of the page.
The same reason goes agains option no 5. A search engine is less likely to identify two pages as separate pages, if the language identification goes in the query string.
Lets look at option 4, placing the language in the host name. I would use that option if the different language versions of the site contains completely different content. On a site like Wikipedia for example, the Greek version contains its own complete set of articles, and the English version contains another set of articles.
So if you don't have completely different content (which it doesn't seem like from your post), you are left with option 2 or 3. I don't know if there are any compelling arguments for one over the other, but no. 3 looks nicer in my eyes. So that is what I would use.
But just a comment for inspiration. I'm currently working on a web application that has 3 major parts, one public, and two parts for two different user types. I have chosen the following url scheme (with en referring to language of course):
http://www.example.com/en/x/y/z for the public part.
http://www.example.com/part1/en/x/y/z for the one private part
http://www.example.com/part2/en/x/y/z for the other private part.
The reason for this is that if I were to split the three parts up into separate applications, it will be a simple reconfiguration in the web server when I have the name of the part at the top of the path. E.g. if we were to use a commercial CMS system for the public part of the site
Edit:
Another argument against option no. 1 is that if you ONLY listen to accept-language, you are not giving the user a choice. The user may not know how to change the language set up in a browser, or may be using a frinds computer setup to a different language. You should at least give the user a choice (storing it in a cookie or the user's profile)
I'd choose number 3, redirect to http://example.com/el/folder/page, because:
Language selection is more important than a page selection, thus selected language should go first in a true human-readable URL.
Only one domain gets all Google's PR. That's good for SEO.
You could advert your site locally with a language code built-in. E.g. in Greece you would advert as http://example.com/el/, so every local visitor will get to a site in Greece and would avoid language-choosing frustration.
Alternatively, you can go for number 5: it is fine for Google and friends, but not as nice for a user.
Also, we should refrain to redirect a user anywhere, unless required. Thus, in my mind, a user opening http://example.com/folder/page should get not a redirect, but a page in a default language.
Number four is the best option, because it specifies the language code pretty early. If you are going to provide any redirects always be sure to use a canonical link tag.
Pick option 5, and I don't believe it is bad for SEO.
This option is good because it shows that the content for say:
http://domain.com/about/corporate/locations is the same as the content in
http://domain.com/about/corporate/locations?hl=el except that the language differs.
The hl parameter should override the Accept-language header so that the user can easily control the matter. The header would only be used when the hl parameter is missing. Granted linking is a little complicated by this, and should probably be addressed through either a cookie which would keep the redirection going to the language chosen by the hl parameter (as it may have been changed by the user from the Accept-language setting, or by having all the links on the page be processed for adding on the current hl parameter.
The SEO issues can be addressed by creating index files for everything like stackoverflow does, these could include multiple sets of indices for the different languages, hopefully encouraging showing up in results for the non-default language.
The use of 1 takes away the differentiator in the URL. The use of 2 and 3 suggest that the page is different, possibly beyond just language, like wikipedia is. And the use of 4 suggests that the server itself is separated, perhaps even geographically.
Because there is a surprisingly poor correlation of geographic location to language preferences, the issue of providing geographically close servers should be left to a proper CDN setup.
My own choice is #3: http://domain.com/el/folder/page. It seems to be the most popular out there on the web. All the other alternatives have problems:
http://domain.com/folder/page --- Bad for SEO?
http://domain.com/folder/page/el --- Doesn't work for pages with parameters. This looks weird: ...page?par1=x&par2=y/el
http://domain.com/el/folder/page --- Looks good!
http://el.domain.com/folder/page --- More work needed to deploy since it requires adding subdomains.
http://domain.com/folder/page?hl=el --- Bad for SEO?
It depends. I would choose number four personally, but many successful companies have different ways of achieving this.
Wikipedia uses subdomains for various
languages (el.wikipedia.org).
So does Yahoo (es.yahoo.com for Spanish), although it doesn't support Greek.
So does Gravatar (el.gravatar.com)
Google uses a /intl/el/ directory.
Apple uses a /gr/ directory (albeit in English and limited to an iPhone page)
It's really up to you. What do you think your customers will like the most?
None of them. A 'normal user' wouldn't understand (and so remember) any of those abbreviations.
In order of preference I'd suggest:
http://www.domain.gr/folder/page
http://www.domain.com/
http://domain.com/gr/folder/page
3 or 4.
3: Can be easily dealt with using htaccess/mod_rewrite. The only downside is that you'd have to write some method of automatically injecting the language code as the first segment of the URI.
4: Probably the best method. Using host headers, it can all be sent to the same web application/content but you can then use code to extract the language code and go from there.
Simples. ;)
I prefer 3 or 4
I'm currently working on a project to re-develop a public sector website. The website uses Google Analytics and has done since April 2007, so alot of data has been captured.
The new site will be developed using ASP.net MVC and as part of the the redevelopment we want to make the URL's more SEO friendly by getting rid of the ?id=123 and replace it with /news/2010/01/this-is-the-title.aspx. Other pages will have different routes however the may have the same content.
I've already developed a legacy route engine which can issue a 301 and redirect the user to the new page however I am uncertain of what will happen to the Google Analytics.
I really don't want to start over with the analytics. Is there a way to join up the new url to the old url on Google Analytics?
I guess you could make use of the _trackPageview function to track a different page name. See How do I track AJAX applications?.
For that, you would obviously need to know programmatically the name of the old URL.
It might be less hassle to make a cut, and start a new profile from scratch with the new URLs.