The following URLs point to the same resource:
http://www.domain.com/books/123
http://www.domain.com/books/123-Harry-Potter
http://www.domain.com/books/123-Harry-Potter-And-The-Deathly-Hallows
I would like to use (1) as the canonical URL for OpenGraph/Facebook, so if you "like" (2) and (3) then will count for (1).
But I would like to use (3) as the canonical URL for Google, because of SEO.
Is this recommended?
Hey Victor, that's a great question. As social media becomes more important it is going to be interesting to see how its technical requirements might start to clash with other thing like SEO.
From an SEO standpoint, it is pretty common for a website to have multiple URLs that point to a single page of unique content. Generally this is due to extra query parameters or some type of domain canonicalization issue (e.g. www vs. non-www). The best way to handle this whould be to use the rel=canonical tag at the top of your pages, and Google/Bing will treat that as a 301 redirect to the canonical URL. The search engines will index all 3 URLs, but through the 301 redirect, they will merge rank of the other 2 into your canonical page, and you will achieve your search objective.
<html>
<head>
<link rel="canonical" href="http://www.domain.com/books/123-Harry-Potter-And-The-Deathly-Hallows" />
</head>
</html>
Here's google's description of the Rel Canonical feature: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
From a user's perspective on a search engine, this seems like a good experience, so I don't think you'll experience any issues from the search engine (e.g. I don't completely agree with #Jan Hančič), but you might want to double check the quality standards for the Facebook like feature (http://developers.facebook.com/docs/opengraph). This seems to be a grey area from their best practices.
Good luck!
choose one main, create 301 the rest
Related
this question is about SEO in URL naming, I just want to know is SEO really weight much more than user experience? What you guys will see as limit to how far SEO should go as ruining people's experience. Just like for this example, I have a page that contain information about art contest that is running or have run in my website.
Which URL is better?
example.com/contest/{contest-id}/{name-of-contest}
or
example.com/online-graphic-design-contest/{contest-id}/{name-of-contest}
Is keyword stuffing in url for keyword such as 'online', 'graphic', 'design' and 'contest' so much more important in SEO, than having a short more readable URL such as the first one?
The best way to think about SEO these days is through the perspective of the user, firstly, and then through the search engine perspective. I would argue that your second URL is much better for both cases. It's more descriptive to the user (we have an "online graphic design contest") and also to search engines.
Google has made it apparent that their focus is on providing content that is relevant to the user, and the best way to be relevant is with content that is descriptive and fits with what your users are searching for. I don't think you're keyword stuffing if you're using a single natural language phrase in the URL to describe the content of the page. That portion of the URL should also match your page title, and header tags on the page, etc., etc.
Here are some useful resources:
http://static.googleusercontent.com/media/www.google.com/en/us/webmasters/docs/search-engine-optimization-starter-guide.pdf
http://linchpinseo.com/user-focused-seo-redefining-what-search-engine-optimization-is
In my application I have localized urls that look something like this:
http://examle.com/en/animals/elephant
http://examle.com/nl/dieren/olifant
http://examle.com/de/tiere/elefant
This question is mainly for Facebook Likes, but I guess I will hit similar problems when I start thinking about search engine crawlers.
What kind of url would you expect as canonical url? I don't want to use the exact english url, because I want that people clicking the link will be forwarded to their own language (browser setting/dependent on IP).
The IP lookup is not something that I want to do on every page hit. Besides that I would need to incorporate more 'state' in my application, because I have to check wether a user has already been forwarded to his own locale, or is browsing the english version on purpose.
I guess it will going to be something like:
http://example.com/something/animals/elephant
or maybe without any language identifier at all:
http://example.com/animals/elephant
but that is a bit harder to implement, bigger chance on url clashes in the future (in the rare case I would get a category called en or de).
Summary
What kind of url would you expect as canonical url? Is there already a standard set for this?
I know this question is a bit old, but I was facing the same issue.
I found this:
Different language versions of a single page are considered duplicates only if the main content is in the same language (that is, if only the header, footer, and other non-critical text is translated, but the body remains the same, then the pages are considered to be duplicates).
That can be found here: https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
From this I can conclude that we should add locales to canonicals.
I did find one resource that recommends not using the canonical tag with localized addresses. However, Google's documentation does not specify and only mentions subdomains in another context.
There is more that that language that you need to think of.
It's typical a tuple of 3 {region, language, property}
If you only have one website then you have {region, language} only.
Every piece of content can either be different in this 3 dimensional space, or at least presented differently. But this is the same piece of content so you'd like to centralize managing of editorial signals, promotions, tracking etc etc. Think about search systems - you'd like page rank to be merged across all instances of the article, not spread thinly out.
I think there is a standard solution: Canonical URL
Put language/region into the domain name
example.com
uk.example.com
fr.example.com
Now you have a choice how you attach a cookie for subdomain (for language/region) or for domain (for user tracking)!
On every html page add a link to canonical URL
<link rel="canonical" href="http://example.com/awesome-article.html" />
Now you are done.
There certainly is no "Standard" beyond it has to be an URL. What you certainly do see on many comercial websites is exactly what you describe:
<protocol>://<server>/<language>/<more-path>
For the "language-tag" you may follow RFCs as well. I guess your 2-letter-abbrev is quite fine.
I only disagree on the <more-path> of the URL. If I understand you right you are thinking about transforming each page into a local-language URL? I would not do that. Maybe I am not the standard user, but I personally like to manually monkey around in URLs, i.e. if the URL shown is http://examle.com/de/tiere/elefant, but I don't trust the content to be translated well I would manually try http://examle.com/en/tiere/elefant -- and that would not bring me to the expected page. And since I also dislike those URLs http://ex.com/with-the-whole-title-in-the-url-so-the-page-will-be-keyworded-by-search-engines my favorite would be to just exchange the <language> part and use generic english (or any other language) for <more-path>. Eg:
http://examle.com/en/animals/elephant
http://examle.com/nl/animals/elephant
http://examle.com/de/animals/elephant
If your site is something like Wikipedia, then I would agree to your scheme of translating the <more-part> as well.
Maybe this Google's guidelines can help with your issue: https://support.google.com/webmasters/answer/189077?hl=en
It says that many websites serve users (across the world) with content targeted to users in a certain region. It is advised to use the rel="alternate" hreflang="x" attributes to serve the correct language or regional URL in Search results.
I have a product page with options in select list (ex : color of the product etc...).
You accede to my product with different urls :
www.mysite.com/product_1.html
www.mysite.com/product_1.html#/color-green
If you accede with the url www.mysite.com/product_1.html#/color-green, the option green of the select list is automatically selected (with javascript).
If i link my product page with those urls, is there a risk of duplicate content ? Is it good for my seo ?
thx
You need to use canonical urls in order to let the search engines know that you are aware that the content seems duplicated.
Basically using a canonical url on your page www.mysite.com/product_1.html#/color-green to go to www.mysite.com/product_1.html tells the search engine that whenever they see www.mysite.com/product_1.html#/color-green they should not scan this page but rather scan the page www.mysite.com/product_1.html
This is the suggested method to overcome duplicate content of this type.
See these pages:
SEO advice: url canonicalization
A rel=canonical corner case
At one time I saw Google indexing the odd #ed URL and showing them in results, but it didn't last long. I think it also required that there was an on page link to the anchor.
Google does support the concept of the hashbang (#!) as a specific way to do indexable anchors and support AJAX, which implies an anchor without the bang (!) will no longer be considered for indexing.
Either way, Google is not stupid. The the basic use of the anchor is to move to a place on a page, i.e. it is the same page (duplicate content) but a different spot. So Google will expect a #ed URL to contain the same content. Why would they punish you for doing what the # is for?
And what is "the risk of duplicate content". Generally, the only onsite risk from duplicate content is Google may waste it's time crawling duplicate pages instead of focusing on other valuable pages. As Google will assume # is the same page it is more likely to not event try the #ed URL.
If you're worried, implement the canonical tag, but do it right. I've seen more issues from implementing it badly than the supposed issues they are there to solve.
Both answers above are correct. Google has said they ignore hashtags unless you use hash-bang format (#!) -- and that really only addresses a certain use case, so don't add it just because you think it will help.
Using the canonical link tag is the right thing to do.
One additional point about dupe content: it's less about the risk than about a missed opportunity. In cases where there are dupes, Google chooses one. If 10 sites link to your site using www.example.com and 10 more link using just example.com you'll get the :link goodness" benefit of only 10 links. The complete solution to this involves ensuring that when users and Google arrive at the "wrong" on, the server responds with an HTTP 301 status and redirects the user to the "right" one. This is known as domain canonicalization and is a good thing for many, many reasons. Use this in addition to the "canonical" link tag and other techniques.
I have a large directory of individual names along with generic publicaly available and category specific information that I want indexed as much as possible in search engines. Listing these names on the site itself is not a concern to people but some don't want to be in search results when they "Google" themselves.
We want to continue listing these names within a page AND still index the page BUT not index specified names or keywords in search engines.
Can this be done page-by-page or would setting up two pages be a better work around:
Options available:
PHP can censor keywords if user-agent=robot/search engine
htaccess to restrict robots to non-censored content, but allowing to a second censored version
meta tags defining words not to index ?
JavaScript could hide keywords from robots but otherwise viewable
I will go through the options and tell you some problems I can see:
PHP: If you don't mind trusting user agent this will work well. I am unsure how some search engines will react to different content being displayed for their bots.
htaccess: You would probably need to redirect the bot to a different page. You could use the url parameters but this would be no different then using a pure PHP solution. The bot would index the page it is redirected to and not the page you wish to visit. You may be able to use the rewrite engine to over come this.
meta tags: Even if you could use meta tags to get the bot to ignore certain words, it wouldn't guarantee that search engines won't ignore it since there is no set "standard" for meta tags. But that doesn't matter since I don't no of any way to get a bot to ignore certain words or phrases using meta tags.
JavaScript: No bot I have ever heard of executes (or even reads) JavaScript when looking at a page, so I don't see this working. You could display the content you want hidden to the users using JavaScript and bots won't be able to see it but neither will users who have JavaScript disabled.
I would go the PHP route.
You can tell robots to skip indexing particular page by adding ROBOTS meta:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
UPDATE: The ways to restrict indexing of particular words I can think of are:
Use JS to add those to the page (see below).
Add module to the server that would strip those words from the rendered page.
JavaScript could be something like this:
<p>
<span id="secretWord">
<SCRIPT TYPE="text/javascript">
<!--
document.write('you can protect the word by concating strings/having HEX codes etc')
//-->
</script>
</span>
</p>
The server module is probably best option. In ASP.NET it should be fairly easy to do that. Not sure about PHP though.
What's not clear from your posting is whether you want to protect your names and keywords against Google, or against all search engines. Google is general well-behaved. You can use the ROBOTS meta tag to prevent that page from being indexed. But it won't prevent search engines that ignore the ROBOTS tags from indexing your site.
Other approaches you did not suggest:
Having the content of the page fetched with client-side JavaScript.
Force the user to execute a CAPTCHA before displaying the text. I recommend the reCAPTCHA package, which is easy to use.
Of all these, the reCAPTCHA approach is probably the best, as it will also protect against ilbehaved spiders. But it is the most onerous on your users.
For some mysterious reason, Google has indexed both these adresses, that lead to the same page:
/something/some-text-1055.html
and
/index.php?pg=something&id=1055
(short notice - the site has had friendly urls since its launch, I have no idea how google found the "index.php?" url - there are "unfriendly" urls only in the content management system, which is password-restricted)
What can I do to solve the situation? (I have around 1000 pages that are double-indexed.) Somebody told me to use "disallow: index.php?" in the robots.txt file.
Right or wrong? Any other suggestions?
You'd be surprised as how pervasive and quick the google bots are at indexing site content. That, combined with lots of CMS systems creating unintended pages/links making it likely that at some point those links were exposed is the most likely culprit. It's also possible your administration area isn't as secure as you think, the google bot got through that way.
The well-behaved, and google recommended, things to do here are
If possible, create 301 redirects from you query string style URLs to your canonical style URLs. That's you saying "hey there, web bot/browser, the content that used to be at this URL is now at this other URL"
Block the query string content in your robots.txt. That's like asking the spiders or other automated programs "Hey, please don't look at this stuff. These aren't the URLs you're looking for"
Google apparently allows you to specify a canonical URL now via a <link /> tag in the top of your page. Consider adding these in.
As to whether doing the well behaved things is the the "right" thing to do re: Google rankings ... who knows. Only "Google" knows how their algorithms work now, and will work in the future, and by Google, I mean a bunch of engineers and executives with conflicting goals on how search should work.
Google now offers a way to specify a page's canonical URL. You can use the following code in your HTML to tell Google your canonical URL:
<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
You can read more about canonical URLs on Google on their blog post on the subject, here: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
According to the blog post, Ask.com, Microsoft Live Search and Yahoo! all support the canonical tag.
If you use sitemap generators to submit to search engines, you'll want to disallow in them as well. They are likely where Google got your links, from crawling your folder and from checking your logs.
Better check what URI has been requested ($_SERVER['REQUEST_URI']) and redirect if it was /index.php.
Changing robots.txt will not help, since the page is already indexed.
The best is to use a permanent redirect (301).
If you want to remove a page once indexed by Google the only way, more or less, is to make it return a 404 not found message.
Is it possible you're posting a form to a similar url and google is simply picking it up from the source?