Double slash in URL path - bad practice? - url

My web app generates multiple slashes in URLs like: http://www.example.com/some///slashes.
Is it a bad practice? Does Google care?
Does Google see /some/slashes and /some///slashes as different URLs? If it does, I think Google won't merge PageRank of these URLs, or will it?
Thanks!

This can lead to search engines indexing incorrect URLs and duplicate content. Multiple pages with the same content is bad for SEO. Duplicated pages detract from the original page. You can end up with the incorrect URL being indexed and the proper URL being removed from search results.

Related

Remove multiple indexed URLs (duplicates) with redirect

I am managing a website that has only about 20-50 pages (articles, links and etc.). Somehow, Google indexed over 1000 links (duplicates, same page with different string in the URL). I found that those links contain ?date= in url. I already blocked by writing Disallow: *date* in robots.txt, made an XML map (which I did not had before) placed it into root folder and imported to Google Webmaster Tools. But the problem still stays: links are (and probably will be) in search results. I would easily remove URLs in GWT, but they can only remove one link at the time, and removing >1000 one by one is not an option.
The question: Is it possible to make dynamic 301 redirects from every page that contains $date= in url to the original one, and how? I am thinking that Google will re-index those pages, redirect to original ones, and delete those numerous pages from search results.
Example:
bad page: www.website.com/article?date=1961-11-1 and n same pages with different "date"
good page: www.website.com/article
automatically redirect all bad pages to good ones.
I have spent whole work day trying to solve this problem, would be nice to get some support. Thank you!
P.S. As far as I think this coding question is the right one to ask in stackoverflow, but if I am wrong (forgive me) redirect me to right place where I can ask this one.
You're looking for the canonical link element, that's the way Google suggests to solve this problem (here's the Webmasters help page about it), and it's used by most if not all search engines. When you place an element like
<link rel='canonical' href='http://www.website.com/article'>
in the header of the page, the URI in the href attribute will be considered the 'canonical' version of the page, the one to be indexed and so on.
For the record: if the duplicate content is not a html page (say, it's a dynamically generated image), and supposing you're using Apache, you can use .htaccess to redirect to the canonical version. Unfortunately the Redirect and RedirectMatch directives don't handle query strings (they're strictly for URIs), but you could use mod_rewrite to strip parts of the query string. See, for example, this answer for a way to do it.

Url with pseudo anchors and duplicate content / SEO

I have a product page with options in select list (ex : color of the product etc...).
You accede to my product with different urls :
www.mysite.com/product_1.html
www.mysite.com/product_1.html#/color-green
If you accede with the url www.mysite.com/product_1.html#/color-green, the option green of the select list is automatically selected (with javascript).
If i link my product page with those urls, is there a risk of duplicate content ? Is it good for my seo ?
thx
You need to use canonical urls in order to let the search engines know that you are aware that the content seems duplicated.
Basically using a canonical url on your page www.mysite.com/product_1.html#/color-green to go to www.mysite.com/product_1.html tells the search engine that whenever they see www.mysite.com/product_1.html#/color-green they should not scan this page but rather scan the page www.mysite.com/product_1.html
This is the suggested method to overcome duplicate content of this type.
See these pages:
SEO advice: url canonicalization
A rel=canonical corner case
At one time I saw Google indexing the odd #ed URL and showing them in results, but it didn't last long. I think it also required that there was an on page link to the anchor.
Google does support the concept of the hashbang (#!) as a specific way to do indexable anchors and support AJAX, which implies an anchor without the bang (!) will no longer be considered for indexing.
Either way, Google is not stupid. The the basic use of the anchor is to move to a place on a page, i.e. it is the same page (duplicate content) but a different spot. So Google will expect a #ed URL to contain the same content. Why would they punish you for doing what the # is for?
And what is "the risk of duplicate content". Generally, the only onsite risk from duplicate content is Google may waste it's time crawling duplicate pages instead of focusing on other valuable pages. As Google will assume # is the same page it is more likely to not event try the #ed URL.
If you're worried, implement the canonical tag, but do it right. I've seen more issues from implementing it badly than the supposed issues they are there to solve.
Both answers above are correct. Google has said they ignore hashtags unless you use hash-bang format (#!) -- and that really only addresses a certain use case, so don't add it just because you think it will help.
Using the canonical link tag is the right thing to do.
One additional point about dupe content: it's less about the risk than about a missed opportunity. In cases where there are dupes, Google chooses one. If 10 sites link to your site using www.example.com and 10 more link using just example.com you'll get the :link goodness" benefit of only 10 links. The complete solution to this involves ensuring that when users and Google arrive at the "wrong" on, the server responds with an HTTP 301 status and redirects the user to the "right" one. This is known as domain canonicalization and is a good thing for many, many reasons. Use this in addition to the "canonical" link tag and other techniques.

Anchor tags in urls & search engines, HOWTO?

How do I implement anchor tag urls so that search engines crawl my pages? Here's an example from twitter:
In search results it's:
http://twitter.com/username
When I click on it, it redirects me to
http://twitter.com/#!/username
How does twitter know when to redirect? Relying on a User-Agent doesn't seem such a good idea.
Twitter isn't optimizing their site for SEO. They have a special deal with Google so I wouldn't use them as an example. Google has support for hash URLs, which you can read about here https://developers.google.com/webmasters/ajax-crawling/docs/specification.
The main idea is that a URL like http://www.example.org/#!/my-url the crawlers convert to http://www.example.org/?_escaped_fragment_=/my-url. When Google encounters that URLs it makes a get requests to the alternative URL and will use that content to index it.

Same webpage on different URLs

What are the implications (SEO-wise) of having the same resource at many different URLs?
I've seen some websites that practically never show a 404 page. Any wrong URL path will simply render the homepage.
Other sites, for example, redirect http://example.com/path/ to http://example.com/path - (no trailing slash) or vice versa in order to avoid duplicate URLs.
Is this a good practice and why (not)?
The largest implication to having the same resource at many different URLs is that your search results (notably Google, I'm not sure how SEO works for other search engines) will be diluted/fragemented. Instead of ranking the resource higher in search result relevance, multiple URLs will rank lower even though they point to the same resource.
It's generally good practice to normalize your URLs for SEO. The issue most website administrators have with supporting normalized URLs is that it sometimes requires drastic changes to their URL structure, and this isn't always possible. To alleviate having to change the URLs directly, there's a canonical-url attribute in a link tag that's supported by Google's webcrawler:
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
It's a step in the right direction. For more information on normalized URLs, the wiki article is helpful:
http://en.wikipedia.org/wiki/URL_normalization
As for trailing slashes, I'm not sure if webcrawlers count these variations distinctly. If in your example, http://example.com/path/ is a directory, then it should have a trailing slash. If path is the name of a file, the trailing slash should be omitted. In IIS at least, when a trailing slash is omitted, the server hunts for a file first, and if not file is found, checks to see if a directory by that name exists. If the directory exists, it redirects internally by adding a trailing slash. This amounts to extra work on the webserver's end that isn't necessary if you're generating internal links on your pages.
"Demystifying the 'duplicate content penalty'" is a pretty nice article on various duplicate content issues. Google's Duplicate Content help page seems to be kept up to date on the best ways to handle it from a technical perspective.

Google sees something that it shouldn't see. Why?

For some mysterious reason, Google has indexed both these adresses, that lead to the same page:
/something/some-text-1055.html
and
/index.php?pg=something&id=1055
(short notice - the site has had friendly urls since its launch, I have no idea how google found the "index.php?" url - there are "unfriendly" urls only in the content management system, which is password-restricted)
What can I do to solve the situation? (I have around 1000 pages that are double-indexed.) Somebody told me to use "disallow: index.php?" in the robots.txt file.
Right or wrong? Any other suggestions?
You'd be surprised as how pervasive and quick the google bots are at indexing site content. That, combined with lots of CMS systems creating unintended pages/links making it likely that at some point those links were exposed is the most likely culprit. It's also possible your administration area isn't as secure as you think, the google bot got through that way.
The well-behaved, and google recommended, things to do here are
If possible, create 301 redirects from you query string style URLs to your canonical style URLs. That's you saying "hey there, web bot/browser, the content that used to be at this URL is now at this other URL"
Block the query string content in your robots.txt. That's like asking the spiders or other automated programs "Hey, please don't look at this stuff. These aren't the URLs you're looking for"
Google apparently allows you to specify a canonical URL now via a <link /> tag in the top of your page. Consider adding these in.
As to whether doing the well behaved things is the the "right" thing to do re: Google rankings ... who knows. Only "Google" knows how their algorithms work now, and will work in the future, and by Google, I mean a bunch of engineers and executives with conflicting goals on how search should work.
Google now offers a way to specify a page's canonical URL. You can use the following code in your HTML to tell Google your canonical URL:
<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
You can read more about canonical URLs on Google on their blog post on the subject, here: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
According to the blog post, Ask.com, Microsoft Live Search and Yahoo! all support the canonical tag.
If you use sitemap generators to submit to search engines, you'll want to disallow in them as well. They are likely where Google got your links, from crawling your folder and from checking your logs.
Better check what URI has been requested ($_SERVER['REQUEST_URI']) and redirect if it was /index.php.
Changing robots.txt will not help, since the page is already indexed.
The best is to use a permanent redirect (301).
If you want to remove a page once indexed by Google the only way, more or less, is to make it return a 404 not found message.
Is it possible you're posting a form to a similar url and google is simply picking it up from the source?

Resources