Remove multiple indexed URLs (duplicates) with redirect - url

I am managing a website that has only about 20-50 pages (articles, links and etc.). Somehow, Google indexed over 1000 links (duplicates, same page with different string in the URL). I found that those links contain ?date= in url. I already blocked by writing Disallow: *date* in robots.txt, made an XML map (which I did not had before) placed it into root folder and imported to Google Webmaster Tools. But the problem still stays: links are (and probably will be) in search results. I would easily remove URLs in GWT, but they can only remove one link at the time, and removing >1000 one by one is not an option.
The question: Is it possible to make dynamic 301 redirects from every page that contains $date= in url to the original one, and how? I am thinking that Google will re-index those pages, redirect to original ones, and delete those numerous pages from search results.
Example:
bad page: www.website.com/article?date=1961-11-1 and n same pages with different "date"
good page: www.website.com/article
automatically redirect all bad pages to good ones.
I have spent whole work day trying to solve this problem, would be nice to get some support. Thank you!
P.S. As far as I think this coding question is the right one to ask in stackoverflow, but if I am wrong (forgive me) redirect me to right place where I can ask this one.

You're looking for the canonical link element, that's the way Google suggests to solve this problem (here's the Webmasters help page about it), and it's used by most if not all search engines. When you place an element like
<link rel='canonical' href='http://www.website.com/article'>
in the header of the page, the URI in the href attribute will be considered the 'canonical' version of the page, the one to be indexed and so on.
For the record: if the duplicate content is not a html page (say, it's a dynamically generated image), and supposing you're using Apache, you can use .htaccess to redirect to the canonical version. Unfortunately the Redirect and RedirectMatch directives don't handle query strings (they're strictly for URIs), but you could use mod_rewrite to strip parts of the query string. See, for example, this answer for a way to do it.

Related

Url with pseudo anchors and duplicate content / SEO

I have a product page with options in select list (ex : color of the product etc...).
You accede to my product with different urls :
www.mysite.com/product_1.html
www.mysite.com/product_1.html#/color-green
If you accede with the url www.mysite.com/product_1.html#/color-green, the option green of the select list is automatically selected (with javascript).
If i link my product page with those urls, is there a risk of duplicate content ? Is it good for my seo ?
thx
You need to use canonical urls in order to let the search engines know that you are aware that the content seems duplicated.
Basically using a canonical url on your page www.mysite.com/product_1.html#/color-green to go to www.mysite.com/product_1.html tells the search engine that whenever they see www.mysite.com/product_1.html#/color-green they should not scan this page but rather scan the page www.mysite.com/product_1.html
This is the suggested method to overcome duplicate content of this type.
See these pages:
SEO advice: url canonicalization
A rel=canonical corner case
At one time I saw Google indexing the odd #ed URL and showing them in results, but it didn't last long. I think it also required that there was an on page link to the anchor.
Google does support the concept of the hashbang (#!) as a specific way to do indexable anchors and support AJAX, which implies an anchor without the bang (!) will no longer be considered for indexing.
Either way, Google is not stupid. The the basic use of the anchor is to move to a place on a page, i.e. it is the same page (duplicate content) but a different spot. So Google will expect a #ed URL to contain the same content. Why would they punish you for doing what the # is for?
And what is "the risk of duplicate content". Generally, the only onsite risk from duplicate content is Google may waste it's time crawling duplicate pages instead of focusing on other valuable pages. As Google will assume # is the same page it is more likely to not event try the #ed URL.
If you're worried, implement the canonical tag, but do it right. I've seen more issues from implementing it badly than the supposed issues they are there to solve.
Both answers above are correct. Google has said they ignore hashtags unless you use hash-bang format (#!) -- and that really only addresses a certain use case, so don't add it just because you think it will help.
Using the canonical link tag is the right thing to do.
One additional point about dupe content: it's less about the risk than about a missed opportunity. In cases where there are dupes, Google chooses one. If 10 sites link to your site using www.example.com and 10 more link using just example.com you'll get the :link goodness" benefit of only 10 links. The complete solution to this involves ensuring that when users and Google arrive at the "wrong" on, the server responds with an HTTP 301 status and redirects the user to the "right" one. This is known as domain canonicalization and is a good thing for many, many reasons. Use this in addition to the "canonical" link tag and other techniques.

Strange google result listing, invalid URL created

Would be great if you guys could shed some light on this, has baffled me:
I was asked by a client if I could try and make the search term for his comedy night "sketchercise" put his website top of the Google ranking. I simply changed the title tag of the header for the whole site from "Allnutt and Simpson" to "Allnutt and Simpson - Sketchercise # Ginglik - Sketch Duo". It did the trick and now the site comes up top of the Google listing when typing in "sketchercise". However, it gives off this very strange link:
http://www.allnuttandsimpson.com/index.php/videos/
This is the link to the google search result too:
http://www.google.co.uk/search?sourceid=chrome&ie=UTF-8&q=sketchercise
This link is invalid, it doesn't make any sense. I guess it has something to do with the use of hash tags and the AJAX driven site, but before I changed the title tag, it linked to the site fine using the # tags. What is the deal with this slash?
The strangest part is that the valid URL for the videos page on that site is /index.php#vidspics, I have never used the word "videos" in a url!
If anyone can explain the cause of this or just help me stop it from happening, I'd be very grateful. I realise that this is an SEO question and I hate that stuff generally, but I hope you can see this is a bit of a strange case!
Just to compare, if you google "allnutt and simpson" it works just fine links to the site and all of it's pages absolutely fine as .php pages (and then my JS converts them to hash tags to keep things clean)
It's because there must be a folder called 'videos' under your hosted files, use an FTP client and check this.
Google crawls every folder and file unless you tell him not to do this, look for robot.txt files to learn how to avoid indexation.
Also ask google to remove that result when you solve this.
Finally that behaviour is not related with hash tags, these are just references to javascript in order to display the appropiate content in you webpage.
Not sure why its posted like this but the only way to stop that page from appearing is using a google webmaster account for this website and make sure the crawlers can't find this link anymore. The alternative is have the site admin put this tag, <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> , in the header when isset($_REQUEST(videos)) is true.
The slash in the address is the parsed form of www.allnuttandsimpson.com/index.php?=videos. You can have the web server change all the php parameters into slashes to make the links look pretty.
Best option for correct results is to create a sitemap and submit it to https://www.google.com/webmasters/tools/ for that site. You will need access.
Oh forgot, the sitemap will make google see all the pages you want it to post, use this for the major pages like those in the main menu. To remove links you don't want requires a robots.txt in the main directory of the site.

SEO duplicate content

If you have an URL like this:
Java lib or app to convert CSV to XML file?
If someone creates a link on his website, but changes the name like this
Java lib or app to convert CSV to XML file?
that would cause duplicate content. Is it nessecary to check if the name in the url is the same as the real name of the question and if not do a 301 redirect to the correct URL?
I see Stackoverflow.com doesn't do a 301 redirect, but just accepts any name in the URL. Can this cause duplicate content or your ranking to drop? Is SO not affected by this?
This won't cause duplicate content as long as you use a <link rel="canonical"> element. For example, this question's <link rel="canonical"> element is:
<link rel="canonical" href="http://stackoverflow.com/questions/3488655/seo-duplicate-content">
This element does not change even if you visit the page with a different slug at the end of the URL.
It tells Google that, even if it got to this page via a different URL, it's the same page as the canonical page (not a duplicate) and it should be counted as such.
More info: Official Google Webmaster Central Blog — Specify Your Canonical
I think this can be resolved with canonical meta tag.
Duplicate content does not "cause your ranking to drop", this is a myth. In the example you gave, the worst thing that'll happen is that the first URL wouldn't gain any page rank benefit from the (incorrect) link, since it is to a different URL. You want to try and avoid this sort of thing simply because if lots of people are linking to the same content at different addresses, the benefit you get from those links is split. There is no 'penalty'.
In my opinion the correct thing to do here is to check the name in the URL against the name in the database, and if it doesn't match, send a 404 (the link is wrong after all). This way people are unlikely to link to the wrong URL in the first place.
If this is an old site and you already have lots of inbound links to incorrect URLs, you might want to send 301s to the correct URLs instead so as not to dead end users.
Some further reading:
http://googlewebmastercentral.blogspot.com/2008/09/demystifying-duplicate-content-penalty.html

Hidden file names in URLs

I usually like defining my pages to know exactly what page does what. However, on a number of sites, I see where the filename is hidden from view and I was just a little curious.
Is there any specific benefit of having URLs appear like this:
http://mydomain/my_directory/my_subdirectory/
As opposed to this:
http://mydomain/my_directory/my_subdirectory/index.php
Thanks.
Done correctly it can be better for:
the end user, it is easier to say and remember.
SEO, the page name may just detract from the URL in terms of search parsing.
Note: In your example (at least with IIS) all that may have happened is you've made index.php the default document of that sub directory. You could use both URLs to access the page which could again affect SEO page rank. A search engine would see both URLs as different, but the page content would be the same, resulting in duplicate content being flagged. The solution to this would be to:
301 redirect from one of the URLs to the other
Add a canonical tag to the page saying which URL you want page rank to be given to.
Some technologies simply don't match url with files. Java Servlet for examples.
These are not file names. These are URLs. Their goal is to describe the resource. Nobody cares whether you did it in PHP or ASP or typed your HTML in the Emacs. Nobody cares that you named your file index.php. We like to see clean URLs with clear structure and semantics.

Google sees something that it shouldn't see. Why?

For some mysterious reason, Google has indexed both these adresses, that lead to the same page:
/something/some-text-1055.html
and
/index.php?pg=something&id=1055
(short notice - the site has had friendly urls since its launch, I have no idea how google found the "index.php?" url - there are "unfriendly" urls only in the content management system, which is password-restricted)
What can I do to solve the situation? (I have around 1000 pages that are double-indexed.) Somebody told me to use "disallow: index.php?" in the robots.txt file.
Right or wrong? Any other suggestions?
You'd be surprised as how pervasive and quick the google bots are at indexing site content. That, combined with lots of CMS systems creating unintended pages/links making it likely that at some point those links were exposed is the most likely culprit. It's also possible your administration area isn't as secure as you think, the google bot got through that way.
The well-behaved, and google recommended, things to do here are
If possible, create 301 redirects from you query string style URLs to your canonical style URLs. That's you saying "hey there, web bot/browser, the content that used to be at this URL is now at this other URL"
Block the query string content in your robots.txt. That's like asking the spiders or other automated programs "Hey, please don't look at this stuff. These aren't the URLs you're looking for"
Google apparently allows you to specify a canonical URL now via a <link /> tag in the top of your page. Consider adding these in.
As to whether doing the well behaved things is the the "right" thing to do re: Google rankings ... who knows. Only "Google" knows how their algorithms work now, and will work in the future, and by Google, I mean a bunch of engineers and executives with conflicting goals on how search should work.
Google now offers a way to specify a page's canonical URL. You can use the following code in your HTML to tell Google your canonical URL:
<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
You can read more about canonical URLs on Google on their blog post on the subject, here: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
According to the blog post, Ask.com, Microsoft Live Search and Yahoo! all support the canonical tag.
If you use sitemap generators to submit to search engines, you'll want to disallow in them as well. They are likely where Google got your links, from crawling your folder and from checking your logs.
Better check what URI has been requested ($_SERVER['REQUEST_URI']) and redirect if it was /index.php.
Changing robots.txt will not help, since the page is already indexed.
The best is to use a permanent redirect (301).
If you want to remove a page once indexed by Google the only way, more or less, is to make it return a 404 not found message.
Is it possible you're posting a form to a similar url and google is simply picking it up from the source?

Resources