Get rid of old links to a retired website in Google search - search-engine

I have a website that has been replaced by another website with a different domain name.
In Google search, I am able to find links to the pages on the old site, and I hope they will not show up in future Google search.
Here is what I did, but I am not sure whether it is correct or enough.
Access to any page on the old website will be immediately redirected to the homepage of the new website. There is no one-to-one page mapping between the two sites. Here is the code for the redirect on the old website:
<meta http-equiv="refresh" content="0;url=http://example.com" >
I went to Google Webmasters site. For the old website, I went to Fetch as Google, clicked "Fetch and Render" and "Reindex".
Really appreciate any input.

A few things you'll want to do here:
You need to use permanent server redirects, not meta refresh. Also I suggest you do provide one-to-one page mapping. It's a better user experience, and large numbers of redirects to root are often interpreted as soft 404s. Consult Google's guide to site migrations for more details.
Rather than Fetch & Render, use Google Search Console's (Webmaster Tools) Change of Address tool. Bing have a similar tool.
A common mistake is blocking crawler access to an retired site. That has the opposite of the intended effect: old URLs need to be accessible to search engines for the redirects to be "seen".

Related

How do search engines obtain unlinked pages?

I noticed that quite a lot Dropbox pages are indexed by Google, Bing, etc. and was wondering how these search engines obtain for instance links like these:
https://dl.dropboxusercontent.com/s/85cdji4d5pl5qym/37-71.pdf
https://dl.dropboxusercontent.com/u/11421929/larin2014.pdf
Given that there are no links on dl.dropboxusercontent.com to follow and the path structure is not that easy to guess, how is it possible that a search engine obtains such a link?
One solution might be that it was posted on a forum and picked up by the search engine but I looked up quite a lot of the links and checked the backlinks without success. I also noticed that Bing and Yahoo show a considerable amount of more results than Google which would mean that Bing does a better job in picking up these links which seems unlikely to me.
Even if the document is really unlinked (no link on their site, no link on someone other’s site, no sitemap, no Referer log from a site that gets linked in the document, etc.), it’s still possible for search engines to find the link.
Two ways are:
Someone could submit the URL to a search engine (whether via a public tool, or via the site’s webmaster account).
The search engine could get all URLs that certain users visit in their browsers. This could, for example, happen when the user has installed a toolbar from that search engine. This is the case with Bing, see my related answer on Webmasters SE:
Microsoft has confirmed that they do discover and index URLs that they find through users surfing the Internet with the Bing Toolbar installed.
And there might be more ways, of course.

How do search engines see dynamic profiles?

Recently search engines have been able to page dynamic content on social networking sites. I would like to understand how this is done. Are there static pages created by a site like Facebook that update semi frequently. Does Google attempt to store every possible user name?
As I understand it, a page like www.facebook.com/username, is not an actual file stored on disk but is shorthand for a query like: select username from users and display the information on the page. How does Google know about every user, this gets even more complicated when things like tweets are involved.
EDIT: I guess I didn't really ask what I wanted to know about. Do I need to be as big as twitter or facebook in order for google to make special ways to crawl my site? Will google automatically find my users profiles if I allow anyone to view them? If not what do I have to do to make that work?
In the case of tweets in particular, Google isn't 'crawling' for them in the traditional sense; they've integrated with Twitter to provide the search results in real-time.
In the more general case of your question, dynamic content is not new to Facebook or Twitter, though it may seem to be. Google crawls a URL; the URL provides HTML data; Google indexes it. Whether it's a dynamic query that's rendering the page, or whether it's a cache of static HTML, makes little difference to the indexing process in theory. In practice, there's a lot more to it (see Michael B's comment below.)
And see Vartec's succinct post on how Google might find all those public Facebook profiles without actually logging in and poking around FB.
OK, that was vastly oversimplified, but let's see what else people have to say..
As far as I know Google isn't able to read and store the actual contents of profiles, because the Google bot doesn't have a Facebook account, and it would be a huge privacy breach.
The bot works by hitting facebook.com and then following every link it can find. Whatever content it sees on the page it hits, it stores. So even if it follows a dynamic url like www.facebook.com/username, it will just remember whatever it saw when it went there. Hopefully in that particular case, it isn't all the private data of said user.
Additionally, facebook can and does provide special instructions that search bots can follow, so that google results don't include a bunch of login pages.
profiles can be linked from outside;
site may provide sitemap

How-To get private pages being crawled by google

How can i get private pages of my web site being crawled and indexed by google ?
maybe it's not very "conventionnal", but i want my private page "links" displayed in google index, but next require a registration to display the page.
EDIT: Based on the addition of "maybe it's not very "conventionnal", but i want my private page "links" displayed in google index, but next require a registration to display the page." To the question:
You can check the User Agent in your php code to basically allow google to see pages if it was a registered user (google's user agent is "Googlebot/1.0" and you can search to find user agents for other common engines).
However, this behavior is specifically against google's rules and they can and will remove your site from the index if they catch you doing it. Their policy is you should not treat googlebot any differently than you treat any random person who visits your site.
(Original Answer) One way is to use a sitemap to show google how to find all of your pages.
In general, and even in the case of sitemaps, if the content you want indexed is not linked to from a page that can be found through the "root" (/) (i.e. there is no way for the public to find it), then it probably won't get indexed. The only way to get it indexed is to link it in someplace.
The question is though, why do you want your private pages in google anyway?
They'll get crawled if and only if they're publicly accessible and your robots.txt file allows it. That's pretty much all you need to do.
Are you asking how to get Google to index your pages?
There are a couple of ways. You need to ensure that you have SEO'd, or Search Engine Optimisation, the pages properly with title text and description key words in your meta data.
You can also submit your site to Google, it's a free service, and it'll be placed in a queue of things that Google will index. May take some time though.
By far the best way to get your pages indexed is using the meta data in the pages themselves.
Google will only index what is
linked from somewhere already in Google's index
accessible to its crawler via normal (unauthenticated) HTTP
It will also
make the contents available in search results to anyone.
This may conflict with your idea of a "private" page.
I'm going to assume that all the other previous answerers are misunderstanding you. As I read it, you aren't asking how to get Google to index your pages, but rather how to get a list of all the pages that Google currently has already indexed on your site? If that is true, you should have a look at the Google Webmaster Tools.

Add identifier to browser bookmarks

Is there a way to automatically append an identifier to a page URL when it is bookmarked in the browser, perhaps something in the document head that gives the browser a directive or an onBookmark JavaScript type of event? I'm looking for ways to further segment my direct traffic in Google Analytics (if you have other ideas for doing that not related to bookmarks, please share them as well).
Example:
http://www.example.com/article
When bookmarked becomes:
http://www.example.com/article#bookmarked
I don't think you can reliably do what you're describing. The best you could do is have a button on your pages that uses window.external.AddFavorite, and then specify your hashed address. It will work for the people who use the button, which may be better than nothing.
Keep in mind that bookmark traffic is not simply part of your direct traffic.
There is a nice demonstration in Justin Cutroni's blog how Google Analytics tracks bookmark visits and it turned out, that most of the time, bookmark visits are shown as organic (if you suppose, that people first do a Google search and then bookmark your site).

Google sees something that it shouldn't see. Why?

For some mysterious reason, Google has indexed both these adresses, that lead to the same page:
/something/some-text-1055.html
and
/index.php?pg=something&id=1055
(short notice - the site has had friendly urls since its launch, I have no idea how google found the "index.php?" url - there are "unfriendly" urls only in the content management system, which is password-restricted)
What can I do to solve the situation? (I have around 1000 pages that are double-indexed.) Somebody told me to use "disallow: index.php?" in the robots.txt file.
Right or wrong? Any other suggestions?
You'd be surprised as how pervasive and quick the google bots are at indexing site content. That, combined with lots of CMS systems creating unintended pages/links making it likely that at some point those links were exposed is the most likely culprit. It's also possible your administration area isn't as secure as you think, the google bot got through that way.
The well-behaved, and google recommended, things to do here are
If possible, create 301 redirects from you query string style URLs to your canonical style URLs. That's you saying "hey there, web bot/browser, the content that used to be at this URL is now at this other URL"
Block the query string content in your robots.txt. That's like asking the spiders or other automated programs "Hey, please don't look at this stuff. These aren't the URLs you're looking for"
Google apparently allows you to specify a canonical URL now via a <link /> tag in the top of your page. Consider adding these in.
As to whether doing the well behaved things is the the "right" thing to do re: Google rankings ... who knows. Only "Google" knows how their algorithms work now, and will work in the future, and by Google, I mean a bunch of engineers and executives with conflicting goals on how search should work.
Google now offers a way to specify a page's canonical URL. You can use the following code in your HTML to tell Google your canonical URL:
<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
You can read more about canonical URLs on Google on their blog post on the subject, here: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
According to the blog post, Ask.com, Microsoft Live Search and Yahoo! all support the canonical tag.
If you use sitemap generators to submit to search engines, you'll want to disallow in them as well. They are likely where Google got your links, from crawling your folder and from checking your logs.
Better check what URI has been requested ($_SERVER['REQUEST_URI']) and redirect if it was /index.php.
Changing robots.txt will not help, since the page is already indexed.
The best is to use a permanent redirect (301).
If you want to remove a page once indexed by Google the only way, more or less, is to make it return a 404 not found message.
Is it possible you're posting a form to a similar url and google is simply picking it up from the source?

Resources