Google indexed my domain anyway? - search-engine

I have a robots.txt like below but Google has still indexed my domain. Basically they've indexed mydomain.com but not mydomain.com/any_page
UserAgent: *
Disallow: /
I mean how can I go back further than / which I thought was the root of domain?
Note this domain is a work in progess, hence I don't want Google or any other search engines seeing it for a minute.

If you don't have one already, get a Google Webmaster Tools account. It includes a URL removal tool that may work for you.
This doesn't address the problem of search engines possibly ignoring or misinterpreting your robots.txt file, of course.
If you REALLY want your site to be off the air until it's launched, your best bet is to actually take it off the air. Make the site inaccessible except by password. If you put HTTP Basic authentication on your documentroot, then no search engine will be able to index anything, but you'll have full access with a password.

Related

Google refuses to index new site URL

This is either a problem that Google is inflicting upon me, or a problem I am inflicting upon myself. I'm not totally sure.
When I first created my website a couple years ago, it followed a path similar to: http://www.mywebsite.abc123.com
Now, after a change in hosting services, I changed my domain to simply: https://www.mywebsite.com
I also added an SSL certifcate at the time for what it's worth. And it's been almost six months. I have all the variations (past and present) of my website registered and verified with Google's search console, but I can see no reason why the http://www.mywebsite.abc123.com link is getting indexed over the https://www.mywebsite.com link. I actually just assumed that http://www.mywebsite.abc123.com wouldn't even work anymore.
I've read about 301 redirects and it looks like something like that would solve my problem, but upon trying to implement it, I was confronted with nothing but a "Too many redirects" error.
Long story short, Google won't index my newer better URL.
But Yahoo and Bing will.
301 redirects have to be set up in the old domain so that it will point to the new one. If you still have access to that domain, you can add the redirects via .htaccess or in the admin panel.

Can a dedicated IP address with a website on it be found and crawled by search engines?

I have a VPS. I have placed a Drupal installation on that IP address. There is no URL registered for my website. The site on the IP address is for personal reference.
Can my IP address get indexed and found on search engines if there is no traditional URL for it? Will it get crawled?
I have no A-records pointing to it from other domain names I have on another VPS platform either. As far as I know, I am the only one that knows this IP address by heart or even goes there to add or refer to content.
There are three ways I know for a search engine to learn about the existence a website.
You submit the domain to them directly.
Someone else links to the domain.
The search engine watches all domain registrations (Google can do this easily because they run a DNS themselves), and tries the standard prefixes (e.g. www).
There does not seem to be an automatic approach for discovering IP addresses with content unless someone links to it.
If it's purely for personal reference and you want to be sure no one else can access it, then you should implement security anyway. Don't just rely on no one knowing the IP.
Can my IP address get indexed and found on search engines if there is no traditional URL for it?
Yes, if you can reach it externally, then so can the search engines. If you don't want it to be indexed, add a "robots.txt" that requests for the site not to be indexed. Bear in mind that crawlers do not have to respect this, but the major ones do.
As for how the search engines discover IP addresses that are not indexed elsewhere, that is probably part of their "secret sauce" that we will never know about. Perhaps your IP has been used before, and it has previously been indexed in that context; if so, a search engine that has a poke around may be expecting that old site but will happily index your new one.
Or, maybe other IP addresses in the same netblock are in active use, and the search engines give yours "a quick try" to see if it responds on ports 80 (http) or 443 (https). If they do, it gets added to their indexes (or do-not-crawl lists, if your robots.txt requests it).
If you specifically do not want search engines to see your content, you could make the default home page blank, and put your Drupal installation in a sub-directory. The search engines will then have nothing to index apart from a blank home page.

google bot, false links

I have a little problem with google bot, I have a server working on windows server 2009, the system called Workcube and it works on coldfusion, there is an error reporter built-in, thus i recieve every message of error, especially it concerned with google bot, that trying to go to a false link, which doesn't exist! the links looks like this:
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=282&HIERARCHY=215.005&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=145&HIERARCHY=200.003&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=123&HIERARCHY=110.006&brand_id=xxblpflyevlitojg
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=1&HIERARCHY=100&brand_id=xxblpflyevlitojg
of course with definition like brand_id=hoyrrolmwdgldah or brand_id=xxblpflyevlitojg is false, i don't have any idea what can be the problem?! need advice! thank you all for help! ;)
You might want to verify your site with Google Webmaster Tools which will provide URLs that it finds that error out.
Your logs are also valid, but you need to verify that it really is Googlebot hitting your site and not someone spoofing their User Agent.
Here are instructions to do just that: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
Essentially you need to do a reverse DNS lookup and then a forward DNS lookup after you receive the host from the reverse lookup.
Once you've verified it's the real Googlebot you can start troubleshooting. You see Googlebot won't request URLs that it hasn't naturally seen before, meaning Googlebot shouldn't be making direct object reference requests. I suspect it's a rogue bot with a User Agent of Googlebot, but if it's not you might want to look through your site to see if you're accidentally linking to those pages.
Unfortunately you posted the full URLs, so even if you clean up your site, Googelbot will see the links from Stack Overflow and continue to crawl them because it'll be in their crawl queue.
I'd suggest 301 redirecting these URLs to someplace that make sense to your users. Otherwise I would 404 or 410 these pages so Google know to remove these pages from their index.
In addition, if these are pages you don't want indexed, I would suggest adding the path to your robots.txt file so Googlebot can't continue to request more of these pages.
Unfortunately there's no real good way of telling Googlebot to never ever crawl these URLs again. You can always go into Google Webmaster Tools and request the URLs to be removed from their index which may stop Googlebot from crawling them again, but that doesn't guarantee it.

Geolocation with wordpressmu/buddypress

I am just looking for the best solution for the following problem.
I have installed wordpress mu, and I wanted to create child blogs, for different areas in the world. But I want it so 1 domain can switch them instantly using the users ip address.
IS there a extention of wordpressmu or buddypress or do I need something on the server say in htaccess to do that?
Check this plugin out: http://wordpress.org/extend/plugins/ip-to-country/
You could possibly use this, check the value returned, and set a WordPress cookie for the user that automatically redirects them to the geo-specific blog and will do so if/when they return to the site.
Check out the Google API for geocoding, user detection, and geogrouping options (redirect all users from combined location like 'Europe'). Rather then doing all the heavy lifting on your server, work with the Google API, get your nicely formatted answer back, and redirect based on that....
Anything that involves geographical relation .... GOOGLE GOOGLE GOOLE

What is the point of www in web urls? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I've been trying to collect analytics for my website and realized that Google analytics was not setup to capture data for visitors to www.example.com (it was only setup for example.com). I noticed that many sites will redirect me to www.example.com when I type only example.com. However, stackoverflow does exactly the opposite (redirects www.stackoverflow.com to just stackoverflow.com).
So, I've decided that in order to get accurate analytics, I should have my web server redirect all users to either www.example.com, or example.com. Is there a reason to do one or the other? Is it purely personal preference? What's the deal with www? I never type it in when I type domains in my browser.
History lesson.
There was a time when the Web did not dominate the Internet. An organisation with a domain (e.g. my university, aston.ac.uk) would typically have several hostnames set up for various services: gopher.aston.ac.uk (Gopher is a precursor to the World-wide Web), news.aston.ac.uk (for NNTP Usenet), ftp.aston.ac.uk (FTP - including anonymous FTP archives). They were just the obvious names for accessing those services.
When HTTP came along, the convention became to give the web server the hostname "www". The convention was so widespread, that some people came to believe that the "www" part actually told the client what protocol to use.
That convention remains popular today, and it does make some amount of sense. However it's not technically required.
I think Slashdot was one of the first web sites to decide to use a www-less URL. Their head man Rob Malda refers to "TCWWW" - "The Cursed WWW" - when press articles include "www" in his URL. I guess that for a site like Slashdot which is primarily a web site to a strong degree, "www" in the URL is redundant.
You may choose whichever you like as the canonical address. But do be consistent. Redirecting from other forms to the canonical form is good practice.
Also, skipping the “www.” saves you four bytes on each request. :)
It's important to be aware that if you don't use a www (or some other subdomain) then all cookies will be submitted to every subdomain and you won't be able to have a cookie-less subdomain for serving static content thus reducing the amount of data sent back and forth between the browser and the server. Something you might later come to regret.
(On the other hand, authenticating users across subdomains becomes harder.)
It's just a subdomain based on tradition, really. There's no point of it if you don't like it, and it wastes typing time as well. I like http://somedomain.com more that http://www.somedomain.com for my sites.
It's primarily a matter of establishing indirection for hostnames. If you want to be able to change where www.example.com points without affecting where example.com points, this matters. This was more likely to be useful when the web was younger, and the "www" helped make it clear why the box existed. These days, many, many domains exist largely to serve web content, and the example.com record all but has to point to the HTTP server anyway, since people will blindly omit the www. (Just this week I was horrified when I tried going to a site someone had mentioned, only to find that it didn't work when I omitted the www, or when I accidentally added a trailing dot after the TLD.)
Omitting the "www" is very Web 2.0 Adoptr Gamma... but with good reason. If people only go to your site for the web content, why keep re-adding the www? I general, I'd drop it.
http://no-www.org/
Google Analytics should work just fine with or without a www subdomain, though. Plenty of sites using GA successfully that don't force either/or.
It is the third-level domain (see Domain name. There was a time where it designated a physical server: some sites used URLs like www1.foo.com, www3.foo.com and so on.
Now, it is more virtual (different 3rd-level domains pointing to same server, same URL handled by different servers), but it is often used to handle sub-domains, and with some trick, you can even handle an infinite number of sub-domains: see, precisely, Wikipedia which uses this level for the language (en.wikipedia.org, fr.wikipedia.org and so on) or others site to give friendly URLs to their users (eg. my page http://PhiLho.deviantART.com).
So the www. isn't just here for decoration, it has a purpose, even if the vast majority of sites just stick to this default, and if not provided, supply it automatically. I knew some sites forgetting to redirect, giving an error if you omitted it, while they communicated on the www-less URL: they expected users to supply it automatically!
Let alone the URL already specifies what protocol is to be used so "www." is really of no use.
As far as I remember, in former times services like www and ftp were located on different machines, therefore using the natural DNS features (subdomains) was necessary at this time (more or less).

Resources