google bot, false links - hyperlink

I have a little problem with google bot, I have a server working on windows server 2009, the system called Workcube and it works on coldfusion, there is an error reporter built-in, thus i recieve every message of error, especially it concerned with google bot, that trying to go to a false link, which doesn't exist! the links looks like this:
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=282&HIERARCHY=215.005&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=145&HIERARCHY=200.003&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=123&HIERARCHY=110.006&brand_id=xxblpflyevlitojg
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=1&HIERARCHY=100&brand_id=xxblpflyevlitojg
of course with definition like brand_id=hoyrrolmwdgldah or brand_id=xxblpflyevlitojg is false, i don't have any idea what can be the problem?! need advice! thank you all for help! ;)

You might want to verify your site with Google Webmaster Tools which will provide URLs that it finds that error out.
Your logs are also valid, but you need to verify that it really is Googlebot hitting your site and not someone spoofing their User Agent.
Here are instructions to do just that: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
Essentially you need to do a reverse DNS lookup and then a forward DNS lookup after you receive the host from the reverse lookup.
Once you've verified it's the real Googlebot you can start troubleshooting. You see Googlebot won't request URLs that it hasn't naturally seen before, meaning Googlebot shouldn't be making direct object reference requests. I suspect it's a rogue bot with a User Agent of Googlebot, but if it's not you might want to look through your site to see if you're accidentally linking to those pages.
Unfortunately you posted the full URLs, so even if you clean up your site, Googelbot will see the links from Stack Overflow and continue to crawl them because it'll be in their crawl queue.
I'd suggest 301 redirecting these URLs to someplace that make sense to your users. Otherwise I would 404 or 410 these pages so Google know to remove these pages from their index.
In addition, if these are pages you don't want indexed, I would suggest adding the path to your robots.txt file so Googlebot can't continue to request more of these pages.
Unfortunately there's no real good way of telling Googlebot to never ever crawl these URLs again. You can always go into Google Webmaster Tools and request the URLs to be removed from their index which may stop Googlebot from crawling them again, but that doesn't guarantee it.

Related

.com extension adds random characters/numbers

I'm using a domain name with this general structure: http://mydomainname.com/
However, when I click it, I get a 404 message saying:
And when I look in the URL, it's not http://mydomainname.com/ but surprisingly http://mydomainname.com/YkPWZ/.
How did YkPWZ/ appear automatically and what can I do to eliminate this issue? Sometimes accessing http://mydomainname.com/ works fine, but most of the time the browser automatically tacks on some random characters at the end of the URL, throwing the 404 message. This is not a browser-specific issue and I've had a few colleagues replicate this issue on different operating systems (both desktop and iOS).
P.S. If it matters at all, I generated my website using Github Pages (markdown files, not HTML).
I'm quite certain this is an issue on the GoDaddy side of things, though I'm unable to find any official documentation on the subject. As noted in comments above, the redirect isn't coming from GitHub Pages.
I found an old thread discussing the issue. Here is a brief summary:
GoDaddy may use redirects like this to handle load balancing on their shared hosting servers.
In several cases, users contacted GoDaddy to ask about the problem and
had the issue resolved, but
were never told the technical specifics of what was happening.
If you wish to stay with GoDaddy I recommend contacting them and sending them to the link I found above. They may be able to resolve the issue for you, though I wouldn't expect an explanation.
Alternatively, you can use another web host. In many circles, GoDaddy isn't rated very highly. It's lucky that there are so many web hosts to choose from. Alternatively, you can use a custom domain directly with GitHub Pages, bypassing a third-party host entirely.

Is this dangerous to my Rails App?

We are hosted on Heroku, and have the NewRelic add on. Every day I check the errors, and almost every day this error comes up.
Action and Type
Middleware/Rack/Rack::MethodOverride#call
EOFError
Message
bad content body
This is a Rails Application, and so I figure it's not doing anything in particular other than returning a 440 response status because there is nothing at the url they are trying to access.
URL
/wp-admin/admin-ajax.php
Through some google-fu I found an article pertaining to this being a brute force attack on wordpress sites.
My specific question is:
Do I worry about this?
I inherited the site and am not sure if this is just something that happens, and if it is something that rails applications don't have to worry about? It seems fairly targeted towards wordpress, but I can't find any documentation on whether I should be doing more to stop this.
Other frequently pinged urls that don't exist on my application
/sites/all/libraries/elfinder/php/connector.minimal.php
/license.php
/tiny_mce/plugins/tinybrowser/upload_file.php
Any enlightenment on the subject would be great. Stack trace available if needed. Thanks in advance, overflowers.
As long as you don't have a route configured to handle those requests you then only have to worry about getting spammed these requests and losing network resources. They'll recieve a 404 Not Found error when they try to reach it and so there is nothing they can really do except slow your site if they spam requests. If they do it often you can ban their IP address.

Verifying Googlebot in Rails

I am looking to implement First Click Free in my rails application. Google has this information on how to verify a if a googlebot is viewing your site here.
I have been searching to see if there is anything existing for Rails to do this but I have been unable to find anything. So firstly, does anyone know of anything? If not, could anyone point me in the right direction of how to go about implementing what they have suggested in that page about how to verify?
Also, in that solution, it has to do a lookup every time to try and detect google, that seems like its going to be a big performance hit if I have to do it every page load? I could cache the IP if it has been verified in the past but Google have stated that their IP's change so at some point it may no longer belong to them. Although it probably doesn't happen regularly so it may not be that big of an issue.
Many thanks!!
Check out the browser gem: https://github.com/fnando/browser
What I'd do is use the
browser.bot?
method to check if your site is being accessed by a bot or not. If you care about the Googlebot specifically, you could check if
browser.name
includes googlebot. Keep in mind that this gem just checks the user agent sent by the client's browser, which could of course be spoofed. Sounds like that isn't a huge concern for your purposes.
I've built a Ruby gem for that recently, it's called "legitbot".
You may learn if a Web request comes from a supported bot using
bot = Legitbot.bot(userAgent, ip)
"legitbot" does this looking into User-agent and searching for a bot signature, i.e. how bots identify themselves. This doesn't guarantee that the Web request IP really comes from e.g. Googlebot. To make sure it is, call
bot.detected_as # => "Google"
bot.valid? # => true
bot.fake? # => false
Supported bots are Googlebot, Yandex bots, Bing, Baidu, DuckDuckGo.

How to block requests to server with user name / password?

We have realized that this URL http://Keyword:redacted#example.com/ redirects to http://example.com/ when copied and pasted into the browser's address bar.
As far as I understand this might be used in some ftp connections but we have no such use on our website. We are suspecting that we are targeted by an attack and have been warned by Google that we are passing PII (mostly email addresses) in our URL requests to their Google Adsense network. We have not been able to find the source, but we have been warned that the violation is in the form of http://Keyword:redacted#example.com/
How can we stop this from happening?
What URL redirect method we can use to not accept this and return an error message?
FYI I experienced a similar issue for a client website and followed up with Adsense support. The matter was escalated to a specialist team who investigated and determined that flagged violations with the format http://Keyword:redacted#example.com/ will be considered false positives. I'm not sure if this applies to all publishers or was specific to our case, but it might be worth following up with Adsense support.
There is nothing you can do. This is handled entirely by your browser long before it even thinks about "talking" to your server.
That's a strange URL for people to copy/paste into the browser's address bar unless they have been told/trained to do so. Your best bet is to tell them to STOP IT! :-)
I suppose you could look at the HTTP Authorization Headers and report an error if they come in populated... (This would $_SERVER['PHP_AUTH_USER'] in PHP.) I've never looked at these values when the header doesn't request them, so I'm not sure if it would work or not...
The syntax http://abc:def#something.com means you're sending userid='abc', password='def' as basic authentication parameters. Your browser will pull out the userid & password and send them along as authentication information, leaving the url without them.
As Peter Bowers mentioned, you could check the authorization headers and see if they're coming in that way, but you can't stop others from doing it if they want. If it happens a lot then I'd suspect that somewhere there's a web form asking users to enter their user/password and it's getting encoded that way. One way to sleuth it out would be to see if you can identify someone by the userid specified.
Having Keyword:redacted sounds odd. It's possible Google Adsense changed the values to avoid including confidential info.

Google indexed my domain anyway?

I have a robots.txt like below but Google has still indexed my domain. Basically they've indexed mydomain.com but not mydomain.com/any_page
UserAgent: *
Disallow: /
I mean how can I go back further than / which I thought was the root of domain?
Note this domain is a work in progess, hence I don't want Google or any other search engines seeing it for a minute.
If you don't have one already, get a Google Webmaster Tools account. It includes a URL removal tool that may work for you.
This doesn't address the problem of search engines possibly ignoring or misinterpreting your robots.txt file, of course.
If you REALLY want your site to be off the air until it's launched, your best bet is to actually take it off the air. Make the site inaccessible except by password. If you put HTTP Basic authentication on your documentroot, then no search engine will be able to index anything, but you'll have full access with a password.

Resources