Verifying Googlebot in Rails

Verifying Googlebot in Rails - ruby-on-rails

I am looking to implement First Click Free in my rails application. Google has this information on how to verify a if a googlebot is viewing your site here.
I have been searching to see if there is anything existing for Rails to do this but I have been unable to find anything. So firstly, does anyone know of anything? If not, could anyone point me in the right direction of how to go about implementing what they have suggested in that page about how to verify?
Also, in that solution, it has to do a lookup every time to try and detect google, that seems like its going to be a big performance hit if I have to do it every page load? I could cache the IP if it has been verified in the past but Google have stated that their IP's change so at some point it may no longer belong to them. Although it probably doesn't happen regularly so it may not be that big of an issue.
Many thanks!!

Check out the browser gem: https://github.com/fnando/browser
What I'd do is use the
browser.bot?
method to check if your site is being accessed by a bot or not. If you care about the Googlebot specifically, you could check if
browser.name
includes googlebot. Keep in mind that this gem just checks the user agent sent by the client's browser, which could of course be spoofed. Sounds like that isn't a huge concern for your purposes.

I've built a Ruby gem for that recently, it's called "legitbot".
You may learn if a Web request comes from a supported bot using
bot = Legitbot.bot(userAgent, ip)
"legitbot" does this looking into User-agent and searching for a bot signature, i.e. how bots identify themselves. This doesn't guarantee that the Web request IP really comes from e.g. Googlebot. To make sure it is, call
bot.detected_as # => "Google"
bot.valid? # => true
bot.fake? # => false
Supported bots are Googlebot, Yandex bots, Bing, Baidu, DuckDuckGo.

Related

Google script origin request url

I'm developing a Google Sheets add-on. The add-on calls an API. In the API configuration, a url like https://longString-script.googleusercontent.com had to be added to the list of urls allowed to make requests from another domain.
Today, I noticed that this url changed to https://sameLongString-0lu-script.googleusercontent.com.
The url changed about 3 months after development start.
I'm wondering what makes the url to change because it also means a change in configuration in our back-end every time.
EDIT: Thanks for both your responses so far. Helped me understand better how this works but I still don't know if/when/how/why the url is going to change.
Quick update, the changing part of the url was "-1lu" for another user today (but not for me when I was testing). It's quite annoying since we can't use wildcards in the google dev console redirect uri field. Am I supposed to paste a lot of "-xlu" uris with x from 1 to like 10 so I don't have to touch this for a while?

For people coming across this now, we've also just encountered this issue while developing a Google Add-on. We've needed to add multiple origin urls to our oauth client for sign-in, following the longString-#lu-script.googleusercontent.com pattern mentioned by OP.
This is annoying as each url has to be entered separately in the authorized urls field (subdomain or wildcard matching isn't allowed). Also this is pretty fragile since it breaks if Google changes the urls they're hosting our add-on from. Furthermore I wasn't able to find any documentation from Google confirming that these are the script origins.

URLs are managed by the host in various ways. At the most basic level, when you build a web server you decide what to call it and what to call any pages on it. Google and other large content providers with farms of servers and redundant data centers and everything are going to manage it a bit differently, but for your purposes, it will be effectively the same in that ... you need to ask them since they are the hosting provider of your cloud content.
Something that MIGHT be related is that Google rolled out some changes recently dealing with the googleusercontent.com domain and picassa images (or at least was scheduled to do so.) So the google support forums will be the way to go with this question for the freshest answers since the cause of a URL change is usually going to be specific to that moment in time and not something that you necessarily need to worry about changing repeatedly. But again, they are going to need to confirm that it was something related to the recent planned changes... or not. :-)
When you find something out you can update this question in case it is of use to others. Especially, if they tell you that it wasn't a one time thing dealing with a change on their end.

This is more likely related to Changing origin in Same-origin Policy. As discussed:
A page may change its own origin with some limitations. A script can set the value of document.domain to its current domain or a superdomain of its current domain. If it sets it to a superdomain of its current domain, the shorter domain is used for subsequent origin checks.
For example, assume a script in the document at http://store.company.com/dir/other.html executes the following statement:
document.domain = "company.com";
After that statement executes, the page can pass the origin check with http://company.com/dir/page.html
So, as noted:
When using document.domain to allow a subdomain to access its parent securely, you need to set document.domain to the same value in both the parent domain and the subdomain. This is necessary even if doing so is simply setting the parent domain back to its original value. Failure to do this may result in permission errors.

Is it possible to ensure that requests come from a specific domain?

I'm making a Rails polling site, which should have results that are very accurate. Users vote using POST links. I've taken pains to make sure users only vote once, and know exactly what they're voting for.
But it occurred to me that third parties with an interest in the results could put up POST links on their own websites, that point to my voting paths. They could skew my results this way, for example by adding a misleading description.
Is there any way of making sure that the requests can only come from my domain? So a link coming from a different domain wouldn't run any of the code in my controller.

There are various things that you'll need to check. First is request.referer, which will tell you the page that referred the link to your site. If it's not your site, you should reject it.
if URI(request.referer).host != my_host
raise ArgumentError.new, "Invalid request from external domain"
end
However, this only protects you from web clients (browsers) that accurately populate the HTTP referer header. And that's assuming that it came from a web page at all. For instance, someone could send a link by email, and an email client is unlikely to provide a referer at all.
In the case of no referer, you can check for that, as well:
if request.referer.blank?
raise ArgumentError.new, "Invalid request from unknown domain"
elsif URI(request.referer).host != my_host
raise ArgumentError.new, "Invalid request from external domain"
end
It's also very easy with simple scripting to spoof the HTTP 'referer', so even if you do get a valid domain, you'll need other checks to ensure that it's a legitimate POST. Script kiddies do this sort of thing all the time, and with a dozen or so lines of Ruby, python, perl, curl, or even VBA, you can simulate interaction by a "real user".
You may want to use something like a request/response key mechanism. In this approach, the link served from your site includes a unique key (that you track) for each visit to the page, and that only someone with that key can vote.
How you identify voters is important, as well. Passive identification techniques are good for non-critical activities, such as serving advertisements or making recommendations. However, this approach regularly fails a measurable percentage of the time when used across the general population. When you also consider the fact that people actually want to corrupt voting activities, it's very easy to suddenly become a target for everyone with a good concept to "beat the system" and some spare time on their hands.
Build in as much security as possible early on, because you'll need far more than you expect. During the 2012 Presidential Election, I was asked to pre-test 41 online voting sites, and was able to break 39 of them within the first 24 hours (6 of them within 1 hour). Be overly cautious. Know how attackers can get in, not just using "normal" mechanisms. Don't publish information about which technologies you're using, even in the code. Seeing "Rails-isms" anywhere in the HTML or Javascript code (or even the URL pathnames) will immediately give the attacker an enormous edge in defeating your safety mechanisms. Use obscurity to your advantage, and use security everywhere that you can.
NOTE: Checking the request.referer is like putting a padlock on a bank vault: it'll keep out those that are easily dissuaded, but won't even slow down the determined individual.

What you are trying to prevent here is basically cross-site request forgery. As Michael correctly pointed out, checking the Referer header will buy you nothing.
A popular counter-measure is to give each user an individual one-time token that is sent with each form and stored in the user's session. If, on submit, the submitted value and the stored value do not match, the request is disgarded. Luckily for you, RoR seems to ship such a feature. Looks like a one-liner indeed.

How to block requests to server with user name / password?

We have realized that this URL http://Keyword:redacted#example.com/ redirects to http://example.com/ when copied and pasted into the browser's address bar.
As far as I understand this might be used in some ftp connections but we have no such use on our website. We are suspecting that we are targeted by an attack and have been warned by Google that we are passing PII (mostly email addresses) in our URL requests to their Google Adsense network. We have not been able to find the source, but we have been warned that the violation is in the form of http://Keyword:redacted#example.com/
How can we stop this from happening?
What URL redirect method we can use to not accept this and return an error message?

FYI I experienced a similar issue for a client website and followed up with Adsense support. The matter was escalated to a specialist team who investigated and determined that flagged violations with the format http://Keyword:redacted#example.com/ will be considered false positives. I'm not sure if this applies to all publishers or was specific to our case, but it might be worth following up with Adsense support.

There is nothing you can do. This is handled entirely by your browser long before it even thinks about "talking" to your server.
That's a strange URL for people to copy/paste into the browser's address bar unless they have been told/trained to do so. Your best bet is to tell them to STOP IT! :-)
I suppose you could look at the HTTP Authorization Headers and report an error if they come in populated... (This would $_SERVER['PHP_AUTH_USER'] in PHP.) I've never looked at these values when the header doesn't request them, so I'm not sure if it would work or not...

The syntax http://abc:def#something.com means you're sending userid='abc', password='def' as basic authentication parameters. Your browser will pull out the userid & password and send them along as authentication information, leaving the url without them.
As Peter Bowers mentioned, you could check the authorization headers and see if they're coming in that way, but you can't stop others from doing it if they want. If it happens a lot then I'd suspect that somewhere there's a web form asking users to enter their user/password and it's getting encoded that way. One way to sleuth it out would be to see if you can identify someone by the userid specified.
Having Keyword:redacted sounds odd. It's possible Google Adsense changed the values to avoid including confidential info.

google bot, false links

I have a little problem with google bot, I have a server working on windows server 2009, the system called Workcube and it works on coldfusion, there is an error reporter built-in, thus i recieve every message of error, especially it concerned with google bot, that trying to go to a false link, which doesn't exist! the links looks like this:
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=282&HIERARCHY=215.005&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=145&HIERARCHY=200.003&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=123&HIERARCHY=110.006&brand_id=xxblpflyevlitojg
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=1&HIERARCHY=100&brand_id=xxblpflyevlitojg
of course with definition like brand_id=hoyrrolmwdgldah or brand_id=xxblpflyevlitojg is false, i don't have any idea what can be the problem?! need advice! thank you all for help! ;)

You might want to verify your site with Google Webmaster Tools which will provide URLs that it finds that error out.
Your logs are also valid, but you need to verify that it really is Googlebot hitting your site and not someone spoofing their User Agent.
Here are instructions to do just that: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
Essentially you need to do a reverse DNS lookup and then a forward DNS lookup after you receive the host from the reverse lookup.
Once you've verified it's the real Googlebot you can start troubleshooting. You see Googlebot won't request URLs that it hasn't naturally seen before, meaning Googlebot shouldn't be making direct object reference requests. I suspect it's a rogue bot with a User Agent of Googlebot, but if it's not you might want to look through your site to see if you're accidentally linking to those pages.
Unfortunately you posted the full URLs, so even if you clean up your site, Googelbot will see the links from Stack Overflow and continue to crawl them because it'll be in their crawl queue.
I'd suggest 301 redirecting these URLs to someplace that make sense to your users. Otherwise I would 404 or 410 these pages so Google know to remove these pages from their index.
In addition, if these are pages you don't want indexed, I would suggest adding the path to your robots.txt file so Googlebot can't continue to request more of these pages.
Unfortunately there's no real good way of telling Googlebot to never ever crawl these URLs again. You can always go into Google Webmaster Tools and request the URLs to be removed from their index which may stop Googlebot from crawling them again, but that doesn't guarantee it.

Ruby on rails (based on Mephisto) - Unable to contact server

I am completely new to ruby and I inherited a ruby system for a product catalogue. Most of my users are able to view everything as they should but overseas users (specifically Mexico) cannot contact the server once logged in. They are an active user. I'm sorry I cannot be more specific, and the system is private so I cannot grant access.
Has anyone had any issues similar to this before? Is it a user-end issue or a system error?

Speaking as somebody who regularly ends up on your user's side of the fence, the number one culprit for this symptom is "Clueless administrator". There are many, many sites which generically block either large blocks of IP space or which geolocate and carve out big portions of the world.
For example, a surprising number of American blogs block Asian countries (including Japan) out of a misplaced effort to avoid DDOS attacks (which actually probably originated in Russia or China but, hey, this species of administrator isn't very good on fine tuning solutions). I have to hop over to my American proxy server to access those sites.
So the first thing I'd do to diagnose your problems is to see whether your Mexican users are making it to the server at all, or whether they're being blocked somewhere earlier (router? firewall? etc). Then, to determine whether the problem is on your end or their end, I'd try to replicate the issue with you proxying your connection through a Mexican proxy and repeating the actions they took to cause the issue.
The fact that they get blocked after logging in could indicate that you have https issues , for example with an HTTPS accelerator installed [1], or it could be that your frontend server is properly serving up the static content but doing the checking on dynamic requests only.
[1] We've seen some really weird bugs at work caused by a malfunctioning HTTPS accelerator.

If it's working for everyone else then it would appear that the problem is not with Ruby or Rails working, since they are...
My first thought would be to check for a network issue: are the Mexican users all behind the same proxy server and/or firewall?
Is login handled within the Rails application or via some other resource? Can you see any evidence that requests from Mexican users are reaching your web server at all?

Login is handled by the rails app. Am currently trying to hunt down the logs, taking some time as again I am new to this system.
Cheers guys

Maybe INS is cracking down on cyber-immigration.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart