Avoid robots from going into a www.domain.com/thishash when link posted to twitter, facebook - ruby-on-rails

I'm building a service where people gets notified (mails) when they follow a link with the format www.domain.com/this_is_a_hash. The people that use this server can share this link on different places like, twitter, tumblr, facebook and more...
The main problem I'm having is that as soon as the link is shared on any of this platforms a lot of request to the www.domain.com/this_is_a_hash are coming to my server. The problem with this is that each time one of this requests hits my server a notification is sent to the owner of the this_is_a_hash, and of course this is not what I want. I just want to get notifications when real people is going into this resource.
I found a very interesting article here that talks about the huge amount of request a server receives when posting to twitter...
So what I need is to avoid search engines to hit the "resource" url... the www.mydomain.com/this_is_a_hash
Any idea? I'm using rails 3.
Thanks!

If you don’t want these pages to be indexed by search engines, you could use a robots.txt to block these URLs.
User-agent: *
Disallow: /
(That would block all URLs for all user-agents. You may want to add a folder to block only those URLs inside of it. Or you could add the forbidden URLs dynamically as they get created, however, some bots might cache the robots.txt for some time so they might not recognize that a new URL should be blocked, too.)
It would, of course, only hold back those bots that are polite enough to follow the rules of your robots.txt.
If your users would copy&paste HTML, you could make use of the nofollow link relationship type:
cute cat
However, this would not be very effective, as even some of those search engines that support this link type still visit the pages.
Alternatively, you could require JavaScript to be able to click the link, but that’s not very elegant, of course.
But I assume they only copy&paste the plain URL, so this wouldn’t work anyway.
So the only chance you have is to decide if it’s a bot or a human after the link got clicked.
You could check for user-agents. You could analyze the behaviour on the page (e.g. how long it takes for the first click). Or, if it’s really important to you, you could force the users to enter a CAPTCHA to be able to see the page content at all. Of course you can never catch all bots with such methods.
You could use analytics on the pages, like Piwik. They try to differentiate users from bots, so that only users show up in the statistics. I’m sure most analytics tools provide an API that would allow sending out mails for each registered visit.

Related

Receive notification when site server adds page

I've been doing some programming off and on for my brother, who is a stock trader. I'm wondering if it is possible to receive a push notification when a site server adds a page. For example, the site smallcapfortunes.com frequently adds pages that are simple extensions off the main URL. For example, the site regularly adds pages under URLs such as /neca/, /stev/, etc.
Are there existing methods to execute this? Or is this something I need to write myself? Has anyone here written anything like that?
I know there are existing sites to track basic updates to a single page. In my research, though, I haven't found anything like this.
Please let me know if there are any other details I need to provide.
Generally you can only get a push notification if a specific website offers that service.
Some websites publish a structured (XML) site map. If the one you're interested in does that, you could pull that sitemap on a regular basis and look for differences.
you're most likely going to want to use http://scrapy.org/ to go through the site and find new /neca/ and /stev/ urls, etc, then just trigger the script every so often.

Tracking users' clicks and page visits in Rails

I would like to monitor users' page visits and clicks in my Rails app to make recommendations. My questions are:
Is there a Rails gem for this, or Google Analytics is the standard? If latter is true, then how should I link a page visit to a particular user profile?
It is typical in Rails to have a section in application.html.erb, which is shared for all pages. If I add Google Analytics pageview tracking code to in application.html.erb, will it be able to track all individual pages?
There are other ways, but the vast majority probably use Google Analytics. Several gems exist that help you integrate with GA to get at the data. See here: https://www.ruby-toolbox.com/categories/Web_Analytics.
Based on your first question, it seems you may want more insight than GA can provide. I've used ClickTale (http://www.clicktale.com) and Woopra (http://www.woopra.com) before, to good effect. This article lists several other alternatives, too - notice the high marks for Clicky: http://imimpact.com/web-stats-alternatives-to-google-analytics/.
Google Analytics (and almost all of these others) will take care of your second question automatically whenever the user loads a new page, since it keyed by URL. That means that, although you put the GA script code in a single place, each unique page is tracked individually.
If you have AJAX requests that change that page without changing the URL, you'll need to dig in to the GA script API. Essentially you'll need to push a new url (possibly with a # in it) whenever you want to track an AJAX-driven link/button click. See here: http://davidwalsh.name/ajax-analytics
I am biased, but I would recommend checking out impressionist, if you need to integrate the page views into the app in real-time. With analytics you will always have some lag time and you are also relying on an external dependency. Impressionist is good if you need this kind of control, but if you are just looking for simple metrics and don't need to pull them into the app, then analytics is probably the way to go.
Check out Ahoy, at https://github.com/ankane/ahoy. With just a few lines of code in your app, you can track page views and tie them to user accounts.
You can further customize Ahoy to track custom events, both the client (with JavaScript) and server.
Ahoy does not depend on any third-party services.

Protecting website content from crawlers

The contents of a commerce website (ASP.NET MVC) are regularly crawled by the competition. These people are programmers and they use sophisticated methods to crawl the site so identifying them by IP is not possible.
Unfortunately replacing values with images is not an option because the site should still remain readable by screen readers (JAWS).
My personal idea is using robots.txt: prohibit crawlers from accessing one common URL on the page (this could be disguised as a normal item detail link, but hidden from normal users Valid URL: http://example.com?itemId=1234 Prohibited: http://example.com?itemId=123 under 128). If an IP owner entered the prohibited link show a CAPTCHA validation.
A normal user would never follow a link like this because it is not visible, Google does not have to crawl it because it is bogus. The issue with this is that the screen reader still reads the link and I don't think that this would be so effective to be worth implementing.
Your idea could possibly work for a few basic crawlers but would be very easy to work around. They would just need to use a proxy and do a get on each link from a new IP.
If you allow anonymous access to your website then you can never fully protect your data. Even if you manage to prevent crawlers with lots of time and effort they could just get a human to browse and capture the content with something like fiddler. The best way to prevent your data being seen by your competitors would be to not put it on a public part of your website.
Forcing users to log in might help matters, at least then you could pick up who is crawling your site and ban them.
As mentioned, its not really going to be possible to hide publicly accessible data from a determined user, however, as these are automated crawlers, you could make life harder for them by altering the layout of your page regularly.
It is probably possible to use different master pages to produce the same (or similar) layouts, and you could swap in the master page on a random basis - this would make the writing of an automated crawler that bit more difficult.
I am about to get to the phase of protecting my content from crawlers either.
I am thinking of limiting what an anonymous user can see of the website and require them to register for a full functionality.
example:
public ActionResult Index()
{
if(Page.User.Identity.IsAuthorized)
return RedirectToAction("IndexAll");
// show only some poor content
}
[Authorize(Roles="Users")]
public ActionResult IndexAll()
{
// Show everything
}
Since you know users now, you can punish any crawler.

Why do some websites have "#!" in the URL [duplicate]

I've just noticed that the long, convoluted Facebook URLs that we're used to now look like this:
http://www.facebook.com/example.profile#!/pages/Another-Page/123456789012345
As far as I can recall, earlier this year it was just a normal URL-fragment-like string (starting with #), without the exclamation mark. But now it's a shebang or hashbang (#!), which I've previously only seen in shell scripts and Perl scripts.
The new Twitter URLs now also feature the #! symbols. A Twitter profile URL, for example, now looks like this:
http://twitter.com/#!/BoltClock
Does #! now play some special role in URLs, like for a certain Ajax framework or something since the new Facebook and Twitter interfaces are now largely Ajaxified?
Would using this in my URLs benefit my Web application in any way?
This technique is now deprecated.
This used to tell Google how to index the page.
https://developers.google.com/webmasters/ajax-crawling/
This technique has mostly been supplanted by the ability to use the JavaScript History API that was introduced alongside HTML5. For a URL like www.example.com/ajax.html#!key=value, Google will check the URL www.example.com/ajax.html?_escaped_fragment_=key=value to fetch a non-AJAX version of the contents.
The octothorpe/number-sign/hashmark has a special significance in an URL, it normally identifies the name of a section of a document. The precise term is that the text following the hash is the anchor portion of an URL. If you use Wikipedia, you will see that most pages have a table of contents and you can jump to sections within the document with an anchor, such as:
https://en.wikipedia.org/wiki/Alan_Turing#Early_computers_and_the_Turing_test
https://en.wikipedia.org/wiki/Alan_Turing identifies the page and Early_computers_and_the_Turing_test is the anchor. The reason that Facebook and other Javascript-driven applications (like my own Wood & Stones) use anchors is that they want to make pages bookmarkable (as suggested by a comment on that answer) or support the back button without reloading the entire page from the server.
In order to support bookmarking and the back button, you need to change the URL. However, if you change the page portion (with something like window.location = 'http://raganwald.com';) to a different URL or without specifying an anchor, the browser will load the entire page from the URL. Try this in Firebug or Safari's Javascript console. Load http://minimal-github.gilesb.com/raganwald. Now in the Javascript console, type:
window.location = 'http://minimal-github.gilesb.com/raganwald';
You will see the page refresh from the server. Now type:
window.location = 'http://minimal-github.gilesb.com/raganwald#try_this';
Aha! No page refresh! Type:
window.location = 'http://minimal-github.gilesb.com/raganwald#and_this';
Still no refresh. Use the back button to see that these URLs are in the browser history. The browser notices that we are on the same page but just changing the anchor, so it doesn't reload. Thanks to this behaviour, we can have a single Javascript application that appears to the browser to be on one 'page' but to have many bookmarkable sections that respect the back button. The application must change the anchor when a user enters different 'states', and likewise if a user uses the back button or a bookmark or a link to load the application with an anchor included, the application must restore the appropriate state.
So there you have it: Anchors provide Javascript programmers with a mechanism for making bookmarkable, indexable, and back-button-friendly applications. This technique has a name: It is a Single Page Interface.
p.s. There is a fourth benefit to this technique: Loading page content through AJAX and then injecting it into the current DOM can be much faster than loading a new page. In addition to the speed increase, further tricks like loading certain portions in the background can be performed under the programmer's control.
p.p.s. Given all of that, the 'bang' or exclamation mark is a further hint to Google's web crawler that the exact same page can be loaded from the server at a slightly different URL. See Ajax Crawling. Another technique is to make each link point to a server-accessible URL and then use unobtrusive Javascript to change it into an SPI with an anchor.
Here's the key link again: The Single Page Interface Manifesto
First of all: I'm the author of the The Single Page Interface Manifesto cited by raganwald
As raganwald has explained very well, the most important aspect of the Single Page Interface (SPI) approach used in FaceBook and Twitter is the use of hash # in URLs
The character ! is added only for Google purposes, this notation is a Google "standard" for crawling web sites intensive on AJAX (in the extreme Single Page Interface web sites). When Google's crawler finds an URL with #! it knows that an alternative conventional URL exists providing the same page "state" but in this case on load time.
In spite of #! combination is very interesting for SEO, is only supported by Google (as far I know), with some JavaScript tricks you can build SPI web sites SEO compatible for any web crawler (Yahoo, Bing...).
The SPI Manifesto and demos do not use Google's format of ! in hashes, this notation could be easily added and SPI crawling could be even easier (UPDATE: now ! notation is used and remains compatible with other search engines).
Take a look to this tutorial, is an example of a simple ItsNat SPI site but you can pick some ideas for other frameworks, this example is SEO compatible for any web crawler.
The hard problem is to generate any (or selected) "AJAX page state" as plain HTML for SEO, in ItsNat is very easy and automatic, the same site is in the same time SPI or page based for SEO (or when JavaScript is disabled for accessibility). With other web frameworks you can ever follow the double site approach, one site is SPI based and another page based for SEO, for instance Twitter uses this "double site" technique.
I would be very careful if you are considering adopting this hashbang convention.
Once you hashbang, you can’t go back. This is probably the stickiest issue. Ben’s post put forward the point that when pushState is more widely adopted then we can leave hashbangs behind and return to traditional URLs. Well, fact is, you can’t. Earlier I stated that URLs are forever, they get indexed and archived and generally kept around. To add to that, cool URLs don’t change. We don’t want to disconnect ourselves from all the valuable links to our content. If you’ve implemented hashbang URLs at any point then want to change them without breaking links the only way you can do it is by running some JavaScript on the root document of your domain. Forever. It’s in no way temporary, you are stuck with it.
You really want to use pushState instead of hashbangs, because making your URLs ugly and possibly broken -- forever -- is a colossal and permanent downside to hashbangs.
To have a good follow-up about all this, Twitter - one of the pioneers of hashbang URL's and single-page-interface - admitted that the hashbang system was slow in the long run and that they have actually started reversing the decision and returning to old-school links.
Article about this is here.
I always assumed the ! just indicated that the hash fragment that followed corresponded to a URL, with ! taking the place of the site root or domain. It could be anything, in theory, but it seems the Google AJAX Crawling API likes it this way.
The hash, of course, just indicates that no real page reload is occurring, so yes, it’s for AJAX purposes. Edit: Raganwald does a lovely job explaining this in more detail.

How do search engines see dynamic profiles?

Recently search engines have been able to page dynamic content on social networking sites. I would like to understand how this is done. Are there static pages created by a site like Facebook that update semi frequently. Does Google attempt to store every possible user name?
As I understand it, a page like www.facebook.com/username, is not an actual file stored on disk but is shorthand for a query like: select username from users and display the information on the page. How does Google know about every user, this gets even more complicated when things like tweets are involved.
EDIT: I guess I didn't really ask what I wanted to know about. Do I need to be as big as twitter or facebook in order for google to make special ways to crawl my site? Will google automatically find my users profiles if I allow anyone to view them? If not what do I have to do to make that work?
In the case of tweets in particular, Google isn't 'crawling' for them in the traditional sense; they've integrated with Twitter to provide the search results in real-time.
In the more general case of your question, dynamic content is not new to Facebook or Twitter, though it may seem to be. Google crawls a URL; the URL provides HTML data; Google indexes it. Whether it's a dynamic query that's rendering the page, or whether it's a cache of static HTML, makes little difference to the indexing process in theory. In practice, there's a lot more to it (see Michael B's comment below.)
And see Vartec's succinct post on how Google might find all those public Facebook profiles without actually logging in and poking around FB.
OK, that was vastly oversimplified, but let's see what else people have to say..
As far as I know Google isn't able to read and store the actual contents of profiles, because the Google bot doesn't have a Facebook account, and it would be a huge privacy breach.
The bot works by hitting facebook.com and then following every link it can find. Whatever content it sees on the page it hits, it stores. So even if it follows a dynamic url like www.facebook.com/username, it will just remember whatever it saw when it went there. Hopefully in that particular case, it isn't all the private data of said user.
Additionally, facebook can and does provide special instructions that search bots can follow, so that google results don't include a bunch of login pages.
profiles can be linked from outside;
site may provide sitemap

Resources