I notice that StackOverflow has a views count for each question and that these view numbers are fairly low and accurate.
I have a similar thing on one of my sites. It basically logs a "hit" whenever the page is loaded in the backend code. Unfortunately it also does this for search engine hits giving bloated and inaccurate numbers.
I guess one way to not count a robot would be to do the view counting with an AJAX call once the page has loaded, but I'm sure there's other, better ways to ignore search engines in your hit counters whilst still letting them in to crawl your site. Do you know any?
An AJAX call will do it, but usually search engines will not load images, javascript or CSS files, so it may be easier to include one of those files in the page, and pass the URL of the page you want to log a request against as a parameter in the file request.
For example, in the page...
http://www.example.com/example.html
You might include in the head section
<link href="empty.css?log=example.html" rel="stylesheet" type="text/css" />
And have your server side log the request, then return an empty css file. The same approach would apply to JavaScript or and image file, though in all cases you'll want to look carefully at what caching might take place.
Another option would be to eliminate the search engines based on their user agent. There's a big list of possible user agents at http://user-agents.org/ to get you started. Of course, you could go the other way, and only count requests from things you know are web browsers (covering IE, Firefox, Safari, Opera and this newfangled Chrome thing would get you 99% of the way there).
Even easier would be to use a log analytics tool like awstats or a service like Google analytics, both of which have already solved this problem.
To solve this problem I implemented a simple filter that would look at the User-Agent header in the HTTP request and compare it to a list of known robots.
I got the robot list from www.robotstxt.org. It's downloadable in a simple text-format that can easily be parsed to auto-generate the "blacklist".
You don't really need to use AJAX, just use JavaScript to add an iFrame off screen. KEEP IT SIMPLE
<script type="javascript">
document.write('<iframe src="myLogScript.php" style="visibility:hidden" width="1" height="1" frameborder="0">');
</script>
An extension to Matt Sheppard's answer might be something like the following:
<script type="text/javascript">
var thePg=window.location.pathname;
var theSite=window.location.hostname;
var theImage=new Image;
theImage.src="/test/hitcounter.php?pg=" + thePg + "?site=" + theSite;
</script>
which can be plugged into a page header or footer template without needing to substitute the page name server-side. Note that if you include the query string (window.location.search), a robust version of this should encode the string to prevent evildoers from crafting page requests that exploit vulnerabilities based on weird stuff in URLs. The nice thing about this vs. a regular <img> tag or <iframe> is that the user won't see a red x if there is a problem with the hitcounter script.
In some cases, it's also important to know the URL that was seen by the browser, before rewrites, etc. that happen server-side, and this give you that. If you want it both ways, then add another parameter server-side that inserts that version of the page name into the query string as well.
An example of the log files from a test of this page:
10.1.1.17 - - [13/Sep/2008:22:21:00 -0400] "GET /test/testpage.html HTTP/1.1" 200 306 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16"
10.1.1.17 - - [13/Sep/2008:22:21:00 -0400] "GET /test/hitcounter.php?pg=/test/testpage.html?site=www.home.***.com HTTP/1.1" 301 - "http://www.home.***.com/test/testpage.html" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16"
The reason Stack Overflow has accurate view counts is that it only count each view/user once.
Third-party hit counter (and web statistics) application often filter out search engines and display them in a separate window/tab/section.
You are either going to have to do what you said in your question with AJAX. Or exclude out User-Agent strings that are known search engines. The only sure way to stop bots are with AJAX.
Related
I am working on news based application in which I want to fetch the dynamic feed with just typing website's name.
For example: If i want to fetch feed from CNN.com or BBCNEWS.com or etc , then i have to just write website name in textbox like "BBC.com" in place of it's rss urlname
http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml.
I know how to fetch feed from static link but i want to do it dynamically
I have searched a lot regarding this but havn't find any answer. I have seen this in feedly application. In which they have done like this.
so, if anybody know then help me regarding this issue.
RSS comes with a mechanism call Auto-Discovery which links RSS feeds to an HTML page.
It relies on the use of a <link> element in the <head> section of any HTML page.
The <link> tag includes 4 important elements:
rel should include alternate which tells the application that the linked document contains an alternate view of the current
document/page. You can also use the feed value, even though, in our
experience, this is much less frequent. Using both is probably a safe
bet
type indicates the MIME type of this alternate representation. RSS uses application/rss+xml while Atom uses application/atom+xml
title is a human description of the document. It’s good to re-use the page’s title. Do not add RSS as it’s meaningless for people :)
href is the most important attribute: it’s the URL (relative or absolute) of the feed.
Here’s, for example, the discovery for this page's very RSS feed:
<link rel="alternate" type="application/atom+xml" title="Feed for question 'iOS RSSFeed, How to fech feed automatic from website'" href="/feeds/question/32946522">
It's a great example!
In the HTML of the site, you'll find a snippet like this
<link rel='alternate' type='application/rss+xml' title='RSS' href='http://feeds.feedburner.com/martini'>
That's where the RSS URL comes from.
I have a dynamic MVC4, jQuery Mobile application that works for the most part quite well. I have an auto posting dropdown list that selects a list from the database via the following code.
<script type="text/javascript">
$(function () {
$("#TownID").live('change', function () {
//$("#TownID").change(function () {
var actionUrl = $('#TheForm1').attr('action') + '/' + $('#TownID').val();
$('#TheForm1').attr('action', actionUrl);
$('#TheForm1').submit();
});
});
</script>
<p>
#using (Html.BeginForm("SearchTown", "Home", FormMethod.Post, new { id = "TheForm1" }))
{
#Html.DropDownList("TownID", (SelectList)ViewBag.TownId, "Select a Town")
}
</p>
The problem is it only works properly the first time a search is performed unless I click refresh. I don’t think this has anything to do with MVC, I think the problem is with AJAX and jQuery Mobile.
Edit:
The first time I search www.mysite.com/Home/Search/2 yields a result and woks fine, but the second time something seems to be left behind in the DOM??? and it looks for:
www.mysite.com/Home/Search/2/2 also
I get 404 errors in my log and “Error Loading Page” but it still finds the results and displays the page correctly!
Then with a third search I get the error 404’s in my log and “Error Loading Page” but it has grown and now looks for:
www.mysite.com/Home/Search/2/2
www.mysite.com/Home/Search/2/2/2 also
This then continues to grow after every search until at some seemingly random point on each test, it seems to give up and I get error 505
Additional Edit:
The code works perfectly if I take jQuery Mobile out of the question
Can anyone tell me what might be going on here?
Get rid of: $(function () {
And replace it with: $(document).delegate('[data-role="page"]', 'pageinit', function () {
Please read the big yellow sections at the top of this page: http://jquerymobile.com/demos/1.1.0/docs/api/events.html
You can't rely on document.ready or any other event that only fires once per page. Instead you have to get used to using jQuery Mobile's custom page events like pageinit so your code will work no-matter when the page is added to the DOM (which you don't know when this will happen in a jQuery Mobile website). There are a ton of events, so again, please read the documentation I linked-to above.
Firstly, dynamically generated html using a server side templating engine blows. I really don't understand what value people see in it.
My guess is that it used to make sense 10 years ago before AJAX became popular, and has just hung in there ever since because people have this feeling that it is "the right way to do it". It isn't. ESPECIALLY for mobile web apps.
Secondly, it looks like you are trying to do pretty simple search. All this MVC4 garbage makes it difficult for you to see what is really happening though. You don't need to append parameters to your URL for a simple form submission like this. In fact your TownId should already be part of the POST data when you submit, so you can just remove the URL modification bit.
Alternatively, don't use a form submission, but just a GET and AJAX. I don't know what your app is doing here, but I imagine you want to display the results on the page dynamically somehow, so a GET is more than enough.
Use your developer browser tools (F12) to see what exactly is getting submitted when you do the submit - it really helps. And for your next project, abandon MVC4! "Well established design patterns" my foot.
I have been bothered by this problem for a long time
There are same select element in the DOM I think so...
and I used $('.SelectCSS:last').val()
It seen work well.
I come from China , English is poor...
I guess this is one for the future, MVC and jQuery Mobile don't seem to blend completely right now. Maybe MS's response to the issue is Single Page Applications!
SPA may satisfy Danial also?
As the title said, I have some DOM manipulation tasks. For example, I want to:
- find all H1 element which have blue color.
- find all text which have size 12px.
- etc..
How can I do it with Rails?
Thank you.. :)
Update
I have been doing some research about extracting web page content based on this paper-> http://www.springerlink.com/index/A65708XMUR9KN9EA.pdf
The summary of the step is:
get the web url which I want to be extracted (single web page)
grab some elements from the web page based on some visual rules (Ex: grab all H1 which have blue color)
process the elements with my algorithm
save the result into my database.
-sorry for my bad english-
If what you're trying to do is manipulate HTML documents inside a rails application, you should take a look at Nokogiri.
It uses XPath to search through the document. With the following, you would find any h1 with the "blue" css class inside a document.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.stackoverflow.com'))
doc.xpath('//h1/a[#class="blue"]').each do |link|
puts link.content
end
After, if what you were trying to do was indeed parse the current page dom, you should take a look at JavaScript and JQuery. Rails can't do that.
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
To reliably sort out what color an arbitrary element on a webpage is, you would need to reverse engineer a browser (to accurately take into account stylesheets, markup hacks, broken tags, images, etc).
A far easier approach would be to embed an existing browser such as gecko into a custom application of your making.
As your spider would browse pages, it would pass them to your embedded instance of gecko where you could use getComputedStyle to pull what color an individual element happens to be.
You originally mentioned wanting to use Ruby on Rails for this project, Rails is a framework for writing presentational applications and really a bad fit for a project like this.
As a starting point, I'd recommend you check out RubyGnome, and in particular RubyGnome's Gtk::MozEmbed functionality.
I'm setting up an e-mail form and I need to be able to check for bots and filter them quietly. The site runs ASP.NET MVC. I'd like to avoid CAPTCHA. Any ideas?
Add a new input field, label it "Please leave blank", hide it using CSS, and ignore the post if that field is filled in. Something like this:
<style type='text/css'>
#other_email_label, #other_email {
display: none;
}
</style>
...
<form action='mail'>
<label id='other_email_label' for='other_email'>Please leave blank:</label>
<input type='text' name='other_email' id='other_email'>
...
</form>
So a human being won't see that field (unless they have CSS turned off, in which case they'll see the label and leave it blank) but a spam robot will fill it in. Any post with that field populated must be from a spam robot.
(Copied from my answer to this related question: "What is a good invisible captcha?")
IIRF can do blacklisting based on user-agent or IP address (or other things).
Works with ASP.NET, PHP, anything. Runs on IIS5, 6, 7. Fast, easy, free.
You can browse the doc here.
I saw a solution to this with forms, the premise was using JavaScript to count keystrokes and time the distance from page_load to form submission. It then guessed if it was a bot based on that time and a typical expectation boundary for keystrokes/second as bots (that use the browser) tend to dump text very quickly without strokes (just a ctrl-v).
Bots just sending POST or GET data without loading the page just get filtered too.
I don't know the details of the implementation, but might be an idea.
Is there a way to get struts 2 (using tiles) to build the whole page before sending it to the browser? I don't want the page to be build "progressively" in the browser one part at a time.
The main problem I'm trying to solve is that internet explorer 7 flashes/blinks the page even if only some of the content changes (firefox does this much more smoothly).
So that if I have a page with:
HEADER
some content
FOOTER
And the "some content" area only changes between page loads, the FOOTER part still flashes the white background before filling it with the background color of the footer. I tought that maybe by getting struts to send the complete page it would load fast enough to eliminate the "blinking".
Now the FOOTER comes from the server a little bit later than the parts before it and so it flashes (in internet explorer, firefox displays the page smoothly).
NB: this is an important requirement for the site, and using ajax to load the middle content is out (as are frames or other "hacks"). The site is built using CSS and not a table layout, maybe I will have to use a table layout to get it to work...
About using tiles flush parameter:
I tried that and it doesn't work as I need. I would need a flush-parameter for the whole page. I have tried the normal jsp page directive "autoFlush=false" but it didn't work. I set this directive on my main template page (and not in the tiles).
Here is an example from the main template, which uses header, body and footer templates. With the Thread.sleep() I added the problem is easy to spot. The footer renders 2 secs later than the rest of the page.
<body>
<div id="container">
<t:insertAttribute name="header" flush="false" />
<div id="content"><t:insertAttribute name="body" flush="false"/></div>
<div class="clear"></div>
<% Thread.sleep(2000); %>
<t:insertAttribute name="footer" flush="false" />
</div>
</body>
UPDATE
Thanks for the comments. The requirement is actually almost reasonable as this isn't a normal web page, think embedded.
But apparently there is no way of configuring IE to start rendering after some delay (like firefox has a configurable delay of some 100ms)?
I tried to intercept the TilesResult but the method doExecute is run before the whole content is apparently evaluated, so the method has already exited before the jsp is evaluated (my Thread.sleep() test). I was wondering how I could render the whole response to a string and then output that all at once to the browser.
I know that this isn't foolproof and network delays etc may factor in this, but if I could get the response to output all at once and maybe use a table based layout (IE possibly renders the table only after the table closes) this could work reasonably.
Or then try to get this switched to firefox or maybe forget all about this little glitch...
UPDATE 2
This started to bother me so I did some investigation.
If I had a plain jsp page (no tiles) the buffering works (with the buffer attribute), so that if I had my Thread.sleep() there the whole page rendered after two seconds if the page size was below the buffer size.
But if I used tiles in the page (as in the example above) I couldn't get the page to render at the same time (I even included the page directive in all my tiles-templates/"components", no help). So tiles probably flushes the response somewhere?
Furthermore, the "problematic tiles" was my body-part, which contained a struts:form tag. I replaced it with a normal form-tag and it worked as I wanted...
UPDATE 3
Ok, nobody seems to know the inner workings of tiles or struts tags...
No big problem as this is a very specific case and requirement.
I worked around it by using apache as a proxt in front of the application, and using apache's proxy configuration options to specify a large buffer.
I'll mark this as answered.
You can send page data all at once at the server end if you like (and many frameworks do that anyway for convenience) but the reality of networking is that it won't all arrive at once and the browser will render it as packets arrive. And this is a good thing for responsiveness, even if you* aesthetically would like the page to display all at once.
You can reduce the lag as much as possible by simplifying markup and using deflate compression to keep the payload size down, and that's a worthwhile thing to do in general. Plus you can make sure you're not hitting a Flash Of Unstyled Content. But you can't control when the browser chooses to render, short of doing it all in JavaScript with all the downsides that entails (and even then, the browser might redraw slowly).
(* - or your client/boss, if that's who has come up with this "important requirement" that your site somehow work differently to every other page on the web.)
Can you use the "flush" attribute on the tiles components?
<tiles:insertAttribute name="body" flush="false"/>
In addition if the output buffer gets too big, it will flush anyway. Try increasing the buffer size?
<%# page language="java" buffer="500kb" autoFlush="false" %>