hi i am trying to build a simple application using grails wherein i need to crawl 3 websites to get data abt the price off the book.And after getting those details when i select to buy it has to redirect to tht selected site.example refer the link http://www.mydiscountbay.com/ I am stuck i dont know hw to implement a simple crawler in grails.pls guide me with a sample code or tutorial on hw to implement it
thanks in advance
Implementing crawler has nothing to do with grails, there are some opensource java crawlers that you may be able to use or customize as per your need. Front end part would be like a normal grails web app.
Using something like URL#getText() will not get you very far with webs that have redirections, cookies, etc.
For anything even a little bit involved, use commons HttpClient, or the groovy HttpBuilder.
http://hc.apache.org/httpcomponents-client-ga/index.html
http://groovy.codehaus.org/HTTP+Builder
To parse the response and extract content, use XmlSlurper, eg: Using XmlSlurper: How to select sub-elements while iterating over a GPathResult
Related
I am trying to develop a Crawler to crawl youtube.com and parse the meta information(title, description, publisher etc) and store these into Hbase/other storage systems. I understood that I have to write plugin(s) to achieve this. But I'm confused what plugins I need to write for this. I am inspecting with this four -
Parser
ParserFilter
Indexer
IndexFilter
To parse the specific metadata information for youtube page, do I need to write a custom Parser plugin or ParseFilter plugin along with using parse-html plugin?
After parsing, to store the entry in Hbase/other storage system do I require to write a IndexWriter plugin? By indexing, we generally understand indexing in Solr, ElasticSearch etc. But I don't need to index in any search engine obviously. So, how can I store them in some store say Hbase after parsing?
Thanks in advance!
Since youtube is a web page, you'll need to write an HtmlParseFilter which gives you access to the raw HTML fetched from the server, but at the moment youtube a LOT of javascript and neither parse-html or parse-tika support executing the js code, so I'll advice you to use the protocol-selenium plugin so you'll delegate the rendering of a webpage to the selenium driver and get the HTML back (after all the JS has been executed). After you write your own HtmlParseFilter you'll need to write your own IndexingFilter, in this case you'll only need to specify what info you want to send to your backend, this is totally backend-agnostic and relies only on the Nutch codebase (that's why you'll need your own IndexWriter).
I assume that you're using Nutch 1.x, in this case yes you need to write a custom IndexWriter for your backend (which is fairly easy). If you use Nutch 2.x you'll have access to several backends through Apache Gora but then you'll have some features missing (like protocol-selenium).
I think you should use something like Crawler4j for your purposes.
The real power of Nutch is utilized when you want to do a much wider search or you want to index your data directly into Solr/ES. But since you just want to download data for each URL, I would totally go with Crawler4j. It's much easier to setup and does not require complex configurations.
I'm working on my first tornado project and i have some questions :
1- apart of the project is collecting and categorize real time hash-tags and tweets form different twitter users is and put them on the website I want ti use iostream for realtime results is there libraries helping me to do that and how to use it I found libs like python-twitter , tweepy but I don't know what is the best and I red about twitter limited api so what is library / way to do that? ... sorry but It's my first time to work with twitter too
2- I found in tornado documentation UIModule how to use it I didn't understand what the benefit of it ?
3- I there way to handler to render global template tags to use in more than template ?
4- I'm using MongoEngine will it work with tornado asynchronous or I have to use Asyncmongo ?
I don't know much about this one, but you could either to AJAX calls to twitter on the frontend, or do something like this: http://arstechnica.com/open-source/guides/2010/04/tutorial-use-twitters-new-real-time-stream-api-in-python.ars
UIModules are reusable parts of a site which can be easily inserted into any template. E.g, you could have a post module and a comment module in a blog, which you could then reuse on multiple pages.
Not really.
Use asyncmongo.
For my Grails application I want to set up Google Analytics to track only "partial" url's. I 'll explain:
a typical Grails url consists of the following parts: domain + application-name + controller + action + id
e.g. www.mydomain.com/myapp/controller/action/12345
As far as I understand for Google Analytics the page to be tracked is identified by the entire url. For my purpose I'm not interested in the id part of the url: I want to know which actions have been performed, but I need not know for which id the action was executed.
And of course I would like a generic solution, because I have multiple controllers and multiple actions... Maybe some kind of filter stating "I want to track pages 3 levels deep (/myapp/controller/action)" would do? Or a filter stating "exclude everything from url after the last /"?
Any help would be much appreciated.
Kind regards,
Pieter
I think this issue is best solved within the realm of Google Analytics, where you can create a specific report that ignores the id-part of the url.
That way you can just use Google Analytics at its easiest and need not make any code changes to your project
There can be several approaches. First thing comes to my mind is taking one of these steps:
Using profile filters (more info)
Generating the same virtual pageviews on each action id (more info)
Using advanced segments tool with a proper condition (page url pattern match) (more info)
Each approach has its pros and cons, choosing proper one depends on the goal you are trying to achieve.
I think this question is best answered by this article.
As the other contributors suggested, I too thought the issue should be resolved in Google Analytics. I clicked around a bit and got hopelessly confused.
Solving the issue within Grails is much much easier. In short the answer is:
in the Google Analytics tracking javascript there is a
"_trackpageview" action
this action can take as parameter the url you want to track
in Grails I can simply add the stuff I want to track:
application/controller/action
my Google Analytics script is in my main template
I just use: _gaq.push(['_trackPageview','myapp/${controllerName}/${actionName}']);
("myapp" should be the name of your application)
(${controllerName} and ${actionName} are generically available
variables in the Grails views)
Hope this will help others.
Thnx for the other answers.
For an event in a couple of weeks I'd like to make an web page/app which display tweets from a specific user, a specific hashtag and all #reply's at the first user in 3 boxes on the screen.
However I've never tried this. I want to use either .NET (C#) or HTML/CSS/JS since I'm proficient in those. Are there any libraries/API's I can use? Or is there an readily available freeware/open-source app I can use?
Have you seen TweetSharp?
Use Twitter's profile and search widgets. Profile for the first box, a search of the hash tag for the second box, and a search of to:username for the third box.
I actually just posted this as an answer to another question:
I just updated a plugin to work with the Twitter 1.1 API. Unfortunately, per Twitter's urging, you will have to perform the actual request from server-side code. However, you can pass the response to the plugin and it will take care of the rest. I don't know what framework you are running, but I have already added sample code for making the request in C#, and will be adding sample code for PHP, shortly.
The plugin makes a call to statuses/user_timeline, but you will likely want to look at statuses/filter or statuses/search, instead. All you will have to do is add your desired parameters (hashtag, replies, etc.) to the server-side code and it should work (with the addition of your security keys and tokens, of course).
Good luck! :)
I want to add a twitter feed for a specific keyword search to my rails application. Where should I start?
You might start with one of the Twitter API libraries written for Ruby.
You may want to consider grabbing the RSS feed for the search and parsing that. I show that in Railscasts episode 168. If you need something more fancy the API is the way to go as Dav mentioned.
But whichever solution you choose, it's important to cache the search results locally on your end. This way your site isn't hitting Twitter every time someone goes to the page. This improves performance and will make your site more stable (not breaking when twitter breaks). You can have the cache auto-update every 10 minutes (or whatever fits) using a cron task.
We download and store the tweets in a local database. I recently wrote a blog post about how I achieved this:
http://www.arctickiwi.com/blog/16-download-you-twitter-feed-using-ruby-on-rails-with-oauth
You can then use will_paginate to handle your pagination and go back as far as you want.