I am trying to develop a Crawler to crawl youtube.com and parse the meta information(title, description, publisher etc) and store these into Hbase/other storage systems. I understood that I have to write plugin(s) to achieve this. But I'm confused what plugins I need to write for this. I am inspecting with this four -
Parser
ParserFilter
Indexer
IndexFilter
To parse the specific metadata information for youtube page, do I need to write a custom Parser plugin or ParseFilter plugin along with using parse-html plugin?
After parsing, to store the entry in Hbase/other storage system do I require to write a IndexWriter plugin? By indexing, we generally understand indexing in Solr, ElasticSearch etc. But I don't need to index in any search engine obviously. So, how can I store them in some store say Hbase after parsing?
Thanks in advance!
Since youtube is a web page, you'll need to write an HtmlParseFilter which gives you access to the raw HTML fetched from the server, but at the moment youtube a LOT of javascript and neither parse-html or parse-tika support executing the js code, so I'll advice you to use the protocol-selenium plugin so you'll delegate the rendering of a webpage to the selenium driver and get the HTML back (after all the JS has been executed). After you write your own HtmlParseFilter you'll need to write your own IndexingFilter, in this case you'll only need to specify what info you want to send to your backend, this is totally backend-agnostic and relies only on the Nutch codebase (that's why you'll need your own IndexWriter).
I assume that you're using Nutch 1.x, in this case yes you need to write a custom IndexWriter for your backend (which is fairly easy). If you use Nutch 2.x you'll have access to several backends through Apache Gora but then you'll have some features missing (like protocol-selenium).
I think you should use something like Crawler4j for your purposes.
The real power of Nutch is utilized when you want to do a much wider search or you want to index your data directly into Solr/ES. But since you just want to download data for each URL, I would totally go with Crawler4j. It's much easier to setup and does not require complex configurations.
Related
I'm developing some website using Rails. I want to add "our users' tweets" part to the main page. I need an advice how I can do it better. I hoped to get standard way to do it, may be some Twitter widget or something else. I used Google, but I've found nothing. Please, point me to the right path. Sorry if my questions is very simple, but I don't really know how to do it. I hope that I needn't parse JSON and add styles independently; I need simple design from Twitter :)
To answer your [ambiguous] question, there are a number of things to consider:
How will you retrieve the tweets?
How will you store / access them?
How will the data be displayed on front-end?
The two methods you have are either to use the Twitter gem, or the TwitterFetcher JS plugin:
Gem
The Twitter gem uses the Twitter API to pull data from the official Twitter API. This means you've got the throttling & authentication to build into your app
The benefits of using this gem is it gives you a HUGE amount of flexibility with the data. You can pull as much data as you need / want, in whatever format you want - all formatted in JSON & can be displayed on your site
This gem is best suited to storing your tweets, either in a DB or in Redis etc, otherwise you'll have massive synchronous dependency on Twitter's API - which is never good for performance
JS
The TwitterFetcher JS plugin is epic - basically takes a Twitter widget & strips out the HTML, allowing you to style it how you like
This is the most effective way to retrieve Twitter data on-the-fly, as it's asynchronous, relies on Twitter's widget system (far more robust than API), and stores no data locally
hi i am trying to build a simple application using grails wherein i need to crawl 3 websites to get data abt the price off the book.And after getting those details when i select to buy it has to redirect to tht selected site.example refer the link http://www.mydiscountbay.com/ I am stuck i dont know hw to implement a simple crawler in grails.pls guide me with a sample code or tutorial on hw to implement it
thanks in advance
Implementing crawler has nothing to do with grails, there are some opensource java crawlers that you may be able to use or customize as per your need. Front end part would be like a normal grails web app.
Using something like URL#getText() will not get you very far with webs that have redirections, cookies, etc.
For anything even a little bit involved, use commons HttpClient, or the groovy HttpBuilder.
http://hc.apache.org/httpcomponents-client-ga/index.html
http://groovy.codehaus.org/HTTP+Builder
To parse the response and extract content, use XmlSlurper, eg: Using XmlSlurper: How to select sub-elements while iterating over a GPathResult
I'm trying to write an iOS application that'll get data from a web server and display it as I want. I want to use JSON for this purpose. But as I'm absolutely new to web apps I've got no idea how I'm going to get the url to a certain feed. Now here're the two big questions:
How do I find the url to a feed provided by a web service? Is there a standard way or is it publicly or exclusively handed to the web service subscribers?
Is the format they provide data in up to their preference (like XML or JSON)? I mean, do I choose my data parsing method according to the format the web service gives data in? So that if the feed is in XML format using NSJSONSerialization class makes no sense.
The URL to use is dependent on the web service and is usually well described in the documentation.
The type of data they return and the the structure is also usually well described in the documentation.
The common bits you'll need to know are how to get to the web-service (NSURLRequest/NSURLConnection or any of the many asynchronous wrappers that are open source and available with a bit of searching), And how to deal with the the returned data - whether it's in JSON (NSJSONSerialization, JSONKit) format or XML (NSXMLParser, libxml, or any of the many open source implementations that are available and described with a bit of searching)
I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'
I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.
I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.
For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:
star-box-giga-star[^>]*>([^<]*)<
It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.
Processing RSS feed is more comfortable.
As of the time of posting, the only RSS feeds available on the site are:
Born on this Date
Died on this Date
Daily Poll
Yet, you may make a call for adding a new one by getting in touch with the help desk.
Resources on RSS feed processing:
Relevant post here on SO.
Super Object
Wikipedia.
When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.
Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).
Here's IMDB's statement:
You may not use data mining, robots, screen scraping, or similar
online data gathering and extraction tools on our website.
To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.
Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).
All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.
But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.
So a very simple command line program for getting a imdb rating would be...
program imdbrating;
{$apptype console}
uses htmlutils;
function ExtractJsonParm(parm,h:string):string;
var r:integer;
begin
r:=pos('"'+Parm+'":',h);
if r<>0 then
result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
else
result:='N/A';
end;
var h:string;
begin
h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
writeln(ExtractJsonParm('Rating',h));
end.
If the page you are crawling is valid XML, i use SimpleXML to extract infos. Works pretty well.
Resource:
Download link.
I want to make a program that takes as user input a website address. The program then goes to that website, downloads it, and then parses the information inside. It outputs a new html file using the information from the website.
Specifically, what this program will do is take certain links from the website, and put the links in the output html file, and it will discard everything else.
Right now I just want to make it for websites that don't require a login, but later on I want to make it work for sites where you have to login, so it will have to be able to deal with cookies.
I'll also want to later on have the program be able to explore certain links and download information from those other sites.
What are the best programming languages or tools to do this?
Beautiful Soup (Python) comes highly recommended, though I have no experience with it personally.
Python.
It's fairly easy to write a simple crawler using python's standard libs, but you'll also be able to find some existing python crawler libraries available on the web.