I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'
I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.
I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.
For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:
star-box-giga-star[^>]*>([^<]*)<
It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.
Processing RSS feed is more comfortable.
As of the time of posting, the only RSS feeds available on the site are:
Born on this Date
Died on this Date
Daily Poll
Yet, you may make a call for adding a new one by getting in touch with the help desk.
Resources on RSS feed processing:
Relevant post here on SO.
Super Object
Wikipedia.
When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.
Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).
Here's IMDB's statement:
You may not use data mining, robots, screen scraping, or similar
online data gathering and extraction tools on our website.
To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.
Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).
All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.
But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.
So a very simple command line program for getting a imdb rating would be...
program imdbrating;
{$apptype console}
uses htmlutils;
function ExtractJsonParm(parm,h:string):string;
var r:integer;
begin
r:=pos('"'+Parm+'":',h);
if r<>0 then
result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
else
result:='N/A';
end;
var h:string;
begin
h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
writeln(ExtractJsonParm('Rating',h));
end.
If the page you are crawling is valid XML, i use SimpleXML to extract infos. Works pretty well.
Resource:
Download link.
Related
I tried using Feedjira to assist with content analysis from newsfeeds, but it appears that RSS feeds now only link to content rather than including them with RSS as I found out in "Feedjira not adding content and author". I plan to use Feedjira to get the URL for the article, but then use Nokogiri to scrape the article and pick out the relevant parts.
The problem is that each media outlet will have a different format for their pages and I need to know the best way for Nokogiri to take the URL from the database (supplied by Feedjira) and depending on the associated feed title (also the database from Feedjira sync) scrape the page in a specific way and save it to a separate table in the database. Anyone got any suggestions?
I don't know your special use case but I'm also doing content analysis using news feeds.
Maybe you'll have a look on Readability which provides you a generic content scraper.
The problem you've encountered is that every feed generator does it a bit differently, just as with HTML generators. You can assume certain fields are going to be in place in an RDF, RSS or ATOM feed, however the author of the feed could use optional tags that you could find very useful, so you have to write code to look for them.
I wrote several feed aggregators in the past, including one that was handling well over 1000 feeds daily. By sniffing out the feed type, ATOM vs. RSS vs RDF, then I could make sensible checks for fields that were interesting given that format, and extract the data if it was available.
Pre-canned parsers get it wrong too often, either grabbing data you don't want and making a mess of the output, or skipping data you do want leaving gaps in the output, so be prepared to write code if you want it done correctly.
You'll probably want to take advantage of a backing database too, to keep track of what you looked at last and when you're supposed to look at it again; That's part of being a good network citizen. You'll also want to keep track whether a feed was down the last n times you looked so you can trim out dead sites.
I wanna fetch the content of a website. But to get the correct content, it is necessary to change a Input Html sroll field on the side?
Many idea how to manage with xcode?
Thanks a lot!!
Lars
If you want to retrieve the HTML that you get after filling in a HTML form, you have to identify precisely what the series of requests looks like to fetch the data. And be careful because it's often not as simple as just looking at the request that the HTML in question generates: unfortunately, it is sometimes a complex series of requests (e.g. retrieving the original HTML is often seamlessly retrieving some critical hidden form fields and/or cookies).
Bottom line, to reverse engineer the required HTTP requests, you often have to pour through HTML code and/or watch the requests with something like Charles. It often takes quite a bit of time to do this with complicated sites.
Before you invest a lot of time here, though, you should first see if the web site provider's Terms of Service permit such usage. They often strictly prohibit this sort of practice. It's much better to contact the web site provider and see if they provide a web service to retrieve the data. That's far easier and will result in a far more robust interface for your app.
But if you're forced programmatically parsing HTML, I'd refer you to How to Parse HTML on iOS on Ray Wenderlich's site.
I have been using LabVIEW to collect measurement data, and I would like to know if it is possible for LabVIEW to communicate the results to a Google Spreadsheet. If so, where could I find resources to learn how to make LabVIEW transmit information to the Google Spreadsheet ?
Thanks!
EDIT AND FOLLOW-UP- I used Jonathan's suggestion below and experimented with the LabVIEW http Post.vi. It's very simple, all you need to do is enter the URL of the Google form (replacing the final "viewform" with "formResponse") and a string with the data you want to enter (with rough syntax = ). A big thanks for that answer, it was really helpful !
However, when I try to use this method for a Google form with more than one page, the data isn't read properly... The form is still sent but every field not present on the first page of the form remains blank on the Spreadsheet. I feel that this is somehow linked to the fact that in the Google form, the URL of all the pages after page 1 are the URL of page 1 with the final "viewform" replaced with "formResponse". Is this what is causing the error or is it something else altogether, and how can I fix it ?
I can think of two ways to do this:
You can create a form in google spreadsheets. The form appears as an html document with standard tags. From here, I would use labview's http functionality to submit data to that form using a POST request. This would be the easiest way to get data in there.
Using the Google Apps API, you can manipulate google spreadsheets and dump data in there directly. This is going to be more complicated in terms of development time, but more configurable in the long run. https://developers.google.com/google-apps/spreadsheets/#what_can_this_api_do There are .net and java code examples throughout the documentation, so it would take some work to port this to LabVIEW, but it could be done.
Hi and thanks in advance for helping me with my question.
Is it possible to write a script that would extract the following information when provided with a craigslist or kiji post ie http://toronto.en.craigslist.ca/tor/atq/3346994296.html:
email address (default one provided by craigslist)
items in the post
address of poster
Above 1-3 is information that can be manually obtained but would like to just input a posting or ad ID and be able to extract this info.
The short answer to this question is...
Yes, automatically extracting the info listed from web pages similar to the one provided as example can be done by a relatively simple script.
In general, this activity [of automatically extracting info from web pages] is known as Web Scraping, a particular form of Data Scraping.
There are both off-the-shelf products that can be used (no or little programming involved; the parametrization of the desired pages and desired fields within the pages is specified by way of configuration.), as well as software libraies which can be used in relation with scripting languages such as python or java and which facilitate the parsing of HTML page, and more generally provide support for the various tasks associated with this activity.
Aside from technical considerations, you need to assert the etiquette and legality of performing this kind of scraping. Whereby some data and sites may be explicitly copyright-protected, it is always a good idea to perform big scraping jobs at low traffic hours and by throttling the requests as to not burden the host site unduly. Also many sites may provide an API or data dumps to supply the same info in a simpler and more controlled fashion.
I just have a link to a product page, at amazon. How do I get all the information (photo, price etc), in my ruby program, just using this link?
Here's the list of supported urls as disclosed by amazon for their oembed, product advertising API would come to picture only after parsing through these URLs and getting the ASINs
http://*amazon.*/gp/product/*
http://*amazon.*/*/dp/*
http://*amazon.*/dp/*
http://*amazon.*/o/ASIN/*
http://*amazon.*/gp/offer-listing/*
http://*amazon.*/*/ASIN/*
http://*amazon.*/gp/product/images/*
http://*amazon.*/gp/aw/d/*
http://www.amzn.com/*
http://amzn.com/*
I found this library (I'm using Rails)
amazon-ecs
I'm experimenting with it. Still, I'd require some kind of ID (product id?) to get details of a particular product. For example, consider this link to kindle
http://www.amazon.com/Kindle-Amazons-Wireless-Reading-Generation/dp/B00154JDAI/ref=amb_link_84372271_1?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=06JJGQP9J3BHKPE38SXP&pf_rd_t=101&pf_rd_p=478184871&pf_rd_i=507846
In that link, I noticed ASIN, which is B00154JDAI.
Looks like I can use this ID, to get product information (using amazon-ecs). I just need to parse the URL, to get ASIN.
Is there any other way to do it?
No, I am not going to do screen scraping, that is not a good idea anytime.
If you want to do this, the Nokogiri or hpricot libraries both allow HTML parsing and searching. However, this kind of screen-scraping is notoriously unreliable (as it may break any time Amazon decides to reorganize their HTML), so if you're planning to do this sort of thing for any length of time I'd recommend leveraging the Amazon Product Advertising API instead.
In your program: fetch the page and parse HTML. Filter out the required information. There may be some libraries in Ruby (that I am unaware of), which parse HTML.
hpricot seems to do what you want.
You should use the library Ruby/AWS (google for it, my karma is not high enough to allow external links...). It has been written exactly for that.
You might need to use the built-in Search to find the item you're looking for. After that, the API gives access to pictures, links and all usable information.