How to scrape different URLs from database with Nokogiri with different requirements - ruby-on-rails

I tried using Feedjira to assist with content analysis from newsfeeds, but it appears that RSS feeds now only link to content rather than including them with RSS as I found out in "Feedjira not adding content and author". I plan to use Feedjira to get the URL for the article, but then use Nokogiri to scrape the article and pick out the relevant parts.
The problem is that each media outlet will have a different format for their pages and I need to know the best way for Nokogiri to take the URL from the database (supplied by Feedjira) and depending on the associated feed title (also the database from Feedjira sync) scrape the page in a specific way and save it to a separate table in the database. Anyone got any suggestions?

I don't know your special use case but I'm also doing content analysis using news feeds.
Maybe you'll have a look on Readability which provides you a generic content scraper.

The problem you've encountered is that every feed generator does it a bit differently, just as with HTML generators. You can assume certain fields are going to be in place in an RDF, RSS or ATOM feed, however the author of the feed could use optional tags that you could find very useful, so you have to write code to look for them.
I wrote several feed aggregators in the past, including one that was handling well over 1000 feeds daily. By sniffing out the feed type, ATOM vs. RSS vs RDF, then I could make sensible checks for fields that were interesting given that format, and extract the data if it was available.
Pre-canned parsers get it wrong too often, either grabbing data you don't want and making a mess of the output, or skipping data you do want leaving gaps in the output, so be prepared to write code if you want it done correctly.
You'll probably want to take advantage of a backing database too, to keep track of what you looked at last and when you're supposed to look at it again; That's part of being a good network citizen. You'll also want to keep track whether a feed was down the last n times you looked so you can trim out dead sites.

Related

What is the best way to extract data from wiki tables, and links from that table to JSON?

I'm kind of new at web dev and had a question of getting data from wikipedia. I am making a personal web app that will keep track of past UFC events. I couldn't find an open source api with event details and results. However the following table on wikipedia has a lot of the info I need: http://en.wikipedia.org/wiki/List_of_UFC_events
And I have seen several tutorials on how to get the info from a wiki table and format it into .csv format using google spreadsheets, or other software such as openrefine. But, I also want the information from each event's wikipage(fight results, winners, award winners, poster images etc.), and each event's own wiki page is lined on the table I mentioned above. I was wondering, what is the easiest way to go about extracting this information?
You can use nokogiri gem to scrap the web page

best way to extract info from the web delphi

I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'
I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.
I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.
For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:
star-box-giga-star[^>]*>([^<]*)<
It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.
Processing RSS feed is more comfortable.
As of the time of posting, the only RSS feeds available on the site are:
Born on this Date
Died on this Date
Daily Poll
Yet, you may make a call for adding a new one by getting in touch with the help desk.
Resources on RSS feed processing:
Relevant post here on SO.
Super Object
Wikipedia.
When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.
Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).
Here's IMDB's statement:
You may not use data mining, robots, screen scraping, or similar
online data gathering and extraction tools on our website.
To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.
Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).
All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.
But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.
So a very simple command line program for getting a imdb rating would be...
program imdbrating;
{$apptype console}
uses htmlutils;
function ExtractJsonParm(parm,h:string):string;
var r:integer;
begin
r:=pos('"'+Parm+'":',h);
if r<>0 then
result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
else
result:='N/A';
end;
var h:string;
begin
h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
writeln(ExtractJsonParm('Rating',h));
end.
If the page you are crawling is valid XML, i use SimpleXML to extract infos. Works pretty well.
Resource:
Download link.

Ruby Rss parser and event trigger

I'm using RSS library so i can parse Atom and RSS in Ruby and Rails and store it in a model.
I've looked at the standard RSS library, but is there one library that will auto-detect that there is a new rss feed so i can update my database ?
what are the best practice to trigger an instruction in order to store the new rss feed ?
should i use threads to handle that problem ?is it going to be slow?
thank you for your help
OK heres the deal.
If you want a real fast feed parser go for Feedzirra. Does not work on windows. http://github.com/pauldix/feedzirra
Autodiscovery?
-Theres truffle-hog if you don't want to do GET redirects. http://github.com/pauldix/truffle-hog
-Theres feedbag if you want to do GET redirects to find feeds from given urls. This is slower though. http://github.com/damog/feedbag
Feedzirra is the best bet if you want to poll for new entries for your feed. But if you want a more non-polling solution to your problem then i would suggest going through the pubsubhubbub spec. Make sure while parsing your feeds they are pubsubhubbub enabled. Check for the link tag. If it points to pubsubhubbub.appspot.com or any other pubsub enabled hub then just subscribe to the feed by sending a subscription request to the hub. You can then define a endpoint in your app which will in turn receive updated entry pings for your feed subscription from the hub. Just read the raw POST data and store it in your database. Stats are that 95% of the blogger blogs are pubsub enabled. That is a lot of data in your hands already. :)
If you are polling for changes then you should check the last-modified or etag from the header rather than parse the entire feed again. Saves you from wasting resources. Feedzirra takes care of this for you.
I am not sure what you mean by "auto-detect" a new feed?
Are you looking for code that can discover when someone creates a new feed on a site? Or, do you mean discover when an existing feed has a new article?
The first is tough because your code needs to know what site to look at so it needs some sort of auto-discovery of sites with new feeds. Searching the google for "new rss feeds" doesn't return anything that looks useful, at least not on the first page. If you, or your users, know of a new site then you can have an interface to add new sites to search. Then you grab the page at that URL, look for the RSS/Atom auto-discovery links, and go from there. Auto-discovery links can open a can of worms because of duplicate content being served using different protocols (RDF, RSS and Atom), so you have to determine which to use, or multiple feeds with alternate content listed.
If you mean you want to discover when an existing feed has new articles, then you have to keep track of the last time your code looked at the feed, and the last article that was seen, then retrieve the feed and see if any articles were not in your list of previously seen articles. Your code needs to be sensitive to the time-to-live information in a lot of feeds too. Hitting the feed every fifteen minutes when they update once a week is bad form. Most aggregation code can do those things already but you might need to configure a database and tell the code how to find it.
Generally, for this sort of task I set up a crontab entry on a production Linux or Unix system and fire off the job periodically, looking in the database for feeds whose last-run-time plus the stored time-to-live value is in the past.
Does that help any?
Very easy solution is to use Dynamic attribute-based finders
When you are filling your model with RSS feed data, instead of Model.create(...) use Model.find_or_create_by_column(value, :other_column => other_value).
You can specify a date as unique value or RSS message title ... (whatever you want)
I think this is pretty easy. You can make some cron task to fill your model once per hour for example. Only new feeds will be added.
There is no chance to get some "event" when RSS is updated without downloading whole RSS feed again.

How to get product information from amazon, just based on the URL?

I just have a link to a product page, at amazon. How do I get all the information (photo, price etc), in my ruby program, just using this link?
Here's the list of supported urls as disclosed by amazon for their oembed, product advertising API would come to picture only after parsing through these URLs and getting the ASINs
http://*amazon.*/gp/product/*
http://*amazon.*/*/dp/*
http://*amazon.*/dp/*
http://*amazon.*/o/ASIN/*
http://*amazon.*/gp/offer-listing/*
http://*amazon.*/*/ASIN/*
http://*amazon.*/gp/product/images/*
http://*amazon.*/gp/aw/d/*
http://www.amzn.com/*
http://amzn.com/*
I found this library (I'm using Rails)
amazon-ecs
I'm experimenting with it. Still, I'd require some kind of ID (product id?) to get details of a particular product. For example, consider this link to kindle
http://www.amazon.com/Kindle-Amazons-Wireless-Reading-Generation/dp/B00154JDAI/ref=amb_link_84372271_1?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=06JJGQP9J3BHKPE38SXP&pf_rd_t=101&pf_rd_p=478184871&pf_rd_i=507846
In that link, I noticed ASIN, which is B00154JDAI.
Looks like I can use this ID, to get product information (using amazon-ecs). I just need to parse the URL, to get ASIN.
Is there any other way to do it?
No, I am not going to do screen scraping, that is not a good idea anytime.
If you want to do this, the Nokogiri or hpricot libraries both allow HTML parsing and searching. However, this kind of screen-scraping is notoriously unreliable (as it may break any time Amazon decides to reorganize their HTML), so if you're planning to do this sort of thing for any length of time I'd recommend leveraging the Amazon Product Advertising API instead.
In your program: fetch the page and parse HTML. Filter out the required information. There may be some libraries in Ruby (that I am unaware of), which parse HTML.
hpricot seems to do what you want.
You should use the library Ruby/AWS (google for it, my karma is not high enough to allow external links...). It has been written exactly for that.
You might need to use the built-in Search to find the item you're looking for. After that, the API gives access to pictures, links and all usable information.

Aggregating feeds in Rails application

I am thinking of writing a daemon to loop through feeds and then add them into the database as ActiveRecord objects.
Firstly, one problem I am facing is that I cannot reliably retrieve the author/user of a story using the feed-normalizer gem. It appears that some times, it does not recognize the tag (I don't know if anyone else has faced this problem).
Secondly, I haven't seen anyone convert RSS feeds back into database entries. I need to do this as each entry will have associations with other ActiveRecord objects. I can't find any gems to do this specifically, but could I somehow hack something like acts_as_feed to do that?
Don't use SimpleRSS. It won't decode HTML entities for you, and it occasionally ignores the structure of the feed.
I've found it easiest to parse the feed as XML with XMLSimple, but you can use any XML parser.
SimpleRSS exposes a very simple API and works pretty well on most feeds. I recommend not looking at the implementation as its "parser" is a bunch of regexes (which is so wrong on so many levels), but it works well.
Daemons is a good gem for running it in the background.
If you are using active record, you should follow the instructions for using AR outside of rails and then inline define the model classes. This will cut down on bloat a bit.
RSS feeds are pretty inconsistent, this is the fall through we use
date = i[:pubDate] || i[:published] || i[:updated]
body = i[:description] || i[:content] || i[:summary] || ""
url = i[:guid] || i[:link]
Also, from experience, make sure you try to rescue everything (and remember that timeouts are not caught by normal rescue). It sucks to have to constantly bounce RSS daemons that get bad data.
The best approach is to use a Rails Engine connected to a Feed API like Superfeedr's.
Polling RSS feeds implies that you'll need to run your own asynchronous workers and/or a queue system which can be fairly complex to build and maintain overtime. You'll also have to handle hundreds of formats and inconsistencies. Here's a blog post that shows how to consume RSS feeds in a Rails application.

Resources