How do I get a teaser excerpt from a web article / user-posted link? - ruby-on-rails

I have a site where users can submit content based on a link. Is there a way to detect the main content of the link and take a teaser? For example, on Digg, all of the entries have a small clip / excerpt from the link. That's pretty much exactly what I want.
I'm working with Ruby on Rails. I found this question on extracting article excerpts but any tips in the right direction would be helpful.

I found out that Digg uses the Open Graph Protocol (http://ogp.me/) by Facebook.
Ultimately, this was exactly what I was looking for!
The Ruby Gem OpenGraph:
https://github.com/intridea/opengraph
By accessing the metadata tag "description", I got the description e.g.
article = OpenGraph.fetch('http://www.page.com/article/1124')
article.description# => 'This is a small description of the movie'
Some pages (but not most articles) don't have the description.

How to Extract a Webpage’s Main Article Content
Try extracting the text using DOM, here is an example page
<body>
<div>
<ul>
<li>Home</li>
<li>Politics</li>
<li>Health</li>
<li>Travel</li>
<li>About</li>
</ul>
<div>
<div>
<div>
<p><b>MIAMI, Florida (CNN) </b> -- Hurricane Ike weakened slightly...
<p>Ike hit Turks and Caicos Islands Sunday morning, leaving a trail of...
<p>"It pretty much looks like an episode of 'The Twilight Zone,' " said...
<p>Aftwood estimates at least 90 percent of homes he saw on the island were...
<p>The possibility of similar devastation prompted state and local officials...
<p > "Let's hope it's all a false alarm," Louisiana Gov. Bobby Jindal said...
</div>
<div>
<p>Some side-story that we don't really care about.</p>
<p>Another paragraph for this story.</p>
</div>
<div>
<p>Yet another semi-related side-story that we still don't care about.</p>
<p>Another paragraph for this story.</p>
<p>Another paragraph for this story.</p>
<p>Yet another paragraph for this story.</p>
</div>
</div>
<div>© 2008 Cable News Network.<div>
</body>
Clearly, we don’t care about the navigation link text, or the two side-stories. Let’s break it down based on DOM location. We have six tags in the first tag of the second tag of the body. We’ll represent this location as a list of indexes, like (2,1,*). If we group all the text nodes in this fashion, and track how much text each group contains, we get a table like:
location = characters
(1,1,1,1) = 4
(1,1,2,1) = 8
(1,1,3,1) = 6
(1,1,4,1) = 6
(1,1,5,1) = 5
(2,1,*) = 500
(2,2,*) = 100
(2,3,*) = 250
(3) = 26

Related

UIWebView is not working right SWIFT

I'm looking for a better way to customize this string. As of right now I'm using a UIWebview and trying to give my string direct css scripting with loadhtml string. This isnt working well at all. The images are all over the place and any changes I'm putting in the loadHtml is being ignored by certain css tags. Is there a better way to do this? Or what can I do to fix this? Also this way is causing me to have blue boxes with question marks in a different busDescriptio string. Not sure if this is related to how I'm calling the string or because of using UIWebView. Thanks!
webView.loadHTMLString("<html><body p style='font-family:arial;font-size:48px;color:white;'>" + busDescriptio + "</body></html>", baseURL: nil)
webView.stringByEvaluatingJavaScriptFromString("document.all[0].innerHTML")
webView.backgroundColor = UIColor.clearColor()
webView.opaque = false
webView.scrollView.scrollEnabled = true
webView.scrollView.bounces = true
webView.sizeToFit()
webView.delegate = self
My busDescriptio is being occupied by this string
<p style='text-align:center'> </p>
<p style='text-align:center'><span style='font-family:comic sans ms'><strong><u>Fresh Seafood and USDA Prime Steak Directly from OUR OWN MARKET!</u></strong></span></p>
<p style='text-align:center'><strong><span style='font-family:tahoma'><span style='font-size:medium'>Wh</span></span></strong><span style='font-family:tahoma'><strong><span style='font-size:medium'>at do we offer?</span></strong></span></p>
<p style='text-align:center'><span style='font-family:comic sans ms'><span style='font-size:medium'>Fresh Seafood, USDA Prime Steaks, Live Music, Full Catering Options with a Customizable Catering Menu, Boxed Lunches delivered to your office, Happy Hour, Daily Drink Specials, Free WiFi, Exceptional Service, Fun and Unique Ambience, Private Party Room, Banquets, Personable Birthday Experiences, Delivery available through eatgoodexpress.com, and </span></span></p>
<p style='text-align:center'><span style='font-size:medium'><strong><span style='font-size:large'><span style='font-family:comic sans ms'>THE LARGEST CRAB IN THE WORLD! </span></span></strong></span></p>
<p style='text-align:center'><em><strong><span style='font-family:tahoma'><span style='font-size:medium'>Why go anywhere else?</span></span></strong></em></p>
<p style='text-align:left'><em> </em></p>
<p style='text-align:center'><img alt='' src='http://i189.photobucket.com/albums/z189/txmom3x/Crabby%20Daddy/crabbydrinks.jpg' style='height:148px; width:222px' /><img alt='' src='http://i189.photobucket.com/albums/z189/txmom3x/Crabby%20Daddy/CrabbyPrivateRoom.jpg' style='height:149px; width:244px' /></p>
<p> </p>
<p style='text-align:center'><img alt='' src='http://i189.photobucket.com/albums/z189/txmom3x/Crabby%20Daddy/crabbycrab.jpg' style='height:142px; width:204px' /><img alt='' src='http://i189.photobucket.com/albums/z189/txmom3x/Crabby%20Daddy/Crabby_music.jpg' style='height:142px; width:153px' /></p>

XPath Node selection

I am using HtmlAgilityPack to parse data for a Windows Phone 8 app. I have managed four nodes but I am having difficulties on the final one.
Game newGame = new Game();
newGame.Title = div.SelectSingleNode(".//section//h3").InnerText.Trim();
newGame.Cover = div.SelectSingleNode(".//section//img").Attributes["src"].Value;
newGame.Summary = div.SelectSingleNode(".//section//p").InnerText.Trim();
newGame.StoreLink = div.SelectSingleNode(".//img[#class= 'Store']").Attributes["src"].Value;
newGame.Logo = div.SelectSingleNode(".//div[#class= 'text-col'").FirstChild.Attributes["src"].Value;
That last piece of code is the one I am having problems with. The HTML on the website looks like this (simplified with the data I need)
<div id= "ContentBlockList" class="tier ">
<section>
<div class="left-side"><img src="newGame.Cover"></div>
<div class="text-col">
<img src="newGame.Logo http://url.png" />
<h3>newGame.Title</h3>
<p>new.Game.Summary</p>
<img src="newGame.StoreLink" class="Store" />
</div>
</div>
</section>
As you can see, I need to parse two images from this block of HTML. This code seems to take the first img src and uses it correctly for the game cover...
newGame.Cover = div.SelectSingleNode(".//section//img").Attributes["src"].Value;
However, I'm not sure how to get the second img src to retrieve the store Logo. Any ideas?
newGame.Cover = div.SelectSingleNode(".//img[2]").Attributes["src"].Value;
You didn't post the entire thing but, this should do the trick.
You can try this way :
newGame.Cover = div.SelectSingleNode("(.//img)[2]")
.GetAttributeValue("src", "");
GetAttributeValue() is preferable over Attributes["..."].Value because, while the latter throws exception, the former approach returns the 2nd parameter (empty string in the example above) when the attribute is not found.
Side note : your HTML markup is invalid as posted (some elements are not closed, <section> for example). That may cause confusion.

How can get the number of likes shown in app store

I want to know if exist a way to get the number of likes shown in App Store or Game Center.
In this way I can check if a user really likes my application on Facebook.
Thanks!
You should be using the iTunes lookup API to get back the info you need in JSON formatted results:
http://www.apple.com/itunes/affiliates/resources/documentation/itunes-store-web-service-search-api.html
The link you should use for Angry Birds Rio would look like this:
http://itunes.apple.com/lookup?id=420635506
{
"resultCount":1,
"results":[
{
"artistId":298910979,
"artistName":"Rovio Entertainment Ltd",
...
"averageUserRating":4.5,
"averageUserRatingForCurrentVersion":4.5,
...
Using http-get you can have a look at the itunes page for the app. e.g. for angry birds rio: https://itunes.apple.com/us/app/angry-birds-rio/id420635506?mt=8&ign-mpt=uo%3D2
Then if you want to know the total rating have a look here and try to split the String to the relevant part:
<div class='extra-list customer-ratings'>
<h4>Customer Ratings</h4>
<div>Current Version:</div>
<div class='rating' role='img' tabindex='-1' aria-label='4 and a half stars, 3113 Ratings'><div><span class="rating-star"> </span><span class="rating-star"> </span> <span class="rating-star"> </span><span class="rating-star"> </span><span class="rating-star half"> </span></div><span class="rating-count">3113 Ratings</span>

Graph background-image resizing without PHP

I've read several helpful answers in re. image resizing using PHP and max-height etc.: Image resize script
However, my problem is that I want to resize an image of a graph that I am retrieving from another site (USGS), and putting into a site (zenfolio) that supports HTML and JavaScript, but not PHP. I have tried adjusting the specified height and width, but keep on ending up resizing only the amount of the image that shows on the page, and not the image itself (sorry I cannot post images as I am a new user).
I just posted them as png's above to demonstrate the problem, but the images are generated as follows:
<div id="riverlevels">
<center>
<div id="MyWidget" style="background-image:url(http://waterdata.usgs.gov/nwisweb/graph?agency_cd=USGS&site_no=12354500&parm_cd=00065&period=21);width:576px;Height:400px;">
<br/>
Montana River Photography </div>
</center>
</div>
</div>
This same image can be generated using this JavaScript, but for some reason that does not allow me to display more than one variable graph per page (I want to show both discharge (00060), and gage height (00065)):
<script type="text/javascript">
wStation = "12354500";
wDays = "21";
wType = "00065";
wWidth = "576px";
wHeight = "400px";
wFColor = "#000033";
wTitle = "";
document.write('<div id="gageheight"></div>');
document.write('<scr'+'ipt type="text/JavaScript"src="http://batpigandme.com/js/showstring.js"></scr'+'ipt>');
As you can tell, I have to use a separate site that I own to create the JavaScript file. The graphs are currently located in various iterations at:
montanariverphoto.com/test
clark fork gage height
I sincerely apologize if I have missed an obvious answer to this! I basically created this widget by reverse engineering a widget from another site, so perhaps my call is incorrect all together.
Does it absolutely have to be a background image? Scaling them is possible (using background-size), but this property is not well supported (basically it won't work in Internet Explorer). Your code would work almost as-is if you can use an image tag instead:
<img src="http://waterdata.usgs.gov/nwisweb/graph?agency_cd=USGS&site_no=12354500&parm_cd=00065&period=21" width="576" height="400" alt="..." />
for your other problem, ids need to be unique on a page. In your code example you are creating a div with the id of gageheight, and this is ID is hardcoded into your javascript file at http://batpigandme.com/js/showstring.js. Since you can only have one element with this ID on the page, if you repeat the code later on it won't work. You'd need to change this script so that you could pass in the ID as a variable, something like:
wTitle = "";
wElement = "gageheight";
document.write('<div id="gageheight"></div>');
document.write('<scr'+'ipt type="text/JavaScript"src="http://batpigandme.com/js/showstring.js"></scr'+'ipt>');
and then in your JS:
var myElement = document.getElementById(wElement);
var JavaScriptCode = document.createElement("script");
JavaScriptCode.setAttribute('type', 'text/javascript');
JavaScriptCode.setAttribute("src", 'http://batpigandme.com/js/data2.js');
myElement.appendChild(JavaScriptCode);

Mechanize not recognizing anchor tags via CSS selector methods

(Hope this isn't a breach of etiquette: I posted this on RailsForum, but I haven't been getting much response from there recently.)
Has anyone else had problems with Mechanize not recognizing anchor tags via CSS selectors?
The HTML looks like this (snippet with white space removed for clarity):
<td class='calendarCell' align='left'>
10
<p style="margin-bottom:15px; line-height:14px; text-align:left;">
<span class="sidenavHeadType">
Current Events</span><br />
<b><a href="http://www.mysite.org/index.php/site/
Clubs/banks_and_the_fed" class="a2">Banks and the Fed</a></b>
<br />
10:30am- 11:45am
</p>
I'm trying to collect the data from these events. Everything is working except getting the anchor within the <p>. There's clearly an <a> tag inside the <b>, and I'm going to need to follow that link to get further details on this event.
In my rake task, I have:
agent.page.search(".calendarCell,.calendarToday").each do |item|
day = item.at("a").text
item.search("p").each do |e|
anchor = e.at("a")
puts anchor
puts e.inner_html
end
end
What's interesting is that the item.at("a") always returns the anchor. But the e.at("a") returns nil. And when I do inner_html on the p element, it ignores the anchor entirely. Example output:
nil
<span class="sidenavHeadType">
Photo Club</span><br><b>Indexing Slide Collections</b>
<br>
2:00pm- 3:00pm
However, when I run the same scrape directly with Nokogiri:
doc.css(".calendarCell,.calendarToday").each do |item|
day = item.at_css("a").text
item.css("p").each do |e|
link = e.at_css("a")[:href]
puts e.inner_html
end
end
It recognizes the inside the , and it will return the href, etc.
<span class="sidenavHeadType">
Bridge Party</span><br><b>Party Bridge</b>
<br>
7:00pm- 9:00pm
Mechanize is supposed to use Nokogiri, so I'm wondering if I have a bad version or if this affects others as well.
Thanks for any leads.
Never mind. False alarm. In my Nokogiri task, I was pointing to a local copy of the page that included the anchors. The live page required a login, so when I browsed to it, I could see the a tags. Adding the login to the rake task solved it.

Resources