Using Nokogiri I am unable to find certain nodes in a document - ruby-on-rails

When trying to search a document for ysr-bio-data ("Height" value) on this page http://sports.yahoo.com/footballrecruiting/football/recruiting/player-Jonathan-Allen-125805
The node is nil. Is this because nokogiri is getting the page before this section is populated? Or is it that the nokogiri object isn't storing the whole page into it's object?
Below is some sample code of how I'm trying to retrieve the data. Thanks!
doc = Nokogiri::HTML(open('http://sports.yahoo.com/footballrecruiting/football/recruiting/player-Jonathan-Allen-125805'))
doc.css('ul#ysr-bio-data')
If I need to provide any additional information please let me know. Thanks!
Edit: Fixed incorrect syntax.

Sorry bud, but there is javascript that needs to run on the page for those cells to be filled out.
you can do this tho.. make the javascript run in a web-browser..
require 'nokogiri'
require 'watir-webdriver' #http://watir.com/
$browser = Watir::Browser.start "http://sports.yahoo.com/footballrecruiting/football/recruiting/player-Jonathan-Allen-125805"
doc = Nokogiri::HTML.parse($browser.html)
doc.css("ul#ysr-bio-data").text
=> "Ht:6'3\"Wt:263 lbs40:4.5 secsBench Max:280Class:2013 (High School)\t"
We're basically replacing open-uri with watir.
Hope this helps.

I found another question on stackoverflow that provided a means to solving my issue.
HTML is read before fully loaded using open-uri and nokogiri

Related

Nokogiri gem vs. opening by hand

I can't get Nokogiri to return the same thing I see when I go to a page and "View Source". And for the life of me can't figure out why.
This is the page I am looking at:
http://www.amazon.com/gp/product/B009NWFP5Q
And as you can see it returns a shoe that's orange..and if I view the source and find the link I'm looking for by searching for "hiRes" twice, I get:
http://ecx.images-amazon.com/images/I/71b75uTtzDL.UL1500.jpg
However, if I run this code with Nokogiri:
require 'nokogiri'
require 'open-uri'
require 'uri'
url = "http://www.amazon.com/gp/product/B009NWFP5Q"
doc = Nokogiri::HTML(open(url))
pic = doc.css('div#imageBlock_feature_div script')[0]
puts pic
and look for the link in the same position I get this image:
http://ecx.images-amazon.com/images/I/81R97WG9nyL.UL1500.jpg
which is a BLUE shoe!!! Arghhh..
Any idea why??
Maybe the color being shown is somehow based on your session or dynamic attributes assigned to a cookie stored in your browser. Find a way to provide a URL that will return exactly what you are looking for. It may also be possible to provide a cookie using the http client code but that seems like a plan B.

How to parse Nokogiri/libXML XML errors to human-friendly errors?

We are using Nokogiri to validate XML files using a XSD. The problem is that the error messages that Nokogiri generates are not very friendly and very hard to translate:
"Element '{http://www.portalfiscal.inf.br/nfe}infNFe': The attribute 'Id' is required but missing."
Does anyone know of a parser or any other way to capture the info needed from the error to generate a more human friendly error?
Until then, we will be doing a custom parser for them... ouch!
I created a gem for this that is now open source: https://rubygems.org/gems/xml_errors_parser
It seems to work pretty well so far, but number of errors parsed is very few for now. It is however very easy to add new errors, so we will be adding them as needed.
Code reviews and pull requests are always great :)

using 'puts' to get information from external domain

ive just started with ruby on rails the other day and i was wandering is it possible to using the puts function to get the content of a div from a page on an external page.
something like puts "http://www.example.com #about"
would something like this work ? or would you have to get the entire page and then puts that section that you wanted ?
additionaly if the content on the "example.com" #about div is constantly changing would puts constantly update its output or would it only run the script each time the page is refreshed ?
The open-uri library (for fetching the page) and the Nokogiri gem (for parsing and retrieving specific content) can assist with this.
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com/'))
puts doc.at('#about').text
puts will not work that way. Ruby makes parsing HTML fairly easy though. Take a look at the Nokogirl library, and you can use xpath queries to get to the div you want to print out. I believe you would need to reopen the file if the div changes, but I'm not positive about that - you can easily test it (or someone here can confirm or reject that statement).

Ruby on Rails - Converting Twitter #mentions, #hashtags and URLs within a string

Let's say I have a string which contains text grabbed from Twitter, as follows:
myString = "I like using #twitter, because I learn so many new things! [line break]
Read my blog: http://www.myblog.com #procrastination"
The tweet is then presented in a view. However, prior to this, I'd like to convert the string so that, in my view:
#twitter links to http://www.twitter.com/twitter
The URL is turned into a link (in which the URL remains the link text)
#procrastination is turned into https://twitter.com/i/#!/search/?q=%23procrastination, in which #procrastination is the link text
I'm sure there must be a gem out there that would allow me to do this, but I can't find one. I have come across twitter-text-rb but I can't quite work out how to apply it to the above. I've done it in PHP using regex and a few other methods, but it got a bit messy!
Thanks in advance for any solutions!
The twitter-text gem has pretty much all the work covered for you. Install it manually (gem install twitter-text, use sudo if needed) or add it to your Gemfile (gem 'twitter-text') if you are using bundler and do bundle install.
Then include the Twitter auto-link library (require 'twitter-text' and include Twitter::Autolink) at the top of your class and call the method auto_link(inputString) with the input string as the parameter and it will give you the auto linked version
Full code:
require 'twitter-text'
include Twitter::Autolink
myString = "I like using #twitter, because I learn so many new things! [line break]
Read my blog: http://www.myblog.com #procrastination"
linkedString = auto_link(myString)
If you output the contents of linkedString, you get the following output:
I like using #<a class="tweet-url username" href="https://twitter.com/twitter" rel="nofollow">twitter</a>, because I learn so many new things! [line break]
Read my blog: http://www.myblog.com <a class="tweet-url hashtag" href="https://twitter.com/#!/search?q=%23procrastination" rel="nofollow" title="#procrastination">#procrastination</a>
Use jQuery Tweet Linkify
A small jQuery plugin that transforms #mention texts into hyperlinks pointing to the actual Twitter profile, #hashtag texts into real hashtag searches, as well as hyperlink texts into actual hyperlinks

How to get the thumbnail image from embedded video

I'm using Ruby on Rails 2.3.8 and Hpricot plugin for parsing HTML.
I would like to get embedded videos thumbnails, and searching on the internet I figured out that youtube and vimeo at least uses OG (open graph) protocol, which provides meta tags that contains the video info (url, thumbnail, etc).
For example, if I had this video, I could read the following meta tag, using Hpricot plugin:
<meta property="og:image" content="http://b.vimeocdn.com/ts/101/345/101345354_200.jpg" />
So, using Hpricot I should be able to parse it as follows:
video_url = "http://vimeo.com/16430948"
video_page = Hpricot.parse(open(video_url))
element = video_page.search("//meta[#property='og:image']")
But I get an empty element instead.
Note: if you searched for video_page.search("//meta"), it will find the one I want on the list...but using the previous syntax it won't.
Could anybody tell me how can I solve this?
I came across this question whilst having a similar problem with Hpricot and meta data.
In the end I had to change the xpath from //meta to /html/head to get my scraping working. Trying the same here seems to work.
video_page.at('/html/head/meta[#property="og:image"]')['content']
Returns your image's URL.

Resources