Rails: Is it possible to import content from another website? - ruby-on-rails

Specifically, I would like to import the first block of text before the table of contents from a Wikipedia page (which is public domain).
Let's say I have a Model "Resource", with an attribute x, and x is a string that is a Wikipedia link (eg. x: "http://en.wikipedia.org/wiki/Lanny_McDonald"). The first block of text on every Wikipedia page is the group of <p>...</p>'s before <div id="toc" class="toc">...</div>.
Can I write code that copies the content of these <p>...</p>'s and writes it onto my website?

This is known as Web Scraping.
Ironically follow this wikipedia link and
consider the legal ramifications etc.
Nokogiri is boss for this..
Install:
sudo gem install nokogiri -- --with-xml2-include=/usr/local/include/libxml2 --with-xml2-lib=/usr/local/lib
Usage:
There are methods to search using xpath or css which makes things simple.
# wiki_scraper.rb
require 'open-uri'
require 'nokogiri'
# Load in the url.
#doc = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Branch_predictor"))
# Print the first <p> element
puts #doc.xpath("/html/body/p[1]")

You could use a HttpWebRequest, to retrieve the entire page, and then parse the html. There are tools available to convert html to xhtml, at which point you could use xml libraries to parse the xhtml.

Related

Nokogiri returning variable name instead of actual data on website?

I am fetching data from a website. I need to fetch a text inside h1 tag. when I inspect the element , inside that h1 tag there is a text. But when I fetch using Nokogiri, there is a variable name in that h1 tag.
content = open('https://example.com').read
html = Nokogiri::HTML(content)
html.css('h1#egift-refresh-online-number-desktop').text
when I inspect in chrome i found
But when I view the source of that page, I saw
I need to extract the actual value not the variable name. How can I do that with Nokogiri? If there is any method for doing this?
Nokogiri is just a simple XML/HTML parser and is not the right tool for this job.
What you have fetched looks like a Handlebars template (or one of its many offshots) and {{ ecardDetails.cardCardnumber }} is just a placeholder in the HTML file that is replaced with actual data by JavaScript possibly after doing an AJAX request.
Nokogiri does not execute JavaScript as its not a browser.
Capybara is a DSL which is mostly used for acceptance testing which when used with the correct driver (like selenium or webkit) can automate a browser and thus scrape pages that rely on JavaScript.

Ruby gem Mechanize

Is it possible to use the render method of a controller to render the content of a Mechanize object? I tried:
def new
a = Mechanize.new
a.get('http://flickr.com/')
render :html => a.current_page
end
which throws an error, as well as render :text => a, a.page, and a.current_page.
I understand that the render function is not expecting a Mechanize object, I just don't know what it wants and how to get it there.
I am at the beginning stages of my development and researching all web scraping frameworks for Ruby and any help would be appreciated.
Try the body method:
page = agent.get('http://www.example.net')
puts page.body[0..100]
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml"
You can also dive deeper into the document using Nokogiri's capabilities. Mechanize is built around Nokogiri, so you can get to the parsed document Nokogiri creates, then use CSS or XPath accessors to located sub-sections of the document. Once you've found what you want you can use the to_html method to have Nokogiri emit the HTML for the nodes or nodeset. See "extract single string from html using ruby/mechanize (and nokogiri)" for information.
Now, while that'll work, you might want to consider whether you're violating the terms-of-service or copyrights by reusing the content directly on your page.

Embedding youtube video in markdown?

i use the ruby gem formatize to parse my markdown-formated text. now i want to embed a youtube-video into the markdown text, but whenever i add the iframe snippet, the gem (or markdown?) just removes it from the output. any advise?
thanks!
You'll have to get formatize to ignore <iframe> tags. See this link.
You can have markdown + HTML together so it sounds like it's an issue with the gem. Notice how the markdown syntax recommends that the older YouTube markup is embedded via direct HTML. You might be able to get away using the older <object> tag approach; I think it's still supported.
According to formatize's documentation, you should pass :safe => true into the markdown function (this opens a security hole, so be sure to run your own, customized sanitization)
That doesn't work so I am instead using my own copy of formatizes function that does no sanitization (yet):
module ApplicationHelper
def post_body(post)
(post.body.blank? ? "" : BlueCloth.new(post.body).to_html).html_safe
end
end

Generating a link with Markdown (BlueCloth) that opens in a new window

I'd like to have a link generated with BlueCloth that opens in a new window. All I could find was the ordinary [Google](http://www.google.com/) syntax but nothing with a new window.
Ideas?
Regards
Tom
Here is a complete reference for markdown: http://daringfireball.net/projects/markdown/syntax
And since there is no mention of how to set the target attribute, I would believe it is not directly possible, but the reference also says:
For any markup that is not covered by
Markdown’s syntax, you simply use HTML
itself. There’s no need to preface it
or delimit it to indicate that you’re
switching from Markdown to HTML; you
just use the tags.
Source: http://daringfireball.net/projects/markdown/syntax#html
So I would suggest you have to use the html syntax for links like this
update
if you wrap the markdown generated content in a div with a specific id like this:
and you use jQuery, you can add the following javascript:
$('#some_id a').attr('target','_blank');
Or you can save the BlueCloth output in a variable before outputting.
markdown_generated_string.gsub!(/<a\s+/i,'<a target="_blank" ')

How to manipulate DOM with Ruby on Rails

As the title said, I have some DOM manipulation tasks. For example, I want to:
- find all H1 element which have blue color.
- find all text which have size 12px.
- etc..
How can I do it with Rails?
Thank you.. :)
Update
I have been doing some research about extracting web page content based on this paper-> http://www.springerlink.com/index/A65708XMUR9KN9EA.pdf
The summary of the step is:
get the web url which I want to be extracted (single web page)
grab some elements from the web page based on some visual rules (Ex: grab all H1 which have blue color)
process the elements with my algorithm
save the result into my database.
-sorry for my bad english-
If what you're trying to do is manipulate HTML documents inside a rails application, you should take a look at Nokogiri.
It uses XPath to search through the document. With the following, you would find any h1 with the "blue" css class inside a document.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.stackoverflow.com'))
doc.xpath('//h1/a[#class="blue"]').each do |link|
puts link.content
end
After, if what you were trying to do was indeed parse the current page dom, you should take a look at JavaScript and JQuery. Rails can't do that.
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
To reliably sort out what color an arbitrary element on a webpage is, you would need to reverse engineer a browser (to accurately take into account stylesheets, markup hacks, broken tags, images, etc).
A far easier approach would be to embed an existing browser such as gecko into a custom application of your making.
As your spider would browse pages, it would pass them to your embedded instance of gecko where you could use getComputedStyle to pull what color an individual element happens to be.
You originally mentioned wanting to use Ruby on Rails for this project, Rails is a framework for writing presentational applications and really a bad fit for a project like this.
As a starting point, I'd recommend you check out RubyGnome, and in particular RubyGnome's Gtk::MozEmbed functionality.

Resources