In need of an explanation of Web scraping with Nokogiri in Rails - ruby-on-rails

I am utterly confused and lost with Nokogiri and web scraping in Rails. I need someone to explain to me how I can get article titles from a web site to list in a view in my Rails application. I can manage to retrieve the data in irb however I have no clue how I can get that same data to be displayed in a view I made.
I have watched a number of tutorials and read documentation and one thing that they do that confuses me the most is when they require nokogiri or open-uri in a their example ruby file what directory is that ruby file supposed to be placed in? Also is that file associated with any controller for it to be displayed in the particular view that I made?
I hope I am explaining my issue as clear as possible without any confusion as I am not trying to confuse myself anymore that i am in my explanation.
See, what I am trying to do is create an application where the user can register and sign in, after they are signed in they are redirected to a page with 3 links. Those links being Audi, BMW and Mercedes-Benz and depending on which link is clicked the user will be then directed to another page where they are returned back a list of articles that mention their desired choice.
I hope this explanation was helpful and I really hope someone can offer to help or give me some kind of documentation that will benefit me.
Thank you!
This is what I did in irb:
2.1.1 :001 > require 'rubygems'
=> false
2.1.1 :002 > require 'nokogiri'
=> true
2.1.1 :003 > require 'open-uri'
=> true
2.1.1 :004 > page = Nokogiri::HTML(open("http://www.dtm.com/de/News/Archiv/index.html"))
I then got this returned:
=> #<Nokogiri::HTML::Document:0x814e3b40 name="document" children=[#<Nokogiri::XML::DTD:0x814e37f8 name="HTML">, #<Nokogiri::XML::Element:0x814e358c name="html" children=[#<Nokogiri::XML::Text:0x814e3384 "\r\n">, #<Nokogiri::XML::Element:0x814e32d0 name="head" children=[#<Nokogiri::XML::Text:0x814e30f0 "\r\n">, #<Nokogiri::XML::Element:0x814e3028 name="title" children=[#<Nokogiri::XML::Text:0x814e2e48 "DTM | Newsarchiv">]>, #<Nokogiri::XML::Text:0x814e2c90 "\r\n">, #<Nokogiri::XML::Element:0x814e2bc8 name="meta" attributes=[#<Nokogiri::XML::Attr:0x814e2b64 name="charset" value="utf-8">]>, #<Nokogiri::XML::Text:0x814e2718 "\r\n">, #<Nokogiri::XML::Element:0x814e2664 name="meta" ...
(I got more but just put up a few lines of what was returned) I am assuming this is the raw data from the page.
I then put:
2.1.1 :008 > puts page
Which returned back the raw HTML content.
Finally I entered:
2.1.1 :014 > page.css("a")
Which returned back the all the links on the page.

I am hoping to help you with a real world example. Lets get some data from Reuters for example.
In your console try this:
# require your tools make sure you have gem install nokogiri
pry(main)> require 'nokogiri'
pry(main)> require 'open-uri'
# set the url
pry(main)> url = "http://www.reuters.com/finance/stocks/overview?symbol=0005.HK"
# load and assign to a variable
pry(main)> doc = Nokogiri::HTML(open(url))
# take a piece of the site that has an element style .sectionQuote you can use ids also
pry(main)> quote = doc.css(".sectionQuote")
Now if you have a look in quote you will see you will have Nokogiri elements. Lets have a look inside:
pry(main)> quote.size
=> 6
pry(main)> quote.first
=> #(Element:0x43ff468 {
name = "div",
attributes = [ #(Attr:0x43ff404 { name = "class", value = "sectionQuote nasdaqChange" })],
children = [
#(Text "\n\t\t\t"),
#(Element:0x43fef18 {
name = "div",
attributes = [ #(Attr:0x43feeb4 { name = "class", value = "sectionQuoteDetail" })],
children = [
#(Text "\n\t\t\t\t"),
#(Element:0x43fe9c8 { name = "span", attributes = [ #(Attr:0x43fe964 { name = "class", value = "nasdaqChangeHeader" })], children = [ #(Text "0005.HK on Hong Kong Stock")] }),
.....
}),
#(Text "\n\t\t")]
})
You can see that nokogiri has essentially encapsulated each DOM element, so that you can search and access it quickly.
if you want to just simply display this div element you can:
pry(main)> quote.first.to_html
=> "<div class=\"sectionQuote nasdaqChange\">\n\t\t\t<div class=\"sectionQuoteDetail\">\n\t\t\t\t<span class=\"nasdaqChangeHeader\">0005.HK on Hong Kong Stock</span>\n\t\t\t\t<br class=\"clear\"><br class=\"clear\">\n\t\t\t\t<span style=\"font-size: 23px;\">\n\t\t\t\t82.85</span><span>HKD</span><br>\n\t\t\t\t<span class=\"nasdaqChangeTime\">14 Aug 2014</span>\n\t\t\t</div>\n\t\t</div>"
and it is possible to use it directly in the view of a rails application.
if you want to be more specific and take individual components and traverse by looping the quote variable for elements one level down, in this instance you can:
pry(main)> quote.each{|p| puts p.inspect}
Or be very specific and get the value of an element ie the name of the stock in our example:
pry(main)> quote.at_css(".nasdaqChangeHeader").content
=> "0005.HK on Hong Kong Stock"
This is a very useful link: http://nokogiri.org/tutorials/searching_a_xml_html_document.html
Really hope this helps
PS: A tip for looking inside objects
(http://ruby-doc.org/core-2.1.1/Object.html#method-i-inspect)
puts quote.inspect

First, you can put nokogiri and openuri in the gemfile of your rails app, with that in place you don't need to require these libraries.
You flow to scrape the sites should be:
# put this code on your controller
web_site = params[:web_site] # could be http://www.bmw.com/com/en/
#doc = Nokogiri::HTML(open(web_site))
#then you can iterate over the document in your view
<% #doc.css('.standardTeaser').each do |teaser_bmw| %>
<p>teaser_bmw.css('.headline').text </p>
#other content of teaser you can search here
<% end %>
So, to scrape the web site you need to fetch the html from the web site and find what content you want to grab.
If you know some basics of css selector it will be very easy to do. Me example doesn't take in account if you want to save the data in a database... but if you want, you just need to create a table with the field you need to save and than create a record after parsing the html.
Is that made sense to you?

Related

Is there a way to parse external RSS Feeds with Jekyll?

I have several websites and would like to show content via RSS like headlines in a Jekyll project. Is it possible to parse external rss feeds with jekyll and use them?
Yes. You'd either want to create a plugin to fetch and parse the external feeds during jekyll build or, plan B, you could always fetch and parse the feeds client-side with AJAX. Since you asked for a Jekyll answer, here's a rough approximation of the former approach:
# Runs during jekyll build
class RssFeedCollector < Generator
safe true
priority :high
def generate(site)
# TODO: Insert code here to fetch RSS feeds
rss_item_coll = null;
# Create a new on-the-fly Jekyll collection called "external_feed"
jekyll_coll = Jekyll::Collection.new(site, 'external_feed')
site.collections['external_feed'] = jekyll_coll
# Add fake virtual documents to the collection
rss_item_coll.each do |item|
title = item[:title]
content = item[:content]
guid = item[:guid]
path = "_rss/" + guid + ".md"
path = site.in_source_dir(path)
doc = Jekyll::Document.new(path, { :site => site, :collection => jekyll_coll })
doc.data['title'] = title;
doc.data['feed_content'] = content;
jekyll_coll.docs << doc
end
end
end
You can then access the collection in your template like so:
{% for item in site.collections['external_feed'].docs %}
<h2>{{ item.title }}</h2>
<p>{{ item.feed_content }}</p>
{% endfor %}
There are a lot of possible variations on the theme but that's the idea.
Well, I don't think Jekyll per se can do that... because Jekyll is more of a CMS. However, Jekyll is written in Ruby can I believe you can easily run ruby/rake tasks with Jekyll (that's even probably what's being used when you build a Jekyll site), so I believe you should probably do that as a ruby script.

Rails Microsoft Word, XML databinding, repeat rows

Those willing to jump straight to my questions can go to the paragraph "Please help with". You will find there my beginning of implementation, along with short XML samples
The story
The famous problem of inserting repeating content, like table rows, into a word template, using the rails framework.
I decided to implement a 'cleaner' solution for replacing some variables in a Word document with rails, using XML databinding. This solution works very well for non-repetitive content, but for repetitive content, a little extra dirty work must be done and I need help with it.
No C#, No Visual, just plain olde ruby on rails & XML
The databinded document
I have a Word document with some content controls, tagged with "human-readable" text, so my users know what should be inside.
I have used Word 2007 Content Control Toolkit to add some custom XML to a .docx file. Therefore in each .docx I have some customXml/itemsx.xml that contains my custom XML.
I have manually databinded this XML to text content control I have in my word template, using drag & drop with Word 2007 Content Control Toolkit.
The replacing process with nokogiri
Basically I already have some code that replaces every XML node by the corresponding value from a hash. For example if I provide this hash to my function :
variables = {
"some_xml-node" => "some_value"
}
It will properly replace XML in customXml/itemsx.xml of .docx file :
<root> <some> <xml-node>some_value</xml-node></some> </root>
So this is taken care of !
The repetitive content
Now as I said, this works perfectly for non-repetitive content. For repetitive content (in my case I want to repeat some <w:tr> in a document), the solution I'd like to go with, is
Manually insert some tags in word/document.xml of .docx file (this is dirty, but hell I can't think of anything else) before every <tr> that needs to be duplicated
In rails, parse the XML and locate the <tr> that needs duplicating using Nokogiri
Copy the tr as many times as I need
Look at some text inside this <tr>, find the databinding (which looks like <w:dataBinding w:xpath="/root[1]/movies[1]/movie[1]/name[1]"
Replace movie[1] by movie[index]
Repeat for every table that needs <tr> duplication
With this solution Therefore I ensure 100% compatibility with my existing system ! It's some kind of preprocessing...
Please help with
Finding an XML comment containing a custom string, and selecting the node just below it (using Nokogiri)
Changing attributes in many sub-nodes of the node found in 1.
XML/Hash samples that could be used (my beginning of implementation after that):
Sample of .docx word/document.xml
<w:document>
<!-- My_Custom_Tag_ID -->
<w:tr someparam="something">
<w:td></w:td>
<w:td><w:sthelse></w:sthelse><w:dataBinding w:xpath="/root[1]/movies[1]/movie[1]/name[1]><w:sth>Value</w:sth></w:td>
<w:td></<:td>
</w:tr>
</w:document>
Sample of input parameter repeat_tag hash
repeat_tags_sample = [
{
"tag" => "My_Custom_Tag_ID",
"repeatable-content" => "movie"
},
{
"tag" => "My_Custom_Tag_ID_2",
"repeatable-content" => "cartoons"
}
]
Sample of input parameter contents hash
contents_sample =
{
"movies" => [{"name" => "X-Men",
"year" => 1998,
"property-xxx" => 42
}, { "name" => "X-Men-4",
"year" => 2007,
"property-xxx" => 42
}],
"cartoons" => [{"name" => "Tom_Jerry",
"year" => 1995,
"property-yyy" => "cat"
}, { "name" => "Random_name",
"year" => 2008,
"property-yyy" => 42
}]
}
My beginning of implementation :
def dynamic_table_content(zip, repeat_tags, contents)
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_dtream)
# repeat_tags_sample = [ {
# "tag" => My_Custom_Tag_ID",
# "repeatable-content" => "movie"},
# ...]
repeat_tags.each do |rpt|
content = contents[rpt[:repeatable-content]]
# content now looks like [
# {"name" => "X-Men",
# "year" => 1998,
# "property-xxx" => 42, ...},
# ...]
content_name = rpt[:repeateable_content].to_s
# the 'movie' of '/root[1]/movies[1]/movie[1]/name[1]' (see below)
puts "Processing #{rpt[:tag]}, adding #{content_name}s"
# Word document.xml sample code looks like this :
# <!-- My_Custom_Tag_ID_inserted_manually -->
# <w:tr ...>
# ...
# <w:dataBinding w:xpath="/root[1]/movies[1]/movie[1]/name[1]>
# ...
# </w:tr>
Find a comment containing a custom string, and select the node just below
# Find starting <w:tr > tag located after <!-- rpt[:tag] -->
base_tr_node = find the node after
# Duplicate it as many times as we want.
content.each_with_index do |content, index|
puts "Adding #{content_name} : #{content}.to_s"
new_tr_node = base_tr_node.add_next_sibling(base_tr_node)
# inside this new node there are many
# <w:dataBinding w:xpath="/root[1]/movies[1]/movie[1]/name[1]>
# <w:dataBinding w:xpath="/root[1]/movies[1]/movie[1]/year[1]>
# ..../movie[1]/property-xxx[1]
# GOAL : replace every movie[1] by movie[index]
Change attributes in many sub-nodes of the node found in 1.
new_tr_node.change_attributes as shown in (see GOAL in previous comments)
# Maybe, it would be something like
# new_tr_node.gsub("(#{content_name})\[([1-9]+)\]", "\1\[#{index}\]")
# ... But new_tr_node is a nokogiri element so .gsub doesn't exist
end
end
#replace["word/document.xml"] = xml.serialize :save_zip_with => 0
end
I have looked at the DoPE extension for Word documents. It looks great ! But alas I had already done a lot of work, and just now I (almost) finished building my own preprocessor.
What I needed was more complicated than what I originally asked. But nevertheless, the answers would be :
EDIT : fixed bad regex/xpath
# 1. Find a comment containing a custom string, and select the node just below
comment_nodes = doc.xpath("//comment()")
# Loop like comment_nodes.each do |comment|
base_tr_node = comment.next_sibling.next_sibling
# For some reason, need to apply next_sibling twice, thought the comment is indeed just above the <w:tr> node
# 2. Change attributes in many sub-nodes of the node found in 1.
matches = tr_node.search('.//*[name()='w:dataBinding']')
matches.each do |databinding_node|
# replace '.*phase[1].*' by '.*phase[index].*'
databinding_node['w:xpath'].gsub("#{comment.text}\[1\]", "#{comment.text}\[#{index}\]")
end

Nokogiri Timeout::Error when scraping own site

Nokogiri works fine for me in the console, but if I put it anywhere... Model, View, or Controller, it times out.
I'd like to use it 1 of 2 ways...
Controller
def show
#design = Design.find(params[:id])
doc = Nokogiri::HTML(open(design_url(#design)))
images = doc.css('.well img') ? doc.css('.well img').map{ |i| i['src'] } : []
end
or...
Model
def first_image
doc = Nokogiri::HTML(open("http://localhost:3000/blog/#{self.id}"))
image = doc.css('.well img')[0] ? doc.css('.well img')[0]['src'] : nil
self.update_attribute(:photo_url, image)
end
Both result in a timeout, though they work perfectly in the console.
When you run your Nokogiri code from the console, you're referencing your development server at localhost:3000. Thus, there are two instances running: one making the call (your console) and one answering the call (your server)
When you run it from within your app, you are referencing the app itself, which is causing an infinite loop since there is no available resource to respond to your call (that resource is the one making the call!). So you would need to be running multiple instances with something like Unicorn (or simply another localhost instance at a different port), and you would need at least one of those instances to be free to answer the Nokogiri request.
If you plan to run this in production, just know that this setup will require an available resource to answer the Nokogiri request, so you're essentially tying up 2 instances with each call. So if you have 4 instances and all 4 happen to make the call at the same time, your whole application is screwed. You'll probably experience pretty severe degradation with only 1 or 2 calls at a time as well...
Im not sure what default value of timeout.
But you can specify some timeout value like below.
require 'net/http'
http = Net::HTTP.new('localhost')
http.open_timeout = 100
http.read_timeout = 100
Nokogiri.parse(http.get("/blog/#{self.id}").body)
Finally you can find what is the problem as you can control timeout value.
So, with tyler's advice I dug into what I was doing a bit more. Because of the disconnect that ckeditor has with the images, due to carrierwave and S3, I can't get any info direct from the uploader (at least it seems that way to me).
Instead, I'm sticking with nokogiri, and it's working wonderfully. I realized what I was actually doing with the open() command, and it was completely unnecessary. Nokogiri parses HTML. I can give it HTML in for form of #design.content! Duh, on my part.
So, this is how I'm scraping my own site, to get the images associated with a blog entry:
designs_controller.rb
def create
params[:design][:photo_url] = Nokogiri::HTML(params[:design][:content]).css('img').map{ |i| i['src']}[0]
#design = Design.new(params[:design])
if #design.save
flash[:success] = "Design created"
redirect_to designs_url
else
render 'designs/new'
end
end
def show
#design = Design.find(params[:id])
#categories = #design.categories
#tags = #categories.map {|c| c.name}
#related = Design.joins(:categories).where('categories.name' => #tags).reject {|d| d.id == #design.id}.uniq
set_meta_tags og: {
title: #design.name,
type: 'article',
url: design_url(#design),
image: Nokogiri::HTML(#design.content).css('img').map{ |i| i['src']},
article: {
published_time: #design.published_at.to_datetime,
modified_time: #design.updated_at.to_datetime,
author: 'Alphabetic Design',
section: 'Designs',
tag: #tags
}
}
end
The Update action has the same code for Nokogiri as the Create action.
Seems kind of obvious now that I'm looking at it, lol. I dwelled on this for longer than I'd like to admit...

Post a proudct with MWS api and Ruby on Rails

I'm trying to upload a new product to mws with the mws api and mws gem
The product is added (like Fix failed listings, because, doen't have a quantity and price)
I'm trying with the next code:
mws = Mws.connect(
merchant: 'merchant',
access: 'access',
secret: 'secret'
)
Later:
product = Mws::Product('11333663') { upc '1234355462233' tax_code
'GEN_TAX_CODE' name 'Some Pduct 034' brand 'Some Bnd' msrp 18.9,
'USD' quantity 10 manufacturer 'Some Mufacturer' category :ce
details {
cable_or_adapter {
cable_length as_distance 5, :feet
} } }
later:
submission_id = mws.feeds.products.add(product)
The product is added, but when I excuted this line:
submission_id = mws.feeds.products.update(product)
The next message is displayed:
=> #<Mws::Apis::Feeds::SubmissionResult:0x9d9ae78 #transaction_id="12345678", #status=#<Mws::EnumEntry:0x9d96170
#sym=:complete, #val="Complete">, #messages_processed=0,
#counts={:success=>0, :error=>1, :warning=>0},
#responses={:"0"=>#, #code=90208, #description="Purge
and replace is not allowed for this feed type.">}>
2.0.0-p195 :050 > result = mws.feeds.get(submission_id.id)
=> #, #messages_processed=1,
#counts={:success=>0, :error=>1, :warning=>0},
#responses={:"0"=>#, #code=90000,
#description="http://sellercentral.amazon.com/myi/search/ErrorListingsSummary?batchId=7564766086">,
:"1"=>#, #code=99042, #description="A
value was not provided for \"item_type\". Please provide a value for
\"item_type\". Please use the Product Classifier or download the
category-specific Browse Tree Guide from Seller Help to see a list of
valid \"item_type\" values. This information tells Amazon where your
product should be classified and affects how easily customers can find
your product.", #additional_info={:sku=>"11333668"}>}>
But, when I tried update the inventory and the price, the follow error ocurred:
result = mws.feeds.get(price_submission_id.id) =>
#sym=:complete, #val="Complete">, #messages_processed=0,
#counts={:success=>0, :error=>1, :warning=>0},
#responses={:"0"=>#, #code=90208, #description="Purge
and replace is not allowed for this feed type.">}>
What can I do?
Without any indepth knowledge of that Ruby Gem (and of Ruby), I can probably still point you in the right direction:
In MWS, feeds automatically update information already in the Amazon database. The call to create a record is identical to a subsequent call to update it. That also means you don't have to keep track of which items were already added to Amazon in the past.
In terms of your Ruby library, you probably should call mws.feeds.products.add(product) for subsequent updates of that product record and not call mws.feeds.products.update(product) at all. The latter seems to create what's called PurgeAndReplace feeds in MWS which you should avoid like the plague.
All other errors you encountered seem to be related to the same root cause.

Generate a link_to on the fly if a URL is found inside the contents of a db text field?

I have an automated report tool (corp intranet) where the admins have a few text area boxes to enter some text for different parts of the email body.
What I'd like to do is parse the contents of the text area and wrap any hyperlinks found with link tags (so when the report goes out there are links instead of text urls).
Is ther a simple way to do something like this without figuring out a way of parsing the text to add link tags around a found (['http:','https:','ftp:] TO the first SPACE after)?
Thank You!
Ruby 1.87, Rails 2.3.5
Make a helper :
def make_urls(text)
urls = %r{(?:https?|ftp|mailto)://\S+}i
html_text = text.gsub urls, '\0'
html_text
end
on the view just call this function , you will get the expected output.
like :
irb(main):001:0> string = 'here is a link: http://google.com'
=> "here is a link: http://google.com"
irb(main):002:0> urls = %r{(?:https?|ftp|mailto)://\S+}i
=> /(?:https?|ftp|mailto):\/\/\S+/i
irb(main):003:0> html = string.gsub urls, '\0'
=> "here is a link: http://google.com"
There are many ways to accomplish your goal. One way would be to use Regex. If you have never heard of regex, this wikipedia entry should bring you up to speed.
For example:
content_string = "Blah ablal blabla lbal blah blaha http://www.google.com/ adsf dasd dadf dfasdf dadf sdfasdf dadf dfaksjdf kjdfasdf http://www.apple.com/ blah blah blah."
content_string.split(/\s+/).find_all { |u| u =~ /^https?:/ }
Which will return: ["http://www.google.com/", "http://www.apple.com/"]
Now, for the second half of the problem, you will use the array returned above to subsititue the text links for hyperlinks.
links = ["http://www.google.com/", "http://www.apple.com/"]
links.each do |l|
content_string.gsub!(l, "<a href='#{l}'>#{l}</a>")
end
content_string will now be updated to contain HTML hyperlinks for all http/https URLs.
As I mentioned earlier, there are numerous ways to tackle this problem - to find the URLs you could also do something like:
require 'uri'
URI.extract(content_string, ['http', 'https'])
I hope this helps you.

Resources