Trouble receiving properly formatted result using Nokogiri - ruby-on-rails

I am moving some of my scraping from JavaScript to Ruby, and I am having trouble using Nokogiri.
I have trouble getting the right <dl> in a target class. I tried using css and xpath with the same result.
This is a sample of the HTML:
<div class="target">
<dl>
<dt>A:</dt>
<dd>foo</dd>
<dt>B:</dt>
<dd>bar</dd>
</dl>
</div>
This is a sample of my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open(url))
doc.css(".target > dl").each do |item|
puts item.text # I would expect to receive a collection of nodes here,
# yet I am receiving a single block of text
end
doc.css(".target > dl > dt").each do |item|
puts item.text # Here I would expect to iterate through a collection of
# dt elements, however I receive a single block of text
end
Can someone show me what I am doing wrong?

In the first case, the result should be the single dl; you are getting a single block of text. That is expected.
In the second case, the result should be two individual dt elements. You are printing their text one after another, which is indistinguishable from printing the text of the dl.
doc.css('.target > dl').length
# => 1 # as you have one `dl` element in `.target`
doc.css('.target > dl > dt').length
# => 2 # as you have two `dt` elements that are children of a `dl` in `.target`
doc.css(".target > dl > dt").each do |item|
puts item.text
puts "---" # make it obvious which element is which
end
# => A:
# ---
# B:
# ---
I am not quite sure what other result you are expecting.

I'd use something like:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="target">
<dl>
<dt>A:</dt>
<dd>foo</dd>
<dt>B:</dt>
<dd>bar</dd>
</dl>
</div>
EOT
This finds the first class='target', then its contained <dt> tags, and extracts each <dt>'s text:
doc.at('.target').search('dt').map{ |n| n.text } # => ["A:", "B:"]
This does the same only passing the text to map as shorthand:
doc.at('.target').search('dt').map(&:text) # => ["A:", "B:"]
This lets the engine find all <dt> embedded in all class="target" tags:
doc.search('.target dt').map(&:text) # => ["A:", "B:"]
See "How to avoid joining all text from Nodes when scraping" also.

Related

How to fetch text before an HTML tag using Nokogiri

I need to get details from an email that is sent to me. I need to place each value inside a variable and save it to database or save them on a hash first before saving to the database.
I'm using the Mail gem to retrieve the email using POP3, and Nokogiri for parsing the email. The data I need to retrieve is inside the <span> tag. However I also need to get the text before the <span> tag which will serve as the key for the text inside the tag. For instance, Name: <span> My Name </span>.
Expected output should be like this if saved in hash:
hash = ['Tour Name:' : 'Day Tour', 'Tour Date:' : '2019-06-07']
or at least I'm able to get the key and the values together.
Here is my code:
require 'net/imap'
require 'nokogiri'
class SomeClass
def self.get_email
Mail.defaults do
retriever_method :pop3, :address => "pop.gmail.com",
:port => 995,
:user_name => username,
:password => password,
:enable_ssl => true
end
email = Mail.first.html_part.to_s
doc = Nokogiri::HTML::Document.parse(email)
puts doc.css('span').map(&:text) <- gets text of span only
end
end
Raw HTML code of the email:
<tr>
<td>
Tour Name: <span style="font-weight:bold">Day Tour</span>
</td>
</tr>
<tr>
<td>
Tour Date: <span style="font-weight:bold">June 07, 2019</span>
</td>
</tr>
Everything depends on the raw HTML code of the email. If it is as simple as you showed, then the following code should work:
docs.css('td').map{|td| td.children.map(&:text)}
Then, you can convert it into a hash by calling to_h.
Of course, please remember that your elements might contain additional whitespaces that should be filtered.
The answer by #MrShemek is suitable for your HTML. If you have more hierarchy of nodes inside your inner children you can do:
Nokogiri::HTML(email).css('td').map{|t| r=t.css('span').remove; [t.text, r.text].map(&:strip)}.to_h
=> {"Tour Name:"=>"Day Tour", "Tour Date:"=>"June 07, 2019"}
The inner element is moved and remaining text is fetched, then forms a tuple with the text and inner text.

How to check if one of the options in my select menu contains text?

I’m using Rails 4.2.7 with Nokogiri. Is there a way I can tell, with Nokogiri, that one of the options in my select menu contains the word "Results" in its text field (that would be visible to the end user)?
I have:
options = doc.css("#menu_id option")
And I can cycle through all of them, checking the text, but I figured there might be a CSS-selector expression or something similar I can do with Nokogiri that would tell me this answer more quickly.
This is the generic way to do it:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<form>
<select id="menu_id">
<option value="foo">foo</option>
</select>
</form>
</body>
</html>
EOT
doc.search('#menu_id option').any?{ |option| option.text == 'foo' } # => true
That looks to see if the text, not the value, is "foo".
doc.search('#menu_id option').any?{ |option| option.text['foo'] } # => true
That looks to see if the text, not the value, contains the sub-string "foo".
doc.search('#menu_id option').any?{ |option| option['value'] == 'foo' } # => true
That looks to see if the value parameter matches the word "foo".
Similarly, they'll tell you whether something doesn't match:
doc.search('#menu_id option').any?{ |option| option.text == 'bar' } # => false
doc.search('#menu_id option').any?{ |option| option.text['bar'] } # => false
doc.search('#menu_id option').any?{ |option| option['value'] == 'bar' } # => false
I would probably rely on Nokogiri's implementation of the jQuery extensions that #gmcnaughton mentioned but that's how I am. YMMV.
Nokogiri supports jQuery's :contains selector, which selects nodes with the given content:
doc.css(“#menu_id option:contains('Results'")
You could also do it with an XPath, which has more power but is more verbose:
doc.xpath('//*[#id='menu_id']//option[contains(text(), "Results")]')
See "Nokogiri: How to select nodes by matching text?".

Ruby - link_to - How to add data directly from DB

First of all, I am very new to ruby and I am trying to maintain an application already running in production.
I have been so far able to "interpret" the code well, but there is one thing I am stuck at.
I have a haml.html file where I am trying to display links from DB.
Imagine a DB structure like below
link_name - Home
URL - /home.html
class - clear
id - homeId
I display a link on the page as below
< a href="/home.html" class="clear" id="home" > Home </a>
To do this I use 'link_to' where I am adding code as follows
-link_to model.link_name , model.url, {:class => model.class ...... }
Now I have a new requirement where we have a free text in DB, something like -
data-help="home-help" data-redirect="home-redirect" which needs to come into the options.
So code in haml needs to directly display content versus assign it to a variable to display.
In other words I am able to do
attr= '"data-help="home-help" data-redirect="home-redirect"' inside the <a>, but not able to do
data-help="home-help" data-redirect="home-redirect" in <a> tag.
Any help would be greatly appreciated!
link_to accepts a hash :data => { :foo => "bar" } of key/val pairs that it will build into data- attributes on the anchor tag. The above will create an attr as follows data-foo="bar"
So you could write a method on the model to grab self.data_fields (or whatever it's called) and split it into attr pairs and then create a hash from that. Then you can just pass the hash directly to the :data param in link_to by :data => model.custom_data_fields_hash
This somewhat verbose method splits things out and returns a hash that'd contain: {"help"=>"home-help", "redirect"=>"home-redirect"}
def custom_data_fields_hash
# this would be replaced by self.your_models_attr
data_fields = 'data-help="home-help" data-redirect="home-redirect"'
# split the full string by spaces into attr pairs
field_pairs = data_fields.split " "
results = {}
field_pairs.each do |field_pair|
# split the attr and value by the =
data_attr, data_value = field_pair.split "="
# remove the 'data-' substring because the link_to will add that in automatically for :data fields
data_attr.gsub! "data-", ""
# Strip the quotes, the helper will add those
data_value.gsub! '"', ""
# add the processed pair to the results
results[data_attr] = data_value
end
results
end
Running this in a Rails console gives:
2.1.2 :065 > helper.link_to "Some Link", "http://foo.com/", :data => custom_data_fields_hash
=> "<a data-help=\"home-help\" data-redirect=\"home-redirect\" href=\"http://foo.com/\">Some Link</a>"
Alternatively you could make it a helper and just pass in the model.data_attr instead
link_to "Some Link", "http://foo.com/", :data => custom_data_fields_hash(model.data_fields_attr)
Not sure you can directly embed an attribute string. You could try to decode the string in order to pass it to link_to:
- link_to model.link_name, model.url,
{
:class => model.class
}.merge(Hash[
str.scan(/([\w-]+)="([^"]*)"/)
])
)

Adding a node using Nokogiri

I have an HTML string (for instance <div class="input">hello</div>) and I want to add a node only if the HTML tag in the string is a label (for instance <label>Hi</label>).
doc = Nokogiri::XML(html)
doc.children.each do |node|
if node.name == 'label'
# this code gets called
span = Nokogiri::XML::Node.new "span", node
span.content = "hello"
puts span.parent
# nil
span.parent = node
# throws error "node can only have one parent"
end
end
doc.to_html # Does not contain the span.
I cannot for the life of me understand what I'm doing wrong, any help would be greatly appreciated.
Edit: This solved my problem, thanks for the answers!
# notice DocumentFragment rather than XML
doc = Nokogiri::HTML::DocumentFragment.parse(html_tag)
doc.children.each do |node|
if node.name == 'label'
span = Nokogiri::XML::Node.new "span", doc
node.add_child(span)
end
end
It's easy to add/change/delete HTML:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse('<div class="input">hello</div>')
div = doc.at('div')
div << '<span>Hello</span>'
puts doc.to_html
Which results in:
# >> <div class="input">hello<span>Hello</span>
# >> </div>
Notice that the above code appended a new node to the existing children of the <div>, because of <<, which means they were appended after the text-node containing "hello".
If you want to overwrite the children, you can do that easily using children =:
div.children = '<span>Hello</span>'
puts doc.to_html
Which results in:
# >> <div class="input"><span>Hello</span></div>
children = can take a single Node which can have multiple other nodes nestled under it, or the HTML text of the node(s) being inserted. That's what node_or_tags means when you see it in the documentation.
That said, to change just an embedded <label>, I'd do something like:
doc = Nokogiri::HTML::DocumentFragment.parse('<div class="input"><label>hello</label></div>')
label = doc.at('div label')
label.name = 'span' if label
puts doc.to_html
# >> <div class="input"><span>hello</span></div>
Or:
doc = Nokogiri::HTML::DocumentFragment.parse('<div class="input"><label>hello</label></div>')
label = doc.at('div label')
label.replace("<span>#{ label.text }</span>") if label
puts doc.to_html
# >> <div class="input"><span>hello</span></div>
Nokogiri makes it easy to change the tag's name once you've pointed at it. You can easily change the text inside the <span> by replacing #{ label.text } with whatever you desire.
at('div label') is one way of finding a particular node. It basically means "find the first label tag inside the first div". at means find the first of something, and is similar to using search(...).first. There are CSS and XPath equivalents to both at and search in the Nokogiri::XML::Node documentation if you need those.
A few issues - you span = .. line was creating the node but not actually adding it to the document. Also, you can't access span outside of the block where you created it.
I think this is what you're after:
html = '<label>Hi</label>'
doc = Nokogiri::XML(html)
doc.children.each do |node|
if node.name == 'label'
# this code gets called
span = Nokogiri::XML::Node.new "span", doc
span.content = "hello"
node.add_child(span)
end
end
# NOTE: `node` nor `span` are accessible outside of the each block
doc.to_s # => "<?xml version=\"1.0\"?>\n<label>Hi<span>hello</span></label>\n"
Note the node.add_child(span) line.

Displaying XML Hashes in Rails Views Not Working

I have narrowed down a 33,364 entry XML file to the 1,068 that I need. Now I am attempting to gather pieces of information from each node that I have narrowed my search down to, and store each piece of information in a hash, so that I can list out the relevant data in a rails view.
Here is the code in my controller (home_controller.rb) --
class HomeController < ApplicationController
# REQUIRE LIBRARIES
require 'nokogiri'
require 'open-uri'
def search
end
def listing
#properties = {}
# OPEN THE XML FILE
mits_feed = File.open("app/assets/xml/mits.xml")
# OUTPUT THE XML DOCUMENT
doc = Nokogiri::XML(mits_feed)
doc.xpath("//Property/PropertyID/Identification[#OrganizationName='northsteppe']").each do |property|
# GATHER PROPERTY INFORMATION
information = {
"street_address" => property.xpath("Address/AddressLine1").text,
"city" => property.xpath("Address/City").text,
"zipcode" => property.xpath("Address/PostalCode").text,
"short_description" => property.xpath("Information/ShortDescription").text,
"long_description" => property.xpath("Information/LongDescription").text,
"rent" => property.xpath("Information/Rents/StandardRent").text,
"application_fee" => property.xpath("Fee/ApplicationFee").text,
"bedrooms" => property.xpath("Floorplan/Room[#RoomType='Bedroom']/Count").text,
"bathrooms" => property.xpath("Floorplan/Room[#RoomType='Bathroom']/Count").text,
"bathrooms" => property.xpath("ILS_Unit/Availability/VacancyClass").text
}
# MERGE NEW PROPERTY INFORMATION TO THE EXISTING HASH
#properties.merge(information)
end
end
end
I'm not getting any errors and my view is loading fine, but it is pulling up blank. Here is my view file (listing.html.erb) --
<div class="propertiesHolder">
<% if #properties %>
<ul>
<% #properties.each do |property| %>
<li><%= property.information.street_address %></li>
<% end %>
</ul>
<% else %>
<h1>There are no properties that match your search</h1>
<% end %>
</div>
Does anyone know why this might be pulling up blank? I would assume that I would receive an error if I had done something incorrect in the code. I also tried just outputting "Hello World" as text for each |property| and this also pulled up blank. Thank you!
Ruby merge does not mutate your hash. It just returns the two hashes as one.
Example
h1 = { "a" => 100, "b" => 200 }
h2 = { "b" => 254, "c" => 300 }
h1.merge(h2)
#=> {"a"=>100, "b"=>254, "c"=>300}
h1
#=> {"a"=>100, "b"=>200}
Note how h1 still retains its original values?
What you will want to do is rename your information hash to #properties. I suggest this because you are merging a hash with information in it (information) with an empty hash (#properties). So instead of overwriting when you merge the hashes, just use the first hash.

Resources