Using Nokogiri to select a block of HTML based on text?

Using Nokogiri to select a block of HTML based on text? - ruby-on-rails

I have the following block of HTML:
<tr>
<th>Consignment Service Code</th>
<td>ND16</td>
</tr>
What I'm ultimately trying to pull is that ND16 string, but to do that, I need to select the <tr> based on the text Consignment Service Code.
I'm using Nokogiri already to parse the HTML, so it'd be great to just keep using that.
So, how can I select that block of HTML based on the text "Consignment Service Code"?

You can do this:
require 'nokogiri'
doc=Nokogiri::HTML::parse <<-eot
<tr>
<th>Consignment Service Code</th>
<td>ND16</td>
</tr>
eot
node = doc.at_xpath("//*[text()='Consignment Service Code']/following-sibling::*[1]")
puts node.text
# >> ND16
Here's an additional try, which might help you to get going:
## parent node
parent_node = doc.at_xpath("//*[text()='Consignment Service Code']/..")
puts parent_node.name # => tr
## to get the child td
puts parent_node.at_xpath("//td").text # => ND16
puts parent_node.to_html
#<tr>
#<th>Consignment Service Code</th>
# <td>ND16</td>
#</tr>

Yet another way.
Use Nokogiri's css method to find the appropriate tr nodes and then select the ones that have the desired text in the th tag. Finally, work with the selected nodes and extract the td values:
require 'nokogiri'
str = '<tr>
<th>Consignment</th>
<td>ND15</td>
</tr>
<tr>
<th>Consignment Service Code</th>
<td>ND16</td>
</tr>
<tr>
<th>Consignment Service Code</th>
<td>ND17</td>
</tr>'
doc = Nokogiri::HTML.parse(str)
nodes = doc.css('tr')
.select{|el|
el.css('th').text =~ /^Consignment Service Code$/
}
nodes.each do |el|
p el.css('td').text
end
Output is:
"ND16"
"ND17"

Related

Nokogiri move node to parent's sibling

My task is to translate an XML table to an HTML table. The problem is that the XML does not follow HTML convention, I am going to have to move nodes to the right place. The headers are pre-ordered instead of level-ordered, and there are table notes between the last table row and the closing table tag.
I solved the pre-order to level-order conversion issue by computing and creating the HTML using a builder and then replacing the XML table header with the HTML that I generated. But the last issue, which should be simple, has given me a mental blowout. I need to move the <TNOTE> out of the <GPOTABLE> and put it in a <div> immediately after </GPOTABLE>.
The XML data snippet is:
<P>(vi) Grinding wheels or discs for vertical single-spindle disc grinders shall be encircled with hoods to remove the dust generated in the operation. The hoods shall be connected to one or more branch pipes having exhaust volumes as shown in Table D-57.5.</P>
<GPOTABLE CDEF="s15,6,6,6,6" COLS="5" OPTS="L2">
<TTITLE>Table D-57.5—Vertical Spindle Disc Grinder</TTITLE>
<BOXHD>
<CHED H="1">Disc diameter, inches (cm)</CHED>
<CHED H="1">One-half or more of disc covered</CHED>
<CHED H="2">Number <SU>1</SU>
</CHED>
<CHED H="2">Exhaust foot <SU>3</SU>/min.</CHED>
<CHED H="1">Disc not covered</CHED>
<CHED H="2">Number <SU>1</SU>
</CHED>
<CHED H="2">Exhaust foot<SU>3</SU>/min.</CHED>
</BOXHD>
<ROW>
<ENT I="01">Up to 20 (50.8)</ENT>
<ENT>1</ENT>
<ENT>500</ENT>
<ENT>2</ENT>
<ENT>780</ENT>
</ROW>
<!-- ....snip .... -->
<ROW>
<ENT I="01">Over 53 to 72 (134.62 to 182.88)</ENT>
<ENT>2</ENT>
<ENT>3,140</ENT>
<ENT>5</ENT>
<ENT>6,010</ENT>
</ROW>
<TNOTE>
<SU>1</SU> Number of exhaust outlets around periphery of hood, or equal distribution provided by other means.</TNOTE>
</GPOTABLE>
<P>(vii) Grinding and polishing belts shall be provided with hoods to remove dust and dirt generated in the operations and the hoods shall be connected to branch pipes having exhaust volumes as shown in Table D-57.6.</P>
After conversion to HTML, it should look something like this:
<table cdef="s15,6,6,6,6" cols="5" opts="L2">
<caption>Table D-57.5—Vertical Spindle Disc Grinder</caption>
<tr>
<th rowspan="2" colspan="1" class="table_header">Disc diameter, inches (cm)</th>
<th rowspan="1" colspan="2" class="table_header">One-half or more of disc covered</th>
<th rowspan="1" colspan="2" class="table_header">Disc not covered</th>
</tr>
<tr>
<th rowspan="1" colspan="1" class="table_header">Number <su>1</su></th>
<th rowspan="1" colspan="1" class="table_header">Exhaust foot <su>3</su>/min.</th>
<th rowspan="1" colspan="1" class="table_header">Number <su>1</su> </th>
<th rowspan="1" colspan="1" class="table_header">Exhaust foot<su>3</su>/min.</th>
</tr>
<tr>
<td i="01">Up to 20 (50.8)</td>
<td>1</td>
<td>500</td>
<td>2</td>
<td>780</td>
</tr>
<!-- .... snip .... -->
<tr>
<td i="01">Over 53 to 72 (134.62 to 182.88)</td>
<td>2</td>
<td>3,140</td>
<td>5</td>
<td>6,010</td>
</tr>
</table>
<div class='tnote'><su>1</su> Number of exhaust outlets around periphery of hood, or equal distribution provided by other means</div>
Here's what I've got so far:
def xslt_tables(xml_text)
frag = Nokogiri::HTML(xml_text)
frag.xpath("//gpotable").each do |table|
TableConverter.new(table)
table.name = 'table'
end
frag.inner_html
end
class TableConverter
attr_accessor :data, :rows, :columns, :frag
# Expects a nokogiri object (a single <gpotable> node), not merely an html fragment
def initialize(nokogiri_fragment)
#column_index = 0
#frag = nokogiri_fragment
puts "find table size..."
find_table_size()
puts "populating the grid..."
populate_grid()
puts "computing rowspans and colspans, save in #data..."
compute_rowspans_and_colspans()
puts "assemble headers from #data"
nokogiri_headers = html_headers()
puts "replace the boxhd with nokogiri_headers, translate remaining table entities"
replace_nodes(nokogiri_headers)
end
# .... snip ....
def replace_nodes(headers)
# note: this actually changes values in the original nokogiri object!
# I'll leave it to the calling script to change the name to <table>
# #frag.xpath("//gpotable").each do |table|
# puts "renaming //gpotable"
# table.name = 'table'
# end
#frag.xpath("ttitle").each do |cap|
puts "replacing ttitle with caption"
cap.name = 'caption'
end
#frag.xpath("boxhd").each do |old|
puts "replacing boxhd with generated th with computed rowspan and colspan"
old.replace headers
end
#frag.xpath("row").each do |row|
puts "renaming row to tr"
row.name = 'tr'
end
#frag.xpath("tr/ent").each do |ent|
puts "renaming ent to td"
ent.name = 'td'
end
#frag.xpath("tnote").each do |tfoot|
puts "moving tnote"
tfoot.add_next_sibling('tnote')
end
end
end
Obviously, the last block with the tnote is wrong, but I'm stumped on how to tack that node(s) on to the end of #frag.
I'd be grateful for any nudges in the right direction; the Nokogiri tutorial and cheatsheet just don't make any sense to me.

Three hours after posting, the obvious (now that I see it) answer smacks me upside the head...
#frag.xpath("tnote").each do |tfoot|
puts "moving tnote"
tfoot.parent.add_next_sibling(tfoot).name = 'div'
end
Hope this helps someone else.

How to get <td> elements from Nokogiri element?

Here is my rb file:
require "open-uri"
require "nokogiri"
url = "http://www.languagedaily.com/learn-german/vocabulary/common-german-words"
html = open(url)
doc = Nokogiri::HTML(html)
row = doc.css("tr")
row.each do |cell|
puts cell
end
The separate element looks like:
<tr class="rowA">
<td class="number">49.</td>
<td class="bigLetter">kann</td>
<td>(I) can, am able to (he/she/it) can (1st- and 3-rd person singular present of "können")</td>
<td>verb</td>
</tr>
I need to get first three td's.

Just continue using css
row.css("td").take(3).each{}

Traverse HTML with no CSS class using Nokogiri?

I've got the following HTML:
<table width="100%" border="0" cellpadding="6" cellspacing="1">
<tbody>
<tr>
<td bgcolor="#ffd204" width="40%" nowrap=""><b>Tracking Number:</b></td>
<td bgcolor="#ffffff" width="60%" nowrap="">C123456789012345</td>
</tr>
<!-- ...there could be additional table rows here... -->
<tr>
<td bgcolor="#ffd204" width="40%" nowrap=""><b>Deliver To:</b></td>
<td bgcolor="#ffffff" width="60%" nowrap="">ANYWHERE, NY</td>
</tr>
</tbody>
</table>
Say, for instance I need to pull the ANYWHERE, NY data. How would I do that using Nokogiri? Or is there something better for traversing this sort of thing where there aren't any CSS selectors to search with?

Since we don't have a CSS class, id attribute, or other semantic markup to use, we instead look for something that is likely to not change in this document to anchor our search to. In this case, I suspect that the "Deliver To:" label will always come right before the td we want. So:
require 'nokogiri'
html = # Fetch either from http via open-uri's open() or from file via IO.read()
doc = Nokogiri.HTML(html)
delivery = doc.at_xpath '//td[preceding-sibling::td[b="Deliver To:"]]/text()'
p delivery.content
#=> "ANYWHERE, NY"
That XPath expression says:
// — at any level,
td — find me an element named td
[…] — but only if…
preceding-sibling:: — it has a preceding sibling
td — that is an element named td
[…] — but only if…
b — it has a child element named b
="Deliver To:" — whose text content equals this string
/text() — and then find me the child text node(s) of that td.
Because we used at_xpath instead of xpath, Nokogiri returns the first matching node it can find—which in this case happens to be the only child text node of that td—instead of an array of nodes.
In case that <td> can have markup, such as <td…>ANYWHERE,<br>NY</td> you can modify the expression to omit the trailing /text() (so that you select only the <td> itself) and then use the text method to fetch the combined visible text inside there.

Given that you don't mind some preprocessing, you could do:
lookup = {}
c = Nokogiri::HTML(open("http://..."))
c.search("tr").each do |tr|
cells = tr.search("td")
lookup[cells.first.text.gsub(':', '')] = cells.last.text
end
puts lookup["Tracking Number"]
I didn't test that code so there might be some syntax issues.

Delete Nodes from a HTML table using Nokogiri

I have been scratching my head over this for a while. Help me out before I start picking my brain.
I have a html document that has an events table which has 'In' and 'Out' as part of the columns. A record can either be an In or Out event. I wan't to only get the rows with values in the 'In' column and then save the text in an event model with the same attributes. The code below is what I have which returns '0'.
#!/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML <<-EOS
<table><thead><th>Reference</th><th>Event Date</th><th>Event Details</th><th>In</th><th>Out</th></thead><tbody><tr><td>BCE16</td><td>2011-08-16 11:14:52</td><td>Received from Arap Moi</td><td>30.00</td><td></td></tr><tr><td>B07K2</td><td>2011-08-16 11:10:06</td><td>Sent out to John Doe.</td><td> </td><td>-50.00</td></tr></tbody><tfoot></tfoot></table>
EOS
minus_received = doc.xpath('//td[contains(text(), "Received from")]').each do |node|
node.parent.remove
end
p minus_received.to_s
Human Readable markup
<table>
<thead>
<th>Reference</th>
<th>Event Date</th>
<th>Event Details</th>
<th>In</th>
<th>Out</th>
</thead>
<tbody>
<tr>
<td>BCE16</td>
<td>2011-08-16 11:14:52</td>
<td>Received from Arap Moi.</td>
<td>30.00</td>
<td></td>
</tr>
<tr>
<td>B07K2</td>
<td>2011-08-16 11:10:06</td>
<td>Sent out to John Doe.</td>
<td> </td>
<td>-50.00</td>
</tr>
</tbody>
<tfoot></tfoot>
</table>
I appreciate your help.

You're outputting the value of .each - if you look at doc after your each call finishes, the html only contains the header and John Doe.

Ruby regular expression help using match to extract pieces of html doc

I have an HTML document of this format:
<tr><td colspan="4"><span class="fullName">Bill Gussio</span></td></tr>
<tr>
<td class="sectionHeader">Contact</td>
<td class="sectionHeader">Phone</td>
<td class="sectionHeader">Home</td>
<td class="sectionHeader">Work</td>
</tr>
<tr valign="top">
<td class="sectionContent"><span>Screen Name:</span> <span>bhjiggy</span><br><span>Email 1:</span> <span>wmgussio#erols.com</span></td>
<td class="sectionContent"><span>Mobile: </span><span>2404173223</span></td>
<td class="sectionContent"><span>NY</span><br><span>New York</span><br><span>78642</span></td>
<td class="sectionContent"><span>MD</span><br><span>Owings Mills</span><br><span>21093</span></td>
</tr>
<tr><td colspan="4"><hr class="contactSeparator"></td></tr>
<tr><td colspan="4"><span class="fullName">Eddie Osefo</span></td></tr>
<tr>
<td class="sectionHeader">Contact</td>
<td class="sectionHeader">Phone</td>
<td class="sectionHeader">Home</td>
<td class="sectionHeader">Work</td>
</tr>
<tr valign="top">
<td class="sectionContent"><span>Screen Name:</span> <span>eddieOS</span><br><span>Email 1:</span> <span>osefo#wam.umd.edu</span></td>
<td class="sectionContent"></td>
<td class="sectionContent"><span></span></td>
<td class="sectionContent"><span></span></td>
</tr>
<tr><td colspan="4"><hr class="contactSeparator"></td></tr>
So it alternates - chunk of contact info and then a "contact separator". I want to grab the contact info so my first obstacle is to grab the chunks in between the contact separator. I have already figured out the regular expression using rubular. It is:
/<tr><td colspan="4"><span class="fullName">((.|\s)*?)<hr class="contactSeparator">/
You can check on rubular to verify that this isolates chunks.
However my big issue is that I am having trouble with the ruby code. I use the built in match function and make prints, but do not get the results I expect. Here is the code:
page = agent.get uri.to_s
chunks = page.body.match(/<tr><td colspan="4"><span class="fullName">((.|\s)*?)<hr class="contactSeparator">/).captures
chunks.each do |chunk|
puts "new chunk: " + chunk.inspect
end
Note that page.body is just the body of the html document grabbed by Mechanize. The html document is much larger but has this format. So, the unexpected output is below:
new chunk: "Bill Gussio</span></td></tr>\r\n\t<tr>\r\n\t\t<td class=\"sectionHeader\">Contact</td>\r\n\t\t<td class=\"sectionHeader\">Phone</td>\r\n\t\t<td class=\"sectionHeader\">Home</td>\r\n\t\t<td class=\"sectionHeader\">Work</td>\r\n\t</tr>\r\n\t<tr valign=\"top\">\r\n\t\t<td class=\"sectionContent\"><span>Screen Name:</span> <span>bhjiggy</span><br><span>Email 1:</span> <span>wmgussio#erols.com</span></td>\r\n\t\t<td class=\"sectionContent\"><span>Mobile: </span><span>2404173223</span></td>\r\n\t\t<td class=\"sectionContent\"><span>NY</span><br><span>New York</span><br><span>78642</span></td>\r\n\t\t<td class=\"sectionContent\"><span>MD</span><br><span>Owings Mills</span><br><span>21093</span></td>\r\n\t</tr>\r\n\t\r\n\t<tr><td colspan=\"4\">"
new chunk: ">"
There are 2 surprises here for me:
1) There are not 2 matches that contain the chunks of contact info, even though on rubular I have verified that these chunks should be extracted.
2) All of the \r\n\t (line feeds, tabs, etc.) are showing up in the matches.
Can anyone see the issue here?
Alternatively, if anyone knows of a good free AOL contacts importer, that would be great. I have been using blackbook but it keeps failing for me on AOL and I am attempting to fix it. Unfortunately, AOL has no contacts API yet.
Thank you!

See Can you provide some examples of why it is hard to parse XML and HTML with a regex?
for why this is a bad idea. Use an HTML parser instead.

If you're just extracting information out of XML, it might be easier to use something other than regular expressions. XPath is a good tool for extracting info from XML. I believe there are some libraries available for Ruby that support XPath, maybe try REXML:
http://www.germane-software.com/software/rexml/
http://redhanded.hobix.com/inspect/noXpathOnMessyHtmlIsJustAsEasyInRuby.html

Use a HTML parser such as hpricot will save you lots of headaches :)
sudo gem install hpricot
It's mostly written in C, so it's fast as well
Here is How to use it:
http://wiki.github.com/why/hpricot/hpricot-basics

This is the code that parses that HTML. Feel free to suggest something better:
contacts = []
email, mobile = "",""
names = page.search("//span[#class='fullName']")
# Every contact has a fullName node, so for each fullName node, we grab the chunk of contact info
names.each do |n|
# next_sibling.next_sibling skips:
# <tr>
# <td class=\"sectionHeader\">Contact</td>
# <td class=\"sectionHeader\">Phone</td>
# <td class=\"sectionHeader\">Home</td>
# <td class=\"sectionHeader\">Work</td>
# </tr>
# to give us the actual chunk of contact information
# then taking the children of that chunk gives us rows of contact info
contact_info_rows = n.parent.parent.next_sibling.next_sibling.children
# Iterate through the rows of contact info
contact_info_rows.each do |row|
# Iterate through the contact info in each row
row.children.each do |info|
# Get Email. There are two ".next_siblings" because space after "Email 1" element is processed as a sibling
if info.content.strip == "Email 1:" then email = info.next_sibling.next_sibling.content.strip end
# If the contact info has a screen name but no email, use screenname#aol.com
if (info.content.strip == "Screen Name:" && email == "") then email = info.next_sibling.next_sibling.content.strip + "#aol.com" end
# Get Mobile #'s
if info.content.strip == "Mobile:" then mobile = info.next_sibling.content.strip end
# Maybe we can try and get zips later. Right now the zip field can look like the street address field
# so we can not tell the difference. There is no label node
#zip_match = /\A\D*(\d{5})-?\d{4}\D*\z/i.match(info.content.strip)
#zip_match = /\A\D*(\d{5})[^\d-]*\z/i.match(info.content.strip)
end
end
contacts << { :name => n.content, :email => email, :mobile => mobile }
# clear variables
email, mobile = "", ""
end

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Using Nokogiri to select a block of HTML based on text? - ruby-on-rails

Related

Nokogiri move node to parent's sibling

How to get <td> elements from Nokogiri element?

Traverse HTML with no CSS class using Nokogiri?

Delete Nodes from a HTML table using Nokogiri

Ruby regular expression help using match to extract pieces of html doc

Categories

Resources