Here is my rb file:
require "open-uri"
require "nokogiri"
url = "http://www.languagedaily.com/learn-german/vocabulary/common-german-words"
html = open(url)
doc = Nokogiri::HTML(html)
row = doc.css("tr")
row.each do |cell|
puts cell
end
The separate element looks like:
<tr class="rowA">
<td class="number">49.</td>
<td class="bigLetter">kann</td>
<td>(I) can, am able to (he/she/it) can (1st- and 3-rd person singular present of "können")</td>
<td>verb</td>
</tr>
I need to get first three td's.
Just continue using css
row.css("td").take(3).each{}
Related
I wish to run a test* to check for presence of something like this:
<tr>
<td> name </td> <td> date </td>
</tr>
Code, such as below:
assert_select "tr" do
assert_select "td", name
assert_select "td", date
end
looks plausible, but is not correct, as the below for example (which is not the match required) would also pass:
<tr>
<td> name </td>
</tr>
<tr>
<td> date </td>
</tr>
I’m struggling to see how this should be approached from the documentation of assert_select.
Thank you
Daniel
within a default Rails integration test (I believe this means MiniTest)
I ended up using a regexs on the result of a css_select, which seems a bit inelegant, but worked for my purposes. If there is a better way I'd be interested to hear it. I used something like this:
pars = css_select "tr"
regexs = /<td>#{name}<\/td>.*<td>#{date}<\/td>/m
match = false
pars.each { |i| if i.to_s =~ regexs then match = true end}
assert match
I have the following block of HTML:
<tr>
<th>Consignment Service Code</th>
<td>ND16</td>
</tr>
What I'm ultimately trying to pull is that ND16 string, but to do that, I need to select the <tr> based on the text Consignment Service Code.
I'm using Nokogiri already to parse the HTML, so it'd be great to just keep using that.
So, how can I select that block of HTML based on the text "Consignment Service Code"?
You can do this:
require 'nokogiri'
doc=Nokogiri::HTML::parse <<-eot
<tr>
<th>Consignment Service Code</th>
<td>ND16</td>
</tr>
eot
node = doc.at_xpath("//*[text()='Consignment Service Code']/following-sibling::*[1]")
puts node.text
# >> ND16
Here's an additional try, which might help you to get going:
## parent node
parent_node = doc.at_xpath("//*[text()='Consignment Service Code']/..")
puts parent_node.name # => tr
## to get the child td
puts parent_node.at_xpath("//td").text # => ND16
puts parent_node.to_html
#<tr>
#<th>Consignment Service Code</th>
# <td>ND16</td>
#</tr>
Yet another way.
Use Nokogiri's css method to find the appropriate tr nodes and then select the ones that have the desired text in the th tag. Finally, work with the selected nodes and extract the td values:
require 'nokogiri'
str = '<tr>
<th>Consignment</th>
<td>ND15</td>
</tr>
<tr>
<th>Consignment Service Code</th>
<td>ND16</td>
</tr>
<tr>
<th>Consignment Service Code</th>
<td>ND17</td>
</tr>'
doc = Nokogiri::HTML.parse(str)
nodes = doc.css('tr')
.select{|el|
el.css('th').text =~ /^Consignment Service Code$/
}
nodes.each do |el|
p el.css('td').text
end
Output is:
"ND16"
"ND17"
Im trying to parse labels out of a table with nokogiri, where there are more than one in only one td field:
<tr class="alt2">
<td class="company">ABB Shanghai Transformer Co., Ltd.</td>
<td class="contactperson">Mr. Frank Liang<br/></td>
<td class="businesscategory">
<label><code>C27.11 </code>Manufacture of electric motors, generators and transformers</label>
<label><code>C27.33 </code>Manufacture of wiring devices</label>
</td>
</tr>
So what I've done now is this:
doc.css("tbody tr").each do |company|
new = GermanSubsidiary.new
new.name = company.at_css(".company").text
new.contact = company.at_css(".contactperson").text
company.at_css(".businesscategory label").each do |category|
new_class = BusinessClassification.create
new_class.code = category.at_css("code").text
new_class.name = category.text
end
end
unfortunately company.at_css(".businesscategory label").each do |category| is not working because at_css doesn't work for arrays... is it?
How can I parse deeper into the structure? As there is a table with multiple lines I have to distinguish in witch line I aim, and can't use the xpath command over the whole document.
Thanks Markus
.at_css('.businesscategory label') only returns the first matching node. Use .css('.businesscategory label') to get all the matching nodes
this xml
xml = <<-XML
<tbody>
<tr class="alt2">
<td class="company">ABB Shanghai Transformer Co., Ltd.</td>
<td class="contactperson">Mr. Frank Liang<br/></td>
<td class="businesscategory">
<label><code>C27.11 </code>Manufacture of electric motors, generators and transformers</label>
<label><code>C27.33 </code>Manufacture of wiring devices</label>
</td>
</tr>
</tbody>
XML
and this script
require 'rubygems'
require 'nokogiri'
require 'pp'
doc = Nokogiri::HTML.fragment(xml)
puts "with at_css example:"
doc.css("tbody tr").each do |company|
company.at_css(".businesscategory label").each do |category|
puts category.at_css("code").text
puts category.text
end
end
puts "\n\nwith css"
doc.css("tbody tr").each do |company|
company.css(".businesscategory label").each do |category|
puts category.at_css("code").text
puts category.text
end
end
prints this result
with at_css example:
with css
C27.11
C27.11 Manufacture of electric motors, generators and transformers
C27.33
C27.33 Manufacture of wiring devices
So, as you can see, using .css instead of .at_css will solve your issue.
Using .at_css('.businesscategory').children will yield whitespace nodes, so be careful
puts "\n\nwith at_css().children"
doc.css("tbody tr").each do |company|
company.at_css(".businesscategory").children.each do |category|
puts category.text.inspect
end
end
prints
with at_css().children
"\n "
"C27.11 Manufacture of electric motors, generators and transformers"
"\n "
"C27.33 Manufacture of wiring devices"
"\n "
I have been scratching my head over this for a while. Help me out before I start picking my brain.
I have a html document that has an events table which has 'In' and 'Out' as part of the columns. A record can either be an In or Out event. I wan't to only get the rows with values in the 'In' column and then save the text in an event model with the same attributes. The code below is what I have which returns '0'.
#!/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML <<-EOS
<table><thead><th>Reference</th><th>Event Date</th><th>Event Details</th><th>In</th><th>Out</th></thead><tbody><tr><td>BCE16</td><td>2011-08-16 11:14:52</td><td>Received from Arap Moi</td><td>30.00</td><td></td></tr><tr><td>B07K2</td><td>2011-08-16 11:10:06</td><td>Sent out to John Doe.</td><td> </td><td>-50.00</td></tr></tbody><tfoot></tfoot></table>
EOS
minus_received = doc.xpath('//td[contains(text(), "Received from")]').each do |node|
node.parent.remove
end
p minus_received.to_s
Human Readable markup
<table>
<thead>
<th>Reference</th>
<th>Event Date</th>
<th>Event Details</th>
<th>In</th>
<th>Out</th>
</thead>
<tbody>
<tr>
<td>BCE16</td>
<td>2011-08-16 11:14:52</td>
<td>Received from Arap Moi.</td>
<td>30.00</td>
<td></td>
</tr>
<tr>
<td>B07K2</td>
<td>2011-08-16 11:10:06</td>
<td>Sent out to John Doe.</td>
<td> </td>
<td>-50.00</td>
</tr>
</tbody>
<tfoot></tfoot>
</table>
I appreciate your help.
You're outputting the value of .each - if you look at doc after your each call finishes, the html only contains the header and John Doe.
I have an HTML document of this format:
<tr><td colspan="4"><span class="fullName">Bill Gussio</span></td></tr>
<tr>
<td class="sectionHeader">Contact</td>
<td class="sectionHeader">Phone</td>
<td class="sectionHeader">Home</td>
<td class="sectionHeader">Work</td>
</tr>
<tr valign="top">
<td class="sectionContent"><span>Screen Name:</span> <span>bhjiggy</span><br><span>Email 1:</span> <span>wmgussio#erols.com</span></td>
<td class="sectionContent"><span>Mobile: </span><span>2404173223</span></td>
<td class="sectionContent"><span>NY</span><br><span>New York</span><br><span>78642</span></td>
<td class="sectionContent"><span>MD</span><br><span>Owings Mills</span><br><span>21093</span></td>
</tr>
<tr><td colspan="4"><hr class="contactSeparator"></td></tr>
<tr><td colspan="4"><span class="fullName">Eddie Osefo</span></td></tr>
<tr>
<td class="sectionHeader">Contact</td>
<td class="sectionHeader">Phone</td>
<td class="sectionHeader">Home</td>
<td class="sectionHeader">Work</td>
</tr>
<tr valign="top">
<td class="sectionContent"><span>Screen Name:</span> <span>eddieOS</span><br><span>Email 1:</span> <span>osefo#wam.umd.edu</span></td>
<td class="sectionContent"></td>
<td class="sectionContent"><span></span></td>
<td class="sectionContent"><span></span></td>
</tr>
<tr><td colspan="4"><hr class="contactSeparator"></td></tr>
So it alternates - chunk of contact info and then a "contact separator". I want to grab the contact info so my first obstacle is to grab the chunks in between the contact separator. I have already figured out the regular expression using rubular. It is:
/<tr><td colspan="4"><span class="fullName">((.|\s)*?)<hr class="contactSeparator">/
You can check on rubular to verify that this isolates chunks.
However my big issue is that I am having trouble with the ruby code. I use the built in match function and make prints, but do not get the results I expect. Here is the code:
page = agent.get uri.to_s
chunks = page.body.match(/<tr><td colspan="4"><span class="fullName">((.|\s)*?)<hr class="contactSeparator">/).captures
chunks.each do |chunk|
puts "new chunk: " + chunk.inspect
end
Note that page.body is just the body of the html document grabbed by Mechanize. The html document is much larger but has this format. So, the unexpected output is below:
new chunk: "Bill Gussio</span></td></tr>\r\n\t<tr>\r\n\t\t<td class=\"sectionHeader\">Contact</td>\r\n\t\t<td class=\"sectionHeader\">Phone</td>\r\n\t\t<td class=\"sectionHeader\">Home</td>\r\n\t\t<td class=\"sectionHeader\">Work</td>\r\n\t</tr>\r\n\t<tr valign=\"top\">\r\n\t\t<td class=\"sectionContent\"><span>Screen Name:</span> <span>bhjiggy</span><br><span>Email 1:</span> <span>wmgussio#erols.com</span></td>\r\n\t\t<td class=\"sectionContent\"><span>Mobile: </span><span>2404173223</span></td>\r\n\t\t<td class=\"sectionContent\"><span>NY</span><br><span>New York</span><br><span>78642</span></td>\r\n\t\t<td class=\"sectionContent\"><span>MD</span><br><span>Owings Mills</span><br><span>21093</span></td>\r\n\t</tr>\r\n\t\r\n\t<tr><td colspan=\"4\">"
new chunk: ">"
There are 2 surprises here for me:
1) There are not 2 matches that contain the chunks of contact info, even though on rubular I have verified that these chunks should be extracted.
2) All of the \r\n\t (line feeds, tabs, etc.) are showing up in the matches.
Can anyone see the issue here?
Alternatively, if anyone knows of a good free AOL contacts importer, that would be great. I have been using blackbook but it keeps failing for me on AOL and I am attempting to fix it. Unfortunately, AOL has no contacts API yet.
Thank you!
See Can you provide some examples of why it is hard to parse XML and HTML with a regex?
for why this is a bad idea. Use an HTML parser instead.
If you're just extracting information out of XML, it might be easier to use something other than regular expressions. XPath is a good tool for extracting info from XML. I believe there are some libraries available for Ruby that support XPath, maybe try REXML:
http://www.germane-software.com/software/rexml/
http://redhanded.hobix.com/inspect/noXpathOnMessyHtmlIsJustAsEasyInRuby.html
Use a HTML parser such as hpricot will save you lots of headaches :)
sudo gem install hpricot
It's mostly written in C, so it's fast as well
Here is How to use it:
http://wiki.github.com/why/hpricot/hpricot-basics
This is the code that parses that HTML. Feel free to suggest something better:
contacts = []
email, mobile = "",""
names = page.search("//span[#class='fullName']")
# Every contact has a fullName node, so for each fullName node, we grab the chunk of contact info
names.each do |n|
# next_sibling.next_sibling skips:
# <tr>
# <td class=\"sectionHeader\">Contact</td>
# <td class=\"sectionHeader\">Phone</td>
# <td class=\"sectionHeader\">Home</td>
# <td class=\"sectionHeader\">Work</td>
# </tr>
# to give us the actual chunk of contact information
# then taking the children of that chunk gives us rows of contact info
contact_info_rows = n.parent.parent.next_sibling.next_sibling.children
# Iterate through the rows of contact info
contact_info_rows.each do |row|
# Iterate through the contact info in each row
row.children.each do |info|
# Get Email. There are two ".next_siblings" because space after "Email 1" element is processed as a sibling
if info.content.strip == "Email 1:" then email = info.next_sibling.next_sibling.content.strip end
# If the contact info has a screen name but no email, use screenname#aol.com
if (info.content.strip == "Screen Name:" && email == "") then email = info.next_sibling.next_sibling.content.strip + "#aol.com" end
# Get Mobile #'s
if info.content.strip == "Mobile:" then mobile = info.next_sibling.content.strip end
# Maybe we can try and get zips later. Right now the zip field can look like the street address field
# so we can not tell the difference. There is no label node
#zip_match = /\A\D*(\d{5})-?\d{4}\D*\z/i.match(info.content.strip)
#zip_match = /\A\D*(\d{5})[^\d-]*\z/i.match(info.content.strip)
end
end
contacts << { :name => n.content, :email => email, :mobile => mobile }
# clear variables
email, mobile = "", ""
end