Nokogiri::XML parse value based on another XML attribute's value - ruby-on-rails

Using Nokogiri::XML how can I retrieve a attribute's value based on another attribute?
XML file:
<RateReplyDetails>
<ServiceType>INT</ServiceType>
<Price>1.0</Price>
</RateReplyDetails>
<RateReplyDetails>
<ServiceType>LOCAL</ServiceType>
<Price>2.0</Price>
</RateReplyDetails>
And I would like to retrieve the Price of the LOCAL ServiceType which is 2.0
I could take the value without any condition with this:
rated_shipment.at('RateReplyDetails/Price').text
And probably I could do something like:
if rated_shipment.at('RateReplyDetails/ServiceType').text == "LOCAL"
rated_shipment.at('RateReplyDetails/Price').text
But is there any elegant and clean way of doing so?

I'd do something like:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<RateReplyDetails>
<ServiceType>INT</ServiceType>
<Price>1.0</Price>
</RateReplyDetails>
<RateReplyDetails>
<ServiceType>LOCAL</ServiceType>
<Price>2.0</Price>
</RateReplyDetails>
</xml>
EOT
service_type = doc.at('//RateReplyDetails/*[text() = "LOCAL"]')
service_type.name # => "ServiceType"
'//RateReplyDetails/*[text() = "LOCAL"]' is an XPath selector that looks for the < RateReplyDetails> node that contains a text node equal to "LOCAL" and returns the node containing the text, which is the <ServiceType> node.
service_type.next_element.text # => "2.0"
Once we've found that it's easy to look at the next element and get its text.

try, content is the xml content string.
doc = Nokogiri::HTML(content)
doc.at('servicetype:contains("INT")').next_element.content
[16] pry(main)>
doc.at('servicetype:contains("INT")').next_element.content
=> "1.0"
[17] pry(main)>
doc.at('servicetype:contains("LOCAL")').next_element.content
=> "2.0"
I have test it, it's working.

Fully in XPath:
rated_shipment.at('//RateReplyDetails[ServiceType="LOCAL"]/Price/text()').to_s
# => "2.0"
EDIT:
it didnt work for me
Full code as proof it does work:
#!/usr/bin/env ruby
require 'nokogiri'
rated_shipment = Nokogiri::XML(DATA)
puts rated_shipment.at('//RateReplyDetails[ServiceType="LOCAL"]/Price/text()').to_s
__END__
<xml>
<RateReplyDetails>
<ServiceType>INT</ServiceType>
<Price>1.0</Price>
</RateReplyDetails>
<RateReplyDetails>
<ServiceType>LOCAL</ServiceType>
<Price>2.0</Price>
</RateReplyDetails>
</xml>
(outputs 2.0.) If it does not work, then it is because your file contents do not match your OP.

Related

Declaring XML Tags in Ruby

I am using Ruby to pull information from an excel sheet and with this information produce an xml file. I need to produce this in Ruby:
What I want:
<Betrag waehrung="EUR">150000</Betrag>
What I have:
<Betrag waehrung ="EUR"/>
I am currently trying xml.Betrag "Waehrung": "Eur"
the Betrag has a row Identifier of "#{row[13]}" which is where it can be found on the excel sheet I am using. I have tried: xml.Betrag "Waehrung": ("Eur"), ("#{row[13]}") with no success, could you please advise?
require 'nokogiri'
builder = Nokogiri::XML::Builder.new do |xml|
xml.Betrag(waehrung: 'EUR') do |e|
e << '150000'
end
end
puts builder.to_xml
=>
<?xml version="1.0"?>
<Betrag waehrung="EUR">150000</Betrag>

Is there a way of iterating through a specific XML tag in Ruby?

Is it possible to iterate over a specific XML tag in Ruby? In my case I want iterate over the desc tag in the following XML code:
<desc>
<id>2408</id>
<who name="Joe Silva">joe#silva.com</who>
<when>Today</when>
<thetext>Hello World</thetext>
</desc>
<desc>
<id>2409</id>
<who name="Joe Silva2">joe2#silva.com</who>
<when>Future</when>
<thetext>Hello World Again</thetext>
</desc>
So far, here is the code I use:
xml_doc = agent.get("www.somewhere.com/file.xml")
document = REXML::Document.new(xml_doc.body);
# iterate over desc here
I want to iterate over each desc tags so that I get the following output:
commentid : 2408
name : Joe Silva
who : joe#silva.com
bug_when : Today
thetext : Hello World
commentid : 2409
name : Joe Silva2
who : joe2#silva.com
bug_when : Future
thetext : Hello World Again
Any suggestions?
Nokogiri example that includes the name attribute for the who node:
require 'nokogiri'
doc = Nokogiri.XML '
<root>
<desc>
<id>2408</id>
<who name="Joe Silva">joe#silva.com</who>
<when>Today</when>
<thetext>Hello World</thetext>
</desc>
<desc>
<id>2409</id>
<who name="Joe Silva2">joe2#silva.com</who>
<when>Future</when>
<thetext>Hello World Again</thetext>
</desc>
</root>
'
doc.css("desc").each do |desc|
puts "commentid : #{desc.css("id").text}"
puts "name : #{desc.css("who").attribute("name")}"
puts "who : #{desc.css("who").text}"
puts "bug_when : #{desc.css("when").text}"
puts "the text : #{desc.css("thetext").text}"
end
I'd also recommend using the Nokogiri gem. Something like this ought to work:
require 'open-uri'
require 'nokogiri'
# fetch and parse the document
doc = Nokogiri::HTML(open('www.somewhere.com/file.xml'))
# search with css selectors
puts doc.at('desc id').text
# search by xpath
puts doc.at_xpath('//desc/id').text
# to iterate over a specific tag
doc.css('desc').each do |tag|
puts tag.css('id').text
# ...
end

Find within the first 10?

I'm using Nokogiri to screen-scrape contents of a website.
I set fetch_number to specify the number of <divs> that I want to retrieve. For example, I may want the first(10) tweets from the target page.
The code looks like this:
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title']
end
However, when there is less than 10 matching div tags returned, it will report
NoMethodError: undefined method 'css' for nil:NilClass
This is because, when no matching HTML is found, it will return nil.
How can I make it return all the available data within 10? I don't need the nils.
UPDATE:
task :test_fetch => :environment do
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
end
Return resultes(Near the end):
24
http://item.taobao.com/item.htm?id=41249522884
http://item.taobao.com/item.htm?id=40369253621
http://item.taobao.com/item.htm?id=40384876796
http://item.taobao.com/item.htm?id=40352486259
http://item.taobao.com/item.htm?id=40384968205
.....
http://item.taobao.com/item.htm?id=38843789106
http://item.taobao.com/item.htm?id=38843517455
http://item.taobao.com/item.htm?id=38854788276
http://item.taobao.com/item.htm?id=38825442050
http://item.taobao.com/item.htm?id=38630599372
http://item.taobao.com/item.htm?id=38346270714
http://item.taobao.com/item.htm?id=38357729988
http://item.taobao.com/item.htm?id=38345374874
this is empty
this is empty
this is empty
this is empty
this is empty
this is empty
count reports only 24 elements, but it retuns a 30 array.
And it actually is not an array, but Nokogiri::XML::NodeSet? I'm not sure.
title = item.css("a")[0]['title']
is a bad practice.
Instead, consider writing using at or at_css instead of search or css:
title = item.at('a')['title']
Next, if the <a> tag returned doesn't have a title parameter, Nokogiri and/or Ruby will be upset because the title variable will be nil. Instead, improve your CSS selector to only allow matches like <a title="foo">:
require 'nokogiri'
doc = Nokogiri::HTML('<body>foobar</body>')
doc.at('a').to_html # => "foo"
doc.at('a[title]').to_html # => "bar"
Notice how the first, which is not constrained to look for tags with a title parameter returns the first <a> tag. Using a[title] will only return ones with a title parameter.
That means your loop over the values will never return nil, and you won't have a problem needing to compact them out of the returned array.
As a general programming tip, if you're getting nils like that, look at the code generating the array, because odds are good it's not doing it right. You should ALWAYS know what sort of results your code will generate. Using compact to clean up the array is a knee-jerk reaction to not having written the code correctly most of the time.
Here's your updated code:
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
And here's what's wrong:
doc.css(".main-wrap .item").first(30)
Here's a simple example demonstrating why that doesn't work:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
In Nokogiri, search',cssandxpath` are equivalent, except that the first is generic and can take either CSS or XPath, while the last two are specific to that language.
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fcf360ef750 name="p" children=[#<Nokogiri::XML::Text:0x3fcf360ef4f8 "foo">]>]
doc.search('p').size # => 1
doc.search('p').map(&:to_html) # => ["<p>foo</p>"]
That shows that the NodeSet returned by doing a simple search returns only one node, and what the node looks like.
doc.search('p').first(2) # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>, nil]
doc.search('p').first(2).size # => 2
Searching using first(n) returns "n" elements. If that many aren't found Nokogiri fills them in using nil values.
This is counter what we'd assume first(n) to do, since Enumerable#first returns up-to-n and won't pad with nils. This isn't a bug, but it is unexpected behavior since Enumerable's first sets the expected behavior for methods with that name, but, this is NodeSet#first, not Enumerable#first, so it does what it does until the Nokogiri authors change it. (You can see why it happens if you look at the source for that particular method.)
Instead, slicing the NodeSet does show the expected behavior:
doc.search('p')[0..1] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0..1].size # => 1
doc.search('p')[0, 2] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0, 2].size # => 1
So, don't use NodeSet#first(n), use the slice form NodeSet#[].
Applying that, I'd write the code something like:
require 'nokogiri'
require 'open-uri'
URL = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(URL))
hrefs = doc.css(".main-wrap .item .detail a[href]")[0..29].map { |anchors|
anchors['href']
}
puts hrefs.size
puts hrefs
# >> 24
# >> http://item.taobao.com/item.htm?id=41249522884
# >> http://item.taobao.com/item.htm?id=40369253621
# >> http://item.taobao.com/item.htm?id=40384876796
# >> http://item.taobao.com/item.htm?id=40352486259
# >> http://item.taobao.com/item.htm?id=40384968205
# >> http://item.taobao.com/item.htm?id=40384816312
# >> http://item.taobao.com/item.htm?id=40384600507
# >> http://item.taobao.com/item.htm?id=39973451949
# >> http://item.taobao.com/item.htm?id=39861209551
# >> http://item.taobao.com/item.htm?id=39545678869
# >> http://item.taobao.com/item.htm?id=39535371171
# >> http://item.taobao.com/item.htm?id=39509186150
# >> http://item.taobao.com/item.htm?id=38973412667
# >> http://item.taobao.com/item.htm?id=38910499863
# >> http://item.taobao.com/item.htm?id=38942960787
# >> http://item.taobao.com/item.htm?id=38910403350
# >> http://item.taobao.com/item.htm?id=38843789106
# >> http://item.taobao.com/item.htm?id=38843517455
# >> http://item.taobao.com/item.htm?id=38854788276
# >> http://item.taobao.com/item.htm?id=38825442050
# >> http://item.taobao.com/item.htm?id=38630599372
# >> http://item.taobao.com/item.htm?id=38346270714
# >> http://item.taobao.com/item.htm?id=38357729988
# >> http://item.taobao.com/item.htm?id=38345374874
Try this
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title'] rescue nil
end
And let me know it works or not? It will not show error
Try compact.
[1, nil, 2, nil, 3] # => [1, 2, 3]
http://www.ruby-doc.org/core-2.1.3/Array.html#method-i-compact
(ie: first(fetch_number).compact.each do |item|)

REXML::Document.new take a simple string as good doc?

I would like to check if the xml is valid. So, here is my code
require 'rexml/document'
begin
def valid_xml?(xml)
REXML::Document.new(xml)
rescue REXML::ParseException
return nil
end
bad_xml_2=%{aasdasdasd}
if(valid_xml?(bad_xml_2) == nil)
puts("bad xml")
raise "bad xml"
end
puts("good_xml")
rescue Exception => e
puts("exception" + e.message)
end
and it returns good_xml as result. Did I do something wrong? It will return bad_xml if the string is
bad_xml = %{
<tasks>
<pending>
<entry>Grocery Shopping</entry>
<done>
<entry>Dry Cleaning</entry>
</tasks>}
Personally, I'd recommend using Nokogiri, as it's the defacto standard for XML/HTML parsing in Ruby. Using it to parse a malformed document:
require 'nokogiri'
doc = Nokogiri::XML('<xml><foo><bar></xml>')
doc.errors # => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: bar line 1 and xml>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag foo line 1>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag xml line 1>]
If I parse a document that is well-formed:
doc = Nokogiri::XML('<xml><foo/><bar/></xml>')
doc.errors # => []
REXML treats a simple string as a valid XML with no root node:
xml = REXML::Document.new('aasdasdasd')
# => <UNDEFINED> ... </>
It does not however treat illegal XML (with mismatching tags, for example) as a valid XML, and throws an exception.
REXML::Document.new(bad_xml)
# REXML::ParseException: #<REXML::ParseException: Missing end tag for 'done' (got "tasks")
It is missing an end-tag to <done> - so it is not valid.

Reading xml file using REXML, says <UNDEFINED> ... </>

I have a very simple xml file that I am trying to access:
<article>
<text>hello world</text>
</article>
I'm doing this so far:
file = File.open("#{Rails.root}/public/files/#{file_id}.xml", "r")
xml = file.read
doc = REXML::Document.new(xml)
When I run this code in rails console, I see:
1.9.3-p194 :033 > doc.inspect
=> "<UNDEFINED> ... </>"
I can't seem to understand why it is not loading the file correctly, I can't access the text xml element either.
It is loading correctly, the document just doesn't have a root node.
require "rexml/document"
doc = REXML::Document.new DATA.read
doc.root_node # => <UNDEFINED> ... </>
doc.inspect # => "<UNDEFINED> ... </>"
doc.to_s # => "<article>\n <text>hello world</text>\n</article>\n"
doc.get_elements('//article') # => [<article> ... </>]
doc.get_elements('//text') # => [<text> ... </>]
__END__
<article>
<text>hello world</text>
</article>
By the way, I think the Ruby community has pretty much universally endorsed Nokogiri for xml parsing.

Resources