Preventing Nokogiri from escaping characters in URLs - nokogiri

Nokogiri("<a href='*|UNSUB|*'>unsubscribe</a>").to_html
# returns
"unsubscribe"
How can I get Nokogiri to not escape the pipes?

require 'nokogiri'
doc = Nokogiri("<a href='*|UNSUB|*'>unsubscribe</a>")
puts doc.to_html
#=> unsubscribe
puts doc.to_xml
#=> <?xml version="1.0"?>
#=> unsubscribe
Alternatively:
puts doc.to_html.gsub('%7C','|')
#=> unsubscribe

Related

Declaring XML Tags in Ruby

I am using Ruby to pull information from an excel sheet and with this information produce an xml file. I need to produce this in Ruby:
What I want:
<Betrag waehrung="EUR">150000</Betrag>
What I have:
<Betrag waehrung ="EUR"/>
I am currently trying xml.Betrag "Waehrung": "Eur"
the Betrag has a row Identifier of "#{row[13]}" which is where it can be found on the excel sheet I am using. I have tried: xml.Betrag "Waehrung": ("Eur"), ("#{row[13]}") with no success, could you please advise?
require 'nokogiri'
builder = Nokogiri::XML::Builder.new do |xml|
xml.Betrag(waehrung: 'EUR') do |e|
e << '150000'
end
end
puts builder.to_xml
=>
<?xml version="1.0"?>
<Betrag waehrung="EUR">150000</Betrag>

Find within the first 10?

I'm using Nokogiri to screen-scrape contents of a website.
I set fetch_number to specify the number of <divs> that I want to retrieve. For example, I may want the first(10) tweets from the target page.
The code looks like this:
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title']
end
However, when there is less than 10 matching div tags returned, it will report
NoMethodError: undefined method 'css' for nil:NilClass
This is because, when no matching HTML is found, it will return nil.
How can I make it return all the available data within 10? I don't need the nils.
UPDATE:
task :test_fetch => :environment do
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
end
Return resultes(Near the end):
24
http://item.taobao.com/item.htm?id=41249522884
http://item.taobao.com/item.htm?id=40369253621
http://item.taobao.com/item.htm?id=40384876796
http://item.taobao.com/item.htm?id=40352486259
http://item.taobao.com/item.htm?id=40384968205
.....
http://item.taobao.com/item.htm?id=38843789106
http://item.taobao.com/item.htm?id=38843517455
http://item.taobao.com/item.htm?id=38854788276
http://item.taobao.com/item.htm?id=38825442050
http://item.taobao.com/item.htm?id=38630599372
http://item.taobao.com/item.htm?id=38346270714
http://item.taobao.com/item.htm?id=38357729988
http://item.taobao.com/item.htm?id=38345374874
this is empty
this is empty
this is empty
this is empty
this is empty
this is empty
count reports only 24 elements, but it retuns a 30 array.
And it actually is not an array, but Nokogiri::XML::NodeSet? I'm not sure.
title = item.css("a")[0]['title']
is a bad practice.
Instead, consider writing using at or at_css instead of search or css:
title = item.at('a')['title']
Next, if the <a> tag returned doesn't have a title parameter, Nokogiri and/or Ruby will be upset because the title variable will be nil. Instead, improve your CSS selector to only allow matches like <a title="foo">:
require 'nokogiri'
doc = Nokogiri::HTML('<body>foobar</body>')
doc.at('a').to_html # => "foo"
doc.at('a[title]').to_html # => "bar"
Notice how the first, which is not constrained to look for tags with a title parameter returns the first <a> tag. Using a[title] will only return ones with a title parameter.
That means your loop over the values will never return nil, and you won't have a problem needing to compact them out of the returned array.
As a general programming tip, if you're getting nils like that, look at the code generating the array, because odds are good it's not doing it right. You should ALWAYS know what sort of results your code will generate. Using compact to clean up the array is a knee-jerk reaction to not having written the code correctly most of the time.
Here's your updated code:
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
And here's what's wrong:
doc.css(".main-wrap .item").first(30)
Here's a simple example demonstrating why that doesn't work:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
In Nokogiri, search',cssandxpath` are equivalent, except that the first is generic and can take either CSS or XPath, while the last two are specific to that language.
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fcf360ef750 name="p" children=[#<Nokogiri::XML::Text:0x3fcf360ef4f8 "foo">]>]
doc.search('p').size # => 1
doc.search('p').map(&:to_html) # => ["<p>foo</p>"]
That shows that the NodeSet returned by doing a simple search returns only one node, and what the node looks like.
doc.search('p').first(2) # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>, nil]
doc.search('p').first(2).size # => 2
Searching using first(n) returns "n" elements. If that many aren't found Nokogiri fills them in using nil values.
This is counter what we'd assume first(n) to do, since Enumerable#first returns up-to-n and won't pad with nils. This isn't a bug, but it is unexpected behavior since Enumerable's first sets the expected behavior for methods with that name, but, this is NodeSet#first, not Enumerable#first, so it does what it does until the Nokogiri authors change it. (You can see why it happens if you look at the source for that particular method.)
Instead, slicing the NodeSet does show the expected behavior:
doc.search('p')[0..1] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0..1].size # => 1
doc.search('p')[0, 2] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0, 2].size # => 1
So, don't use NodeSet#first(n), use the slice form NodeSet#[].
Applying that, I'd write the code something like:
require 'nokogiri'
require 'open-uri'
URL = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(URL))
hrefs = doc.css(".main-wrap .item .detail a[href]")[0..29].map { |anchors|
anchors['href']
}
puts hrefs.size
puts hrefs
# >> 24
# >> http://item.taobao.com/item.htm?id=41249522884
# >> http://item.taobao.com/item.htm?id=40369253621
# >> http://item.taobao.com/item.htm?id=40384876796
# >> http://item.taobao.com/item.htm?id=40352486259
# >> http://item.taobao.com/item.htm?id=40384968205
# >> http://item.taobao.com/item.htm?id=40384816312
# >> http://item.taobao.com/item.htm?id=40384600507
# >> http://item.taobao.com/item.htm?id=39973451949
# >> http://item.taobao.com/item.htm?id=39861209551
# >> http://item.taobao.com/item.htm?id=39545678869
# >> http://item.taobao.com/item.htm?id=39535371171
# >> http://item.taobao.com/item.htm?id=39509186150
# >> http://item.taobao.com/item.htm?id=38973412667
# >> http://item.taobao.com/item.htm?id=38910499863
# >> http://item.taobao.com/item.htm?id=38942960787
# >> http://item.taobao.com/item.htm?id=38910403350
# >> http://item.taobao.com/item.htm?id=38843789106
# >> http://item.taobao.com/item.htm?id=38843517455
# >> http://item.taobao.com/item.htm?id=38854788276
# >> http://item.taobao.com/item.htm?id=38825442050
# >> http://item.taobao.com/item.htm?id=38630599372
# >> http://item.taobao.com/item.htm?id=38346270714
# >> http://item.taobao.com/item.htm?id=38357729988
# >> http://item.taobao.com/item.htm?id=38345374874
Try this
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title'] rescue nil
end
And let me know it works or not? It will not show error
Try compact.
[1, nil, 2, nil, 3] # => [1, 2, 3]
http://www.ruby-doc.org/core-2.1.3/Array.html#method-i-compact
(ie: first(fetch_number).compact.each do |item|)

REXML::Document.new take a simple string as good doc?

I would like to check if the xml is valid. So, here is my code
require 'rexml/document'
begin
def valid_xml?(xml)
REXML::Document.new(xml)
rescue REXML::ParseException
return nil
end
bad_xml_2=%{aasdasdasd}
if(valid_xml?(bad_xml_2) == nil)
puts("bad xml")
raise "bad xml"
end
puts("good_xml")
rescue Exception => e
puts("exception" + e.message)
end
and it returns good_xml as result. Did I do something wrong? It will return bad_xml if the string is
bad_xml = %{
<tasks>
<pending>
<entry>Grocery Shopping</entry>
<done>
<entry>Dry Cleaning</entry>
</tasks>}
Personally, I'd recommend using Nokogiri, as it's the defacto standard for XML/HTML parsing in Ruby. Using it to parse a malformed document:
require 'nokogiri'
doc = Nokogiri::XML('<xml><foo><bar></xml>')
doc.errors # => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: bar line 1 and xml>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag foo line 1>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag xml line 1>]
If I parse a document that is well-formed:
doc = Nokogiri::XML('<xml><foo/><bar/></xml>')
doc.errors # => []
REXML treats a simple string as a valid XML with no root node:
xml = REXML::Document.new('aasdasdasd')
# => <UNDEFINED> ... </>
It does not however treat illegal XML (with mismatching tags, for example) as a valid XML, and throws an exception.
REXML::Document.new(bad_xml)
# REXML::ParseException: #<REXML::ParseException: Missing end tag for 'done' (got "tasks")
It is missing an end-tag to <done> - so it is not valid.

What can I use to generate a local XML file?

I have a project that I am working on and I do not know much about Rails or Ruby.
I need to generate an XML file from user input.
Can some direct me to any resource that can show me how to do this pretty quickly and easily?
The Nokogiri gem has a nice interface for creating XML from scratch. It's powerful while still easy to use. It's my preference:
require 'nokogiri'
builder = Nokogiri::XML::Builder.new do |xml|
xml.root {
xml.products {
xml.widget {
xml.id_ "10"
xml.name "Awesome widget"
}
}
}
end
puts builder.to_xml
Will output:
<?xml version="1.0"?>
<root>
<products>
<widget>
<id>10</id>
<name>Awesome widget</name>
</widget>
</products>
</root>
Also, Ox does this too. Here's a sample from the documenation:
require 'ox'
doc = Ox::Document.new(:version => '1.0')
top = Ox::Element.new('top')
top[:name] = 'sample'
doc << top
mid = Ox::Element.new('middle')
mid[:name] = 'second'
top << mid
bot = Ox::Element.new('bottom')
bot[:name] = 'third'
mid << bot
xml = Ox.dump(doc)
# xml =
# <top name="sample">
# <middle name="second">
# <bottom name="third"/>
# </middle>
# </top>
Nokogiri is a wrapper around libxml2.
Gemfile
gem 'nokogiri'
To generate xml simple use the Nokogiri XML Builder like this
xml = Nokogiri::XML::Builder.new { |xml|
xml.body do
xml.node1 "some string"
xml.node2 123
xml.node3 do
xml.node3_1 "another string"
end
xml.node4 "with attributes", :attribute => "some attribute"
xml.selfclosing
end
}.to_xml
The result will look like
<?xml version="1.0"?>
<body>
<node1>some string</node1>
<node2>123</node2>
<node3>
<node3_1>another string</node3_1>
</node3>
<node4 attribute="some attribute">with attributes</node4>
<selfclosing/>
</body>
Source: http://www.jakobbeyer.de/xml-with-nokogiri

to_xml on Hash with string array fails with Not all elements respond to to_xml

to_xml on Hash with string array fails with Not all elements respond to to_xml
>>r={"records"=>["001","002"]}
=> {"records"=>["001", "002"]}
>>r.to_xml
RuntimeError: Not all elements respond
to to_xml from
/jruby/../1.8/gems/activesupport2.3.9/lib/active_support/core_ext/array/conversions.rb:163:in `to_xml'
Is there a rails preferred way to change the Hash.to_xml behavior to return
<records>
<record>001</record>
<record>002</record>
</records>
...
Just like DigitalRoss said, this appears to work out of the box in Ruby 1.9 with ActiveSupport 3:
ruby-1.9.2-p0 > require 'active_support/all'
=> true
ruby-1.9.2-p0 > r={"records"=>["001","002"]}
=> {"records"=>["001", "002"]}
ruby-1.9.2-p0 > puts r.to_xml
<?xml version="1.0" encoding="UTF-8"?>
<hash>
<records type="array">
<record>001</record>
<record>002</record>
</records>
</hash>
At least with MRI (you're using JRuby, though), you can get similar behavior on Ruby 1.8 with ActiveSupport 2.3.9:
require 'rubygems'
gem 'activesupport', '~>2.3'
require 'active_support'
class String
def to_xml(options = {})
root = options[:root] || 'string'
options[:builder].tag! root, self
end
end
Which gives you...
ruby-1.8.7-head > load 'myexample.rb'
=> true
ruby-1.8.7-head > r={"records"=>["001","002"]}
=> {"records"=>["001", "002"]}
ruby-1.8.7-head > puts r.to_xml
<?xml version="1.0" encoding="UTF-8"?>
<hash>
<records type="array">
<record>001</record>
<record>002</record>
</records>
</hash>
Note that my code doesn't work with Ruby 1.9 and ActiveRecord 3.
No, because there is no way that "001" and "002" know how to become <record>001</record>. These strings are just that: strings. They don't know that they are used in a hash with an array, let alone that these strings share a key, that needs to be singularized.
You could do something like:
record = Struct.new(:value) do
def to_xml
"<record>#{value}</record>"
end
end
r = { "records" => [ record.new("001"), record.new("002") ] }
r.to_xml
Or, use a tool like Builder to make the xml separately from the data structure.

Resources