I have an HTML string (for instance <div class="input">hello</div>) and I want to add a node only if the HTML tag in the string is a label (for instance <label>Hi</label>).
doc = Nokogiri::XML(html)
doc.children.each do |node|
if node.name == 'label'
# this code gets called
span = Nokogiri::XML::Node.new "span", node
span.content = "hello"
puts span.parent
# nil
span.parent = node
# throws error "node can only have one parent"
end
end
doc.to_html # Does not contain the span.
I cannot for the life of me understand what I'm doing wrong, any help would be greatly appreciated.
Edit: This solved my problem, thanks for the answers!
# notice DocumentFragment rather than XML
doc = Nokogiri::HTML::DocumentFragment.parse(html_tag)
doc.children.each do |node|
if node.name == 'label'
span = Nokogiri::XML::Node.new "span", doc
node.add_child(span)
end
end
It's easy to add/change/delete HTML:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse('<div class="input">hello</div>')
div = doc.at('div')
div << '<span>Hello</span>'
puts doc.to_html
Which results in:
# >> <div class="input">hello<span>Hello</span>
# >> </div>
Notice that the above code appended a new node to the existing children of the <div>, because of <<, which means they were appended after the text-node containing "hello".
If you want to overwrite the children, you can do that easily using children =:
div.children = '<span>Hello</span>'
puts doc.to_html
Which results in:
# >> <div class="input"><span>Hello</span></div>
children = can take a single Node which can have multiple other nodes nestled under it, or the HTML text of the node(s) being inserted. That's what node_or_tags means when you see it in the documentation.
That said, to change just an embedded <label>, I'd do something like:
doc = Nokogiri::HTML::DocumentFragment.parse('<div class="input"><label>hello</label></div>')
label = doc.at('div label')
label.name = 'span' if label
puts doc.to_html
# >> <div class="input"><span>hello</span></div>
Or:
doc = Nokogiri::HTML::DocumentFragment.parse('<div class="input"><label>hello</label></div>')
label = doc.at('div label')
label.replace("<span>#{ label.text }</span>") if label
puts doc.to_html
# >> <div class="input"><span>hello</span></div>
Nokogiri makes it easy to change the tag's name once you've pointed at it. You can easily change the text inside the <span> by replacing #{ label.text } with whatever you desire.
at('div label') is one way of finding a particular node. It basically means "find the first label tag inside the first div". at means find the first of something, and is similar to using search(...).first. There are CSS and XPath equivalents to both at and search in the Nokogiri::XML::Node documentation if you need those.
A few issues - you span = .. line was creating the node but not actually adding it to the document. Also, you can't access span outside of the block where you created it.
I think this is what you're after:
html = '<label>Hi</label>'
doc = Nokogiri::XML(html)
doc.children.each do |node|
if node.name == 'label'
# this code gets called
span = Nokogiri::XML::Node.new "span", doc
span.content = "hello"
node.add_child(span)
end
end
# NOTE: `node` nor `span` are accessible outside of the each block
doc.to_s # => "<?xml version=\"1.0\"?>\n<label>Hi<span>hello</span></label>\n"
Note the node.add_child(span) line.
Related
I am moving some of my scraping from JavaScript to Ruby, and I am having trouble using Nokogiri.
I have trouble getting the right <dl> in a target class. I tried using css and xpath with the same result.
This is a sample of the HTML:
<div class="target">
<dl>
<dt>A:</dt>
<dd>foo</dd>
<dt>B:</dt>
<dd>bar</dd>
</dl>
</div>
This is a sample of my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open(url))
doc.css(".target > dl").each do |item|
puts item.text # I would expect to receive a collection of nodes here,
# yet I am receiving a single block of text
end
doc.css(".target > dl > dt").each do |item|
puts item.text # Here I would expect to iterate through a collection of
# dt elements, however I receive a single block of text
end
Can someone show me what I am doing wrong?
In the first case, the result should be the single dl; you are getting a single block of text. That is expected.
In the second case, the result should be two individual dt elements. You are printing their text one after another, which is indistinguishable from printing the text of the dl.
doc.css('.target > dl').length
# => 1 # as you have one `dl` element in `.target`
doc.css('.target > dl > dt').length
# => 2 # as you have two `dt` elements that are children of a `dl` in `.target`
doc.css(".target > dl > dt").each do |item|
puts item.text
puts "---" # make it obvious which element is which
end
# => A:
# ---
# B:
# ---
I am not quite sure what other result you are expecting.
I'd use something like:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="target">
<dl>
<dt>A:</dt>
<dd>foo</dd>
<dt>B:</dt>
<dd>bar</dd>
</dl>
</div>
EOT
This finds the first class='target', then its contained <dt> tags, and extracts each <dt>'s text:
doc.at('.target').search('dt').map{ |n| n.text } # => ["A:", "B:"]
This does the same only passing the text to map as shorthand:
doc.at('.target').search('dt').map(&:text) # => ["A:", "B:"]
This lets the engine find all <dt> embedded in all class="target" tags:
doc.search('.target dt').map(&:text) # => ["A:", "B:"]
See "How to avoid joining all text from Nodes when scraping" also.
First of all, I am very new to ruby and I am trying to maintain an application already running in production.
I have been so far able to "interpret" the code well, but there is one thing I am stuck at.
I have a haml.html file where I am trying to display links from DB.
Imagine a DB structure like below
link_name - Home
URL - /home.html
class - clear
id - homeId
I display a link on the page as below
< a href="/home.html" class="clear" id="home" > Home </a>
To do this I use 'link_to' where I am adding code as follows
-link_to model.link_name , model.url, {:class => model.class ...... }
Now I have a new requirement where we have a free text in DB, something like -
data-help="home-help" data-redirect="home-redirect" which needs to come into the options.
So code in haml needs to directly display content versus assign it to a variable to display.
In other words I am able to do
attr= '"data-help="home-help" data-redirect="home-redirect"' inside the <a>, but not able to do
data-help="home-help" data-redirect="home-redirect" in <a> tag.
Any help would be greatly appreciated!
link_to accepts a hash :data => { :foo => "bar" } of key/val pairs that it will build into data- attributes on the anchor tag. The above will create an attr as follows data-foo="bar"
So you could write a method on the model to grab self.data_fields (or whatever it's called) and split it into attr pairs and then create a hash from that. Then you can just pass the hash directly to the :data param in link_to by :data => model.custom_data_fields_hash
This somewhat verbose method splits things out and returns a hash that'd contain: {"help"=>"home-help", "redirect"=>"home-redirect"}
def custom_data_fields_hash
# this would be replaced by self.your_models_attr
data_fields = 'data-help="home-help" data-redirect="home-redirect"'
# split the full string by spaces into attr pairs
field_pairs = data_fields.split " "
results = {}
field_pairs.each do |field_pair|
# split the attr and value by the =
data_attr, data_value = field_pair.split "="
# remove the 'data-' substring because the link_to will add that in automatically for :data fields
data_attr.gsub! "data-", ""
# Strip the quotes, the helper will add those
data_value.gsub! '"', ""
# add the processed pair to the results
results[data_attr] = data_value
end
results
end
Running this in a Rails console gives:
2.1.2 :065 > helper.link_to "Some Link", "http://foo.com/", :data => custom_data_fields_hash
=> "<a data-help=\"home-help\" data-redirect=\"home-redirect\" href=\"http://foo.com/\">Some Link</a>"
Alternatively you could make it a helper and just pass in the model.data_attr instead
link_to "Some Link", "http://foo.com/", :data => custom_data_fields_hash(model.data_fields_attr)
Not sure you can directly embed an attribute string. You could try to decode the string in order to pass it to link_to:
- link_to model.link_name, model.url,
{
:class => model.class
}.merge(Hash[
str.scan(/([\w-]+)="([^"]*)"/)
])
)
I have an XML document which I need to parse with Nokogiri however I need to filter out all 'role' nodes which names do not match those requested.
Essentially I want to return an array of only those roles where the first and last name match those required.
Current Status:
I have all the code working except for the one filtering/search line from within the controller. I have had a look through the filter and search functions of Nokogiri but cannot seem to achieve the desired result.
XML Input
<xml>
<role xsi:type="director">
<firstName>Thomas</firstName>
<lastName>JONES</lastName>
<company>Jones Enterprises</company>
</role>
<role xsi:type="director">
<firstName>Thomas</firstName>
<lastName>TEST</lastName>
<company>Test Factory</company>
</role>
</xml>
Controller
firstname = 'Thomas'
lastname = 'JONES'
#results = doc.css('role').where((doc.css('firstName').text == #firstname) AND (doc.css('lastName').text == #lastname))
View
<%= #results.each do |t| %>
<%= t.company %>
<% end %>
Required Output
Jones Enterprises
You can let the libXML2 underpinnings do the work for you using XPath:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<role xsi:type="director">
<firstName>Thomas</firstName>
<lastName>JONES</lastName>
<company>Jones Enterprises</company>
</role>
<role xsi:type="director">
<firstName>Thomas</firstName>
<lastName>TEST</lastName>
<company>Test Factory</company>
</role>
</xml>
EOT
FIRSTNAME = 'Thomas'
LASTNAME = 'JONES'
roles = doc.search("//role[child::firstName[text()[contains(., 'Thomas')]] and child::lastName[text()[contains(., 'JONES')]]]")
puts roles.to_xml
# >> <role xsi:type="director">
# >> <firstName>Thomas</firstName>
# >> <lastName>JONES</lastName>
# >> <company>Jones Enterprises</company>
# >> </role>
You can do the same with CSS, only CSS doesn't let us use the logic to test two child nodes' content in the same libXML call. Instead, at that point, we have to make multiple calls and let Ruby and Nokogiri filter for the desired nodes which gets to be more difficult and CPU-intensive. Something like this works:
roles_firstnames = doc.search('role firstName:contains("Thomas")').map(&:parent)
roles_lastnames = doc.search('role lastName:contains("JONES")').map(&:parent)
matching_roles = (roles_firstnames & roles_lastnames)
puts matching_roles.map(&:to_xml)
# >> <role xsi:type="director">
# >> <firstName>Thomas</firstName>
# >> <lastName>JONES</lastName>
# >> <company>Jones Enterprises</company>
# >> </role>
Notice:
Nokogiri lets us use a lot of CSS extensions provided by jQuery, such as :contains.
roles_firstnames & roles_lastnames is letting Ruby use a set intersection on the array. Each array contains a list of nodes containing the first or last names. Each entry is the parent node's identifier. & simplifies the test to see what nodes in the two arrays are in common, and basically does an and followed by a uniq for us.
Either way you do it, once you have the <role> nodes needed, it's easy to iterate over them and extract the child <company> node's text:
roles.map{ |n| n.at('company').text }
# => ["Jones Enterprises"]
First, you selecting roles like this:
#roles = x.css('role').select {|r| firstname == r.at('firstName').text and lastname == r.at('lastName').text }
You should use the variables inside select block which contain filter params.
And in your view you reading refined XML nodes like this:
<% #roles.each do |r| %>
<%= r.at('company').text %>
<% end %>
So I am trying to build an XML document for export. But I need to add extra text inside the headers and can't figure out how.
def as_xml
require 'rubygems'
require 'builder'
builder = Builder::XmlMarkup.new(:target=>STDOUT, :indent=>2)
xml = builder.propertyList { |b|
b.description(self.description);
self.highlights.each do |h|
b.highlight(h);
end
}
end
returns:
<propertyList>
<description>"Description goes here"
</description>
<highlight>Highlight 1</highlight>
<highlight>Highlight 2</highlight>
</propertyList>
Is there a way to make it so that I can add an ID attribute to the highlight tags?
Such as <highlight id=1>, etc.
Also wondering if there a way to define whether or not a tag should be self-closing using builder?
e.g.:
<auction date=self.auctionDate />
You can pass in attributes as a hash for the secondary argument:
self.highlights.each_with_index do |h, i|
b.highlight(h, id: i+1);
end
=> <highlight id='1'>Highlight 1</highlight>
=> <highlight id='2'>Highlight 2</highlight>
And if you pass only a hash, you can get a self-closing node:
b.auction(date: 'someDate')
=> <auction date='someDate' />
I want to split a string by whitespace
irb(main):001:0> input = "dog cat"
=> "dog cat"
irb(main):002:0> output = input.strip.split(/\s+/)
=> ["dog", "cat"]
This is good. However, I'm also doing this in the controller in Rails, and when I supply the same input, and have it print out the output #{output} into my view, it shows as dogcat instead of ["dog", "cat"]. I am really confused how that can happen. Any ideas?
I'm printing it using #notice = "#{output}" in the controller, and in my view I have <%= #notice %>
Rather than splitting your string in the controller and sending it as an array to your view, send the entire string to your view:
input = "dog cat"
#notice = input
Then, in your view, split your the string and display it as a stringified-array:
<%= array(#notice.strip.split(/\s+/)).to_s %>
If you print an array of strings, you'll get the strings all concatenated together. You'd get the same thing in irb if you had entered, print "#{output}". You need to decide how you want to format them and print them that way, perhaps with a simple helper function. For example, the helper could do:
output.each { |s| puts "<p>#{s}</p>" }
Or whatever you like.
Continuing with your example code:
>> input = "dog cat"
=> "dog cat"
>> output = input.strip.split /\s+/
=> ["dog", "cat"]
>> joined = output.join ' '
=> "dog cat"
Remember too that Ruby has several helpers like %w and %W for letting you convert a string into an array of words. If you're starting with an array of words, each of which may have whitespace before and after its individual item, you might try something like this:
>> # `input` is an array of words that was populated Somewhere Else
>> # `input` has the initial value [" dog ", "cat\n", "\r tribble\t"]
>> output = input.join.split /\s+/
=> ["dog", "cat", "tribble"]
>> joined = output.join ' '
=> "dog cat tribble"
Calling String#join without any parameter will join stringish array items together with no separation between them, and is what seems to be done in your example where you just render the array as a string
>> #notice = output
>> # #notice will render as 'dogcat'
As opposed to:
>> #notice = input.join.split(/\s+/).join ' '
>> # #notice will render as 'dog cat'
And there you go.