I need to replace a node in a document with new HTML I'm creating.
The class of the node I have to replace is:
Nokogiri::XML::Node
I create my fragment using the Nokogiri Builder:
new_node = Nokogiri::XML::Builder.new do |xml|
xml.table('border' => '1', 'cellpadding' => '1', 'cellspacing' => '1') {
xml.thead {
xml.tr {
battery_test[0..4].each do |head|
xml.th_ head["inputValue"]
end
}
}
xml.tbody {
battery_test.drop(5).each_slice(5) do |row|
xml.tr {
row.each do |item|
xml.td_ item["inputValue"]
end
}
end
}
}
end
But the class of new_node is Nokogiri::XML::Builder.
How can I replace my Nokogiri::XML::Node with the fragment I create with the builder?
You don't have to use Builder to create nodes. Nokogiri allows several ways of defining them. Your question isn't asked well as it's missing essential information, but this will get you started:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<head></head>
<body>
</body>
</html>
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
# >> <body>
# >> </body>
# >> </html>
I can add a table using a string containing the HTML:
body = doc.at('body')
body.inner_html = "<table><tbody><tr><td>foo</td><td>bar</td></tr></tbody></table>"
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
# >> <body><table><tbody><tr>
# >> <td>foo</td>
# >> <td>bar</td>
# >> </tr></tbody></table></body>
# >> </html>
Modify the string generation to contain the HTML you need, let Nokogiri do the heavy lifting, and you're done. It's easier to read and maintain.
inner_html= is defined as:
inner_html=(node_or_tags)
node_or_tags means you can pass a node created using Builder, snipped from some other place in the DOM, or a string containing the markup.
Similarly:
table = Nokogiri::XML::Node.new('table', doc)
table.class # => Nokogiri::XML::Element
table.add_child('<tbody><tr><td>foo</td><td>bar</td></tr></tbody>')
body = doc.at('body')
body.inner_html = table
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
# >> <body><table><tbody><tr>
# >> <td>foo</td>
# >> <td>bar</td>
# >> </tr></tbody></table></body>
# >> </html>
Note that table is a Nokogiri::XML::Element. HTML nodes are a subclass of XML nodes so don't let that confuse you.
The tutorials are good starting points for trying anything with Nokogiri. In this case "Modifying an HTML / XML Document" is useful. Also the "Cheat sheet" is chock-full of goodness. Finally, "Questions tagged [nokogiri]" reveals all the top questions on Stack Overflow.
Related
Rails Viewcomponents allow you to test if a component has rendered in minitest using
refute_component_rendered but how do you do the same in RSpec?
class SometimesNotRenderedComponent < ViewComponent::Base
def initialize(my_param)
#my_param = my_param
end
def render?
# test this
end
end
it "renders nothing when..." do
render_inline(described_class.new(my_param))
# expect(page).to ... have no content
end
Let's dig a little, otherwise, the answer would be quite short.
The code for refute_component_rendered is pretty simple:
https://github.com/ViewComponent/view_component/blob/v2.78.0/lib/view_component/test_helpers.rb#L14
def refute_component_rendered
assert_no_selector("body")
end
assert_no_selector is a Capybara matcher. Negative matchers for rspec are defined dynamically with a prefix have_no_ and are not documented.
have_selector delegates to assert_selector.
https://www.rubydoc.info/gems/capybara/Capybara/RSpecMatchers#have_selector-instance_method
Which means have_no_selector delegates to assert_no_selector.
You can use either one, it's just a matter of preference:
it "does not render" do
render_inline(described_class.new(nil))
expect(page).to have_no_selector("body")
expect(page).not_to have_selector("body")
end
it "renders, but why" do
# why match on `body`? it is just how `render_inline` method works,
# https://github.com/ViewComponent/view_component/blob/v2.78.0/lib/view_component/test_helpers.rb#L56
# it assignes whatever the result of `render` to #rendered_content, like this:
#rendered_content = "i'm rendering"
# when you call `page`
# https://github.com/ViewComponent/view_component/blob/v2.78.0/lib/view_component/test_helpers.rb#L10
# it wraps #rendered_content in `Capybara::Node::Simple`
# https://www.rubydoc.info/gems/capybara/Capybara/Node/Simple
# if content is empty, there is no body
# Capybara::Node::Simple.new("").native.to_html
# # => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n\n"
puts page.native.to_xhtml # to see what you're matching on
expect(page).to_not have_no_selector("body")
end
I've double checked:
$ bin/rspec spec/components/badge_component_spec.rb
BadgeComponent
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
does not render
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<p>i'm rendering</p>
</body>
</html>
renders, but why
Finished in 0.06006 seconds (files took 2.11 seconds to load)
2 examples, 0 failures
I'm using a Nokogiri-based helper to truncate text without breaking HTML tags:
require "rubygems"
require "nokogiri"
module TextHelper
def truncate_html(text, max_length, ellipsis = "...")
ellipsis_length = ellipsis.length
doc = Nokogiri::HTML::DocumentFragment.parse text
content_length = doc.inner_text.length
actual_length = max_length - ellipsis_length
content_length > actual_length ? doc.truncate(actual_length).inner_html + ellipsis : text.to_s
end
end
module NokogiriTruncator
module NodeWithChildren
def truncate(max_length)
return self if inner_text.length <= max_length
truncated_node = self.dup
truncated_node.children.remove
self.children.each do |node|
remaining_length = max_length - truncated_node.inner_text.length
break if remaining_length <= 0
truncated_node.add_child node.truncate(remaining_length)
end
truncated_node
end
end
module TextNode
def truncate(max_length)
Nokogiri::XML::Text.new(content[0..(max_length - 1)], parent)
end
end
end
Nokogiri::HTML::DocumentFragment.send(:include, NokogiriTruncator::NodeWithChildren)
Nokogiri::XML::Element.send(:include, NokogiriTruncator::NodeWithChildren)
Nokogiri::XML::Text.send(:include, NokogiriTruncator::TextNode)
On
content_length > actual_length ? doc.truncate(actual_length).inner_html + ellipsis : text.to_s
it appends the ellipse just after the last tag.
On my view I call
<%= truncate_html(news.parsed_body, 700, "... Read more.").html_safe %>
The issue is that the text that is being parsed is wrapped in <p></p> tags, causing the view to break:
"Lorem Ipsum</p>
... Read More"
Is it possible to append the ellipse to the last part of the last node using Nokogiri, so the final output becomes:
"Loren Ipsum... Read More</p>
Since you didn't supply any input data you get to interpolate from this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo bar baz</p>
</body>
</html>
EOT
paragraph = doc.at('p')
text = paragraph.text
text[4..-1] = '...'
paragraph.content = text
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >> <p>foo ...</p>
# >> </body>
# >> </html>
You're making it much harder than it really is. Nokogiri is smart enough to know whether we're passing markup, or simply text, and content will create a text node or an element depending on which it is.
This code simply:
Finds the p tag.
Extracts the text from it.
Replaces the text from a given point to the end with '...'.
Replaces the content of the paragraph with that text.
If you only want to append to that text it becomes even easier:
paragraph = doc.at('p')
paragraph.content = paragraph.text + ' ...Read more.'
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >> <p>foo bar baz ...Read more.</p>
# >> </body>
# >> </html>
I need to downcase all text in an HTML document that has been parsed with Nokogiri. Here my code:
agent = Mechanize.new
page = agent.get('http://www.example.com').parser.search('//*[translate(text(),"ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz") = *]').to_html
There is not error as such in the code; it executes without an error. If I go in and check a random tag in the document, however, the case is still the same as before. Is there another/better way to downcase all text in a document?
You could use traverse to downcase all text nodes:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.example.com/"))
doc.traverse do |node|
node.content = node.content.downcase if node.text?
end
puts doc.to_html
Output:
<!DOCTYPE html>
<html>
<head>
<title>example domain</title>
<meta charset="utf-8">
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style type="text/css">
body { ... }
</style>
</head>
<body>
<div>
<h1>example domain</h1>
<p>this domain is established to be used for illustrative examples in documents. you may use this
domain in examples without prior coordination or asking for permission.</p>
<p>more information...</p>
</div>
</body>
</html>
Using Nokogiri, I am manually creating <video> and <source> tags. My code looks like this:
mp4_source_tag = html.create_element('source')
tag.replace(mp4_source_tag)
mp4_source_tag['type'] = 'video/mp4'
mp4_source_tag['src'] = video.mp4_video.url
Which produces the following HTML:
<source type="video/mp4" src="/system/mp4_videos/1/original/trailer.mp4?1347088365"></source>
However this is invalid HTML5. The correct output should be:
<source type="video/mp4" src="/system/mp4_videos/1/original/trailer.mp4?1347088365">
How would I use Nokogiri to output valid HTML5 without the closing </source> tag?
The replaced tag was an <img> tag, but that doesn't appear to matter.
If you create your document as XML instead of HTML, Nokogiri will output empty elements with a closing slash, e.g. <source />; this is valid for HTML5.
html = Nokogiri.HTML('')
puts html.create_element('source')
#=> <source></source>
xml = Nokogiri.XML('')
puts xml.create_element('source')
#=> <source/>
The downsides of this, however, is that parsing a valid HTML5 document as XML will cause mistakes in parsing:
require 'nokogiri'
html5 = '<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<title>Test</title>
</head><body>
<img src="a.jpg"><img src="b.jpg">
</body></html>'
doc = Nokogiri.XML( html5, &:noblanks )
puts doc
#=> <?xml version="1.0"?>
#=> <!DOCTYPE html>
#=> <html>
#=> <head>
#=> <meta charset="utf-8">
#=> <title>Test</title>
#=> </meta>
#=> <body>
#=> <img src="a.jpg">
#=> <img src="b.jpg">
#=> </img>
#=> </img>
#=> </body>
#=> </head>
#=> </html>
To fix this, you'll need to make your source be valid XML by self-closing your void elements (which is also valid for HTML5). Further, to avoid the XML Declaration you'll need to serialize the DTD and root separately:
require 'nokogiri'
html5 = '<!DOCTYPE html>
<html><head>
<meta charset="utf-8"/>
<title>Test</title>
</head><body>
<img src="a.jpg"/><img src="b.jpg"/>
</body></html>'
doc = Nokogiri.XML( html5, &:noblanks )
puts doc.children.map(&:to_s)
#=> <!DOCTYPE html>
#=> <html>
#=> <head>
#=> <meta charset="utf-8"/>
#=> <title>Test</title>
#=> </head>
#=> <body>
#=> <img src="a.jpg"/>
#=> <img src="b.jpg"/>
#=> </body>
#=> </html>
Say I start with everything inside the body element:
Nokogiri::HTML( doc ).xpath( "/html/body/node()" ).to_html
which contains some <script> and <noscript>. How do I get rid of these?
You might want to change your XPath expression to:
Nokogiri::HTML( doc ).xpath( "/html/body/node()[not(self::script or self::noscript)]" ).to_html
#!/usr/bin/env ruby
require 'nokogiri'
html = <<EOT
<html>
<head>
<script>
<!-- dummy script !>
</script>
</head>
<body>
<script><!-- dummy script !></script>
<noscript>dummy script</noscript>
</body>
</head>
EOT
doc = Nokogiri::HTML(html)
Here's the gist of it:
doc.at('body').search('script,noscript').remove
puts doc.to_xml
>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
>> <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
>> <script>
>> <!-- dummy script !>
>> </script>
>> </head>
>> <body>
>>
>> </body>
>> </html>
For simplicity, I'm using Nokogiri's ability to use CSS accessors, rather than XPath.
doc.at('body').search('script,noscript').remove
looks for the first occurrence of the <body> tag, then looks inside for all <script> and <noscript> tags, removing them.
The gap between the resulting <body> tags are the result of the carriage returns in text nodes that trailed the actual target tags.