Convert html to text in ROR - ruby-on-rails

HTML
<p>Hello</p>
<p>this is <br></p>
<p>a <br></p>
<p>test message</p><br>
I have already tried 'strip tags' which gives me the following output :
"Hellothis is a test message"
The output I want:
Hello
this is
a
test message

html = "<p>Hello</p>
<p>this is <br></p>
<p>a <br></p>
<p>test message</p><br>"
strip_tags
strip_tags helper seems to work fine :
puts ActionController::Base.helpers.strip_tags(html)
# =>
# Hello
# this is
# a
# test message
Nokogiri
Nokogiri is included by default in Rails, so you could also use :
doc = Nokogiri::HTML(html)
puts doc.xpath("//text()").to_s
It outputs :
Hello
this is
a
test message
Convert newlines to spaces
If you want to remove newlines :
ActionController::Base.helpers.strip_tags(html).gsub(/\s+/,' ')
#=> "Hello this is a test message"

The HTML is rendered by a browser like:
Hello
this is
a
test message
This isn't quite correct though, because the HTML contains trailing <br> tags in the <p> tags, which should be a string like:
this is \n\n\n
which is normally considered a paragraph plus a new-line. But, browsers play games when rendering text in order to make it more readable, and gobble blank lines and spaces. For example, this HTML:
<p>foo</p>
<p></p>
<p></p>
<p></p>
<p>bar</p>
renders as:
foo
bar
and:
<p>foo bar</p>
renders as:
foo bar
So, you have to decide do you want to render the text using Nokogiri like the browser for readability, or do it accurately?
This does it like the browser:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<p>Hello</p>
<p>this is <br></p>
<p>a <br></p>
<p>test message</p><br>
EOT
doc.search('br').remove
text = doc.search('p').map { |p| p.text + "\n\n" }
puts text
# >> Hello
# >>
# >> this is
# >>
# >> a
# >>
# >> test message
# >>
It removes the breaks, then converts the <p> contained text by appending two new-lines.
Doing it accurately, as per how the markup shows, is a little different:
doc.search('br').map { |br| br.replace("\n") }
text = doc.search('p').map { |p| p.text + "\n\n" }
puts text
# >> Hello
# >>
# >> this is
# >>
# >>
# >> a
# >>
# >>
# >> test message
# >>
This is just a simplified way of doing it to get you started. Rails does the opposite of this in ActionView's simple_format method.
Browsers have a lot more rules used to determine when and how to display the text and their rendering can be influenced by CSS and JavaScript which won't necessarily translate to text, especially plain text.

Related

How to create a Nokogiri::XML::Node from Nokogiri::XML::Builder

I need to replace a node in a document with new HTML I'm creating.
The class of the node I have to replace is:
Nokogiri::XML::Node
I create my fragment using the Nokogiri Builder:
new_node = Nokogiri::XML::Builder.new do |xml|
xml.table('border' => '1', 'cellpadding' => '1', 'cellspacing' => '1') {
xml.thead {
xml.tr {
battery_test[0..4].each do |head|
xml.th_ head["inputValue"]
end
}
}
xml.tbody {
battery_test.drop(5).each_slice(5) do |row|
xml.tr {
row.each do |item|
xml.td_ item["inputValue"]
end
}
end
}
}
end
But the class of new_node is Nokogiri::XML::Builder.
How can I replace my Nokogiri::XML::Node with the fragment I create with the builder?
You don't have to use Builder to create nodes. Nokogiri allows several ways of defining them. Your question isn't asked well as it's missing essential information, but this will get you started:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<head></head>
<body>
</body>
</html>
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
# >> <body>
# >> </body>
# >> </html>
I can add a table using a string containing the HTML:
body = doc.at('body')
body.inner_html = "<table><tbody><tr><td>foo</td><td>bar</td></tr></tbody></table>"
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
# >> <body><table><tbody><tr>
# >> <td>foo</td>
# >> <td>bar</td>
# >> </tr></tbody></table></body>
# >> </html>
Modify the string generation to contain the HTML you need, let Nokogiri do the heavy lifting, and you're done. It's easier to read and maintain.
inner_html= is defined as:
inner_html=(node_or_tags)
node_or_tags means you can pass a node created using Builder, snipped from some other place in the DOM, or a string containing the markup.
Similarly:
table = Nokogiri::XML::Node.new('table', doc)
table.class # => Nokogiri::XML::Element
table.add_child('<tbody><tr><td>foo</td><td>bar</td></tr></tbody>')
body = doc.at('body')
body.inner_html = table
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
# >> <body><table><tbody><tr>
# >> <td>foo</td>
# >> <td>bar</td>
# >> </tr></tbody></table></body>
# >> </html>
Note that table is a Nokogiri::XML::Element. HTML nodes are a subclass of XML nodes so don't let that confuse you.
The tutorials are good starting points for trying anything with Nokogiri. In this case "Modifying an HTML / XML Document" is useful. Also the "Cheat sheet" is chock-full of goodness. Finally, "Questions tagged [nokogiri]" reveals all the top questions on Stack Overflow.

Convert xml to hash using Nokogiri but keep the anchor tags

I have xml file like this below. I want to parse the convert it to a
ruby hash. I tried doing it this way:
But it strips out the anchor tags and I end up
with description something like this.
"Today is a "
How can I convert the xml to a hash but keep the anchor tags?
Code:
#doc = File.open(xml_file) { |f| Nokogiri::XML(f) }
data = Hash.from_xml(#doc.to_s)
XML FILE
<blah>
<tag>
<name>My Name</name>
<url>www.url.com</url>
<file>myfile.zip</file>
<description>Today is a sunny</description>
</tag>
<tag>
<name>Someones Name</name>
<url>www.url2.com</url>
<file>myfile2.zip</file>
<description>Today is a rainy</description>
</tag>
</blah>
The only way I see now is to escape HTML inside <description> in the whole document, then execute Hash#from_xml:
doc = File.open(xml_file) { |f| Nokogiri::XML(f) }
# escape HTML inside <description>
doc.css("description").each do |node|
node.inner_html = CGI.escapeHTML(node.inner_html)
end
data = Hash.from_xml(doc.to_s) # =>
# {"blah"=>
# {
# "tag"=>[
# {
# "name"=>"My Name",
# "url"=>"www.url.com",
# "file"=>"myfile.zip",
# "description"=>"Today is a sunny"
# },
# {
# "name"=>"Someones Name",
# "url"=>"www.url2.com",
# "file"=>"myfile2.zip",
# "description"=>"Today is a rainy"
# }
# ]
# }
# }
Nokogiri is used here just for HTML escaping. You don't really need it if you find some another way to escape. For example:
xml = File.open(xml_file).read
# escaping inner HTML (maybe not the best way, just example)
xml.gsub!(/<description>(.*)<\/description>/, "<description>#{CGI.escapeHTML($1)}</description>")
data = Hash.from_xml(doc.to_s)

Ruby RegEx to locate image assets in an html/erb file

My end goal is to write a script that will loop through all my app/views folders and find any image assets being used within them (jpg, png, svg, gifs) and I can't quite get it but I feel I am close but need a little assistance.
This is how I am getting all my assets
assets_in_assets = []
# I searched for image asset names in this folder
image_asset_path = './app/assets/images'
# I haven't made use the below global variables yet
assets_in_use = []
# I plan to loop through the below folders to see if and where the image
# assets are being used
public_folder = './public'
app_folder = './app'
Find.find(image_asset_path) do |path|
# returns path and file names of all files extensions recursively
if !File.directory?(path) && path =~ /.*[\.jpg$ | \.png$ | .svg$ | \.gif$]/
&& !(path =~ /\.DS_Store/)
new_path = File.basename(path) # equiv to path.to_s.split('/').last
assets_in_assets << new_path
end
end
# The above seems to work, it gives me all the asset image names in an array.
This is how i am trying read a html.erb file to find if and where images are being used.
Here is a sample of part of the page:
<div class="wrapper">
<div class="content-wrapper pull-center center-text">
<img class="pattern-stars" src="<%= image_path('v3/super/pattern-
stars.png') %>" aria-hidden="true">
<h2 class="pull-center uppercase">Built by the Obsessed People at the
Company</h2>
<p class="top-mini">Our pets needed a challenge.</p>
<p class="italicize">So we made one.</p>
<img class="stroke" src="<%= image_path('v3/super/stroke.png') %>"
aria-hidden="true">
</div>
</div>
# The assets I am expecting to find, in this small section, are:
#- pattern-stars.png
#- stroke.png
And my code (I tried two different ways, here is the first):
# My plan is start with one specific file, then expand it once the code works
lines = File.open('./app/views/pages/chewer.html.erb', 'r')
lines.each do |f|
if f =~ / [\w]+\.(jpe?g | png | gif | svg) /xi
puts 'match: ' + f # just wanted to see what's being returned
end
end
# This is what gets returned
# match: <img class="pattern-stars" src="<%= image_path('v3/super
# /pattern-stars.png') %>" aria-hidden="true">
# match: <img class="stroke" src="<%= image_path('v3/super/stroke.png')
# %>" aria-hidden="true">
Not what I was hoping for. I also tried the following:
lines = File.open('./app/views/pages/chewer.html.erb', 'r')
lines.each do |f|
new_f = File.basename(f)
puts 'after split' + new_f # I wanted to see what was being returned
if new_f =~ / [\w]+\.(jpe?g | png | gif | svg) /xi
puts 'match: ' + new_f
end
end
# This is what gets returned
# after split: pattern-stars.png') %>" aria-hidden="true">
# match: pattern-stars.png') %>" aria-hidden="true">
# after split: stroke.png') %>" aria-hidden="true">
# match: stroke.png') %>" aria-hidden="true">
And here I remain blocked. I have searched through S.O. and tried a few things but nothing I have found has helped but it could be that I implemented the solutions incorrectly. I also tried look-behind (using the single ' as a end point) and look-ahead (using a / as a starting point)
If this is a dup or similar to another question, please let me know. I'd appreciate the help (plus an brief explanation, I really want to get a better understanding to improve my skills.
(?:['"])([^'"]+\.(?:png|jpe?g|gif|svg)) seems to work in the one test case you supplied us. It relies on the image paths always being within a string as the 'this is the start of the image path' delimiter and terminates at the extension so even if the string is unclosed should stop at an appropriate place.
Using the above, I eventually got to the following solution;
Find.find(app_folder, public_folder) do |path|
if !File.directory?(path)
&& !(path =~/\.\/app\/assets\/images/)
&& !(path =~ /\.DS_Store/)
&& !(path =~ /\.\/app\/assets\/fonts/)
asset_file = File.read(path)
image_asset = asset_file.scan(/ (?:['"|\s|#])([^'"|\s|#]+\.(?:png | jpe?g |gif | svg)) /xi).flatten
image_asset.each do |image_name|
assets_in_use << [path, File.basename(image_name)]
end
end
end

Add text just before the closing tag using Nokogiri

I'm using a Nokogiri-based helper to truncate text without breaking HTML tags:
require "rubygems"
require "nokogiri"
module TextHelper
def truncate_html(text, max_length, ellipsis = "...")
ellipsis_length = ellipsis.length
doc = Nokogiri::HTML::DocumentFragment.parse text
content_length = doc.inner_text.length
actual_length = max_length - ellipsis_length
content_length > actual_length ? doc.truncate(actual_length).inner_html + ellipsis : text.to_s
end
end
module NokogiriTruncator
module NodeWithChildren
def truncate(max_length)
return self if inner_text.length <= max_length
truncated_node = self.dup
truncated_node.children.remove
self.children.each do |node|
remaining_length = max_length - truncated_node.inner_text.length
break if remaining_length <= 0
truncated_node.add_child node.truncate(remaining_length)
end
truncated_node
end
end
module TextNode
def truncate(max_length)
Nokogiri::XML::Text.new(content[0..(max_length - 1)], parent)
end
end
end
Nokogiri::HTML::DocumentFragment.send(:include, NokogiriTruncator::NodeWithChildren)
Nokogiri::XML::Element.send(:include, NokogiriTruncator::NodeWithChildren)
Nokogiri::XML::Text.send(:include, NokogiriTruncator::TextNode)
On
content_length > actual_length ? doc.truncate(actual_length).inner_html + ellipsis : text.to_s
it appends the ellipse just after the last tag.
On my view I call
<%= truncate_html(news.parsed_body, 700, "... Read more.").html_safe %>
The issue is that the text that is being parsed is wrapped in <p></p> tags, causing the view to break:
"Lorem Ipsum</p>
... Read More"
Is it possible to append the ellipse to the last part of the last node using Nokogiri, so the final output becomes:
"Loren Ipsum... Read More</p>
Since you didn't supply any input data you get to interpolate from this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo bar baz</p>
</body>
</html>
EOT
paragraph = doc.at('p')
text = paragraph.text
text[4..-1] = '...'
paragraph.content = text
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >> <p>foo ...</p>
# >> </body>
# >> </html>
You're making it much harder than it really is. Nokogiri is smart enough to know whether we're passing markup, or simply text, and content will create a text node or an element depending on which it is.
This code simply:
Finds the p tag.
Extracts the text from it.
Replaces the text from a given point to the end with '...'.
Replaces the content of the paragraph with that text.
If you only want to append to that text it becomes even easier:
paragraph = doc.at('p')
paragraph.content = paragraph.text + ' ...Read more.'
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >> <body>
# >> <p>foo bar baz ...Read more.</p>
# >> </body>
# >> </html>

Find within the first 10?

I'm using Nokogiri to screen-scrape contents of a website.
I set fetch_number to specify the number of <divs> that I want to retrieve. For example, I may want the first(10) tweets from the target page.
The code looks like this:
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title']
end
However, when there is less than 10 matching div tags returned, it will report
NoMethodError: undefined method 'css' for nil:NilClass
This is because, when no matching HTML is found, it will return nil.
How can I make it return all the available data within 10? I don't need the nils.
UPDATE:
task :test_fetch => :environment do
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
end
Return resultes(Near the end):
24
http://item.taobao.com/item.htm?id=41249522884
http://item.taobao.com/item.htm?id=40369253621
http://item.taobao.com/item.htm?id=40384876796
http://item.taobao.com/item.htm?id=40352486259
http://item.taobao.com/item.htm?id=40384968205
.....
http://item.taobao.com/item.htm?id=38843789106
http://item.taobao.com/item.htm?id=38843517455
http://item.taobao.com/item.htm?id=38854788276
http://item.taobao.com/item.htm?id=38825442050
http://item.taobao.com/item.htm?id=38630599372
http://item.taobao.com/item.htm?id=38346270714
http://item.taobao.com/item.htm?id=38357729988
http://item.taobao.com/item.htm?id=38345374874
this is empty
this is empty
this is empty
this is empty
this is empty
this is empty
count reports only 24 elements, but it retuns a 30 array.
And it actually is not an array, but Nokogiri::XML::NodeSet? I'm not sure.
title = item.css("a")[0]['title']
is a bad practice.
Instead, consider writing using at or at_css instead of search or css:
title = item.at('a')['title']
Next, if the <a> tag returned doesn't have a title parameter, Nokogiri and/or Ruby will be upset because the title variable will be nil. Instead, improve your CSS selector to only allow matches like <a title="foo">:
require 'nokogiri'
doc = Nokogiri::HTML('<body>foobar</body>')
doc.at('a').to_html # => "foo"
doc.at('a[title]').to_html # => "bar"
Notice how the first, which is not constrained to look for tags with a title parameter returns the first <a> tag. Using a[title] will only return ones with a title parameter.
That means your loop over the values will never return nil, and you won't have a problem needing to compact them out of the returned array.
As a general programming tip, if you're getting nils like that, look at the code generating the array, because odds are good it's not doing it right. You should ALWAYS know what sort of results your code will generate. Using compact to clean up the array is a knee-jerk reaction to not having written the code correctly most of the time.
Here's your updated code:
require 'nokogiri'
require 'open-uri'
url = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(url) )
puts doc.css(".main-wrap .item").count
doc.css(".main-wrap .item").first(30).each do |item_info|
if item_info
href = item_info.at(".detail a")['href']
puts href
else
puts 'this is empty'
end
end
And here's what's wrong:
doc.css(".main-wrap .item").first(30)
Here's a simple example demonstrating why that doesn't work:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
In Nokogiri, search',cssandxpath` are equivalent, except that the first is generic and can take either CSS or XPath, while the last two are specific to that language.
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fcf360ef750 name="p" children=[#<Nokogiri::XML::Text:0x3fcf360ef4f8 "foo">]>]
doc.search('p').size # => 1
doc.search('p').map(&:to_html) # => ["<p>foo</p>"]
That shows that the NodeSet returned by doing a simple search returns only one node, and what the node looks like.
doc.search('p').first(2) # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>, nil]
doc.search('p').first(2).size # => 2
Searching using first(n) returns "n" elements. If that many aren't found Nokogiri fills them in using nil values.
This is counter what we'd assume first(n) to do, since Enumerable#first returns up-to-n and won't pad with nils. This isn't a bug, but it is unexpected behavior since Enumerable's first sets the expected behavior for methods with that name, but, this is NodeSet#first, not Enumerable#first, so it does what it does until the Nokogiri authors change it. (You can see why it happens if you look at the source for that particular method.)
Instead, slicing the NodeSet does show the expected behavior:
doc.search('p')[0..1] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0..1].size # => 1
doc.search('p')[0, 2] # => [#<Nokogiri::XML::Element:0x3fe3a28d2848 name="p" children=[#<Nokogiri::XML::Text:0x3fe3a28c7b50 "foo">]>]
doc.search('p')[0, 2].size # => 1
So, don't use NodeSet#first(n), use the slice form NodeSet#[].
Applying that, I'd write the code something like:
require 'nokogiri'
require 'open-uri'
URL = 'http://themagicway.taobao.com/search.htm?&search=y&orderType=newOn_desc'
doc = Nokogiri::HTML(open(URL))
hrefs = doc.css(".main-wrap .item .detail a[href]")[0..29].map { |anchors|
anchors['href']
}
puts hrefs.size
puts hrefs
# >> 24
# >> http://item.taobao.com/item.htm?id=41249522884
# >> http://item.taobao.com/item.htm?id=40369253621
# >> http://item.taobao.com/item.htm?id=40384876796
# >> http://item.taobao.com/item.htm?id=40352486259
# >> http://item.taobao.com/item.htm?id=40384968205
# >> http://item.taobao.com/item.htm?id=40384816312
# >> http://item.taobao.com/item.htm?id=40384600507
# >> http://item.taobao.com/item.htm?id=39973451949
# >> http://item.taobao.com/item.htm?id=39861209551
# >> http://item.taobao.com/item.htm?id=39545678869
# >> http://item.taobao.com/item.htm?id=39535371171
# >> http://item.taobao.com/item.htm?id=39509186150
# >> http://item.taobao.com/item.htm?id=38973412667
# >> http://item.taobao.com/item.htm?id=38910499863
# >> http://item.taobao.com/item.htm?id=38942960787
# >> http://item.taobao.com/item.htm?id=38910403350
# >> http://item.taobao.com/item.htm?id=38843789106
# >> http://item.taobao.com/item.htm?id=38843517455
# >> http://item.taobao.com/item.htm?id=38854788276
# >> http://item.taobao.com/item.htm?id=38825442050
# >> http://item.taobao.com/item.htm?id=38630599372
# >> http://item.taobao.com/item.htm?id=38346270714
# >> http://item.taobao.com/item.htm?id=38357729988
# >> http://item.taobao.com/item.htm?id=38345374874
Try this
doc.css(".tweet").first(fetch_number).each do |item|
title = item.css("a")[0]['title'] rescue nil
end
And let me know it works or not? It will not show error
Try compact.
[1, nil, 2, nil, 3] # => [1, 2, 3]
http://www.ruby-doc.org/core-2.1.3/Array.html#method-i-compact
(ie: first(fetch_number).compact.each do |item|)

Resources