How do I scrape HTML between two HTML comments using Nokogiri? - ruby-on-rails

I have some HTML pages where the contents to be extracted are marked with HTML comments like below.
<html>
.....
<!-- begin content -->
<div>some text</div>
<div><p>Some more elements</p></div>
<!-- end content -->
...
</html>
I am using Nokogiri and trying to extract the HTML between the <!-- begin content --> and <!-- end content --> comments.
I want to extract the full elements between these two HTML comments:
<div>some text</div>
<div><p>Some more elements</p></div>
I can get the text-only version using this characters callback:
class TextExtractor < Nokogiri::XML::SAX::Document
def initialize
#interesting = false
#text = ""
#html = ""
end
def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^begin content/ # match starting comment
#interesting = true
when /^end content/
#interesting = false # match closing comment
end
def characters(string)
#text << string if #interesting
end
end
I get the text-only version with #text but I need the full HTML stored in #html.

Extracting content between two nodes is not a normal thing we'd do; Normally we'd want content inside a particular node. Comments are nodes, they're just special types of nodes.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<body>
<!-- begin content -->
<div>some text</div>
<div><p>Some more elements</p></div>
<!-- end content -->
</body>
EOT
By looking for a comment containing the specified text it's possible to find a starting node:
start_comment = doc.at("//comment()[contains(.,'begin content')]") # => #<Nokogiri::XML::Comment:0x3fe94994268c " begin content ">
Once that's found then a loop is needed that stores the current node, then looks for the next sibling until it finds another comment:
content = Nokogiri::XML::NodeSet.new(doc)
contained_node = start_comment.next_sibling
loop do
break if contained_node.comment?
content << contained_node
contained_node = contained_node.next_sibling
end
content.to_html # => "\n <div>some text</div>\n <div><p>Some more elements</p></div>\n"

Related

Use Lua filter to set pandoc template variable

My goal is to input different text in the template based on a single variable in the yaml.
Below is a minimal attempt, but I can't get it to work.
I'm looking for a Lua filter that set the variable $selected$ based on the value of $switch$.
In practice, I'll set several template variables based on that variable.
The idea is to have one more generic template instead of many templates with relative few differences.
pandoc index.md --to html --from markdown --output index.html --template template.html --lua-filter=filter.lua
file index.md
---
title: "test"
switch: "a"
---
Some text
file template.html
<html>
<title>$title$</title>
<body>
<h1>$selected$</h1>
<h2>$switch$</h2>
$body$
</body>
</html>
file filter.lua
local function choose(info)
local result
if (info == "a")
then
result = "first choise"
else
result = "alternative"
end
return result
end
return {
{
Meta = function(meta)
meta.title, meta.selected = choose(meta.switch)
return meta
end
}
}
desired output
<html>
<title>test</title>
<body>
<h1>first choise</h1>
<h2>a</h2>
<p>Some text</p>
</body>
</html>
the result I get
<html>
<title>alternative</title>
<body>
<h1></h1>
<h2>a</h2>
<p>Some text</p>
</body>
</html>
The issue here is that metadata values look like strings, but can be of some other type. Here, they are Inlines, as can be checked with this filter:
function Meta (meta)
print(pandoc.utils.type(meta.switch))
end
The easiest solution is to convert the value to a string with pandoc.utils.stringify:
Meta = function(meta)
meta.selected = choose(pandoc.utils.stringify(meta.switch))
return meta
end
The filter should work as expected now.

How to write method to csv not just string? Ruby csv gem

I need to put the text content from an html element to a csv file. With the ruby csv gem, it seems that the primary write method for wrapped Strings and IOs only converts a string even if an object is specified.
For example:
Searchresults = puts browser.divs(class: 'results_row').map(&:text)
csv << %w(Searchresults)
returns only "searchresults" in the csv file.
It seems like there should be a way to specify the text from the div element to be put and not just a literal string.
Edit:
Okay arieljuod and spickermann were right. Now I am getting text content from the div element output to the csv, but not all of it like when I output to the console. The div element "results_row" has two a elements with text content. It also has a child div "results_subrow" with a paragraph of text content that is not getting written to the csv.
HTML:
<div class="bodytag" style="padding-bottom:30px; overflow:visible">
<h2>Search Results for "serialnum3"</h2>
<div id="results_banner">
Products
<span>Showing 1 to 2 of 2 results</span>
</div>
<div class="pg_dir"></div>
<div class="results_row">
FUJI
50mm lens
<div class="results_subrow">
<p>more product info</p>
</div>
</div>
<div class="results_row">
FUJI
50mm lens
<div class="results_subrow">
<p>more product info 2</p>
</div>
</div>
<div class="pg_dir"></div>
My code:
search_results = browser.divs(class: 'results_row').map(&:text)
csv << search_results
I'm thinking that including the child div "results_subrow" in the locator will find what I am missing. Like:
search_results = browser.divs(class: 'results_row', 'results_subrow').map(&:text)
csv << search_results
%w[Searchresults] creates an array containing the word Searchresults. You probably want something like this:
# assign the array returned from `map` to the `search_results` variable
search_results = browser.divs(class: 'results_row').map(&:text)
# output the `search_results`. Note that the return value of `puts` is `nil`
# therefore something like `Searchresults = puts browser...` doesn't work
puts search_results
# append `search_results` to your csv
csv << search_results

Wrap specific text with link Nokogiri

I'm using Nokogiri and haven't been able to figure out how to wrap a specific word with a link that I provide.
I have <span class="blah">XSS Attack document</span>
Which I want to change to
<span class="blah">XSS Attack document</span>
I know that there's a .wrap() in Nokogiri but it doesn't appear to be able to wrap just the specific XSS text.
By explicitly creating and adding a new node
require 'nokogiri'
text = '<html> <body> <div> <span class="blah">XSS Attack document</span> </div> </body> </html>'
html = Nokogiri::HTML(text)
# get the node span
node = html.at_xpath('//span[#class="blah"]')
# change its text content
node.content = node.content.gsub('XSS', '')
# create a node <a>
link = Nokogiri::XML::Node.new('a', html)
link['href'] = 'http://blah.com'
link.content = 'XSS'
# add it before the text
node.children.first.add_previous_sibling(link)
# print it
puts html.to_html
By using inner_html=
require 'nokogiri'
text = '<html> <body> <div> <span class="blah">XSS Attack document</span> </div> </body> </html>'
html = Nokogiri::HTML(text)
node = html.at_xpath('//span[#class="blah"]')
node.inner_html = node.content.gsub('XSS', 'XSS')
puts html.to_html
The both solutions are ok in our case. But when traversing the node tree, inner_html= is not the best as it removes all the children nodes. Because it removes all node children, it's not the best choice in terms of performance, when what you need just to add a node child.

Nokogiri: how to find a div by id and see what text it contains?

I just started using Nokogiri this morning and I'm wondering how to perform a simple task: I just need to search a webpage for a div like this:
<div id="verify" style="display:none"> site_verification_string </div>
I want my code to look something like this:
require 'nokogiri'
require 'open-uri'
url = h(#user.first_url)
doc = Nokogiri::HTML(open(url))
if #SEARCH_FOR_DIV#.text == site_verification_string
#user.save
end
So the main question is, how do I search for that div using nokogiri?
Any help is appreciated.
html = <<-HTML
<html>
<body>
<div id="verify" style="display: none;">foobar</div>
</body>
</html>
HTML
doc = Nokogiri::HTML html
puts 'verified!' if doc.at_css('[id="verify"]').text.eql? 'foobar'
For a simple way to get an element by its ID you can use .at_css("element#id")
Example for finding a div with the id "verify"
html = Nokogiri::HTML(open("http://example.com"))
puts html.at_css("div#verify")
This will get you the div and all the elements it contains

Getting attribute's value in Nokogiri to extract link URLs

I have a document which look like this:
<div id="block">
link
</div>
I can't get Nokogiri to get me the value of href attribute. I'd like to store the address in a Ruby variable as a string.
html = <<HTML
<div id="block">
link
</div>
HTML
doc = Nokogiri::HTML(html)
doc.xpath('//div/a/#href')
#=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="http://google.com">]
Or if you wanna be more specific about the div:
>> doc.xpath('//div[#id="block"]/a/#href')
=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="http://google.com">]
>> doc.xpath('//div[#id="block"]/a/#href').first.value
=> "http://google.com"
doc = Nokogiri::HTML(open("[insert URL here]"))
href = doc.css('#block a')[0]["href"]
The variable href is assigned to the value of the "href" attribute for the <a> element inside the element with id 'block'. The line doc.css('#block a') returns a single item array containing the attributes of #block a. [0] targets that single element, which is a hash containing all the attribute names and values. ["href"] targets the key of "href" inside that hash and returns the value, which is a string containing the url.
Having struggled with this question in various forms, I decided to write myself a tutorial disguised as an answer. It may be helpful to others.
Starting with with this snippet:
require 'rubygems'
require 'nokogiri'
html = <<HTML
<div id="block1">
link1
</div>
<div id="block2">
link2
<a id="tips">just a bookmark</a>
</div>
HTML
doc = Nokogiri::HTML(html)
extracting all the links
We can use xpath or css to find all the elements and then keep only the ones that have an href attribute:
nodeset = doc.xpath('//a') # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
But there's a better way: in the above cases, the .compact is necessary because the searches return the "just a bookmark" element as well. We can use a more refined search to find just the elements that contain an href attribute:
attrs = doc.xpath('//a/#href') # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]
finding a specific link
To find a link within the <div id="block2">
nodeset = doc.xpath('//div[#id="block2"]/a/#href')
nodeset.first.value # => "http://stackoverflow.com"
nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"
If you know you're searching for just one link, you can use at_xpath or at_css instead:
attr = doc.at_xpath('//div[#id="block2"]/a/#href')
attr.value # => "http://stackoverflow.com"
element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"
find a link from associated text
What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:
element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"
element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"
find text from a link
And what if you want to find the text associated with a particular link?
Not a problem:
element = doc.at_xpath('//a[#href="http://stackoverflow.com"]')
element.text # => "link2"
element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"
useful references
In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:
a handy Nokogiri cheat sheet
a tutorial on parsing HTML with Nokogiri
interactively test CSS selector queries
doc = Nokogiri::HTML("HTML ...")
href = doc.css("div[id='block'] > a")
result = href['href'] #http://google.com
data = '<html lang="en" class="">
<head>
<a href="https://example.com/9f40a.css" media="all" rel="stylesheet" /> link1</a>
<a href="https://example.com/4e5fb.css" media="all" rel="stylesheet" />link2</a>
<a href="https://example.com/5s5fb.css" media="all" rel="stylesheet" />link3</a>
</head>
</html>'
Here is my Try for above sample of HTML code:
doc = Nokogiri::HTML(data)
doc.xpath('//#href').map(&:value)
=> [https://example.com/9f40a.css, https://example.com/4e5fb.css, https://example.com/5s5fb.css]
document.css("#block a")["href"]
where document is the Nokogiri HTML parsed.

Resources