Getting attribute's value in Nokogiri to extract link URLs - ruby-on-rails

I have a document which look like this:
<div id="block">
link
</div>
I can't get Nokogiri to get me the value of href attribute. I'd like to store the address in a Ruby variable as a string.

html = <<HTML
<div id="block">
link
</div>
HTML
doc = Nokogiri::HTML(html)
doc.xpath('//div/a/#href')
#=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="http://google.com">]
Or if you wanna be more specific about the div:
>> doc.xpath('//div[#id="block"]/a/#href')
=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="http://google.com">]
>> doc.xpath('//div[#id="block"]/a/#href').first.value
=> "http://google.com"

doc = Nokogiri::HTML(open("[insert URL here]"))
href = doc.css('#block a')[0]["href"]
The variable href is assigned to the value of the "href" attribute for the <a> element inside the element with id 'block'. The line doc.css('#block a') returns a single item array containing the attributes of #block a. [0] targets that single element, which is a hash containing all the attribute names and values. ["href"] targets the key of "href" inside that hash and returns the value, which is a string containing the url.

Having struggled with this question in various forms, I decided to write myself a tutorial disguised as an answer. It may be helpful to others.
Starting with with this snippet:
require 'rubygems'
require 'nokogiri'
html = <<HTML
<div id="block1">
link1
</div>
<div id="block2">
link2
<a id="tips">just a bookmark</a>
</div>
HTML
doc = Nokogiri::HTML(html)
extracting all the links
We can use xpath or css to find all the elements and then keep only the ones that have an href attribute:
nodeset = doc.xpath('//a') # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
But there's a better way: in the above cases, the .compact is necessary because the searches return the "just a bookmark" element as well. We can use a more refined search to find just the elements that contain an href attribute:
attrs = doc.xpath('//a/#href') # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]
finding a specific link
To find a link within the <div id="block2">
nodeset = doc.xpath('//div[#id="block2"]/a/#href')
nodeset.first.value # => "http://stackoverflow.com"
nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"
If you know you're searching for just one link, you can use at_xpath or at_css instead:
attr = doc.at_xpath('//div[#id="block2"]/a/#href')
attr.value # => "http://stackoverflow.com"
element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"
find a link from associated text
What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:
element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"
element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"
find text from a link
And what if you want to find the text associated with a particular link?
Not a problem:
element = doc.at_xpath('//a[#href="http://stackoverflow.com"]')
element.text # => "link2"
element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"
useful references
In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:
a handy Nokogiri cheat sheet
a tutorial on parsing HTML with Nokogiri
interactively test CSS selector queries

doc = Nokogiri::HTML("HTML ...")
href = doc.css("div[id='block'] > a")
result = href['href'] #http://google.com

data = '<html lang="en" class="">
<head>
<a href="https://example.com/9f40a.css" media="all" rel="stylesheet" /> link1</a>
<a href="https://example.com/4e5fb.css" media="all" rel="stylesheet" />link2</a>
<a href="https://example.com/5s5fb.css" media="all" rel="stylesheet" />link3</a>
</head>
</html>'
Here is my Try for above sample of HTML code:
doc = Nokogiri::HTML(data)
doc.xpath('//#href').map(&:value)
=> [https://example.com/9f40a.css, https://example.com/4e5fb.css, https://example.com/5s5fb.css]

document.css("#block a")["href"]
where document is the Nokogiri HTML parsed.

Related

How can I get all attributes from nokogiri nodeset

As in Nokogiri::XML::Element, there is a method called attributes to get all as a hash. While for NodeSet object, there are no such method and we need to specify attribute key to get its value. I know that xpath have the ability to extract attributes but I couldn't think of the solutions of the following situation:
Normally, there is only one attr called match-type in match element document:
<D:match match-type="starts-with">appren</D:match>
But now, I need to assume only matct-type attr is allowed in this element tag:
<D:match caseless="bogus" match-type="starts-with">appren</D:match>
My idea is to get all attributes inside this element and find out the size of the attributes other than 'match-type'.
Any solution that I can do that? Thanks!
This isn't going to directly answer your question, because it's not clear whether you've tried anything. Instead, this code can be modified to do what you want but you're going to need to figure out what to change:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<a id="some_id" href="/foo/bar/index.html" class='bold'>anchor text</a>
<a id="some_other_id" href="/foo/bar/index2.html" class='italic'>anchor text</a>
</body>
</html>
EOT
doc.search('a').map{ |node| node.keys.reject{ |k| k == 'id' }.map{ |p| node[p].size }.inject(:+) } # => [23, 26]

How to find href from all tags in Nokogiri?

I want to extract the href from all of the tags in some HTML using Nokogiri.
If I have HTML:
<div>
</div>
<link href="/test2"></link>
<map href="/test3"></map>
How should do this?
You can use this XPath: //#href to get all the href attributes.
Example:
html = Nokogiri::HTML(html_source)
links = html.xpath('//#href').map(&:value)
# => ["/test", "/test2", "/test3"]

How do I scrape HTML between two HTML comments using Nokogiri?

I have some HTML pages where the contents to be extracted are marked with HTML comments like below.
<html>
.....
<!-- begin content -->
<div>some text</div>
<div><p>Some more elements</p></div>
<!-- end content -->
...
</html>
I am using Nokogiri and trying to extract the HTML between the <!-- begin content --> and <!-- end content --> comments.
I want to extract the full elements between these two HTML comments:
<div>some text</div>
<div><p>Some more elements</p></div>
I can get the text-only version using this characters callback:
class TextExtractor < Nokogiri::XML::SAX::Document
def initialize
#interesting = false
#text = ""
#html = ""
end
def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^begin content/ # match starting comment
#interesting = true
when /^end content/
#interesting = false # match closing comment
end
def characters(string)
#text << string if #interesting
end
end
I get the text-only version with #text but I need the full HTML stored in #html.
Extracting content between two nodes is not a normal thing we'd do; Normally we'd want content inside a particular node. Comments are nodes, they're just special types of nodes.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<body>
<!-- begin content -->
<div>some text</div>
<div><p>Some more elements</p></div>
<!-- end content -->
</body>
EOT
By looking for a comment containing the specified text it's possible to find a starting node:
start_comment = doc.at("//comment()[contains(.,'begin content')]") # => #<Nokogiri::XML::Comment:0x3fe94994268c " begin content ">
Once that's found then a loop is needed that stores the current node, then looks for the next sibling until it finds another comment:
content = Nokogiri::XML::NodeSet.new(doc)
contained_node = start_comment.next_sibling
loop do
break if contained_node.comment?
content << contained_node
contained_node = contained_node.next_sibling
end
content.to_html # => "\n <div>some text</div>\n <div><p>Some more elements</p></div>\n"

Wrap specific text with link Nokogiri

I'm using Nokogiri and haven't been able to figure out how to wrap a specific word with a link that I provide.
I have <span class="blah">XSS Attack document</span>
Which I want to change to
<span class="blah">XSS Attack document</span>
I know that there's a .wrap() in Nokogiri but it doesn't appear to be able to wrap just the specific XSS text.
By explicitly creating and adding a new node
require 'nokogiri'
text = '<html> <body> <div> <span class="blah">XSS Attack document</span> </div> </body> </html>'
html = Nokogiri::HTML(text)
# get the node span
node = html.at_xpath('//span[#class="blah"]')
# change its text content
node.content = node.content.gsub('XSS', '')
# create a node <a>
link = Nokogiri::XML::Node.new('a', html)
link['href'] = 'http://blah.com'
link.content = 'XSS'
# add it before the text
node.children.first.add_previous_sibling(link)
# print it
puts html.to_html
By using inner_html=
require 'nokogiri'
text = '<html> <body> <div> <span class="blah">XSS Attack document</span> </div> </body> </html>'
html = Nokogiri::HTML(text)
node = html.at_xpath('//span[#class="blah"]')
node.inner_html = node.content.gsub('XSS', 'XSS')
puts html.to_html
The both solutions are ok in our case. But when traversing the node tree, inner_html= is not the best as it removes all the children nodes. Because it removes all node children, it's not the best choice in terms of performance, when what you need just to add a node child.

Nokogiri: how to find a div by id and see what text it contains?

I just started using Nokogiri this morning and I'm wondering how to perform a simple task: I just need to search a webpage for a div like this:
<div id="verify" style="display:none"> site_verification_string </div>
I want my code to look something like this:
require 'nokogiri'
require 'open-uri'
url = h(#user.first_url)
doc = Nokogiri::HTML(open(url))
if #SEARCH_FOR_DIV#.text == site_verification_string
#user.save
end
So the main question is, how do I search for that div using nokogiri?
Any help is appreciated.
html = <<-HTML
<html>
<body>
<div id="verify" style="display: none;">foobar</div>
</body>
</html>
HTML
doc = Nokogiri::HTML html
puts 'verified!' if doc.at_css('[id="verify"]').text.eql? 'foobar'
For a simple way to get an element by its ID you can use .at_css("element#id")
Example for finding a div with the id "verify"
html = Nokogiri::HTML(open("http://example.com"))
puts html.at_css("div#verify")
This will get you the div and all the elements it contains

Resources