I'm using Nokogiri and haven't been able to figure out how to wrap a specific word with a link that I provide.
I have <span class="blah">XSS Attack document</span>
Which I want to change to
<span class="blah">XSS Attack document</span>
I know that there's a .wrap() in Nokogiri but it doesn't appear to be able to wrap just the specific XSS text.
By explicitly creating and adding a new node
require 'nokogiri'
text = '<html> <body> <div> <span class="blah">XSS Attack document</span> </div> </body> </html>'
html = Nokogiri::HTML(text)
# get the node span
node = html.at_xpath('//span[#class="blah"]')
# change its text content
node.content = node.content.gsub('XSS', '')
# create a node <a>
link = Nokogiri::XML::Node.new('a', html)
link['href'] = 'http://blah.com'
link.content = 'XSS'
# add it before the text
node.children.first.add_previous_sibling(link)
# print it
puts html.to_html
By using inner_html=
require 'nokogiri'
text = '<html> <body> <div> <span class="blah">XSS Attack document</span> </div> </body> </html>'
html = Nokogiri::HTML(text)
node = html.at_xpath('//span[#class="blah"]')
node.inner_html = node.content.gsub('XSS', 'XSS')
puts html.to_html
The both solutions are ok in our case. But when traversing the node tree, inner_html= is not the best as it removes all the children nodes. Because it removes all node children, it's not the best choice in terms of performance, when what you need just to add a node child.
Related
Here is the HTML source I am trying to scrape:
<section class="articles">
<article role="article">
</article>
<article role="article">
</article>
I am trying to scrape the href with this:
require 'open-air'
require 'nokogiri'
url = "http://www.vg.no/sport/langrenn/"
doc = Nokogiri::HTML(open(url))
doc.css(".articles article").each do |i|
location = i.at_css("a")[:href]
puts location
end
I have tried so many other things, but this seems like it should work. I have been able to scrape content using other selectors on this page, just nothing inside of the <article></article> tags, which contains everything I need.
I was trying to use HTMLAgilityPack to manipulate some Html. I decided to try CSQuery as well.
The goal is to extract and img tag and it src and reinsert it in front of an h3 tag.
Assume html:
<div class="col-md-6">
<div class="item">
<div class="content galleryItem">
<h3>
Al Shabaab kill at least 29 in latest attacks on Kenyan coast
</h3>
<p> <img alt src="../../../../images/AlShabaab.jpg"></p>
<p>
Al Shabaab killed at least 29 people in two coastal areas of Kenya.</p>
</div>
</div>
</div>
Goal is to move the img in front of the h3
I used the following to strip the style attr from img tags:
Dim csq = CQ.Create(input)
Dim csstyle = csq("img")
Return csstyle.RemoveAttr("style")
Since you did not explicitly tag this VB.NET I'll answer in C#, I hope that's OK:
var cq = CQ.create(input); // create the CsQuery source
var img = cq["img"]; // image here, img["src"] is its source
img.Remove().InsertBefore(cq["h3"]);// remove it, and add it in front of H3.
of course this code can be shorter, but I wanted the code to match your literal description.
I have some HTML pages where the contents to be extracted are marked with HTML comments like below.
<html>
.....
<!-- begin content -->
<div>some text</div>
<div><p>Some more elements</p></div>
<!-- end content -->
...
</html>
I am using Nokogiri and trying to extract the HTML between the <!-- begin content --> and <!-- end content --> comments.
I want to extract the full elements between these two HTML comments:
<div>some text</div>
<div><p>Some more elements</p></div>
I can get the text-only version using this characters callback:
class TextExtractor < Nokogiri::XML::SAX::Document
def initialize
#interesting = false
#text = ""
#html = ""
end
def comment(string)
case string.strip # strip leading and trailing whitespaces
when /^begin content/ # match starting comment
#interesting = true
when /^end content/
#interesting = false # match closing comment
end
def characters(string)
#text << string if #interesting
end
end
I get the text-only version with #text but I need the full HTML stored in #html.
Extracting content between two nodes is not a normal thing we'd do; Normally we'd want content inside a particular node. Comments are nodes, they're just special types of nodes.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<body>
<!-- begin content -->
<div>some text</div>
<div><p>Some more elements</p></div>
<!-- end content -->
</body>
EOT
By looking for a comment containing the specified text it's possible to find a starting node:
start_comment = doc.at("//comment()[contains(.,'begin content')]") # => #<Nokogiri::XML::Comment:0x3fe94994268c " begin content ">
Once that's found then a loop is needed that stores the current node, then looks for the next sibling until it finds another comment:
content = Nokogiri::XML::NodeSet.new(doc)
contained_node = start_comment.next_sibling
loop do
break if contained_node.comment?
content << contained_node
contained_node = contained_node.next_sibling
end
content.to_html # => "\n <div>some text</div>\n <div><p>Some more elements</p></div>\n"
I just started using Nokogiri this morning and I'm wondering how to perform a simple task: I just need to search a webpage for a div like this:
<div id="verify" style="display:none"> site_verification_string </div>
I want my code to look something like this:
require 'nokogiri'
require 'open-uri'
url = h(#user.first_url)
doc = Nokogiri::HTML(open(url))
if #SEARCH_FOR_DIV#.text == site_verification_string
#user.save
end
So the main question is, how do I search for that div using nokogiri?
Any help is appreciated.
html = <<-HTML
<html>
<body>
<div id="verify" style="display: none;">foobar</div>
</body>
</html>
HTML
doc = Nokogiri::HTML html
puts 'verified!' if doc.at_css('[id="verify"]').text.eql? 'foobar'
For a simple way to get an element by its ID you can use .at_css("element#id")
Example for finding a div with the id "verify"
html = Nokogiri::HTML(open("http://example.com"))
puts html.at_css("div#verify")
This will get you the div and all the elements it contains
I have a document which look like this:
<div id="block">
link
</div>
I can't get Nokogiri to get me the value of href attribute. I'd like to store the address in a Ruby variable as a string.
html = <<HTML
<div id="block">
link
</div>
HTML
doc = Nokogiri::HTML(html)
doc.xpath('//div/a/#href')
#=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="http://google.com">]
Or if you wanna be more specific about the div:
>> doc.xpath('//div[#id="block"]/a/#href')
=> [#<Nokogiri::XML::Attr:0x80887798 name="href" value="http://google.com">]
>> doc.xpath('//div[#id="block"]/a/#href').first.value
=> "http://google.com"
doc = Nokogiri::HTML(open("[insert URL here]"))
href = doc.css('#block a')[0]["href"]
The variable href is assigned to the value of the "href" attribute for the <a> element inside the element with id 'block'. The line doc.css('#block a') returns a single item array containing the attributes of #block a. [0] targets that single element, which is a hash containing all the attribute names and values. ["href"] targets the key of "href" inside that hash and returns the value, which is a string containing the url.
Having struggled with this question in various forms, I decided to write myself a tutorial disguised as an answer. It may be helpful to others.
Starting with with this snippet:
require 'rubygems'
require 'nokogiri'
html = <<HTML
<div id="block1">
link1
</div>
<div id="block2">
link2
<a id="tips">just a bookmark</a>
</div>
HTML
doc = Nokogiri::HTML(html)
extracting all the links
We can use xpath or css to find all the elements and then keep only the ones that have an href attribute:
nodeset = doc.xpath('//a') # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
But there's a better way: in the above cases, the .compact is necessary because the searches return the "just a bookmark" element as well. We can use a more refined search to find just the elements that contain an href attribute:
attrs = doc.xpath('//a/#href') # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]
finding a specific link
To find a link within the <div id="block2">
nodeset = doc.xpath('//div[#id="block2"]/a/#href')
nodeset.first.value # => "http://stackoverflow.com"
nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"
If you know you're searching for just one link, you can use at_xpath or at_css instead:
attr = doc.at_xpath('//div[#id="block2"]/a/#href')
attr.value # => "http://stackoverflow.com"
element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"
find a link from associated text
What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:
element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"
element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"
find text from a link
And what if you want to find the text associated with a particular link?
Not a problem:
element = doc.at_xpath('//a[#href="http://stackoverflow.com"]')
element.text # => "link2"
element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"
useful references
In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:
a handy Nokogiri cheat sheet
a tutorial on parsing HTML with Nokogiri
interactively test CSS selector queries
doc = Nokogiri::HTML("HTML ...")
href = doc.css("div[id='block'] > a")
result = href['href'] #http://google.com
data = '<html lang="en" class="">
<head>
<a href="https://example.com/9f40a.css" media="all" rel="stylesheet" /> link1</a>
<a href="https://example.com/4e5fb.css" media="all" rel="stylesheet" />link2</a>
<a href="https://example.com/5s5fb.css" media="all" rel="stylesheet" />link3</a>
</head>
</html>'
Here is my Try for above sample of HTML code:
doc = Nokogiri::HTML(data)
doc.xpath('//#href').map(&:value)
=> [https://example.com/9f40a.css, https://example.com/4e5fb.css, https://example.com/5s5fb.css]
document.css("#block a")["href"]
where document is the Nokogiri HTML parsed.