Can't select <article> selector with Nokogiri - ruby-on-rails

Here is the HTML source I am trying to scrape:
<section class="articles">
<article role="article">
</article>
<article role="article">
</article>
I am trying to scrape the href with this:
require 'open-air'
require 'nokogiri'
url = "http://www.vg.no/sport/langrenn/"
doc = Nokogiri::HTML(open(url))
doc.css(".articles article").each do |i|
location = i.at_css("a")[:href]
puts location
end
I have tried so many other things, but this seems like it should work. I have been able to scrape content using other selectors on this page, just nothing inside of the <article></article> tags, which contains everything I need.

Related

Get content in href page when crawl data in rails?

I want to crawl data from a website. In this website :
HTML :
<div>
<ul>
<li>Place1</li>
<li>Place2</li>
</ul>
</div>
Inside "http://.../place1":
<div>
<p>Place 1</p>
<img src="...">
<div>
How can I crawl data inside href using 'Nokogiri" gem? (Data in other page when we click )
When I research, I only find the way to crawl data in a page. Not find how to crawl data inside href page. Thanks
In order to crawl data inside href, you have to create a new request to crawl data inside it.
...
# require 'open-uri'
href = 'http://.../place1'
doc = Nokogiri::HTML(open(href))
...
You can get all links by .css method. Then you can crawl by like this
# require 'open-uri'
links = doc.css('a').map { |link| link['href'] }
links.each do |link|
doc = Nokogiri::HTML(open(link))
end

Rails Nokogiri gem - scrape data using itemprop

I have a div which looks like as following and I am trying to scrape the data using itemprop but I cant seem to get it to work.
<div class="information">
<h1 itemprop="title">Some title here</h1>
<span itemprop="addressLocality">St. Inigoes</span>,
<span itemprop="addressRegion">MD</span>
<span itemprop="addressCountry">US</span>
</div>
Without itemprop I can get the data uaing data.css('.information').css('h1').try(:text) but if i try the following i get null data.css('meta[#itemprop="title"]') and the response I get it null.
So my question is how can i scrape the data of all span and h1 using itemprop
You should be able to scrape using the following technique
title = data.at("//h1[#itemprop = 'title']").children.text
addressLocality = data.at("//span[#itemprop = 'addressLocality']").children.text
addressRegion = data.at("//span[#itemprop = 'addressRegion']").children.text
addressCountry = data.at("//span[#itemprop = 'addressCountry']").children.text

NOKOGIRI - Get all the elements that contain a dollar sign

Im trying to get the price generically from all the shopping sites
What i tried so far :
price = doc.xpath('//span[contains(text(), "$")]').try(:first).try(:content)
if (!price)
price = doc.xpath('//div[contains(text(), "$")]').try(:first).try(:content)
end
Example of the html i tried :
https://jet.com/product/adidas-Real-Madrid-Ball-15-5/a08713c229924ceca7171850680d3e32 (The HTML of this url)
this is not working so well, what am i doing wrong?
Thanks you all
You may try
Nokogiri: How to select nodes by matching text?
I've change a little and it worked
require 'pp'
require 'nokogiri'
html = '
<html>
<body>
<p>foo</p>
<p>$bar</p>
</body>
</html>
'
doc = Nokogiri::HTML(html)
pp doc.at('p:contains("$")').text.strip

Wrap specific text with link Nokogiri

I'm using Nokogiri and haven't been able to figure out how to wrap a specific word with a link that I provide.
I have <span class="blah">XSS Attack document</span>
Which I want to change to
<span class="blah">XSS Attack document</span>
I know that there's a .wrap() in Nokogiri but it doesn't appear to be able to wrap just the specific XSS text.
By explicitly creating and adding a new node
require 'nokogiri'
text = '<html> <body> <div> <span class="blah">XSS Attack document</span> </div> </body> </html>'
html = Nokogiri::HTML(text)
# get the node span
node = html.at_xpath('//span[#class="blah"]')
# change its text content
node.content = node.content.gsub('XSS', '')
# create a node <a>
link = Nokogiri::XML::Node.new('a', html)
link['href'] = 'http://blah.com'
link.content = 'XSS'
# add it before the text
node.children.first.add_previous_sibling(link)
# print it
puts html.to_html
By using inner_html=
require 'nokogiri'
text = '<html> <body> <div> <span class="blah">XSS Attack document</span> </div> </body> </html>'
html = Nokogiri::HTML(text)
node = html.at_xpath('//span[#class="blah"]')
node.inner_html = node.content.gsub('XSS', 'XSS')
puts html.to_html
The both solutions are ok in our case. But when traversing the node tree, inner_html= is not the best as it removes all the children nodes. Because it removes all node children, it's not the best choice in terms of performance, when what you need just to add a node child.

Nokogiri: how to find a div by id and see what text it contains?

I just started using Nokogiri this morning and I'm wondering how to perform a simple task: I just need to search a webpage for a div like this:
<div id="verify" style="display:none"> site_verification_string </div>
I want my code to look something like this:
require 'nokogiri'
require 'open-uri'
url = h(#user.first_url)
doc = Nokogiri::HTML(open(url))
if #SEARCH_FOR_DIV#.text == site_verification_string
#user.save
end
So the main question is, how do I search for that div using nokogiri?
Any help is appreciated.
html = <<-HTML
<html>
<body>
<div id="verify" style="display: none;">foobar</div>
</body>
</html>
HTML
doc = Nokogiri::HTML html
puts 'verified!' if doc.at_css('[id="verify"]').text.eql? 'foobar'
For a simple way to get an element by its ID you can use .at_css("element#id")
Example for finding a div with the id "verify"
html = Nokogiri::HTML(open("http://example.com"))
puts html.at_css("div#verify")
This will get you the div and all the elements it contains

Resources