I'm building a webscraper and using Nokogiri. Here is the code that I currently have:
require 'nokogiri'
require 'open-uri'
require 'pry'
class Scraper
def get_page
doc = Nokogiri::HTML(open("http://www.theskimm.com/recent"))
h = {}
doc.xpath('//a[#href]').each do |link|
h[link.text.strip] = link['href']
end
puts h
end
binding.pry
end
Scraper.new.get_page
This returns me a hash of all URLs on the page (I only pasted the first few lines):
{"Back to Sign Up"=>"/", "SHARE THIS"=>"https://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fwww.theskimm.com%2F2015%2F12%2F07%2Fskimm-for-december-8th-2&display=popup", "theSkimm\nSkimm for December 8th"=>"/", "Trump campaign press release"=>"http://skimmth.is/1SKR0bP", "assault weapons ban"=>"http://skimmth.is/1QbnCO8"}
However, I'd like to only grab the URLs that contain "http://skimmth.is/" as part of the value. What code/ Regular Expression would I need to add to my original Scraper class to ONLY selects URLs with that address?
You can use contains() function of xpath.
doc.xpath('//a[contains(#href, "http://skimmth.is/")]').map{|e| e.attr(:href)}
=> ["http://skimmth.is/1SKR0bP",
"http://skimmth.is/1QbnCO8",
"http://skimmth.is/1SHBSff",
"http://skimmth.is/1N8dORo",
"http://skimmth.is/1HRwGoO",
"http://skimmth.is/1HRmEUG",
"http://skimmth.is/1NePsmI",
"http://skimmth.is/1IQoJLn",
"http://skimmth.is/1ToQ6T1",
"http://skimmth.is/1IAZ6mW",
"http://skimmth.is/1N7Foy1",
"http://skimmth.is/1m7B6Op",
"http://skimmth.is/1SKBhJW",
"http://skimmth.is/1ToQ6T1",
"http://skimmth.is/1XfpwkX%20",
"http://skimmth.is/1P9rq20"]
You can use if as a statement modifier to check that the value is appropriate before adding it to the hash. For example, update this line:
h[link.text.strip] = link['href']
to
h[link.text.strip] = link['href'] if link['href'] =~ /http:\/\/skimmth.is\//
FWIW: =~ is the match method for the Regexp class.
Related
I'd like to know if there's a clean way of getting a list of cookies that website (URL) uses?
Scenario: User writes down URL of his website, and Ruby on Rails application checks for all cookies that website uses and returns them. For now, let's think that's only one URL.
I've tried with these code snippets below, but I'm only getting back one or no cookies:
url = 'http://www.google.com'
r = HTTParty.get(url)
puts r.request.options[:headers].inspect
puts r.code
or
uri = URI('https://www.google.com')
res = Net::HTTP.get_response(uri)
puts "cookies: " + res.get_fields("set-cookie").inspect
puts res.request.options[:headers]["Cookie"].inspect
or with Mechanize gem:
agent = Mechanize.new
page = agent.get("http://www.google.com")
agent.cookies.each do |cooky| puts cooky.to_s end
It doesn't have to be strict Ruby code, just something I can add to Ruby on Rails application without too much hassle.
You should use Selenium-webdriver:
you'll be able to retrieve all the cookies for given website:
require "selenium-webdriver"
#driver = Selenium::WebDriver.for :firefox #assuming you're using firefox
#driver.get("https://www.google.com/search?q=ruby+get+cookies+from+website&ie=utf-8&oe=utf-8&client=firefox-b-ab")
#driver.manage.all_cookies.each do |cookie|
puts cookie[:name]
end
#cookie handling functions
def add_cookie(name, value)
#driver.manage.add_cookie(name: name, value: value)
end
def get_cookie(cookie_name)
#driver.manage.cookie_named(cookie_name)
end
def get_all_cookies
#driver.manage.all_cookies
end
def delete_cookie(cookie_name)
#driver.manage.delete_cookie(cookie_name)
end
def delete_all_cookies
#driver.manage.delete_all_cookies
end
With HTTParty you can do this:
puts HTTParty.get(url).headers["set-cookie"]
Get them as an array with:
puts HTTParty.get(url).headers["set-cookie"].split("; ")
So I wrote some nokogiri code that works in a test .rb file but when I put it inside a rails app model it won't iterate and just returns the first value. Here is the code that iterates correctly:
require "rubygems"
require "open-uri"
require "nokogiri"
url = "http://www.ebay.com/sch/Cars-Trucks-/6001/i.html?_from=R40&_sac=1&_vxp=mtr&_nkw=car+projects&_ipg=200&rt=nc"
data = Nokogiri::HTML(open(url))
data.css(".li").each do |item|
item_link = item.at_css(".vip")[:href]
item_doc = Nokogiri::HTML(open(item_link))
puts item_doc.at_css("#itemTitle").text.sub! 'Details about', ''
end
Here is the same code in a rails app that only returns the first title it finds:
require "rubygems"
require "open-uri"
require "nokogiri"
class EbayScraper
attr_accessor :url, :data
def initialize(url)
#url = url
end
def data
#data ||= Nokogiri::HTML(open(#url))
end
def titles
data.css(".li").each do |item|
item_link = item.at_css(".vip")[:href]
item_data = Nokogiri::HTML(open(item_link))
return item_data.at_css("#itemTitle").text.sub! 'Details about', ''
end
end
ebay = EbayScraper.new("http://www.ebay.com/sch/Cars-Trucks-/6001/i.html?_from=R40&_sac=1&_vxp=mtr&_nkw=car+projects&_ipg=200&rt=nc")
titles = ebay.titles
puts titles
Why does the first code iterate through the whole thing and the second bunch of code just returns the first one?
Thanks for your time in advance!
Because you have a return statement in your loop that exits your titles function.
Since XPath supports custom functions I created one to make it possible to match case insensitive:
class XpathFunctions
def case_insensitive_equals node_set, str_to_match
node_set.find_all do |node|
node.to_s.downcase == str_to_match.to_s.downcase
end
end
end
Testing with this page, however, returns these results:
agent = Mechanize.new
page = agent.get('http://www.angelettiauto.it/parcoveicoli.php').parser
page.xpath("//*[case_insensitive_equals(text(),'Audi')]", XpathFunctions.new).count
# => 1
The expected results would have been 4, because there are 4 Audis listed on the page and I need all of them.
This is of cause caused by using an exact match and not contains(), but I can't figure out where to inject it.
This can be achieved by modifying the case_insensitive_equals method like so:
def case_insensitive_equals node_set, str_to_match
node_set.find_all do |node|
node.to_s.downcase.include?(str_to_match.downcase)
end
end
I'm working to integrate UserVoice Single Single On with my rails app. They provide the following class for ruby:
require 'rubygems'
require 'ezcrypto'
require 'json'
require 'cgi'
require 'base64'
module Uservoice
class Token
attr_accessor :data
USERVOICE_SUBDOMAIN = "FILL IN"
USERVOICE_SSO_KEY = "FILL IN"
def initialize(options = {})
options.merge!({:expires => (Time.zone.now.utc + 5 * 60).to_s})
key = EzCrypto::Key.with_password USERVOICE_SUBDOMAIN, USERVOICE_SSO_KEY
encrypted = key.encrypt(options.to_json)
#data = Base64.encode64(encrypted).gsub(/\n/,'') # Remove line returns where are annoyingly placed every 60 characters
end
def to_s
#data
end
end
end
What I can't figure out is how to use this. I added this file to my lib directory and am using Rails Console to run. I tried:
1.9.3-p125 :013 > Uservoice::Token
=> Uservoice::Token
But can't get it to actually return for the options:
Uservoice::Token.new(:guid => 1, :display_name => "jeff goldmen", :email => "jeff#google.com")
Any ideas how to actually use this? Thanks
Looking at the code, it doesn't appear that the initializer (what gets run when you call new) will take just a hash. The method definition looks like this:
def initialize(key, api_key, data)
And it seems to treat the data variable as a hash. You might just need to add the key and api_key values when you instantiate a Token. So a call would look like this:
Uservoice::Token.new(KEY, API_KEY, {guid:1, display_name:'foo', email:'f#b.com'})
I want to traverse some HTML documents with Nokogiri.
After getting the XML object, I want to have the last URL used by Nokogiri that fetched a document to be part of my JSON response.
def url = "http://ow.ly/hh8ri"
doc = Nokogiri::HTML(open(url)
...
Nokogiri internally redirects it to http://www.mp.rs.gov.br/imprensa/noticias/id30979.html, but I want to have access to it.
I want to know if the "doc" object has access to some URL as attribute or something.
Does someone know a workaround?
By the way, I want the full URL because I'm traversing the HTML to find <img> tags and some have relative ones like: "/media/image/image.png", and then I adjust some using:
URI.join(url, relative_link_url).to_s
The image URL should be:
http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg
Instead of:
http://ow.ly/hh8ri/media/imprensa/2013/01/30979_260_260__trytr.jpg
EDIT: IDEA
class Scraper < Nokogiri::HTML::Document
attr_accessor :url
class << self
def new(url)
html = open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)
self.parse(html).tap do |d|
url = URI.parse(url)
response = Net::HTTP.new(url.host, url.port)
head = response.start do |r|
r.head url.path
end
d.url = head['location']
end
end
end
end
Use Mechanize. The URLs will always be converted to absolute:
require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://ow.ly/hh8ri'
page.images.map{|i| i.url.to_s}
#=> ["http://www.mp.rs.gov.br/images/imprensa/barra_area.gif", "http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg"]
Because your example is using OpenURI, that's the code to ask, not Nokogiri. Nokogiri has NO idea where the content came from.
OpenURI can tell you easily:
require 'open-uri'
starting_url = 'http://www.example.com'
final_uri = nil
puts "Starting URL: #{ starting_url }"
io = open(starting_url) { |io| final_uri = io.base_uri }
doc = io.read
puts "Final URL: #{ final_uri.to_s }"
Which outputs:
Starting URL: http://www.example.com
Final URL: http://www.iana.org/domains/example
base_uri is documented in the OpenURI::Meta module.
I had the exact same issue recently. What I did was to create a class that inherits from Nokogiri::HTML::Document, and then just override thenew class method to parse the document, then save the url in an instance variable with an accessor:
require 'nokogiri'
require 'open-uri'
class Webpage < Nokogiri::HTML::Document
attr_accessor :url
class << self
def new(url)
html = open(url)
self.parse(html).tap do |d|
d.url = url
end
end
end
end
Then you can just create a new Webpage, and it will have access to all the normal methods you would have with a Nokogiri::HTML::Document:
w = Webpage.new("http://www.google.com")
w.url
#=> "http://www.google.com"
w.at_css('title')
#=> [#<Nokogiri::XML::Element:0x4952f78 name="title" children=[#<Nokogiri::XML::Text:0x4952cb2 "Google">]>]
If you have some relative url that you got from an image tag, you can then make it absolute by passing the return value of the url accessor to URI.join:
relative_link_url = "/media/image/image.png"
=> "/media/image/image.png"
URI.join(w.url, relative_link_url).to_s
=> "http://www.google.com/media/image/image.png"
Hope that helps.
p.s. the title of this question is quite misleading. Something more along the lines of "Accessing URL of Nokogiri HTML document" would be clearer.