How to get a list of website (url) cookies with Ruby - ruby-on-rails

I'd like to know if there's a clean way of getting a list of cookies that website (URL) uses?
Scenario: User writes down URL of his website, and Ruby on Rails application checks for all cookies that website uses and returns them. For now, let's think that's only one URL.
I've tried with these code snippets below, but I'm only getting back one or no cookies:
url = 'http://www.google.com'
r = HTTParty.get(url)
puts r.request.options[:headers].inspect
puts r.code
or
uri = URI('https://www.google.com')
res = Net::HTTP.get_response(uri)
puts "cookies: " + res.get_fields("set-cookie").inspect
puts res.request.options[:headers]["Cookie"].inspect
or with Mechanize gem:
agent = Mechanize.new
page = agent.get("http://www.google.com")
agent.cookies.each do |cooky| puts cooky.to_s end
It doesn't have to be strict Ruby code, just something I can add to Ruby on Rails application without too much hassle.

You should use Selenium-webdriver:
you'll be able to retrieve all the cookies for given website:
require "selenium-webdriver"
#driver = Selenium::WebDriver.for :firefox #assuming you're using firefox
#driver.get("https://www.google.com/search?q=ruby+get+cookies+from+website&ie=utf-8&oe=utf-8&client=firefox-b-ab")
#driver.manage.all_cookies.each do |cookie|
puts cookie[:name]
end

#cookie handling functions
def add_cookie(name, value)
#driver.manage.add_cookie(name: name, value: value)
end
def get_cookie(cookie_name)
#driver.manage.cookie_named(cookie_name)
end
def get_all_cookies
#driver.manage.all_cookies
end
def delete_cookie(cookie_name)
#driver.manage.delete_cookie(cookie_name)
end
def delete_all_cookies
#driver.manage.delete_all_cookies
end

With HTTParty you can do this:
puts HTTParty.get(url).headers["set-cookie"]
Get them as an array with:
puts HTTParty.get(url).headers["set-cookie"].split("; ")

Related

Parse Open Graph Data in Rails using Metainspector

I am working on an app where I am required to fetch and save the open graph data of a website.
So far I have been able to grab properties such as title, description, url by using this code
before_save :get_meta_from_link
def check_link
begin
#page_link = MetaInspector.new(sanitized_url)
rescue Faraday::ConnectionFailed => e
errors.add(:link, "Oops, can't be processed ATM")
end
end
def get_meta_from_link
page = #page_link
return unless page.to_hash.present?
if page.title.present?
self.title = page.title
end
if page.description.present?
self.description = page.description
end
if page.url.present?
self.url = page.url
end
end
I am using the metainspector gem and trying to grab values such as og:locale, og:type. How can I fetch those values?
This is the link I am using to cross reference values: https://metainspectordemo.herokuapp.com
Ok, so I managed to solve it using
def check_link
begin
#page_link = MetaInspector.new(sanitized_url)
rescue MetaInspector::RequestError => e
errors.add(:link, "you provided is not being read by our system. Please check the link.")
end
end
in my link model
followed by
def get_meta_from_link
page = #page_link
paje = #page_link.meta_tags
return unless page.to_hash.present?
if page.title.present?
self.btitle = page.title
end
end

Running GET Request Through Rails on separate thread

I have a get request that retrieves JSON needed for graphs to display on a page. I'd do it in JQuery, but because of the API that I am using, it is not possible -- so I have to do it in rails.
I'm wondering this: If I run the get request on a separate thread in the page's action, can the variable then be passed to javascript after the page loads? I'm not sure how threading works in rails.
Would something like this work:
Thread.new do
url = URI.parse("http://api.steampowered.com/IDOTAMatch_570/GetMatchHistory/v001/?key=#{ENV['STEAM_WEB_API_KEY']}&account_id=#{id}&matches_requested=25&game_mode=1234516&format=json")
res = Net::HTTP::get(url)
matchlist = JSON.parse(res)
matches = []
if matchlist['result'] == 1 then
matchlist['result']['matches'].each do |match|
matches.push(GetMatchWin(match['match_id']))
end
end
def GetMatchWin(match_id, id)
match_data = matchlist["result"]["matches"].select {|m| m["match_id"] == match_id}
end
end
end
Given that the above code is in a helper file, and it then gets called in the action for the controller as such:
def index
if not session.key?(:current_user) then
redirect_to root_path
else
gon.winlossdata = GetMatchHistoryRawData(session[:current_user][:uid32])
end
end
The "gon" part is just a gem to pass data to javascript.

Capybara + remote form request

I have a form that I'm testing using Capybara. This form's URL goes to my Braintree sandbox, although I suspect the problem would happen for any remote URL. When Capybara clicks the submit button for the form, the request is routed to the dummy application rather than the remote service.
Here's an example app that reproduces this issue: https://github.com/radar/capybara_remote. Run bundle exec ruby test/form_test.rb and the test will pass, which is not what I'd typically expect.
Why does this happen and is this behaviour that I can rely on always happening?
Mario Visic points out this description in the Capybara documentation:
Furthermore, you cannot use the RackTest driver to test a remote application, or to access remote URLs (e.g., redirects to external sites, external APIs, or OAuth services) that your application might interact with.
But I wanted to know why, so I source dived. Here's my findings:
lib/capybara/node/actions.rb
def click_button(locator)
find(:button, locator).click
end
I don't care about the find here because that's working. It's the click that's more interesting. That method is defined like this:
lib/capybara/node/element.rb
def click
wait_until { base.click }
end
I don't know what base is, but I see the method is defined twice more in lib/capybara/rack_test/node.rb and lib/capybara/selenium/node.rb. The tests are using Rack::Test and not Selenium, so it's probably the former:
lib/capybara/rack_test/node.rb
def click
if tag_name == 'a'
method = self["data-method"] if driver.options[:respect_data_method]
method ||= :get
driver.follow(method, self[:href].to_s)
elsif (tag_name == 'input' and %w(submit image).include?(type)) or
((tag_name == 'button') and type.nil? or type == "submit")
Capybara::RackTest::Form.new(driver, form).submit(self)
end
end
The tag_name is probably not a link -- because it's a button we're clicking -- so it falls to the elsif. It's definitely an input tag with type == "submit", so then let's see what Capybara::RackTest::Form does:
lib/capybara/rack_test/form.rb
def submit(button)
driver.submit(method, native['action'].to_s, params(button))
end
Ok then. driver is probably the Rack::Test driver for Capybara. What's that doing?
lib/capybara/rack_test/driver.rb
def submit(method, path, attributes)
browser.submit(method, path, attributes)
end
What is this mysterious browser? It's defined in the same file thankfully:
def browser
#browser ||= Capybara::RackTest::Browser.new(self)
end
Let's look at what this class's submit method does.
lib/capybara/rack_test/browser.rb
def submit(method, path, attributes)
path = request_path if not path or path.empty?
process_and_follow_redirects(method, path, attributes, {'HTTP_REFERER' => current_url})
end
process_and_follow_redirects does what it says on the box:
def process_and_follow_redirects(method, path, attributes = {}, env = {})
process(method, path, attributes, env)
5.times do
process(:get, last_response["Location"], {}, env) if last_response.redirect?
end
raise Capybara::InfiniteRedirectError, "redirected more than 5 times, check for infinite redirects." if last_response.redirect?
end
So does process:
def process(method, path, attributes = {}, env = {})
new_uri = URI.parse(path)
method.downcase! unless method.is_a? Symbol
if new_uri.host
#current_host = "#{new_uri.scheme}://#{new_uri.host}"
#current_host << ":#{new_uri.port}" if new_uri.port != new_uri.default_port
end
if new_uri.relative?
if path.start_with?('?')
path = request_path + path
elsif not path.start_with?('/')
path = request_path.sub(%r(/[^/]*$), '/') + path
end
path = current_host + path
end
reset_cache!
send(method, path, attributes, env.merge(options[:headers] || {}))
end
Time to break out the debugger and see what method is here. Sticking a binding.pry before the final line in that method, and a require 'pry' in the test. It turns out method is :post and, for interest's sake, new_uri is a URI object with our remote form's URL.
Where's this post method coming from? method(:post).source_location tells me:
["/Users/ryan/.rbenv/versions/1.9.3-p374/lib/ruby/1.9.1/forwardable.rb", 199]
That doesn't seem right... Does Capybara have a def post somewhere?
capybara (master)★ack "def post"
lib/capybara/rack_test/driver.rb
76: def post(*args, &block); browser.post(*args, &block); end
Cool. We know that browser is aCapybara::RackTest::Browser` object. The class beginning gives the next hint:
class Capybara::RackTest::Browser
include ::Rack::Test::Methods
I know that Rack::Test::Methods comes with a post method. Time to dive into that gem.
lib/rack/test.rb
def post(uri, params = {}, env = {}, &block)
env = env_for(uri, env.merge(:method => "POST", :params => params))
process_request(uri, env, &block)
end
Ignoring env_for for the time being, what does process_request do?
lib/rack/test.rb
def process_request(uri, env)
uri = URI.parse(uri)
uri.host ||= #default_host
#rack_mock_session.request(uri, env)
if retry_with_digest_auth?(env)
auth_env = env.merge({
"HTTP_AUTHORIZATION" => digest_auth_header,
"rack-test.digest_auth_retry" => true
})
auth_env.delete('rack.request')
process_request(uri.path, auth_env)
else
yield last_response if block_given?
last_response
end
end
Hey, #rack_mock_session looks interesting. Where's that defined?
rack-test (master)★ack "#rack_mock_session ="
lib/rack/test.rb
40: #rack_mock_session = mock_session
42: #rack_mock_session = MockSession.new(mock_session)
In two places, very close to each other. What's on and around these lines?
def initialize(mock_session)
#headers = {}
if mock_session.is_a?(MockSession)
#rack_mock_session = mock_session
else
#rack_mock_session = MockSession.new(mock_session)
end
#default_host = #rack_mock_session.default_host
end
Ok then, so it ensures it is a MockSession object. What's MockSession and how is its request method defined?
def request(uri, env)
env["HTTP_COOKIE"] ||= cookie_jar.for(uri)
#last_request = Rack::Request.new(env)
status, headers, body = #app.call(#last_request.env)
headers["Referer"] = env["HTTP_REFERER"] || ""
#last_response = MockResponse.new(status, headers, body, env["rack.errors"].flush)
body.close if body.respond_to?(:close)
cookie_jar.merge(last_response.headers["Set-Cookie"], uri)
#after_request.each { |hook| hook.call }
if #last_response.respond_to?(:finish)
#last_response.finish
else
#last_response
end
end
I'm going to go right ahead here and assume #app is the Rack application stack. By calling the call method, the request is routed directly to this stack, rather going out to the world.
I conclude that this behaviour looks like its intentional and that I can indeed rely on it being that way.

How to get a full URL given a shortened one passed to Nokogiri?

I want to traverse some HTML documents with Nokogiri.
After getting the XML object, I want to have the last URL used by Nokogiri that fetched a document to be part of my JSON response.
def url = "http://ow.ly/hh8ri"
doc = Nokogiri::HTML(open(url)
...
Nokogiri internally redirects it to http://www.mp.rs.gov.br/imprensa/noticias/id30979.html, but I want to have access to it.
I want to know if the "doc" object has access to some URL as attribute or something.
Does someone know a workaround?
By the way, I want the full URL because I'm traversing the HTML to find <img> tags and some have relative ones like: "/media/image/image.png", and then I adjust some using:
URI.join(url, relative_link_url).to_s
The image URL should be:
http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg
Instead of:
http://ow.ly/hh8ri/media/imprensa/2013/01/30979_260_260__trytr.jpg
EDIT: IDEA
class Scraper < Nokogiri::HTML::Document
attr_accessor :url
class << self
def new(url)
html = open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)
self.parse(html).tap do |d|
url = URI.parse(url)
response = Net::HTTP.new(url.host, url.port)
head = response.start do |r|
r.head url.path
end
d.url = head['location']
end
end
end
end
Use Mechanize. The URLs will always be converted to absolute:
require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://ow.ly/hh8ri'
page.images.map{|i| i.url.to_s}
#=> ["http://www.mp.rs.gov.br/images/imprensa/barra_area.gif", "http://www.mp.rs.gov.br/media/imprensa/2013/01/30979_260_260__trytr.jpg"]
Because your example is using OpenURI, that's the code to ask, not Nokogiri. Nokogiri has NO idea where the content came from.
OpenURI can tell you easily:
require 'open-uri'
starting_url = 'http://www.example.com'
final_uri = nil
puts "Starting URL: #{ starting_url }"
io = open(starting_url) { |io| final_uri = io.base_uri }
doc = io.read
puts "Final URL: #{ final_uri.to_s }"
Which outputs:
Starting URL: http://www.example.com
Final URL: http://www.iana.org/domains/example
base_uri is documented in the OpenURI::Meta module.
I had the exact same issue recently. What I did was to create a class that inherits from Nokogiri::HTML::Document, and then just override thenew class method to parse the document, then save the url in an instance variable with an accessor:
require 'nokogiri'
require 'open-uri'
class Webpage < Nokogiri::HTML::Document
attr_accessor :url
class << self
def new(url)
html = open(url)
self.parse(html).tap do |d|
d.url = url
end
end
end
end
Then you can just create a new Webpage, and it will have access to all the normal methods you would have with a Nokogiri::HTML::Document:
w = Webpage.new("http://www.google.com")
w.url
#=> "http://www.google.com"
w.at_css('title')
#=> [#<Nokogiri::XML::Element:0x4952f78 name="title" children=[#<Nokogiri::XML::Text:0x4952cb2 "Google">]>]
If you have some relative url that you got from an image tag, you can then make it absolute by passing the return value of the url accessor to URI.join:
relative_link_url = "/media/image/image.png"
=> "/media/image/image.png"
URI.join(w.url, relative_link_url).to_s
=> "http://www.google.com/media/image/image.png"
Hope that helps.
p.s. the title of this question is quite misleading. Something more along the lines of "Accessing URL of Nokogiri HTML document" would be clearer.

Using OpenUri, how can I get the contents of a redirecting page?

I want to get data from this page:
http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=0656887000494793
But that page forwards to:
http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?execution=eXs1
So, when I use open, from OpenUri, to try and fetch the data, it throws a RuntimeError error saying HTTP redirection loop:
I'm not really sure how to get that data after it redirects and throws that error.
You need a tool like Mechanize. From it's description:
The Mechanize library is used for
automating interaction with websites.
Mechanize automatically stores and
sends cookies, follows redirects, can
follow links, and submit forms. Form
fields can be populated and submitted.
Mechanize also keeps track of the
sites that you have visited as a
history.
which is exactly what you need. So,
sudo gem install mechanize
then
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get "http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber trackingNumber=0656887000494793"
page.content # Get the resulting page as a string
page.body # Get the body content of the resulting page as a string
page.search(".somecss") # Search for specific elements by XPath/CSS using nokogiri
and you're ready to rock 'n' roll.
The site seems to be doing some of the redirection logic with sessions. If you don't send back the session cookies they are sending on the first request you will end up in a redirect loop. IMHO it's a crappy implementation on their part.
However, I tried to pass the cookies back to them, but I didn't get it to work, so I can't be completely sure that that is all that's going on here.
While mechanize is a wonderful tool I prefer to "cook" my own thing.
If you are serious about parsing you can take a look at this code. It serves to crawl thousands of site on an international level everyday and as far as I have researched and tweaked there isn't a more stable approach to this that also allows you to highly customize later on your needs.
require "open-uri"
require "zlib"
require "nokogiri"
require "sanitize"
require "htmlentities"
require "readability"
def crawl(url_address)
self.errors = Array.new
begin
begin
url_address = URI.parse(url_address)
rescue URI::InvalidURIError
url_address = URI.decode(url_address)
url_address = URI.encode(url_address)
url_address = URI.parse(url_address)
end
url_address.normalize!
stream = ""
timeout(8) { stream = url_address.open(SHINSO_HEADERS) }
if stream.size > 0
url_crawled = URI.parse(stream.base_uri.to_s)
else
self.errors << "Server said status 200 OK but document file is zero bytes."
return
end
rescue Exception => exception
self.errors << exception
return
end
# extract information before html parsing
self.url_posted = url_address.to_s
self.url_parsed = url_crawled.to_s
self.url_host = url_crawled.host
self.status = stream.status
self.content_type = stream.content_type
self.content_encoding = stream.content_encoding
self.charset = stream.charset
if stream.content_encoding.include?('gzip')
document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
document = stream.read
end
self.charset_guess = CharGuess.guess(document)
if not self.charset_guess.blank? and (not self.charset_guess.downcase == 'utf-8' or not self.charset_guess.downcase == 'utf8')
document = Iconv.iconv("UTF-8", self.charset_guess, document).to_s
end
document = Nokogiri::HTML.parse(document,nil,"utf8")
document.xpath('//script').remove
document.xpath('//SCRIPT').remove
for item in document.xpath('//*[translate(#src, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")]')
item.set_attribute('src',make_absolute_address(item['src']))
end
document = document.to_s.gsub(/<!--(.|\s)*?-->/,'')
self.content = Nokogiri::HTML.parse(document,nil,"utf8")
end

Resources