I have a problem requesting website with httparty gem: anti-bot system responds me with some boring stuff )
does httparty have any standart methods to alter request headers(I mean UserAgent, etc.)?
or can it be done in other way?
I solved this issue some time ago by using the mechanize gem which has user agent and cookies support built in.
Quick example:
require 'rubygems'
require 'mechanize'
a = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari'
}
a.get('http://google.com/') do |page|
search_result = page.form_with(:name => 'f') do |search|
search.q = 'Hello world'
end.submit
search_result.links.each do |link|
puts link.text
end
end
Related
I am writing a simple script to scrape data from this link: https://www.congress.gov/members.
The script will go through each link of the member, follow that link, and scrape data from that link. This script is a .rake file on Ruby on Rails application.
Below is the script:
require 'mechanize'
require 'date'
require 'json'
require 'openssl'
module OpenSSL
module SSL
remove_const :VERIFY_PEER
end
end
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
I_KNOW_THAT_OPENSSL_VERIFY_PEER_EQUALS_VERIFY_NONE_IS_WRONG = nil
task :testing do
agent = Mechanize.new
page = agent.get("https://www.congress.gov/members")
page_links = page.links_with(href: %r{^/member/\w+})
product_links = page_links[0...2]
products = product_links.map do |link|
product = link.click
state = product.search('td:nth-child(1)').text
website = product.search('.member_website+ td').text
{
state: state,
website: website
}
end
puts JSON.pretty_generate(products)
end
and below is the output when i ran this script/file:
Your regular expression does not match links.
Try this: page_links = page.links_with(href: %r{.*/member/\w+})
You can validate regular expressions here: http://rubular.com/
I've installed the mechanize gem in rails app and to test it I'm just copying and pasting the code below into the irb console. It logs into the page and I can put Orange into the search field and submit but then the next page has no content with "Orange" nor any of the orange employees that I see in my browser. Does linkedin have some security features to stop this or am I doing something wrong?
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'
#create agent
agent = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari 4'
}
agent.follow_meta_refresh = true
#visit page
page = agent.get("https://www.linkedin.com/")
#login
login_form = page.form('login')
login_form.session_key = "email"
login_form.session_password = "password"
page = agent.submit(login_form, login_form.buttons.first)
# get the form
form = agent.page.form_with(:name => "commonSearch")
#fill form out
form.keywords = 'Orange France'
# get the button you want from the form
button = form.button_with(:value => "Search")
# submit the form using that button
agent.submit(form, button)
agent.page.link_with(:text => "Orange")
=> nil
The problem with Mechanize is it won't work directly with JavaScript loaded content, like the one found on this scenario using a LinkedIn search.
A solution for this is to look on the page's body and use regular expressions to get the desired content, and then parse the results as JSON.
For example:
url = "http://www.linkedin.com/vsearch/p?type=people&keywords=dario+barrionuevo"
results = agent.get(url).body.scan(/\{"person"\:\{.*?\}\}/)
person = results.first # You'd use an each here, but for the example we'll get the first
json = JSON.parse(person)
json['person']['firstName'] # => 'Dario'
json['person']['lastName'] # => 'Barrionuevo'
here's my problem:
I need to post data from RoR server to remote PHP server, to a specific url, but before that I need to authenticate.. any help is much appreciated..
What I have done so far..
#sample data
postparams ={'id'=>1, 'name'=>'Test', 'phone'=>'123123123'}
#url - is in form http://domain.com/some/somemore
#user - contains username
#pass - contains password
require "uri"
require "net/http"
uri = URI(url)
req = Net::HTTP::Post.new(uri.path)
req.set_form_data(postparams)
res = Net::HTTP.start(uri.hostname, uri.port) do |http|
http.request(req)
end
case res
when Net::HTTPSuccess, Net::HTTPRedirection
#all ok
else
res.value
end
Obviously I get 403.. because I'm not authorized? How do I authorize?
I also tried my luck with mechanize gem (below - using the same "sample" data\vars)
#when not logged in it renders login form
login_form = agent.get(url).forms.first
login_form.username = user
login_form.password = pass
# submit login form
agent.submit(login_form, login_form.buttons.first)
#not sure how to submit to url..
#note that accessing url will not render the from
#(I can't access it as I did with login form) - I simply need to post postparams
#to this url... and get the response code..
I think the mechanize gem is your best choice.
Here is an example showing how to post a file to flicker using mechanize.
Maybe you could easily adapt to your needs:
require 'rubygems'
require 'mechanize'
abort "#{$0} login passwd filename" if (ARGV.size != 3)
a = Mechanize.new { |agent|
# Flickr refreshes after login
agent.follow_meta_refresh = true
}
a.get('http://flickr.com/') do |home_page|
signin_page = a.click(home_page.link_with(:text => /Sign In/))
my_page = signin_page.form_with(:name => 'login_form') do |form|
form.login = ARGV[0]
form.passwd = ARGV[1]
end.submit
# Click the upload link
upload_page = a.click(my_page.link_with(:text => /Upload/))
# We want the basic upload page.
upload_page = a.click(upload_page.link_with(:text => /basic Uploader/))
# Upload the file
upload_page.form_with(:method => 'POST') do |upload_form|
upload_form.file_uploads.first.file_name = ARGV[2]
end.submit
end
I strongly suggest the use of ruby rest-client gem.
I am trying to make more than one log file on localhost
one file is sign_in.rb
require 'mechanize'
#agent = Mechanize.new
page = #agent.get('http://localhost:3000/users/sign_in')
form =page.forms.first
form["user[username]"] ='admin'
form["user[password]"]= '123456'
#agent.submit(form,form.buttons.first)
pp page
the second is profile_page.rb
require 'mechanize'
require_relative 'sign_in'
page = #agent.get('http://localhost:3000/users/admin')
form =page.forms.first
form.radiobuttons_with(:name => 'read_permission_level')[1].check
#agent.submit(form,form.buttons.first)
pp page
how can I combine these two files and run them on loop in order to create more than one log file
I don't know much about Mechanize, but is there any reason you can't simply combine the two bits of code and put them a while loop? I don't know how often you need to do Mechanize.new. To make more than one log file, simply open two different files and write to them.
require 'mechanize'
require_relative 'sign_in'
log1 = File.open("first.log", "w")
log2 = File.open("second.log", "w")
#agent = Mechanize.new
while true
# #agent = Mechanize.new # not sure if this is needed
page = #agent.get('http://localhost:3000/users/sign_in')
form = page.forms.first
form["user[username]"] ='admin'
form["user[password]"]= '123456'
#agent.submit(form,form.buttons.first)
PP.pp page, log1
# #agent = Mechanize.new # not sure if this is needed
page = #agent.get('http://localhost:3000/users/admin')
form = page.forms.first
form.radiobuttons_with(:name => 'read_permission_level')[1].check
#agent.submit(form,form.buttons.first)
PP.pp page, log2
end
My goal is to find the first result in google search resultes and collect the site link, so I built this script:
require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url
I get a string like this:
url = <em>Gallon</em> - Wikipedia, the free encyclopedia
But I need only the link (http://en.wikipedia.org/wiki/Gallon) not all the html code...
How can I do it? I am using the gems:
require 'hpricot'
require 'open-uri'
require 'mechanize'
You can get the value of attributes like this
(doc/"a")[16].attributes['href']
but I have to say that the magic number 16 seems brittle.
You are also not supposed to scrape the search results, you should consider using the Custom Search API.
Since mechanize includes nokogiri you can should skip hpricot altogether. It will slow your code down unnecessarily. You are effectively doing the same thing twice.
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
puts search_results.links[16].href
Instead of converting to a string with url = site.to_s do url = site[0].attributes['href']
try to use:
site = doc.search("a[#href]")[16,1]
Waitir is a reasonable choice to check the layout of a web page.
require 'rubygems'
require 'watir'
#Launching browser windows and navigating to google
browser = Watir::Browser.new
browser.goto("http://www.google.co.il/")
#Logging to console if a link with href = http://en.wikipedia.org/wiki/Gallon present
puts browser.link(:href, "http://en.wikipedia.org/wiki/Gallon").exists?
Since the input is always going to follow the same format, you could just do:
url.split("href=\"").last.split("\"").first