Data Scrape Search Limit - ruby-on-rails

I am using a ruby seed file that scrapes data from APOD (Astronomy Picture of the Day). Since there are thousands of entries, is there a way to limit the scrape to just pull the past 365 images?
Here's the seed code I am using:
require 'rubygems'
require 'open-uri'
require 'open-uri'
require 'nokogiri'
require 'curl'
require 'fileutils'
BASE = 'http://antwrp.gsfc.nasa.gov/apod/'
FileUtils.mkdir('small') unless File.exist?('small')
FileUtils.mkdir('big') unless File.exist?('big')
f = open 'http://antwrp.gsfc.nasa.gov/apod/archivepix.html'
html_doc = Nokogiri::HTML(f.read)
html_doc.xpath('//b//a').each do |element|
imgurl = BASE + element.attributes['href'].value
doc = Nokogiri::HTML(open(imgurl).read)
doc.xpath('//p//a//img').each do |elem|
small_img = BASE + elem.attributes['src'].value
big_img = BASE + elem.parent.attributes['href'].value
s_img_f = open("small/#{File.basename(small_img)}", 'wb')
b_img_f = open("big/#{File.basename(big_img)}", 'wb')
rs_img = Curl::Easy.new(small_img)
rb_img = Curl::Easy.new(big_img)
rs_img.perform
s_img_f.write(rs_img.body_str)
rb_img.perform
b_img_f.write(rb_img.body_str)
s_img_f.close
puts "Download #{File.basename(small_img)} finished."
b_img_f.close
puts "Download #{File.basename(big_img)} finished."
rs_img.close
rb_img.close
end
end
puts "All done."

You can treat the node set like an array to get elements between a specific index.
Add [0..364] to the node set of links:
html_doc.xpath('//b//a')[0..364].each do |element|

Related

My scraped data is empty (Rails and mechanize)

I am writing a simple script to scrape data from this link: https://www.congress.gov/members.
The script will go through each link of the member, follow that link, and scrape data from that link. This script is a .rake file on Ruby on Rails application.
Below is the script:
require 'mechanize'
require 'date'
require 'json'
require 'openssl'
module OpenSSL
module SSL
remove_const :VERIFY_PEER
end
end
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
I_KNOW_THAT_OPENSSL_VERIFY_PEER_EQUALS_VERIFY_NONE_IS_WRONG = nil
task :testing do
agent = Mechanize.new
page = agent.get("https://www.congress.gov/members")
page_links = page.links_with(href: %r{^/member/\w+})
product_links = page_links[0...2]
products = product_links.map do |link|
product = link.click
state = product.search('td:nth-child(1)').text
website = product.search('.member_website+ td').text
{
state: state,
website: website
}
end
puts JSON.pretty_generate(products)
end
and below is the output when i ran this script/file:
Your regular expression does not match links.
Try this: page_links = page.links_with(href: %r{.*/member/\w+})
You can validate regular expressions here: http://rubular.com/

not able to get authentication token

I tried to use twitter api with ruby for exact search. But because of new twitter api i am not able to access the json file.
this is link
https://api.twitter.com/1.1/search/tweets.json?q=%23superbowl&result_type=recent
Can you please tell me how to fetch json file.
I have posted my rb file , plz help
# http://ruby-doc.org/stdlib-2.0.0/libdoc/open-uri/rdoc/OpenURI.html
require 'open-uri'
# https://github.com/flori/json
require 'json'
# http://stackoverflow.com/questions/9008847/what-is-difference-between-p-and-pp
require 'pp'
require 'twitter'
#load 'twitter_config.rb'
#Create seprate config file.
#Encrprty ur keys using md5.
client = Twitter::REST::Client.new do |config|
config.consumer_key = "3OpBStixWSMAzBttp6UWON7QA"
config.consumer_secret = "MBmfQXoHLY61hYHzGYU8n69sQRQPheXJDejP1SKE2LBdgaqRg4"
config.access_token = "322718806-qScub4diRzggWUO9DaLKMVKwXZcgrqHD2OFbwQTr"
config.access_token_secret = "aWAIxQVnqF3nracQ0cC0HbRbSDxlCAaUIICorEAPlxIub"
end
# Construct the URL we'll be calling
print "please enter phrase you want to search"
phrase_value=gets.chomp;
#pets = File.open("https://api.twitter.com/1.1/search/tweets.json?q=#{phrase_value}", "r");
request_uri = "https://api.twitter.com/1.1/search/tweets.json?q=";
request_query = ''
url = "#{request_uri}#{phrase_value}"
url.gsub!(' ','%20')
print url;
# Actually fetch the contents of the remote URL as a String.
buffer = open(url).read
# Convert the String response into a plain old Ruby array. It is faster and saves you time compared to the standard Ruby libraries too.
result = JSON.parse(buffer)
# An example of how to take a random sample of elements from an array. Pass the number of elements you want into .sample() method. It's probably a better idea for the server to limit the results before sending, but you can use basic Ruby skills to trim & modify the data however you'd like.
result = result.sample(5)
# Loop through each of the elements in the 'result' Array & print some of their attributes.
result.each do |tweet|
puts "Count Username tweet"
puts "(#{tweet.url}) #{tweet.user.screen_name} #{tweet.text}";
#sleep(3);
#count=count+1;
#before following user check if its alreay in list using bollean.
client.follow("#{tweet.user.screen_name}")
end
puts "Finished!\n\n"
require 'json'
require 'oauth'
require 'pp'
consumer_key = 'xxxx'
consumer_secret = 'xxxx'
access_token = 'xxxx'
access_token_secret = 'xxxx'
consumer = OAuth::Consumer.new(
consumer_key,
consumer_secret,
site:'https://api.twitter.com/'
)
endpoint = OAuth::AccessToken.new(consumer, access_token, access_token_secret)
puts "please enter phrase you want to search"
phrase_value=gets.chomp;
request_uri = "https://api.twitter.com/1.1/search/tweets.json?q=#{phrase_value}";
# GET
response = endpoint.get("#{request_uri}")
result = JSON.parse(response.body)
pp result
Since you already are using the Twitter gem why not use its search methods instead of rolling your own:
require 'twitter'
client = Twitter::REST::Client.new do |config|
config.consumer_key = "3OpBStixWSMAzBttp6UWON7QA"
config.consumer_secret = "MBmfQXoHLY61hYHzGYU8n69sQRQPheXJDejP1SKE2LBdgaqRg4"
config.access_token = "322718806-qScub4diRzggWUO9DaLKMVKwXZcgrqHD2OFbwQTr"
config.access_token_secret = "aWAIxQVnqF3nracQ0cC0HbRbSDxlCAaUIICorEAPlxIub"
end
print "please enter phrase you want to search"
search_query = gets.chomp;
# #see https://github.com/sferik/twitter/blob/master/examples/Search.md
client.search(search_query, result_type: "recent").each do |tweet|
puts "Count Username tweet"
puts "(#{tweet.url}) #{tweet.user.screen_name} #{tweet.text}";
end

How to save pictures from URL to disk

I want to download pictures from a URL, like: http://trinity.e-stile.ru/ and save images to a directory like "C:\pickaxe\pictures". It is important to use Nokogiri.
I read similar questions on this site, but I didn't find how it works and I didn't understand the algorithm.
I wrote the code where I parse the URL and put parts of the webpage source code with "img" tag into a links object:
require 'nokogiri'
require 'open-uri'
PAGE_URL="http://trinity.e-stile.ru/"
page=Nokogiri::HTML(open(PAGE_URL)) #parsing into object
links=page.css("img") #object with html code with img tag
puts links.length # it is 24 images on this url
puts
links.each{|i| puts i } #it looks like: <img border="0" alt="" src="/images/kroliku.jpg">
puts
puts
links.each{|link| puts link['src'] } #/images/kroliku.jpg
What method is used to save pictures after grabbing the HTML code?
How can I put the images into a directory on my disk?
I changed the code, but it has an error:
/home/action/.parts/packages/ruby2.1/2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `initialize': getaddrinfo: Name or service not known (SocketError)
This is the code now:
require 'nokogiri'
require 'open-uri'
require 'net/http'
LOCATION = 'pics'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
#PAGE_URL = "http://ruby.bastardsbook.com/files/hello-webpage.html"
#PAGE_URL="http://trinity.e-stile.ru/"
PAGE_URL="http://www.youtube.com/"
page=Nokogiri::HTML(open(PAGE_URL))
links=page.css("img")
links.each{|link|
Net::HTTP.start(PAGE_URL) do |http|
localname = link.gsub /.*\//, '' # left the filename only
resp = http.get link['src']
open("#{LOCATION}/#{localname}", "wb") do |file|
file.write resp.body
end
end
}
You are almost done. The only thing left is to store files. Let’s do it.
LOCATION = 'C:\pickaxe\pictures'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
require 'net/http'
.... # your code with nokogiri etc.
links.each{|link|
Net::HTTP.start(PAGE_URL) do |http|
localname = link.gsub /.*\//, '' # left the filename only
resp = http.get link['src']
open("#{LOCATION}/#{localname}", "wb") do |file|
file.write resp.body
end
end
end
That’s it.
The correct version:
require 'nokogiri'
require 'open-uri'
LOCATION = 'pics'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
#PAGE_URL="http://trinity.e-stile.ru/"
PAGE_URL="http://www.youtube.com/"
page=Nokogiri::HTML(open(PAGE_URL))
links=page.css("img")
links.each{|link|
uri = URI.join(PAGE_URL, link['src'] ).to_s # make absolute uri
localname=File.basename(link['src'])
File.open("#{LOCATION}/#{localname}",'wb') { |f| f.write(open(uri).read) }
}

How can I create log files on loop using mechanize with ruby

I am trying to make more than one log file on localhost
one file is sign_in.rb
require 'mechanize'
#agent = Mechanize.new
page = #agent.get('http://localhost:3000/users/sign_in')
form =page.forms.first
form["user[username]"] ='admin'
form["user[password]"]= '123456'
#agent.submit(form,form.buttons.first)
pp page
the second is profile_page.rb
require 'mechanize'
require_relative 'sign_in'
page = #agent.get('http://localhost:3000/users/admin')
form =page.forms.first
form.radiobuttons_with(:name => 'read_permission_level')[1].check
#agent.submit(form,form.buttons.first)
pp page
how can I combine these two files and run them on loop in order to create more than one log file
I don't know much about Mechanize, but is there any reason you can't simply combine the two bits of code and put them a while loop? I don't know how often you need to do Mechanize.new. To make more than one log file, simply open two different files and write to them.
require 'mechanize'
require_relative 'sign_in'
log1 = File.open("first.log", "w")
log2 = File.open("second.log", "w")
#agent = Mechanize.new
while true
# #agent = Mechanize.new # not sure if this is needed
page = #agent.get('http://localhost:3000/users/sign_in')
form = page.forms.first
form["user[username]"] ='admin'
form["user[password]"]= '123456'
#agent.submit(form,form.buttons.first)
PP.pp page, log1
# #agent = Mechanize.new # not sure if this is needed
page = #agent.get('http://localhost:3000/users/admin')
form = page.forms.first
form.radiobuttons_with(:name => 'read_permission_level')[1].check
#agent.submit(form,form.buttons.first)
PP.pp page, log2
end

How to find the href element value in "<a>" tag with ruby

My goal is to find the first result in google search resultes and collect the site link, so I built this script:
require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url
I get a string like this:
url = <em>Gallon</em> - Wikipedia, the free encyclopedia
But I need only the link (http://en.wikipedia.org/wiki/Gallon) not all the html code...
How can I do it? I am using the gems:
require 'hpricot'
require 'open-uri'
require 'mechanize'
You can get the value of attributes like this
(doc/"a")[16].attributes['href']
but I have to say that the magic number 16 seems brittle.
You are also not supposed to scrape the search results, you should consider using the Custom Search API.
Since mechanize includes nokogiri you can should skip hpricot altogether. It will slow your code down unnecessarily. You are effectively doing the same thing twice.
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
puts search_results.links[16].href
Instead of converting to a string with url = site.to_s do url = site[0].attributes['href']
try to use:
site = doc.search("a[#href]")[16,1]
Waitir is a reasonable choice to check the layout of a web page.
require 'rubygems'
require 'watir'
#Launching browser windows and navigating to google
browser = Watir::Browser.new
browser.goto("http://www.google.co.il/")
#Logging to console if a link with href = http://en.wikipedia.org/wiki/Gallon present
puts browser.link(:href, "http://en.wikipedia.org/wiki/Gallon").exists?
Since the input is always going to follow the same format, you could just do:
url.split("href=\"").last.split("\"").first

Resources