How to save pictures from URL to disk - ruby-on-rails

I want to download pictures from a URL, like: http://trinity.e-stile.ru/ and save images to a directory like "C:\pickaxe\pictures". It is important to use Nokogiri.
I read similar questions on this site, but I didn't find how it works and I didn't understand the algorithm.
I wrote the code where I parse the URL and put parts of the webpage source code with "img" tag into a links object:
require 'nokogiri'
require 'open-uri'
PAGE_URL="http://trinity.e-stile.ru/"
page=Nokogiri::HTML(open(PAGE_URL)) #parsing into object
links=page.css("img") #object with html code with img tag
puts links.length # it is 24 images on this url
puts
links.each{|i| puts i } #it looks like: <img border="0" alt="" src="/images/kroliku.jpg">
puts
puts
links.each{|link| puts link['src'] } #/images/kroliku.jpg
What method is used to save pictures after grabbing the HTML code?
How can I put the images into a directory on my disk?
I changed the code, but it has an error:
/home/action/.parts/packages/ruby2.1/2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `initialize': getaddrinfo: Name or service not known (SocketError)
This is the code now:
require 'nokogiri'
require 'open-uri'
require 'net/http'
LOCATION = 'pics'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
#PAGE_URL = "http://ruby.bastardsbook.com/files/hello-webpage.html"
#PAGE_URL="http://trinity.e-stile.ru/"
PAGE_URL="http://www.youtube.com/"
page=Nokogiri::HTML(open(PAGE_URL))
links=page.css("img")
links.each{|link|
Net::HTTP.start(PAGE_URL) do |http|
localname = link.gsub /.*\//, '' # left the filename only
resp = http.get link['src']
open("#{LOCATION}/#{localname}", "wb") do |file|
file.write resp.body
end
end
}

You are almost done. The only thing left is to store files. Let’s do it.
LOCATION = 'C:\pickaxe\pictures'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
require 'net/http'
.... # your code with nokogiri etc.
links.each{|link|
Net::HTTP.start(PAGE_URL) do |http|
localname = link.gsub /.*\//, '' # left the filename only
resp = http.get link['src']
open("#{LOCATION}/#{localname}", "wb") do |file|
file.write resp.body
end
end
end
That’s it.

The correct version:
require 'nokogiri'
require 'open-uri'
LOCATION = 'pics'
if !File.exist? LOCATION # create folder if it is not exist
require 'fileutils'
FileUtils.mkpath LOCATION
end
#PAGE_URL="http://trinity.e-stile.ru/"
PAGE_URL="http://www.youtube.com/"
page=Nokogiri::HTML(open(PAGE_URL))
links=page.css("img")
links.each{|link|
uri = URI.join(PAGE_URL, link['src'] ).to_s # make absolute uri
localname=File.basename(link['src'])
File.open("#{LOCATION}/#{localname}",'wb') { |f| f.write(open(uri).read) }
}

Related

No such file or directory # rb_sysopen - /app_path/public/10205/18_barcode.png

I am using barby gem to generate the bar codes. I am getting this error on this type of strings i-e "105811/18". when I use this string like "105811_18" or "105811-18" it works fine and generate the barcode successfully but generating error on slash keyword, and pointing error on second last line of my method. "File.open(file_path, 'wb') { |f| f.write output }" Thanks for your help in advance.
require 'barby'
require 'barby/barcode/code_128'
require 'barby/outputter/png_outputter'
def self.get_bar_code(medical_record_no)
barcode_digits = medical_record_no.rjust(13, '0')
barcode = Barby::Code128.new(barcode_digits)
output = Barby::PngOutputter.new(barcode).to_png
file_path = File.join(Rails.root, 'public', "#{medical_record_no}_barcode.png")
File.open(file_path, 'wb') { |f| f.write output }
return barcode_digits
end

CarrierWave Stub Request Error "Not allowed to upload application/octet-stream files"

In a Rails app, I'm writing an RSpec test for an upload method that inherits functionality from CarrierWave. The method itself looks at the URL of a user uploaded image, downloads it, and re-uploads it to S3. In testing, the final step uploads to local storage instead of S3.
I'm trying to modify this process with a stub request. When CarrierWave requests the URL, I intercept with the stub, tell it to return a local file instead.
That part works.
The error comes when CarrierWave tries to upload the stubbed file.This is the error.
CarrierWave::IntegrityError: You are not allowed to upload application/octet-stream files
Here is the stub request.
## rails_helper.rb
...
require 'spec_helper'
require 'rspec/rails'
require 'sidekiq/testing'
require 'capybara/rspec'
require 'webmock/rspec'
...
RSpec.configure do |config|
...
config.before(:each) do |example|
body_file = File.open(File.expand_path('./spec/fixtures/files/1976.png'))
stub_request(:get, /fakeimagehost.com/).to_return(status: 200, body: body_file)
Here is the RSpec
...
context 'when uploaded image URL does not match AWS URL' do
let(:image) { build(:user_uploaded_image) }
let(:image_file) { 'spec/fixtures/files/1976.png' }
let(:html_string) { "<img src='http://fakeimagehost.com/image.png'" }
let(:upload_foreign_image_service) { UploadForeignImageService.new(html_string) }
it 'returns an array of uploaded images' do
expect(upload_foreign_image_service.upload_images).to eql([image])
end
end
...
Here is the relevant portion of the upload method.
...
#uploaded_images = []
def upload_images
parsed_html = Nokogiri::HTML.fragment(#original_html)
parsed_html.css('img').each do |element|
source = element.attributes['src'].value
if source.match ...
...
#uploaded_images |= [existing_image] if existing_image.present?
else
new_image = UserUploadedImage.new
#### BLOWS UP HERE ####
new_image.embedded_image.download! source
...
end
end
#uploaded_images
end
Happy to provide the UserUploadedImage class as well.
This solved the problem. In the .to_return portion of the stub request, I needed to specify the file type.
stub_request(:get, /fakeimagehost.com/).to_return(status: 200, body: body_file, headers: {'Content-Type' =>'image/png'}
This headers: {'Content-Type' =>'image/png'} was the missing piece.

Data Scrape Search Limit

I am using a ruby seed file that scrapes data from APOD (Astronomy Picture of the Day). Since there are thousands of entries, is there a way to limit the scrape to just pull the past 365 images?
Here's the seed code I am using:
require 'rubygems'
require 'open-uri'
require 'open-uri'
require 'nokogiri'
require 'curl'
require 'fileutils'
BASE = 'http://antwrp.gsfc.nasa.gov/apod/'
FileUtils.mkdir('small') unless File.exist?('small')
FileUtils.mkdir('big') unless File.exist?('big')
f = open 'http://antwrp.gsfc.nasa.gov/apod/archivepix.html'
html_doc = Nokogiri::HTML(f.read)
html_doc.xpath('//b//a').each do |element|
imgurl = BASE + element.attributes['href'].value
doc = Nokogiri::HTML(open(imgurl).read)
doc.xpath('//p//a//img').each do |elem|
small_img = BASE + elem.attributes['src'].value
big_img = BASE + elem.parent.attributes['href'].value
s_img_f = open("small/#{File.basename(small_img)}", 'wb')
b_img_f = open("big/#{File.basename(big_img)}", 'wb')
rs_img = Curl::Easy.new(small_img)
rb_img = Curl::Easy.new(big_img)
rs_img.perform
s_img_f.write(rs_img.body_str)
rb_img.perform
b_img_f.write(rb_img.body_str)
s_img_f.close
puts "Download #{File.basename(small_img)} finished."
b_img_f.close
puts "Download #{File.basename(big_img)} finished."
rs_img.close
rb_img.close
end
end
puts "All done."
You can treat the node set like an array to get elements between a specific index.
Add [0..364] to the node set of links:
html_doc.xpath('//b//a')[0..364].each do |element|

How to grep file name and extensions in webpage using nokogiri/hpricot and other gem?

I am working on an application where I have to
1) get all the links of website
2) and then get the list of all the files and file extensions in each
of the web page/link.
I am done with the first part of it :)
I get all the links of website by below code..
require 'rubygems'
require 'spidr'
require 'uri'
Spidr.site('http://testasp.vulnweb.com/') do |spider|
spider.every_url { |url|
puts url
}
end
now I have to get the all the files/file-extensions in each of the
page so I tried the below code
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'spidr'
site = 'http://testasp.vulnweb.com'
in1=[]
Spidr.site(site) do |spider|
spider.every_url { |url| in1.push url }
end
in1.each do |input1|
input1 = input1.to_s
#puts input1
begin
doc = Nokogiri::HTML(open(input1))
doc.traverse do |el|
[el[:src], el[:href]].grep(/\.(txt|css|gif|jpg|png|pdf)$/i).map{|l| URI.join(input1, l).to_s}.each do |link|
puts link
end
end
rescue => e
puts "errrooooooooor"
end
end
but Can anybody guide me how to parse the links/webpage and get the file-
extensions in the page?
You might want to take a look at URI#parse. The URI module is a part of the Ruby standard library and is a dependency of the spidr gem. Example implementation with a spec for good measure.
require 'rspec'
require 'uri'
class ExtensionExtractor
def extract(uri)
/\A.*\/(?<file>.*\.(?<extension>txt|css|gif|jpg|png|pdf))\z/i =~ URI.parse(uri).path
{:path => uri, :file => file, :extension => extension}
end
end
describe ExtensionExtractor do
before(:all) do
#css_uri = "http://testasp.vulnweb.com/styles.css"
#gif_uri = "http://testasp.vulnweb.com/Images/logo.gif"
#gif_uri_with_param = "http://testasp.vulnweb.com/Images/logo.gif?size=350x350"
end
describe "Common Extensions" do
it "should extract CSS files from URIs" do
file = subject.extract(#css_uri)
file[:path].should eq #css_uri
file[:file].should eq "styles.css"
file[:extension].should eq "css"
end
it "should extract GIF files from URIs" do
file = subject.extract(#gif_uri)
file[:path].should eq #gif_uri
file[:file].should eq "logo.gif"
file[:extension].should eq "gif"
end
it "should properly extract extensions even when URIs have parameters" do
file = subject.extract(#gif_uri_with_param)
file[:path].should eq #gif_uri_with_param
file[:file].should eq "logo.gif"
file[:extension].should eq "gif"
end
end
end

FasterCSV: Read Remote CSV Files

I can't seem to get this to work. I want to pull a CSV file from a different webserver to read in my application. This is how I'd like to call it:
url = 'http://www.testing.com/test.csv'
records = FasterCSV.read(url, :headers => true, :header_converters => :symbol)
But that doesn't work. I tried Googling, and all I came up with was this excerpt: Practical Ruby Gems
So, I tried modifying it as follows:
require 'open-uri'
url = 'http://www.testing.com/test.csv'
csv_url = open(url)
records = FasterCSV.read(csv_url, :headers => true, :header_converters => :symbol)
... and I get a can't convert Tempfile into String error (coming from the FasterCSV gem).
Can anyone tell me how to make this work?
require 'open-uri'
url = 'http://www.testing.com/test.csv'
open(url) do |f|
f.each_line do |line|
FasterCSV.parse(line) do |row|
# Your code here
end
end
end
http://www.ruby-doc.org/core/classes/OpenURI.html
http://fastercsv.rubyforge.org/
I would retrieve the file with Net::HTTP for example and feed that to FasterCSV
Extracted from ri Net::HTTP
require 'net/http'
require 'uri'
url = URI.parse('http://www.example.com/index.html')
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/index.html')
}
puts res.body
You just had a small typo. You should have used FasterCSV.parse instead of FasterCSV.read:
data = open('http://www.testing.com/test.csv')
records = FasterCSV.parse(data)
I would download it with rio - as easy as:
require 'rio'
require 'fastercsv'
array_of_arrays = FasterCSV.parse(rio('http://www.example.com/index.html').read)
I upload CSV file with Paperclip and save it to Cloudfiles and then start file processing with Delayed_job.
This worked for me:
require 'open-uri'
url = 'http://www.testing.com/test.csv'
open(url) do |file|
FasterCSV.parse(file.read) do |row|
# Your code here
end
end

Resources