Heroku: Unpacking a Gzip file through a rake task fails - ruby-on-rails

I'm using Rails 5.2 with ruby 2.5.1 and am deploying my app to Heroku.
I ran into problems when I tried running my local rake task. The task calls an API which responds with a *.gz file, saves it, upzips and then uses the retrieved JSON to populate the database and finally deletes the *.gz file. The task runs smooth in development but when called in production. The last line printed into the console is 'Unzipping the file...', so my guess is that the issues origin from the zlib library.
companies_list.rake
require 'json'
require 'open-uri'
require 'zlib'
require 'openssl'
require 'action_view'
include ActionView::Helpers::DateHelper
desc 'Updates Company table'
task update_db: :environment do
start = Time.now
zip_file_url = 'https://example.com/api/download'
TEMP_FILE_NAME = 'companies.gz'
puts 'Creating folders...'
tempdir = Dir.mktmpdir
file_path = "#{tempdir}/#{TEMP_FILE_NAME}"
puts 'Downloading the file...'
open(file_path, 'wb') do |file|
open(zip_file_url, { ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE }) do |uri|
file.write(uri.read)
end
end
puts 'Download complete.'
puts 'Unzipping the file...'
gz = Zlib::GzipReader.new(open(file_path))
output = gz.read
#companies_array = JSON.parse(output)
puts 'Unzipping complete.'
(...)
end
Has anyone else run into similar issues and knows how to get it to work?

Your code snippet does not indicate that you ever close your GzipReader. It is often best to wrap IO's in blocks to ensure they are closed appropriately. Also, the open method may not be the one you want, so just let GzipReader handle opening the file for you and just send in the file_path.
Zlib::GzipReader.new(file_path) do |gz|
output = gz.read
#companies_array = JSON.parse(output)
end

The issue was linked to memory limit rather than Gzip unpacking (that's why the problem only occurred in production).
The solution was using a Json::Streamer so that the whole file is not loading into memory at once.
This is the crucial part: (goes after the code posted in the question)
puts 'Updating the Company table...'
streamer = Json::Streamer.parser(file_io: file, chunk_size: 1024) # customize your chunk_size
streamer.get(nesting_level: 1) do |company|
(do your stuff with the API data here...)
end
end

Related

Rails FTP OPEN CSV

I have the following code to connect my rails app to my FTP. This works great. However, I want to use open-uri to open the csv file so I can parse it. Any ideas how to do this? I think it's an easy thing to do but I'm missing something.
require 'net/ftp'
ftp = Net::FTP.new
ftp.connect("xxx.xxx.xx.xxx",21)
ftp.login("xxxxx","xxxx")
ftp.chdir("/")
ftp.passive = true
puts ftp.list("TEST.csv")
You'll need to use #gettextfile.
A) Get the file to a local temporary file and read its content
# Creating a tmp file can be done differently as well.
# It may also be omitted, in which case `gettextfile`
# will create a file in the current directory.
Dir::Tmpname.create(['TEST', ['.csv']) do |file_name|
ftp.gettextfile('TEST.csv', file_name)
content = File.read(file_name)
end
B) Pass a block to gettextfile and get the content one line at a time
content = ''
ftp.gettextfile('TEST.csv') do |line|
content << line
end

How can I download an image from a website using Rails?

I'm using Selenium-Webdriver, OpenUri and Nokogiri to scrape a website. I want to download a particular image from said website to my Ubuntu computer. I tried a few different methods but each of them gives a different error message.
Here's my base code, which opens the website and gets the image url (everything after this I ran in my pry console):
require 'open-url'
require 'selenium-webdriver'
require 'nokogiri'
require 'uri'
url = "https://www.google.com/"
browser = Selenium::WebDriver.for :chrome
document = open(url).read
parsed_content = Nokogiri::HTML(content)
image = "https://www.google.com" + parsed_content.css('#hplogo').attr('src').value
binding.pry
1) Here's the first thing I tried to download the image:
download = open(image)
IO.copy_stream(download, '~/image.png')
For this, I got the following error:
Errno::ENOENT: No such file or directory # rb_sysopen - ~/image.png from (pry):44:in 'initialize'
As per this question, I tried adding a directory in the code:
FileUtils.mkdir_p(image) unless File.exist?(image)
But I got the same error.
2) Next I tried this:
open('image.png', 'wb') do |file|
file << open(image).read
end
and this returns
#<File:image.png (closed)
but the file isn't anywhere on my computer and I can't figure out what that message means.
3) Next I tried
IO.copy_stream(open(image), 'image.png')
which simply returned this:
5482
but again, I have no idea what that means and the file isn't anywhere.
4) Finally I tried
read_image = open(image).read
File.open(image, 'image.png') do |file|
file.puts read_image
end
which outputs
ArgumentError: invalid access mode image.png
from (pry):53:in 'initialize
What am I doing wrong? Was I close with any of my approaches?
File open second argument is mode for file openning.
read_image = open(image).read
File.open('image.png', 'w+') do |file|
file.write read_image
end
Your third variant works good.
5482 - length of file. File 'image.png' in same directory as your .rb file.

Generating a CSV and uploading it to S3 when finished in a background job

I'm providing users with the ability to download an extremely large amount of data via CSV. To do this, I'm using Sidekiq and putting the task off into a background job once they've initiated it. What I've done in the background job is generate a csv containing all of the proper data, storing it in /tmp and then call save! on my model, passing the location of the file to the paperclip attribute which then goes off and is stored in S3.
All of this is working perfectly fine locally. My problem now lies with Heroku and it's ability to store files for a short duration dependent on what node you're on. My background job is unable to find the tmp file that gets saved because of how Heroku deals with these files. I guess I'm searching for a better way to do this. If there's some way that everything can be done in-memory, that would be awesome. The only problem is that paperclip expects an actual file object as an attribute when you're saving the model. Here's what my background job looks like:
class CsvWorker
include Sidekiq::Worker
def perform(report_id)
puts "Starting the jobz!"
report = Report.find(report_id)
items = query_ranged_downloads(report.start_date, report.end_date)
csv = compile_csv(items)
update_report(report.id, csv)
end
def update_report(report_id, csv)
report = Report.find(report_id)
report.update_attributes(csv: csv, status: true)
report.save!
end
def compile_csv(items)
clean_items = items.compact
path = File.new("#{Rails.root}/tmp/uploads/downloads_by_title_#{Process.pid}.csv", "w")
csv_string = CSV.open(path, "w") do |csv|
csv << ["Item Name", "Parent", "Download Count"]
clean_items.each do |row|
if !row.item.nil? && !row.item.parent.nil?
csv << [
row.item.name,
row.item.parent.name,
row.download_count
]
end
end
end
return path
end
end
I've omitted the query method for readabilities sake.
I don't think Heroku's temporary file storage is the problem here. The warnings around that mostly center around the facts that a) dynos are ephemeral, so anything you write can and will disappear without notice; and b) dynos are interchangeable, so the presence of inter-request tempfiles are a matter of luck when you have more than one web dyno running. However, in no situation do temporary files just vanish while your worker is running.
One thing I notice is that you're actually creating two temporary files with the same name:
> path = File.new("/tmp/filename", "w")
=> #<File:/tmp/filename>
> path.fileno
=> 3
> CSV.open(path, "w") do |csv| csv << %w(foo bar baz); puts csv.fileno end
4
=> nil
You could change the path = line to just set the filename (instead of opening it for writing), and then make update_report open the filename for reading. I haven't dug into what Paperclip does when you give it an empty, already-overwritten, opened-for-writing file handle, but changing that flow may well fix the issue.
Alternately, you could do this in memory instead: generate the CSV as a string and give it to Paperclip as a StringIO. (Paperclip supports certain non-file objects, including StringIOs, using e.g. Paperclip::StringioAdapter.) Try something like:
# returns a CSV as a string
def compile_csv(items)
CSV.generate do |csv|
# ...
end
end
def update_report(report_id, csv)
report = Report.find(report_id)
report.update_attributes(csv: StringIO.new(csv), status: true)
report.save!
end

How do I copy a file onto a separate server using Net::FTP?

I'm building a Rails app which creates a bookmarklet file for each user upon sign-up. I'd like to save that file onto a remote server, so I'm trying Ruby's Net::FTP, based on "Rails upload file to ftp server".
I tried this code:
require 'net/ftp'
FileUtils.cp('public/ext/files/script.js', 'public/ext/bookmarklets/'+resource.authentication_token )
file = File.open('public/ext/bookmarklets/'+resource.authentication_token, 'a') {|f| f.puts("cb_bookmarklet.init('"+resource.username+"', '"+resource.authentication_token+"', '"+resource.id.to_s+"');$('<link>', {href: '//***.com/bookmarklet/cb.css',rel: 'stylesheet',type: 'text/css'}).appendTo('head');});"); return f }
ftp = Net::FTP.new('www.***.com')
ftp.passive = true
ftp.login(user = '***', psswd = '***')
ftp.storbinary("STOR " + file.original_filename, StringIO.new(file.read), Net::FTP::DEFAULT_BLOCKSIZE)
ftp.quit()
But I'm getting an error that the file variable is nil. I may be doing several things wrong here. I'm pretty new to Ruby and Rails, so any help is welcome.
The block form of File.open does not return the file handle (and even if it did, it would be closed at that point). Perhaps change your code to roughly:
require '…'
FileUtils.cp …
File.open('…','a') do |file|
ftp = …
ftp.storbinary("STOR #{file.original_filename}", StringIO.new(file.read))
ftp.quit
end
Alternatively:
require '…'
FileUtils.cp …
filename = '…'
contents = IO.read(filename)
ftp = …
ftp.storbinary("STOR #{filename}", StringIO.new(contents))
ftp.quit

Why does using OpenURI to download a file result in a partial file?

I'm trying to use OpenURI to download a file from S3, and then save it locally so I can send the file as an attachment with ActionMailer.
Something strange is going on. The images being downloaded and attached are corrupt, the bottom parts of the images are missing.
Here's the code:
require 'open-uri'
open("#{Rails.root.to_s}/tmp/#{a.attachment_file_name}", "wb") do |file|
source_url = a.authenticated_url()
io = open(URI.parse(source_url).to_s)
file << io.read
attachments[a.attachment_file_name] = File.read("#{Rails.root.to_s}/tmp/#{a.attachment_file_name}")
end
a is the attachment from ActionMailer.
What can I try next?
It looks like you're trying to read the file before it's been closed, which could leave part of the file buffer unwritten.
I'd do it like this:
require 'open-uri'
source_url = a.authenticated_url()
attachment_file = "#{Rails.root.to_s}/tmp/#{a.attachment_file_name}"
open(attachment_file, "wb") do |file|
file.print open(source_url, &:read)
end
attachments[a.attachment_file_name] = File.read(attachment_file)
It looks like source_url = a.authenticated_url() will be a string, so parsing the string into a URI then doing to_s on it will be redundant unless URI is doing some normalizing, which I don't think it does.
Based on my sysadmin experience: A side task is cleaning up the downloaded/spooled files. They could be deleted immediately after being attached, or you could have a cron job that runs daily, deleting all spooled files over one day old.
An additional concern for this is there is no error handling in case the URL can't be read, causing the attachment to fail. Using a temp spool file you could check for the existence of the file. Even better, you should probably be prepared to handle an exception if the server returns a 400 or 500 error.
To avoid using a temporary spool file try this untested code:
require 'open-uri'
source_url = a.authenticated_url()
attachments[a.attachment_file_name] = open(source_url, &:read)

Resources