How to read a pdf response from Rails app and save to file with parallel tests? - ruby-on-rails

Ok - I have the following in my test/test_helper.rb:
def read_pdf_from_response(response)
file = Tempfile.new
file.write response.body.force_encoding('UTF-8')
begin
reader = PDF::Reader.new(file)
reader.pages.map(&:text).join.squeeze("\n")
ensure
file.close
file.unlink
end
end
I use it like this in an integration test:
get project_path(project, format: 'pdf')
read_response_from_pdf(#response).tap do |pdf|
assert_match(/whatever/, pdf)
end
This works fine as long as I run a test singly or when running all tests with only one worker, e.g. PARALLEL_WORKERS=1. But tests that use this method will fail intermittently when I run my suite with more than 1 parallel worker. My laptop has 8 cores, so that's normally what it's running with.
Here's the error:
PDF::Reader::MalformedPDFError: PDF malformed, expected 5 but found 96 instead
or sometimes: PDF::Reader::MalformedPDFError: PDF file is empty
The PDF reader is https://github.com/yob/pdf-reader which hasn't given any problems.
The controller that sends the PDF returns like so:
send_file out_file,
filename: "#{#project.name}.pdf",
type: 'application/pdf',
disposition: (params[:download] ? 'attachment' : 'inline')
I can't see why this isn't working. No files should ever have the same name at the same time, since I'm using Tempfile, right? How can I make all this run with parallel tests?

While I cannot confirm why this is happening the issue may be that:
You are forcing the encoding to "UTF-8" but PDF documents are binary files so this conversion could be damaging the PDF.
Some of the responses you are receiving are truly empty or malformed.
Maybe try this instead:
def read_pdf_from_response(response)
doc = StringIO.new(response.body.to_s)
begin
PDF::Reader.new(doc)
.pages
.map(&:text)
.join
.squeeze("\n")
rescue PDF::Reader::MalformedPDFError => e
# handle issues with the pdf itself
end
end
This will avoid the file system altogether while still using a compatible IO object and will make sure that the response is read as binary to avoid any conversion conflicts.

Related

rails - Exporting a huge CSV file consumes all RAM in production

So my app exports a 11.5 MB CSV file and uses basically all of the RAM that never gets freed.
The data for the CSV is taken from the DB, and in the case mentioned above the whole thing is being exported.
I am using Ruby 2.4.1 standard CSV library in the following fashion:
export_helper.rb:
CSV.open('full_report.csv', 'wb', encoding: UTF-8) do |file|
data = Model.scope1(param).scope2(param).includes(:model1, :model2)
data.each do |item|
file << [
item.method1,
item.method2,
item.methid3
]
end
# repeat for other models - approx. 5 other similar loops
end
and then in the controller:
generator = ExportHelper::ReportGenerator.new
generator.full_report
respond_to do |format|
format.csv do
send_file(
"#{Rails.root}/full_report.csv",
filename: 'full_report.csv',
type: :csv,
disposition: :attachment
)
end
end
After a single request the puma processes load 55% of the whole server's RAM and stay like that until eventually run out of memory completely.
For instance in this article generating a million-lines 75 MB CSV file required only 1 MB of RAM. But there is no DB querying involved.
The server has 1015 MB RAM + 400 MB of swap memory.
So my questions are:
What exactly consumes so much memory? Is it the CSV generation or the communication with the DB?
Am I doing something wrong and missing a memory leak? Or is it just how the library works?
Is there way to free up the memory without restarting puma workers?
Thanks in advance!
Instead of each you should be using find_each, which is specifically for cases like this, because it will instantiate the Models in batches and release them afterwards, whereas each will instantiate all of them at once.
CSV.open('full_report.csv', 'wb', encoding: UTF-8) do |file|
Model.scope1(param).find_each do |item|
file << [
item.method1
]
end
end
Furthermore you should stream the CSV instead of writing it to memory or disk before sending it to the browser:
format.csv do
headers["Content-Type"] = "text/csv"
headers["Content-disposition"] = "attachment; filename=\"full_report.csv\""
# streaming_headers
# nginx doc: Setting this to "no" will allow unbuffered responses suitable for Comet and HTTP streaming applications
headers['X-Accel-Buffering'] = 'no'
headers["Cache-Control"] ||= "no-cache"
# Rack::ETag 2.2.x no longer respects 'Cache-Control'
# https://github.com/rack/rack/commit/0371c69a0850e1b21448df96698e2926359f17fe#diff-1bc61e69628f29acd74010b83f44d041
headers["Last-Modified"] = Time.current.httpdate
headers.delete("Content-Length")
response.status = 200
header = ['Method 1', 'Method 2']
csv_options = { col_sep: ";" }
csv_enumerator = Enumerator.new do |y|
y << CSV::Row.new(header, header).to_s(csv_options)
Model.scope1(param).find_each do |item|
y << CSV::Row.new(header, [item.method1, item.method2]).to_s(csv_options)
end
end
# setting the body to an enumerator, rails will iterate this enumerator
self.response_body = csv_enumerator
end
Apart from using find_each, you should try running the ReportGenerator code in a background job with ActiveJob. As background jobs run in seperate processes, when they are killed memory is released back to the OS.
So you could try something like this:
A user requests some report(CSV, PDF, Excel)
Some controller enqeues a ReportGeneratorJob, and a confirmation is displayed to the user
The job is performed and an email sent with the download link/file.
Beware tho, you can easily improve ActiveRecord side, but then when sending response through Rails, it will all end up in memory buffer in the Response object: https://github.com/rails/rails/blob/master/actionpack/lib/action_dispatch/http/response.rb#L110
You also need to take use of live streaming feature to pass the data to the client directly without buffering: https://guides.rubyonrails.org/action_controller_overview.html#live-streaming-of-arbitrary-data

How to make Ruby Net::SFTP not block, so I can begin streaming a remote download immediately?

How can I acquire a remote file using Net::SFTP and stream it without waiting for the entire file to download?
My test.zip file is 1GB in size. When I run this code my browser does nothing for several minutes and then the download finally begins. I'd like it to begin streaming the file sooner than that. I have to use Net::SFTP or something like it since the file is only available via SSH or SFTP.
I've also tried Net::SFTP's download and download! methods and got the same result.
headers['Content-Type'] = 'application/zip'
headers['Content-disposition'] = "attachment; filename=test.zip"
self.response_body = Enumerator.new do |lines|
Net::SFTP.start('example.com', 'foo', keys: ['~/.ssh/id_rsa.pub']) do |sftp|
file = sftp.file.open('/path/to/test.zip')
lines << file.read(4096) until file.eof?
end
end

How can I ZIP and stream many files without appending to memory on Rails5/Ruby2.4? [duplicate]

I need to serve some data from my database in a zip file, streaming it on the fly such that:
I do not write a temporary file to disk
I do not compose the whole file in RAM
I know that I can do streaming generation of zip files to the filesystemk using ZipOutputStream as here. I also know that I can do streaming output from a rails controller by setting response_body to a Proc as here. What I need (I think) is a way of plugging those two things together. Can I make rails serve a response from a ZipOutputStream? Can I get ZipOutputStream give me incremental chunks of data that I can feed into my response_body Proc? Or is there another way?
Short Version
https://github.com/fringd/zipline
Long Version
so jo5h's answer didn't work for me in rails 3.1.1
i found a youtube video that helped, though.
http://www.youtube.com/watch?v=K0XvnspdPsc
the crux of it is creating an object that responds to each... this is what i did:
class ZipGenerator
def initialize(model)
#model = model
end
def each( &block )
output = Object.new
output.define_singleton_method :tell, Proc.new { 0 }
output.define_singleton_method :pos=, Proc.new { |x| 0 }
output.define_singleton_method :<<, Proc.new { |x| block.call(x) }
output.define_singleton_method :close, Proc.new { nil }
Zip::IoZip.open(output) do |zip|
#model.attachments.all.each do |attachment|
zip.put_next_entry "#{attachment.name}.pdf"
file = attachment.file.file.send :file
file = File.open(file) if file.is_a? String
while buffer = file.read(2048)
zip << buffer
end
end
end
sleep 10
end
end
def getzip
self.response_body = ZipGenerator.new(#model)
#this is a hack to preven middleware from buffering
headers['Last-Modified'] = Time.now.to_s
end
EDIT:
the above solution didn't ACTUALLY work... the problem is that rubyzip needs to jump around the file to rewrite the headers for entries as it goes. particularly it needs to write the compressed size BEFORE it writes the data. this is just not possible in a truly streaming situation... so ultimately this task may be impossible. there is a chance that it might be possible to buffer a whole file at a time, but this seemed less worth it. ultimately i just wrote to a tmp file... on heroku i can write to Rails.root/tmp less instant feedback, and not ideal, but neccessary.
ANOTHER EDIT:
i got another idea recently... we COULD know the compressed size of the files if we do not compress them. the plan goes something like this:
subclass the ZipStreamOutput class as follows:
always use the "stored" compression method, in other words do not compress
ensure we never seek backwards to change file headers, get it all right up front
rewrite any code related to TOC that seeks
I haven't tried to implement this yet, but will report back if there's any success.
OK ONE LAST EDIT:
In the zip standard: http://en.wikipedia.org/wiki/Zip_(file_format)#File_headers
they mention that there's a bit you can flip to put the size, compressed size and crc AFTER a file. so my new plan was to subclass zipoutput stream so that it
sets this flag
writes sizes and CRCs after the data
never rewinds output
furthermore i needed to get all the hacks in order to stream output in rails fixed up...
anyways it all worked!
here's a gem!
https://github.com/fringd/zipline
I had a similar issue. I didn't need to stream directly, but only had your first case of not wanting to write a temp file. You can easily modify ZipOutputStream to accept an IO object instead of just a filename.
module Zip
class IOOutputStream < ZipOutputStream
def initialize io
super '-'
#outputStream = io
end
def stream
#outputStream
end
end
end
From there, it should just be a matter of using the new Zip::IOOutputStream in your Proc. In your controller, you'd probably do something like:
self.response_body = proc do |response, output|
Zip::IOOutputStream.open(output) do |zip|
my_files.each do |file|
zip.put_next_entry file
zip << IO.read file
end
end
end
It is now possible to do this directly:
class SomeController < ApplicationController
def some_action
compressed_filestream = Zip::ZipOutputStream.write_buffer do |zos|
zos.put_next_entry "some/filename.ext"
zos.print data
end
compressed_filestream .rewind
respond_to do |format|
format.zip do
send_data compressed_filestream .read, filename: "some.zip"
end
end
# or some other return of send_data
end
end
This is the link you want:
http://info.michael-simons.eu/2008/01/21/using-rubyzip-to-create-zip-files-on-the-fly/
It builds and generates the zipfile using ZipOutputStream and then uses send_file to send it directly out from the controller.
Use chunked HTTP transfer encoding for output: HTTP header "Transfer-Encoding: chunked" and restructure the output according to the chunked encoding specification, so no need to know the resulting ZIP file size at the begginning of the transfer. Can be easily coded in Ruby with the help of Open3.popen3 and threads.

Generating a CSV and uploading it to S3 when finished in a background job

I'm providing users with the ability to download an extremely large amount of data via CSV. To do this, I'm using Sidekiq and putting the task off into a background job once they've initiated it. What I've done in the background job is generate a csv containing all of the proper data, storing it in /tmp and then call save! on my model, passing the location of the file to the paperclip attribute which then goes off and is stored in S3.
All of this is working perfectly fine locally. My problem now lies with Heroku and it's ability to store files for a short duration dependent on what node you're on. My background job is unable to find the tmp file that gets saved because of how Heroku deals with these files. I guess I'm searching for a better way to do this. If there's some way that everything can be done in-memory, that would be awesome. The only problem is that paperclip expects an actual file object as an attribute when you're saving the model. Here's what my background job looks like:
class CsvWorker
include Sidekiq::Worker
def perform(report_id)
puts "Starting the jobz!"
report = Report.find(report_id)
items = query_ranged_downloads(report.start_date, report.end_date)
csv = compile_csv(items)
update_report(report.id, csv)
end
def update_report(report_id, csv)
report = Report.find(report_id)
report.update_attributes(csv: csv, status: true)
report.save!
end
def compile_csv(items)
clean_items = items.compact
path = File.new("#{Rails.root}/tmp/uploads/downloads_by_title_#{Process.pid}.csv", "w")
csv_string = CSV.open(path, "w") do |csv|
csv << ["Item Name", "Parent", "Download Count"]
clean_items.each do |row|
if !row.item.nil? && !row.item.parent.nil?
csv << [
row.item.name,
row.item.parent.name,
row.download_count
]
end
end
end
return path
end
end
I've omitted the query method for readabilities sake.
I don't think Heroku's temporary file storage is the problem here. The warnings around that mostly center around the facts that a) dynos are ephemeral, so anything you write can and will disappear without notice; and b) dynos are interchangeable, so the presence of inter-request tempfiles are a matter of luck when you have more than one web dyno running. However, in no situation do temporary files just vanish while your worker is running.
One thing I notice is that you're actually creating two temporary files with the same name:
> path = File.new("/tmp/filename", "w")
=> #<File:/tmp/filename>
> path.fileno
=> 3
> CSV.open(path, "w") do |csv| csv << %w(foo bar baz); puts csv.fileno end
4
=> nil
You could change the path = line to just set the filename (instead of opening it for writing), and then make update_report open the filename for reading. I haven't dug into what Paperclip does when you give it an empty, already-overwritten, opened-for-writing file handle, but changing that flow may well fix the issue.
Alternately, you could do this in memory instead: generate the CSV as a string and give it to Paperclip as a StringIO. (Paperclip supports certain non-file objects, including StringIOs, using e.g. Paperclip::StringioAdapter.) Try something like:
# returns a CSV as a string
def compile_csv(items)
CSV.generate do |csv|
# ...
end
end
def update_report(report_id, csv)
report = Report.find(report_id)
report.update_attributes(csv: StringIO.new(csv), status: true)
report.save!
end

Length of uploaded file in Ruby on Rails decreases after UploadedFile.read

On a RoR app that i've inherited, a test is failing that involves a file upload. The assertion that fails looks like so:
assert_equal File.size("#{RAILS_ROOT}/test/fixtures/#{filename}"), #candidate.picture.length
It fails with (the test file is 69 bytes):
<69> expected but was <5>.
This is after a post using:
fixture_file_upload(filename, content_type, :binary)
In the candidate model, the uploaded file is assigned to a property that is then saved to a mediumblob in MySQL. It looks to me like the uploaded file is 69 bytes, but after it is assigned to the model property (using UploadedFile.read), the length is showing as only 5 bytes.
So this code:
puts "file.length=" + file.length.to_s
self.picture = file.read
puts "self.picture.length=" + self.picture.length.to_s
results in this output:
file.length=69
self.picture.length=5
I'm at a bit of a loss as to why this is, any ideas?
This came down to a Windows/Ruby idiosyncrasy, where reading the file appeared to be happening in text mode. There is an extension in this app in test_helper, something like:
class ActionController::TestUploadedFile
# Akward but neccessary for testing since an ActionController::UploadedFile subtype is expected
include ActionController::UploadedFile
def read
tempfile = File.new(self.path)
tempfile.read
end
end
And apparently, on Windows, there is a specific IO method that can be called to force the file into binary mode. Calling this method on the tempfile, like so:
tempfile.binmode
caused everything to work as expected, with the read from the UploadedFile matching the size of the fixture file on disk.

Resources