I have been trying to figure out what's up with my Rails app as it relates to memory and Nokogiri's XML parsing. For some reason, this one function alone consumes up about 1GB of memory and does not release it when it's completed. I'm not quite sure what's going on here.
def split_nessus
# To avoid consuming too much memory, we're going to split the Nessus file
# if it's over 10MB into multiple smaller files.
file_size = File.size(#nessus_files[0]).to_f / 2**20
files = []
if file_size >= 10
file_num = 1
d =[0])
content = Nokogiri::XML(
data = Nokogiri::XML("<data></data>")
hosts_num = 1
content.xpath("//ReportHost").each do |report_host|
data.root << report_host
hosts_num += 1
if hosts_num == 100"#{#nessus_files[0]}_nxtmp_#{file_num}", "w") {|f| f.write(data.to_xml)}
files << "#{#nessus_files[0]}_nxtmp_#{file_num}"
data = Nokogiri::XML("<data></data>")
hosts_num = 1
file_num += 1
#nessus_files = files
Since Rails crashes when trying to parse a 100MB+ XML file, I've decided to break XML files into separate files if they're over 10MB, and just trying to handle them individually.
Any thoughts as to why this will not release about 1GB of memory when it's completed?

Nokogiri uses system libraries like libxml and libxslt under the hood. Because of that I would assume that it's probably not an issue in Ruby's garbage collection but somewhere else.
If you are working with large files, it's usually a good idea to stream-process them, so that you do not load the whole file into memory, which in itself will result in huge memory consumption as large strings are very memory consuming.
Because of this, when working with large XML files, you should use a stream parser. In Nokogiri this is Nokogiri::XML::SAX.


rails - Exporting a huge CSV file consumes all RAM in production

So my app exports a 11.5 MB CSV file and uses basically all of the RAM that never gets freed.
The data for the CSV is taken from the DB, and in the case mentioned above the whole thing is being exported.
I am using Ruby 2.4.1 standard CSV library in the following fashion:
export_helper.rb:'full_report.csv', 'wb', encoding: UTF-8) do |file|
data = Model.scope1(param).scope2(param).includes(:model1, :model2)
data.each do |item|
file << [
# repeat for other models - approx. 5 other similar loops
and then in the controller:
generator =
respond_to do |format|
format.csv do
filename: 'full_report.csv',
type: :csv,
disposition: :attachment
After a single request the puma processes load 55% of the whole server's RAM and stay like that until eventually run out of memory completely.
For instance in this article generating a million-lines 75 MB CSV file required only 1 MB of RAM. But there is no DB querying involved.
The server has 1015 MB RAM + 400 MB of swap memory.
So my questions are:
What exactly consumes so much memory? Is it the CSV generation or the communication with the DB?
Am I doing something wrong and missing a memory leak? Or is it just how the library works?
Is there way to free up the memory without restarting puma workers?
Thanks in advance!
Instead of each you should be using find_each, which is specifically for cases like this, because it will instantiate the Models in batches and release them afterwards, whereas each will instantiate all of them at once.'full_report.csv', 'wb', encoding: UTF-8) do |file|
Model.scope1(param).find_each do |item|
file << [
Furthermore you should stream the CSV instead of writing it to memory or disk before sending it to the browser:
format.csv do
headers["Content-Type"] = "text/csv"
headers["Content-disposition"] = "attachment; filename=\"full_report.csv\""
# streaming_headers
# nginx doc: Setting this to "no" will allow unbuffered responses suitable for Comet and HTTP streaming applications
headers['X-Accel-Buffering'] = 'no'
headers["Cache-Control"] ||= "no-cache"
# Rack::ETag 2.2.x no longer respects 'Cache-Control'
headers["Last-Modified"] = Time.current.httpdate
response.status = 200
header = ['Method 1', 'Method 2']
csv_options = { col_sep: ";" }
csv_enumerator = do |y|
y <<, header).to_s(csv_options)
Model.scope1(param).find_each do |item|
y <<, [item.method1, item.method2]).to_s(csv_options)
# setting the body to an enumerator, rails will iterate this enumerator
self.response_body = csv_enumerator
Apart from using find_each, you should try running the ReportGenerator code in a background job with ActiveJob. As background jobs run in seperate processes, when they are killed memory is released back to the OS.
So you could try something like this:
A user requests some report(CSV, PDF, Excel)
Some controller enqeues a ReportGeneratorJob, and a confirmation is displayed to the user
The job is performed and an email sent with the download link/file.
Beware tho, you can easily improve ActiveRecord side, but then when sending response through Rails, it will all end up in memory buffer in the Response object:
You also need to take use of live streaming feature to pass the data to the client directly without buffering:

Why `linedelimiter` does not work for bag.read_text?

I am trying to load yaml from files created by
entries = bag.from_sequence([{1:2}, {3:4}])
yamls =
yamls = bag.read_test(r'\*.yaml.gz', linedelimiter='\n\n)
but it reads files line by line. How to read yamls from files?
While blocksize=None read_text reads files line by line.
If blocksize is set, you could read compressed files.
How to overcome this? Is uncompressing the files is the only option?
Indeed, linedelimiter is used not for the sense you have in mind, but only for separating the larger blocks. As you say, when you compress with gzip, the file is no longer random-accessible, and blocks cannot be used at all.
It would be possible to pass the linedelimiter into the functions that turn chunks of data into lines (in dask.bag.text, if you are interested).
For now, a workaround could look like this:
yamls = bag.read_test(r'\*.yaml.gz').map_partitions(
lambda x: '\n'.join(x).split(delimiter))

reading large csv files in a rails app takes up a lot of memory - Strategy to reduce memory consumption?

I have a rails app which allows users to upload csv files and schedule the reading of multiple csv files with help of delayed_job gem. The problem is the app reads each file in its entirity into memory and then writes to the database. If its just 1 file being read its fine, but when multiple files are read the RAM on the server gets full and causes the app to hang.
I am trying to find a solution for this problem.
One solution I researched is to break the csv file into smaller parts and save them on the server, and read the smaller files. see this link
example: split -b 40k myfile segment
Not my preferred solution. Are there any other approaches to solve this where I dont have to break the file. Solutions must be ruby code.
You can make use of CSV.foreach to read just chunks of your CSV file:
path = Rails.root.join('data/uploads/.../upload.csv') # or, whatever
CSV.foreach(path) do |row|
# process row[i] here
If it's run in a background job, you could additionally call GC.start every n rows.
How it works
CSV.foreach operates on an IO stream, as you can see here:
def IO.foreach(path, options =, &block)
# ...
open(path, options) do |csv|
The csv.each part is a call to IO#each, which reads the file line by line (rb_io_getline_1 invokation) and leaves the line read to be garbage collected:
static VALUE
rb_io_each_line(int argc, VALUE *argv, VALUE io)
// ...
while (!NIL_P(str = rb_io_getline_1(rs, limit, io))) {
// ...

How to recursively download FTP folder in parallel in Ruby?

I need to cache an ftp folder locally in ruby. Right now I'm using ftp_sync to download the ftp folder but it's painfully slow, do you guys know any library that can download the folder files in parallel?
The syncftp gem may help you:
Ruby has a decent built-in FTP library in case you want to roll your own:
To download files in parallel, you can use multiple threads with timeouts:
Ruby Net::FTP Timeout Threads
A great way to get parallel work done is Celluloid, the concurrent framework:
All that said, if the download speed is limited to your overall network bandwidth, then none of these approaches will help much.
To speed up the transfers in this case, be sure you're only downloading the information that's changed: new files and changed sections of existing files.
Segmented downloading can give massive speedups in some cases, such as downloaded big log files where only a small percentage of the file has changed, and the changes are all at the end of the file, and are all appends.
You can also consider shelling out to the command line. There are many tools that can help you with this. A good general-purpose one is "curl", which supports simple ranges for FTP files as well, for example you can get the first 100 bytes of a document using FTP like this:
curl -r 0-99 ftp://www.get.this/README
Are you open to other protocols besides FTP? Take a look at the "rsync" command, which is excellent for download synchronization. The rsync command has many optimizations to transfer just the changed data. For example rsync can sync a remote directory to a local directory like this:
rsync -auvC /local/foo/
Take a look at Curb. It's a wrapper around Curl, and can do multiple connections in parallel.
This is a modified version of one of their examples:
require 'curb'
urls = %w[
responses = {}
m =
# add a few easy handles
urls.each do |url|
responses[url] =
puts "Queuing #{ url }..."
spinner_counter = 0
spinner = %w[ | / - \ ]
m.perform do
print 'Performing downloads ', spinner[spinner_counter], "\r"
spinner_counter = (spinner_counter + 1) % spinner.size
urls.each do |url|
print "[#{ url } #{ responses[url].total_time } seconds] Saving #{ responses[url].body_str.size } bytes...", 'wb') { |fo| fo.write(responses[url].body_str) }
puts 'done.'
That'll pull in both the Ruby and Python source (which are pretty big so they'll take about a minute, depending on your internet connection and host). You won't see any files appear until the last block, where they get written out.

Tracking Upload Progress of File to S3 Using Ruby aws-sdk

Firstly, I am aware that there are quite a few questions that are similar to this one in SO. I have read most, if not all of them, over the past week. But I still can't make this work for me.
I am developing a Ruby on Rails app that allows users to upload mp3 files to Amazon S3. The upload itself works perfectly, but a progress bar would greatly improve user experience on the website.
I am using the aws-sdk gem which is the official one from Amazon. I have looked everywhere in its documentation for callbacks during the upload process, but I couldn't find anything.
The files are uploaded one at a time directly to S3 so it doesn't need to load it into memory. No multiple file upload necessary either.
I figured that I may need to use JQuery to make this work and I am fine with that.
I found this that looked very promising:
And I even tried following the example here:
But I just could not make it work for me.
The documentation for aws-sdk also BRIEFLY describes streaming uploads with a block:
obj.write do |buffer, bytes|
# writing fewer than the requested number of bytes to the buffer
# will cause write to stop yielding to the block
But this is barely helpful. How does one "write to the buffer"? I tried a few intuitive options that would always result in timeouts. And how would I even update the browser based on the buffering?
Is there a better or simpler solution to this?
Thank you in advance.
I would appreciate any help on this subject.
The "buffer" object yielded when passing a block to #write is an instance of StringIO. You can write to the buffer using #write or #<<. Here is an example that uses the block form to upload a file.
file ='/path/to/file', 'r')
obj = s3.buckets['my-bucket'].objects['object-key']
obj.write(:content_length => file.size) do |buffer, bytes|
# you could do some interesting things here to track progress
After read the source code of the AWS gem, I've adapted (or mostly copy) the multipart upload method to yield the current progress based on how many chunks have been uploaded
s3 =['your_bucket']
file =, 'r', encoding: 'BINARY')
file_to_upload = "#{s3_dir}/#{filename}"
upload_progress = 0
opts = {
content_type: mime_type,
cache_control: 'max-age=31536000',
estimated_content_length: file.size,
part_size = self.compute_part_size(opts)
parts_number = (file.size.to_f / part_size).ceil.to_i
obj = s3.objects[file_to_upload]
obj.multipart_upload(opts) do |upload|
until file.eof? do
break if (abort_upload = upload.aborted?)
upload_progress += 1.0/parts_number
# Yields the Float progress and the String filepath from the
# current file that's being uploaded
yield(upload_progress, upload) if block_given?
The compute_part_size method is defined here and I've modified it to this:
def compute_part_size options
max_parts = 10000
min_size = 5242880 #5 MB
estimated_size = options[:estimated_content_length]
[(estimated_size.to_f / max_parts).ceil, min_size].max.to_i
This code was tested on Ruby 2.0.0p0
