Memory consumption when using Nokogiri and XML

Memory consumption when using Nokogiri and XML - ruby-on-rails

I have been trying to figure out what's up with my Rails app as it relates to memory and Nokogiri's XML parsing. For some reason, this one function alone consumes up about 1GB of memory and does not release it when it's completed. I'm not quite sure what's going on here.
def split_nessus
# To avoid consuming too much memory, we're going to split the Nessus file
# if it's over 10MB into multiple smaller files.
file_size = File.size(#nessus_files[0]).to_f / 2**20
files = []
if file_size >= 10
file_num = 1
d = File.open(#nessus_files[0])
content = Nokogiri::XML(d.read)
d.close
data = Nokogiri::XML("<data></data>")
hosts_num = 1
content.xpath("//ReportHost").each do |report_host|
data.root << report_host
hosts_num += 1
if hosts_num == 100
File.open("#{#nessus_files[0]}_nxtmp_#{file_num}", "w") {|f| f.write(data.to_xml)}
files << "#{#nessus_files[0]}_nxtmp_#{file_num}"
data = Nokogiri::XML("<data></data>")
hosts_num = 1
file_num += 1
end
end
#nessus_files = files
end
end
Since Rails crashes when trying to parse a 100MB+ XML file, I've decided to break XML files into separate files if they're over 10MB, and just trying to handle them individually.
Any thoughts as to why this will not release about 1GB of memory when it's completed?

Nokogiri uses system libraries like libxml and libxslt under the hood. Because of that I would assume that it's probably not an issue in Ruby's garbage collection but somewhere else.
If you are working with large files, it's usually a good idea to stream-process them, so that you do not load the whole file into memory, which in itself will result in huge memory consumption as large strings are very memory consuming.
Because of this, when working with large XML files, you should use a stream parser. In Nokogiri this is Nokogiri::XML::SAX.

Related

rails - Exporting a huge CSV file consumes all RAM in production

So my app exports a 11.5 MB CSV file and uses basically all of the RAM that never gets freed.
The data for the CSV is taken from the DB, and in the case mentioned above the whole thing is being exported.
I am using Ruby 2.4.1 standard CSV library in the following fashion:
export_helper.rb:
CSV.open('full_report.csv', 'wb', encoding: UTF-8) do |file|
data = Model.scope1(param).scope2(param).includes(:model1, :model2)
data.each do |item|
file << [
item.method1,
item.method2,
item.methid3
]
end
# repeat for other models - approx. 5 other similar loops
end
and then in the controller:
generator = ExportHelper::ReportGenerator.new
generator.full_report
respond_to do |format|
format.csv do
send_file(
"#{Rails.root}/full_report.csv",
filename: 'full_report.csv',
type: :csv,
disposition: :attachment
)
end
end
After a single request the puma processes load 55% of the whole server's RAM and stay like that until eventually run out of memory completely.
For instance in this article generating a million-lines 75 MB CSV file required only 1 MB of RAM. But there is no DB querying involved.
The server has 1015 MB RAM + 400 MB of swap memory.
So my questions are:
What exactly consumes so much memory? Is it the CSV generation or the communication with the DB?
Am I doing something wrong and missing a memory leak? Or is it just how the library works?
Is there way to free up the memory without restarting puma workers?
Thanks in advance!

Instead of each you should be using find_each, which is specifically for cases like this, because it will instantiate the Models in batches and release them afterwards, whereas each will instantiate all of them at once.
CSV.open('full_report.csv', 'wb', encoding: UTF-8) do |file|
Model.scope1(param).find_each do |item|
file << [
item.method1
]
end
end
Furthermore you should stream the CSV instead of writing it to memory or disk before sending it to the browser:
format.csv do
headers["Content-Type"] = "text/csv"
headers["Content-disposition"] = "attachment; filename=\"full_report.csv\""
# streaming_headers
# nginx doc: Setting this to "no" will allow unbuffered responses suitable for Comet and HTTP streaming applications
headers['X-Accel-Buffering'] = 'no'
headers["Cache-Control"] ||= "no-cache"
# Rack::ETag 2.2.x no longer respects 'Cache-Control'
# https://github.com/rack/rack/commit/0371c69a0850e1b21448df96698e2926359f17fe#diff-1bc61e69628f29acd74010b83f44d041
headers["Last-Modified"] = Time.current.httpdate
headers.delete("Content-Length")
response.status = 200
header = ['Method 1', 'Method 2']
csv_options = { col_sep: ";" }
csv_enumerator = Enumerator.new do |y|
y << CSV::Row.new(header, header).to_s(csv_options)
Model.scope1(param).find_each do |item|
y << CSV::Row.new(header, [item.method1, item.method2]).to_s(csv_options)
end
end
# setting the body to an enumerator, rails will iterate this enumerator
self.response_body = csv_enumerator
end

Apart from using find_each, you should try running the ReportGenerator code in a background job with ActiveJob. As background jobs run in seperate processes, when they are killed memory is released back to the OS.
So you could try something like this:
A user requests some report(CSV, PDF, Excel)
Some controller enqeues a ReportGeneratorJob, and a confirmation is displayed to the user
The job is performed and an email sent with the download link/file.

Beware tho, you can easily improve ActiveRecord side, but then when sending response through Rails, it will all end up in memory buffer in the Response object: https://github.com/rails/rails/blob/master/actionpack/lib/action_dispatch/http/response.rb#L110
You also need to take use of live streaming feature to pass the data to the client directly without buffering: https://guides.rubyonrails.org/action_controller_overview.html#live-streaming-of-arbitrary-data

Why `linedelimiter` does not work for bag.read_text?

I am trying to load yaml from files created by
entries = bag.from_sequence([{1:2}, {3:4}])
yamls = entries.map(yaml.dump)
yamls.to_textfiles(r'\*.yaml.gz')
with
yamls = bag.read_test(r'\*.yaml.gz', linedelimiter='\n\n)
but it reads files line by line. How to read yamls from files?
UPDATE:
While blocksize=None read_text reads files line by line.
If blocksize is set, you could read compressed files.
How to overcome this? Is uncompressing the files is the only option?

Indeed, linedelimiter is used not for the sense you have in mind, but only for separating the larger blocks. As you say, when you compress with gzip, the file is no longer random-accessible, and blocks cannot be used at all.
It would be possible to pass the linedelimiter into the functions that turn chunks of data into lines (in dask.bag.text, if you are interested).
For now, a workaround could look like this:
yamls = bag.read_test(r'\*.yaml.gz').map_partitions(
lambda x: '\n'.join(x).split(delimiter))

reading large csv files in a rails app takes up a lot of memory - Strategy to reduce memory consumption?

I have a rails app which allows users to upload csv files and schedule the reading of multiple csv files with help of delayed_job gem. The problem is the app reads each file in its entirity into memory and then writes to the database. If its just 1 file being read its fine, but when multiple files are read the RAM on the server gets full and causes the app to hang.
I am trying to find a solution for this problem.
One solution I researched is to break the csv file into smaller parts and save them on the server, and read the smaller files. see this link
example: split -b 40k myfile segment
Not my preferred solution. Are there any other approaches to solve this where I dont have to break the file. Solutions must be ruby code.
Thanks,

You can make use of CSV.foreach to read just chunks of your CSV file:
path = Rails.root.join('data/uploads/.../upload.csv') # or, whatever
CSV.foreach(path) do |row|
# process row[i] here
end
If it's run in a background job, you could additionally call GC.start every n rows.
How it works
CSV.foreach operates on an IO stream, as you can see here:
def IO.foreach(path, options = Hash.new, &block)
# ...
open(path, options) do |csv|
csv.each(&block)
end
end
The csv.each part is a call to IO#each, which reads the file line by line (rb_io_getline_1 invokation) and leaves the line read to be garbage collected:
static VALUE
rb_io_each_line(int argc, VALUE *argv, VALUE io)
{
// ...
while (!NIL_P(str = rb_io_getline_1(rs, limit, io))) {
rb_yield(str);
}
// ...
}

How to recursively download FTP folder in parallel in Ruby?

I need to cache an ftp folder locally in ruby. Right now I'm using ftp_sync to download the ftp folder but it's painfully slow, do you guys know any library that can download the folder files in parallel?
Thanks!

The syncftp gem may help you:
http://rubydoc.info/gems/syncftp/0.0.3/frames
Ruby has a decent built-in FTP library in case you want to roll your own:
http://www.ruby-doc.org/stdlib-1.9.3/libdoc/net/ftp/rdoc/Net/FTP.html
To download files in parallel, you can use multiple threads with timeouts:
Ruby Net::FTP Timeout Threads
A great way to get parallel work done is Celluloid, the concurrent framework:
https://github.com/celluloid/celluloid
All that said, if the download speed is limited to your overall network bandwidth, then none of these approaches will help much.
To speed up the transfers in this case, be sure you're only downloading the information that's changed: new files and changed sections of existing files.
Segmented downloading can give massive speedups in some cases, such as downloaded big log files where only a small percentage of the file has changed, and the changes are all at the end of the file, and are all appends.
You can also consider shelling out to the command line. There are many tools that can help you with this. A good general-purpose one is "curl", which supports simple ranges for FTP files as well, for example you can get the first 100 bytes of a document using FTP like this:
curl -r 0-99 ftp://www.get.this/README
Are you open to other protocols besides FTP? Take a look at the "rsync" command, which is excellent for download synchronization. The rsync command has many optimizations to transfer just the changed data. For example rsync can sync a remote directory to a local directory like this:
rsync -auvC me#my.com:/remote/foo/ /local/foo/

Take a look at Curb. It's a wrapper around Curl, and can do multiple connections in parallel.
This is a modified version of one of their examples:
require 'curb'
urls = %w[
http://ftp.ruby-lang.org/pub/ruby/1.9/ruby-1.9.3-p286.tar.bz2
http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2
]
responses = {}
m = Curl::Multi.new
# add a few easy handles
urls.each do |url|
responses[url] = Curl::Easy.new(url)
puts "Queuing #{ url }..."
m.add(responses[url])
end
spinner_counter = 0
spinner = %w[ | / - \ ]
m.perform do
print 'Performing downloads ', spinner[spinner_counter], "\r"
spinner_counter = (spinner_counter + 1) % spinner.size
end
puts
urls.each do |url|
print "[#{ url } #{ responses[url].total_time } seconds] Saving #{ responses[url].body_str.size } bytes..."
File.open(File.basename(url), 'wb') { |fo| fo.write(responses[url].body_str) }
puts 'done.'
end
That'll pull in both the Ruby and Python source (which are pretty big so they'll take about a minute, depending on your internet connection and host). You won't see any files appear until the last block, where they get written out.

Tracking Upload Progress of File to S3 Using Ruby aws-sdk

Firstly, I am aware that there are quite a few questions that are similar to this one in SO. I have read most, if not all of them, over the past week. But I still can't make this work for me.
I am developing a Ruby on Rails app that allows users to upload mp3 files to Amazon S3. The upload itself works perfectly, but a progress bar would greatly improve user experience on the website.
I am using the aws-sdk gem which is the official one from Amazon. I have looked everywhere in its documentation for callbacks during the upload process, but I couldn't find anything.
The files are uploaded one at a time directly to S3 so it doesn't need to load it into memory. No multiple file upload necessary either.
I figured that I may need to use JQuery to make this work and I am fine with that.
I found this that looked very promising: https://github.com/blueimp/jQuery-File-Upload
And I even tried following the example here: https://github.com/ncri/s3_uploader_example
But I just could not make it work for me.
The documentation for aws-sdk also BRIEFLY describes streaming uploads with a block:
obj.write do |buffer, bytes|
# writing fewer than the requested number of bytes to the buffer
# will cause write to stop yielding to the block
end
But this is barely helpful. How does one "write to the buffer"? I tried a few intuitive options that would always result in timeouts. And how would I even update the browser based on the buffering?
Is there a better or simpler solution to this?
Thank you in advance.
I would appreciate any help on this subject.

The "buffer" object yielded when passing a block to #write is an instance of StringIO. You can write to the buffer using #write or #<<. Here is an example that uses the block form to upload a file.
file = File.open('/path/to/file', 'r')
obj = s3.buckets['my-bucket'].objects['object-key']
obj.write(:content_length => file.size) do |buffer, bytes|
buffer.write(file.read(bytes))
# you could do some interesting things here to track progress
end
file.close

After read the source code of the AWS gem, I've adapted (or mostly copy) the multipart upload method to yield the current progress based on how many chunks have been uploaded
s3 = AWS::S3.new.buckets['your_bucket']
file = File.open(filepath, 'r', encoding: 'BINARY')
file_to_upload = "#{s3_dir}/#{filename}"
upload_progress = 0
opts = {
content_type: mime_type,
cache_control: 'max-age=31536000',
estimated_content_length: file.size,
}
part_size = self.compute_part_size(opts)
parts_number = (file.size.to_f / part_size).ceil.to_i
obj = s3.objects[file_to_upload]
begin
obj.multipart_upload(opts) do |upload|
until file.eof? do
break if (abort_upload = upload.aborted?)
upload.add_part(file.read(part_size))
upload_progress += 1.0/parts_number
# Yields the Float progress and the String filepath from the
# current file that's being uploaded
yield(upload_progress, upload) if block_given?
end
end
end
The compute_part_size method is defined here and I've modified it to this:
def compute_part_size options
max_parts = 10000
min_size = 5242880 #5 MB
estimated_size = options[:estimated_content_length]
[(estimated_size.to_f / max_parts).ceil, min_size].max.to_i
end
This code was tested on Ruby 2.0.0p0

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart