optimizing reading database and writing to csv file - ruby-on-rails

I'm trying to read a large amount of cells from database (over 100.000) and write them to a csv file on VPS Ubuntu server. It happens that server doesn't have enough memory.
I was thinking about reading 5000 rows at once and writing them to file, then reading another 5000, etc..
How should I restructure my current code so that memory won't be consumed fully?
Here's my code:
def write_rows(emails)
File.open(file_path, "w+") do |f|
f << "email,name,ip,created\n"
emails.each do |l|
f << [l.email, l.name, l.ip, l.created_at].join(",") + "\n"
end
end
end
The function is called from sidekiq worker by:
write_rows(user.emails)
Thanks for help!

The problem here is that when you call emails.each ActiveRecord loads all the records from the database and keeps them in memory, to avoid this you can use the method find_each:
require 'csv'
BATCH_SIZE = 5000
def write_rows(emails)
CSV.open(file_path, 'w') do |csv|
csv << %w{email name ip created}
emails.find_each do |email|
csv << [email.email, email.name, email.ip, email.created_at]
end
end
end
By default find_each loads records in batches of 1000 at a time, if you want to load batches of 5000 record you have to pass the option :batch_size to find_each:
emails.find_each(:batch_size => 5000) do |email|
...
More information about the find_each method (and the related find_in_batches) can be found on the Ruby on Rails Guides.
I've used the CSV class to write the file instead of joining fields and lines by hand. This is not inteded to be a performance optimization since writing on the file shouldn't be the bottleneck here.

Related

XML generation very slow and using lots of memory in Rails 4

I'm generating an XML file to share data with another system. From my troubleshooting, I've found that this process is both slow and consuming lots of memory (getting lots of R14's on Heroku.)
My index method on my Jobs Controller looks like this:
def index
respond_to do |format|
format.xml {#jobs = #user.jobs.includes(job_types: [:job_lines, :job_photos])}
format.json
{
# More code here, this part is not the problem.
}
end
end
My view (index.xml.builder) looks like this (I've removed a bunch of fields to keep the example smaller):
xml.instruct!
xml.jobs do
#jobs.each do |j|
xml.job do
xml.id j.id
xml.job_number j.job_number
xml.registration j.registration
xml.name j.name
xml.job_types do
j.job_types.each do |t|
xml.job_type do
xml.id t.id
xml.job_id t.job_id
xml.type_number t.type_number
xml.description t.description
xml.job_lines do
t.job_lines.each do |l|
xml.job_line do
xml.id l.id
xml.line_number l.line_number
xml.job_type_id l.job_type_id
xml.line_type l.line_type
xml.type_number l.type_number
xml.description l.description
xml.part_number l.part_number
end # job_line node
end # job_lines.each
end # job_lines node
xml.job_photos do
t.job_photos.each do |p|
xml.job_photo do
xml.id p.id
xml.pcid p.pcid
xml.job_type_id p.job_type_id
xml.image_url p.image.url
end # job_line node
end # job_lines.each
end # job_lines node
end # job_type
end # job_types.each
end # job_types node
end # job node
end # #jobs.each
end # jobs node
The generated XML file is not small (it's about 100kB). Running on Heroku, their Scout tool tells me that this process is often taking 4-6 seconds to run. Also, despite only running 1 worker, with 4 threads (in Puma) this part of my code is consuming all my memory. In scout, I can see that it's "Max Allocations" are as high as 10M compared with my next worst method which is only 500k allocations.
Can anyone tell me what I'm doing wrong? Is there a more efficient (in terms of speed and memory usage) way for me to generate XML?
Any help would be appreciated.
EDIT 1
I've tried building the XML manually like this:
joblist.each do |j|
result << " <job>\n"
result << " <id>" << j.id.to_s << "</id>\n"
result << " <job_number>" << j.job_number.to_s << "</job_number>\n"
# Lots more lines removed
end
This has given me some improvements. My largest allocations is now 1.8M. I'm close to Heroku's limit (reached a max of 500MB out of the 512MB limit over 24 hours).
I am still only running 1 Worker with 4 threads.If I can I'd like to get the memory down more so I can run some more Puma Workers and Threads.
EDIT 2
I ended up having to do this in batches (using offset and limit) and send 5 jobs at a time. The memory usage dropped substantially when I did this. Obviously there was more calls to the controller but each was smaller and faster.

How to download CSV data using ActionController::Live from MongoDB?

I have created a CSV downloader in a controller like this
format.csv do
#records = Model.all
headers['Content-Disposition'] = "attachment; filename=\"products.csv\""
headers['Content-Type'] ||= 'text/csv'
end
Now I want to create server sent events to download CSV from this for optimising purpose. I know I can do this in Rails using ActionController::Live but I have have no experience with it.
Can some one explain to me how I can
Query records as batches
Add records to stream
Handle sse from browser side
Write records to CSV files
Correct me if any of my assumptions are wrong. Help me do this in a better way. Thanks.
Mongoid automatically query your records in batches (More info over here)
To add your records to a CSV file, you should do something like:
records = MyModel.all
# By default batch_size is 100, but you can modify it using .batch_size(x)
result = CSV.generate do |csv|
csv << ["attribute1", "attribute2", ...]
records.each do |r|
csv << [r.attribute1, r.attribute2, ...]
end
end
send_data result, filename: 'MyCsv.csv'
Remember that send_data is an ActionController method!
I think you donĀ“t need SSE for generating a CSV. Just include ActionController::Live into the controller to use the response.stream.write iterating your collection:
include ActionController::Live
...
def some_action
format.csv do
# Needed for streaming to workaround Rack 2.2 bug
response.headers['Last-Modified'] = Time.now.httpdate
headers['Content-Disposition'] = "attachment; filename=\"products.csv\""
headers['Content-Type'] ||= 'text/csv'
[1,2,3,4].each do |i| # --> change it to iterate your DB records
response.stream.write ['SOME', 'thing', "interesting #{i}", "#{Time.zone.now}"].to_csv
sleep 1 # some fake delay to see chunking
end
ensure
response.stream.close
end
end
Try it with curl or similar to see the output line by line:
$ curl -i http://localhost:3000/test.csv

Export large data(million of rows) into CSV in Rails

I have a million of records and I want to export that data into CSV. I used find_each method to fetch the records. But it also taking too much time to fetch data and download CSV. I am not able to do other activity in the application because its taking more memory. Its just showing me loading the page in the browser.
I have written the following code in the controller
def export_csv
require 'csv'
lines = []
csv_vals = []
User.where(status:ACTIVE).order('created_atdesc').find_each(batch_size: 10000) do |user|
csv_vals << user.email if user.email.present?
csv_vals << user.name if user.name.present?
.......
........
.......etc
lines << CSV.generate_line(csv_vals)
end
send_data(line, type: 'text/csv; charset=iso-8859-1; header=present', \
disposition: "attachment; filename=file123.csv"
end
Is there another way to load the millions of records and download quickly?
this may help:
genereating and streaming potentially large csv files using ruby on rail

Memory issue with huge CSV Export in Rails

I'm trying to export a large amount of data from a database to a csv file but it is taking a very long time and fear I'll have major memory issues.
Does anyone know of any better way to export a CSV without the memory build up? If so, can you show me how? Thanks.
Here's my controller:
def users_export
File.new("users_export.csv", "w") # creates new file to write to
#todays_date = Time.now.strftime("%m-%d-%Y")
#outfile = #todays_date + ".csv"
#users = User.select('id, login, email, last_login, created_at, updated_at')
FasterCSV.open("users_export.csv", "w+") do |csv|
csv << [ #todays_date ]
csv << [ "id","login","email","last_login", "created_at", "updated_at" ]
#users.find_each do |u|
csv << [ u.id, u.login, u.email, u.last_login, u.created_at, u.updated_at ]
end
end
send_file "users_export.csv",
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=#{#outfile}"
end
You're building up one giant string so you have to keep the entire csv file in memory. You're also loading all of your users which will also sit on a bunch of memory. It won't make any difference if you only have a few hundred or a few thousand users but a some point you will probably need to do 2 things
Use
User.find_each do |user|
csv << [...]
end
This loads users in batches (1000 by default) rather than all of them.
You should also look at writing the csv to a file rather than buffering the entire thing in memory. Assuming you have created a temporary file,
FasterCSV.open('/path/to/file','w') do |csv|
...
end
Will write your csv to a file. You can then use send_file to send it. If you already have a file open, FasterCSV.new(io) should work too.
Lastly, on rails 3.1 and higher you might be able to stream the csv file as you create it, but that isn't something I've tried before.
Additionally to the tips on csv generation, be sure to optimize the call to the database also.
Select only the columns you need.
#users = User.select('id, login, email, last_login, created_at, updated_at').order('login')
#users.find_each do |user|
...
end
If you have for example 1000 users, and each have password, password_salt, city, country, ...
then several 1000 objects less are transfered from database, created as ruby objects and finally garbage collected.

Exporting ActiveRecord objects into POROs

I'm developing a "script generator" to automatize some processes at work.
It has a Rails application running on a server that stores all data needed to make the script and generates the script itself at the end of the process.
The problem I am having is how to export the data from the ActiveRecord format to Plain Old Ruby Objects (POROs) so I can deal with them in my script with no database support and a pure-ruby implementation.
I thought about YAML, CSV or something like this to export the data but it would be a painful process to update these structures if the process changes. Is there a simpler way?
Ty!
By "update these structures if the process changes", do you mean changing the code that reads and writes the CSV or YAML data when the fields in the database change?
The following code writes and reads any AR object to/from CSV (requires the FasterCSV gem):
def load_from_csv(csv_filename, poro_class)
headers_read = []
first_record = true
num_headers = 0
transaction do
FCSV.foreach(csv_filename) do |row|
if first_record
headers_read = row
num_headers = headers_read.length
first_record = false
else
hash_values = {}
for col_index in 0...num_headers
hash_values[headers_read[col_index]] = row[col_index]
end
new_poro_obj = poro_class.new(hash_values) # assumes that your PORO has a constructor that accepts a hash. If not, you can do something like new_poro_obj.send(headers_read[col_index], row[col_index]) in the loop above
#work with your new_poro_obj
end
end
end
end
#objects is a list of ActiveRecord objects of the same class
def dump_to_csv(csv_filename, objects)
FCSV.open(csv_filename,'w') do |csv|
#get column names and write them as headers
col_names = objects[0].class.column_names()
csv << col_names
objects.each do |obj|
col_values = []
col_names.each do |col_name|
col_values.push obj[col_name]
end
csv << col_values
end
end
end

Resources