Exporting large amounts of data using FasterCSV with Rails

Exporting large amounts of data using FasterCSV with Rails - ruby-on-rails

I have a controller in Rails that generates CSV reports using FasterCSV. These reports will contain approximately 20,000 rows, maybe more.
It takes about 30 seconds or more when creating the csv_string in my implementation below. Is there a better/faster way to export the data? Any way to output the data without having to store it all in memory in the csv_string?
My current implementation is as follows:
#report_data = Person.find(:all, :conditions => "age > #{params[:age]}")
csv_string = FasterCSV.generate do |csv|
#report_data.each do |e|
values = [e.id, e.name, e.description, e.phone, e.address]
csv << values
end
end
send_data csv_string, :type => "text/plain",
:filename=>"report.csv", :disposition => 'attachment'

I would try using find_in_batches, to eliminate that many ActiveRecord objects in memory at once.
http://ryandaigle.com/articles/2009/2/23/what-s-new-in-edge-rails-batched-find
I believe that should help quite a bit, creating and having many ActiveRecord objects in memory is slooowww.

Related

Massive data export to CSV in Rails 4.2.7.1

I'm having RAM usage issues while export huge data to CSV file in Rails. My method responsible for csv generation:
def csv
CSV.generate(headers: true, col_sep: ';', force_quotes: true) do |csv|
csv << headers
cached_data.each do |cashed_object|
# row_data is multielement(~100 elements) array of strings containing data to be presented in csv
row_data = selected_ordered_scopes.map { |scope| cashed_object[scope.to_s] }.flatten
csv << row_data
end
end
end
selected_ordered_scopes is an array: [:id, :preferences, ...]
cached_object is written in Rails.cache as following:
id => [...],
preferences => ['lorem', 'ipsum',...],
other key => ['lorem', ipsum',...]
...
I use gem 'get_process_mem' to analyze memory usage while executing export action in Rails app.
When I perform two/three/four consecutive export data containing 50 000 rows (each row is cached_object data described above) I see that memory usage grows up above 2 Gb and never gets released. Do you have any idea how to deal with that?
I would like to release memory quickly after first export or perform it somehow other way not to use so much memory.

Slower while generating the XML from the bunch of model object

class GenericFormatter < Formatter
attr_accessor :tag_name,:objects
def generate_xml
builder = Nokogiri::XML::Builder.new do |xml|
xml.send(tag_name.pluralize) {
objects.each do |obj|
xml.send(tag_name.singularize){
self.generate_obj_row obj,xml
}
end
}
end
builder.to_xml
end
def initialize tag_name,objects
self.tag_name = tag_name
self.objects = objects
end
def generate_obj_row obj,xml
obj.attributes.except("updated_at").map do |key,value|
xml.send(key, value)
end
xml.updated_at obj.updated_at.try(:strftime,"%m/%d/%Y %H:%M:%S") if obj.attributes.key?('updated_at')
end
end
In the above code, I have implemented the formatter where I have used the nokogiri XML Builder to generate the XML by manipulating the objects passing out inside the code.It's generated the faster XML when the data is not too large if data is larger like more than 10,000 records then It's slow down the XML to generate and takes at least 50-60 seconds.
Problem: Is there any way to generate the XML faster, I have tried XML Builders on view as well but did n't work.How can I generate the XML Faster? Should the solution be an application on rails 3 and suggestions to optimized above code?

Your main problem is processing everything in one go instead of splitting your data into batches. It all requires a lot of memory, first to build all those ActiveRecord models and then to build memory representation of the whole xml document. Meta-programming is also quite expensive (I mean those send methods).
Take a look at this code:
class XmlGenerator
attr_accessor :tag_name, :ar_relation
def initialize(tag_name, ar_relation)
#ar_relation = ar_relation
#tag_name = tag_name
end
def generate_xml
singular_tag_name = tag_name.singularize
plural_tag_name = tag_name.pluralize
xml = ""
xml << "<#{plural_tag_name}>"
ar_relation.find_in_batches(batch_size: 1000) do |batch|
batch.each do |obj|
xml << "<#{singular_tag_name}>"
obj.attributes.except("updated_at").each do |key, value|
xml << "<#{key}>#{value}</#{key}>"
end
if obj.attributes.key?("updated_at")
xml << "<updated_at>#{obj.updated_at.strftime('%m/%d/%Y %H:%M:%S')}</updated_at>"
end
xml << "</#{singular_tag_name}>"
end
end
xml << "</#{tag_name.pluralize}>"
xml
end
end
# example usage
XmlGenerator.new("user", User.where("age < 21")).generate_xml
Major improvements are:
fetching data from database in batches, you need to pass ActiveRecord collection instead of array of ActiveRecord models
generating xml by constructing strings, this has a risk of producing invalid xml, but it is much faster than using builder
I tested it on over 60k records. It took around 40 seconds to generate such xml document.
There is much more that can be done to improve this even further, but it all depends on your application.
Here are some ideas:
do not use ActiveRecord to fetch data, instead use lighter library or plain database driver
fetch only data that you need
tweak batch size
write generated xml directly to a file (if that is your use case) to save memory

The Nokogiri gem has a nice interface for creating XML from scratch,
Nokogiri is a wrapper around libxml2.
Gemfile gem 'nokogiri' To generate xml simple use the Nokogiri XML Builder like this
xml = Nokogiri::XML::Builder.new { |xml|
xml.body do
xml.test1 "some string"
xml.test2 890
xml.test3 do
xml.test3_1 "some string"
end
xml.test4 "with attributes", :attribute => "some attribute"
xml.closing
end
}.to_xml
output
<?xml version="1.0"?>
<body>
<test1>some string</test1>
<test2>890</test2>
<test3>
<test3_1>some string</test3_1>
</test3>
<test4 attribute="some attribute">with attributes</test4>
<closing/>
</body>
Demo: http://www.jakobbeyer.de/xml-with-nokogiri

Slow performance when deleting huge amount of rows

The problem
I'm trying to parse huge CSV file (27mb) and delete big amount of rows, but running in performance issues.
Specifications
rails version 4.2.0, Posgtres as db client
videos table has 300000 rows
categories_videos pivot table has 885000 rows
To load the external csv file takes 29097ms
External CSV file has 3117000 lines (1 deleted video id per line)
The task
I have large CSV file 27MB with the IDs of videos which were deleted and I have to go through this file and check if there are any videos in my database that would have matching ID and if they have delete them from my db.
1) roughly 126724ms (per chunk)
file_location = 'http://my_external_source/file.csv';
open(file_location, 'r:utf-8') do |f|
data = SmarterCSV.process(f, { :headers_in_file => false, :user_provided_headers => ["id"], :chunk_size => 1000 }) do |chunk|
chunk = chunk.map{ |row| row[:id] }
Video.delete_all(:id => chunk)
VideoCategoryRelation.delete_all(:video_video_id => chunk)
end
end
2) roughly 90000ms (per chunk)
file_location = 'http://my_external_source/file.csv';
open(file_location, 'r:utf-8') do |f|
data = SmarterCSV.process(f, { :headers_in_file => false, :user_provided_headers => ["id"], :chunk_size => 1000 }) do |chunk|
chunk = chunk.map{ |row| row[:id] }
Video.where(:video_id => chunk).destroy_all
end
end
Is there any efficient way how to go through this that would note take hours?

I don't know Ruby or the database you are using, but it looks like there are a lot of separate delete calls to the database.
Here's what I would try to speed things up:
First, make sure you have an index on id in both tables.
In each table, create a field (boolean or small int) to mark a record for deletion. In your loop, instead of deleting, just set the deletion field to true (this should be fast if you have an index on id). And only at the end call delete once on each table (delete from table where the deletemarker is true).

Exceptionally slow import task

I have a rake task (Rails 3 / Mongoid) that takes a lot of time to complete for no apparent reason, my guess is that I'm doing something multiple times where it's not needed or that I'm missing something very obvious (I'm no MongoDB or Mongoid expert):
task :fix_editors => :environment do
(0...50).each do |num|
CSV.foreach("foo_20141013_ascii.csv-#{num}.csv", col_sep: ";", headers: true, force_quotes: true) do |row|
editors = Hash[*Editor.all.collect {|ed| [ed.name, ed.id]}.flatten]
begin
book = Book.where(internal_id: row["ID"], editorial_data_checked: false).first
if book && !row["Marchio"].nil?
editor_name = HTMLEntities.new.decode(row['Marchio']).strip.titleize
editor_id = editors[editor_name]
unless editor_id
editor = Editor.create(name: editor_name)
editors[editor_name] = editor.id
editor_id = editor.id
end
if book.update_attributes(editor_id: editor_id, editorial_data_checked: true)
puts "#{book.slug} updated with editor data"
else
puts "Nothing done for #{book.slug}"
end
end
rescue => e
puts e
retry
end
end
end
end
The CSV I had to read at the beginning was very big, so I've split it in 50 smaller files (that was my first attempt to speed things up).
Then I tried to remove all the queries I could, that's why it doesn't read from the Editor collection for every row but collects all of them at the beginning and then just looks up things in a hash.
At the end I removed all save calls and used update_attributes.
The Book collection is more or less 1 million records, so it's pretty large. I have 13k Editors, so no big deal there.
Here is my Book class:
https://gist.github.com/anonymous/087e6c81ef5f355a160d
Locally it takes more than 1 second per row, I don't think it's normal, but feel free to let me know if you disagree. All writes take less then 0.1/0.2 (I've used Benchmark.measure)
I'm out of ideas, can anybody help me? Am I missing something? Thanks in advance

Replace
editors = Hash[*Editor.all.collect {|ed| [ed.name, ed.id]}.flatten]
to the second line right after
task :fix_editors => :environment do
other thing that you could do batch processing: load 1000 rows, then 1000 books and then process those 1000 books

Do you have an index on column internal_id of books table?

How do i skip over the first three rows instead of the only the first in FasterCSV

I am using FasterCSV and i am looping with a foreach like this
FasterCSV.foreach("#{Rails.public_path}/uploads/transfer.csv", :encoding => 'u', :headers => :first_row) do |row|
but the problem is my csv has the first 3 lines as the headers...any way to make fasterCSV skip the first three rows rather then only the first??

Not sure about FasterCSV, but in Ruby 1.9 standard CSV library (which is made from FasterCSV), I can do something like:
c = CSV.open '/path/to/my.csv'
c.drop(3).each do |row|
# do whatever with row
end

I'm not a user of FasterCSV, but why not do the control yourself:
additional_rows_to_skip = 2
FasterCSV.foreach("...", :encoding => 'u', :headers => :first_row) do |row|
if additional_rows_to_skip > 0
additional_rows_to_skip -= 1
else
# do stuff...
end
end

Thanks to Mladen Jablanovic. I got my clue.. But I realized something interesting
In 1.9, reading seems to be from POS.
In this I mean if you do
c = CSV.open iFileName
logger.debug c.first
logger.debug c.first
logger.debug c.first
You'll get three different results in your log. One for each of the three header rows.
c.each do |row| #now seems to start on the 4th row.
It makes perfect sense that it would read the file this way. Then it would only have to have the current row in memory.
I still like Mladen Jablanovićs answer, but this is an interesting bit of logic too.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Exporting large amounts of data using FasterCSV with Rails - ruby-on-rails

I would try using find_in_batches, to eliminate that many ActiveRecord objects in memory at once. http://ryandaigle.com/articles/2009/2/23/what-s-new-in-edge-rails-batched-find I believe that should help quite a bit, creating and having many ActiveRecord objects in memory is slooowww.

Related

Massive data export to CSV in Rails 4.2.7.1

Slower while generating the XML from the bunch of model object

Slow performance when deleting huge amount of rows

Exceptionally slow import task

How do i skip over the first three rows instead of the only the first in FasterCSV

Categories

Resources