Massive data export to CSV in Rails 4.2.7.1 - ruby-on-rails

I'm having RAM usage issues while export huge data to CSV file in Rails. My method responsible for csv generation:
def csv
CSV.generate(headers: true, col_sep: ';', force_quotes: true) do |csv|
csv << headers
cached_data.each do |cashed_object|
# row_data is multielement(~100 elements) array of strings containing data to be presented in csv
row_data = selected_ordered_scopes.map { |scope| cashed_object[scope.to_s] }.flatten
csv << row_data
end
end
end
selected_ordered_scopes is an array: [:id, :preferences, ...]
cached_object is written in Rails.cache as following:
id => [...],
preferences => ['lorem', 'ipsum',...],
other key => ['lorem', ipsum',...]
...
I use gem 'get_process_mem' to analyze memory usage while executing export action in Rails app.
When I perform two/three/four consecutive export data containing 50 000 rows (each row is cached_object data described above) I see that memory usage grows up above 2 Gb and never gets released. Do you have any idea how to deal with that?
I would like to release memory quickly after first export or perform it somehow other way not to use so much memory.

Related

Read large XLS and put into Array is very slow

I have a large XLS file with postal codes, the problem: is a quite slow to read the data, for example, the file have multiple sheets with the state name, into each sheet they are a multiple rows with postal code, neighborhoods and municipality. The file have 33 states, each state have between 1000 and 9000 rows.
I try to parse this to an array of hashes, which one take 22 seconds. Is there any way to read this faster?
This is how I read the sheet
def read_sheet(sheet_name:, offset: 1)
sheet = file.worksheet sheet_name[0..30]
clean_data = sheet.each_with_index(offset)
.lazy
.reject{|k, _| !k.any?}
data = clean_data.map do |row, _index|
DATA_MAPPING.map do |field, column|
{ field => row[column] }
end.inject(:merge)
end
yield data
end
And I retrieve all with
def read_file
result = {}
sheets_titles.each_with_index do |name|
read_sheet(sheet_name: name) do |data|
result.merge! name => data.to_a
end
end
result
end
So, if I use .to_a or .to_json or any method to process the data and insert to DB, I have to wait few seconds ... any suggestion?

How to manipulate a CSV object in ruby?

I want to export some ActiveRecords in CSV format. After check some tutorials, I found this:
def export_as_csv(equipments)
attributes = %w[id title description category_id]
CSV.generate(headers: true) do |csv|
csv << attributes
equipments.each do |equipment|
csv << equipment.attributes.values_at(*attributes)
end
return csv
end
end
The problem is, I want to manipulate all in memory in my tests(i.e. I don't want to save the file in the disk). So, when I receive this csv object as return value, how I can iterate through rows and columns? I came from Python and so I tried:
csv = exporter.export_as_csv(equipments)
for row in csv:
foo(row)
But obviously didn't work. Also, the equipments are surely not nil.
CSV.generate returns string formatted according csv rules.
So the most obvious way is to parse it and iterate, like:
csv = exporter.expor_as_csv(equipments)
CSV.parse(csv).each do |line|
# line => ['a', 'b', 'c']
end
After some videos, I found that the return was the problem. Returning the CSV I was receiving a CSV object, and not the CSV itself.

Slow performance when deleting huge amount of rows

The problem
I'm trying to parse huge CSV file (27mb) and delete big amount of rows, but running in performance issues.
Specifications
rails version 4.2.0, Posgtres as db client
videos table has 300000 rows
categories_videos pivot table has 885000 rows
To load the external csv file takes 29097ms
External CSV file has 3117000 lines (1 deleted video id per line)
The task
I have large CSV file 27MB with the IDs of videos which were deleted and I have to go through this file and check if there are any videos in my database that would have matching ID and if they have delete them from my db.
1) roughly 126724ms (per chunk)
file_location = 'http://my_external_source/file.csv';
open(file_location, 'r:utf-8') do |f|
data = SmarterCSV.process(f, { :headers_in_file => false, :user_provided_headers => ["id"], :chunk_size => 1000 }) do |chunk|
chunk = chunk.map{ |row| row[:id] }
Video.delete_all(:id => chunk)
VideoCategoryRelation.delete_all(:video_video_id => chunk)
end
end
2) roughly 90000ms (per chunk)
file_location = 'http://my_external_source/file.csv';
open(file_location, 'r:utf-8') do |f|
data = SmarterCSV.process(f, { :headers_in_file => false, :user_provided_headers => ["id"], :chunk_size => 1000 }) do |chunk|
chunk = chunk.map{ |row| row[:id] }
Video.where(:video_id => chunk).destroy_all
end
end
Is there any efficient way how to go through this that would note take hours?
I don't know Ruby or the database you are using, but it looks like there are a lot of separate delete calls to the database.
Here's what I would try to speed things up:
First, make sure you have an index on id in both tables.
In each table, create a field (boolean or small int) to mark a record for deletion. In your loop, instead of deleting, just set the deletion field to true (this should be fast if you have an index on id). And only at the end call delete once on each table (delete from table where the deletemarker is true).

Can Rails sort a CSV file?

I'm exporting a CSV from many different sources which makes it very hard to sort before putting it into the CSV.
csv = CSV.generate col_sep: '#' do |csv|
... adding a few columns here
end
Now, it would be awesome if I was able to sort this CSV by the 2nd column. Is that in any way possible?
If you're trying to sort before writing, it depends on your data structure, in which i'll need to see your code a bit more. For reading a csv, you can convert it to hash and sort by header name even:
rows = []
CSV.foreach('mycsvfile.csv', headers: true) do |row|
rows << row.to_h
end
rows.sort_by{ |row| row['last_name'] }
Edit to use sort_by, thanks to max williams.
Here is how you would sort by column number:
rows = []
CSV.foreach('mycsvfile.csv', headers: true) do |row|
# collect each row as an array of values only
rows << row.to_h.values
end
# sort in place by the 2nd column
rows.sort_by! { |row| row[1] }
rows.each do |row|
# do stuff with your now sorted rows
end

Exporting large amounts of data using FasterCSV with Rails

I have a controller in Rails that generates CSV reports using FasterCSV. These reports will contain approximately 20,000 rows, maybe more.
It takes about 30 seconds or more when creating the csv_string in my implementation below. Is there a better/faster way to export the data? Any way to output the data without having to store it all in memory in the csv_string?
My current implementation is as follows:
#report_data = Person.find(:all, :conditions => "age > #{params[:age]}")
csv_string = FasterCSV.generate do |csv|
#report_data.each do |e|
values = [e.id, e.name, e.description, e.phone, e.address]
csv << values
end
end
send_data csv_string, :type => "text/plain",
:filename=>"report.csv", :disposition => 'attachment'
I would try using find_in_batches, to eliminate that many ActiveRecord objects in memory at once.
http://ryandaigle.com/articles/2009/2/23/what-s-new-in-edge-rails-batched-find
I believe that should help quite a bit, creating and having many ActiveRecord objects in memory is slooowww.

Resources