Slow performance when deleting huge amount of rows - ruby-on-rails

The problem
I'm trying to parse huge CSV file (27mb) and delete big amount of rows, but running in performance issues.
Specifications
rails version 4.2.0, Posgtres as db client
videos table has 300000 rows
categories_videos pivot table has 885000 rows
To load the external csv file takes 29097ms
External CSV file has 3117000 lines (1 deleted video id per line)
The task
I have large CSV file 27MB with the IDs of videos which were deleted and I have to go through this file and check if there are any videos in my database that would have matching ID and if they have delete them from my db.
1) roughly 126724ms (per chunk)
file_location = 'http://my_external_source/file.csv';
open(file_location, 'r:utf-8') do |f|
data = SmarterCSV.process(f, { :headers_in_file => false, :user_provided_headers => ["id"], :chunk_size => 1000 }) do |chunk|
chunk = chunk.map{ |row| row[:id] }
Video.delete_all(:id => chunk)
VideoCategoryRelation.delete_all(:video_video_id => chunk)
end
end
2) roughly 90000ms (per chunk)
file_location = 'http://my_external_source/file.csv';
open(file_location, 'r:utf-8') do |f|
data = SmarterCSV.process(f, { :headers_in_file => false, :user_provided_headers => ["id"], :chunk_size => 1000 }) do |chunk|
chunk = chunk.map{ |row| row[:id] }
Video.where(:video_id => chunk).destroy_all
end
end
Is there any efficient way how to go through this that would note take hours?

I don't know Ruby or the database you are using, but it looks like there are a lot of separate delete calls to the database.
Here's what I would try to speed things up:
First, make sure you have an index on id in both tables.
In each table, create a field (boolean or small int) to mark a record for deletion. In your loop, instead of deleting, just set the deletion field to true (this should be fast if you have an index on id). And only at the end call delete once on each table (delete from table where the deletemarker is true).

Related

Read large XLS and put into Array is very slow

I have a large XLS file with postal codes, the problem: is a quite slow to read the data, for example, the file have multiple sheets with the state name, into each sheet they are a multiple rows with postal code, neighborhoods and municipality. The file have 33 states, each state have between 1000 and 9000 rows.
I try to parse this to an array of hashes, which one take 22 seconds. Is there any way to read this faster?
This is how I read the sheet
def read_sheet(sheet_name:, offset: 1)
sheet = file.worksheet sheet_name[0..30]
clean_data = sheet.each_with_index(offset)
.lazy
.reject{|k, _| !k.any?}
data = clean_data.map do |row, _index|
DATA_MAPPING.map do |field, column|
{ field => row[column] }
end.inject(:merge)
end
yield data
end
And I retrieve all with
def read_file
result = {}
sheets_titles.each_with_index do |name|
read_sheet(sheet_name: name) do |data|
result.merge! name => data.to_a
end
end
result
end
So, if I use .to_a or .to_json or any method to process the data and insert to DB, I have to wait few seconds ... any suggestion?

Iterating over Sets of ActiveRecord Query Batch

I have a table that has a few thousand sets of 2-3 nearly identical records, that all share a unique "id" (not database ID, but item id). That is, two to three records share the same item id and there are about 2100 records, or ~700 unique items. Example:
{id: 1, product_id:333, is_special:true}, {id:2, product_id:333, is_special:false}, {id:3, product_id:333, is_special:false}, {id:4, product_id:334, is_special:false}...
I'd like to perform a query that lets me iterate over each set, modify/remove duplicate records, then move on to the next set.
This is what I currently have:
task find_averages: :environment do
responses = Response.all
chunked_responses = responses.chunk_while { |a,b| a.result_id == b.result_id}.to_a
chunked_responses.each do |chunk|
if chunk.length < 3
chunk.each do |chunky_response|
chunky_response.flagged = true
chunky_response.save
end
else
chunk.each do |chunky_response|
**manipulate each item in the chunk here**
end
end
end
end
Edit. I worked this one out after discovering the chunk_while method. I am not positive chunk_while is the most efficient method here, but it works well. I am closing this, but anyone else that needs to group records and then iterate over them, this should help.
The following code should iterate over an array of items that some of them share common values, and group them by common values:
responses = Responses.all
chunked_responses = responses.chunk_while { |a,b| a.result_id == b.result_id}.to_a
chunked_responses.each do |chunk|
chunk.each do |chunky_response|
**manipulate each item in the chunk here**
end
end

Massive data export to CSV in Rails 4.2.7.1

I'm having RAM usage issues while export huge data to CSV file in Rails. My method responsible for csv generation:
def csv
CSV.generate(headers: true, col_sep: ';', force_quotes: true) do |csv|
csv << headers
cached_data.each do |cashed_object|
# row_data is multielement(~100 elements) array of strings containing data to be presented in csv
row_data = selected_ordered_scopes.map { |scope| cashed_object[scope.to_s] }.flatten
csv << row_data
end
end
end
selected_ordered_scopes is an array: [:id, :preferences, ...]
cached_object is written in Rails.cache as following:
id => [...],
preferences => ['lorem', 'ipsum',...],
other key => ['lorem', ipsum',...]
...
I use gem 'get_process_mem' to analyze memory usage while executing export action in Rails app.
When I perform two/three/four consecutive export data containing 50 000 rows (each row is cached_object data described above) I see that memory usage grows up above 2 Gb and never gets released. Do you have any idea how to deal with that?
I would like to release memory quickly after first export or perform it somehow other way not to use so much memory.

Exceptionally slow import task

I have a rake task (Rails 3 / Mongoid) that takes a lot of time to complete for no apparent reason, my guess is that I'm doing something multiple times where it's not needed or that I'm missing something very obvious (I'm no MongoDB or Mongoid expert):
task :fix_editors => :environment do
(0...50).each do |num|
CSV.foreach("foo_20141013_ascii.csv-#{num}.csv", col_sep: ";", headers: true, force_quotes: true) do |row|
editors = Hash[*Editor.all.collect {|ed| [ed.name, ed.id]}.flatten]
begin
book = Book.where(internal_id: row["ID"], editorial_data_checked: false).first
if book && !row["Marchio"].nil?
editor_name = HTMLEntities.new.decode(row['Marchio']).strip.titleize
editor_id = editors[editor_name]
unless editor_id
editor = Editor.create(name: editor_name)
editors[editor_name] = editor.id
editor_id = editor.id
end
if book.update_attributes(editor_id: editor_id, editorial_data_checked: true)
puts "#{book.slug} updated with editor data"
else
puts "Nothing done for #{book.slug}"
end
end
rescue => e
puts e
retry
end
end
end
end
The CSV I had to read at the beginning was very big, so I've split it in 50 smaller files (that was my first attempt to speed things up).
Then I tried to remove all the queries I could, that's why it doesn't read from the Editor collection for every row but collects all of them at the beginning and then just looks up things in a hash.
At the end I removed all save calls and used update_attributes.
The Book collection is more or less 1 million records, so it's pretty large. I have 13k Editors, so no big deal there.
Here is my Book class:
https://gist.github.com/anonymous/087e6c81ef5f355a160d
Locally it takes more than 1 second per row, I don't think it's normal, but feel free to let me know if you disagree. All writes take less then 0.1/0.2 (I've used Benchmark.measure)
I'm out of ideas, can anybody help me? Am I missing something? Thanks in advance
Replace
editors = Hash[*Editor.all.collect {|ed| [ed.name, ed.id]}.flatten]
to the second line right after
task :fix_editors => :environment do
other thing that you could do batch processing: load 1000 rows, then 1000 books and then process those 1000 books
Do you have an index on column internal_id of books table?

Exporting large amounts of data using FasterCSV with Rails

I have a controller in Rails that generates CSV reports using FasterCSV. These reports will contain approximately 20,000 rows, maybe more.
It takes about 30 seconds or more when creating the csv_string in my implementation below. Is there a better/faster way to export the data? Any way to output the data without having to store it all in memory in the csv_string?
My current implementation is as follows:
#report_data = Person.find(:all, :conditions => "age > #{params[:age]}")
csv_string = FasterCSV.generate do |csv|
#report_data.each do |e|
values = [e.id, e.name, e.description, e.phone, e.address]
csv << values
end
end
send_data csv_string, :type => "text/plain",
:filename=>"report.csv", :disposition => 'attachment'
I would try using find_in_batches, to eliminate that many ActiveRecord objects in memory at once.
http://ryandaigle.com/articles/2009/2/23/what-s-new-in-edge-rails-batched-find
I believe that should help quite a bit, creating and having many ActiveRecord objects in memory is slooowww.

Resources