Rails: export millions of row to csv - ruby-on-rails

So I found a lot of articles where people are having issues exporting big data into a CSV with rails. I'm able to do this, but it takes about 40 seconds per 20k rows.
Has anyone overcame this issue? I searched everywhere for the past couple hours and couldn't find something which worked for me.
Thanks!

Suppose you want to load 1k rows into CSV. You can write a rake task which accepts limit and offset to pull data from table. Then write a ruby script something like below
batch_size = 100
offset = 0
for i in 0..9
nohup rake my_task:to_load_csv(batch_size, offset, index) > rake.out 2>&1 &
offset += batch_size
end
** Refer this link to know more about how to run rake in background
rake task will be something like
namespace :my_task
task :load_csv, [:limit, :offset, :index] :environments do
# write code here load data from table using limit and offset
# write the data returned in above query to FILE_NAME_#{index}.csv
end
end
Once you see all rake task are finished combine all files by index. If you want to automate process of combining files, you need to write some code for process monitoring. You have to grep for all active rake tasks and store their PID in array. Then every 15 seconds or something try to get the status of process using PID from array. If process is no longer running pop the PID from array. Continue this until array is blank i.e all rakes are finished and then merge files by their index.
Hopefully this helps you. Thanks!

Related

Ruby CSV.foreach start at specific row

I've seen a couple posts for this with no real answers or out-of-date answers, so I'm wondering if there are any new solutions. I have an enormous CSV I need to read in. I can't call open() on it bc it kills my server. I have no choice but to use .foreach().
Doing it this way, my script will take 6 days to run. I want to see if I can cut that down by using Threads and splitting the task in two or four. So one thread reads lines 1-n and one thread simultaneously will read lines n+1-end.
So I need to be able to only read in the last half of the file in one thread (and later if I split it into more threads, just a specific line through a specific line).
Is there anyway in Ruby to do this? Can this start at a certain row?
CSV.foreach(FULL_FACT_SHEET_CSV_PATH) do |trial|
EDIT:
Just to give an idea of what one of my threads looks like:
threads << Thread.new {
CSV.open('matches_thread3.csv', 'wb') do |output_csv|
output_csv << HEADER
count = 1
index = 0
CSV.foreach(CSV_PATH) do |trial|
index += 1
if index > 120000
break if index > 180000
#do stuff
end
end
end
}
But as you can see, it has to iterate the file until it gets to record 120,000 before it starts. So the goal would be to eliminate reading all of the rows before row 120,000 by starting to read at row 120,000.
If still relevant, you can do something like this using .with_index after :
rows_array = []
CSV.foreach(path).with_index do |row, i|
next if i == 0 #skip first row
rows_array << columns.map { |n| row[n] }
end
But as you can see, it has to iterate the file until it gets to record 120,000 before it starts. So the goal would be to eliminate reading all of the rows before row 120,000 by starting to read at row 120,000.
Impossible. Content of a CSV file is just a blob of text, with some commas and newlines. You can't know at which offset in the file row N starts without knowing where row N-1 ends. And to know this, you have to know where row N-1 starts (see recursion?) and read the file until you see where it ends (encounter a newline that is not part of field value).
Exception to this is if all your rows are of fixed size. In which case, you can seek directly to offset 120_000 * row_size. I am yet to see a file like this, though.
As per my understanding towards your Question in Ruby way it may help you.
require 'csv'
csv_file = "matches_thread3.csv"
# define one Constant Chunk Size for Jobs
CHUNK_SIZE = 120000
# split - by splitting (\n) will generate an array of CSV records
# each_slice - will create array of records of CHUNK_SIZE defined
File.read(csv_file).split("\n").drop(1).each_slice(CHUNK_SIZE).with_index
do |chunk, index|
data = []
# chunk will be work as multiple Jobs of 120000 records
chunk.each do |row|
data << r
##do stuff
end
end

Sidekiq worker processes not updating database records properly

I have a sidekiq worker that processes certain series of task in batch. Once it completes the job, it updates a tracker table on the success/failure of the task. Each batch has a unique identifier that is being passed to the worker script and the worker process queries that table for this unique id and update that particular row through a activerecord query similar to:
cpr = MODEL.find(tracker_unique_id)
cpr.update_attributes(:attempted => cpr[:attempted] + 1, :success => cpr[:success] + 1)
What I have noticed is that the tracker only get record of 1 set of task running even though I can see from the sidekiq log and another result table that x number of tasks finished running.
Anyone can help me on this?
Your update_attributes call has a race condition as you cannot increment like that safely. Multiple threads will stomp on each other. You must do a proper UPDATE SQL statement.
update models set attempted = attempted + 1 where tracker_unique_id = ?

Finding mongoDB records in batches (using mongoid ruby adapter)

Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end

Update database after duration (Ruby on Rails)

I have a cycle model with two fields: duration (string) and completed (boolean). When a user creates a cycle, they enter the duration (lets say 30 minutes) and the cycle is set to not complete (boolean 0). How do I update that database entry after the cycle duration (30 minutes) to mark the cycle as complete (boolean 1)? Is there a way to handle this with ruby/rails code, or do I have to execute a javascript function?
The goal is to be able to find and display all completed cycles using Cycle.all(:conditions..) and call the SQL database. I wrote a "complete?" method in the cycle model that compares the age of the cycle to the duration, but this is useless for SQL find methods.
What's the best way to tackle this? Thanks!
Define a rake task that runs something likeā€¦
desc "Expire old cycles"
task :cron => :environment do
expired = Cycle.all :conditions => ["expiration < ?", DateTime.now]
expired.each { |c| c.expire! }
end
Where c#expire! is a method that'll mark it as expired in the database. Then setup rake cron to run every N minutes via a cronjob.
If you're comfortable doing this in SQL, you can optimize this by writing a query to do UPDATE cycles SET complete = 1 WHERE expiration < NOW();.
You can add another field, let's say Expired_time that is when the cycle is complete. For example:
# Here is the example record:
Duration Created_at Expired_time
30 mins Time Time + 30 mins
And now simply check current date (now) with Expired_time to check it is complete or not.

Rails: rake db:migrate *very* slow on Oracle

I'm using rails with the oracleenhanced adaptor to create a new interface for a legacy application.
Database migrations work successfully, but take an incredibly long amount of time before rake finishes. The database changes happen pretty quickly (1 or 2 seconds), but the db/schema.db dump takes over an hour to complete. (See example migration below.)
It's a relatively large schema (about 150 tables), but I'm sure it shouldn't be taking this long to dump out each table description.
Is there anyway to speed this up by just taking the last schema.db and applying the change specified in the migration to it? Or am I able to skip this schema dump altogether?
I understand this schema.db is used to create the test database from scratch each time, but this case, there's a large chunk of the database logic in table triggers which aren't included in the schema.rb anyway, so the rake tests are no good to us in any case. (That's a whole different issue that I need to sort out at some other point.)
dgs#dgs-laptop:~/rails/voyager$ time rake db:migrate
(in /home/dgs/rails/voyager)
== 20090227012452 AddModuleActionAndControllerNames: migrating ================
-- add_column(:modules, :action_name, :text)
-> 0.9619s
-> 0 rows
-- add_column(:modules, :controller_name, :text)
-> 0.1680s
-> 0 rows
== 20090227012452 AddModuleActionAndControllerNames: migrated (1.1304s) =======
real 87m12.961s
user 0m12.949s
sys 0m2.128s
After all migrations are applied to database then rake db:migrate calls db:schema:dump task to generate schema.rb file from current database schema.
db:schema:dump call adapter's "tables" method to get the list of all tables, then for each table calls "indexes" method and "columns" method. You can find SQL SELECT statements that are used in these methods in activerecord-oracle_enhanced-adapter gem's oracle_enhanced_adapter.rb file. Basically it does selects from ALL% or USER% data dictionary tables to find all the information.
Initially I had issues with original Oracle adapter when I used it with databases with lot of different schemas (as performance might be affected by the total number of table in the database - not just in your schema) and therefore I did some optimizations in Oracle enhanced adapter. It would be good to find out which methods are slow in your case (I suspect that it could be either "indexes" or "columns" method which is executed for each table).
One way hoe to debug this issue would be if you would put some debug messages in oracle_enhanced_adapter.rb file so that you could identify which method calls are taking so long time.
Problem mostly solved after some digging round in oracle_enhanced_adapter.rb.
The problem came down to way too many tables in the local schema (many EBA_%, EVT_%, EMP_%, SMP_% tables had been created in there coincidently at some point), archive tables being included in the dump and a select from the data dictionaries taking 14 seconds to execute.
To fix the speed, I did three things:
Dropped all unneeded tables (about 250 out of 500)
Excluded archive tables from the schema dump
Cached the result of the long running query
This improved the time from the migration/schema dump for the remaining 350 tables from about 90 minutes to about 15 seconds. More than fast enough.
My code as follows (for inspiration not copying and pasting - this code is fairly specific to my database, but you should be able to get the idea). You need to create the temp table manually. It takes about 2 or 3 minutes for me to do - still too long to generate with each migration, and it's fairly static anyway =)
module ActiveRecord
module ConnectionAdapters
class OracleEnhancedAdapter
def tables(name = nil)
select_all("select lower(table_name) from all_tables where owner = sys_context('userenv','session_user') and table_name not like 'A!_%' escape '!' ").inject([]) do | tabs, t |
tabs << t.to_a.first.last
end
end
# TODO think of some way to automatically create the rails_temp_index table
#
# Table created by:
# create table rails_temp_index_table as
# SELECT lower(i.index_name) as index_name, i.uniqueness,
# lower(c.column_name) as column_name, i.table_name
# FROM all_indexes i, user_ind_columns c
# WHERE c.index_name = i.index_name
# AND i.owner = sys_context('userenv','session_user')
# AND NOT exists (SELECT uc.index_name FROM user_constraints uc
# WHERE uc.constraint_type = 'P' and uc.index_name = i.index_name);
def indexes(table_name, name = nil) #:nodoc:
result = select_all(<<-SQL, name)
SELECT index_name, uniqueness, column_name
FROM rails_temp_index_table
WHERE table_name = '#{table_name.to_s.upcase}'
ORDER BY index_name
SQL
current_index = nil
indexes = []
result.each do |row|
if current_index != row['index_name']
indexes << IndexDefinition.new(table_name, row['index_name'], row['uniqueness'] == "UNIQUE", [])
current_index = row['index_name']
end
indexes.last.columns << row['column_name']
end
indexes
end
end

Resources