How should you backfill a new table in Rails? - ruby-on-rails

I'm creating a new table that needs to be backfilled with data based on User accounts (over a couple dozen thousand) with the following one-time rake task.
What I've decided to do is create a big INSERT string for every 2000 users and execute that query.
Here's what the code roughly looks like:
task :backfill_my_new_table => :environment do
inserts = []
User.find_each do |user|
tuple = # form the tuple based on user and user associations like (1, 'foo', 'bar', NULL)
inserts << tuple
end
# At this point, the inserts array is of size at least 20,000
conn = ActiveRecord::Base.connection
inserts.each_slice(2000) do |slice|
sql = "INSERT INTO my_new_table (ref_id, column_a, column_b, column_c) VALUES #{inserts.join(", ")}"
conn.execute(sql)
end
end
So I'm wondering, is there a better way to do this? What are some drawbacks of the approach I took? How should I improve it? What if I didn't slice the inserts array and simply executed a single INSERT with over a couple dozen thousand VALUES tuples? What are the drawbacks of that method?
Thanks!

Depends on which PG version you are using, but in most cases of bulk loading data to a table this is enough checklist:
try to use COPY instead of INSERT whenever possible;
if using multiple INSERTs, disable autocommit and wrap all INSERTs in a single transaction, i.e. BEGIN; INSERT ...; INSERT ...; COMMIT;
disable indexes and checks/constraints on/of a target table;
disable table triggers;
alter table so it became unlogged (since PG 9.5, don't forget to turn logging on after data import), or increase max_wal_size so WAL wont be flooded
20k of rows is not such a big deal for a PG, so 2k-sliced inserts within one transaction will be just fine, unless there are some very complex triggers/checks involved. It is also worth reading PG manual section on bulk loading.
UPD: and a little bit old, yet wonderful piece from depesz, excerpt:
so, if you want to insert data as fast as possible – use copy (or better yet – pgbulkload). if for whatever reason you can't use copy, then use multi-row inserts (new in 8.2!). then if you can, bundle them in transactions, and use prepared transactions, but generally – they don't give you much.

Related

Syncing data between CSV and DB

I need sync data between a CSV file and my DB, but is a very slow process when I check if each item exists.
For example, I have a very large list of postal codes, when this list load into the system, the app need check if this record exist in the database.
I try to use find_or_initialize_by, but is very slow when the list of postal codes has more than 100_000 records ... I also tried to cache all the records in the database and compare them using .select, but it is almost as slow as using the database.
Any suggestion?
Using find_or_initialize_by is extremely slow for use cases like these because this approach would run at least one query against the database for each line in the CSV file. And if the record wasn't found there will be a second insert query. Even if every single query is extremely fast, let's assume they only take 5ms, they will add up: With 100k lines in the CSV alone the find_or_initialize_by method calls will take over 8 minutes.
Therefore my approach would be to avoid doing many small database queries and instead do only a few, big queries and keep the data in memory.
First, load all records from the database but not the whole record but only the unique parts. For postal code data that might be the zip_code column. Then store that data in an in-memory data structure that allows fast lookup, for example in a Set.
require 'set'
existing_zip_codes = Set.new(
PostalCode.all.pluck(:zip_code)
)
Then iterate over the CSV and collect all data that need to be imported into the database.
missing_postal_codes = []
CSV.foreach(...) do |row|
next if existing_zip_codes.include?(row['zip_code'])
missing_postal_codes << {
zip_code: row['zip_code'],
city: row['city'],
# ...
}
end
And in the last step, I would insert all those missing data with one big insert_all call into the database.
PostalCode.insert_all(missing_postal_codes)

locking rows on rails update to avoid collisions. (Postgres back end)

So I have a method on my model object which creates a unique sequence number when a binary field in the row is updated from null to true. Its implemented like this:
class AnswerHeader < ApplicationRecord
before_save :update_survey_complete_sequence, if: :survey_complete_changed?
def update_survey_complete_sequence
maxval =AnswerHeader.maximum('survey_complete_sequence')
self.survey_complete_sequence=maxval+1
end
end
My question is what do I need to lock so two rows being updated at the same time don't end up with two rows having the same survey_complete_sequence?
If it is possible to lock a single row rather than whole table that would be good because this is a often accessed table by users.
If you want to handle this in application logic itself, instead of letting database handle this. You make make use of rails with_lock function that will create a transaction and acquire a row level db lock on the selected rows(in your case a single row).
What you need to lock
In your case you have to lock the row containing the maximum survey_complete_sequence, since this is the row every query will look for while getting the value you require.
maxval =AnswerHeader.maximum('survey_complete_sequence')
Is it possible to lock a single row rather than whole table
There is no such specific lock for your scenario. But you can make use of Postgresql's SELECT FOR UPDATE row-level locking.
To acquire an exclusive row-level lock on a row without actually
modifying the row, select the row with SELECT FOR UPDATE.
And you can use pessimistic locking in rails and specify which lock you will use.
Call lock('some locking clause') to use a database-specific locking
clause of your own such as 'LOCK IN SHARE MODE' or 'FOR UPDATE NOWAIT'
Here's an example of how to achieve that from rails official guide itself
Item.transaction do
i = Item.lock("LOCK IN SHARE MODE").find(1)
...
end
Relations using lock are usually wrapped inside a transaction for preventing deadlock conditions.
So what you need doing is -
Apply SELECT FOR UPDATE lock to row consisting maximum('survey_complete_sequence')
Get the value you require from that row
Update your AnswerHeader with the value received
I believe you should give advisory locks a look. It makes sure the same block of code isn't executed on two machines simultaneously, while still keeping the table open for other business.
It uses the database, but it doesn't lock your tables.
You can use the gem called "with_advisory_lock" like this:
Model.with_advisory_lock("ADVISORY_LOCK_NAME") do
# Your code
end
https://github.com/ClosureTree/with_advisory_lock
It doesn't work with SQLite.
If you are using postgress, maybe Sequenced can help you out without defining a sequence at the DB level.
Is there a reason survey_complete_sequence should be incremental? if not, maybe randomize a bigint?
You probably don't want to lock the table, and even if you lock the row you're currently updating the row you're basing your maxval on will be available for another update to read and generate its sequence #.
Unless you have a huge table and lots of updates every millisecond (in the order of thousands) this shouldn't be an issue in real life. But if the idea bothers you, you can go ahead and add an unique index to the table on "survey_complete_sequence" column. The DB error will propagate to a Rails exception you can deal with within the application.

Import CSV data faster in rails

I am building an import module to import a large set of orders from a csv file. I have a model called Order where the data needs to be stored.
A simplified version of the Order model is below
sku
quantity
value
customer_email
order_date
status
When importing the data two things have to happen
Any dates or currencies need to be cleaned up i.e. dates are represented as strings in the csv, this needs to be converted into a Rails Date object and currencies need to converted to a decimal by removing any commas or dollar signs
If a row already exists it has to be updated, the uniqueness is checked based on two columns.
Currently I use a simple csv import code
CSV.foreach("orders.csv") do |row|
order = Order.first_or_initialize(sku: row[0], customer_email: row[3])
order.quantity = row[1]
order.value= parse_currency(row[2])
order.order_date = parse_date(row[4])
order.status = row[5]
order.save!
end
Where parse_currency and parse_date are two functions used to extract the values from strings. In the case of the date it is just a wrapper for Date.strptime.
I can add a check to see if the record already exists and do nothing in case it already exists and that should save a little time. But I am looking for something that is significantly faster. Currently importing around 100k rows takes around 30 mins with an empty database. It will get slower as the data size increases.
So I am basically looking for a faster way to import the data.
Any help would be appreciated.
Edit
After some more testing based on the comments here I have an observation and a question. I am not sure if they should go here or if I need to open a new thread for the questions. So please let me know if I have to move this to a separate question.
I ran a test using Postgres copy to import the data from the file and it took less than a minute. I just imported the data into a new table without any validations. So the import can be much faster.
The Rails overhead seems to be coming from 2 places
The multiple database calls that are happening i.e. the first_or_initialize for each row. This ends up becoming multiple SQL calls because it has to first find the record and then update it and then save it.
Bandwidth. Each time the SQL server is called the data flows back and forth which adds up to a lot of time
Now for my question. How do I move the update/create logic to the database i.e. If an order already exists based on the sku and customer_email it needs to update the record else a new record needs to be created. Currently with rails I am using the first_or_initialize method to get the record in case it exists and update it, else I am creating a new one and saving it. How do I do that in SQL.
I could run a raw SQL query using ActiveRecord connection execute but I do not think that would be a very elegant way of doing it. Is there a better way of doing that?
Since ruby 1.9 fastcsv is now part of ruby core. You don't need to use a special gem. Simply use CSV.
With 100k records ruby takes 0.018 secs / record. In my opinion most of your time will be used within Order.first_or_initialize. This part of your code takes an extra roundtrip to your database. Initialization of an ActiveRecord takes it time too. But to realy be sure I would suggest that you benchmark your code.
Benchmark.bm do |x|
x.report("CSV evel") { CSV.foreach("orders.csv") {} }
x.report("Init: ") { 1.upto(100_000) {Order.first_or_initialize(sku: rand(...), customer_email: rand(...))} } # use rand query to prevent query caching
x.report('parse_currency') { 1.upto(100_000) { parse_currency(...} }
x.report('parse_date') { 1.upto(100_000) { parse_date(...} }
end
You should also watch memory consumption during your import. Maybe the garbage collection does not run often enough or objects are not cleaned up.
To gain speed you can follow Matt Brictson hint and bypass ActiveRecord.
You can try the gem activerecord-import or you can start to go parallel, for instance multiprocessing with fork or multithreading with Thread.new.

Rails ActiveRecord and PostgreSQL Partitioning

I've got a large web app which writes many millions of rows into partitioned tables in PostgreSQL each day (meaning there's a new table for each day's data).
We're using PostgreSQL's table inheritance and partitioning to speed things along:
Due to there being year's worth of data in our DB we can't effectively use insert triggers to route the content to the correct table (the functions are getting very, very long in length).
Long story short, we need ActiveRecord to know which table to insert and update the data on. BUT, not change the table that is used for selects and other DB tasks.
Obviously it's simple to define the table name for a model, but is it possible to override the table name for just particular actions?
Here's a little more detail:
Database:
Table: dashboard.impressions (id, host, data, created_on, etc)
Table: data.impressions_20120801 (inherited from dashboard.impressions, with a constraint of created_on being equal to the tables date)
Impression.create :host=>"localhost", :data=>"{...}", created_on=>DateTime.now should write to the data.impressions_20120801 table, where Impression.where(:host=>"localhost") should search on the dashboard.impressions table, since that contains all the data.
Edit: I'm running PostgreSQL 9.1 and Rails 3.2.6
I don't do Rails so I can't help with the ActiveRecord side, but I can offer a pure Pg fallback solution for if you can't get ActiveRecord to do what you want. It'll cost you a little bit of insert performance so it'll be much better to teach ActiveRecord to do the inserts to the right place.
Personally I'd just do the INSERTs directly via the pg gem and bypass ActiveRecord completely. If you can't do that, or ActiveRecord does caching that means you shouldn't, try this alternate partitioning trigger implementation.
Instead of explicitly listing every partition in your trigger function, consider EXECUTE ... USING for insertion, and generate the partition name using your naming scheme. Something like the untested:
CREATE OR REPLACE FUNCTION partition_trigger() RETURNS trigger AS $$
DECLARE
target_partition text;
BEGIN
IF tg_op = 'INSERT' THEN
target_partition = ( ... work out the partition name ... )
EXECUTE 'INSERT INTO '||quote_ident(target_partition)||' (col1,col2) VALUES ($1, $2)'
USING (NEW.col1, NEW.col2);
END IF;
RETURN NULL;
END;
$$ LANGUAGE 'plpgsql';

Ruby on Rails migrations are very slow

I have SQLite3 database, which is populated with some large set of data.
I use migration for that.
3 tables will have following count of records:
Table_1 will have about 10 records
each record of Table_1 will be associated with ~100 records in Table_2
each record of Table_2 will be associated with ~2000 records in Table_3
The count of records will be about 10*100*2000 = 2000000
This takes a long time... Event, if i populate my database with about 20000 records, it takes about 10 minutes.
Also, i have noticed, that, during migration execution, ruby interpreter takes just 5% from CPU time and 95% remains unused ...
What the reason of such pure performance ?
Quite simply, inserting large amounts of records through manually saving AR objects one at a time is going to take years.
The best compromise between speed and "cleanness" (i.e. not a complete dodgy hack) for inserting large amounts of data is ar-extensions's (http://github.com/zdennis/ar-extensions) import method. It's not ideal, but it's better than any of the alternatives I could find, and the syntax is clean and doesn't require you to drop to raw sql (or anywhere close).
Example syntax:
items = Array.new
1.upto(200) do |n|
items << Item.new :some_field => n
end
Item.import items, :validate => false
At least in mysql this will batch the records into a single INSERT statement with multiple sets of values. Pretty damn fast.
If you run each INSERT statement in it's own transaction, SQLite can be very, very slow. But if you run it all in one transaction (or a logical set of transactions), then it can be very fast.
Seed_fu could help, as discussed in this question

Resources