Syncing data between CSV and DB - ruby-on-rails

I need sync data between a CSV file and my DB, but is a very slow process when I check if each item exists.
For example, I have a very large list of postal codes, when this list load into the system, the app need check if this record exist in the database.
I try to use find_or_initialize_by, but is very slow when the list of postal codes has more than 100_000 records ... I also tried to cache all the records in the database and compare them using .select, but it is almost as slow as using the database.
Any suggestion?

Using find_or_initialize_by is extremely slow for use cases like these because this approach would run at least one query against the database for each line in the CSV file. And if the record wasn't found there will be a second insert query. Even if every single query is extremely fast, let's assume they only take 5ms, they will add up: With 100k lines in the CSV alone the find_or_initialize_by method calls will take over 8 minutes.
Therefore my approach would be to avoid doing many small database queries and instead do only a few, big queries and keep the data in memory.
First, load all records from the database but not the whole record but only the unique parts. For postal code data that might be the zip_code column. Then store that data in an in-memory data structure that allows fast lookup, for example in a Set.
require 'set'
existing_zip_codes = Set.new(
PostalCode.all.pluck(:zip_code)
)
Then iterate over the CSV and collect all data that need to be imported into the database.
missing_postal_codes = []
CSV.foreach(...) do |row|
next if existing_zip_codes.include?(row['zip_code'])
missing_postal_codes << {
zip_code: row['zip_code'],
city: row['city'],
# ...
}
end
And in the last step, I would insert all those missing data with one big insert_all call into the database.
PostalCode.insert_all(missing_postal_codes)

Related

Parsing large CSVs while searching Database

Currently have a tricky problem and need ideas for most efficient way to go about solving.
We periodically iterate through large CSV files (~50000 to 2m rows), and for each row, we need to check a database table for matching columns.
So for example, each CSV row could have details about an Event - artist, venue, date/time etc, and for each row, we check our database (PG) for any rows that match the artist, venue and date/time the most, and then perform operations if any match is found.
Currently, the entire process is highly CPU, memory and time intensive pulling row by row, so we perform the matching in batches, but still seeking ideas for an efficient way to perform the comparison both memory-wise, and time-wise
Thanks.
Load the complete CSV file into a temporary table in your database (using a DB tool, see e.g. How to import CSV file data into a PostgreSQL table?)
Perform matching and operations in-database, i.e. in SQL
If necessary, truncate the temporary table afterwards
This would move most of the load into the DB server, avoiding all the ActiveRecord overhead (network traffic, result parsing, model instantiation etc.)

Import CSV data faster in rails

I am building an import module to import a large set of orders from a csv file. I have a model called Order where the data needs to be stored.
A simplified version of the Order model is below
sku
quantity
value
customer_email
order_date
status
When importing the data two things have to happen
Any dates or currencies need to be cleaned up i.e. dates are represented as strings in the csv, this needs to be converted into a Rails Date object and currencies need to converted to a decimal by removing any commas or dollar signs
If a row already exists it has to be updated, the uniqueness is checked based on two columns.
Currently I use a simple csv import code
CSV.foreach("orders.csv") do |row|
order = Order.first_or_initialize(sku: row[0], customer_email: row[3])
order.quantity = row[1]
order.value= parse_currency(row[2])
order.order_date = parse_date(row[4])
order.status = row[5]
order.save!
end
Where parse_currency and parse_date are two functions used to extract the values from strings. In the case of the date it is just a wrapper for Date.strptime.
I can add a check to see if the record already exists and do nothing in case it already exists and that should save a little time. But I am looking for something that is significantly faster. Currently importing around 100k rows takes around 30 mins with an empty database. It will get slower as the data size increases.
So I am basically looking for a faster way to import the data.
Any help would be appreciated.
Edit
After some more testing based on the comments here I have an observation and a question. I am not sure if they should go here or if I need to open a new thread for the questions. So please let me know if I have to move this to a separate question.
I ran a test using Postgres copy to import the data from the file and it took less than a minute. I just imported the data into a new table without any validations. So the import can be much faster.
The Rails overhead seems to be coming from 2 places
The multiple database calls that are happening i.e. the first_or_initialize for each row. This ends up becoming multiple SQL calls because it has to first find the record and then update it and then save it.
Bandwidth. Each time the SQL server is called the data flows back and forth which adds up to a lot of time
Now for my question. How do I move the update/create logic to the database i.e. If an order already exists based on the sku and customer_email it needs to update the record else a new record needs to be created. Currently with rails I am using the first_or_initialize method to get the record in case it exists and update it, else I am creating a new one and saving it. How do I do that in SQL.
I could run a raw SQL query using ActiveRecord connection execute but I do not think that would be a very elegant way of doing it. Is there a better way of doing that?
Since ruby 1.9 fastcsv is now part of ruby core. You don't need to use a special gem. Simply use CSV.
With 100k records ruby takes 0.018 secs / record. In my opinion most of your time will be used within Order.first_or_initialize. This part of your code takes an extra roundtrip to your database. Initialization of an ActiveRecord takes it time too. But to realy be sure I would suggest that you benchmark your code.
Benchmark.bm do |x|
x.report("CSV evel") { CSV.foreach("orders.csv") {} }
x.report("Init: ") { 1.upto(100_000) {Order.first_or_initialize(sku: rand(...), customer_email: rand(...))} } # use rand query to prevent query caching
x.report('parse_currency') { 1.upto(100_000) { parse_currency(...} }
x.report('parse_date') { 1.upto(100_000) { parse_date(...} }
end
You should also watch memory consumption during your import. Maybe the garbage collection does not run often enough or objects are not cleaned up.
To gain speed you can follow Matt Brictson hint and bypass ActiveRecord.
You can try the gem activerecord-import or you can start to go parallel, for instance multiprocessing with fork or multithreading with Thread.new.

How should you backfill a new table in Rails?

I'm creating a new table that needs to be backfilled with data based on User accounts (over a couple dozen thousand) with the following one-time rake task.
What I've decided to do is create a big INSERT string for every 2000 users and execute that query.
Here's what the code roughly looks like:
task :backfill_my_new_table => :environment do
inserts = []
User.find_each do |user|
tuple = # form the tuple based on user and user associations like (1, 'foo', 'bar', NULL)
inserts << tuple
end
# At this point, the inserts array is of size at least 20,000
conn = ActiveRecord::Base.connection
inserts.each_slice(2000) do |slice|
sql = "INSERT INTO my_new_table (ref_id, column_a, column_b, column_c) VALUES #{inserts.join(", ")}"
conn.execute(sql)
end
end
So I'm wondering, is there a better way to do this? What are some drawbacks of the approach I took? How should I improve it? What if I didn't slice the inserts array and simply executed a single INSERT with over a couple dozen thousand VALUES tuples? What are the drawbacks of that method?
Thanks!
Depends on which PG version you are using, but in most cases of bulk loading data to a table this is enough checklist:
try to use COPY instead of INSERT whenever possible;
if using multiple INSERTs, disable autocommit and wrap all INSERTs in a single transaction, i.e. BEGIN; INSERT ...; INSERT ...; COMMIT;
disable indexes and checks/constraints on/of a target table;
disable table triggers;
alter table so it became unlogged (since PG 9.5, don't forget to turn logging on after data import), or increase max_wal_size so WAL wont be flooded
20k of rows is not such a big deal for a PG, so 2k-sliced inserts within one transaction will be just fine, unless there are some very complex triggers/checks involved. It is also worth reading PG manual section on bulk loading.
UPD: and a little bit old, yet wonderful piece from depesz, excerpt:
so, if you want to insert data as fast as possible – use copy (or better yet – pgbulkload). if for whatever reason you can't use copy, then use multi-row inserts (new in 8.2!). then if you can, bundle them in transactions, and use prepared transactions, but generally – they don't give you much.

How do I optimise getting and updating the id for 500000 records?

I have a CSV file that contains data like the
id of user, unit and size.
I want to update member_id for 500,000 products:
500000.times do |i|
user = User.find(id: tmp[i])
hash = {
unit: tmp[UNIT],
size: tmp[SIZE]
}
hash.merge!(user_id: user.id) if user.present?
Product.create(hash)
end
How do I optimize that procedure to not find each User object but maybe get an array of related hashes?
There's two things here that are massively holding back performance. First you're doing N User.find calls which is totally out of control. Secondly you're creating individual records instead of doing a mass-insert each of which runs inside its own tiny transaction block.
Generally these sorts of bulk operations are better done purely in the SQL domain. You can insert a very large number of rows at the same time, often only limited by the size of the query you can submit, and that parameter is usually adjustable.
While a gigantic query may lock or block your database for a period of time, it will be the fastest way to do your updates. If you need to keep your system running during mass inserts, you'll need to break it up into a series of smaller commits.
Remember that Product.connection is a more low-level access layer allowing you to manipulate the data directly with queries.

Importing huge excel file to Rails application

I have an excel file with thousands of rows. In my case, I can't use bulk insert, because for each row I should create few associations. Now, all process take more than 1 hour with 20k rows, which is hell. What is the best way to resolve this problem?
I'm using spreadsheet gem.
This is analogous to the infamous "1+N" query situation that Rails loves to encounter. I have a similar situation (importing files of 20k+ rows with multiple associations). The way I optimized this process was to pre-load hashes for the associations. So for example, if you have an AssociatedModel that contains a lookup_column that is in your import data, you would first build a hash:
associated_model_hash = Hash.new(:not_found)
AssociatedModel.each do |item|
associated_model_hash[item.lookup_column] = item
end
This provides a hash of objects. You can repeat for as many associations as you have. In your import loop:
associated_model = associated_model_hash[row[:lookup_column]]
new_item.associated_model_id = associated_model.id
Because you don't have to do a search on the database each time, this is much faster. It should also allow you to use bulk insert (assuming you can guarantee that the associated models will not be deleted or modified in a bad way during the load).

Resources