Ruby on Rails migrations are very slow - ruby-on-rails

I have SQLite3 database, which is populated with some large set of data.
I use migration for that.
3 tables will have following count of records:
Table_1 will have about 10 records
each record of Table_1 will be associated with ~100 records in Table_2
each record of Table_2 will be associated with ~2000 records in Table_3
The count of records will be about 10*100*2000 = 2000000
This takes a long time... Event, if i populate my database with about 20000 records, it takes about 10 minutes.
Also, i have noticed, that, during migration execution, ruby interpreter takes just 5% from CPU time and 95% remains unused ...
What the reason of such pure performance ?

Quite simply, inserting large amounts of records through manually saving AR objects one at a time is going to take years.
The best compromise between speed and "cleanness" (i.e. not a complete dodgy hack) for inserting large amounts of data is ar-extensions's (http://github.com/zdennis/ar-extensions) import method. It's not ideal, but it's better than any of the alternatives I could find, and the syntax is clean and doesn't require you to drop to raw sql (or anywhere close).
Example syntax:
items = Array.new
1.upto(200) do |n|
items << Item.new :some_field => n
end
Item.import items, :validate => false
At least in mysql this will batch the records into a single INSERT statement with multiple sets of values. Pretty damn fast.

If you run each INSERT statement in it's own transaction, SQLite can be very, very slow. But if you run it all in one transaction (or a logical set of transactions), then it can be very fast.

Seed_fu could help, as discussed in this question

Related

Syncing data between CSV and DB

I need sync data between a CSV file and my DB, but is a very slow process when I check if each item exists.
For example, I have a very large list of postal codes, when this list load into the system, the app need check if this record exist in the database.
I try to use find_or_initialize_by, but is very slow when the list of postal codes has more than 100_000 records ... I also tried to cache all the records in the database and compare them using .select, but it is almost as slow as using the database.
Any suggestion?
Using find_or_initialize_by is extremely slow for use cases like these because this approach would run at least one query against the database for each line in the CSV file. And if the record wasn't found there will be a second insert query. Even if every single query is extremely fast, let's assume they only take 5ms, they will add up: With 100k lines in the CSV alone the find_or_initialize_by method calls will take over 8 minutes.
Therefore my approach would be to avoid doing many small database queries and instead do only a few, big queries and keep the data in memory.
First, load all records from the database but not the whole record but only the unique parts. For postal code data that might be the zip_code column. Then store that data in an in-memory data structure that allows fast lookup, for example in a Set.
require 'set'
existing_zip_codes = Set.new(
PostalCode.all.pluck(:zip_code)
)
Then iterate over the CSV and collect all data that need to be imported into the database.
missing_postal_codes = []
CSV.foreach(...) do |row|
next if existing_zip_codes.include?(row['zip_code'])
missing_postal_codes << {
zip_code: row['zip_code'],
city: row['city'],
# ...
}
end
And in the last step, I would insert all those missing data with one big insert_all call into the database.
PostalCode.insert_all(missing_postal_codes)

How should you backfill a new table in Rails?

I'm creating a new table that needs to be backfilled with data based on User accounts (over a couple dozen thousand) with the following one-time rake task.
What I've decided to do is create a big INSERT string for every 2000 users and execute that query.
Here's what the code roughly looks like:
task :backfill_my_new_table => :environment do
inserts = []
User.find_each do |user|
tuple = # form the tuple based on user and user associations like (1, 'foo', 'bar', NULL)
inserts << tuple
end
# At this point, the inserts array is of size at least 20,000
conn = ActiveRecord::Base.connection
inserts.each_slice(2000) do |slice|
sql = "INSERT INTO my_new_table (ref_id, column_a, column_b, column_c) VALUES #{inserts.join(", ")}"
conn.execute(sql)
end
end
So I'm wondering, is there a better way to do this? What are some drawbacks of the approach I took? How should I improve it? What if I didn't slice the inserts array and simply executed a single INSERT with over a couple dozen thousand VALUES tuples? What are the drawbacks of that method?
Thanks!
Depends on which PG version you are using, but in most cases of bulk loading data to a table this is enough checklist:
try to use COPY instead of INSERT whenever possible;
if using multiple INSERTs, disable autocommit and wrap all INSERTs in a single transaction, i.e. BEGIN; INSERT ...; INSERT ...; COMMIT;
disable indexes and checks/constraints on/of a target table;
disable table triggers;
alter table so it became unlogged (since PG 9.5, don't forget to turn logging on after data import), or increase max_wal_size so WAL wont be flooded
20k of rows is not such a big deal for a PG, so 2k-sliced inserts within one transaction will be just fine, unless there are some very complex triggers/checks involved. It is also worth reading PG manual section on bulk loading.
UPD: and a little bit old, yet wonderful piece from depesz, excerpt:
so, if you want to insert data as fast as possible – use copy (or better yet – pgbulkload). if for whatever reason you can't use copy, then use multi-row inserts (new in 8.2!). then if you can, bundle them in transactions, and use prepared transactions, but generally – they don't give you much.

Effects of Rail's default_scope on performance

Can default_scope when used to not order records by ID significantly slow down a Rails application?
For example, I have a Rails (currently 3.1) app using PostgreSQL where nearly every Model has a default_scope ordering records by their name:
default_scope order('users.name')
Right now because the default_scope's order records by name rather by ID, I am worried I might be incurring a significant performance penalty when normal queries are run. For example with:
User.find(5563401)
or
#User.where('created_at = ?', 2.weeks.ago)
or
User.some_scope_sorted_best_by_id.all
In the above examples, what performance penalty might I incur by having a default_scope by name on my Model? Should I be concerned about this default_scope affecting application performance?
Your question is missing the point. The default scope itself is just a few microseconds of Ruby execution to cause an order by clause to be added to every SQL statement sent to PostgreSQL.
So your question is really asking about the performance difference between unordered queries and ordered ones.
Postgresql documentation is pretty explicit. Ordered queries on unindexed fields are much slower than unordered because (no surprise), PostgreSQL must sort the results before returning them, first creating temporary table or index to contain the result. This could easily be a factor of 4 in query time, possibly much more.
If you introduce an index just to achieve quick ordering, you are still paying to maintain the index on every insert and update. And unless it's the primary index, sorted access still involves random seeks, which may actually be slower than creating a temporary table. This also is discussed in the Postgres docs.
In a nutshell, NEVER add an order clause to an SQL query that doesn't need it (unless you enjoy waiting for your database).
NB: I doubt a simple find() will have order by attached because it must return exactly one result. You can verify this very quickly by starting rails console, issuing a find, and watching the generated SQL scroll by. However, the where and all definitely will be ordered and consequently definitely be slower than needed.

ActiveRecord query much slower than straight SQL?

I've been working on optimizing my project's DB calls and I noticed a "significant" difference in performance between the two identical calls below:
connection = ActiveRecord::Base.connection()
pgresult = connection.execute(
"SELECT SUM(my_column)
FROM table
WHERE id = #{id}
AND created_at BETWEEN '#{lower}' and '#{upper}'")
and the second version:
sum = Table.
where(:id => id, :created_at => lower..upper).
sum(:my_column)
The method using the first version on average takes 300ms to execute (the operation is called a couple thousand times total within it), and the method using the second version takes about 550ms. That's almost 100% decrease in speed.
I double-checked the SQL that's generated by the second version, it's identical to the first with exception for it prepending table columns with the table name.
Why the slow-down? Is the conversion between ActiveRecord and SQL really making the operation take almost 2x?
Do I need to stick to writing straight SQL (perhaps even a sproc) if I need to perform the same operation a ton of times and I don't want to hit the overhead?
Thanks!
A couple of things jump out.
Firstly, if this code is being called 2000 times and takes 250ms extra to run, that's ~0.125ms per call to convert the Arel to SQL, which isn't unrealistic.
Secondly, I'm not sure of the internals of Range in Ruby, but lower..upper may be doing calculations such as the size of the range and other things, which will be a big performance hit.
Do you see the same performance hit with the following?
sum = Table.
where(:id => id).
where(:created_at => "BETWEEN ? and ?", lower, upper).
sum(:my_column)

iterating through table in Ruby using hash runs slow

I have the following code for
h2.each {|k, v|
#count += 1
puts #count
sq.each do |word|
if Wordsdoc.find_by_docid(k).tf.include?(word)
sum += Wordsdoc.find_by_docid(k).tf[word] * #s[word]
end
end
rec_hash[k] = sum
sum = 0
}
h2 -> is a hash that contain ids of documents, the hash contains more than a 1000 of these
Wordsdoc -> is a model/table in my database...
sq -> is a hash that contain around 10 words
What i'm doing is i'm going through each of the document ids and then for each word in sq i look up in the Wordsdoc table if the word exists (Wordsdoc.find_by_docid(k).tf.include?(word) , here tf is a hash of {word => value}
and if it does I get the value of that word in Wordsdoc and multiple it with the value of the word in #s which is also a hash of {word = > value}
This seems to be running very slow. Tt processe one document per second. Is there a way to process this faster?
thanks really appreciate your help on this!
You do a lot of duplicate querying. While ActiveRecord can do some caching in the background to speed things up, there is a limit to what it can do, and there is no reason to make things harder for it.
The most obvious cause for slowdown is the Wordsdoc.find_by_docid(k). For each value of k, you call it 10 times, and each time you call it there is a possibility to call it again. That means you call that method with the same argument 10-20 times for each entry in h2. Queries to the database are expensive, since the database is on the hard disk, and accessing the hard disk is expensive in any system. You can just as easily call Wordsdoc.find_by_Docid(k) once, before you enter the sq.each loop, and store it in a variable - that would save a lot of querying and make your loop go much faster.
Another optimization - though not nearly as important as the first one - is to get all the Wordsdoc records in a single query. Almost all mid to high level(and some of the low level, too!) programming languages and libraries work better and faster when they work in bulks, and ActiveRecord is no exception. If you can query for all entries of Wordsdoc, and filter them by the docid's in h2's keys, you can turn 1000 queries(after the first optimization. Before the first optimization it was 10000-20000 queries) to a single, huge query. That will enable ActiveRerocd and the underlying database to retrieve your data in bigger chunks, and save you a lot of disc access.
There are some more minor optimization you can do, but the two I've specified should be more than enough.
You're calling Wordsdoc.find_by_docid(k) twice.
You could refactor the code to:
wordsdoc = Wordsdoc.find_by_docid(k)
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
...but still it will be ugly and inefficient.
You should prefetch all records in batches, see: https://makandracards.com/makandra/1181-use-find_in_batches-to-process-many-records-without-tearing-down-the-server
For example something like that should be much more efficient:
Wordsdoc.find_in_batches(:conditions => {:docid => array_of_doc_ids}).each do |wordsdoc|
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
end
Also you can retrieve only certain columns from Wordsdoc table using for example :select => :tf in find_in_batches method.
As you have a lot going on I'm just going to offer you up to things to check out.
A book called Eloquent Ruby deals with Documents and iterating through documents to count the number of times a word was used. All his examples are about a Document system he was maintaining and so it could even tackle other problems for you.
inject is a method that could speed up what you're looking to do for the sum part, maybe.
Delayed Job the whole thing if you are doing this async-ly. meaning if this is a web app, you must be timing out if you're waiting a 1000 seconds for this job to complete before it shows it's answers on the screen.
Go get em.

Resources