Rails migration performance: using 35GBs of swap? - ruby-on-rails

I have a pretty big migration scanning through a 300k-row table, using find_each and a batch_size of 1000. The migration takes about two hours to run, and for each row a new row is created in a different table. I can't use pure SQL to do this migration - it has to be Ruby.
My question, though, is why does Ruby first use up all the memory available and then start using insane amounts of swap (35GBs)? (See the screen shots attached.) I would have thought Ruby's GC would have been invoked before it started eating swap. After all, in theory only 1000 records should be being loaded into memory at one time. And these records are small, far smaller than 1MB. What am I doing wrong?
UPDATE: here's some sample code
Post.find_each(:batch_size => 1000) do |p|
user = User.find_by_fb_id(p.fb_uid)
if user
puts "Migrating post #{p.pid}"
e = Entity.new
e.created_at = p.time
e.updated_at = p.last_update
e.response = p.post
e.user_id = user.id
e.legacy_type = "GamePost"
e.legacy_id = p.pid
e.is_approved = true
e.is_muted = true
e.save(:validate => false)
end
end

Related

Optimized method for retrieving top high scores, one per user (has_many through)

I have an app that saves high scores for its users. We want to show a leaderboard but we only want every user to show up once with their highest score. We have a HighScore model, which saves scores and has a few other fields (game type, game settings, etc). HighScore has a has_many_through relationship to User (through high_score_user) because a HighScore can have multiple users (in case of a game that's played with multiple players). Now I need a function that shows that leaderboard but I have yet to find a good way to write this code.
Currently I simply grab the top 500 scores, include the high_score_users, then iterate through them to filter out duplicate users. Once I have 10 scores, I return those scores. Obviously this is extremely suboptimal and it's very, very slow. Here's the code I have so far:
def self.unique_leaderboard(map_name, score_type, game_mode, game_type)
used_scores = Rails.cache.fetch("leaderboards_unique/#{map_name}/#{score_type}/#{game_type}/#{game_mode}",
expires_in: 10.minutes) do
top500 = HighScore
top500 = top500.for_map(map_name) if map_name
top500 = top500.for_type(score_type) if score_type
top500 = top500.for_gametype(game_type) if game_type
top500 = top500.for_gamemode(game_mode) if game_mode
top500 = top500.current_season
top500 = top500.ranked(game_type)
top500 = top500.includes(:high_score_users)
top500 = top500.limit(500)
used_scores = []
top500.each do |score|
break if used_scores.count >= 10
next unless (used_scores.map do |used_score|
used_score.high_score_users.map(&:user_id)
end.flatten.uniq & score.high_score_users.map(&:user_id)).empty?
used_scores << score
end
used_scores
end
HighScore.where(id: used_scores.map(&:id)).includes(:users).ranked(game_type)
end
I'm using Ruby on Rails with Postgres. How do I improve this code so it's not incredibly slow? I couldn't find a way to do it in SQL, nor could I find a way to do it properly with ActiveRecord.
I would bet that the major time killer is this code:
next unless (used_scores.map do |used_score|
used_score.high_score_users.map(&:user_id)
end.flatten.uniq & score.high_score_users.map(&:user_id)).empty?
It is being executed up to 500 times in a worst case, and each iteration is relatively heavy-weight due to several unnecessary maps on each iteration. And all this computational complexity just to track unique user_ids from the high scores (to short-circuit the iteration as soon as 10 unique top scorers are selected...
So if you just replace
...
used_scores = []
top500.each do |score|
...
end
used_scores
...
with smth. like
top_scorers = Hash.new { |h,k| h[k] = [] }
top500.each do |score|
score.high_score_users.each do |user|
top_scorers[user.id] << score
break if top_scorers.size >= 10
end
end
top_scorers.values.flatten.uniq
it should become significantly faster already.
But honestly fetching 500 high scores just to pick 10 top scorers seems weird anyway. This task can be solved perfectly fine by the database itself. If <high scores SQL query> is your high scores query (without the limit part) then something like this would do the job (pseudocode! just to illustrate the idea)
user_ids = select distinct(user_id) from <high scores SQL query> limit 10;
select * from <high scores SQL query> where user_id in <user_ids>
(this pseudocode could be "translated" into AR query(ies) in different ways, it's up to you)

threaded batching with mongoid without using skip

I have a large mongo db that i want to grab a batch of records, process them in a thread, grab the next batch, process in a thread, etc. There is major decay in .skip as explained in this post https://arpitbhayani.me/blogs/mongodb-cursor-skip-is-slow. The only way I can figure out how to do this is to take the last id of the current batch as follows (this is non-threaded):
batch_size = 1000
starting_id = Person.first.id
batch = Person.where(:id.gte => starting_id).limit(batch_size)
while(batch.present?)
batch.each do |b|
# process
starting_id = batch.last.id
batch = Person.where(:id.gte => starting_id).limit(batch_size)
end
The problem is, the finding is the slow part (relative) and what I really want to do is parallelize this line (I will take care of governing too many threads so that's not an issue):
batch = Person.where(:id.gte => starting_id).limit(batch_size)
I can't figure out non-skip approach to putting this in a thread because I have to wait until the slow line (above) finishes to start the next thread. Can anyone think of a way to thread this? This is what I've tried, but it has almost zero performance improvement:
batch_size = 1000
starting_id = Person.first.id
thread_count = 10
keep_going = true
while(keep_going)
batch = Person.where(:id.gte => starting_id).limit(batch_size)
if batch.present?
while Thread.list.count > (thread_count - 1)
sleep(1)
end
Thread.new do
batch.each do |b|
# process
starting_id = batch.last.id
end
else
keep_going = false
end
end
This doesn't quite work, but the structure is not the problem, the main question is how can I get the nth batch of records quickly in mongo / mongoid? If I could get the nth batch (which is what limit and skip gets me) I could easily parallelize.
thanks for any help,
Kevin
Something like:
while batch.any?
batch = Person.where(:id.gte => starting_id).order(id: :ASC).limit(batch_size)
Thread.new{ process batch }
starting_id = batch.last.id
end
Alternatively, add a processing key and index and update the documents after they are fetched. It would be done in a single query, so should not be too slow. At least it would be constant time. The batch query would be .where(:id.gte => starting_id, :processing => nil).limit(batch_size)
the main question is how can I get the nth batch of records quickly in mongo / mongoid?
I don't think you need the nth batch, or need to parallelize the query. I believe slow part will be processing each batch...

I need advice in speeding up this rails method that involves many queries

I'm trying to display a table that counts webhooks and arranges the various counts into cells by date_sent, sending_ip, and esp (email service provider). Within each cell, the controller needs to count the webhooks that are labelled with the "opened" event, and the "sent" event. Our database currently includes several million webhooks, and adds at least 100k per day. Already this process takes so long that running this index method is practically useless.
I was hoping that Rails could break down the enormous model into smaller lists using a line like this:
#today_hooks = #m_webhooks.where(:date_sent => this_date)
I thought that the queries after this line would only look at the partial list, instead of the full model. Unfortunately, running this index method generates hundreds of SQL statements, and they all look like this:
SELECT COUNT(*) FROM "m_webhooks" WHERE "m_webhooks"."date_sent" = $1 AND "m_webhooks"."sending_ip" = $2 AND (m_webhooks.esp LIKE 'hotmail') AND (m_webhooks.event LIKE 'sent')
This appears that the "date_sent" attribute is included in all of the queries, which implies that the SQL is searching through all 1M records with every single query.
I've read over a dozen articles about increasing performance in Rails queries, but none of the tips that I've found there have reduced the time it takes to complete this method. Thank you in advance for any insight.
m_webhooks.controller.rb
def index
def set_sub_count_hash(thip) {
gmail_hooks: {opened: a = thip.gmail.send(#event).size, total_sent: b = thip.gmail.sent.size, perc_opened: find_perc(a, b)},
hotmail_hooks: {opened: a = thip.hotmail.send(#event).size, total_sent: b = thip.hotmail.sent.size, perc_opened: find_perc(a, b)},
yahoo_hooks: {opened: a = thip.yahoo.send(#event).size, total_sent: b = thip.yahoo.sent.size, perc_opened: find_perc(a, b)},
other_hooks: {opened: a = thip.other.send(#event).size, total_sent: b = thip.other.sent.size, perc_opened: find_perc(a, b)},
}
end
#m_webhooks = MWebhook.select("date_sent", "sending_ip", "esp", "event", "email").all
#event = params[:event] || "unique_opened"
#m_list_of_ips = [#List of three ip addresses]
end_date = Date.today
start_date = Date.today - 10.days
date_range = (end_date - start_date).to_i
#count_array = []
date_range.times do |n|
this_date = end_date - n.days
#today_hooks = #m_webhooks.where(:date_sent => this_date)
#count_array[n] = {:this_date => this_date}
#m_list_of_ips.each_with_index do |ip, index|
thip = #today_hooks.where(:sending_ip => ip) #Stands for "Today Hooks ip"
#count_array[n][index] = set_sub_count_hash(thip)
end
end
Well, your problem is very simple, actually. You gotta remember that when you use where(condition), the query is not straight executed in the DB.
Rails is smart enough to detect when you need a concrete result (a list, an object, or a count or #size like in your case) and chain your queries while you don't need one. In your code, you keep chaining conditions to the main query inside a loop (date_range). And it gets worse, you start another loop inside this one adding conditions to each query created in the first loop.
Then you pass the query (not concrete yet, it was not yet executed and does not have results!) to the method set_sub_count_hash which goes on to call the same query many times.
Therefore you have something like:
10(date_range) * 3(ip list) * 8 # (times the query is materialized in the #set_sub_count method)
and then you have a problem.
What you want to do is to do the whole query at once and group it by date, ip and email. You should have a hash structure after that, which you would pass to the #set_sub_count method and do some ruby gymnastics to get the counts you're looking for.
I imagine the query something like:
main_query = #m_webhooks.where('date_sent > ?', 10.days.ago.to_date)
.where(sending_ip:#m_list_of_ips)
Ok, now you have one query, which is nice, but I think you should separate the query in 4 (gmail, hotmail, yahoo and other), which gives you 4 queries (the first one, the main_query, will not be executed until you call for materialized results, don forget it). Still, like 100 times faster.
I think this is the result that should be grouped, mapped and passed to #set_sub_count instead of passing the raw query and calling methods on it every time and many times. It will be a little work to do the grouping, mapping and counting for sure, but hey, it's faster. =)
In case this helps anybody else, I learned how to fill a hash with counts in a much simpler way. More importantly, this approach runs a single query (as opposed to the 240 queries that I was running before).
#count_array[esp_index][j] = MWebhook.where('date_sent > ?', start_date.to_date)
.group('date_sent', 'sending_ip', 'event', 'esp').count

Ruby/ Rails split array into N groups with remainder being added to last group.

I am attempting to make a batch process which will take a parameter that specifies the number of background workers, and split a collection into that many arrays. For example if
def split_for_batch(number_of_workers)
<code>
end
array = [1,2,3,4,5,6,7,8,9,10]
array.split_for_batch(3)
=> [[1,2,3],[4,5,6],[7,8,9,10]]
the thing is that I don't want to have to load all of the users into memory at once because it is a batch. What I have now is
def initialize_audit_run_threads
total_users = tax_audit_run_users.count
partition_size = (total_users / thread_count).round
tax_audit_run_users.in_groups_of(partition_size).each do |group|
thread = TaxAuditRunThread.create(:tax_audit_run_id => id, :status_code => 1)
group.each do |user|
if user
user.tax_audit_run_thread_id = thread.id
user.save
end
end
end
where the thread_count is an attribute of the class that determines the number of background workers. Currently this code will create 4 threads rather than 3. I have also tried using find_in_batches but I am having the same problem where if I have 10 tax_audit_run_users in the array I have no way to let the last worker know to process the last record. Is there a way in ruby or rails to divide a collection into n parts and have the last part include the stragglers?
How to split (chunk) a Ruby array into parts of X elements?
You will of course need to modify it a bit to add the last chunk if it's less than the chunk size, or not, up to you.

Finding mongoDB records in batches (using mongoid ruby adapter)

Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end

Resources