I have a large mongo db that i want to grab a batch of records, process them in a thread, grab the next batch, process in a thread, etc. There is major decay in .skip as explained in this post https://arpitbhayani.me/blogs/mongodb-cursor-skip-is-slow. The only way I can figure out how to do this is to take the last id of the current batch as follows (this is non-threaded):
batch_size = 1000
starting_id = Person.first.id
batch = Person.where(:id.gte => starting_id).limit(batch_size)
while(batch.present?)
batch.each do |b|
# process
starting_id = batch.last.id
batch = Person.where(:id.gte => starting_id).limit(batch_size)
end
The problem is, the finding is the slow part (relative) and what I really want to do is parallelize this line (I will take care of governing too many threads so that's not an issue):
batch = Person.where(:id.gte => starting_id).limit(batch_size)
I can't figure out non-skip approach to putting this in a thread because I have to wait until the slow line (above) finishes to start the next thread. Can anyone think of a way to thread this? This is what I've tried, but it has almost zero performance improvement:
batch_size = 1000
starting_id = Person.first.id
thread_count = 10
keep_going = true
while(keep_going)
batch = Person.where(:id.gte => starting_id).limit(batch_size)
if batch.present?
while Thread.list.count > (thread_count - 1)
sleep(1)
end
Thread.new do
batch.each do |b|
# process
starting_id = batch.last.id
end
else
keep_going = false
end
end
This doesn't quite work, but the structure is not the problem, the main question is how can I get the nth batch of records quickly in mongo / mongoid? If I could get the nth batch (which is what limit and skip gets me) I could easily parallelize.
thanks for any help,
Kevin
Something like:
while batch.any?
batch = Person.where(:id.gte => starting_id).order(id: :ASC).limit(batch_size)
Thread.new{ process batch }
starting_id = batch.last.id
end
Alternatively, add a processing key and index and update the documents after they are fetched. It would be done in a single query, so should not be too slow. At least it would be constant time. The batch query would be .where(:id.gte => starting_id, :processing => nil).limit(batch_size)
the main question is how can I get the nth batch of records quickly in mongo / mongoid?
I don't think you need the nth batch, or need to parallelize the query. I believe slow part will be processing each batch...
Related
Sidekiq will run 25 concurrent jobs in our scenario. We need to get a single integer as the result of each job and tally all of the results together. In this case we are querying an external API and returning counts. We want the total count from all of the API requests.
The Report object stores the final total. Postgresql is our database.
At the end of each job, we increment the report with the additional records found.
Report.find(report_id).increment(:total, api_response_total)
Is this a good approach to track the running total? Will there be Postgresql concurrency issues? Is there a better approach?
increment shouldn't lead to concurrency issues, at sql level, it updates atomically with COALESCE(total, 0) + api_response_total. Race conditions can come only if you addition manually and then saving the object.
report = Report.find(report_id)
report.total += api_response_total
report.save # NOT SAFE
Note: Even with increment! the value at Rails level can be stale, but it will be correct at database level:
# suppose initial `total` is 0
report = Report.find(report_id) # Thread 1 at time t0
report2 = Report.find(report_id) # Thread 2 at time t0
report.increment!(:total) # Thread 1 at time t1
report2.increment!(:total) # Thread 2 at time t1
report.total #=> 1 # Thread 1 at time t2
report2.total #=> 1 # Thread 2 at time t2
report.reload.total #=> 2 # Thread 1 at time t3, value was stale in object, but correct in db
Is this a good approach to track the running total? Will there be Postgresql concurrency issues? Is there a better approach?
I will prefer to do this with Sidekiq Batches. It allows you to run a batch of jobs and assign a callback to the batch, which executes once all jobs are processed. Example:
batch = Sidekiq::Batch.new
batch.description = "Batch description (this is optional)"
batch.on(:success, MyCallback, :to => user.email)
batch.jobs do
rows.each { |row| RowWorker.perform_async(row) }
end
puts "Just started Batch #{batch.bid}"
We need to get a single integer as the result of each job and tally all of the results together.
Note that Sidekiq job doesn't do anything with the returned value and the value is GC'ed and ignored. So, in above batch strategy, you will not have data of jobs in the callback. You can tailor-made that solution. For example, have a LIST in redis with key as batch id, and push the values of each complete job (in perform). In callback, simply use the list and summate it.
To break up a long migration of data, I'm using a query limited to groups of 100, then processing those 100 records.
something like this...
count = Model.where("conditions").count
count = count / 100
count = count+1 if count%100 != 0
count.times do
#do my data migration steps .limit(100)...
end
is there a shortcut or better way of doing that count based on whether or not there is a remainder when dividing by 100? Feels like I'm forgetting an easy way (besides rounding which seems slower, but maybe it's not).
Yes. This is very well supported by Rails, you do not have to roll your own code for finding batches of records.
The easiest is to simply use find_each, which seamlessly loads 1000 records at a time:
Model.find_each do |model|
# ...
end
The underlying mechanism is find_in_batches with a default batch size of 1000. You can use find_in_batches directly, but you do not have to, find_each is sufficient:
Model.find_in_batches(batch_size: 100) do |batch|
batch.each do |model|
# ...
end
end
Rails has several methods for loading records in batches. find_each would work nicely here. It defaults to batches of 1000, but you can specify the batch size:
Model.find_each(batch_size: 100) do |record|
...
end
I am attempting to make a batch process which will take a parameter that specifies the number of background workers, and split a collection into that many arrays. For example if
def split_for_batch(number_of_workers)
<code>
end
array = [1,2,3,4,5,6,7,8,9,10]
array.split_for_batch(3)
=> [[1,2,3],[4,5,6],[7,8,9,10]]
the thing is that I don't want to have to load all of the users into memory at once because it is a batch. What I have now is
def initialize_audit_run_threads
total_users = tax_audit_run_users.count
partition_size = (total_users / thread_count).round
tax_audit_run_users.in_groups_of(partition_size).each do |group|
thread = TaxAuditRunThread.create(:tax_audit_run_id => id, :status_code => 1)
group.each do |user|
if user
user.tax_audit_run_thread_id = thread.id
user.save
end
end
end
where the thread_count is an attribute of the class that determines the number of background workers. Currently this code will create 4 threads rather than 3. I have also tried using find_in_batches but I am having the same problem where if I have 10 tax_audit_run_users in the array I have no way to let the last worker know to process the last record. Is there a way in ruby or rails to divide a collection into n parts and have the last part include the stragglers?
How to split (chunk) a Ruby array into parts of X elements?
You will of course need to modify it a bit to add the last chunk if it's less than the chunk size, or not, up to you.
I have a pretty big migration scanning through a 300k-row table, using find_each and a batch_size of 1000. The migration takes about two hours to run, and for each row a new row is created in a different table. I can't use pure SQL to do this migration - it has to be Ruby.
My question, though, is why does Ruby first use up all the memory available and then start using insane amounts of swap (35GBs)? (See the screen shots attached.) I would have thought Ruby's GC would have been invoked before it started eating swap. After all, in theory only 1000 records should be being loaded into memory at one time. And these records are small, far smaller than 1MB. What am I doing wrong?
UPDATE: here's some sample code
Post.find_each(:batch_size => 1000) do |p|
user = User.find_by_fb_id(p.fb_uid)
if user
puts "Migrating post #{p.pid}"
e = Entity.new
e.created_at = p.time
e.updated_at = p.last_update
e.response = p.post
e.user_id = user.id
e.legacy_type = "GamePost"
e.legacy_id = p.pid
e.is_approved = true
e.is_muted = true
e.save(:validate => false)
end
end
Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end