Rails find_in_batches with locking - ruby-on-rails

I need to process large number of records in batches. And each batch should be processed in it's own transaction. Is there way to wrap each batch in transaction and lock all records in batch at the same time?
Model.scheduled_now.lock.find_in_batches do |batch|
model_ids = batch.map(&:id)
Model.update_all({status: 'delivering'}, {"id IN (?)" , model_ids})
# creates and updates other DB records
# and triggers background job
perform_delivery_actions(batch)
end
Does SELECT FOR UPDATE in this example commits transaction after each batch?
Or I need to put internal transaction block and lock records manually inside each batch (which means one more query)?
The reason I don't want to put outer transaction block is that I want to commit each batch separately, not a whole thing at once.

I ended up implementing my own find_in_batches_with_lock:
def find_in_batches_with_lock(scope, user, batch_size: 1000)
last_processed_id = 0
records = []
begin
user.transaction do
records = scope.where("id > ?", last_processed_id)
.order(:id).limit(batch_size).lock.all
next if records.empty?
yield records
last_processed_id = records.last.id
end
end while !records.empty?
end

Related

threaded batching with mongoid without using skip

I have a large mongo db that i want to grab a batch of records, process them in a thread, grab the next batch, process in a thread, etc. There is major decay in .skip as explained in this post https://arpitbhayani.me/blogs/mongodb-cursor-skip-is-slow. The only way I can figure out how to do this is to take the last id of the current batch as follows (this is non-threaded):
batch_size = 1000
starting_id = Person.first.id
batch = Person.where(:id.gte => starting_id).limit(batch_size)
while(batch.present?)
batch.each do |b|
# process
starting_id = batch.last.id
batch = Person.where(:id.gte => starting_id).limit(batch_size)
end
The problem is, the finding is the slow part (relative) and what I really want to do is parallelize this line (I will take care of governing too many threads so that's not an issue):
batch = Person.where(:id.gte => starting_id).limit(batch_size)
I can't figure out non-skip approach to putting this in a thread because I have to wait until the slow line (above) finishes to start the next thread. Can anyone think of a way to thread this? This is what I've tried, but it has almost zero performance improvement:
batch_size = 1000
starting_id = Person.first.id
thread_count = 10
keep_going = true
while(keep_going)
batch = Person.where(:id.gte => starting_id).limit(batch_size)
if batch.present?
while Thread.list.count > (thread_count - 1)
sleep(1)
end
Thread.new do
batch.each do |b|
# process
starting_id = batch.last.id
end
else
keep_going = false
end
end
This doesn't quite work, but the structure is not the problem, the main question is how can I get the nth batch of records quickly in mongo / mongoid? If I could get the nth batch (which is what limit and skip gets me) I could easily parallelize.
thanks for any help,
Kevin
Something like:
while batch.any?
batch = Person.where(:id.gte => starting_id).order(id: :ASC).limit(batch_size)
Thread.new{ process batch }
starting_id = batch.last.id
end
Alternatively, add a processing key and index and update the documents after they are fetched. It would be done in a single query, so should not be too slow. At least it would be constant time. The batch query would be .where(:id.gte => starting_id, :processing => nil).limit(batch_size)
the main question is how can I get the nth batch of records quickly in mongo / mongoid?
I don't think you need the nth batch, or need to parallelize the query. I believe slow part will be processing each batch...

Custom bulk indexer for searchkick : mapping options are ignored

I'm using Searchkick 3.1.0
I have to bulk index a certain collection of records. By what I read in the docs and have tried, I cannot pass an predefined array of ids to Searchkick's reindex method. I'm using the async mode.
If you do for example, Klass.reindex(async: true), it will enqueue jobs with the specified batch_size in your options. The problem with that it loops through the entire model's ids will then determin if they have to be indexed. For example, if I have 10 000 records in my database and a batch size of 200, it will enqueue 50 jobs. It will then loop on each id and if the search_import's conditions are met, it will index it.
This step is useless, I would like to enqueue a pre-filtered array of ids to prevent looping through the entire records.
I tried writing the following job to overwrite the normal behavior :
def perform(class_name, batch_size = 100, offset = 0)
model = class_name.constantize
ids = model
.joins(:user)
.where(user: { active: true, id: $rollout.get(:searchkick).users })
.where("#{class_name.downcase.pluralize}.id > ?", offset)
.pluck(:id)
until ids.empty?
ids_to_enqueue = ids.shift(batch_size)
Searchkick::BulkReindexJob.perform_later(
class_name: model.name,
record_ids: ids_to_enqueue
)
end
The problem : The searchkick mapping options are completely ignored when inserting records into ElasticSearch and I can't figure out why. It doesn't take the specified match (text_middle) and create a mapping with default match 'keyword'.
Is there any clean way to bulk reindex an array of records without having to enqueue jobs containing unwanted records?
You should be able to reindex records based on a condition:
From the searchkick docs:
Reindex multiple records
Product.where(store_id: 1).reindex
You can put that in your own delayed job.
What I have done is have for some of our batch operations that happens already in a delayed job, I wrap the code in the job in the bulk block, also in the searchkick doc.
Searchkick.callbacks(:bulk) do
... // wrap some batch operations on model instrumented with searchkick.
// the bulk block should be outside of any transaction block
end

Using the result of concurrent Rails & Sidekiq jobs

Sidekiq will run 25 concurrent jobs in our scenario. We need to get a single integer as the result of each job and tally all of the results together. In this case we are querying an external API and returning counts. We want the total count from all of the API requests.
The Report object stores the final total. Postgresql is our database.
At the end of each job, we increment the report with the additional records found.
Report.find(report_id).increment(:total, api_response_total)
Is this a good approach to track the running total? Will there be Postgresql concurrency issues? Is there a better approach?
increment shouldn't lead to concurrency issues, at sql level, it updates atomically with COALESCE(total, 0) + api_response_total. Race conditions can come only if you addition manually and then saving the object.
report = Report.find(report_id)
report.total += api_response_total
report.save # NOT SAFE
Note: Even with increment! the value at Rails level can be stale, but it will be correct at database level:
# suppose initial `total` is 0
report = Report.find(report_id) # Thread 1 at time t0
report2 = Report.find(report_id) # Thread 2 at time t0
report.increment!(:total) # Thread 1 at time t1
report2.increment!(:total) # Thread 2 at time t1
report.total #=> 1 # Thread 1 at time t2
report2.total #=> 1 # Thread 2 at time t2
report.reload.total #=> 2 # Thread 1 at time t3, value was stale in object, but correct in db
Is this a good approach to track the running total? Will there be Postgresql concurrency issues? Is there a better approach?
I will prefer to do this with Sidekiq Batches. It allows you to run a batch of jobs and assign a callback to the batch, which executes once all jobs are processed. Example:
batch = Sidekiq::Batch.new
batch.description = "Batch description (this is optional)"
batch.on(:success, MyCallback, :to => user.email)
batch.jobs do
rows.each { |row| RowWorker.perform_async(row) }
end
puts "Just started Batch #{batch.bid}"
We need to get a single integer as the result of each job and tally all of the results together.
Note that Sidekiq job doesn't do anything with the returned value and the value is GC'ed and ignored. So, in above batch strategy, you will not have data of jobs in the callback. You can tailor-made that solution. For example, have a LIST in redis with key as batch id, and push the values of each complete job (in perform). In callback, simply use the list and summate it.

Processing pgSQL query results in batches

I've written the rake task to perform a postgreSQL query. The task returns an object of class Result.
Here's my task:
task export_products: :environment do
results = execute "SELECT smth IN somewhere"
if results.present?
results
else
nil
end
end
def execute sql
ActiveRecord::Base.connection.execute sql
end
My further plan is to split the output in batches and save these batches one by one into a .csv file.
Here I get stuck. I cannot imagine how to call find_in_batches method of ActiveRecord::Batches module for PG::Result.
How should I proceed?
Edit: I have a legacy sql query to a legacy database
If you look at how find_in_batches is implemented, you'll see that the algorithm is essentially:
Force the query to be ordered by the primary key.
Add a LIMIT clause to the query to match the batch size.
Execute the modified query from (2) to get a batch.
Do whatever needs to be done with the batch.
If the batch is smaller than the batch size, then the unlimited query has been exhausted so we're done.
Get the maximum primary query value (last_max) from the batch you get in (3).
Add primary_key_column > last_max to the query from (2)'s WHERE clause, run the query again, and go to step (4).
Pretty straight forward and could be implemented with something like this:
def in_batches_of(batch_size)
last_max = 0 # This should be safe for any normal integer primary key.
query = %Q{
select whatever
from table
where what_you_have_now
and primary_key_column > %{last_max}
order by primary_key_column
limit #{batch_size}
}
results = execute(query % { last_max: last_max }).to_a
while(results.any?)
yield results
break if(results.length < batch_size)
last_max = results.last['primary_key_column']
results = execute(query % { last_max: last_max }).to_a
end
end
in_batches_of(1000) do |batch|
# Do whatever needs to be done with the `batch` array here
end
Where, of course, primary_key_column and friends have been replaced with real values.
If you don't have a primary key in your query then you can use some other column that sorts nicely and is unique enough for your needs. You could also use an OFFSET clause instead of the primary key but that can get expensive with large result sets.

Finding mongoDB records in batches (using mongoid ruby adapter)

Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end

Resources