To break up a long migration of data, I'm using a query limited to groups of 100, then processing those 100 records.
something like this...
count = Model.where("conditions").count
count = count / 100
count = count+1 if count%100 != 0
count.times do
#do my data migration steps .limit(100)...
end
is there a shortcut or better way of doing that count based on whether or not there is a remainder when dividing by 100? Feels like I'm forgetting an easy way (besides rounding which seems slower, but maybe it's not).
Yes. This is very well supported by Rails, you do not have to roll your own code for finding batches of records.
The easiest is to simply use find_each, which seamlessly loads 1000 records at a time:
Model.find_each do |model|
# ...
end
The underlying mechanism is find_in_batches with a default batch size of 1000. You can use find_in_batches directly, but you do not have to, find_each is sufficient:
Model.find_in_batches(batch_size: 100) do |batch|
batch.each do |model|
# ...
end
end
Rails has several methods for loading records in batches. find_each would work nicely here. It defaults to batches of 1000, but you can specify the batch size:
Model.find_each(batch_size: 100) do |record|
...
end
Related
I have a large mongo db that i want to grab a batch of records, process them in a thread, grab the next batch, process in a thread, etc. There is major decay in .skip as explained in this post https://arpitbhayani.me/blogs/mongodb-cursor-skip-is-slow. The only way I can figure out how to do this is to take the last id of the current batch as follows (this is non-threaded):
batch_size = 1000
starting_id = Person.first.id
batch = Person.where(:id.gte => starting_id).limit(batch_size)
while(batch.present?)
batch.each do |b|
# process
starting_id = batch.last.id
batch = Person.where(:id.gte => starting_id).limit(batch_size)
end
The problem is, the finding is the slow part (relative) and what I really want to do is parallelize this line (I will take care of governing too many threads so that's not an issue):
batch = Person.where(:id.gte => starting_id).limit(batch_size)
I can't figure out non-skip approach to putting this in a thread because I have to wait until the slow line (above) finishes to start the next thread. Can anyone think of a way to thread this? This is what I've tried, but it has almost zero performance improvement:
batch_size = 1000
starting_id = Person.first.id
thread_count = 10
keep_going = true
while(keep_going)
batch = Person.where(:id.gte => starting_id).limit(batch_size)
if batch.present?
while Thread.list.count > (thread_count - 1)
sleep(1)
end
Thread.new do
batch.each do |b|
# process
starting_id = batch.last.id
end
else
keep_going = false
end
end
This doesn't quite work, but the structure is not the problem, the main question is how can I get the nth batch of records quickly in mongo / mongoid? If I could get the nth batch (which is what limit and skip gets me) I could easily parallelize.
thanks for any help,
Kevin
Something like:
while batch.any?
batch = Person.where(:id.gte => starting_id).order(id: :ASC).limit(batch_size)
Thread.new{ process batch }
starting_id = batch.last.id
end
Alternatively, add a processing key and index and update the documents after they are fetched. It would be done in a single query, so should not be too slow. At least it would be constant time. The batch query would be .where(:id.gte => starting_id, :processing => nil).limit(batch_size)
the main question is how can I get the nth batch of records quickly in mongo / mongoid?
I don't think you need the nth batch, or need to parallelize the query. I believe slow part will be processing each batch...
I have an array of Active Record result and I want to iterate over each record to get a specific attribute and add all of them in one line with a nil check. Here is what I got so far
def total_cost(cost_rec)
total= 0.0
unless cost_rec.nil?
cost_rec.each { |c| total += c.cost }
end
total
end
Is there an elegant way to do the same thing in one line?
You could combine safe-navigation (to "hide" the nil check), summation inside the database (to avoid pulling a bunch of data out of the database that you don't need), and a #to_f call to hide the final nil check:
cost_rec&.sum(:cost).to_f
If the cost is an integer, then:
cost_rec&.sum(:cost).to_i
and if cost is a numeric inside the database and you don't want to worry about precision issues:
cost_rec&.sum(:cost).to_d
If cost_rec is an array rather than a relation (i.e. you've already pulled all the data out of the database), then one of:
cost_rec&.sum(&:cost).to_f
cost_rec&.sum(&:cost).to_i
cost_rec&.sum(&:cost).to_d
depending on what type cost is.
You could also use Kernel#Array to ignore nils (since Array(nil) is []) and ignore the difference between arrays and ActiveRecord relations (since #Array calls #to_ary and relations respond to that) and say:
Array(cost_rec).sum(&:cost)
that'll even allow cost_rec to be a single model instance. This also bypasses the need for the final #to_X call since [].sum is 0. The downside of this approach is that you can't push the summation into the database when cost_rec is a relation.
anything like these?
def total_cost(cost_rec)
(cost_rec || []).inject(0) { |memo, c| memo + c.cost }
end
or
def total_cost(cost_rec)
(cost_rec || []).sum(&:cost)
end
Either one of these should work
total = cost_rec.map(&:cost).compact.sum
total = cost_rec.map{|c| c.cost }.compact.sum
total = cost_rec.pluck(:cost).compact.sum
Edit: if cost_rec is nil
total = (cost_rec || []).map{|c| c.cost }.compact.sum
When cost_rec is an ActiveRecord::Relatation then this should work out of the box:
cost_rec.sum(:cost)
See ActiveRecord::Calculations#sum.
I am attempting to make a batch process which will take a parameter that specifies the number of background workers, and split a collection into that many arrays. For example if
def split_for_batch(number_of_workers)
<code>
end
array = [1,2,3,4,5,6,7,8,9,10]
array.split_for_batch(3)
=> [[1,2,3],[4,5,6],[7,8,9,10]]
the thing is that I don't want to have to load all of the users into memory at once because it is a batch. What I have now is
def initialize_audit_run_threads
total_users = tax_audit_run_users.count
partition_size = (total_users / thread_count).round
tax_audit_run_users.in_groups_of(partition_size).each do |group|
thread = TaxAuditRunThread.create(:tax_audit_run_id => id, :status_code => 1)
group.each do |user|
if user
user.tax_audit_run_thread_id = thread.id
user.save
end
end
end
where the thread_count is an attribute of the class that determines the number of background workers. Currently this code will create 4 threads rather than 3. I have also tried using find_in_batches but I am having the same problem where if I have 10 tax_audit_run_users in the array I have no way to let the last worker know to process the last record. Is there a way in ruby or rails to divide a collection into n parts and have the last part include the stragglers?
How to split (chunk) a Ruby array into parts of X elements?
You will of course need to modify it a bit to add the last chunk if it's less than the chunk size, or not, up to you.
I am developing a Rails app. I would like to use an array to hold 2,000,000 data, then insert the data into database like following:
large_data = Get_data_Method() #get 2,000,000 raw data
all_values = Array.new
large_data.each{ |data|
all_values << data[1] #e.g. data[1] has the format "(2,'john','2002-09-12')"
}
sql="INSERT INTO cars (id,name,date) VALUES "+all_values.join(',')
ActiveRecord::Base.connection.execute(sql)
When I run the code, it takes a long long time at the point of large_data.each{...} . Actually I am now still waiting for it to finish(it has been running for 1 hour already still not finish the large_data.each{...} part).
Is it because of the number of elements is too large for the ruby array that the array can not hold 2,000,000 elements ? or ruby array can hold that much elements and it is reasonable to wait this long?
Since I would like to use bulk insertion in SQL to speed up the large data insertion time in mysql database, so I would like to use only one INSERT INTO statement, that's why I did the above thing. If this is a bad design, can you recommand me a better way?
Some notes:
Don't use the pattern "empty array + each + push", use Enumerable#map.
all_values = large_data.map { |data| data[1] }
Is it possible to write get_data to return items lazily? if the answer is yes, check enumerators and use them to do batched inserts into the database instead of puting all objects at once. Something like this:
def get_data
Enumerator.new do |yielder|
yielder.yield some_item
yielder.yield another_item
# yield all items.
end
end
get_data.each_slice(1000) do |data|
# insert those 1000 elements into the database
end
That said, there're projects for doing efficient bulk insertions, check ar-extensions and activerecord-import for Rails >= 3.
An array of 2m items is never going to be the easyist thing to manage, have you taken a look at MongoDB, this is a database which can be accessed just like an array and could be the answer to your issues.
An easy fix would be to split your inserts into blocks of 1000, that would make the whole process more manageable.
Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end