Sidekiq will run 25 concurrent jobs in our scenario. We need to get a single integer as the result of each job and tally all of the results together. In this case we are querying an external API and returning counts. We want the total count from all of the API requests.
The Report object stores the final total. Postgresql is our database.
At the end of each job, we increment the report with the additional records found.
Report.find(report_id).increment(:total, api_response_total)
Is this a good approach to track the running total? Will there be Postgresql concurrency issues? Is there a better approach?
increment shouldn't lead to concurrency issues, at sql level, it updates atomically with COALESCE(total, 0) + api_response_total. Race conditions can come only if you addition manually and then saving the object.
report = Report.find(report_id)
report.total += api_response_total
report.save # NOT SAFE
Note: Even with increment! the value at Rails level can be stale, but it will be correct at database level:
# suppose initial `total` is 0
report = Report.find(report_id) # Thread 1 at time t0
report2 = Report.find(report_id) # Thread 2 at time t0
report.increment!(:total) # Thread 1 at time t1
report2.increment!(:total) # Thread 2 at time t1
report.total #=> 1 # Thread 1 at time t2
report2.total #=> 1 # Thread 2 at time t2
report.reload.total #=> 2 # Thread 1 at time t3, value was stale in object, but correct in db
Is this a good approach to track the running total? Will there be Postgresql concurrency issues? Is there a better approach?
I will prefer to do this with Sidekiq Batches. It allows you to run a batch of jobs and assign a callback to the batch, which executes once all jobs are processed. Example:
batch = Sidekiq::Batch.new
batch.description = "Batch description (this is optional)"
batch.on(:success, MyCallback, :to => user.email)
batch.jobs do
rows.each { |row| RowWorker.perform_async(row) }
end
puts "Just started Batch #{batch.bid}"
We need to get a single integer as the result of each job and tally all of the results together.
Note that Sidekiq job doesn't do anything with the returned value and the value is GC'ed and ignored. So, in above batch strategy, you will not have data of jobs in the callback. You can tailor-made that solution. For example, have a LIST in redis with key as batch id, and push the values of each complete job (in perform). In callback, simply use the list and summate it.
Related
I have a large mongo db that i want to grab a batch of records, process them in a thread, grab the next batch, process in a thread, etc. There is major decay in .skip as explained in this post https://arpitbhayani.me/blogs/mongodb-cursor-skip-is-slow. The only way I can figure out how to do this is to take the last id of the current batch as follows (this is non-threaded):
batch_size = 1000
starting_id = Person.first.id
batch = Person.where(:id.gte => starting_id).limit(batch_size)
while(batch.present?)
batch.each do |b|
# process
starting_id = batch.last.id
batch = Person.where(:id.gte => starting_id).limit(batch_size)
end
The problem is, the finding is the slow part (relative) and what I really want to do is parallelize this line (I will take care of governing too many threads so that's not an issue):
batch = Person.where(:id.gte => starting_id).limit(batch_size)
I can't figure out non-skip approach to putting this in a thread because I have to wait until the slow line (above) finishes to start the next thread. Can anyone think of a way to thread this? This is what I've tried, but it has almost zero performance improvement:
batch_size = 1000
starting_id = Person.first.id
thread_count = 10
keep_going = true
while(keep_going)
batch = Person.where(:id.gte => starting_id).limit(batch_size)
if batch.present?
while Thread.list.count > (thread_count - 1)
sleep(1)
end
Thread.new do
batch.each do |b|
# process
starting_id = batch.last.id
end
else
keep_going = false
end
end
This doesn't quite work, but the structure is not the problem, the main question is how can I get the nth batch of records quickly in mongo / mongoid? If I could get the nth batch (which is what limit and skip gets me) I could easily parallelize.
thanks for any help,
Kevin
Something like:
while batch.any?
batch = Person.where(:id.gte => starting_id).order(id: :ASC).limit(batch_size)
Thread.new{ process batch }
starting_id = batch.last.id
end
Alternatively, add a processing key and index and update the documents after they are fetched. It would be done in a single query, so should not be too slow. At least it would be constant time. The batch query would be .where(:id.gte => starting_id, :processing => nil).limit(batch_size)
the main question is how can I get the nth batch of records quickly in mongo / mongoid?
I don't think you need the nth batch, or need to parallelize the query. I believe slow part will be processing each batch...
I have a sidekiq worker that processes certain series of task in batch. Once it completes the job, it updates a tracker table on the success/failure of the task. Each batch has a unique identifier that is being passed to the worker script and the worker process queries that table for this unique id and update that particular row through a activerecord query similar to:
cpr = MODEL.find(tracker_unique_id)
cpr.update_attributes(:attempted => cpr[:attempted] + 1, :success => cpr[:success] + 1)
What I have noticed is that the tracker only get record of 1 set of task running even though I can see from the sidekiq log and another result table that x number of tasks finished running.
Anyone can help me on this?
Your update_attributes call has a race condition as you cannot increment like that safely. Multiple threads will stomp on each other. You must do a proper UPDATE SQL statement.
update models set attempted = attempted + 1 where tracker_unique_id = ?
I noticed that Rails can have concurrency issues with multiple servers and would like to force my model to always lock. Is this possible in Rails, similar to unique constraints to force data integrity? Or does it just require careful programming?
Terminal One
irb(main):033:0* Vote.transaction do
irb(main):034:1* v = Vote.lock.first
irb(main):035:1> v.vote += 1
irb(main):036:1> sleep 60
irb(main):037:1> v.save
irb(main):038:1> end
Terminal Two, while sleeping
irb(main):240:0* Vote.transaction do
irb(main):241:1* v = Vote.first
irb(main):242:1> v.vote += 1
irb(main):243:1> v.save
irb(main):244:1> end
DB Start
select * from votes where id = 1;
id | vote | created_at | updated_at
----+------+----------------------------+----------------------------
1 | 0 | 2013-09-30 02:29:28.740377 | 2013-12-28 20:42:58.875973
After execution
Terminal One
irb(main):040:0> v.vote
=> 1
Terminal Two
irb(main):245:0> v.vote
=> 1
DB End
select * from votes where id = 1;
id | vote | created_at | updated_at
----+------+----------------------------+----------------------------
1 | 1 | 2013-09-30 02:29:28.740377 | 2013-12-28 20:44:10.276601
Other Example
http://rhnh.net/2010/06/30/acts-as-list-will-break-in-production
You are correct that transactions by themselves don't protect against many common concurrency scenarios, incrementing a counter being one of them. There isn't a general way to force a lock, you have to ensure you use it everywhere necessary in your code
For the simple counter incrementing scenario there are two mechanisms that will work well:
Row Locking
Row locking will work as long as you do it everywhere in your code where it matters. Knowing where it matters may take some experience to get an instinct for :/. If, as in your above code, you have two places where a resource needs concurrency protection and you only lock in one, you will have concurrency issues.
You want to use the with_lock form; this does a transaction and a row-level lock (table locks are obviously going to scale much more poorly than row locks, although for tables with few rows there is no difference as postgresql (not sure about mysql) will use a table lock anyway. This looks like this:
v = Vote.first
v.with_lock do
v.vote +=1
sleep 10
v.save
end
The with_lock creates a transaction, locks the row the object represents, and reloads the objects attributes all in one step, minimizing the opportunity for bugs in your code. However this does not necessarily help you with concurrency issues involving the interaction of multiple objects. It can work if a) all possible interactions depend on one object, and you always lock that object and b) the other objects each only interact with one instance of that object, e.g. locking a user row and doing stuff with objects which all belong_to (possibly indirectly) that user object.
Serializable Transactions
The other possibility is to use serializable transaction. Since 9.1, Postgresql has "real" serializable transactions. This can perform much better than locking rows (though it is unlikely to matter in the simple counter incrementing usecase)
The best way to understand what serializable transactions give you is this: if you take all the possible orderings of all the (isolation: :serializable) transactions in your app, what happens when your app is running is guaranteed to always correspond with one of those orderings. With ordinary transactions this is not guaranteed to be true.
However, what you have to do in exchange is to take care of what happens when a transaction fails because the database is unable to guarantee that it was serializable. In the case of the counter increment, all we need to do is retry:
begin
Vote.transaction(isolation: :serializable) do
v = Vote.first
v.vote += 1
sleep 10 # this is to simulate concurrency
v.save
end
rescue ActiveRecord::StatementInvalid => e
sleep rand/100 # this is NECESSARY in scalable real-world code,
# although the amount of sleep is something you can tune.
retry
end
Note the random sleep before the retry. This is necessary because failed serializable transactions have a non-trivial cost, so if we don't sleep, multiple processes contending for the same resource can swamp the db. In a heavily concurrent app you may need to gradually increase the sleep with each retry. The random is VERY important to avoid harmonic deadlocks -- if all the processes sleep the same amount of time they can get into a rhythm with each other, where they all are sleeping and the system is idle and then they all try for the lock at the same time and the system deadlocks causing all but one to sleep again.
When the transaction that needs to be serializable involves interaction with a source of concurrency other than the database, you may still have to use row-level locks to accomplish what you need. An example of this would be when a state machine transition determines what state to transition to based on a query to something other than the db, like a third-party API. In this case you need to lock the row representing the object with the state machine while the third party API is queried. You cannot nest transactions inside serializable transactions, so you would have to use object.lock! instead of with_lock.
Another thing to be aware of is that any objects fetched outside the transaction(isolation: :serializable) should have reload called on them before use inside the transaction.
ActiveRecord always wraps save operations in a transaction.
For your simple case it might be best to just use a SQL update instead of performing logic in Ruby and then saving. Here is an example which adds a model method to do this:
class Vote
def vote!
self.class.update_all("vote = vote + 1", {:id => id})
end
This method avoids the need for locking in your example. If you need more general database locking check see David's suggestion.
You can do the following in your model like so
class Vote < ActiveRecord::Base
validate :handle_conflict, only: :update
attr_accessible :original_updated_at
attr_writer :original_updated_at
def original_updated_at
#original_updated_at || updated_at
end
def handle_conflict
#If we want to use this across multiple models
#then extract this to module
if #conflict || updated_at.to_f> original_updated_at.to_f
#conflict = true
#original_updated_at = nil
#If two updates are made at the same time a validation error
#is displayed and the fields with
errors.add :base, 'This record changed while you were editing'
changes.each do |attribute, values|
errors.add attribute, "was #{values.first}"
end
end
end
end
The original_updated_at is a virtual attribute that is set. handle_conflict is fired when the record is updated. Checks to see if the updated_at attribute is in the database is later than the one hidden(defined on your page). By the way you should define the following in the your app/view/votes/_form.html.erb
<%= f.hidden_field :original_updated_at %>
If a there is a conflict then raise the validation error.
And if you are using Rails 4 you will won't have the attr_accessible and will need to add :original_updated_at to your vote_params method in your controller.
Hopefully this sheds some light.
For simple +1
Vote.increment_counter :vote, Vote.first.id
Because vote was used both for the table name and the field, this is how the 2 are used
TableName.increment_counter :field_name, id_of_the_row
Foobar.find(1).votes_count returns 0.
In rails console, I am doing:
10.times { Resque.enqueue(AddCountToFoobar, 1) }
My resque worker:
class AddCountToFoobar
#queue = :low
def self.perform(id)
foobar = Foobar.find(id)
foobar.update_attributes(:count => foobar.votes_count +1)
end
end
I would expect Foobar.find(1).votes_count to be 10, but instead it returns 4. If I run 10.times { Resque.enqueue(AddCountToFoobar, 1) } again, it returns the same behaviour. It only increments votes_count by 4 and sometimes 5.
Can anyone explain this?
This is a classic race condition scenario. Imagine that only 2 workers exist and that they each run one of your vote incrementing jobs. Imagine the following sequence.
Worker1: load foobar(vote count == 1)
Worker2: load foobar(vote count == 1, in a separate ruby object)
Worker 1: increment vote count (now == 2) and save
Worker 2: increment it's copy of foobar (vote count now == 2) and save, overwriting what worker 1 did
Although 2 workers ran 1 update job each, the count only increased by 1 because they were both operating on their own copy of foobar that wasn't aware of the change the other worker was doing
To solve this, you could either do an inplace style update, ie
UPDATE foos SET count = count + 1
or use one of the 2 forms of locking active record supports (pessimistic locking & optimistic locking)
The former works because the database ensures that you don't have concurrent updates on the same row at the same time.
Looks like ActiveRecord is not thread-safe in Resque (or rather redis, I guess). Here's a nice explanation.
As Frederick says, you're observing a race condition. You need to serialize access to the critical section from the time you read the value and update it.
I'd try to use pessimistic locking:
http://api.rubyonrails.org/classes/ActiveRecord/Transactions/ClassMethods.html
http://api.rubyonrails.org/classes/ActiveRecord/Locking/Pessimistic.html
foobar = Foobar.find(id)
foobar.with_lock do
foobar.update_attributes(:count => foobar.votes_count +1)
end
Using rails 3 and mongoDB with the mongoid adapter, how can I batch finds to the mongo DB? I need to grab all the records in a particular mongo DB collection and index them in solr (initial index of data for searching).
The problem I'm having is that doing Model.all grabs all the records and stores them into memory. Then when I process over them and index in solr, my memory gets eaten up and the process dies.
What I'm trying to do is batch the find in mongo so that I can iterate over 1,000 records at a time, pass them to solr to index, and then process the next 1,000, etc...
The code I currently have does this:
Model.all.each do |r|
Sunspot.index(r)
end
For a collection that has about 1.5 million records, this eats up 8+ GB of memory and kills the process. In ActiveRecord, there is a find_in_batches method that allows me to chunk up the queries into manageable batches that keeps the memory from getting out of control. However, I can't seem to find anything like this for mongoDB/mongoid.
I would LIKE to be able to do something like this:
Model.all.in_batches_of(1000) do |batch|
Sunpot.index(batch)
end
That would alleviate my memory problems and query difficulties by only doing a manageable problem set each time. The documentation is sparse, however, on doing batch finds in mongoDB. I see lots of documentation on doing batch inserts but not batch finds.
With Mongoid, you don't need to manually batch the query.
In Mongoid, Model.all returns a Mongoid::Criteria instance. Upon calling #each on this Criteria, a Mongo driver cursor is instantiated and used to iterate over the records. This underlying Mongo driver cursor already batches all records. By default the batch_size is 100.
For more information on this topic, read this comment from the Mongoid author and maintainer.
In summary, you can just do this:
Model.all.each do |r|
Sunspot.index(r)
end
If you are iterating over a collection where each record requires a lot of processing (i.e querying an external API for each item) it is possible for the cursor to timeout. In this case you need to perform multiple queries in order to not leave the cursor open.
require 'mongoid'
module Mongoid
class Criteria
def in_batches_of(count = 100)
Enumerator.new do |y|
total = 0
loop do
batch = 0
self.limit(count).skip(total).each do |item|
total += 1
batch += 1
y << item
end
break if batch == 0
end
end
end
end
end
Here is a helper method you can use to add the batching functionality. It can be used like so:
Post.all.order_by(:id => 1).in_batches_of(7).each_with_index do |post, index|
# call external slow API
end
Just make sure you ALWAYS have an order_by on your query. Otherwise the paging might not do what you want it to. Also I would stick with batches of 100 or less. As said in the accepted answer Mongoid queries in batches of 100 so you never want to leave the cursor open while doing the processing.
It is faster to send batches to sunspot as well.
This is how I do it:
records = []
Model.batch_size(1000).no_timeout.only(:your_text_field, :_id).all.each do |r|
records << r
if records.size > 1000
Sunspot.index! records
records.clear
end
end
Sunspot.index! records
no_timeout: prevents the cursor to disconnect (after 10 min, by default)
only: selects only the id and the fields, which are actually indexed
batch_size: fetch 1000 entries instead of 100
I am not sure about the batch processing, but you can do this way
current_page = 0
item_count = Model.count
while item_count > 0
Model.all.skip(current_page * 1000).limit(1000).each do |item|
Sunpot.index(item)
end
item_count-=1000
current_page+=1
end
But if you are looking for a perfect long time solution i wouldn't recommend this. Let me explain how i handled the same scenario in my app. Instead of doing batch jobs,
i have created a resque job which updates the solr index
class SolrUpdator
#queue = :solr_updator
def self.perform(item_id)
item = Model.find(item_id)
#i have used RSolr, u can change the below code to handle sunspot
solr = RSolr.connect :url => Rails.application.config.solr_path
js = JSON.parse(item.to_json)
solr.add js
end
end
After adding the item, i just put an entry to the resque queue
Resque.enqueue(SolrUpdator, item.id.to_s)
Thats all, start the resque and it will take care of everything
As #RyanMcGeary said, you don't need to worry about batching the query. However, indexing objects one at a time is much much slower than batching them.
Model.all.to_a.in_groups_of(1000, false) do |records|
Sunspot.index! records
end
The following will work for you , just try it
Model.all.in_groups_of(1000, false) do |r|
Sunspot.index! r
end