How do I avoid a simple race condition in Rails? - ruby-on-rails

I have a simple race condition. I have a website where people can vote on photos, but maximum 10 votes are allowed.
When a user submits a vote, I update a num_votes column in the photos table for that specific photo. I do this for easy lookup for the number of votes.
How can I make sure that the vote.save and the num_votes update happen in the same transaction?
Thanks!

In order to achieve this you have to use some kind of locking. Basically you have 3 options: optimistic/pessimistic rails locking and some external locking backend (like Redis::Lock).
I personally would go for pessimistic locking if high performance is not really the case here
photo = Photo.find(photo_id)
photo.with_lock do
photo.num_votes += 1
photo.save!
end
I should also point out that sticking to only wrapping incrementing num_votes and save into one transaction would not solve the race-condition. Most RDBMS by default work in read committed mode. Which doesn't prevent such a race condition.
FYI See Pessimistic and Optimistic Locking reference

If it is a simple race condition then you should resolve it as a race condition.
Try using some locking mechanism. Redis is good to go:
redis locking for ruby
RedisLocker.new('vote_#{#photo.id}').run! { #photo.vote }
# ... photo model
def vote
if num_votes <= 10
self.num_votes += 1
save
end
end

Well, Rails/Postgres supports transactions. You can simply declare one, on any ActiveRecord model:
Photo.transaction do
Vote.create(:whatever)
Photo.votes = thing
Photo.save!
end
If an exception is raised during the transaction block (say, by calling .save! on an invalid model), the transaction is rolled back and any database changes that would have happened in there aren't committed (in this case, the Vote record doesn't get inserted). You'll still need to rescue and handle the exception, of course.
Incidentally, storing number of associated objects in a record for easy lookup is a pretty common pattern, known as a counter cache, and Rails supports those as well - you might want to look into formally making num_votes a counter cache (the default name would be photos.votes_count, but it's not required). You might still want a transaction to check that it doesn't exceed the limit, though.

You don't need explicit locks for this
Photo.where(:id => photo_id).where('num_votes < 10').update_all('num_votes = num_votes+ 1')
will update the number of votes for that photo, but only if there are less than 10 votes. You can check the return value of update_all to see if anything was actually updated: the return value is the number of updated rows. If the update fails then don't create the vote (or if you have already created the vote, rollback the transaction).
Optimistic locking uses a similar technique to detect attempts at concurrent updates: it places a condition on the update that ensures that nothing will happen if someone has snuck in there before you and then checks the number of updated rows.

Related

locking rows on rails update to avoid collisions. (Postgres back end)

So I have a method on my model object which creates a unique sequence number when a binary field in the row is updated from null to true. Its implemented like this:
class AnswerHeader < ApplicationRecord
before_save :update_survey_complete_sequence, if: :survey_complete_changed?
def update_survey_complete_sequence
maxval =AnswerHeader.maximum('survey_complete_sequence')
self.survey_complete_sequence=maxval+1
end
end
My question is what do I need to lock so two rows being updated at the same time don't end up with two rows having the same survey_complete_sequence?
If it is possible to lock a single row rather than whole table that would be good because this is a often accessed table by users.
If you want to handle this in application logic itself, instead of letting database handle this. You make make use of rails with_lock function that will create a transaction and acquire a row level db lock on the selected rows(in your case a single row).
What you need to lock
In your case you have to lock the row containing the maximum survey_complete_sequence, since this is the row every query will look for while getting the value you require.
maxval =AnswerHeader.maximum('survey_complete_sequence')
Is it possible to lock a single row rather than whole table
There is no such specific lock for your scenario. But you can make use of Postgresql's SELECT FOR UPDATE row-level locking.
To acquire an exclusive row-level lock on a row without actually
modifying the row, select the row with SELECT FOR UPDATE.
And you can use pessimistic locking in rails and specify which lock you will use.
Call lock('some locking clause') to use a database-specific locking
clause of your own such as 'LOCK IN SHARE MODE' or 'FOR UPDATE NOWAIT'
Here's an example of how to achieve that from rails official guide itself
Item.transaction do
i = Item.lock("LOCK IN SHARE MODE").find(1)
...
end
Relations using lock are usually wrapped inside a transaction for preventing deadlock conditions.
So what you need doing is -
Apply SELECT FOR UPDATE lock to row consisting maximum('survey_complete_sequence')
Get the value you require from that row
Update your AnswerHeader with the value received
I believe you should give advisory locks a look. It makes sure the same block of code isn't executed on two machines simultaneously, while still keeping the table open for other business.
It uses the database, but it doesn't lock your tables.
You can use the gem called "with_advisory_lock" like this:
Model.with_advisory_lock("ADVISORY_LOCK_NAME") do
# Your code
end
https://github.com/ClosureTree/with_advisory_lock
It doesn't work with SQLite.
If you are using postgress, maybe Sequenced can help you out without defining a sequence at the DB level.
Is there a reason survey_complete_sequence should be incremental? if not, maybe randomize a bigint?
You probably don't want to lock the table, and even if you lock the row you're currently updating the row you're basing your maxval on will be available for another update to read and generate its sequence #.
Unless you have a huge table and lots of updates every millisecond (in the order of thousands) this shouldn't be an issue in real life. But if the idea bothers you, you can go ahead and add an unique index to the table on "survey_complete_sequence" column. The DB error will propagate to a Rails exception you can deal with within the application.

Multiple worker threads working on the same database - how to make it work properly?

I have a database that has a list of rows that need to be operated on. It looks something like this:
id remaining delivered locked
============================================
1 10 24 f
2 6 0 f
3 0 14 f
I am using DataMapper with Ruby, but really I think this is a general programming question that isn't specific to the exact implementation I'm using...
I am creating a bunch of worker threads that do something like this (pseudo-ruby-code):
while true do
t = any_row_in_database_where_remaining_greater_than_zero_and_unlocked
t.lock # update database to set locked = true
t.do_some_stuff
t.delivered += 1
t.remaining -= 1
t.unlock
end
Of course, the problem is, these threads compete with each other and the whole thing isn't really thread safe. The first line in the while loop can easily pull out the same row in multiple threads before they get a chance to get locked.
I need to make sure one thread is only working on one row at the same time.
What is the best way to do this?
The key step is when you select an unlocked row from the database and mark it as locked. If you can do that safely then everything else will be fine.
2 ways I know of that can make this safe are pessimistic and optimistic locking. They both rely on your database as the ultimate guarantor when it comes to concurrency.
Pessimistic Locking
Pessimistic locking means acquiring a lock upfront when you select the rows you want to work with, so that no one else can read them.
Something like
SELECT * from some_table WHERE ... FOR UPDATE
works with mysql and postgres (and possibly others) and will prevent any other connection to the database from reading the rows returned to you (how granular that lock is depends on the engine used, indexes etc - check your database's documentation). It's called pessimistic because you are assuming that a concurrency problem will occur and acquire the lock preventatively. It does mean that you bear the cost of locking even when not necessary and may reduce your concurrency depending on the granularity of the lock you have.
Optimistic Locking
Optimistic locking refers to a technique where you don't want the burden of a pessimistic lock because most of the time there won't be concurrent updates (if you update the row setting the locked flag to true as soon as you have read the row, the window is relatively small). AFAIK this only works when updating one row at a time
First add an integer column lock_version to the table. Whenever you update the table, increment lock_version by 1 alongside the other updates you are making. Assume the current lock_version is 3. When you update, change the update query to
update some_table set ... where id=12345 and lock_version = 3
and check the number of rows updated (the db driver returns this). if this updates 1 row then you know everything was ok. If this updates 0 rows then either the row you wanted was deleted or its lock version has changed, so you go back to step 1 in your process and search for a new row to work on.
I'm not a datamapper user so I don't know whether it / plugins for it provide support for these approaches. Active Record supports both so you can look there for inspiration if data mapper doesn't.
I would use a Mutex:
# outside your threads
worker_updater = Mutex.new
# inside each thread's updater
while true
worker_updater.synchronize do
# your code here
end
sleep 0.1 # Slow down there, mister!
end
This guarantees that only one thread at a time can enter the code in the synchronize. For optimal performance, consider what portion of your code needs to be thread-safe (first two lines?) and only wrap that portion in the Mutex.

State machine transitions at specific times

Simplified example:
I have a to-do. It can be future, current, or late based on what time it is.
Time State
8:00 am Future
9:00 am Current
10:00 am Late
So, in this example, the to-do is "current" from 9 am to 10 am.
Originally, I thought about adding fields for "current_at" and "late_at" and then using an instance method to return the state. I can query for all "current" todos with now > current and now < late.
In short, I'd calculate the state each time or use SQL to pull the set of states I need.
If I wanted to use a state machine, I'd have a set of states and would store that state name on the to-do. But, how would I trigger the transition between states at a specific time for each to-do?
Run a cron job every minute to pull anything in a state but past the transition time and update it
Use background processing to queue transition jobs at the appropriate times in the future, so in the above example I would have two jobs: "transition to current at 9 am" and "transition to late at 10 am" that would presumably have logic to guard against deleted todos and "don't mark late if done" and such.
Does anyone have experience with managing either of these options when trying to handle a lot of state transitions at specific times?
It feels like a state machine, I'm just not sure of the best way to manage all of these transitions.
Update after responses:
Yes, I need to query for "current" or "future" todos
Yes, I need to trigger notifications on state change ("your todo wasn't to-done")
Hence, my desire to more of a state-machine-like idea so that I can encapsulate the transitions.
I have designed and maintained several systems that manage huge numbers of these little state machines. (Some systems, up to 100K/day, some 100K/minute)
I have found that the more state you explicitly fiddle with, the more likely it is to break somewhere. Or to put it a different way, the more state you infer, the more robust the solution.
That being said, you must keep some state. But try to keep it as minimal as possible.
Additionally, keeping the state-machine logic in one place makes the system more robust and easier to maintain. That is, don't put your state machine logic in both code and the database. I prefer my logic in the code.
Preferred solution. (Simple pictures are best).
For your example I would have a very simple table:
task_id, current_at, current_duration, is_done, is_deleted, description...
and infer the state based on now in relation to current_at and current_duration. This works surprisingly well. Make sure you index/partition your table on current_at.
Handling logic on transition change
Things are different when you need to fire an event on the transition change.
Change your table to look like this:
task_id, current_at, current_duration, state, locked_by, locked_until, description...
Keep your index on current_at, and add one on state if you like. You are now mangling state, so things are a little more fragile due to concurrency or failure, so we'll have to shore it up a little bit using locked_by and locked_until for optimistic locking which I'll describe below.
I assume your program will fail in the middle of processing on occassion—even if only for a deployment.
You need a mechanism to transition a task from one state to another. To simplify the discussion, I'll concern myself with moving from FUTURE to CURRENT, but the logic is the same no matter the transition.
If your dataset is large enough, you constantly poll the database to discover to discover tasks requiring transition (of course, with linear or exponential back-off when there's nothing to do); otherwise you use or your favorite scheduler whether it is cron or ruby-based, or Quartz if you subscribe to Java/Scala/C#.
Select all entries that need to be moved from FUTURE to CURRENT and are not currently locked.
(updated:)
-- move from pending to current
select task_id
from tasks
where now >= current_at
and (locked_until is null OR locked_until < now)
and state == 'PENDING'
and current_at >= (now - 3 days) -- optimization
limit :LIMIT -- optimization
Throw all these task_ids into your reliable queue. Or, if you must, just process them in your script.
When you start to work on an item, you must first lock it using our optimistic locking scheme:
update tasks
set locked_by = :worker_id -- unique identifier for host + process + thread
, locked_until = now + 5 minutes -- however this looks in your SQL langage
where task_id = :task_id -- you can lock multiple tasks here if necessary
and (locked_until is null OR locked_until < now) -- only if it's not locked!
Now, if you actually updated the record, you own the lock. You may now fire your special on-transition logic. (Applause. This is what makes you different from all the other task managers, right?)
When that is successful, update the task state, make sure you still use the optimistic locking:
update tasks
set state = :new_state
, locked_until = null -- explicitly release the lock (an optimization, really)
where task_id = :task_id
and locked_by = :worker_id -- make sure we still own the lock
-- no-one really cares if we overstep our time-bounds
Multi-thread/process optimization
Only do this when you have multiple threads or processes updating tasks in batch (such as in a cron job, or polling the database)! The problem is they'll each get the similar results from the database and will then contend to lock each row. This is inefficient both because it will slow down the database, and because you have threads basically doing nothing but slowing down the others.
So, add a limit to how many results the query returns and follow this algorithm:
results = database.tasks_to_move_to_current_state :limit => BATCH_SIZE
while !results.empty
results.shuffle! # make sure we're not in lock step with another worker
contention_count = 0
results.each do |task_id|
if database.lock_task :task_id => task_id
on_transition_to_current task_id
else
contention_count += 1
end
break if contention_count > MAX_CONTENTION_COUNT # too much contention!
done
results = database.tasks_to_move_to_current_state :limit => BATCH_SIZE
end
Fiddle around with BATCH_SIZE and MAX_CONTENTION_COUNT until the program is super-fast.
Update:
The optimistic locking allows for multiple processors in parallel.
By have the lock timeout (via the locked_until field) it allows for failure while processing a transition. If the processor fails, another processor is able to pick up the task after a timeout (5 minutes in the above code). It is important, then, to a) only lock the task when you are about to work on it; and b) lock the task for how long it will take to do the task plus a generous leeway.
The locked_by field is mostly for debugging purposes, (which process/machine was this on?) It is enough to have the locked_until field if your database driver returns the number of rows updated, but only if you update one row at a time.
Managing all those transitions at specific times does seem tricky. Perhaps you could use something like DelayedJob to schedule the transitions, so that a cron job every minute wouldn't be necessary, and recovering from a failure would be more automated?
Otherwise - if this is Ruby, is using Enumerable an option?
Like so (in untested pseudo-code, with simplistic methods)
ToDo class
def state
if to_do.future?
return "Future"
elsif to_do.current?
return "Current"
elsif to_do.late?
return "Late"
else
return "must not have been important"
end
end
def future?
Time.now.hour <= 8
end
def current?
Time.now.hour == 9
end
def late?
Time.now.hour >= 10
end
def self.find_current_to_dos
self.find(:all, :conditions => " 1=1 /* or whatever */ ").select(&:state == 'Current')
end
One simple solution for moderately large datasets is to use a SQL database. Each todo record should have a "state_id", "current_at", and "late_at" fields. You can probably omit the "future_at" unless you really have four states.
This allows three states:
Future: when now < current_at
Current: when current_at <= now < late_at
Late: when late_at <= now
Storing the state as state_id (optionally make a foreign key to a lookup table named "states" where 1: Future, 2: Current, 3: Late) is basically storing de-normalized data, which lets you avoid recalculating the state as it rarely changes.
If you aren't actually querying todo records according to state (eg ... WHERE state_id = 1) or triggering some side-effect (eg sending an email) when the state changes, perhaps you don't need to manage state. If you're just showing the user a todo list and indicating which ones are late, the cheapest implementation might even be to calculate it client side. For the purpose of answering, I'll assume you need to manage the state.
You have a few options for updating state_id. I'll assume you are enforcing the constraint current_at < late_at.
The simplest is to update every record: UPDATE todos SET state_id = CASE WHEN late_at <= NOW() THEN 3 WHEN current_at <= NOW() THEN 2 ELSE 1 END;.
You probably will get better performance with something like (in one transaction) UPDATE todos SET state_id = 3 WHERE state_id <> 3 AND late_at <= NOW(), UPDATE todos SET state_id = 2 WHERE state_id <> 2 AND NOW() < late_at AND current_at <= NOW(), UPDATE todos SET state_id = 1 WHERE state_id <> 1 AND NOW() < current_at. This avoids retrieving rows that don't need to be updated but you'll want indices on "late_at" and "future_at" (you can try indexing "state_id", see note below). You can run these three updates as frequently as you need.
Slight variation of the above is to get the IDs of records first, so you can do something with the todos that have changed states. This looks something like SELECT id FROM todos WHERE state_id <> 3 AND late_at <= NOW() FOR UPDATE. You should then do the update like UPDATE todos SET state_id = 3 WHERE id IN (:ids). Now you've still got the IDs to do something with later (eg email a notification "20 tasks have become overdue").
Scheduling or queuing update jobs for each todo (eg update this one to "current" at 10AM and "late" at 11PM) will result in a lot of scheduled jobs, at least two times the number of todos, and poor performance -- each scheduled job is updating only a single record.
You could schedule batch updates like UPDATE state_id = 2 WHERE ID IN (1,2,3,4,5,...) where you've pre-calculated the list of todo IDs that will become current near some specific time. This probably won't work out so nicely in practice for several reasons. One being some todo's current_at and late_at fields might change after you've scheduled updates.
Note: you might not gain much by indexing "state_id" as it only divides your dataset into three sets. This is probably not good enough for a query planner to consider using it in a query like SELECT * FROM todos WHERE state_id = 1.
The key to this problem that you didn't discuss is what happens to completed todos? If you leave them in this todos table, the table will grow indefinitely and your performance will degrade over time. The solution is partitioning the data into two separate tables (like "completed_todos" and "pending_todos"). You can then use UNION to concatenate both tables when you actually need to.
State machines are driven by something. user interaction or the last input from a stream, right? In this case, time drives the state machine. I think a cron job is the right play. it would be the clock driving the machine.
for what it's worth it is pretty difficult to set up an efficient index on a two columns where you have to do a range like that.
now > current && now < late is going to be hard to represent in the database in a performant way as an attribute of task
id|title|future_time|current_time|late_time
1|hello|8:00am|9:00am|10:00am
Never try to force patterns into problems. Things are the other way around. So, go directly to find a good solution for it.
Here is an idea: (for what I understood yours is)
Use persistent alerts and one monitored process to "consume" them. Secondarily, query them.
That will allow you to:
keep it simple
keep it cheap to maintain. Secondarily it also will keep you mentally more
fresh to do something else.
keep all the logic in code only (as it should).
I stress the point of having that process monitored with some kind of watchdog so you are ensured to send those alerts in time (or, in a worst case scenario, with some delay after a crash or things like that).
Note that: the fact of having persisted those alerts allows you this two things:
make/keeps your system resilient (more fault tolerant) and
make you able to query future and current items (by playing around with querying the alerts' time range as best fits your needs)
In my experience, a state machine in SQL is most useful when you have an external process acting on something, and updating the database with it's state. For example, we have a process that uploads and converts videos. We use the database to keep track of what is happening to a video at any time, and what should happen to it next.
In your case, I think you can (and should) use SQL to solve your problem instead of worrying about using a state machine:
Make a todo_states table:
todo_id todo_state_id datetime notified
1 1 (future) 8:00 0
1 2 (current) 9:00 0
1 3 (late) 10:00 0
Your SQL query, where all the real work happens:
SELECT todo_id, MAX(todo_state_id) AS todo_state_id
FROM todo_states
WHERE time < NOW()
GROUP BY todo_id
The currently active state is always the one you select. If you want to notify the user just once, insert the original state with notify = 0, and bump it on the first select.
Once the task is "done", you can either insert another state into the todo_states table, or simply delete all the states associated with a task and raise a "done" flag in the todo item, or whatever is most useful in your case.
Don't forget to clean out stale states.

Updating several records at once in rails

In a rails 2 app I'm building, I have a need to update a collection of records with specific attributes. I have a named scope to find the collection, but I have to iterate over each record to update the attributes. Instead of making one query to update several thousand records, I'll have to make several thousand queries.
What I've found so far is something like Model.find_by_sql("UPDATE products ...)
This feels really junior, but I've googled and looked around SO and haven't found my answer.
For clarity, what I have is:
ps = Product.last_day_of_freshness
ps.each { |p| p.update_attributes(:stale => true) }
What I want is:
Product.last_day_of_freshness.update_attributes(:stale => true)
It sounds like you are looking for ActiveRecord::Base.update_all - from the documentation:
Updates all records with details given if they match a set of conditions supplied, limits and order can also be supplied. This method constructs a single SQL UPDATE statement and sends it straight to the database. It does not instantiate the involved models and it does not trigger Active Record callbacks or validations.
Product.last_day_of_freshness.update_all(:stale => true)
Actually, since this is rails 2.x (You didn't specify) - the named_scope chaining may not work, you might need to pass the conditions for your named scope as the second parameter to update_all instead of chaining it onto the end of the Product scope.
Have you tried using update_all ?
http://api.rubyonrails.org/classes/ActiveRecord/Relation.html#method-i-update_all
For those who will need to update big amount of records, one million or even more, there is a good way to update records by batches.
product_ids = Product.last_day_of_freshness.pluck(:id)
iterations_size = product_ids.count / 5000
puts "Products to update #{product_ids.count}"
product_ids.each_slice(5000).with_index do |batch_ids, i|
puts "step #{i} of iterations_size"
Product.where(id: batch_ids).update_all(stale: true)
end
If your table has a lot indexes, it also will increase time for such operations, because it will need to rebuild them. When I called update_all for all records in table, there were about two million records and twelve indexes, operation didn't accomplish in more than one hour. With this approach it took about 20 minutes in development env and about 4 minutes in production, of course it depends on application settings and server hardware. You can put it in rake task or some background worker.
Loos like update_all is the best option... though I'll maintain my hacky version in case you're curious:
You can use just plain-ole SQL to do what you want thus:
ps = Product.last_day_of_freshness
ps_ids = ps.map(%:id).join(',') # local var just for readability
Product.connection.execute("UPDATE `products` SET `stale` = TRUE WHERE id in (#{ps_ids)")
Note that this is db-dependent - you may need to adjust quoting style to suit.

Display a record sequentially with every refresh

I have a Rails 3 application that currently shows a single "random" record with every refresh, however, it repeats records too often, or will never show a particular record. I was wondering what a good way would be to loop through each record and display them such that all get shown before any are repeated. I was thinking somehow using cookies or session_ids to sequentially loop through the record id's, but I'm not sure if that would work right, or exactly how to go about that.
The database consists of a single table with a single column, and currently only about 25 entries, but more will be added. ID's are generated automatically and are sequential.
Some suggestions would be appreciated.
Thanks.
The funny thing about 'random' is that it doesn't usually feel random when you get the same answer twice in short succession.
The usual answer to this problem is to generate a queue of responses, and make sure when you add entries to the queue that they aren't already on the queue. This can either be a queue of entries that you will return to the user, or a queue of entries that you have already returned to the user. I like your idea of using the record ids, but with only 25 entries, that repeating loop will also be annoying. :)
You could keep track of the queue of previous entries in memcached if you've already got one deployed or you could stuff the queue into the session (it'll probably just be five or six integers, not too excessive data transfer) or the database.
I think I'd avoid the database, because it sure doesn't need to be persistent, it doesn't need to take database bandwidth or compute time, and using the database just to keep track of five or six integers seems silly. :)
UPDATE:
In one of your controllers (maybe ApplicationController), add something like this to a method that you run in a before_filter:
class ApplicationController < ActionController::Base
before_filter :find_quip
def find_quip:
last_quip_id = session[:quip_id] || Quips.find(:first).id
new_quip_id = Quips.find(last_quip.id + 1).id || Quips.find(:first)
session[:quip_id] = new_quip
end
end
I'm not so happy with the code to wrap around when you run out of quips; it'll completely screw up if there is ever a hole in the sequence. Which is probably going to happen someday. And it will start on number 2. But I'm getting too tired to sort it out. :)
If there are only going to be not too many like you say, you could store the entire array of IDs as a session variable, with another variable for the current index, and loop through them sequentially, incrementing the index.

Resources