locking rows on rails update to avoid collisions. (Postgres back end) - ruby-on-rails

So I have a method on my model object which creates a unique sequence number when a binary field in the row is updated from null to true. Its implemented like this:
class AnswerHeader < ApplicationRecord
before_save :update_survey_complete_sequence, if: :survey_complete_changed?
def update_survey_complete_sequence
maxval =AnswerHeader.maximum('survey_complete_sequence')
self.survey_complete_sequence=maxval+1
end
end
My question is what do I need to lock so two rows being updated at the same time don't end up with two rows having the same survey_complete_sequence?
If it is possible to lock a single row rather than whole table that would be good because this is a often accessed table by users.

If you want to handle this in application logic itself, instead of letting database handle this. You make make use of rails with_lock function that will create a transaction and acquire a row level db lock on the selected rows(in your case a single row).

What you need to lock
In your case you have to lock the row containing the maximum survey_complete_sequence, since this is the row every query will look for while getting the value you require.
maxval =AnswerHeader.maximum('survey_complete_sequence')
Is it possible to lock a single row rather than whole table
There is no such specific lock for your scenario. But you can make use of Postgresql's SELECT FOR UPDATE row-level locking.
To acquire an exclusive row-level lock on a row without actually
modifying the row, select the row with SELECT FOR UPDATE.
And you can use pessimistic locking in rails and specify which lock you will use.
Call lock('some locking clause') to use a database-specific locking
clause of your own such as 'LOCK IN SHARE MODE' or 'FOR UPDATE NOWAIT'
Here's an example of how to achieve that from rails official guide itself
Item.transaction do
i = Item.lock("LOCK IN SHARE MODE").find(1)
...
end
Relations using lock are usually wrapped inside a transaction for preventing deadlock conditions.
So what you need doing is -
Apply SELECT FOR UPDATE lock to row consisting maximum('survey_complete_sequence')
Get the value you require from that row
Update your AnswerHeader with the value received

I believe you should give advisory locks a look. It makes sure the same block of code isn't executed on two machines simultaneously, while still keeping the table open for other business.
It uses the database, but it doesn't lock your tables.
You can use the gem called "with_advisory_lock" like this:
Model.with_advisory_lock("ADVISORY_LOCK_NAME") do
# Your code
end
https://github.com/ClosureTree/with_advisory_lock
It doesn't work with SQLite.

If you are using postgress, maybe Sequenced can help you out without defining a sequence at the DB level.
Is there a reason survey_complete_sequence should be incremental? if not, maybe randomize a bigint?

You probably don't want to lock the table, and even if you lock the row you're currently updating the row you're basing your maxval on will be available for another update to read and generate its sequence #.
Unless you have a huge table and lots of updates every millisecond (in the order of thousands) this shouldn't be an issue in real life. But if the idea bothers you, you can go ahead and add an unique index to the table on "survey_complete_sequence" column. The DB error will propagate to a Rails exception you can deal with within the application.

Related

How can I add a number column that tracks deletions?

Is there a gem or some database logic which I can use to add a number column to my database that tracks adds and deletes?
For example, GitHub has issues. An issue has a database ID. But it also has a number which is like a human readable identifier. If an issue is deleted, the number continues to increase. And repo A can have an issue with a number, and that doesn’t conflict with repo B.
Create a new table add a column to the Table like deleteCount. Everytime you call delete function or method. Just add a line that increments deleteCount like deleteCount++ to the success block. Same goes to the add method.
I believe you are overthinking it. The id column in rails (which all models possess by default) works the way you are thinking.
If you want a more human readable number as well, I would look at this gem (it has a ton of potential uses):
https://github.com/norman/friendly_id
Edit:
Looks like you might actually be looking for basically number of children of a parent. Logic looks like this:
When a parent is created a child_count column is set to 0
Whenever child is created for that parent, it increments the parent count, saves the result (this must be done atomically to avoid problems), which returns the current child_count
Set that child_count the childs parent_child_id.
The key trickly bit is that #2 has to be done atomically to avoid problems. So lock the row, update the column, then unlock the row.
Code roughly looks like:
# In Child model
after_commit :add_parent_child_id
def add_parent_child_id
parent.with_lock do
new_child_count = parent.child_count + 1
parent.child_count = new_child_count
parent.save!
self.update!(parent_child_id: new_child_count)
end
end

How should you backfill a new table in Rails?

I'm creating a new table that needs to be backfilled with data based on User accounts (over a couple dozen thousand) with the following one-time rake task.
What I've decided to do is create a big INSERT string for every 2000 users and execute that query.
Here's what the code roughly looks like:
task :backfill_my_new_table => :environment do
inserts = []
User.find_each do |user|
tuple = # form the tuple based on user and user associations like (1, 'foo', 'bar', NULL)
inserts << tuple
end
# At this point, the inserts array is of size at least 20,000
conn = ActiveRecord::Base.connection
inserts.each_slice(2000) do |slice|
sql = "INSERT INTO my_new_table (ref_id, column_a, column_b, column_c) VALUES #{inserts.join(", ")}"
conn.execute(sql)
end
end
So I'm wondering, is there a better way to do this? What are some drawbacks of the approach I took? How should I improve it? What if I didn't slice the inserts array and simply executed a single INSERT with over a couple dozen thousand VALUES tuples? What are the drawbacks of that method?
Thanks!
Depends on which PG version you are using, but in most cases of bulk loading data to a table this is enough checklist:
try to use COPY instead of INSERT whenever possible;
if using multiple INSERTs, disable autocommit and wrap all INSERTs in a single transaction, i.e. BEGIN; INSERT ...; INSERT ...; COMMIT;
disable indexes and checks/constraints on/of a target table;
disable table triggers;
alter table so it became unlogged (since PG 9.5, don't forget to turn logging on after data import), or increase max_wal_size so WAL wont be flooded
20k of rows is not such a big deal for a PG, so 2k-sliced inserts within one transaction will be just fine, unless there are some very complex triggers/checks involved. It is also worth reading PG manual section on bulk loading.
UPD: and a little bit old, yet wonderful piece from depesz, excerpt:
so, if you want to insert data as fast as possible – use copy (or better yet – pgbulkload). if for whatever reason you can't use copy, then use multi-row inserts (new in 8.2!). then if you can, bundle them in transactions, and use prepared transactions, but generally – they don't give you much.

Is is possible in ruby to set a specific active record call to read dirty

I am looking at a rather large database.. Lets say I have an exported flag on the product records.
If I want an estimate of how many products I have with the flag set to false, I can do a call something like this
Product.where(:exported => false).count.. .
The problem I have is even the count takes a long time, because the table of 1 million products is being written to. More specifically exports are happening, and the value I'm interested in counting is ever changing.
So I'd like to do a dirty read on the table... Not a dirty read always. And I 100% don't want all subsequent calls to the database on this connection to be dirty.
But for this one call, dirty is what I'd like.
Oh.. I should mention ruby 1.9.3 heroku and postgresql.
Now.. if I'm missing another way to get the count, I'd be excited to try that.
OH SNOT one last thing.. this example is contrived.
PostgreSQL doesn't support dirty reads.
You might want to use triggers to maintain a materialized view of the count - but doing so will mean that only one transaction at a time can insert a product, because they'll contend for the lock on the product count in the summary table.
Alternately, use system statistics to get a fast approximation.
Or, on PostgreSQL 9.2 and above, ensure there's a primary key (and thus a unique index) and make sure vacuum runs regularly. Then you should be able to do quite a fast count, as PostgreSQL should choose an index-only scan on the primary key.
Note that even if Pg did support dirty reads, the read would still not return perfectly up to date results because rows would sometimes inserted behind the read pointer in a sequential scan. The only way to get a perfectly up to date count is to prevent concurrent inserts: LOCK TABLE thetable IN EXCLUSIVE MODE.
As soon as a query begins to execute it's against a frozen read-only state because that's what MVCC is all about. The values are not changing in that snapshot, only in subsequent amendments to that state. It doesn't matter if your query takes an hour to run, it is operating on data that's locked in time.
If your queries are taking a very long time it sounds like you need an index on your exported column, or whatever values you use in your conditions, as a COUNT against an indexed an column is usually very fast.

How do I avoid a simple race condition in Rails?

I have a simple race condition. I have a website where people can vote on photos, but maximum 10 votes are allowed.
When a user submits a vote, I update a num_votes column in the photos table for that specific photo. I do this for easy lookup for the number of votes.
How can I make sure that the vote.save and the num_votes update happen in the same transaction?
Thanks!
In order to achieve this you have to use some kind of locking. Basically you have 3 options: optimistic/pessimistic rails locking and some external locking backend (like Redis::Lock).
I personally would go for pessimistic locking if high performance is not really the case here
photo = Photo.find(photo_id)
photo.with_lock do
photo.num_votes += 1
photo.save!
end
I should also point out that sticking to only wrapping incrementing num_votes and save into one transaction would not solve the race-condition. Most RDBMS by default work in read committed mode. Which doesn't prevent such a race condition.
FYI See Pessimistic and Optimistic Locking reference
If it is a simple race condition then you should resolve it as a race condition.
Try using some locking mechanism. Redis is good to go:
redis locking for ruby
RedisLocker.new('vote_#{#photo.id}').run! { #photo.vote }
# ... photo model
def vote
if num_votes <= 10
self.num_votes += 1
save
end
end
Well, Rails/Postgres supports transactions. You can simply declare one, on any ActiveRecord model:
Photo.transaction do
Vote.create(:whatever)
Photo.votes = thing
Photo.save!
end
If an exception is raised during the transaction block (say, by calling .save! on an invalid model), the transaction is rolled back and any database changes that would have happened in there aren't committed (in this case, the Vote record doesn't get inserted). You'll still need to rescue and handle the exception, of course.
Incidentally, storing number of associated objects in a record for easy lookup is a pretty common pattern, known as a counter cache, and Rails supports those as well - you might want to look into formally making num_votes a counter cache (the default name would be photos.votes_count, but it's not required). You might still want a transaction to check that it doesn't exceed the limit, though.
You don't need explicit locks for this
Photo.where(:id => photo_id).where('num_votes < 10').update_all('num_votes = num_votes+ 1')
will update the number of votes for that photo, but only if there are less than 10 votes. You can check the return value of update_all to see if anything was actually updated: the return value is the number of updated rows. If the update fails then don't create the vote (or if you have already created the vote, rollback the transaction).
Optimistic locking uses a similar technique to detect attempts at concurrent updates: it places a condition on the update that ensures that nothing will happen if someone has snuck in there before you and then checks the number of updated rows.

Multiple worker threads working on the same database - how to make it work properly?

I have a database that has a list of rows that need to be operated on. It looks something like this:
id remaining delivered locked
============================================
1 10 24 f
2 6 0 f
3 0 14 f
I am using DataMapper with Ruby, but really I think this is a general programming question that isn't specific to the exact implementation I'm using...
I am creating a bunch of worker threads that do something like this (pseudo-ruby-code):
while true do
t = any_row_in_database_where_remaining_greater_than_zero_and_unlocked
t.lock # update database to set locked = true
t.do_some_stuff
t.delivered += 1
t.remaining -= 1
t.unlock
end
Of course, the problem is, these threads compete with each other and the whole thing isn't really thread safe. The first line in the while loop can easily pull out the same row in multiple threads before they get a chance to get locked.
I need to make sure one thread is only working on one row at the same time.
What is the best way to do this?
The key step is when you select an unlocked row from the database and mark it as locked. If you can do that safely then everything else will be fine.
2 ways I know of that can make this safe are pessimistic and optimistic locking. They both rely on your database as the ultimate guarantor when it comes to concurrency.
Pessimistic Locking
Pessimistic locking means acquiring a lock upfront when you select the rows you want to work with, so that no one else can read them.
Something like
SELECT * from some_table WHERE ... FOR UPDATE
works with mysql and postgres (and possibly others) and will prevent any other connection to the database from reading the rows returned to you (how granular that lock is depends on the engine used, indexes etc - check your database's documentation). It's called pessimistic because you are assuming that a concurrency problem will occur and acquire the lock preventatively. It does mean that you bear the cost of locking even when not necessary and may reduce your concurrency depending on the granularity of the lock you have.
Optimistic Locking
Optimistic locking refers to a technique where you don't want the burden of a pessimistic lock because most of the time there won't be concurrent updates (if you update the row setting the locked flag to true as soon as you have read the row, the window is relatively small). AFAIK this only works when updating one row at a time
First add an integer column lock_version to the table. Whenever you update the table, increment lock_version by 1 alongside the other updates you are making. Assume the current lock_version is 3. When you update, change the update query to
update some_table set ... where id=12345 and lock_version = 3
and check the number of rows updated (the db driver returns this). if this updates 1 row then you know everything was ok. If this updates 0 rows then either the row you wanted was deleted or its lock version has changed, so you go back to step 1 in your process and search for a new row to work on.
I'm not a datamapper user so I don't know whether it / plugins for it provide support for these approaches. Active Record supports both so you can look there for inspiration if data mapper doesn't.
I would use a Mutex:
# outside your threads
worker_updater = Mutex.new
# inside each thread's updater
while true
worker_updater.synchronize do
# your code here
end
sleep 0.1 # Slow down there, mister!
end
This guarantees that only one thread at a time can enter the code in the synchronize. For optimal performance, consider what portion of your code needs to be thread-safe (first two lines?) and only wrap that portion in the Mutex.

Resources