How do I change these producer-consumer microservices to allow parallel processing? - ruby-on-rails

I've got a couple microservices (implemented in ruby, although I doubt that is important for my question). One of them provides items, and the other one processes them, and then marks them as processed (via a DELETE call)
The provider has an /items endpoint which lists a bunch of items identified with an id, in JSON format. It also has a DELETE /items/id endpoint which removes one item from the list (presumably because it is processed)
The code (very simplified) in the "processor" looks like this:
items = <GET provider/items>
items.each do |item|
process item
<DELETE provider/items/#{item.id}>
end
This has several problems, but the one I would like to solve is that it is not thread-safe, and thus I can't run it in parallel. If two workers start processing items simultaneously, they will "step onto each other's toes": they will get the same list of items, and then (try to) process and delete each item twice.
What is the simplest way I can change this setup to allow for parallel processing?
You can assume that I have ruby available. I would prefer keeping changes to a minimum, and would rather not install other gems if possible. Sidekiq is available as a queuing system on the consumer.

Some alternatives (just brainstorming):
Just drop HTTP and use pub-sub with a queue. Have the producer queueing items, a number of consumers processing them (and triggering state changes, in this case with HTTP if you fancy it).
If you really want to HTTP, I think there are a couple of missing pieces. If your items' states are pending and processed, there's a hidden/implicit state in your state machine: in_progress (or whatever). Once you think of it, picture becomes clearer: your GET /items is not idempotent (because it changes the state of items from pending to in progress) and hence should not be a GET in the first place.
a. an alternative could be adding a new entity (e.g. batch) that gets created via POST and groups some items under it and sends them. Items already returned won't be part of future batches, and then you can mark as done whole batches (e.g. PUT /batches/X/done). This gets crazy very fast, as you will start reimplementing features (acks, timeouts, errors) already present both in queueing systems and plain/explicit (see c) HTTP.
b. a slightly simpler alternative: just turn /items in a POST/PUT (weird in both cases) endpoint that marks items as being processed (and doesn't return them anymore because it only returns pending items). The same issue with errors and timeouts apply though.
c. have the producer being explicit and requesting the processing of an item to the other service via PUT. you can either include all needed data in the body, or use it as a ping and have the processor requesting the info via GET. you can add asynchronous processing in either side (but probably better in the processor).
I would honestly do 1 (unless compelling reason).

Seems to me that the issue is with parallelizing this implementation is you are thinking that each thread will call:
<GET provider/items>
One solution would be to get all the items first then do the async processing.
My Ruby is non-existent but it might look something like this:
class HardWorker
include Sidekiq::Worker
def perform(item)
process item
<DELETE provider/items/#{item.id}>
end
end
items = <GET provider/items>
items.each do |item|
HardWorker.perform_async(item)
end
This way your "producer" is the loop and the consumer is the async HardWorker.

What is the simplest way I can change this setup to allow for parallel processing?
If you can upgrade the code on the server, or add middle-man code, then the simplest way is a queue.
If you prefer just client-side, with no middle-man and no client-to-client talk, and some occasional redundancy is ok, then here are some ideas.
Reduce collisions by using shuffle
If it's ok for your server to receive a DELETE for a non-existent object
And the "process item" cost+time is relatively small
And the process is order-independent
Then you could shuffle the items to reduce collisions:
items.shuffle.each do |item|
process item
Check that the item exists by using HEAD
If your server has the HEAD method
And has a way to look up one item
And the HTTP connection is cheap+fast compared to "process item"
Then you could skip the item if it doesn't exists:
items.each do |item|
next if !<HEAD provider/items/id>
Refresh the items by using a polling loop
If the items are akin to you polling an ongoing pool of work
And are order independent
And the GET request is idempotent, i.e. it's ok to request all the items more than once
And the DELETE request returns a result that informs you the item did not exist
Then you could process items until you hit a redundancy, then refresh the items list:
loop do
items = <GET provider/items>
if items.blank?
sleep 1
next
end
items.each do |item|
process item
<DELETE provider/items/#{item.id}>
break if DELETE returns a code that indicates "already deleted"
end
end
All of the above combined using a polling loop, shuffle, and HEAD check.
This is surprisingly efficient, given no queue, nor middle-man, nor client-to-client talk.
There's still a rare redundant "process item" that can happen when multiple clients check if an item exists then start processing it; in practice this is near-zero probability, especially when there are many items.
loop do
items = <GET provider/items>
if items.blank?
sleep 1
next
end
items.shuffle do |item|
break if !<HEAD provider/items/id>
process item
<DELETE provider/items/#{item.id}>
break if DELETE returns a code that indicates "already deleted"
end
end

Related

ActiveRecord and Postgres row locking

API clients in a busy application are competing for existing resources. They request 1 or 2 at a time, then attempt actions upon those record. I am trying to use transactions to protect state but am having trouble getting a clear picture of row locks, especially where nested transactions (I guess savepoints, since PG doesn't really do transactions within transactions?) are concerned.
The process should look like this:
Request N resources
Remove those resources from the pool to prevent other users from attempting to claim them
Perform action with those resources
Roll back the entire transaction and return resources to pool if an error occurs
(Assume happy path for all examples. Requests always result in products returned.)
One version could look like this:
def self.do_it(request_count)
Product.transaction do
locked_products = Product.where(state: 'available').lock('FOR UPDATE').limit(request_count).to_a
Product.where(id: locked_products.map(&:id)).update_all(state: 'locked')
do_something(locked_products)
end
end
It seems to me that we could have a deadlock on that first line if two users request 2 and only 3 are available. So, to get around it, I'd like to do...
def self.do_it(request_count)
Product.transaction do
locked_products = []
request_count.times do
Product.transaction(requires_new: true) do
locked_product = Product.where(state: 'available').lock('FOR UPDATE').limit(1).first
locked_product.update!(state: 'locked')
locked_products << locked_product
end
end
do_something(locked_products)
end
end
But from what I've managed to find online, that inner transaction's end will not release the row locks -- they'll only be released when the outermost transaction ends.
Finally, I considered this:
def self.do_it(request_count)
locked_products = []
request_count.times do
Product.transaction do
locked_product = Product.where(state: 'available').lock('FOR UPDATE').limit(1).first
locked_product.update!(state: 'locked')
locked_products << locked_product
end
end
Product.transaction { do_something(locked_products) }
ensure
evaluate_and_cleanup(locked_products)
end
This gives me two completely independent transactions followed by a third that performs the action, but I am forced to do a manual check (or I could rescue) if do_something fails, which makes things messier. It also could lead to deadlocks if someone were to call do_it from within a transaction, which is very possible.
So my big questions:
Is my understanding of the release of row locks correct? Will row locks within nested transactions only be released when the outermost transaction is closed?
Is there a command that will change the lock type without closing the transaction?
My smaller question:
Is there some established or totally obvious pattern here that's jumping out to someone to handle this more sanely?
As it turns out, it was pretty easy to answer these questions by diving into the PostgreSQL console and playing around with transactions.
To answer the big questions:
Yes, my understanding of row locks was correct. Exclusive locks acquired within savepoints are NOT released when the savepoint is released, they are released when the overall transaction is committed.
No, there is no command to change the lock type. What kind of sorcery would that be? Once you have an exclusive lock, all queries that would touch that row must wait for you to release the lock before they can proceed.
Other than committing the transaction, rolling back the savepoint or the transaction will also release the exclusive lock.
In the case of my app, I solved my problem by using multiple transactions and keeping track of state very carefully within the app. This presented a great opportunity for refactoring and the final version of the code is simpler, clearer, and easier to maintain, though it came at the expense of being a bit more spread out than the "throw-it-all-in-a-PG-transaction" approach.

Sidekiq handling re-queue when processing large data

See the updated question below.
Original question:
In my current Rails project, I need to parse large xml/csv data file and save it into mongodb.
Right now I use this steps:
Receive uploaded file from user, store the data into mongodb
Use sidekiq to perform async process of the data in mongodb.
After process finished, delete the raw data.
For small and medium data in localhost, the steps above run well. But in heroku, I use hirefire to dynamically scale worker dyno up and down. When worker still processing the large data, hirefire see empty queue and scale down worker dyno. This send kill signal to the process, and leave the process in incomplete state.
I'm searching a better way to do the parsing, allow the parsing process got killed anytime (saving the current state when receiving kill signal), and allow the process got re-queued.
Right now I'm using Model.delay.parse_file and it don't get re-queued.
UPDATE
After reading sidekiq wiki, I found article about job control. Can anyone explain the code, how it works, and how it preserve it's state when receiving SIGTERM signal and the worker get re-queued?
Is there any alternative way to handle job termination, save current state, and continue right from the last position?
Thanks,
Might be easier to explain the process and the high level steps, give a sample implementation (a stripped down version of one that I use), and then talk about throw and catch:
Insert the raw csv rows with an incrementing index (to be able to resume from a specific row/index later)
Process the CSV stopping every 'chunk' to check if the job is done by checking if Sidekiq::Fetcher.done? returns true
When the fetcher is done?, store the index of the currently processed item on the user and return so that the job completes and control is returned to sidekiq.
Note that if a job is still running after a short timeout (default 20s) the job will be killed.
Then when the job runs again simply, start where you left off last time (or at 0)
Example:
class UserCSVImportWorker
include Sidekiq::Worker
def perform(user_id)
user = User.find(user_id)
items = user.raw_csv_items.where(:index => {'$gte' => user.last_csv_index.to_i})
items.each_with_index do |item, i|
if (i+1 % 100) == 0 && Sidekiq::Fetcher.done?
user.update(last_csv_index: item.index)
return
end
# Process the item as normal
end
end
end
The above class makes sure that each 100 items we check that the fetcher is not done (a proxy for if shutdown has been started), and ends execution of the job. Before the execution ends however we update the user with the last index that has been processed so that we can start where we left off next time.
throw catch is a way to implement this above functionality a little cleaner (maybe) but is a little like using Fibers, nice concept but hard to wrap your head around. Technically throw catch is more like goto than most people are generally comfortable with.
edit
Also you could not make call to Sidekiq::Fetcher.done? and record the last_csv_index on each row or on each chunk of rows processed, that way if your worker is killed without having the opportunity to record the last_csv_index you can still resume 'close' to where you left off.
You are trying to address the concept of idempotency, the idea that processing a thing multiple times with potential incomplete cycles does not cause problems. (https://github.com/mperham/sidekiq/wiki/Best-Practices#2-make-your-jobs-idempotent-and-transactional)
Possible steps forward
Split the file up into parts and process those parts with a job per part.
Lift the threshold for hirefire so that it will scale when jobs are likely to have fully completed (10 minutes)
Don't allow hirefire to scale down while a job is working (set a redis key on start and clear on completion)
Track progress of the job as it is processing and pick up where you left off if the job is killed.

Resque.. how can I get a list of the queues

Ok.. On heroku I have up 24 workers (as I understand it)
I have say 1000 clients. Each with their own "schema" in a postgresql database.
each client has tasks that can be done "later".. sending orders to my companies back end, is a great example.
I was thinking that I could create a new queue for each client, and each queue would have it's own worker(process). That it seems isn't in the cards.
So ok.. my thinking now is to have a queue field in client record..
so client 1 through 15 are in queue_a
and client 16 through 106 are in queue_b.. ect If one client is using heaps, we could move them to a new queue, or move others out of the slow Queue. clients with low volumns could be collected.. It would be a balancing act, but it wouldn't be all that hard to manage, if we kept track of metrics (which we will anyway)
(any counter ideas would be awesome to hear, I'm really in spit ball phase)
Right now, though. I'd like to figure out how to create a worker for each queue.
https://gist.github.com/486161 tells me how to create X workers, but doesn't really let me set a worker to a Queue. If I knew that, and how to get a list of queues, I think I'd be on my way to a viable solution to the limits.
Reading onhttp://blog.winfieldpeterson.com/2012/02/17/resque-queue-priority/
I realize that my plan is fraught with hardship.. The first client/queue to get added to the worker, would get priority.. I don't want that, I'd want them to all have the same. As long as they are part of the same queue..
i just stick to the topic :)
getting all queues in resque is pretty easy
Resque.queues
is a list of all queue names, it does not include the 'failed' queue, i did something like this
(['failed'] + Resque.queues).each do |queue|
queue_size = queue=='failed' ? Resque::Failure.count : Resque.size(queue)
end

Simulating race conditions in RSpec unit tests

We have an asynchronous task that performs a potentially long-running calculation for an object. The result is then cached on the object. To prevent multiple tasks from repeating the same work, we added locking with an atomic SQL update:
UPDATE objects SET locked = 1 WHERE id = 1234 AND locked = 0
The locking is only for the asynchronous task. The object itself may still be updated by the user. If that happens, any unfinished task for an old version of the object should discard its results as they're likely out-of-date. This is also pretty easy to do with an atomic SQL update:
UPDATE objects SET results = '...' WHERE id = 1234 AND version = 1
If the object has been updated, its version won't match and so the results will be discarded.
These two atomic updates should handle any possible race conditions. The question is how to verify that in unit tests.
The first semaphore is easy to test, as it is simply a matter of setting up two different tests with the two possible scenarios: (1) where the object is locked and (2) where the object is not locked. (We don't need to test the atomicity of the SQL query as that should be the responsibility of the database vendor.)
How does one test the second semaphore? The object needs to be changed by a third party some time after the first semaphore but before the second. This would require a pause in execution so that the update may be reliably and consistently performed, but I know of no support for injecting breakpoints with RSpec. Is there a way to do this? Or is there some other technique I'm overlooking for simulating such race conditions?
You can borrow an idea from electronics manufacturing and put test hooks directly into the production code. Just as a circuit board can be manufactured with special places for test equipment to control and probe the circuit, we can do the same thing with the code.
SUppose we have some code inserting a row into the database:
class TestSubject
def insert_unless_exists
if !row_exists?
insert_row
end
end
end
But this code is running on multiple computers. There's a race condition, then, since another processes may insert the row between our test and our insert, causing a DuplicateKey exception. We want to test that our code handles the exception that results from that race condition. In order to do that, our test needs to insert the row after the call to row_exists? but before the call to insert_row. So let's add a test hook right there:
class TestSubject
def insert_unless_exists
if !row_exists?
before_insert_row_hook
insert_row
end
end
def before_insert_row_hook
end
end
When run in the wild, the hook does nothing except eat up a tiny bit of CPU time. But when the code is being tested for the race condition, the test monkey-patches before_insert_row_hook:
class TestSubject
def before_insert_row_hook
insert_row
end
end
Isn't that sly? Like a parasitic wasp larva that has hijacked the body of an unsuspecting caterpillar, the test hijacked the code under test so that it will create the exact condition we need tested.
This idea is as simple as the XOR cursor, so I suspect many programmers have independently invented it. I have found it to be generally useful for testing code with race conditions. I hope it helps.

Need alternative to filters/observers for Ruby on Rails project

Rails has a nice set of filters (before_validation, before_create, after_save, etc) as well as support for observers, but I'm faced with a situation in which relying on a filter or observer is far too computationally expensive. I need an alternative.
The problem: I'm logging web server hits to a large number of pages. What I need is a trigger that will perform an action (say, send an email) when a given page has been viewed more than X times. Due to the huge number of pages and hits, using a filter or observer will result in a lot of wasted time because, 99% of the time, the condition it tests will be false. The email does not have to be sent out right away (i.e. a 5-10 minute delay is acceptable).
What I am instead considering is implementing some kind of process that sweeps the database every 5 minutes or so and checks to see which pages have been hit more than X times, recording that state in a new DB table, then sending out a corresponding email. It's not exactly elegant, but it will work.
Does anyone else have a better idea?
Rake tasks are nice! But you will end up writing more custom code for each background job you add. Check out the Delayed Job plugin http://blog.leetsoft.com/2008/2/17/delayed-job-dj
DJ is an asynchronous priority queue that relies on one simple database table. According to the DJ website you can create a job using Delayed::Job.enqueue() method shown below.
class NewsletterJob < Struct.new(:text, :emails)
def perform
emails.each { |e| NewsletterMailer.deliver_text_to_email(text, e) }
end
end
Delayed::Job.enqueue( NewsletterJob.new("blah blah", Customers.find(:all).collect(&:email)) )
I was once part of a team that wrote a custom ad server, which has the same requirements: monitor the number of hits per document, and do something once they reach a certain threshold. This server was going to be powering an existing very large site with a lot of traffic, and scalability was a real concern. My company hired two Doubleclick consultants to pick their brains.
Their opinion was: The fastest way to persist any information is to write it in a custom Apache log directive. So we built a site where every time someone would hit a document (ad, page, all the same), the server that handled the request would write a SQL statement to the log: "INSERT INTO impressions (timestamp, page, ip, etc) VALUES (x, 'path/to/doc', y, etc);" -- all output dynamically with data from the webserver. Every 5 minutes, we would gather these files from the web servers, and then dump them all in the master database one at a time. Then, at our leisure, we could parse that data to do anything we well pleased with it.
Depending on your exact requirements and deployment setup, you could do something similar. The computational requirement to check if you're past a certain threshold is still probably even smaller (guessing here) than executing the SQL to increment a value or insert a row. You could get rid of both bits of overhead by logging hits (special format or not), and then periodically gather them, parse them, input them to the database, and do whatever you want with them.
When saving your Hit model, update a redundant column in your Page model that stores a running total of hits, this costs you 2 extra queries, so maybe each hit takes twice as long to process, but you can decide if you need to send the email with a simple if.
Your original solution isn't bad either.
I have to write something here so that stackoverflow code-highlights the first line.
class ApplicationController < ActionController::Base
before_filter :increment_fancy_counter
private
def increment_fancy_counter
# somehow increment the counter here
end
end
# lib/tasks/fancy_counter.rake
namespace :fancy_counter do
task :process do
# somehow process the counter here
end
end
Have a cron job run rake fancy_counter:process however often you want it to run.

Resources