Parallelizing methods in Rails - ruby-on-rails

My Rails web app has dozens of methods from making calls to an API and processing query result. These methods have the following structure:
def method_one
batch_query_API
process_data
end
..........
def method_nth
batch_query_API
process_data
end
def summary
method_one
......
method_nth
collect_results
end
How can I run all query methods at the same time instead of sequential in Rails (without firing up multiple workers, of course)?
Edit: all of the methods are called from a single instance variable. I think this limits the use of Sidekiq or Delay in submitting jobs simultaneously.

Ruby has the excellent promise gem. Your example would look like:
require 'future'
def method_one
...
def method_nth
def summary
result1 = future { method_one }
......
resultn = future { method_nth }
collect_results result1, ..., resultn
end
Simple, isn't it? But let's get to more details. This is a future object:
result1 = future { method_one }
It means, the result1 is getting evaluated in the background. You can pass it around to other methods. But result1 doesn't have any result yet, it is still processing in the background. Think of passing around a Thread. But the major difference is - the moment you try to read it, instead of passing it around, it blocks and waits for the result at that point. So in the above example, all the result1 .. resultn variables will keep getting evaluated in the background, but when the time comes to collect the results, and when you try to actually read these values, the reads will wait for the queries to finish at that point.
Install the promise gem and try the below in Ruby console:
require 'future'
x = future { sleep 20; puts 'x calculated'; 10 }; nil
# adding a nil to the end so that x is not immediately tried to print in the console
y = future { sleep 25; puts 'y calculated'; 20 }; nil
# At this point, you'll still be using the console!
# The sleeps are happening in the background
# Now do:
x + y
# At this point, the program actually waits for the x & y future blocks to complete
Edit: Typo in result, should have been result1, change echo to puts

You can take a look at a new option in town: The futoroscope gem.
As you can see by the announcing blog post it tries to solve the same problem you are facing, making simultaneous API query's. It seems to have pretty good support and good test coverage.

Assuming that your problem is a slow external API, a solution could be the use of either threaded programming or asynchronous programming. By default when doing IO, your code will block. This basically means that if you have a method that does an HTTP request to retrieve some JSON your method will tell your operating system that you're going to sleep and you don't want to be woken up until the operating system has a response to that request. Since that can take several seconds, your application will just idly have to wait.
This behavior is not specific to just HTTP requests. Reading from a file or a device such as a webcam has the same implications. Software does this to prevent hogging up the CPU when it obviously has no use of it.
So the question in your case is: Do we really have to wait for one method to finish before we can call another? In the event that the behavior of method_two is dependent on the outcome of method_one, then yes. But in your case, it seems that they are individual units of work without co-dependence. So there is a potential for concurrency execution.
You can start new threads by initializing an instance of the Thread class with a block that contains the code you'd like to run. Think of a thread as a program inside your program. Your Ruby interpreter will automatically alternate between the thread and your main program. You can start as many threads as you'd like, but the more threads you create, the longer turns your main program will have to wait before returning to execution. However, we are probably talking microseconds or less. Let's look at an example of threaded execution.
def main_method
Thread.new { method_one }
Thread.new { method_two }
Thread.new { method_three }
end
def method_one
# something_slow_that_does_an_http_request
end
def method_two
# something_slow_that_does_an_http_request
end
def method_three
# something_slow_that_does_an_http_request
end
Calling main_method will cause all three methods to be executed in what appears to be parallel. In reality they are still being sequentually processed, but instead of going to sleep when method_one blocks, Ruby will just return to the main thread and switch back to method_one thread, when the OS has the input ready.
Assuming each method takes two 2 ms to execute minus the wait for the response, that means all three methods are running after just 6 ms - practically instantly.
If we assume that a response takes 500 ms to complete, that means you can cut down your total execution time from 2 + 500 + 2 + 500 + 2 + 500 to just 2 + 2 + 2 + 500 - in other words from 1506 ms to just 506 ms.
It will feel like the methods are running simultanously, but in fact they are just sleeping simultanously.
In your case however you have a challenge because you have an operation that is dependent on the completion of a set of previous operations. In other words, if you have task A, B, C, D, E and F, then A, B, C, D and E can be performed simultanously, but F cannot be performed until A, B, C, D and E are all complete.
There are different ways to solve this. Let's look at a simple solution which is creating a sleepy loop in the main thread that periodically examines a list of return values to make sure some condition is fullfilled.
def task_1
# Something slow
return results
end
def task_2
# Something slow
return results
end
def task_3
# Something slow
return results
end
my_responses = {}
Thread.new { my_responses[:result_1] = task_1 }
Thread.new { my_responses[:result_2] = task_2 }
Thread.new { my_responses[:result_3] = task_3 }
while (my_responses.count < 3) # Prevents the main thread from continuing until the three spawned threads are done and have dumped their results in the hash.
sleep(0.1) # This will cause the main thread to sleep for 100 ms between each check. Without it, you will end up checking the response count thousands of times pr. second which is most likely unnecessary.
end
# Any code at this line will not execute until all three results are collected.
Keep in mind that multithreaded programming is a tricky subject with numerous pitfalls. With MRI it's not so bad, because while MRI will happily switch between blocked threads, MRI doesn't support executing two threads simultanously and that solves quite a few concurrency concerns.
If you want to get into multithreaded programming, I recommend this book:
http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601
It's centered around Java, but the pitfalls and concepts explained are universal.

You should check out Sidekiq.
RailsCasts episode about Sidekiq.

Related

Rails and threading

I'm trying to make a Rails 4 application that makes a lot of http requests to some API handle more traffic, originally the code in the controller looks like this:
def index
#var1 = api_call some_params1
#var2 = api_call some_params2
#var3 = api_call some_params3
#var4 = api_call some_params4
#var5 = api_call some_params5
end
I did some googling around and ended up refactoring it as so:
def index
#var1 = Thread.new { api_call some_params1 }.value
#var2 = Thread.new { api_call some_params2 }.value
#var3 = Thread.new { api_call some_params3 }.value
#var4 = Thread.new { api_call some_params4 }.value
#var5 = Thread.new { api_call some_params5 }.value
end
Am I doing this right? Or am I instead supposed to call join on those threads somewhere?
Is this safe for production or is there something I should be tweaking, maybe in the Nginx or passenger configs?
Am I doing this right?
There are no issues in your code but I don't think that using threads makes a lot of sense in your code example since you're executing requests one after another anyway.
If you want to make parallel requests then you should do it like this instead:
threads = [params1, params2, ...].map { |p| Thread.new { api_call(p) } }
values = threads.map(&:value)
Am I doing this right? Or am I instead supposed to call join on those threads somewhere?
Both join and value calls will wait for a thread to finish but value is more convenient for you there if you want to retrieve a value returned from a thread. value is using join under the hood.
Is this safe for production or is there something I should be tweaking, maybe in the Nginx or passenger configs?
You don't need to tweak anything to use threads and it is generally safe to use them in production (if you're using MRI then GIL prevents deadlocks). You just need to be aware that if you're using a lot of threads then you'll be using a lot of extra memory. And using threads don't always improve performance of a program. For example, due to GIL there is not much point in using threads for executing CPU-intensive code even on a multicore machine.

Is there something similar to JS 'Promise.all()' in Ruby?

Below is a code that should be optimized:
def statistics
blogs = Blog.where(id: params[:ids])
results = blogs.map do |blog|
{
id: blog.id,
comment_count: blog.blog_comments.select("DISTINCT user_id").count
}
end
render json: results.to_json
end
Each SQL query cost around 200ms. If I have 10 blog posts, this function would take 2s because it runs synchronously. I can use GROUP BY to optimize the query, but I put that aside first because the task could be a third party request, and I am interested in how Ruby deals with async.
In Javascript, when I want to dispatch multiple asynchronous works and wait all of them to resolve, I can use Promise.all(). I wonder what the alternatives are for Ruby language to solve this problem.
Do I need a thread for this case? And is it safe to do that in Ruby?
There are multiple ways to solve this in ruby, including promises (enabled by gems).
JavaScript accomplishes asynchronous execution using an event loop and event driven I/O. There are event libraries to accomplish the same thing in ruby. One of the most popular is eventmachine.
As you mentioned, threads can also solve this problem. Thread-safety is a big topic and is further complicated by different thread models in different flavors of ruby (MRI, JRuby, etc). In summary I'll just say that of course threads can be used safely... there are just times when that is difficult. However, when used with blocking I/O (like to an API or a database request) threads can be very useful and fairly straight-forward. A solution with threads might look something like this:
# run blocking IO requests simultaneously
thread_pool = [
Thread.new { execute_sql_1 },
Thread.new { execute_sql_2 },
Thread.new { execute_sql_3 },
# ...
]
# wait for the slowest one to finish
thread_pool.each(&:join)
You also have access to other currency models, like the actor model, async classes, promises, and others enabled by gems like concurrent-ruby.
Finally, ruby concurrency can take the form of multiple processes communicating through built in mechanisms (drb, sockets, etc) or through distributed message brokers (redis, rabbitmq, etc).
Sure just do the count in one database call:
blogs = Blog
.select('blogs.id, COUNT(DISTINCT blog_comments.user_id) AS comment_count')
.joins('LEFT JOIN blog_comments ON blog_comments.blog_id = blogs.id')
.where(comments: { id: params[:ids] })
.group('blogs.id')
results = blogs.map do |blog|
{ id: blog.id, comment_count: blog.comment_count }
end
render json: results.to_json
You might need to change the statements depending on how your table as named in the database because I just guessed by the name of your associations.
Okay, generalizing a bit:
You have a list of data data and want to operate on that data asynchronously. Assuming the operation is the same for all entries in your list, you can do this:
data = [1, 2, 3, 4] # Example data
operation = -> (data_entry) { data * 2 } # Our operation: multiply by two
results = data.map{ |e| Thread.new(e, &operation) }.map{ |t| t.value }
Taking it apart:
data = [1, 2, 3, 4]
This could be anything from database IDs to URIs. Using numbers for simplicity here.
operation = -> (data_entry) { data * 2 }
Definition of a lambda that takes one argument and does some calculation on it. This could be an API call, an SQL query or any other operation that takes some time to complete. Again, for simplicity, I'm just multiplicating the numbers by 2.
results =
This array will contain the results of all the asynchronous operations.
data.map{ |e| Thread.new(e, &operation) }...
For every entry in the data set, spawn a thread that runs operation and pass the entry as argument. This is the data_entry argument in the lambda.
...map{ |t| t.value }
Extract the value from each thread. This will wait for the thread to finish first, so by the end of this line all your data will be there.
Lambdas
Lambdas are really just glorified blocks that raise an error if you pass in the wrong number of arguments. The syntax -> (arguments) {code} is just syntactic sugar for Lambda.new { |arguments| code }.
When a method accepts a block like Thread.new { do_async_stuff_here } you can also pass a Lambda or Proc object prefixed with & and it will be treated the same way.

How do I wait for multiple asynchronous operations to finish before sending a response in Ruby on Rails?

In some web dev I do, I have multiple operations beginning, like GET requests to external APIs, and I want them to both start at the same time because one doesn't rely on the result of the other. I want things to be able to run in the background. I found the concurrent-ruby library which seems to work well. By mixing it into a class you create, the class's methods have asynchronous versions which run on a background thread. This lead me to write code like the following, where FirstAsyncWorker and SecondAsyncWorker are classes I've coded, into which I've mixed the Concurrent::Async module, and coded a method named "work" which sends an HTTP request:
def index
op1_result = FirstAsyncWorker.new.async.work
op2_result = SecondAsyncWorker.new.async.work
render text: results(op1_result, op2_result)
end
However, the controller will implicitly render a response at the end of the action method's execution. So the response gets sent before op1_result and op2_result get values and the only thing sent to the browser is "#".
My solution to this so far is to use Ruby threads. I write code like:
def index
op1_result = nil
op2_result = nil
op1 = Thread.new do
op1_result = get_request_without_concurrent
end
op2 = Thread.new do
op2_result = get_request_without_concurrent
end
# Wait for the two operations to finish
op1.join
op2.join
render text: results(op1_result, op2_result)
end
I don't use a mutex because the two threads don't access the same memory. But I wonder if this is the best approach. Is there a better way to use the concurrent-ruby library, or other libraries better suited to this situation?
I ended up answering my own question after some more research into the concurrent-ruby library. Futures ended up being what I was after! Simply put, they execute a block of code in a background thread and attempting to access the Future's calculated value blocks the main thread until that background thread has completed its work. My Rails controller actions end up looking like:
def index
op1 = Concurrent::Future.execute { get_request }
op2 = Concurrent::Future.execute { another_request }
render text: "The result is #{result(op1.value, op2.value)}."
end
The line with render blocks until both async tasks have finished, at which point result can begin running.

Timeout in a delayed job

I have some code that potentially can run for a longer period of time. However if it does I want to kill it, here is what I'm doing at the moment :
def perform
Timeout.timeout(ENV['JOB_TIMEOUT'].to_i, Exceptions::WorkerTimeout) { do_perform }
end
private
def do_perform
...some code...
end
Where JOB_TIMEOUT is an environment variable with value such as 10.seconds. I've got reports that this still doesn't prevent my job from running longer that it should.
Is there a better way to do this?
I believe delayed_job does some exception handling voodoo with multiple retries etc, not to mention that I think do_perform will return immediately and the job will continue as usual in another thread. I would imagine a better approach is doing flow control inside the worker
def perform
# A nil timeout will continue with no timeout, protect against unset ENV
timeout = (ENV['JOB_TIMEOUT'] || 10).to_i
do_stuff
begin
Timeout.timeout(timeout) { do_long_running_stuff }
rescue Timeout::Error
clean_up_after_self
notify_business_logic_of_failure
end
end
This will work. Added benefits are not coupling delayed_job so tightly with your business logic - this code can be ported to any other job queueing system unmodified.

Groovy PromiseMap - Can I limit the asynchronous thread pool?

I am making a (quick and dirty) Batching API that allows the UI to send a selection of REST API calls and get results for all of them at once.
I am using PromiseMap to make some asynchronous REST calls to the relevant services, which get collected afterward.
There could be a large number of threads that need to run, and I would like to throttle the number of threads that run at the same time, similar to Executor's thread pool.
Is this possible without physically separating the threads into multiple PromiseMaps and chaining them? I haven't found anything online describing limiting the thread pool.
//get requested calls
JSONArray callsToMake=request.JSON as JSONArray
//registers calls in promise map
def promiseMap = new PromiseMap()
//Can I limit this Map as a thread pool to, say, run 10 at a time until finished
data.each {
def tempVar=it
promiseMap[tempVar.id]={makeCall(tempVar.method, "${basePath}${tempVar.to}" as String, tempVar.body)}
}
def result=promiseMap.get()
def resultList=parseResults(result)
response.status=HttpStatusCodes.ACCEPTED
render resultList as JSON
I'm hoping there's a fairly straight-forward setting that I may be ignorant of.
Thank you.
The default Async implementation in Grails is GPars. To configure the number of threads you need to use a GParsPool. See:
http://gpars.org/guide/guide/dataParallelism.html#dataParallelism_parallelCollections_GParsPool
Example:
withPool(10) {...}
withPool doesn't seem to be working. Just incase if anyone is looking to limit threads here is what i did. We can create a custom Group with custom ThreadPool and specify the number of the Threads.
def customGroup = new DefaultPGroup(new DefaultPool(true, 5))
try {
Dataflow.usingGroup(customGroup, {
def promises = new PromiseList()
(1..100).each { number ->
promises << {
log.info "Performing Task ${number}"
Thread.sleep(200)
number++
}
}
def result = promises.get()
})
}
finally {
customGroup.shutdown()
}
Use
runtime 'org.grails:grails-async-gpars'
at build.gradle
And
GParsExecutorsPool.withPool(10){service ->
Shop.list().each{shop ->
Item.list().each{item ->
service.submit({createOrder(shop, item)} as Runnable)
}
}
}
in your Service for example

Resources