I am making a (quick and dirty) Batching API that allows the UI to send a selection of REST API calls and get results for all of them at once.
I am using PromiseMap to make some asynchronous REST calls to the relevant services, which get collected afterward.
There could be a large number of threads that need to run, and I would like to throttle the number of threads that run at the same time, similar to Executor's thread pool.
Is this possible without physically separating the threads into multiple PromiseMaps and chaining them? I haven't found anything online describing limiting the thread pool.
//get requested calls
JSONArray callsToMake=request.JSON as JSONArray
//registers calls in promise map
def promiseMap = new PromiseMap()
//Can I limit this Map as a thread pool to, say, run 10 at a time until finished
data.each {
def tempVar=it
promiseMap[tempVar.id]={makeCall(tempVar.method, "${basePath}${tempVar.to}" as String, tempVar.body)}
}
def result=promiseMap.get()
def resultList=parseResults(result)
response.status=HttpStatusCodes.ACCEPTED
render resultList as JSON
I'm hoping there's a fairly straight-forward setting that I may be ignorant of.
Thank you.
The default Async implementation in Grails is GPars. To configure the number of threads you need to use a GParsPool. See:
http://gpars.org/guide/guide/dataParallelism.html#dataParallelism_parallelCollections_GParsPool
Example:
withPool(10) {...}
withPool doesn't seem to be working. Just incase if anyone is looking to limit threads here is what i did. We can create a custom Group with custom ThreadPool and specify the number of the Threads.
def customGroup = new DefaultPGroup(new DefaultPool(true, 5))
try {
Dataflow.usingGroup(customGroup, {
def promises = new PromiseList()
(1..100).each { number ->
promises << {
log.info "Performing Task ${number}"
Thread.sleep(200)
number++
}
}
def result = promises.get()
})
}
finally {
customGroup.shutdown()
}
Use
runtime 'org.grails:grails-async-gpars'
at build.gradle
And
GParsExecutorsPool.withPool(10){service ->
Shop.list().each{shop ->
Item.list().each{item ->
service.submit({createOrder(shop, item)} as Runnable)
}
}
}
in your Service for example
Related
I have an Apache Beam pipeline running on Google Dataflow whose job is rather simple:
It reads individual JSON objects from Pub/Sub
Parses them
And sends them via HTTP to some API
This API requires me to send the items in batches of 75. So I built a DoFn that accumulates events in a list and publish them via this API once they I get 75. This results to be too slow, so I thought instead of executing those HTTP requests in different threads using a thread pool.
The implementation of what I have right now looks like this:
private class WriteFn : DoFn<TheEvent, Void>() {
#Transient var api: TheApi
#Transient var currentBatch: MutableList<TheEvent>
#Transient var executor: ExecutorService
#Setup
fun setup() {
api = buildApi()
executor = Executors.newCachedThreadPool()
}
#StartBundle
fun startBundle() {
currentBatch = mutableListOf()
}
#ProcessElement
fun processElement(processContext: ProcessContext) {
val record = processContext.element()
currentBatch.add(record)
if (currentBatch.size >= 75) {
flush()
}
}
private fun flush() {
val payloadTrack = currentBatch.toList()
executor.submit {
api.sendToApi(payloadTrack)
}
currentBatch.clear()
}
#FinishBundle
fun finishBundle() {
if (currentBatch.isNotEmpty()) {
flush()
}
}
#Teardown
fun teardown() {
executor.shutdown()
executor.awaitTermination(30, TimeUnit.SECONDS)
}
}
This seems to work "fine" in the sense that data is making it to the API. But I don't know if this is the right approach and I have the sense that this is very slow.
The reason I think it's slow is that when load testing (by sending a few million events to Pub/Sub), it takes it up to 8 times more time for the pipeline to forward those messages to the API (which has response times of under 8ms) than for my laptop to feed them into Pub/Sub.
Is there any problem with my implementation? Is this the way I should be doing this?
Also... am I required to wait for all the requests to finish in my #FinishBundle method (i.e. by getting the futures returned by the executor and waiting on them)?
You have two interrelated questions here:
Are you doing this right / do you need to change anything?
Do you need to wait in #FinishBundle?
The second answer: yes. But actually you need to flush more thoroughly, as will become clear.
Once your #FinishBundle method succeeds, a Beam runner will assume the bundle has completed successfully. But your #FinishBundle only sends the requests - it does not ensure they have succeeded. So you could lose data that way if the requests subsequently fail. Your #FinishBundle method should actually be blocking and waiting for confirmation of success from the TheApi. Incidentally, all of the above should be idempotent, since after finishing the bundle, an earthquake could strike and cause a retry ;-)
So to answer the first question: should you change anything? Just the above. The practice of batching requests this way can work as long as you are sure the results are committed before the bundle is committed.
You may find that doing so will cause your pipeline to slow down, because #FinishBundle happens more frequently than #Setup. To batch up requests across bundles you need to use the lower-level features of state and timers. I wrote up a contrived version of your use case at https://beam.apache.org/blog/2017/08/28/timely-processing.html. I would be quite interested in how this works for you.
It may simply be that the extremely low latency you are expecting, in the low millisecond range, is not available when there is a durable shuffle in your pipeline.
Below is a code that should be optimized:
def statistics
blogs = Blog.where(id: params[:ids])
results = blogs.map do |blog|
{
id: blog.id,
comment_count: blog.blog_comments.select("DISTINCT user_id").count
}
end
render json: results.to_json
end
Each SQL query cost around 200ms. If I have 10 blog posts, this function would take 2s because it runs synchronously. I can use GROUP BY to optimize the query, but I put that aside first because the task could be a third party request, and I am interested in how Ruby deals with async.
In Javascript, when I want to dispatch multiple asynchronous works and wait all of them to resolve, I can use Promise.all(). I wonder what the alternatives are for Ruby language to solve this problem.
Do I need a thread for this case? And is it safe to do that in Ruby?
There are multiple ways to solve this in ruby, including promises (enabled by gems).
JavaScript accomplishes asynchronous execution using an event loop and event driven I/O. There are event libraries to accomplish the same thing in ruby. One of the most popular is eventmachine.
As you mentioned, threads can also solve this problem. Thread-safety is a big topic and is further complicated by different thread models in different flavors of ruby (MRI, JRuby, etc). In summary I'll just say that of course threads can be used safely... there are just times when that is difficult. However, when used with blocking I/O (like to an API or a database request) threads can be very useful and fairly straight-forward. A solution with threads might look something like this:
# run blocking IO requests simultaneously
thread_pool = [
Thread.new { execute_sql_1 },
Thread.new { execute_sql_2 },
Thread.new { execute_sql_3 },
# ...
]
# wait for the slowest one to finish
thread_pool.each(&:join)
You also have access to other currency models, like the actor model, async classes, promises, and others enabled by gems like concurrent-ruby.
Finally, ruby concurrency can take the form of multiple processes communicating through built in mechanisms (drb, sockets, etc) or through distributed message brokers (redis, rabbitmq, etc).
Sure just do the count in one database call:
blogs = Blog
.select('blogs.id, COUNT(DISTINCT blog_comments.user_id) AS comment_count')
.joins('LEFT JOIN blog_comments ON blog_comments.blog_id = blogs.id')
.where(comments: { id: params[:ids] })
.group('blogs.id')
results = blogs.map do |blog|
{ id: blog.id, comment_count: blog.comment_count }
end
render json: results.to_json
You might need to change the statements depending on how your table as named in the database because I just guessed by the name of your associations.
Okay, generalizing a bit:
You have a list of data data and want to operate on that data asynchronously. Assuming the operation is the same for all entries in your list, you can do this:
data = [1, 2, 3, 4] # Example data
operation = -> (data_entry) { data * 2 } # Our operation: multiply by two
results = data.map{ |e| Thread.new(e, &operation) }.map{ |t| t.value }
Taking it apart:
data = [1, 2, 3, 4]
This could be anything from database IDs to URIs. Using numbers for simplicity here.
operation = -> (data_entry) { data * 2 }
Definition of a lambda that takes one argument and does some calculation on it. This could be an API call, an SQL query or any other operation that takes some time to complete. Again, for simplicity, I'm just multiplicating the numbers by 2.
results =
This array will contain the results of all the asynchronous operations.
data.map{ |e| Thread.new(e, &operation) }...
For every entry in the data set, spawn a thread that runs operation and pass the entry as argument. This is the data_entry argument in the lambda.
...map{ |t| t.value }
Extract the value from each thread. This will wait for the thread to finish first, so by the end of this line all your data will be there.
Lambdas
Lambdas are really just glorified blocks that raise an error if you pass in the wrong number of arguments. The syntax -> (arguments) {code} is just syntactic sugar for Lambda.new { |arguments| code }.
When a method accepts a block like Thread.new { do_async_stuff_here } you can also pass a Lambda or Proc object prefixed with & and it will be treated the same way.
In some web dev I do, I have multiple operations beginning, like GET requests to external APIs, and I want them to both start at the same time because one doesn't rely on the result of the other. I want things to be able to run in the background. I found the concurrent-ruby library which seems to work well. By mixing it into a class you create, the class's methods have asynchronous versions which run on a background thread. This lead me to write code like the following, where FirstAsyncWorker and SecondAsyncWorker are classes I've coded, into which I've mixed the Concurrent::Async module, and coded a method named "work" which sends an HTTP request:
def index
op1_result = FirstAsyncWorker.new.async.work
op2_result = SecondAsyncWorker.new.async.work
render text: results(op1_result, op2_result)
end
However, the controller will implicitly render a response at the end of the action method's execution. So the response gets sent before op1_result and op2_result get values and the only thing sent to the browser is "#".
My solution to this so far is to use Ruby threads. I write code like:
def index
op1_result = nil
op2_result = nil
op1 = Thread.new do
op1_result = get_request_without_concurrent
end
op2 = Thread.new do
op2_result = get_request_without_concurrent
end
# Wait for the two operations to finish
op1.join
op2.join
render text: results(op1_result, op2_result)
end
I don't use a mutex because the two threads don't access the same memory. But I wonder if this is the best approach. Is there a better way to use the concurrent-ruby library, or other libraries better suited to this situation?
I ended up answering my own question after some more research into the concurrent-ruby library. Futures ended up being what I was after! Simply put, they execute a block of code in a background thread and attempting to access the Future's calculated value blocks the main thread until that background thread has completed its work. My Rails controller actions end up looking like:
def index
op1 = Concurrent::Future.execute { get_request }
op2 = Concurrent::Future.execute { another_request }
render text: "The result is #{result(op1.value, op2.value)}."
end
The line with render blocks until both async tasks have finished, at which point result can begin running.
I am using the plugin: Grails CSV Plugin in my application with Grails 2.5.3.
I need to implement the concurrency functionality with for example: GPars, but I don't know how I can do it.
Now, the configuration is sequential processing. Example of my code fragment:
Thanks.
Implementing concurrency in this case may not give you much of a benefit. It really depends on where the bottleneck is. For example, if the bottleneck is in reading the CSV file, then there would be little advantage because the file can only be read in sequential order. With that out of the way, here's the simplest example I could come up with:
import groovyx.gpars.GParsPool
def tokens = csvFileLoad.inputStream.toCsvReader(['separatorChar': ';', 'charset': 'UTF-8', 'skipLines': 1]).readAll()
def failedSaves = GParsPool.withPool {
tokens.parallel
.map { it[0].trim() }
.filter { !Department.findByName(it) }
.map { new Department(name: it) }
.map { customImportService.saveRecordCSVDepartment(it) }
.map { it ? 0 : 1 }
.sum()
}
if(failedSaves > 0) transactionStatus.setRollbackOnly()
As you can see, the entire file is read first; hence the main bottleneck. The majority of the processing is done concurrently with the map(), filter(), and sum() methods. At the very end, the transaction is rolled back if any of the Departments failed to save.
Note: I chose to go with a map()-sum() pair instead of using anyParallel() to avoid having to convert the parallel array produced by map() to a regular Groovy collection, perform the anyParallel(), which creates a parallel array and then converts it back to a Groovy collection.
Improvements
As I already mentioned in my example the CSV file is first read completely before the concurrent execution begins. It also attempts to save all of the Department instances, even if one failed to save. You may want that (which is what you demonstrated) or not.
My Rails web app has dozens of methods from making calls to an API and processing query result. These methods have the following structure:
def method_one
batch_query_API
process_data
end
..........
def method_nth
batch_query_API
process_data
end
def summary
method_one
......
method_nth
collect_results
end
How can I run all query methods at the same time instead of sequential in Rails (without firing up multiple workers, of course)?
Edit: all of the methods are called from a single instance variable. I think this limits the use of Sidekiq or Delay in submitting jobs simultaneously.
Ruby has the excellent promise gem. Your example would look like:
require 'future'
def method_one
...
def method_nth
def summary
result1 = future { method_one }
......
resultn = future { method_nth }
collect_results result1, ..., resultn
end
Simple, isn't it? But let's get to more details. This is a future object:
result1 = future { method_one }
It means, the result1 is getting evaluated in the background. You can pass it around to other methods. But result1 doesn't have any result yet, it is still processing in the background. Think of passing around a Thread. But the major difference is - the moment you try to read it, instead of passing it around, it blocks and waits for the result at that point. So in the above example, all the result1 .. resultn variables will keep getting evaluated in the background, but when the time comes to collect the results, and when you try to actually read these values, the reads will wait for the queries to finish at that point.
Install the promise gem and try the below in Ruby console:
require 'future'
x = future { sleep 20; puts 'x calculated'; 10 }; nil
# adding a nil to the end so that x is not immediately tried to print in the console
y = future { sleep 25; puts 'y calculated'; 20 }; nil
# At this point, you'll still be using the console!
# The sleeps are happening in the background
# Now do:
x + y
# At this point, the program actually waits for the x & y future blocks to complete
Edit: Typo in result, should have been result1, change echo to puts
You can take a look at a new option in town: The futoroscope gem.
As you can see by the announcing blog post it tries to solve the same problem you are facing, making simultaneous API query's. It seems to have pretty good support and good test coverage.
Assuming that your problem is a slow external API, a solution could be the use of either threaded programming or asynchronous programming. By default when doing IO, your code will block. This basically means that if you have a method that does an HTTP request to retrieve some JSON your method will tell your operating system that you're going to sleep and you don't want to be woken up until the operating system has a response to that request. Since that can take several seconds, your application will just idly have to wait.
This behavior is not specific to just HTTP requests. Reading from a file or a device such as a webcam has the same implications. Software does this to prevent hogging up the CPU when it obviously has no use of it.
So the question in your case is: Do we really have to wait for one method to finish before we can call another? In the event that the behavior of method_two is dependent on the outcome of method_one, then yes. But in your case, it seems that they are individual units of work without co-dependence. So there is a potential for concurrency execution.
You can start new threads by initializing an instance of the Thread class with a block that contains the code you'd like to run. Think of a thread as a program inside your program. Your Ruby interpreter will automatically alternate between the thread and your main program. You can start as many threads as you'd like, but the more threads you create, the longer turns your main program will have to wait before returning to execution. However, we are probably talking microseconds or less. Let's look at an example of threaded execution.
def main_method
Thread.new { method_one }
Thread.new { method_two }
Thread.new { method_three }
end
def method_one
# something_slow_that_does_an_http_request
end
def method_two
# something_slow_that_does_an_http_request
end
def method_three
# something_slow_that_does_an_http_request
end
Calling main_method will cause all three methods to be executed in what appears to be parallel. In reality they are still being sequentually processed, but instead of going to sleep when method_one blocks, Ruby will just return to the main thread and switch back to method_one thread, when the OS has the input ready.
Assuming each method takes two 2 ms to execute minus the wait for the response, that means all three methods are running after just 6 ms - practically instantly.
If we assume that a response takes 500 ms to complete, that means you can cut down your total execution time from 2 + 500 + 2 + 500 + 2 + 500 to just 2 + 2 + 2 + 500 - in other words from 1506 ms to just 506 ms.
It will feel like the methods are running simultanously, but in fact they are just sleeping simultanously.
In your case however you have a challenge because you have an operation that is dependent on the completion of a set of previous operations. In other words, if you have task A, B, C, D, E and F, then A, B, C, D and E can be performed simultanously, but F cannot be performed until A, B, C, D and E are all complete.
There are different ways to solve this. Let's look at a simple solution which is creating a sleepy loop in the main thread that periodically examines a list of return values to make sure some condition is fullfilled.
def task_1
# Something slow
return results
end
def task_2
# Something slow
return results
end
def task_3
# Something slow
return results
end
my_responses = {}
Thread.new { my_responses[:result_1] = task_1 }
Thread.new { my_responses[:result_2] = task_2 }
Thread.new { my_responses[:result_3] = task_3 }
while (my_responses.count < 3) # Prevents the main thread from continuing until the three spawned threads are done and have dumped their results in the hash.
sleep(0.1) # This will cause the main thread to sleep for 100 ms between each check. Without it, you will end up checking the response count thousands of times pr. second which is most likely unnecessary.
end
# Any code at this line will not execute until all three results are collected.
Keep in mind that multithreaded programming is a tricky subject with numerous pitfalls. With MRI it's not so bad, because while MRI will happily switch between blocked threads, MRI doesn't support executing two threads simultanously and that solves quite a few concurrency concerns.
If you want to get into multithreaded programming, I recommend this book:
http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601
It's centered around Java, but the pitfalls and concepts explained are universal.
You should check out Sidekiq.
RailsCasts episode about Sidekiq.