Grails CSV Plugin - Concurrency

Grails CSV Plugin - Concurrency - grails

I am using the plugin: Grails CSV Plugin in my application with Grails 2.5.3.
I need to implement the concurrency functionality with for example: GPars, but I don't know how I can do it.
Now, the configuration is sequential processing. Example of my code fragment:
Thanks.

Implementing concurrency in this case may not give you much of a benefit. It really depends on where the bottleneck is. For example, if the bottleneck is in reading the CSV file, then there would be little advantage because the file can only be read in sequential order. With that out of the way, here's the simplest example I could come up with:
import groovyx.gpars.GParsPool
def tokens = csvFileLoad.inputStream.toCsvReader(['separatorChar': ';', 'charset': 'UTF-8', 'skipLines': 1]).readAll()
def failedSaves = GParsPool.withPool {
tokens.parallel
.map { it[0].trim() }
.filter { !Department.findByName(it) }
.map { new Department(name: it) }
.map { customImportService.saveRecordCSVDepartment(it) }
.map { it ? 0 : 1 }
.sum()
}
if(failedSaves > 0) transactionStatus.setRollbackOnly()
As you can see, the entire file is read first; hence the main bottleneck. The majority of the processing is done concurrently with the map(), filter(), and sum() methods. At the very end, the transaction is rolled back if any of the Departments failed to save.
Note: I chose to go with a map()-sum() pair instead of using anyParallel() to avoid having to convert the parallel array produced by map() to a regular Groovy collection, perform the anyParallel(), which creates a parallel array and then converts it back to a Groovy collection.
Improvements
As I already mentioned in my example the CSV file is first read completely before the concurrent execution begins. It also attempts to save all of the Department instances, even if one failed to save. You may want that (which is what you demonstrated) or not.

Related

How to parallelize HTTP requests within an Apache Beam step?

I have an Apache Beam pipeline running on Google Dataflow whose job is rather simple:
It reads individual JSON objects from Pub/Sub
Parses them
And sends them via HTTP to some API
This API requires me to send the items in batches of 75. So I built a DoFn that accumulates events in a list and publish them via this API once they I get 75. This results to be too slow, so I thought instead of executing those HTTP requests in different threads using a thread pool.
The implementation of what I have right now looks like this:
private class WriteFn : DoFn<TheEvent, Void>() {
#Transient var api: TheApi
#Transient var currentBatch: MutableList<TheEvent>
#Transient var executor: ExecutorService
#Setup
fun setup() {
api = buildApi()
executor = Executors.newCachedThreadPool()
}
#StartBundle
fun startBundle() {
currentBatch = mutableListOf()
}
#ProcessElement
fun processElement(processContext: ProcessContext) {
val record = processContext.element()
currentBatch.add(record)
if (currentBatch.size >= 75) {
flush()
}
}
private fun flush() {
val payloadTrack = currentBatch.toList()
executor.submit {
api.sendToApi(payloadTrack)
}
currentBatch.clear()
}
#FinishBundle
fun finishBundle() {
if (currentBatch.isNotEmpty()) {
flush()
}
}
#Teardown
fun teardown() {
executor.shutdown()
executor.awaitTermination(30, TimeUnit.SECONDS)
}
}
This seems to work "fine" in the sense that data is making it to the API. But I don't know if this is the right approach and I have the sense that this is very slow.
The reason I think it's slow is that when load testing (by sending a few million events to Pub/Sub), it takes it up to 8 times more time for the pipeline to forward those messages to the API (which has response times of under 8ms) than for my laptop to feed them into Pub/Sub.
Is there any problem with my implementation? Is this the way I should be doing this?
Also... am I required to wait for all the requests to finish in my #FinishBundle method (i.e. by getting the futures returned by the executor and waiting on them)?

You have two interrelated questions here:
Are you doing this right / do you need to change anything?
Do you need to wait in #FinishBundle?
The second answer: yes. But actually you need to flush more thoroughly, as will become clear.
Once your #FinishBundle method succeeds, a Beam runner will assume the bundle has completed successfully. But your #FinishBundle only sends the requests - it does not ensure they have succeeded. So you could lose data that way if the requests subsequently fail. Your #FinishBundle method should actually be blocking and waiting for confirmation of success from the TheApi. Incidentally, all of the above should be idempotent, since after finishing the bundle, an earthquake could strike and cause a retry ;-)
So to answer the first question: should you change anything? Just the above. The practice of batching requests this way can work as long as you are sure the results are committed before the bundle is committed.
You may find that doing so will cause your pipeline to slow down, because #FinishBundle happens more frequently than #Setup. To batch up requests across bundles you need to use the lower-level features of state and timers. I wrote up a contrived version of your use case at https://beam.apache.org/blog/2017/08/28/timely-processing.html. I would be quite interested in how this works for you.
It may simply be that the extremely low latency you are expecting, in the low millisecond range, is not available when there is a durable shuffle in your pipeline.

Is there something similar to JS 'Promise.all()' in Ruby?

Below is a code that should be optimized:
def statistics
blogs = Blog.where(id: params[:ids])
results = blogs.map do |blog|
{
id: blog.id,
comment_count: blog.blog_comments.select("DISTINCT user_id").count
}
end
render json: results.to_json
end
Each SQL query cost around 200ms. If I have 10 blog posts, this function would take 2s because it runs synchronously. I can use GROUP BY to optimize the query, but I put that aside first because the task could be a third party request, and I am interested in how Ruby deals with async.
In Javascript, when I want to dispatch multiple asynchronous works and wait all of them to resolve, I can use Promise.all(). I wonder what the alternatives are for Ruby language to solve this problem.
Do I need a thread for this case? And is it safe to do that in Ruby?

There are multiple ways to solve this in ruby, including promises (enabled by gems).
JavaScript accomplishes asynchronous execution using an event loop and event driven I/O. There are event libraries to accomplish the same thing in ruby. One of the most popular is eventmachine.
As you mentioned, threads can also solve this problem. Thread-safety is a big topic and is further complicated by different thread models in different flavors of ruby (MRI, JRuby, etc). In summary I'll just say that of course threads can be used safely... there are just times when that is difficult. However, when used with blocking I/O (like to an API or a database request) threads can be very useful and fairly straight-forward. A solution with threads might look something like this:
# run blocking IO requests simultaneously
thread_pool = [
Thread.new { execute_sql_1 },
Thread.new { execute_sql_2 },
Thread.new { execute_sql_3 },
# ...
]
# wait for the slowest one to finish
thread_pool.each(&:join)
You also have access to other currency models, like the actor model, async classes, promises, and others enabled by gems like concurrent-ruby.
Finally, ruby concurrency can take the form of multiple processes communicating through built in mechanisms (drb, sockets, etc) or through distributed message brokers (redis, rabbitmq, etc).

Sure just do the count in one database call:
blogs = Blog
.select('blogs.id, COUNT(DISTINCT blog_comments.user_id) AS comment_count')
.joins('LEFT JOIN blog_comments ON blog_comments.blog_id = blogs.id')
.where(comments: { id: params[:ids] })
.group('blogs.id')
results = blogs.map do |blog|
{ id: blog.id, comment_count: blog.comment_count }
end
render json: results.to_json
You might need to change the statements depending on how your table as named in the database because I just guessed by the name of your associations.

Okay, generalizing a bit:
You have a list of data data and want to operate on that data asynchronously. Assuming the operation is the same for all entries in your list, you can do this:
data = [1, 2, 3, 4] # Example data
operation = -> (data_entry) { data * 2 } # Our operation: multiply by two
results = data.map{ |e| Thread.new(e, &operation) }.map{ |t| t.value }
Taking it apart:
data = [1, 2, 3, 4]
This could be anything from database IDs to URIs. Using numbers for simplicity here.
operation = -> (data_entry) { data * 2 }
Definition of a lambda that takes one argument and does some calculation on it. This could be an API call, an SQL query or any other operation that takes some time to complete. Again, for simplicity, I'm just multiplicating the numbers by 2.
results =
This array will contain the results of all the asynchronous operations.
data.map{ |e| Thread.new(e, &operation) }...
For every entry in the data set, spawn a thread that runs operation and pass the entry as argument. This is the data_entry argument in the lambda.
...map{ |t| t.value }
Extract the value from each thread. This will wait for the thread to finish first, so by the end of this line all your data will be there.
Lambdas
Lambdas are really just glorified blocks that raise an error if you pass in the wrong number of arguments. The syntax -> (arguments) {code} is just syntactic sugar for Lambda.new { |arguments| code }.
When a method accepts a block like Thread.new { do_async_stuff_here } you can also pass a Lambda or Proc object prefixed with & and it will be treated the same way.

Jenkins Pipeline Multiconfiguration Project

Original situation:
I have a job in Jenkins that is running an ant script. I easily managed to test this ant script on more then one software version using a "Multi-configuration project".
This type of project is really cool because it allows me to specify all the versions of the two software that I need (in my case Java and Matlab) an it will run my ant script with all the combinations of my parameters.
Those parameters are then used as string to be concatenated in the definition of the location of the executable to be used by my ant.
example: env.MATLAB_EXE=/usr/local/MATLAB/${MATLAB_VERSION}/bin/matlab
This is working perfectly but now I am migrating this scripts to a pipline version of it.
Pipeline migration:
I managed to implement the same script in a pipeline fashion using the Parametrized pipelines pluin. With this I achieve the point in which I can manually select which version of my software is going to be used if I trigger the build manually and I also found a way to execute this periodically selecting the parameter I want at each run.
This solution seems fairly working however is not really satisfying.
My multi-config project had some feature that this does not:
With more then one parameter I can set to interpolate them and execute each combination
The executions are clearly separated and in build history/build details is easy to recognize which settings hads been used
Just adding a new "possible" value to the parameter is going to spawn the desired executions
Request
So I wonder if there is a better solution to my problem that can satisfy also the point above.
Long story short: is there a way to implement a multi-configuration project in jenkins but using the pipeline technology?

I've seen this and similar questions asked a lot lately, so it seemed that it would be a fun exercise to work this out...
A matrix/multi-config job, visualized in code, would really just be a few nested for loops, one for each axis of parameters.
You could build something fairly simple with some hard coded for loops to loop over a few lists. Or you can get more complicated and do some recursive looping so you don't have to hard code the specific loops.
DISCLAIMER: I do ops much more than I write code. I am also very new to groovy, so this can probably be done more cleanly, and there are probably a lot of groovier things that could be done, but this gets the job done, anyway.
With a little work, this matrixBuilder could be wrapped up in a class so you could pass in a task closure and the axis list and get the task map back. Stick it in a shared library and use it anywhere. It should be pretty easy to add some of the other features from the multiconfiguration jobs, such as filters.
This attempt uses a recursive matrixBuilder function to work through any number of parameter axes and build all the combinations. Then it executes them in parallel (obviously depending on node availability).
/*
All the config axes are defined here
Add as many lists of axes in the axisList as you need.
All combinations will be built
*/
def axisList = [
["ubuntu","rhel","windows","osx"], //agents
["jdk6","jdk7","jdk8"], //tools
["banana","apple","orange","pineapple"] //fruit
]
def tasks = [:]
def comboBuilder
def comboEntry = []
def task = {
// builds and returns the task for each combination
/* Map the entries back to a more readable format
the index will correspond to the position of this axis in axisList[] */
def myAgent = it[0]
def myJdk = it[1]
def myFruit = it[2]
return {
// This is where the important work happens for each combination
node(myAgent) {
println "Executing combination ${it.join('-')}"
def javaHome = tool myJdk
println "Node=${env.NODE_NAME}"
println "Java=${javaHome}"
}
//We won't declare a specific agent this part
node {
println "fruit=${myFruit}"
}
}
}
/*
This is where the magic happens
recursively work through the axisList and build all combinations
*/
comboBuilder = { def axes, int level ->
for ( entry in axes[0] ) {
comboEntry[level] = entry
if (axes.size() > 1 ) {
comboBuilder(axes[1..-1], level + 1)
}
else {
tasks[comboEntry.join("-")] = task(comboEntry.collect())
}
}
}
stage ("Setup") {
node {
println "Initial Setup"
}
}
stage ("Setup Combinations") {
node {
comboBuilder(axisList, 0)
}
}
stage ("Multiconfiguration Parallel Tasks") {
//Run the tasks in parallel
parallel tasks
}
stage("The End") {
node {
echo "That's all folks"
}
}
You can see a more detailed flow of the job at http://localhost:8080/job/multi-configPipeline/[build]/flowGraphTable/ (available under the Pipeline Steps link on the build page.
EDIT:
You can move the stage down into the "task" creation and then see the details of each stage more clearly, but not in a neat matrix like the multi-config job.
...
return {
// This is where the important work happens for each combination
stage ("${it.join('-')}--build") {
node(myAgent) {
println "Executing combination ${it.join('-')}"
def javaHome = tool myJdk
println "Node=${env.NODE_NAME}"
println "Java=${javaHome}"
}
//Node irrelevant for this part
node {
println "fruit=${myFruit}"
}
}
}
...
Or you could wrap each node with their own stage for even more detail.
As I did this, I noticed a bug in my previous code (fixed above now). I was passing the comboEntry reference to the task. I should have sent a copy, because, while the names of the stages were correct, when it actually executed them, the values were, of course, all the last entry encountered. So I changed it to tasks[comboEntry.join("-")] = task(comboEntry.collect()).
I noticed that you can leave the original stage ("Multiconfiguration Parallel Tasks") {} around the execution of the parallel tasks. Technically now you have nested stages. I'm not sure how Jenkins is supposed to handle that, but it doesn't complain. However, the 'parent' stage timing is not inclusive of the parallel stages timing.
I also noticed is that when a new build starts to run, on the "Stage View" of the job, all the previous builds disappear, presumably because the stage names don't all match up. But after the build finishes running, they all match again and the old builds show up again.
And finally, Blue Ocean doesn't seem to vizualize this the same way. It doesn't recognize the "stages" in the parallel processes, only the enclosing stage (if it is present), or "Parallel" if it isn't. And then only shows the individual parallel processes, not the stages within.

Points 1 and 3 are not completely clear to me, but I suspect you just want to use “scripted” rather than “Declarative” Pipeline syntax, in which case you can make your job do whatever you like—anything permitted by matrix project axes and axis filters and much more, including parallel execution. Declarative syntax trades off syntactic simplicity (and friendliness to “round-trip” editing tools and “linters”) for flexibility.
Point 2 is about visualization of the result, rather than execution per se. While this is a complex topic, the usual concrete request which is not already supported by existing visualizations like Blue Ocean is to be able to see test results distinguished by axis combination. This is tracked by JENKINS-27395 and some related issues, with design in progress.

Jedis/Redis SocketTimeout exception on Lua scripts

We are using lua scripts to perform batch deletes of data on updates to our DB. Jedis executes the lua script using a pipeline.
local result = redis.call('lrange',key,0,12470)
for i,k in ipairs(result) do
redis.call('del',k)
redis.call('ltrim',key,1,k)
end
try (Jedis jedis = jedisPool.getResource()) {
Pipeline pipeline = jedis.pipelined();
long len = jedis.llen(table);
String script = String.format(DELETE_LUA_SCRIPT, table, len);
LOGGER.info(script);
pipeline.eval(script);
pipeline.sync();
} catch (JedisConnectionException e) {
LOGGER.info(e.getMessage());
}
For large ranges we notice that the lua scripts slow down and we get SocketTimeOutExceptions.
running redis-cli slowlog displays only the lua scripts that have taken too long to execute.
Is there a better way to do this? is my lua script blocking?
When I use just pipeline to do the batch deletes, the slowlog also returns slow queries.
try (Jedis jedis = jedisPool.getResource()) {
Pipeline pipeline = jedis.pipelined();
long len = jedis.llen(table);
List<String> queriesContainingTable = jedis.lrange(table,0,len);
if(queriesContainingTable.size() > 0) {
for (String query: queriesContainingTable) {
pipeline.del(query);
pipeline.lrem(table,1,query);
}
pipeline.sync();
}
} catch (JedisConnectionException e) {
LOGGER.info("CACHE INVALIDATE FAIL:"+e.getMessage());
}

slowlog is capable of storing top 128 slowlogs alone (can be changed in redis.conf slowlog-max-len 128). So your 1st model of using LUA script is surely a blocking one.
If you delete such a number (12470) one by one it is surely a blocking one as it take more time to complete. Out of the 2 models 2nd one is fine for me (using pipeline), because you avoid the iteration all you do is hitting del query for n times.
You can use del of multiple keys for every 100 or 1000 (whichever you feel as optimal after a small testing). You can group them to a pipeline altogether.
Or if you can do the same without atomicity, you can delete every 100 or 1000 keys at once in a loop, so that it wouldn't be a blocking call.
Try out with different combinations take the metrics and go with the optimized one.

Groovy PromiseMap - Can I limit the asynchronous thread pool?

I am making a (quick and dirty) Batching API that allows the UI to send a selection of REST API calls and get results for all of them at once.
I am using PromiseMap to make some asynchronous REST calls to the relevant services, which get collected afterward.
There could be a large number of threads that need to run, and I would like to throttle the number of threads that run at the same time, similar to Executor's thread pool.
Is this possible without physically separating the threads into multiple PromiseMaps and chaining them? I haven't found anything online describing limiting the thread pool.
//get requested calls
JSONArray callsToMake=request.JSON as JSONArray
//registers calls in promise map
def promiseMap = new PromiseMap()
//Can I limit this Map as a thread pool to, say, run 10 at a time until finished
data.each {
def tempVar=it
promiseMap[tempVar.id]={makeCall(tempVar.method, "${basePath}${tempVar.to}" as String, tempVar.body)}
}
def result=promiseMap.get()
def resultList=parseResults(result)
response.status=HttpStatusCodes.ACCEPTED
render resultList as JSON
I'm hoping there's a fairly straight-forward setting that I may be ignorant of.
Thank you.

The default Async implementation in Grails is GPars. To configure the number of threads you need to use a GParsPool. See:
http://gpars.org/guide/guide/dataParallelism.html#dataParallelism_parallelCollections_GParsPool
Example:
withPool(10) {...}

withPool doesn't seem to be working. Just incase if anyone is looking to limit threads here is what i did. We can create a custom Group with custom ThreadPool and specify the number of the Threads.
def customGroup = new DefaultPGroup(new DefaultPool(true, 5))
try {
Dataflow.usingGroup(customGroup, {
def promises = new PromiseList()
(1..100).each { number ->
promises << {
log.info "Performing Task ${number}"
Thread.sleep(200)
number++
}
}
def result = promises.get()
})
}
finally {
customGroup.shutdown()
}

Use
runtime 'org.grails:grails-async-gpars'
at build.gradle
And
GParsExecutorsPool.withPool(10){service ->
Shop.list().each{shop ->
Item.list().each{item ->
service.submit({createOrder(shop, item)} as Runnable)
}
}
}
in your Service for example

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart