Process huge amount of data with grails and gpars - grails

I have a Grails application which runs a job on a daily basis at midnight. In my example app I have 10000 Person records and do the following in the quartz job:
package threading
import static grails.async.Promises.task
import static groovyx.gpars.GParsExecutorsPool.withPool
class ComplexJob {
static triggers = {
simple repeatInterval: 30 * 1000l
}
def execute() {
if (Person.count == 5000) {
println "Executing job"
withPool 10000, {
Person.listOrderByAge(order: "asc").each { p ->
task {
log.info "Started ${p}"
Thread.sleep(15000l - (-1 * p.age))
}.onComplete {
log.info "Completed ${p}"
}
}
}
}
}
}
ignore the repeatInterval as this is only for testing purposes.
When the job gets executed I get the following exception:
2014-11-14 16:11:51,880 quartzScheduler_Worker-3 grails.plugins.quartz.listeners.ExceptionPrinterJobListener - Exception occurred in job: Grails Job
org.quartz.JobExecutionException: java.lang.IllegalStateException: The thread pool executor cannot run the task. The upper limit of the thread pool size has probably been reached. Current pool size: 1000 Maximum pool size: 1000 [See nested exception: java.lang.IllegalStateException: The thread pool executor cannot run the task. The upper limit of the thread pool size has probably been reached. Current pool size: 1000 Maximum pool size: 1000]
at grails.plugins.quartz.GrailsJobFactory$GrailsJob.execute(GrailsJobFactory.java:111)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.IllegalStateException: The thread pool executor cannot run the task. The upper limit of the thread pool size has probably been reached. Current pool size: 1000 Maximum pool size: 1000
at org.grails.async.factory.gpars.LoggingPoolFactory$3.rejectedExecution(LoggingPoolFactory.groovy:100)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at groovyx.gpars.scheduler.DefaultPool.execute(DefaultPool.java:155)
at groovyx.gpars.group.PGroup.task(PGroup.java:305)
at groovyx.gpars.group.PGroup.task(PGroup.java:286)
at groovyx.gpars.dataflow.Dataflow.task(Dataflow.java:93)
at org.grails.async.factory.gpars.GparsPromise.<init>(GparsPromise.groovy:41)
at org.grails.async.factory.gpars.GparsPromiseFactory.createPromise(GparsPromiseFactory.groovy:68)
at grails.async.Promises.task(Promises.java:123)
at threading.ComplexJob$_execute_closure1_closure3.doCall(ComplexJob.groovy:20)
at threading.ComplexJob$_execute_closure1.doCall(ComplexJob.groovy:19)
at groovyx.gpars.GParsExecutorsPool$_withExistingPool_closure2.doCall(GParsExecutorsPool.groovy:192)
at groovyx.gpars.GParsExecutorsPool.withExistingPool(GParsExecutorsPool.groovy:191)
at groovyx.gpars.GParsExecutorsPool.withPool(GParsExecutorsPool.groovy:162)
at groovyx.gpars.GParsExecutorsPool.withPool(GParsExecutorsPool.groovy:136)
at threading.ComplexJob.execute(ComplexJob.groovy:18)
at grails.plugins.quartz.GrailsJobFactory$GrailsJob.execute(GrailsJobFactory.java:104)
... 2 more
2014-11-14 16:12:06,756 Actor Thread 20 org.grails.async.factory.gpars.LoggingPoolFactory - Async execution error: A DataflowVariable can only be assigned once. Only re-assignments to an equal value are allowed.
java.lang.IllegalStateException: A DataflowVariable can only be assigned once. Only re-assignments to an equal value are allowed.
at groovyx.gpars.dataflow.expression.DataflowExpression.bind(DataflowExpression.java:368)
at groovyx.gpars.group.PGroup$4.run(PGroup.java:315)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2014-11-14 16:12:06,756 Actor Thread 5 org.grails.async.factory.gpars.LoggingPoolFactory - Async execution error: A DataflowVariable can only be assigned once. Only re-assignments to an equal value are allowed.
java.lang.IllegalStateException: A DataflowVariable can only be assigned once. Only re-assignments to an equal value are allowed.
at groovyx.gpars.dataflow.expression.DataflowExpression.bind(DataflowExpression.java:368)
at groovyx.gpars.group.PGroup$4.run(PGroup.java:315)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
It seems as if the thread pool hasn't been set to 10000 while I use withPool(10000)
Can I perhaps do this computation (now only prints log statements) in chunks? If so how can I tell what the latest item was that was processed (e.g. where to continue) ?

I suspect the withPool() method has no effect, since the task is most likely using a default thread pool, not the one created in withPool. Try removing the call to withPool() and see if the tasks still run.
The groovyx.gpars.scheduler.DefaultPool pool (the default for tasks) in GPars resizes with tasks and has a limit to 1000 concurrent threads.
I'd suggest creating a fixed size pool instead, e.g.:
def group = new DefaultPGroup(numberOfThreads)
group.task {...}
Note: I'm not familiar with the grails.async task, only the core GPars ones, so things may be slightly different around PGroups in grails.async.

Trying to wrap processing of each element to task seems not optimal. Standard way to make parallel processing is split whole task to appropriate number of sub-tasks. You are starting with selection of this number. For CPU-bound task you might create N=number-of-processors tasks. Then you split task to N sub-tasks. Like this:
persons = Person.listOrderByAge(order: "asc")
nThreads = Runtime.getRuntime().availableProcessors()
size = persons.size() / nThreads
withPool nThreads, {
persons.collate(size).each { subList =>
task {
subList.each { p =>
...
}
}
}
}

Related

Issue with Quartz grail plugin

I have a grails application and a quartz job running on it. The job contains the below code similar to below .
class MyJob{
static triggers = {}
def printLog(msg){
String threadId = Thread.currentThread().getId()
String threadName = Thread.currentThread().getName()
log.info(threadId+" - "+threadName+" : "+msg)
}
def execute(context)
{
printLog("Before Sync");
synchronized(MyJob){
printLog("Inside Sync");
try{
printLog("Before sleep 20 minutes")
Thread.sleep(1200000)
printLog("After sleep")
}catch (Exception e){
log.error("Error while sleeping")
}
}
printLog("After Sync")
}
}
I have scheduled it to trigger a job every minute
I can see in the logs that one thread is getting the synchronized block and then the other jobs start piling up, waiting for the thread to finish, this is working as expected.
The issue here is the jobs stop after 10 minutes by that time it have created 10 Threads. Out of that one is sleeping for 20 minutes and other 9 are waiting for the 1st thread to release the lock. Why is no new jobs created ?
I saw in some answers I can fix the issue by modifying my triggers section like below
static triggers = {
simple repeatInterval: 100
}
I tried the above option and its still showing only 10 jobs.
From where its taking the default configuration of 10 ?
How can i modify the value to do infinitely ?
I am new to grails and quartz, so I have no idea what is happening.
I think the Grails plugin sets the threadCount to 10 in the bundled quartz.properties file, assuming you're using Grails 3 you can override in application.yml like this:
quartz:
threadPool:
threadCount: 25
Grails 2 - application.groovy
quartz {
props {
threadPool.threadCount = 100
}
}
In general, it's not a a good idea to lock the Job thread with sleeps
If you have a job running a long process you must to split it in several jobs in order to release the Thread as soon as posible

Memory leak of AKKA Actor

I have a simple test program to try ...
object ActorLeak extends App {
val system = ActorSystem("ActorLeak")
val times = 100000000
for (i <- 1 to times) {
val myActor = system.actorOf(Props(classOf[TryActor], i), name = s"TryActor-$i")
//Thread sleep 100
myActor ! StopCmd
if (i % 10000 == 0)
println(s"Completed $i")
}
println(s"Creating and stopping $times end.")
val hookThread = new Thread(new Runnable {
def run() {
system.shutdown()
}
})
Runtime.getRuntime.addShutdownHook(hookThread)
}
case object StopCmd
class TryActor(no: Int) extends Actor {
def receive = {
case StopCmd => context stop self
}
}
I found: sometime OutOfMemoryError, sometimes make JVM die, run slowly slowly ...
Is there memory leak in creation / stop of actors?
Actor creation and messaging are both asynchronous, when actorOf returns this does not mean the actor has been created yet, and when ! returns it does not mean the actor has received or acted upon the message.
This means that you are actually not creating and stopping an actor for each iteration, but that you trigger creation, and send a message, this loop is probably quicker in queueing up actor creation than the messages can arrive and trigger the stopping of the messages which fills up the heap of your JVM.
To do what you I think you are trying to do you would have to provide a response from the actor upon receiving the StopCmd and wait for that inside of your loop before continuing with the next iteration. This can be done with the ask pattern together with Await.result to block the main thread until the actor reply has returned.
Note that this is only useful for your understanding and not something that you would do in an actual system using Akka.

Huge performance drop in cassandra-orm after million records

I'm using cassandra-orm plugin (cassandra-orm:0.4.5) for migrating clicks from Postgres DB to Cassandra. (I know I could use raw data import, but I want to make use of groupBy and explicit indexes maintained by the plugin).
The migration procedure is simple: I select a bunch of clicks from Postgres (via GORM) and then I flush them to Cassandra. Every click is a new record and a new object is created in Grails and saved in Cassandra. With 20 threads I was able to reach throughput of 2000 clicks/sec. After importing 5 mil clicks the performance started to degrade dramatically to 50 clicks/sec.
I made some profiling and I've found out, that 19 threads were waiting (parked) and one thread was performing a rehash on Groovy's AbstractConcurrentMapBase.
stack trace for waiting threads:
Name: pool-4-thread-2
State: WAITING on org.codehaus.groovy.util.ManagedConcurrentMap$Segment#5387f7af
Total blocked: 45,027 Total waited: 55,891
Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
org.codehaus.groovy.util.LockableObject.lock(LockableObject.java:34)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:101)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:97)
org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:35)
org.codehaus.groovy.runtime.metaclass.ThreadManagedMetaBeanProperty$ThreadBoundGetter.invoke(ThreadManagedMetaBeanProperty.java:180)
groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:1604)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1140)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:3332)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1152)
com.nosql.Click.getProperty(Click.groovy)
stack trace for rehash thread:
Name: pool-4-thread-11
State: RUNNABLE
Total blocked: 46,544 Total waited: 57,433
Stack trace:
org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment.rehash(AbstractConcurrentMapBase.java:217)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:105)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:97)
org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:35)
org.codehaus.groovy.runtime.metaclass.ThreadManagedMetaBeanProperty$ThreadBoundGetter.invoke(ThreadManagedMetaBeanProperty.java:180)
groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:1604)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1140)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:3332)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1152)
com.fma.nosql.Click.getProperty(Click.groovy)
After hours of debugging I've found out, that the issue is in dynamic property "_cassandra_cluster_" which is added to all plugin managed objects:
// cluster property (_cassandra_cluster_)
clazz.metaClass."${CLUSTER_PROP}" = null
This property is then internally saved in ThreadManagedMetaBeanProperty instance2Prop map. When the dynamic property is accessed def cluster = click._cassandra_cluster_ then the click instance is saved to instance2Prop map with soft reference. So far so good, soft references can be garbage collected, right. However there seems to be a bug in the ManagedConcurrentMap implementation which disregards the garbage collected elements and keep rehashing and expanding the map (described here and here).
Workaround
Since the map is internally saved on the class level, the only working solution was to restart the server. Eventually I've developed a dirty solution, which clears the internal map from zombie elements. Following code is running in a separate thread:
public void rehashClickSegmentsIfNecessary() {
ManagedConcurrentMap instanceMap = lookupInstanceMap(Click.class, "_cassandra_cluster_")
if(instanceMap.fullSize() - instanceMap.size() > 50000) {
//we have more than 50 000 zombie references in map
rehashSegments(instanceMap)
}
}
private void rehashSegments(ManagedConcurrentMap instanceMap) {
org.codehaus.groovy.util.ManagedConcurrentMap.Segment[] segments = instanceMap.segments
for(int i=0;i<segments.length;i++) {
segments[i].lock()
try {
segments[i].rehash()
} finally {
segments[i].unlock()
}
}
}
private ManagedConcurrentMap lookupInstanceMap(Class clazz, String prop) {
MetaClassRegistry registry = GroovySystem.metaClassRegistry
MetaClassImpl metaClass = registry.getMetaClass(clazz)
return metaClass.getMetaProperty(prop, false).instance2Prop
}
Do you have any production experience with cassandra-orm or any other grails plugin connecting to cassandra?

How to Configuring concurrent execution property as false in grails

Hi i am new to grails and I am using quartz plug-in for scheduling jobs. I scheduled job for every 60 sec but it is actually taking more than 60 sec some times So in that case one more threads is started and the first thread is still running So can any one tell me how to execute threads sequentially one by one.
When using the Grails Quartz plugin you can simply set the concurrent property to false to avoid concurrent executions of a Job:
class MyJob {
static triggers = {
...
}
def concurrent = false
def execute(context) {
...
}
}
If you are using Quartz as a plain dependency (not as a Grails plugin) you need to extend StatefulJob (Quartz < 2.0) or set the #StatefulJob and #DisallowConcurrentExecution annotations (Quartz >= 2.0).

Searchable index gets locked on manual update (LockObtainFailedException)

We have a Grails project that runs behind a load balancer. There are three instances of the Grails application running on the server (using separate Tomcat instances). Each instance has its own searchable index. Because the indexes are separate, the automatic update is not enough keeping the index consistent between the application instances. Because of this we have disabled the searchable index mirroring and updates to the index are done manually in a scheduled quartz job. According to our understanding no other part of the application should modify the index.
The quartz job runs once a minute and it checks from the database which rows have been updated by the application, and re-indexes those objects. The job also checks if the same job is already running so it doesn’t do any concurrent indexing. The application runs fine for few hours after the startup and then suddenly when the job is starting, LockObtainFailedException is thrown:
22.10.2012 11:20:40 [xxxx.ReindexJob] ERROR Could not update searchable index, class org.compass.core.engine.SearchEngineException:
Failed to open writer for sub index [product]; nested exception is
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
out:
SimpleFSLock#/home/xxx/tomcat/searchable-index/index/product/lucene-a7bbc72a49512284f5ac54f5d7d32849-write.lock
According to the log the last time the job was executed, re-indexing was done without any errors and the job finished successfully. Still, this time the re-index operation throws the locking exception, as if the previous operation was unfinished and the lock had not been released. The lock will not be released until the application is restarted.
We tried to solve the problem by manually opening the locked index, which causes the following error to be printed to the log:
22.10.2012 11:21:30 [manager.IndexWritersManager ] ERROR Illegal state, marking an index writer as open, while another is marked as
open for sub index [product]
After this the job seems to be working correctly and doesn’t become stuck in a locked state again. However this causes the application to constantly use 100 % of the CPU resource. Below is a shortened version of the quartz job code.
Any help would be appreciated to solve the problem, thanks in advance.
class ReindexJob {
def compass
...
static Calendar lastIndexed
static triggers = {
// Every day every minute (at xx:xx:30), start delay 2 min
// cronExpression: "s m h D M W [Y]"
cron name: "ReindexTrigger", cronExpression: "30 * * * * ?", startDelay: 120000
}
def execute() {
if (ConcurrencyHelper.isLocked(ConcurrencyHelper.Locks.LUCENE_INDEX)) {
log.error("Search index has been locked, not doing anything.")
return
}
try {
boolean acquiredLock = ConcurrencyHelper.lock(ConcurrencyHelper.Locks.LUCENE_INDEX, "ReindexJob")
if (!acquiredLock) {
log.warn("Could not lock search index, not doing anything.")
return
}
Calendar reindexDate = lastIndexed
Calendar newReindexDate = Calendar.instance
if (!reindexDate) {
reindexDate = Calendar.instance
reindexDate.add(Calendar.MINUTE, -3)
lastIndexed = reindexDate
}
log.debug("+++ Starting ReindexJob, last indexed ${TextHelper.formatDate("yyyy-MM-dd HH:mm:ss", reindexDate.time)} +++")
Long start = System.currentTimeMillis()
String reindexMessage = ""
// Retrieve the ids of products that have been modified since the job last ran
String productQuery = "select p.id from Product ..."
List<Long> productIds = Product.executeQuery(productQuery, ["lastIndexedDate": reindexDate.time, "lastIndexedCalendar": reindexDate])
if (productIds) {
reindexMessage += "Found ${productIds.size()} product(s) to reindex. "
final int BATCH_SIZE = 10
Long time = TimeHelper.timer {
for (int inserted = 0; inserted < productIds.size(); inserted += BATCH_SIZE) {
log.debug("Indexing from ${inserted + 1} to ${Math.min(inserted + BATCH_SIZE, productIds.size())}: ${productIds.subList(inserted, Math.min(inserted + BATCH_SIZE, productIds.size()))}")
Product.reindex(productIds.subList(inserted, Math.min(inserted + BATCH_SIZE, productIds.size())))
Thread.sleep(250)
}
}
reindexMessage += " (${time / 1000} s). "
} else {
reindexMessage += "No products to reindex. "
}
log.debug(reindexMessage)
// Re-index brands
Brand.reindex()
lastIndexed = newReindexDate
log.debug("+++ Finished ReindexJob (${(System.currentTimeMillis() - start) / 1000} s) +++")
} catch (Exception e) {
log.error("Could not update searchable index, ${e.class}: ${e.message}")
if (e instanceof org.apache.lucene.store.LockObtainFailedException || e instanceof org.compass.core.engine.SearchEngineException) {
log.info("This is a Lucene index locking exception.")
for (String subIndex in compass.searchEngineIndexManager.getSubIndexes()) {
if (compass.searchEngineIndexManager.isLocked(subIndex)) {
log.info("Releasing Lucene index lock for sub index ${subIndex}")
compass.searchEngineIndexManager.releaseLock(subIndex)
}
}
}
} finally {
ConcurrencyHelper.unlock(ConcurrencyHelper.Locks.LUCENE_INDEX, "ReindexJob")
}
}
}
Based on JMX CPU samples, it seems that Compass is doing some scheduling behind the scenes. From 1 minute CPU samples it seems like there are few things different when normal and 100% CPU instances are compared:
org.apache.lucene.index.IndexWriter.doWait() is using most of the CPU time.
Compass Scheduled Executor Thread is shown in the thread list, this was not seen in a normal situation.
One Compass Executor Thread is doing commitMerge, in a normal situation none of these threads was doing commitMerge.
You can try increasing the 'compass.transaction.lockTimeout' setting. The default is 10 (seconds).
Another option is to disable concurrency in Compass and make it synchronous. This is controlled with the 'compass.transaction.processor.read_committed.concurrentOperations': 'false' setting. You might also have to set 'compass.transaction.processor' to 'read_committed'
These are the compass settings we are currently using:
compassSettings = [
'compass.engine.optimizer.schedule.period': '300',
'compass.engine.mergeFactor':'1000',
'compass.engine.maxBufferedDocs':'1000',
'compass.engine.ramBufferSize': '128',
'compass.engine.useCompoundFile': 'false',
'compass.transaction.processor': 'read_committed',
'compass.transaction.processor.read_committed.concurrentOperations': 'false',
'compass.transaction.lockTimeout': '30',
'compass.transaction.lockPollInterval': '500',
'compass.transaction.readCommitted.translog.connection': 'ram://'
]
This has concurrency switched off. You can make it asynchronous by changing the 'compass.transaction.processor.read_committed.concurrentOperations' setting to 'true'. (or removing the entry).
Compass configuration reference:
http://static.compassframework.org/docs/latest/core-configuration.html
Documentation for the concurrency of read_committed processor:
http://www.compass-project.org/docs/latest/reference/html/core-searchengine.html#core-searchengine-transaction-read_committed
If you want to keep async operations, you can also control the number of threads it uses. Using compass.transaction.processor.read_committed.concurrencyLevel=1 setting would allow asynchronous operations but just use one thread (the default is 5 threads). There are also the compass.transaction.processor.read_committed.backlog and compass.transaction.processor.read_committed.addTimeout settings.
I hope this helps.

Resources