Searchable index gets locked on manual update (LockObtainFailedException) - grails

We have a Grails project that runs behind a load balancer. There are three instances of the Grails application running on the server (using separate Tomcat instances). Each instance has its own searchable index. Because the indexes are separate, the automatic update is not enough keeping the index consistent between the application instances. Because of this we have disabled the searchable index mirroring and updates to the index are done manually in a scheduled quartz job. According to our understanding no other part of the application should modify the index.
The quartz job runs once a minute and it checks from the database which rows have been updated by the application, and re-indexes those objects. The job also checks if the same job is already running so it doesn’t do any concurrent indexing. The application runs fine for few hours after the startup and then suddenly when the job is starting, LockObtainFailedException is thrown:
22.10.2012 11:20:40 [xxxx.ReindexJob] ERROR Could not update searchable index, class org.compass.core.engine.SearchEngineException:
Failed to open writer for sub index [product]; nested exception is
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
out:
SimpleFSLock#/home/xxx/tomcat/searchable-index/index/product/lucene-a7bbc72a49512284f5ac54f5d7d32849-write.lock
According to the log the last time the job was executed, re-indexing was done without any errors and the job finished successfully. Still, this time the re-index operation throws the locking exception, as if the previous operation was unfinished and the lock had not been released. The lock will not be released until the application is restarted.
We tried to solve the problem by manually opening the locked index, which causes the following error to be printed to the log:
22.10.2012 11:21:30 [manager.IndexWritersManager ] ERROR Illegal state, marking an index writer as open, while another is marked as
open for sub index [product]
After this the job seems to be working correctly and doesn’t become stuck in a locked state again. However this causes the application to constantly use 100 % of the CPU resource. Below is a shortened version of the quartz job code.
Any help would be appreciated to solve the problem, thanks in advance.
class ReindexJob {
def compass
...
static Calendar lastIndexed
static triggers = {
// Every day every minute (at xx:xx:30), start delay 2 min
// cronExpression: "s m h D M W [Y]"
cron name: "ReindexTrigger", cronExpression: "30 * * * * ?", startDelay: 120000
}
def execute() {
if (ConcurrencyHelper.isLocked(ConcurrencyHelper.Locks.LUCENE_INDEX)) {
log.error("Search index has been locked, not doing anything.")
return
}
try {
boolean acquiredLock = ConcurrencyHelper.lock(ConcurrencyHelper.Locks.LUCENE_INDEX, "ReindexJob")
if (!acquiredLock) {
log.warn("Could not lock search index, not doing anything.")
return
}
Calendar reindexDate = lastIndexed
Calendar newReindexDate = Calendar.instance
if (!reindexDate) {
reindexDate = Calendar.instance
reindexDate.add(Calendar.MINUTE, -3)
lastIndexed = reindexDate
}
log.debug("+++ Starting ReindexJob, last indexed ${TextHelper.formatDate("yyyy-MM-dd HH:mm:ss", reindexDate.time)} +++")
Long start = System.currentTimeMillis()
String reindexMessage = ""
// Retrieve the ids of products that have been modified since the job last ran
String productQuery = "select p.id from Product ..."
List<Long> productIds = Product.executeQuery(productQuery, ["lastIndexedDate": reindexDate.time, "lastIndexedCalendar": reindexDate])
if (productIds) {
reindexMessage += "Found ${productIds.size()} product(s) to reindex. "
final int BATCH_SIZE = 10
Long time = TimeHelper.timer {
for (int inserted = 0; inserted < productIds.size(); inserted += BATCH_SIZE) {
log.debug("Indexing from ${inserted + 1} to ${Math.min(inserted + BATCH_SIZE, productIds.size())}: ${productIds.subList(inserted, Math.min(inserted + BATCH_SIZE, productIds.size()))}")
Product.reindex(productIds.subList(inserted, Math.min(inserted + BATCH_SIZE, productIds.size())))
Thread.sleep(250)
}
}
reindexMessage += " (${time / 1000} s). "
} else {
reindexMessage += "No products to reindex. "
}
log.debug(reindexMessage)
// Re-index brands
Brand.reindex()
lastIndexed = newReindexDate
log.debug("+++ Finished ReindexJob (${(System.currentTimeMillis() - start) / 1000} s) +++")
} catch (Exception e) {
log.error("Could not update searchable index, ${e.class}: ${e.message}")
if (e instanceof org.apache.lucene.store.LockObtainFailedException || e instanceof org.compass.core.engine.SearchEngineException) {
log.info("This is a Lucene index locking exception.")
for (String subIndex in compass.searchEngineIndexManager.getSubIndexes()) {
if (compass.searchEngineIndexManager.isLocked(subIndex)) {
log.info("Releasing Lucene index lock for sub index ${subIndex}")
compass.searchEngineIndexManager.releaseLock(subIndex)
}
}
}
} finally {
ConcurrencyHelper.unlock(ConcurrencyHelper.Locks.LUCENE_INDEX, "ReindexJob")
}
}
}
Based on JMX CPU samples, it seems that Compass is doing some scheduling behind the scenes. From 1 minute CPU samples it seems like there are few things different when normal and 100% CPU instances are compared:
org.apache.lucene.index.IndexWriter.doWait() is using most of the CPU time.
Compass Scheduled Executor Thread is shown in the thread list, this was not seen in a normal situation.
One Compass Executor Thread is doing commitMerge, in a normal situation none of these threads was doing commitMerge.

You can try increasing the 'compass.transaction.lockTimeout' setting. The default is 10 (seconds).
Another option is to disable concurrency in Compass and make it synchronous. This is controlled with the 'compass.transaction.processor.read_committed.concurrentOperations': 'false' setting. You might also have to set 'compass.transaction.processor' to 'read_committed'
These are the compass settings we are currently using:
compassSettings = [
'compass.engine.optimizer.schedule.period': '300',
'compass.engine.mergeFactor':'1000',
'compass.engine.maxBufferedDocs':'1000',
'compass.engine.ramBufferSize': '128',
'compass.engine.useCompoundFile': 'false',
'compass.transaction.processor': 'read_committed',
'compass.transaction.processor.read_committed.concurrentOperations': 'false',
'compass.transaction.lockTimeout': '30',
'compass.transaction.lockPollInterval': '500',
'compass.transaction.readCommitted.translog.connection': 'ram://'
]
This has concurrency switched off. You can make it asynchronous by changing the 'compass.transaction.processor.read_committed.concurrentOperations' setting to 'true'. (or removing the entry).
Compass configuration reference:
http://static.compassframework.org/docs/latest/core-configuration.html
Documentation for the concurrency of read_committed processor:
http://www.compass-project.org/docs/latest/reference/html/core-searchengine.html#core-searchengine-transaction-read_committed
If you want to keep async operations, you can also control the number of threads it uses. Using compass.transaction.processor.read_committed.concurrencyLevel=1 setting would allow asynchronous operations but just use one thread (the default is 5 threads). There are also the compass.transaction.processor.read_committed.backlog and compass.transaction.processor.read_committed.addTimeout settings.
I hope this helps.

Related

When does reactor execute a subscription chain?

The reactor documentation states the following:
Nothing happens until you subscribe
If that was true, why do I see a java.lang.NullPointerException when I run the following code snippet, which has a reactor chain without a subscription?
#Test
void test() {
String a = null;
Flux.just(a.toLowerCase())
.doOnNext(System.out::println);
}
Deepak,
Nothing happens means the data will not be flowing through the chain of your functions to your consumers until a subscription happens.
You're getting NPE because Java tries to compute the value which is given to a hot operator just() on the Flux definition step.
You can also convert just() to a cold operator using defer() so you will receive NPE only after a subscription happened:
public Flux<String> test() {
String a = null;
return Flux.defer(() -> Flux.just(a.toLowerCase()))
.doOnNext(System.out::println);
}
Please, read more about hot vs hold operators.
Update:
Small example of cold and hot publishers. Each time new subscription happens cold publisher's body is recalculated. Meanwhile, just() is only producing time that was calculated only once at definition time.
Mono<Date> currentTime = Mono.just(Calendar.getInstance().getTime());
Mono<Date> realCurrentTime = Mono.defer(() -> Mono.just(Calendar.getInstance().getTime()));
// 1 sec sleep
Thread.sleep(1000);
currentTime.subscribe(time -> System.out.println("Current Time " + time.getTime()));
realCurrentTime.subscribe(time -> System.out.println("Real current Time " + time.getTime()));
Thread.sleep(2000);
currentTime.subscribe(time -> System.out.println("Current Time " + time.getTime()));
realCurrentTime.subscribe(time -> System.out.println("Real current Time " + time.getTime()));
The output is:
Current Time 1583788755759
Real current Time 1583788756826
Current Time 1583788755759
Real current Time 1583788758833

Issue with Quartz grail plugin

I have a grails application and a quartz job running on it. The job contains the below code similar to below .
class MyJob{
static triggers = {}
def printLog(msg){
String threadId = Thread.currentThread().getId()
String threadName = Thread.currentThread().getName()
log.info(threadId+" - "+threadName+" : "+msg)
}
def execute(context)
{
printLog("Before Sync");
synchronized(MyJob){
printLog("Inside Sync");
try{
printLog("Before sleep 20 minutes")
Thread.sleep(1200000)
printLog("After sleep")
}catch (Exception e){
log.error("Error while sleeping")
}
}
printLog("After Sync")
}
}
I have scheduled it to trigger a job every minute
I can see in the logs that one thread is getting the synchronized block and then the other jobs start piling up, waiting for the thread to finish, this is working as expected.
The issue here is the jobs stop after 10 minutes by that time it have created 10 Threads. Out of that one is sleeping for 20 minutes and other 9 are waiting for the 1st thread to release the lock. Why is no new jobs created ?
I saw in some answers I can fix the issue by modifying my triggers section like below
static triggers = {
simple repeatInterval: 100
}
I tried the above option and its still showing only 10 jobs.
From where its taking the default configuration of 10 ?
How can i modify the value to do infinitely ?
I am new to grails and quartz, so I have no idea what is happening.
I think the Grails plugin sets the threadCount to 10 in the bundled quartz.properties file, assuming you're using Grails 3 you can override in application.yml like this:
quartz:
threadPool:
threadCount: 25
Grails 2 - application.groovy
quartz {
props {
threadPool.threadCount = 100
}
}
In general, it's not a a good idea to lock the Job thread with sleeps
If you have a job running a long process you must to split it in several jobs in order to release the Thread as soon as posible

Quartz 2.2.1, JMX jobruntime always -1?

Is it normal that in Quartz, for the JMX Attribute CurrentlyExecutingJobs=> [item] => jobRunTime always is "-1" while it is currently running, or is there some setting in Quartz to ensure the jobRunTime is updated appropriately?
(confirmed via jconsole, Mission Control, and jmx code)
Usecase is to track/monitor long-running jobs, and thought jobRunTime would be the appropriate path. The alternative path is "fireTime" + CURRENT_NOW calculation, but wanted to avoid extra calculation if it was already occurring somewhere.
After chasing this around, this particular value is not updated without it being manually set. Reviewing tools that monitor Quartz jobs, such as Javamelody, they have to calculate it every time too:
elapsedTime = System.currentTimeMillis()- quartzAdapter.getContextFireTime(jobExecutionContext).getTime();
If you want to manually update the jobruntime value for long-running jobs to check the value rather than calculating it outside, you have to change every job you have to support this feature. Here is a rough example that can be modified for your needs sourced from: https://github.com/dhartford/quartz-snippets/blob/master/update_jobruntime_timer_innerclass
/**
* inner class to handle scheduled updates of the Quartz jobruntime attribute
*/
class UpdateJobTimer extends TimerTask{
private JobExecutionContextImpl jec;
/* usage example, such as at the start of the execute method of the Job interface:
* Timer timer = new Timer();
* //update every 10 seconds (in milliseconds), whatever poll timing you want
* timer.schedule(new UpdateJobTimer(jec), 0, 10000);
* ...
* timer.cancel(); //do cleanup in all appropriate spots
*/
UpdateJobTimer(JobExecutionContextImpl jec){
this.jec = jec;
}
#Override
public void run() {
long runtimeinms = jec.getFireTime().getTime() - new java.util.Date().getTime();
jec.setJobRunTime(runtimeinms);
System.out.println("DEBUG TIMERTASK on JOB: " + jec.getJobDetail().getKey().getName() + " triggered [" + jec.getFireTime() + "] updated [" + new java.util.Date() + "]" );
}
}`

Jedis/Redis SocketTimeout exception on Lua scripts

We are using lua scripts to perform batch deletes of data on updates to our DB. Jedis executes the lua script using a pipeline.
local result = redis.call('lrange',key,0,12470)
for i,k in ipairs(result) do
redis.call('del',k)
redis.call('ltrim',key,1,k)
end
try (Jedis jedis = jedisPool.getResource()) {
Pipeline pipeline = jedis.pipelined();
long len = jedis.llen(table);
String script = String.format(DELETE_LUA_SCRIPT, table, len);
LOGGER.info(script);
pipeline.eval(script);
pipeline.sync();
} catch (JedisConnectionException e) {
LOGGER.info(e.getMessage());
}
For large ranges we notice that the lua scripts slow down and we get SocketTimeOutExceptions.
running redis-cli slowlog displays only the lua scripts that have taken too long to execute.
Is there a better way to do this? is my lua script blocking?
When I use just pipeline to do the batch deletes, the slowlog also returns slow queries.
try (Jedis jedis = jedisPool.getResource()) {
Pipeline pipeline = jedis.pipelined();
long len = jedis.llen(table);
List<String> queriesContainingTable = jedis.lrange(table,0,len);
if(queriesContainingTable.size() > 0) {
for (String query: queriesContainingTable) {
pipeline.del(query);
pipeline.lrem(table,1,query);
}
pipeline.sync();
}
} catch (JedisConnectionException e) {
LOGGER.info("CACHE INVALIDATE FAIL:"+e.getMessage());
}
slowlog is capable of storing top 128 slowlogs alone (can be changed in redis.conf slowlog-max-len 128). So your 1st model of using LUA script is surely a blocking one.
If you delete such a number (12470) one by one it is surely a blocking one as it take more time to complete. Out of the 2 models 2nd one is fine for me (using pipeline), because you avoid the iteration all you do is hitting del query for n times.
You can use del of multiple keys for every 100 or 1000 (whichever you feel as optimal after a small testing). You can group them to a pipeline altogether.
Or if you can do the same without atomicity, you can delete every 100 or 1000 keys at once in a loop, so that it wouldn't be a blocking call.
Try out with different combinations take the metrics and go with the optimized one.

Memory leak of AKKA Actor

I have a simple test program to try ...
object ActorLeak extends App {
val system = ActorSystem("ActorLeak")
val times = 100000000
for (i <- 1 to times) {
val myActor = system.actorOf(Props(classOf[TryActor], i), name = s"TryActor-$i")
//Thread sleep 100
myActor ! StopCmd
if (i % 10000 == 0)
println(s"Completed $i")
}
println(s"Creating and stopping $times end.")
val hookThread = new Thread(new Runnable {
def run() {
system.shutdown()
}
})
Runtime.getRuntime.addShutdownHook(hookThread)
}
case object StopCmd
class TryActor(no: Int) extends Actor {
def receive = {
case StopCmd => context stop self
}
}
I found: sometime OutOfMemoryError, sometimes make JVM die, run slowly slowly ...
Is there memory leak in creation / stop of actors?
Actor creation and messaging are both asynchronous, when actorOf returns this does not mean the actor has been created yet, and when ! returns it does not mean the actor has received or acted upon the message.
This means that you are actually not creating and stopping an actor for each iteration, but that you trigger creation, and send a message, this loop is probably quicker in queueing up actor creation than the messages can arrive and trigger the stopping of the messages which fills up the heap of your JVM.
To do what you I think you are trying to do you would have to provide a response from the actor upon receiving the StopCmd and wait for that inside of your loop before continuing with the next iteration. This can be done with the ask pattern together with Await.result to block the main thread until the actor reply has returned.
Note that this is only useful for your understanding and not something that you would do in an actual system using Akka.

Resources