Pooled LdapTemplate stalls for minutes during context validation - grails

I'm trying to use Spring-ldap's LdapTemplate to retrieve information from an LDAP source during a Rest call service implementation and, while I think I have a working configuration, we're noticing stalls of up to 15 minutes intermittently when the service is hit. Logging statements have determined the stall happens during the ldapTemplate.search() call.
My beans:
contextSourceTarget(org.springframework.ldap.core.support.LdapContextSource) {
urls = ["https://someldapsource.com"]
userDn = 'uid=someaccount,ou=xxx,cn=users,dc=org,dc=com'
password = 'somepassword'
pooled = true
}
dirContextValidator(org.springframework.ldap.pool2.validation.DefaultDirContextValidator)
poolConfig( org.springframework.ldap.pool2.factory.PoolConfig ) {
testOnBorrow = true
testWhileIdle = true
}
ldapContextSource(org.springframework.ldap.pool2.factory.PooledContextSource, ref('poolConfig')) {
contextSource = ref('contextSourceTarget')
dirContextValidator = ref('dirContextValidator')
}
ldapTemplate(LdapTemplate, ref('ldapContextSource')) {}
I expect this application could be hitting LDAP several times concurrently (via concurrent rest calls to this app) for retrieving data from different users. Here's the code that makes that call:
List attrs =['uid', 'otherattr1', 'otherattr2']
// this just returns a Map containing the key value pairs of the attrs passed in here.
LdapNamedContextMapper mapper = new LdapNamedContextMapper( attrs )
log.debug( "getLdapUser:preLdapSearch")
List<Map> results = ldapTemplate.search(
'cn=grouproot,cn=Groups,dc=org,dc=com',
'uniquemember=userNameImsearchingfor',
SearchControls.SUBTREE_SCOPE,
attrs as String[], mapper )
log.debug( "getLdapUser:postLdapSearch" )
Unfortunately, at random times it seems, the timestamp difference between the preLdapSearch and postLdapSearch logs is upwards of 15 minutes. Obviously, this is bad, and it would seem to be a pool management issue.
So I turned on debug logging for packages org.springframework.ldap and org.apache.commons.pool2
And now when this happens I get the following in the logs:
2018-09-20 20:18:46.251 DEBUG appEvent="getLdapUser:preLdapSearch"
2018-09-20 20:35:03.246 DEBUG A class javax.naming.ServiceUnavailableException - not explicitly configured to be a non-transient exception - encountered; ignoring.
2018-09-20 20:35:03.249 DEBUG DirContext 'javax.naming.ldap.InitialLdapContext#1f4f37b4' failed validation with an exception.
javax.naming.ServiceUnavailableException: my.ldaphost.com:636; socket closed
at com.sun.jndi.ldap.Connection.readReply(Connection.java:454)
at com.sun.jndi.ldap.LdapClient.getSearchReply(LdapClient.java:638)
at com.sun.jndi.ldap.LdapClient.getSearchReply(LdapClient.java:638)
at com.sun.jndi.ldap.LdapClient.search(LdapClient.java:561)
at com.sun.jndi.ldap.LdapCtx.doSearch(LdapCtx.java:1985)
at com.sun.jndi.ldap.LdapCtx.searchAux(LdapCtx.java:1844)
at com.sun.jndi.ldap.LdapCtx.c_search(LdapCtx.java:1769)
at com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(ComponentDirContext.java:392)
(LOTS OF STACK TRACE REMOVED)
2018-09-20 20:35:03.249 DEBUG Closing READ_ONLY DirContext='javax.naming.ldap.InitialLdapContext#1f4f37b4'
2018-09-20 20:35:03.249 DEBUG Closed READ_ONLY DirContext='javax.naming.ldap.InitialLdapContext#1f4f37b4'
2018-09-20 20:35:03.249 DEBUG Creating a new READ_ONLY DirContext
2018-09-20 20:35:03.787 DEBUG Created new READ_ONLY DirContext='javax.naming.ldap.InitialLdapContext#5239386d'
2018-09-20 20:35:03.838 DEBUG DirContext 'javax.naming.ldap.InitialLdapContext#5239386d' passed validation.
2018-09-20 20:35:03.890 DEBUG appEvent="getLdapUser:postLdapSearch"
Questions:
How can I find out more? I've got debug logging turned on for org.springframework.ldap and org.apache.commons.pool2
Why is it seeming to take 15+minutes to determine that a connection is stale/unusable? How can I configure to make that much shorter?

There is a good chance that the underlying LDAP system is having connection issues.
You could try adding timeouts in the connection pool settings:
max-wait - default is -1
eviction-run-interval-millis - you may
want to set this to control how often to check for problems
Docs: https://docs.spring.io/spring-ldap/docs/current/reference/#pool-configuration

Related

Are stored procedures in Cosmos DB automatically retried on conflict?

Stored procedures in Cosmos DB are transactional and run under isolation snapshop with optimistic concurrency control. That means that write conflicts can occur, but they are detected so the transaction is rolled back.
If such a conflict occurs, does Cosmos DB automatically retry the stored procedure, or does the client receive an exception (maybe a HTTP 412 precondition failure?) and need to implement the retry logic itself?
I tried running 100 instances of a stored procedures in parallel that would produce a write conflict by reading the a document (without setting _etag), waiting for a while and then incrementing an integer property within that document (again without setting _etag).
In all trials so far, no errors occurred, and the result was as if the 100 runs were run sequentially. So the preliminary answer is: yes, Cosmos DB automatically retries running an SP on write conflicts (or perhaps enforces transactional isolation by some other means like locking), so clients hopefully don't need to worry about aborted SPs due to conflicts.
It would be great to hear from a Cosmos DB engineer how this is achieved: retry, locking or something different?
You're correct in that this isn't properly documented anywhere. Here's how OCC check can be done in a stored procedure:
function storedProcedureWithEtag(newItem)
{
var context = getContext();
var collection = context.getCollection();
var response = context.getResponse();
if (!newItem) {
throw 'Missing item';
}
// update the item to set changed time
newItem.ChangedTime = (new Date()).toISOString();
var etagForOcc = newItem._etag;
var upsertAccecpted = collection.upsertDocument(
collection.getSelfLink(),
newItem,
{ etag: etagForOcc }, // <-- Pass in the etag
function (err2, feed2, options2) {
if (err2) throw err2;
response.setBody(newItem);
}
);
if (!upsertAccecpted) {
throw "Unable to upsert item. Id: " + newItem.id;
}
}
Credit: https://peter.intheazuresky.com/2016/12/22/documentdb-optimistic-concurrency-in-a-stored-procedure/
SDK does not retry on a 412, 412 failures are related to Optimistic Concurrency and in those cases, you are controlling the ETag that you are passing. It is expected that the user handles the 412 by reading the newest version of the document, obtains the newer ETag, and retries the operation with the updated value.
Example for V3 SDK
Example for V2 SDK

Query timeout in Neo4j 3.0.6

It looks like previously working approach is deprecated now:
unsupported.dbms.executiontime_limit.enabled=true
unsupported.dbms.executiontime_limit.time=1s
According to the documentation new variables are responsible for timeouts handling:
dbms.transaction.timeout
dbms.transaction_timeout
At the same time the new variables look related to the transactions.
The new timeout variables look not working. They were set in the neo4j.conf as follows:
dbms.transaction_timeout=5s
dbms.transaction.timeout=5s
Slow cypher query isn't terminated.
Then the Neo4j plugin was added to model a slow query with transaction:
#Procedure("test.slowQuery")
public Stream<Res> slowQuery(#Name("delay") Number Delay )
{
ArrayList<Res> res = new ArrayList<>();
try ( Transaction tx = db.beginTx() ){
Thread.sleep(Delay.intValue(), 0);
tx.success();
} catch (Exception e) {
System.out.println(e);
}
return res.stream();
}
The function served by the plugin is executed with neoism Golang package. And the timeout isn't triggered as well.
The timeout is only honored if your procedure code invokes either operations on the graph like reading nodes and rels or explicitly checks if the current transaction is marked as terminate.
For the later, see https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/master/src/main/java/apoc/util/Utils.java#L41-L51 as example.
According to the documentation the transaction guard is interested in orphaned transactions only.
The server guards against orphaned transactions by using a timeout. If there are no requests for a given transaction within the timeout period, the server will roll it back. You can configure the timeout in the server configuration, by setting dbms.transaction_timeout to the number of seconds before timeout. The default timeout is 60 seconds.
I've not found a way how to trigger timeout for a query which isn't orphaned with a native functionality.
#StefanArmbruster pointed a good direction. The timeout triggering functionality can be got with creating a wrapper function in Neo4j plugin like it is made in apoc.

Huge performance drop in cassandra-orm after million records

I'm using cassandra-orm plugin (cassandra-orm:0.4.5) for migrating clicks from Postgres DB to Cassandra. (I know I could use raw data import, but I want to make use of groupBy and explicit indexes maintained by the plugin).
The migration procedure is simple: I select a bunch of clicks from Postgres (via GORM) and then I flush them to Cassandra. Every click is a new record and a new object is created in Grails and saved in Cassandra. With 20 threads I was able to reach throughput of 2000 clicks/sec. After importing 5 mil clicks the performance started to degrade dramatically to 50 clicks/sec.
I made some profiling and I've found out, that 19 threads were waiting (parked) and one thread was performing a rehash on Groovy's AbstractConcurrentMapBase.
stack trace for waiting threads:
Name: pool-4-thread-2
State: WAITING on org.codehaus.groovy.util.ManagedConcurrentMap$Segment#5387f7af
Total blocked: 45,027 Total waited: 55,891
Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
org.codehaus.groovy.util.LockableObject.lock(LockableObject.java:34)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:101)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:97)
org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:35)
org.codehaus.groovy.runtime.metaclass.ThreadManagedMetaBeanProperty$ThreadBoundGetter.invoke(ThreadManagedMetaBeanProperty.java:180)
groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:1604)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1140)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:3332)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1152)
com.nosql.Click.getProperty(Click.groovy)
stack trace for rehash thread:
Name: pool-4-thread-11
State: RUNNABLE
Total blocked: 46,544 Total waited: 57,433
Stack trace:
org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment.rehash(AbstractConcurrentMapBase.java:217)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:105)
org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:97)
org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:35)
org.codehaus.groovy.runtime.metaclass.ThreadManagedMetaBeanProperty$ThreadBoundGetter.invoke(ThreadManagedMetaBeanProperty.java:180)
groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:1604)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1140)
groovy.lang.MetaClassImpl.getProperty(MetaClassImpl.java:3332)
groovy.lang.ExpandoMetaClass.getProperty(ExpandoMetaClass.java:1152)
com.fma.nosql.Click.getProperty(Click.groovy)
After hours of debugging I've found out, that the issue is in dynamic property "_cassandra_cluster_" which is added to all plugin managed objects:
// cluster property (_cassandra_cluster_)
clazz.metaClass."${CLUSTER_PROP}" = null
This property is then internally saved in ThreadManagedMetaBeanProperty instance2Prop map. When the dynamic property is accessed def cluster = click._cassandra_cluster_ then the click instance is saved to instance2Prop map with soft reference. So far so good, soft references can be garbage collected, right. However there seems to be a bug in the ManagedConcurrentMap implementation which disregards the garbage collected elements and keep rehashing and expanding the map (described here and here).
Workaround
Since the map is internally saved on the class level, the only working solution was to restart the server. Eventually I've developed a dirty solution, which clears the internal map from zombie elements. Following code is running in a separate thread:
public void rehashClickSegmentsIfNecessary() {
ManagedConcurrentMap instanceMap = lookupInstanceMap(Click.class, "_cassandra_cluster_")
if(instanceMap.fullSize() - instanceMap.size() > 50000) {
//we have more than 50 000 zombie references in map
rehashSegments(instanceMap)
}
}
private void rehashSegments(ManagedConcurrentMap instanceMap) {
org.codehaus.groovy.util.ManagedConcurrentMap.Segment[] segments = instanceMap.segments
for(int i=0;i<segments.length;i++) {
segments[i].lock()
try {
segments[i].rehash()
} finally {
segments[i].unlock()
}
}
}
private ManagedConcurrentMap lookupInstanceMap(Class clazz, String prop) {
MetaClassRegistry registry = GroovySystem.metaClassRegistry
MetaClassImpl metaClass = registry.getMetaClass(clazz)
return metaClass.getMetaProperty(prop, false).instance2Prop
}
Do you have any production experience with cassandra-orm or any other grails plugin connecting to cassandra?

Timeout for Futures in Akka

We have a server that processes portfolio and securities (inside it) in different actors. For portfolio with smaller number of securities (<20) this works fine. When i increase the number of security count to 1000, encountered following issues:
akka.dispatch.FutureTimeoutException: Futures timed out after [5000] milliseconds
I could bypass this error by increasing timeout inside akka config, is that the right thing to do? In akka versions earlier than 1.2 i could set self.timeout inside the actor but that is deprecated.
The other issue I faced (intermittently) is that the entire server hangs while joining in futures.map code inside my portfolio actor:
//fork out for each security
val listOfFutures = new ListBuffer[Future[Security]]()
for (security <- portfolio.getSecurities.toList) {
val securityProcessor = actorOf[SecurityProcessor].start()
listOfFutures += (securityProcessor ? security) map {
_.asInstanceOf[Security]
}
}
EventHandler.info(this,"joining results from security processors")
//join for each security
val futures = Future.sequence(listOfFutures.toList)
futures.map {
listOfSecurities =>
portfolioResponse = MergeHelper.merge(portfolio, listOfSecurities)
}.get
You do not state which version of Akka you're on, and given my limited time with the crystal ball I'll assume that you're on 1.2.
You can specify a Timeout when you call ask/?
(Also, your code is a bit convoluted, but that I have already solved in your other question.)
Cheers,
√

Grails Connections behaving very differently in Integration test

We have a custom data source that extends BasicDataSource. We have overridden the getConnection method which does a couple things inside of it. When we run the webapp outside of testing, when we call a service from a controller it will grab a new connection and use that connection until the service is done. All is well. However, inside an integration test, the connection appears to be grabbed before the test even calls the controller. Flow below
Regular Run:
call controller -> controller calls service method -> connection is grabbed -> service method is run and returns to controller
Integration Test:
connection is grabbed -> call controller from test -> controller calls service method -> service method is run and returns to controller
Needless to say, this is giving us problems as having the correct connection is very important for our app. Thoughts?
Edit: Still getting significant issues with this. We've reached a point where we have to avoid creating integration tests, or do some manual connection switching (which defeats half the point of the tests)
DataSource.groovy
dataSource {
pooled = true
dialect="org.hibernate.dialect.OracleDialect"
properties {
maxActive = 50
maxIdle = 10
initialSize = 10
minEvictableIdleTimeMillis = 1800000
timeBetweenEvictionRunsMillis = 1800000
maxWait = 10000
testWhileIdle = true
numTestsPerEvictionRun = 3
testOnBorrow = true
}
}
hibernate {
cache.use_second_level_cache = true
cache.use_query_cache = true
cache.provider_class = 'net.sf.ehcache.hibernate.EhCacheProvider'
}
This is not a final Answer, however I believe this is an explanation of what is going on:
Running as Web app: your Service class has a transactionManager which has a sessionFactory, which gets the connection! So in this case, assuming that you Service is 'transactional=true' all methods that you call in your services will have a 'Session.beginTransaction()' in the beginning of the method(there is a Grails`s Proxy to do that, when you set 'transactional=true'), which will call all that stack until getConnection().
Running as Integration Test: as Grails doesnt commit your DB changes, it always rollback them! I believe that when you are starting your Integration test, grails is creating a transaction right away! so it will be able to rollback it afterwards!(which make totally sense right!), you can confirm that taking a look at the class org.codehaus.groovy.grails.test.support.GrailsTestInterceptor. The method init() is called before your services in your integration test. So that`s why getConnection() is being called before everything!
Suggestion:
You can try setting your integration test class as 'transaction=false' and see if getConnection() doesn't get call in the beginning!
Go to Transactions section in here to see more!
Just dont forget that in your test you will have to rollback your transaction! if your set transaction=false.

Resources