We're using neo4j (3.1.5-enterprise) for one of our services. (Over HTTP)
We set dbms.transaction.timeout=150s in our neo4j config file .
We have a scenario which may take more time than 150 seconds, but what we would like is for the transaction to be expired after 150 seconds anyway.
For some reason its not happening and the transaction continue until it fully executed but its not being stopped after 150 seconds, any guess why?
In our application logs I can see the following exception (more stacktrace details below):
neo.db.NeoHttpDriver - Errors in response:
[NeoResponseError{
code='Neo.DatabaseError.Statement.ExecutionFailed',
message='Transaction timeout. (Overtime: 23793 ms).',
stackTrace='org.neo4j.kernel.guard.GuardTimeoutException: Transaction timeout. (Overtime: 23793 ms).
...
Also, our service steps(in the specific scenario that may take long time) in general is open a transaction, lock some common entity and proceed. Since the transaction is not expired and released(and therefor the common entity continue to be locked) after 150 seconds, then other threads may also be locked for a long time.
Thanks!
Orel
Exception stacktrace:
15:00:59.627 [DefaultThreadPool-7] DEBUG c.e.e.m.neo.db.NeoHttpDriver - Errors in response: [NeoResponseError{code='Neo.DatabaseError.Statement.ExecutionFailed', message='Transaction timeout. (Overtime: 23793 ms).', stackTrace='org.neo4j.kernel.guard.GuardTimeoutException: Transaction timeout. (Overtime: 23793 ms).
at org.neo4j.kernel.guard.TimeoutGuard.check(TimeoutGuard.java:71)
at org.neo4j.kernel.guard.TimeoutGuard.check(TimeoutGuard.java:57)
at org.neo4j.kernel.guard.TimeoutGuard.check(TimeoutGuard.java:49)
at org.neo4j.kernel.impl.api.GuardingStatementOperations.nodeCursorById(GuardingStatementOperations.java:300)
at org.neo4j.kernel.impl.api.OperationsFacade.nodeHasProperty(OperationsFacade.java:343)
at org.neo4j.cypher.internal.spi.v3_1.TransactionBoundQueryContext$NodeOperations.hasProperty(TransactionBoundQueryContext.scala:319)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations$$anonfun$hasProperty$1.apply$mcZ$sp(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations$$anonfun$hasProperty$1.apply(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations$$anonfun$hasProperty$1.apply(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.spi.v3_1.ExceptionTranslationSupport$class.translateException(ExceptionTranslationSupport.scala:32)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1.translateException(ExceptionTranslatingQueryContextFor3_1.scala:34)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations.hasProperty(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.compiler.v3_1.spi.DelegatingOperations.hasProperty(DelegatingQueryContext.scala:221)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.AbstractSetPropertyOperation.setProperty(SetOperation.scala:98)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.SetEntityPropertyOperation.set(SetOperation.scala:117)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.SetPipe$$anonfun$internalCreateResults$1.apply(SetPipe.scala:31)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.SetPipe$$anonfun$internalCreateResults$1.apply(SetPipe.scala:30)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator$$anonfun$next$1.apply(ResultIterator.scala:71)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator$$anonfun$next$1.apply(ResultIterator.scala:68)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator$$anonfun$failIfThrows$1.apply(ResultIterator.scala:94)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.decoratedCypherException(ResultIterator.scala:103)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.failIfThrows(ResultIterator.scala:92)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.next(ResultIterator.scala:68)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.next(ResultIterator.scala:49)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.foreach(ResultIterator.scala:49)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.to(ResultIterator.scala:49)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.toList(ResultIterator.scala:49)
at org.neo4j.cypher.internal.compiler.v3_1.EagerResultIterator.<init>(ResultIterator.scala:35)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.toEager(ResultIterator.scala:53)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.DefaultExecutionResultBuilderFactory$ExecutionWorkflowBuilder.buildResultIterator(DefaultExecutionResultBuilderFactory.scala:109)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.DefaultExecutionResultBuilderFactory$ExecutionWorkflowBuilder.createResults(DefaultExecutionResultBuilderFactory.scala:99)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.DefaultExecutionResultBuilderFactory$ExecutionWorkflowBuilder.build(DefaultExecutionResultBuilderFactory.scala:68)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.InterpretedExecutionPlanBuilder$$anonfun$getExecutionPlanFunction$1.apply(ExecutionPlanBuilder.scala:164)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.InterpretedExecutionPlanBuilder$$anonfun$getExecutionPlanFunction$1.apply(ExecutionPlanBuilder.scala:148)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.InterpretedExecutionPlanBuilder$$anon$1.run(ExecutionPlanBuilder.scala:123)
at org.neo4j.cypher.internal.compatibility.CompatibilityFor3_1$ExecutionPlanWrapper$$anonfun$run$1.apply(CompatibilityFor3_1.scala:275)
at org.neo4j.cypher.internal.compatibility.CompatibilityFor3_1$ExecutionPlanWrapper$$anonfun$run$1.apply(CompatibilityFor3_1.scala:273)
at org.neo4j.cypher.internal.compatibility.exceptionHandlerFor3_1$runSafely$.apply(CompatibilityFor3_1.scala:190)
at org.neo4j.cypher.internal.compatibility.CompatibilityFor3_1$ExecutionPlanWrapper.run(CompatibilityFor3_1.scala:273)
at org.neo4j.cypher.internal.PreparedPlanExecution.execute(PreparedPlanExecution.scala:26)
at org.neo4j.cypher.internal.ExecutionEngine.execute(ExecutionEngine.scala:107)
at org.neo4j.cypher.internal.javacompat.ExecutionEngine.executeQuery(ExecutionEngine.java:59)
at org.neo4j.server.rest.transactional.TransactionHandle.safelyExecute(TransactionHandle.java:371)
at org.neo4j.server.rest.transactional.TransactionHandle.executeStatements(TransactionHandle.java:323)
at org.neo4j.server.rest.transactional.TransactionHandle.execute(TransactionHandle.java:230)
at org.neo4j.server.rest.transactional.TransactionHandle.execute(TransactionHandle.java:119)
at org.neo4j.server.rest.web.TransactionalService.lambda$executeStatements$0(TransactionalService.java:203)
Most likely the problem is that the tx is waiting on a lock. Prior to Neo4j 3.2, dbms.transaction.timeout cannot cover the case of terminating a transaction that's waiting on a lock (or rather, it will mark it for termination, but the actual termination won't happen until the lock is acquired).
In Neo4j 3.2, dbms.lock.acquisition.timeout was introduced, which interrupts waiting on a lock and allows the thread to check if the tx has been terminated and take appropriate action.
The following is based on an answer provided by Neo4j Support:
dbms.lock.acquisition.timeout
As a starting point, dbms.lock.acquisition.timeout was only added in Neo4j 3.2, it does not exist for 3.1. where we don't yet have lock acquisition timeout, hence wait times on locks can over-runs past the set limit. Things like GC can also extend the time. However, as you're currently on 3.1.5, dbms.lock.acquisition.timeout would not yet be enforced.
dbms.transaction.timeout
dbms.transaction.timeout marks a transaction for termination, but the actual logic of checking this and performing the termination happens on a running thread, not one waiting on locks, and doesn't cause the thread to wake up and check. Presumably the logic for terminating a thread upon timeout is that some other thread periodically checks execution time for a transaction, and if it has exceeded the transaction timeout, sets a boolean variable on the transaction to indicate that it is marked for termination. The actual termination of the thread likely happens in an event loop for the transaction, where it checks that variable to see if it's marked for termination, then terminates and rolls back. A thread that attempts to acquire a lock enters a waiting state when the lock is already held by another thread. During this waiting state, the event loop is not being processed, so the thread never reaches the point in the event loop where it can check if it's been marked as terminated and take care of it.
Bottom line:
dbms.transaction.timeout does not cause a hard timeout, it only marks the transaction as timed-out, which will cause it to rollback once the flag is checked.
Related
I have been seeing below error message for quite some time now but could not figure out what leads to the failure.
Error:
concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)
On restart of dask cluster it runs successfully.
If running a dask-cloudprovider ECSCluster or FargateCluster the concurrent.futures._base.CancelledError can result from a long-running step in computation where there is no output (logging or otherwise) to the Client. In these cases, due to the lack of interaction with the client, the scheduler regards itself as "idle" and times out after the configured cloudprovider.ecs.scheduler_timeout period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.
The solution is to set scheduler_timeout to a higher value, either via config or by passing directly to the ECSCluster/FargateCluster constructor.
A 3rd party data loss prevention driver when enabled driver verifier on it causes driver verifier bugcheck based on IrqlZwPassive Rule
The crash includes the following information:
ZwOpenKey should only be called at IRQL = PASSIVE_LEVEL.
What are some of the potential impacts to a Windows system if ZwOpenKey is used outside of IRQL=PASSIVE_LEVEL?
Is this always a serious problem that we should raise with a vendor, or only in certain scenarios.
all Zw api in kernel must be called only on PASSIVE_LEVEL. this is by design. if call it on APC_LEVEL this already will be UB some times this can work, some times produce hang or crash. say in case ZwOpenKey - registry manager can read key data from disk, if it still not in memory. so pass IRP to filesystem and wait for it complete. but Irp for completion can insert special APC (IopCompleteRequest) in calling thread. if thread on APC level - APC will not be executed, until IRQL of thread not lower to passive. but it never done - he wait on IRP complete..
another point - on exit from Zw service, system check - are UserApcPending in Thread and if yes, raise IRQL to APC_LEVEL, initiate user apc, and lower it back to PASSIVE_LEVEL (system assume that Zw called on PASSIVE_LEVEL) - so you can enter to Zw api at APC_LEVEL and exit on PASSIVE_LEVEL. can ask - why thread at some time have APC_LEVEL ? simply, because nothing to do IRQL raised ? or exist some requirements why at some point must be APC_LEVEL ? if yes, what is be if situation require stay on APC_LEVEL but thread ahead of time lower IRQL to PASSIVE_LEVEL ? really UB. in most case can be nothing. but in some case can be very nasty bug which very hard catch and research.
I am using FreeRTOS v8.2.3 on a PIC32 micro controller. I have a case where I need to post the following 3 events to 3 corresponding queues from an ISR, in order to unblock a task awaiting one of these events at a time -
a) SETUP packet arrival
b) Transfer completed event 1
c) Transfer completed event 2
My exection sequence and requirement are as follows:
Case 1 (execution is blocked for an event at point_1):
As SETUP arrives while waiting at point_1 of execution -
i) the waiting task should be unblocked
ii)Setup received from queue and processed
Some code is processed and reaches point_2
Case 2 (execution is blocked for an event at point_2):
If any one of SETUP or transfer complete events occur at point_2 -
i) unblock the wait
ii) receive transfer_complete_1 or transfer_complete_2 event from queue to carry out some additional transfers and loop at point_2
iii)if it was a Setup queue event, do not receive, but go to point_1
The code does not seem to work when I try to use xQueueReceive and xQueueSelectFromSet on the Setup queue even when one of them is used at point_1 and the other used at point_2.
But seems to work fine if I use xQueueSelectFromSet at both the places and verify the queuset member handle that caused the event to proceed further.
Given the requirement above, the problem with using xQueueSelectFromSet at both the places is that
- the xQueueSelectFromSet call will be placed back to back, first on a Setup event at point_2 and then immediately on point_1 which is not intentional
- the xQueueSelectFromSet call at point_1 is also not desired
Hence can anyone please explain whether and how to use both a queueset and queuereceive on the same queue? If not possible how do we typically implement the above requirement in FreeRTOS?
This is a duplicate of a question asked on the FreeRTOS support forum, so below is a duplicate of the answer I gave there:
I don't fully understand your usage scenario, but some points which may help.
1) If a queue is a member of a queue set, then the queue can only be read after its handle has been returned from the queue set. Further, if a queue's handle is returned from a queue set then the item must be read from the queue. If either of these requirements are not met then the state of the queue set will not match that of the queues in the set.
2) If the same task is reading from the multiple queues then it is probably not necessary to use a queue set at all. See the "alternatives to using queue sets" section on the following page: http://www.freertos.org/Pend-on-multiple-rtos-objects.html
When pthread_exit(PTHREAD_CANCELED) is called I have expected behavior (stack unwinding, destructors calls) but the call to pthread_cancel(pthread_self()) just terminated the thread.
Why pthread_exit(PTHREAD_CANCELED) and pthread_cancel(pthread_self()) differ significantly and the thread memory is not released in the later case?
The background is as follows:
The calls are made from a signal handler and reasoning behind this strange approach is to cancel a thread waiting for the external library semop() to complete (looping around on EINTR I suppose)
I have noticed that calling pthread_cancel from other thread does not work (as if semop was not a cancellation point) but signalling the thread and then calling pthread_exit works but calls the destructor within a signal handler.
pthread_cancel could postpone the action to the next cancellation point.
In terms of thread specific clean-up behaviour there should be no difference between cancelling a thread via pthread_cancel() and exiting a thread via pthread_exit().
POSIX says:
[...] When the cancellation is acted on, the cancellation clean-up handlers for thread shall be called. When the last cancellation clean-up handler returns, the thread-specific data destructor functions shall be called for thread. When the last destructor function returns, thread shall be terminated.
From Linux's man pthread_cancel:
When a cancellation requested is acted on, the following steps occur for thread (in this order):
Cancellation clean-up handlers are popped (in the reverse of the order in which they were pushed) and called. (See pthread_cleanup_push(3).)
Thread-specific data destructors are called, in an unspecified order. (See pthread_key_create(3).)
The thread is terminated. (See pthread_exit(3).)
Referring the strategy to introduce a cancellation point by signalling a thread, I have my doubts this were the cleanest way.
As many system calls return on receiving a signal while setting errno to EINTR, it would be easy to catch this case and simply let the thread end itself cleanly under this condition via pthread_exit().
Some pseudo code:
while (some condition)
{
if (-1 == semop(...))
{ /* getting here on error or signal reception */
if (EINTR == errno)
{ /* getting here on signal reception */
pthread_exit(...);
}
}
}
Turned out that there is no difference.
However some interesting side effects took place.
Operations on std::iostream especially cerr/cout include cancellation points. When the underlying operation is canceled the stream is marked as not good. So you will get no output from any other thread if only one has discovered cancellation on an attempt to print.
So play with pthread_setcancelstate() and pthread_testcancel() or just call cerr.clear() when needed.
Applies to C++ streams only, stderr,stdin seems not be affected.
First of all, there are two things associated to thread which will tell what to do when you call pthread_cancel().
1. pthread_setcancelstate
2. pthread_setcanceltype
first function will tell whether that particular thread can be cancelled or not, and the second function tells when and how that thread should be cancelled, for example, should that thread be terminated as soon as you send cancellation request or it need to wait till that thread reaches some milestone before getting terminated.
when you call pthread_cancel(), thread wont be terminated directly, above two actions will be performed, i.e., checking whether that thread can be cancelled or not, and if yes, when to cancel.
if you disable cancel state, then pthread_cancel() can't terminate that thread, but the cancellation request will stay in a queue waiting for that thread to become cancellable, i.e., at some point of time if you are enabling cancel state, then your cancel request will start working on terminating that thread
whereas if you use pthread_exit(), then the thread will be terminated irrespective to the cancel state and cancel type of that particular thread.
*this is one of the differences between pthread_exit() and pthread_cancel(), there can be few more.
I changed the JTA transaction timeout from admin console and set to 300, even after changing it fails saying JTA transaction unexpectedly rolled back (maybe due to a timeout) with a:
weblogic.transaction.RollbackException: Transaction timed out after 181 seconds`
To make sure whether my changes (timeout value 300) got reflected for that domain or not I checked under domain config.xml it got reflected with 300.
My question is, is there any other place also do I need to update the transaction timeout value and do I need to restart the server ?
Full stack trace after the exception from server below:
Caused by: org.springframework.transaction.UnexpectedRollbackException: JTA transaction unexpectedly rolled back (maybe due to a timeout); nested exception is weblogic.transaction.RollbackException: Transaction
timed out after 180 seconds
BEA1-160A800A149091F72E5E
at org.springframework.transaction.jta.JtaTransactionManager.doCommit(JtaTransactionManager.java:1031)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.processCommit(AbstractPlatformTransactionManager.java:709)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:678)
at org.springframework.transaction.interceptor.TransactionAspectSupport.completeTransactionAfterThrowing(TransactionAspectSupport.java:359)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:110)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at $Proxy103.saveRegistryData(Unknown Source)
at gov.cms.pqri.arch.submission.registry.bean.RegDataAccessManager.persistRegistry(RegDataAccessManager.java:54)
... 14 more
Caused by: weblogic.transaction.RollbackException: Transaction timed out after 180 seconds
BEA1-160A800A149091F72E5E
at weblogic.transaction.internal.TransactionImpl.throwRollbackException(TransactionImpl.java:1818)
at weblogic.transaction.internal.ServerTransactionImpl.internalCommit(ServerTransactionImpl.java:333)
at weblogic.transaction.internal.ServerTransactionImpl.commit(ServerTransactionImpl.java:227)
at weblogic.transaction.internal.TransactionManagerImpl.commit(TransactionManagerImpl.java:281)
at org.springframework.transaction.jta.JtaTransactionManager.doCommit(JtaTransactionManager.java:1028)
... 22 more
after changing the stuck Thread Max time to 300 under servers -> configuration -> tuning (tab) from admin console it is getting updated and working fine.
I have also came across this issue and have resolved the same, since this is related to JTA transaction so we need to increase the timeout of JTA as well along with the time out for stuck max thread. Please click on JTA from the weblogic console home and increase the JTA timeout from 30(by default) to 300.
We met same issue on Weblogic 12.1.2 [JTA transaction unexpectedly rolled back (maybe due to a timeout)] after all investigation we found the root cause of the problem.In my opinion it occurs due to huge dataset processing transactional and near the end of the process If an exception is thrown, JTA is rolling back data as expected.But it does not give the details of the error.In our case ,it mostly cause because of the database integrity (e.g we try to insert data a column with smaller size than data.)
In summary,it will be the best way to investigate db logs instead of increasing stuck Thread Max time.Thread max time can be a solution,but not a proper solution for real enterprise systems.
Also this issue discussed on another stackover link and hibernate jira issue
And solution suggested:
This is a default behaviour of Weblogic JTA realization. To obtain
root exception you should set system property
weblogic.transaction.allowOverrideSetRollbackReason to true.
One of the solution is add this line into
/bin/setDomainEnv.cmd:
set JAVA_OPTIONS=%JAVA_OPTIONS% -Dweblogic.transaction.allowOverrideSetRollbackReason=true
I got my JTA timeouts increased by adding jta.properties file into config folder of my app with lines:
com.atomikos.icatch.default_jta_timeout=600000
com.atomikos.icatch.max_timeout=600000