I am using neo4j 2.1.5 and using JDBCCypherExecutor to post my cypher queries.
often the cypher executor thread gets stuck making the app unusable after sometime.
The only option after sometime is to restart the spark webapp.
Has anyone encountered this problem?
The jstack of one of the blocked thread is
"qtp1639509299-63" prio=10 tid=0x00007fe454001000 nid=0x1e0e waiting on condition [0x00007fe564fea000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000586cf6e88> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at org.apache.http.impl.conn.tsccm.WaitingThread.await(WaitingThread.java:158)
at org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking(ConnPoolByRoute.java:403)
at org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry(ConnPoolByRoute.java:300)
at org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection(ThreadSafeClientConnManager.java:224)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:391)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
at org.restlet.ext.httpclient.internal.HttpMethodCall.sendRequest(HttpMethodCall.java:336)
at org.restlet.engine.adapter.ClientAdapter.commit(ClientAdapter.java:114)
at org.restlet.engine.adapter.HttpClientHelper.handle(HttpClientHelper.java:112)
at org.restlet.Client.handle(Client.java:180)
at org.restlet.routing.Filter.doHandle(Filter.java:159)
at org.restlet.routing.Filter.handle(Filter.java:206)
at org.restlet.resource.ClientResource.handle(ClientResource.java:1136)
at org.restlet.resource.ClientResource.handleOutbound(ClientResource.java:1225)
at org.restlet.resource.ClientResource.handle(ClientResource.java:1068)
at org.restlet.resource.ClientResource.handle(ClientResource.java:1044)
at org.restlet.resource.ClientResource.post(ClientResource.java:1453)
at org.neo4j.jdbc.internal.rest.TransactionalQueryExecutor.post(TransactionalQueryExecutor.java:98)
at org.neo4j.jdbc.internal.rest.TransactionalQueryExecutor.commit(TransactionalQueryExecutor.java:133)
at org.neo4j.jdbc.internal.rest.TransactionalQueryExecutor.executeQueries(TransactionalQueryExecutor.java:204)
at org.neo4j.jdbc.internal.rest.TransactionalQueryExecutor.executeQuery(TransactionalQueryExecutor.java:214)
at org.neo4j.jdbc.internal.Neo4jConnection.executeQuery(Neo4jConnection.java:370)
at org.neo4j.jdbc.internal.Neo4jPreparedStatement.executeQuery(Neo4jPreparedStatement.java:48)
at com.zahoor.graph.executor.JdbcCypherExecutor.query(JdbcCypherExecutor.java:28)
Separate threads might be the issue here, as the JDBC driver keeps the transaction in a thread local, so if you spawn new threads it creates new transaction objects and new connections (if you don't reuse the same Connection instance.
And the default pooling of HttpClient is (afaik) 10 parallel connections.
Related
I'm running a Java application that uses RabbitMQ Server 3.8.9, spring-amqp-2.2.10.RELEASE, and spring-rabbit-2.2.10.RELEASE.
My test case does something like the following:
Start the RabbitMQ Server
Start my Java application
Test and validate some functionality on my Java application
Gracefully stop my Java application
Gracefully stop the RabbitMQ Server
Repeat 1-6 a few more times
Everything looks fine except sometimes during one of the restarts about 10 minutes into it, I see the following error in my application's logs:
2021-02-05 12:52:46.498 UTC,ERROR,org.springframework.amqp.rabbit.connection.PublisherCallbackChannelImpl,null,rabbitConnectionFactory23,runWorker():1149,Failed to invoke afterAckCallback
java.lang.NullPointerException: null
at org.springframework.amqp.rabbit.connection.PublisherCallbackChannelImpl.lambda$doHandleConfirm$1(PublisherCallbackChannelImpl.java:1027) ~[spring-rabbit.jar:2.2.10.RELEASE]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]
Further analysis doesn't point to anything specific. There are no errors in the RabbitMQ log files, no restarts of the RabbitMQ server, nothing weird in the RabbitMQ logs during the time stamp above.
The code in question:
https://github.com/spring-projects/spring-amqp/blob/v2.2.10.RELEASE/spring-rabbit/src/main/java/org/springframework/amqp/rabbit/connection/PublisherCallbackChannelImpl.java#L1027
My tests are automated and run as part of a CI pipeline. The issue is intermittent and I have had trouble reproducing it locally in my sandbox.
From what I can tell, the functionality of my Java application is unaffected.
Code that creates the RabbitMQ connection factory used everywhere:
final CachingConnectionFactory connectionFactory = new CachingConnectionFactory(HOST_NAME);
connectionFactory.setChannelCacheSize(1);
connectionFactory.setPublisherConfirms(true);
It seems like a concurrency problem, but I'm not so sure on how to get to the bottom of it. For the most part, we use the RabbitTemplate and other Spring facilities to connect to RabbitMQ.
Anyone in the Spring world with some knowledge in RabbitMQ care to chime in?
Thanks
The code you talk about is like this:
finally {
try {
if (this.afterAckCallback != null && getPendingConfirmsCount() == 0) {
this.afterAckCallback.accept(this);
this.afterAckCallback = null;
}
}
catch (Exception e) {
this.logger.error("Failed to invoke afterAckCallback", e);
}
}
There is really could be a race condition around that this.afterAckCallback property.
We may pass if() in one but then different thread makes this.afterAckCallback as null, so we fail with that NPE.
We have to copy its value to the local variable and then check and perform accept().
Feel free to raise a GitHub issue against Spring AMQP project: https://github.com/spring-projects/spring-amqp/issues
We have a race condition because we really call this doHandleConfirm() with its async logic from the loop in the processMultipleAck().
When using Dask's distributed scheduler I have a task that is running on a remote worker that I want to stop.
How do I stop it? I know about the cancel method, but this doesn't seem to work if the task has already started executing.
If it's not yet running
If the task has not yet started running you can cancel it by cancelling the associated future
future = client.submit(func, *args) # start task
future.cancel() # cancel task
If you are using dask collections then you can use the client.cancel method
x = x.persist() # start many tasks
client.cancel(x) # cancel all tasks
If it is running
However if your task has already started running on a thread within a worker then there is nothing that you can do to interrupt that thread. Unfortunately this is a limitation of Python.
Build in an explicit stopping condition
The best you can do is to build in some sort of stopping criterion into your function with your own custom logic. You might consider checking a shared variable within a loop. Look for "Variable" in these docs: http://dask.pydata.org/en/latest/futures.html
from dask.distributed import Client, Variable
client = Client()
stop = Varible()
stop.put(False)
def long_running_task():
while not stop.get():
... do stuff
future = client.submit(long_running_task)
... wait a while
stop.put(True)
I am trying to do load test for zuul version 1.1.2.
However I am keep getting following issue after few a minute for running load test.
Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: book could not acquire a semaphore for execution and no fallback available.
at com.netflix.hystrix.AbstractCommand$21.call(AbstractCommand.java:783) ~[hystrix-core-1.5.3.jar:1.5.3]
My question is how can I increase maxSemaphores via confiugration.
hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds= 20000000
zuul.hystrix.command.default.execution.isolation.strategy= SEMAPHORE
zuul.hystrix.command.default.execution.isolation.semaphore.maxConcurrentRequests= 10
zuul.hystrix.command.default.fallback.isolation.semaphore.maxConcurrentRequests= 10
zuul.semaphore.maxSemaphores=3000
zuul.eureka.book.semaphore.maxSemaphore=30000
I have tried search many option on Intenet but one of those works for me
Please advise
it turns out I am using old version. For later version we could set semaphores at Zuul level. below is an example to set the maxSemaphores 3000 as default for routing to every proxied service
zuul.semaphore.maxSemaphores=3000
The actual property is max-semaphores (this would be with yaml config):
zuul:
semaphore:
#com.netflix.hystrix.exception.HystrixRuntimeException: "microservice" could not acquire a semaphore for execution and no fallback available.
max-semaphores: 2000
Updated to latest SDK version 0.3.150326, and we had a job fail due to this error:
(d0f58ccaf368cf1f): Workflow failed. Causes: (539037ea87656484):
Cannot downsize without losing active shuffle data. old_size = 10,
new_size = 8.
Job ID: 2015-04-02_21_26_53-11930390736602232537
Have not been able to reproduce, but thought I should ask if it's a known issue or not?
Looking at the docs, it appears autoscaling is currently only "experimental", but I would have imagined that this a core feature of Cloud Dataflow, and as such should be fully supported.
1087 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
1103 [main] INFO com.google.cloud.dataflow.sdk.util.PackageUtil - Uploading 79 files from PipelineOptions.filesToStage to GCS to prepare for execution in the cloud.
43086 [main] INFO com.google.cloud.dataflow.sdk.util.PackageUtil - Uploading PipelineOptions.filesToStage complete: 2 files newly uploaded, 77 files cached
Dataflow SDK version: 0.3.150326
57718 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/gdfp-7414/dataflow/job/2015-04-02_21_26_53-11930390736602232537
Submitted job: 2015-04-02_21_26_53-11930390736602232537
2015-04-03T04:26:54.710Z: (3a5437c7f9c6e33f): Expanding GroupByKey operations into optimizable parts.
2015-04-03T04:26:54.714Z: (3a5437c7f9c6e8dd): Annotating graph with Autotuner information.
2015-04-03T04:26:55.436Z: (3a5437c7f9c6e85b): Fusing adjacent ParDo, Read, Write, and Flatten operations
2015-04-03T04:26:55.453Z: (3a5437c7f9c6efad): Fusing consumer denormalized-write-to-BQ into events-denormalize
2015-04-03T04:26:55.455Z: (3a5437c7f9c6e54b): Fusing consumer events-denormalize into events-read-from-BQ
2015-04-03T04:26:55.457Z: (3a5437c7f9c6eae9): Fusing consumer unmapped-write-to-BQ into events-denormalize
2015-04-03T04:26:55.504Z: (3a5437c7f9c6e67d): Adding StepResource setup and teardown to workflow graph.
2015-04-03T04:26:55.525Z: (971aceaf96c03b86): Starting the input generators.
2015-04-03T04:26:55.546Z: (ea598353613cc1d3): Adding workflow start and stop steps.
2015-04-03T04:26:55.548Z: (ea598353613ccd39): Assigning stage ids.
2015-04-03T04:26:56.017Z: S07: (fb31ac3e5c3be05a): Executing operation WeightingFactor
2015-04-03T04:26:56.024Z: S09: (ee7049b2bfe3f48c): Executing operation Name_Community
2015-04-03T04:26:56.037Z: (3a5437c7f9c6e293): Starting worker pool setup.
2015-04-03T04:26:56.042Z: (3a5437c7f9c6edcf): Starting 5 workers...
2015-04-03T04:26:56.047Z: S01: (a25730bd9d25e5ed): Executing operation Browser_mapping
2015-04-03T04:26:56.049Z: S11: (fb31ac3e5c3beb06): Executing operation WebsiteVHH
2015-04-03T04:26:56.051Z: (30eb1307dfc8372f): Value "Name_Community.out" materialized.
2015-04-03T04:26:56.065Z: (52e655ceeab44257): Value "WeightingFactor.out" materialized.
2015-04-03T04:26:56.072Z: S03: (c024e27994951718): Executing operation OS_mapping
2015-04-03T04:26:56.076Z: S10: (a3947955b25f3830): Executing operation AsIterable3/CreatePCollectionView
2015-04-03T04:26:56.087Z: (4c9eb5a54721c4f7): Value "WebsiteVHH.out" materialized.
2015-04-03T04:26:56.094Z: S05: (52e655ceeab4458a): Executing operation SA1_Area_Metro
2015-04-03T04:26:56.103Z: S08: (c024e279949513f4): Executing operation AsIterable2/CreatePCollectionView
2015-04-03T04:26:56.106Z: (4c9eb5a54721cd78): Value "AsIterable3/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.107Z: (58b58f637f29b69a): Value "OS_mapping.out" materialized.
2015-04-03T04:26:56.115Z: (f0587ec8b1f9f69f): Value "Browser_mapping.out" materialized.
2015-04-03T04:26:56.126Z: (a277f34c719a133): Value "SA1_Area_Metro.out" materialized.
2015-04-03T04:26:56.127Z: S12: (c024e279949510d0): Executing operation AsIterable4/CreatePCollectionView
2015-04-03T04:26:56.132Z: S04: (52e655ceeab44adf): Executing operation AsIterable6/CreatePCollectionView
2015-04-03T04:26:56.136Z: (f0587ec8b1f9fd86): Value "AsIterable2/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.141Z: S02: (eb97fca639a2101b): Executing operation AsIterable5/CreatePCollectionView
2015-04-03T04:26:56.151Z: S06: (8cc6100045f0af9b): Executing operation AsIterable/CreatePCollectionView
2015-04-03T04:26:56.159Z: (6da6e59d099e8c60): Value "AsIterable4/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.163Z: (4c9eb5a54721c5f9): Value "AsIterable6/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.173Z: (a3947955b25f3701): Value "AsIterable5/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.178Z: (58b58f637f29b853): Value "AsIterable/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.199Z: S13: (8cc6100045f0ac67): Executing operation events-read-from-BQ+events-denormalize+denormalized-write-to-BQ+unmapped-write-to-BQ
2015-04-03T04:26:56.653Z: (6153d4cd276be2a0): Autoscaling: Enabled for job /workflows/wf-2015-04-02_21_26_53-11930390736602232537
2015-04-03T04:30:31.754Z: (a94b4f451005c934): Autoscaling: Resizing worker pool from 5 to 10.
2015-04-03T04:31:01.754Z: (a94b4f451005c38e): Autoscaling: Resizing worker pool from 10 to 8.
2015-04-03T04:31:02.363Z: (d0f58ccaf368cf1f): Workflow failed. Causes: (539037ea87656484): Cannot downsize without losing active shuffle data. old_size = 10, new_size = 8.
2015-04-03T04:31:02.396Z: (7f503ea3d5c37a55): Stopping the input generators.
2015-04-03T04:31:02.411Z: (58b58f637f29ba9f): Cleaning up.
2015-04-03T04:31:02.442Z: (58b58f637f29bc58): Tearing down pending resources...
2015-04-03T04:31:02.447Z: (58b58f637f29be11): Starting worker pool teardown.
2015-04-03T04:31:02.453Z: (58b58f637f29b05d): Stopping worker pool...
2015-04-03T04:31:03.062Z: (a1f260e16fea5b6): Workflow failed. Causes: (539037ea87656484): Cannot downsize without losing active shuffle data. old_size = 10, new_size = 8.
458752 [main] INFO com.google.cloud.dataflow.sdk.runners.BlockingDataflowPipelineRunner - Job finished with status FAILED
458755 [main] INFO com.<removed>.cdf.job.AbstractCloudDataFlowJob - com.google.cloud.dataflow.sdk.runners.BlockingDataflowPipelineRunner$PipelineJobState#27a7ef08
458755 [main] INFO com.<removed>.cdf.job.AbstractCloudDataFlowJob - Cleaning up after <removed> job. At the moment nothing to do.
Disconnected from the target VM, address: '127.0.0.1:57739', transport: 'socket'
Sorry for the trouble. This is a bug in the service. I'll update this thread when we address it, and thank you for your patience.
On a working Grails 2.2.5 system, we're occasionally losing connection to the MySQL database, for reasons that are not relevant here. The majority of the system recovers perfectly well from the outage. But any Quartz jobs (using Quartz plugin 0.4.2) are typically failing to run again after such an outage. This is a typical message which appears in the log at the point the job should run:
2015-02-26 16:30:45,304 [quartzScheduler_Worker-9] ERROR core.ErrorLogger - Unable to notify JobListener(s) of Job to be executed: (Job will NOT be executed!). trigger= GRAILS_JOBS.quickQuoteCleanupJob job= GRAILS_JOBS.com.aire.QuickQuoteCleanupJob
org.quartz.SchedulerException: JobListener 'sessionBinderListener' threw exception: Already value [org.springframework.orm.hibernate3.SessionHolder#593a9498] for key [org.hibernate.impl.SessionFactoryImpl#c8488d7] bound to thread [quartzScheduler_Worker-9] [See nested exception: java.lang.IllegalStateException: Already value [org.springframework.orm.hibernate3.SessionHolder#593a9498] for key [org.hibernate.impl.SessionFactoryImpl#c8488d7] bound to thread [quartzScheduler_Worker-9]]
at org.quartz.core.QuartzScheduler.notifyJobListenersToBeExecuted(QuartzScheduler.java:1868)
at org.quartz.core.JobRunShell.notifyListenersBeginning(JobRunShell.java:338)
at org.quartz.core.JobRunShell.run(JobRunShell.java:176)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525)
Caused by: java.lang.IllegalStateException: Already value [org.springframework.orm.hibernate3.SessionHolder#593a9498] for key [org.hibernate.impl.SessionFactoryImpl#c8488d7] bound to thread [quartzScheduler_Worker-9]
at org.quartz.core.QuartzScheduler.notifyJobListenersToBeExecuted(QuartzScheduler.java:1866)
... 3 more
What do I need to do to make things more robust, so that the Quartz jobs recover as well?
By default, a Quartz job will get a session bound to it. Disable that session binding and let your service handle the transaction / session. That's what we do and when we get our DB connections back up, jobs still work.
To disable session binding in your job, add :
def sessionRequired = false