Quartz jobs failing after MySQL db errors - grails

On a working Grails 2.2.5 system, we're occasionally losing connection to the MySQL database, for reasons that are not relevant here. The majority of the system recovers perfectly well from the outage. But any Quartz jobs (using Quartz plugin 0.4.2) are typically failing to run again after such an outage. This is a typical message which appears in the log at the point the job should run:
2015-02-26 16:30:45,304 [quartzScheduler_Worker-9] ERROR core.ErrorLogger - Unable to notify JobListener(s) of Job to be executed: (Job will NOT be executed!). trigger= GRAILS_JOBS.quickQuoteCleanupJob job= GRAILS_JOBS.com.aire.QuickQuoteCleanupJob
org.quartz.SchedulerException: JobListener 'sessionBinderListener' threw exception: Already value [org.springframework.orm.hibernate3.SessionHolder#593a9498] for key [org.hibernate.impl.SessionFactoryImpl#c8488d7] bound to thread [quartzScheduler_Worker-9] [See nested exception: java.lang.IllegalStateException: Already value [org.springframework.orm.hibernate3.SessionHolder#593a9498] for key [org.hibernate.impl.SessionFactoryImpl#c8488d7] bound to thread [quartzScheduler_Worker-9]]
at org.quartz.core.QuartzScheduler.notifyJobListenersToBeExecuted(QuartzScheduler.java:1868)
at org.quartz.core.JobRunShell.notifyListenersBeginning(JobRunShell.java:338)
at org.quartz.core.JobRunShell.run(JobRunShell.java:176)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525)
Caused by: java.lang.IllegalStateException: Already value [org.springframework.orm.hibernate3.SessionHolder#593a9498] for key [org.hibernate.impl.SessionFactoryImpl#c8488d7] bound to thread [quartzScheduler_Worker-9]
at org.quartz.core.QuartzScheduler.notifyJobListenersToBeExecuted(QuartzScheduler.java:1866)
... 3 more
What do I need to do to make things more robust, so that the Quartz jobs recover as well?

By default, a Quartz job will get a session bound to it. Disable that session binding and let your service handle the transaction / session. That's what we do and when we get our DB connections back up, jobs still work.
To disable session binding in your job, add :
def sessionRequired = false

Related

Impossible (?) NullPointerException - Springframework RabbitMQ, Failed to invoke afterAckCallback

I'm running a Java application that uses RabbitMQ Server 3.8.9, spring-amqp-2.2.10.RELEASE, and spring-rabbit-2.2.10.RELEASE.
My test case does something like the following:
Start the RabbitMQ Server
Start my Java application
Test and validate some functionality on my Java application
Gracefully stop my Java application
Gracefully stop the RabbitMQ Server
Repeat 1-6 a few more times
Everything looks fine except sometimes during one of the restarts about 10 minutes into it, I see the following error in my application's logs:
2021-02-05 12:52:46.498 UTC,ERROR,org.springframework.amqp.rabbit.connection.PublisherCallbackChannelImpl,null,rabbitConnectionFactory23,runWorker():1149,Failed to invoke afterAckCallback
java.lang.NullPointerException: null
at org.springframework.amqp.rabbit.connection.PublisherCallbackChannelImpl.lambda$doHandleConfirm$1(PublisherCallbackChannelImpl.java:1027) ~[spring-rabbit.jar:2.2.10.RELEASE]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]
Further analysis doesn't point to anything specific. There are no errors in the RabbitMQ log files, no restarts of the RabbitMQ server, nothing weird in the RabbitMQ logs during the time stamp above.
The code in question:
https://github.com/spring-projects/spring-amqp/blob/v2.2.10.RELEASE/spring-rabbit/src/main/java/org/springframework/amqp/rabbit/connection/PublisherCallbackChannelImpl.java#L1027
My tests are automated and run as part of a CI pipeline. The issue is intermittent and I have had trouble reproducing it locally in my sandbox.
From what I can tell, the functionality of my Java application is unaffected.
Code that creates the RabbitMQ connection factory used everywhere:
final CachingConnectionFactory connectionFactory = new CachingConnectionFactory(HOST_NAME);
connectionFactory.setChannelCacheSize(1);
connectionFactory.setPublisherConfirms(true);
It seems like a concurrency problem, but I'm not so sure on how to get to the bottom of it. For the most part, we use the RabbitTemplate and other Spring facilities to connect to RabbitMQ.
Anyone in the Spring world with some knowledge in RabbitMQ care to chime in?
Thanks
The code you talk about is like this:
finally {
try {
if (this.afterAckCallback != null && getPendingConfirmsCount() == 0) {
this.afterAckCallback.accept(this);
this.afterAckCallback = null;
}
}
catch (Exception e) {
this.logger.error("Failed to invoke afterAckCallback", e);
}
}
There is really could be a race condition around that this.afterAckCallback property.
We may pass if() in one but then different thread makes this.afterAckCallback as null, so we fail with that NPE.
We have to copy its value to the local variable and then check and perform accept().
Feel free to raise a GitHub issue against Spring AMQP project: https://github.com/spring-projects/spring-amqp/issues
We have a race condition because we really call this doHandleConfirm() with its async logic from the loop in the processMultipleAck().

Spring Cloud Dataflow Task Execution Fails on subsequent runs

Name: spring-cloud-dataflow-server
Version: 2.5.0.BUILD-SNAPSHOT
I have a very simple task created. First run it always COMPLETES fine with NO ISSUES. If task is run again it FAILS with following error.
Subsequent Launch of same task fails with below exception and it's a fresh run after the previous execution completed fully. If a task is run one time can't it be run again?
(log from Task Execution Details - Execution ID: 246)
Caused by: org.springframework.batch.core.repository.JobInstanceAlreadyCompleteException: A job instance already exists and is complete for parameters={-spring.cloud.data.flow.taskappname=composed-task-runner, -spring.cloud.task.executionid=246, -graph=threetasks-t1 && threetasks-t2 && threetasks-t3, -spring.datasource.username=root, -spring.cloud.data.flow.platformname=default, -dataflow-server-uri=http://10.104.227.49:9393, -management.metrics.export.prometheus.enabled=true, -management.metrics.export.prometheus.rsocket.host=prometheus-proxy, -spring.datasource.url=jdbc:mysql://10.110.89.91:3306/mysql, -spring.datasource.driverClassName=org.mariadb.jdbc.Driver, -spring.datasource.password=manager, -management.metrics.export.prometheus.rsocket.port=7001, -management.metrics.export.prometheus.rsocket.enabled=true, -spring.cloud.task.name=threetasks}. If you want to run this job again, change the parameters.
A Job instance in a Spring Batch application requires a unique Job Parameter and this is by design.
In this case, since you are using the Composed Task, you can use the property --increment-instance-enabled=true as part of the composed task definition to handle it. This property will make sure to have the Job Instance get the unique Job parameters.
You can check the list of properties supported for Composed Task Runner here

configuration maxSemaphores for zuul server

I am trying to do load test for zuul version 1.1.2.
However I am keep getting following issue after few a minute for running load test.
Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: book could not acquire a semaphore for execution and no fallback available.
at com.netflix.hystrix.AbstractCommand$21.call(AbstractCommand.java:783) ~[hystrix-core-1.5.3.jar:1.5.3]
My question is how can I increase maxSemaphores via confiugration.
hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds= 20000000
zuul.hystrix.command.default.execution.isolation.strategy= SEMAPHORE
zuul.hystrix.command.default.execution.isolation.semaphore.maxConcurrentRequests= 10
zuul.hystrix.command.default.fallback.isolation.semaphore.maxConcurrentRequests= 10
zuul.semaphore.maxSemaphores=3000
zuul.eureka.book.semaphore.maxSemaphore=30000
I have tried search many option on Intenet but one of those works for me
Please advise
it turns out I am using old version. For later version we could set semaphores at Zuul level. below is an example to set the maxSemaphores 3000 as default for routing to every proxied service
zuul.semaphore.maxSemaphores=3000
The actual property is max-semaphores (this would be with yaml config):
zuul:
semaphore:
#com.netflix.hystrix.exception.HystrixRuntimeException: "microservice" could not acquire a semaphore for execution and no fallback available.
max-semaphores: 2000

Error when worker pool is scaling down: "Cannot downsize without losing active shuffle data"

Updated to latest SDK version 0.3.150326, and we had a job fail due to this error:
(d0f58ccaf368cf1f): Workflow failed. Causes: (539037ea87656484):
Cannot downsize without losing active shuffle data. old_size = 10,
new_size = 8.
Job ID: 2015-04-02_21_26_53-11930390736602232537
Have not been able to reproduce, but thought I should ask if it's a known issue or not?
Looking at the docs, it appears autoscaling is currently only "experimental", but I would have imagined that this a core feature of Cloud Dataflow, and as such should be fully supported.
1087 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
1103 [main] INFO com.google.cloud.dataflow.sdk.util.PackageUtil - Uploading 79 files from PipelineOptions.filesToStage to GCS to prepare for execution in the cloud.
43086 [main] INFO com.google.cloud.dataflow.sdk.util.PackageUtil - Uploading PipelineOptions.filesToStage complete: 2 files newly uploaded, 77 files cached
Dataflow SDK version: 0.3.150326
57718 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/gdfp-7414/dataflow/job/2015-04-02_21_26_53-11930390736602232537
Submitted job: 2015-04-02_21_26_53-11930390736602232537
2015-04-03T04:26:54.710Z: (3a5437c7f9c6e33f): Expanding GroupByKey operations into optimizable parts.
2015-04-03T04:26:54.714Z: (3a5437c7f9c6e8dd): Annotating graph with Autotuner information.
2015-04-03T04:26:55.436Z: (3a5437c7f9c6e85b): Fusing adjacent ParDo, Read, Write, and Flatten operations
2015-04-03T04:26:55.453Z: (3a5437c7f9c6efad): Fusing consumer denormalized-write-to-BQ into events-denormalize
2015-04-03T04:26:55.455Z: (3a5437c7f9c6e54b): Fusing consumer events-denormalize into events-read-from-BQ
2015-04-03T04:26:55.457Z: (3a5437c7f9c6eae9): Fusing consumer unmapped-write-to-BQ into events-denormalize
2015-04-03T04:26:55.504Z: (3a5437c7f9c6e67d): Adding StepResource setup and teardown to workflow graph.
2015-04-03T04:26:55.525Z: (971aceaf96c03b86): Starting the input generators.
2015-04-03T04:26:55.546Z: (ea598353613cc1d3): Adding workflow start and stop steps.
2015-04-03T04:26:55.548Z: (ea598353613ccd39): Assigning stage ids.
2015-04-03T04:26:56.017Z: S07: (fb31ac3e5c3be05a): Executing operation WeightingFactor
2015-04-03T04:26:56.024Z: S09: (ee7049b2bfe3f48c): Executing operation Name_Community
2015-04-03T04:26:56.037Z: (3a5437c7f9c6e293): Starting worker pool setup.
2015-04-03T04:26:56.042Z: (3a5437c7f9c6edcf): Starting 5 workers...
2015-04-03T04:26:56.047Z: S01: (a25730bd9d25e5ed): Executing operation Browser_mapping
2015-04-03T04:26:56.049Z: S11: (fb31ac3e5c3beb06): Executing operation WebsiteVHH
2015-04-03T04:26:56.051Z: (30eb1307dfc8372f): Value "Name_Community.out" materialized.
2015-04-03T04:26:56.065Z: (52e655ceeab44257): Value "WeightingFactor.out" materialized.
2015-04-03T04:26:56.072Z: S03: (c024e27994951718): Executing operation OS_mapping
2015-04-03T04:26:56.076Z: S10: (a3947955b25f3830): Executing operation AsIterable3/CreatePCollectionView
2015-04-03T04:26:56.087Z: (4c9eb5a54721c4f7): Value "WebsiteVHH.out" materialized.
2015-04-03T04:26:56.094Z: S05: (52e655ceeab4458a): Executing operation SA1_Area_Metro
2015-04-03T04:26:56.103Z: S08: (c024e279949513f4): Executing operation AsIterable2/CreatePCollectionView
2015-04-03T04:26:56.106Z: (4c9eb5a54721cd78): Value "AsIterable3/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.107Z: (58b58f637f29b69a): Value "OS_mapping.out" materialized.
2015-04-03T04:26:56.115Z: (f0587ec8b1f9f69f): Value "Browser_mapping.out" materialized.
2015-04-03T04:26:56.126Z: (a277f34c719a133): Value "SA1_Area_Metro.out" materialized.
2015-04-03T04:26:56.127Z: S12: (c024e279949510d0): Executing operation AsIterable4/CreatePCollectionView
2015-04-03T04:26:56.132Z: S04: (52e655ceeab44adf): Executing operation AsIterable6/CreatePCollectionView
2015-04-03T04:26:56.136Z: (f0587ec8b1f9fd86): Value "AsIterable2/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.141Z: S02: (eb97fca639a2101b): Executing operation AsIterable5/CreatePCollectionView
2015-04-03T04:26:56.151Z: S06: (8cc6100045f0af9b): Executing operation AsIterable/CreatePCollectionView
2015-04-03T04:26:56.159Z: (6da6e59d099e8c60): Value "AsIterable4/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.163Z: (4c9eb5a54721c5f9): Value "AsIterable6/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.173Z: (a3947955b25f3701): Value "AsIterable5/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.178Z: (58b58f637f29b853): Value "AsIterable/CreatePCollectionView.out" materialized.
2015-04-03T04:26:56.199Z: S13: (8cc6100045f0ac67): Executing operation events-read-from-BQ+events-denormalize+denormalized-write-to-BQ+unmapped-write-to-BQ
2015-04-03T04:26:56.653Z: (6153d4cd276be2a0): Autoscaling: Enabled for job /workflows/wf-2015-04-02_21_26_53-11930390736602232537
2015-04-03T04:30:31.754Z: (a94b4f451005c934): Autoscaling: Resizing worker pool from 5 to 10.
2015-04-03T04:31:01.754Z: (a94b4f451005c38e): Autoscaling: Resizing worker pool from 10 to 8.
2015-04-03T04:31:02.363Z: (d0f58ccaf368cf1f): Workflow failed. Causes: (539037ea87656484): Cannot downsize without losing active shuffle data. old_size = 10, new_size = 8.
2015-04-03T04:31:02.396Z: (7f503ea3d5c37a55): Stopping the input generators.
2015-04-03T04:31:02.411Z: (58b58f637f29ba9f): Cleaning up.
2015-04-03T04:31:02.442Z: (58b58f637f29bc58): Tearing down pending resources...
2015-04-03T04:31:02.447Z: (58b58f637f29be11): Starting worker pool teardown.
2015-04-03T04:31:02.453Z: (58b58f637f29b05d): Stopping worker pool...
2015-04-03T04:31:03.062Z: (a1f260e16fea5b6): Workflow failed. Causes: (539037ea87656484): Cannot downsize without losing active shuffle data. old_size = 10, new_size = 8.
458752 [main] INFO com.google.cloud.dataflow.sdk.runners.BlockingDataflowPipelineRunner - Job finished with status FAILED
458755 [main] INFO com.<removed>.cdf.job.AbstractCloudDataFlowJob - com.google.cloud.dataflow.sdk.runners.BlockingDataflowPipelineRunner$PipelineJobState#27a7ef08
458755 [main] INFO com.<removed>.cdf.job.AbstractCloudDataFlowJob - Cleaning up after <removed> job. At the moment nothing to do.
Disconnected from the target VM, address: '127.0.0.1:57739', transport: 'socket'
Sorry for the trouble. This is a bug in the service. I'll update this thread when we address it, and thank you for your patience.

Cloud Dataflow: java.lang.IllegalStateException: no evaluator registered for GroupedValues

I'm getting the following exception when running the pipeline locally. There is no exception when submitting for cloud execution.
Thanks,
Genady
INFO: Executing pipeline using the DirectPipelineRunner.
Exception in thread "main" java.lang.IllegalStateException: no evaluator registered for GroupedValues [GroupedValues]
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:606)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:200)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:196)
at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:109)
at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:204)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:583)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:327)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:70)
at app.Main.main(Main.java:124)
The code outline is basically this:
PCollection<KV<MyKey, Iterable<MyValue>>> groupedByMyKey = ...
PCollection<KV<MyKey, MyAggregated>> aggregated = groupedByMyKey.apply(
Combine.<MyKey, MyValue, MyAggregated>groupedValues(new Aggregator()));
Aggregator class extends CombineFn<MyValue, List<MyValue>, MyAggregated>
Can you share a code snippet that triggers this? GroupedValues is a PTransform that is often used within various combining transforms, so it might be from using something like Min, Max, etc.
The error means that the DirectPipelineRunner doesn't know how to evaluate a GroupedValues. However, that's unexpected, since that should have been expanded into a ParDo before execution.
I found the reason to this behaviour
I was using a command line argument to run it in remote mode (--runner=BlockingDataflowPipelineRunner) and then forced it to run locally with
PipelineRunner<?> runner = DirectPipelineRunner.fromOptions(options);
runner.run(p);
After removing these lines and just using the --runner=DirectPipelineRunner argument it worked as expected.

Resources