Quartz scheduler is running twice at same time in cluster environment rarely - quartz

this issue is not happening all the time. very rarely we are observing same Job is running twice on same node or different node exactly at the same time.
we are using 4 node cluster.
we have already set org.quartz.jobStore.isClustered = true in quartz.properties
Do we still required to set Annotation Type DisallowConcurrentExecution at Job class level.
we are creating our job classes by implementing interface org.quartz.Job.
Any help is greatly appreciated.

Related

Quart.Net is Sometimes Running Overlapping Tasks

I am using Quartz.Net 3.0.7 to manage a scheduler. In my test environment I have two instances of the scheduler running. I have a test process that runs for exactly 2 hours before ending. Quartz is configured to start the process every 10 seconds and I am using the DisallowConcurrentExecution attribute to prevent multiple instances of the task from running at the same time. 80% of the time this is working as expected. Quartz will start up the process and prevent any other instances of the task from starting until after the initial one has completed. If I stop one of the two services hosting Quart, then the other instance picks up the task at the next 10-second mark.
However, after keeping these two Quartz services running for 48 uninterrupted hours, I have discovered a couple of times where things went horribly wrong. At times host B will start up the task, even though the task is still in the middle of its 2 hour execution on host A. At one point I even found the process had started up 3 times on host B, all within a 10 minute period. So, for a two hour period, the one task had three instances running simultaneously. After all three finished, Quartz went back to the expected schedule of only having one instance running at a time.
If these overlapping tasks were happening 100% of the time, I would think there is something wrong on my end, but since it seems to happen only about 20% of the time, I am thinking it must be something in the Quartz implementation. Is this by design or is it a bug? If there is an event I can capture from Quart.Net to tell me that another instance of a task has started up, I can listen for that and stop the existing task from running. I just need to make sure that DisallowConcurrentExecution is getting obeyed and prevent a task from running multiple instances concurrently. Thanks.
Edit:
I added logic that uses context.Scheduler.GetCurrentlyExecutingJobs to look for any jobs that have the same JobDetail.Key but a different FireInstanceId when my task starts up. If I find another currently executing job, I will prevent this instance from doing anything. I am finding that in the duplicate concurrent scenario, Quartz is reporting that there are no other jobs currently executing with the same JobDetail.Key. Should that be possible? Under what case would Quartz.Net start an IJob, lose track of it as an executing job after a few minutes, but allow it to continue executing without cancelling the CancellationToken?
Edit2:
I found an instance in my logs where Quartz started a task as expected. Then, one minute later, Quartz tried to start up 9 additional instances, each with a different FireInstanceId. My custom code blocked the 9 additional instances, because it can see that the original instance was still going, by calling GetCurrentlyExecutingJobs to get a list of running jobs. I double checked and the ConcurrentExecutionDisallowed flag is true on all of the tasks at runtime, so I would expect that Quartz would prevent the duplicate instances. This sounds like a bug. Am I expected to handle this manually or should I expect Quartz to get this right?
Edit3:
I am definitely looking at two different problems. In both cases Quartz.Net is launching my IJob instance with a new FireInstanceId while there is already another FireInstanceId running for the same JobKey. In one scenario I can see that both FireInstanceIds are active by calling GetCurrentlyExecutingJobs. In the second scenario calling GetCurrentlyExecutingJobs shows that the first FireInstanceId is no longer running, even though I can see from my logs that the original instance is still running. Both of these scenarios result in multiple instances of my IJob running at the same time, which is not acceptable. It is easy enough to tackle the first scenario by calling GetCurrentlyExecutingJobs when my IJob starts, but the second scenario is harder. I will have to ping GetCurrentlyExecutingJobs on an interval and stop the task if it’s FireInstanceId has disappeared from the active list. Has anyone else really not noticed this behavior?
I found that if I set this option, that I no longer have overlapping executing jobs. I still wish that Quartz would cancel the job’s cancellation token, though, if it lost track of the executing job.
QuartzProperties.Add("quartz.jobStore.clusterCheckinInterval", "60000");

Massive-Distributed Parallel Execution of tasks

We are currently struggling with the following task. We need to run a windows application (single instance only working) 1000 times with different input parameters. One run of this application can take up to multiple hours. It feels like we have the same problem like any video rendering farm – each picture of a video should be calculated independently and parallel – but it is not rendering.
Currently we tried to execute it with Jenkins and Pipeline jobs. We used the parallel steps in pipeline and lets Jenkins queue and execute the application. We use the Jenkins Label Expression to lets Jenkins choose which job can be run on which node.
The limitation in Jenkins is currently with massive parallel jobs (https://issues.jenkins-ci.org/browse/JENKINS-47724). When the queue contains multiple hundred jobs adding new jobs took much longer – will become even worse by increasing queue. And main problem: Jenkins will start the execution of parallel pipeline part-jobs only after finishing adding all to the queue.
We already investigated ideas how to solve this problem:
Python Distributed: https://distributed.readthedocs.io/en/latest/
a. For single functions it looks great, but for the complete run like we have in Jenkins => Deploy and collect results looks complex
b. Client->Server bidirectional communication needed – no chance to bring it online through a NAT (VM Server)
BOINC: https://boinc.berkeley.edu/
a. for our understanding we had to extend the backend in a massive way to bring our jobs working => to configure the jobs in BOINC we had to write a lot of new automating code
b. currently we need a predeployed application which can differ between different inputs => no equivalent of Jenkins Label Expression
Any ideas how to solve it?
Thanks in advance

Error in docs about neo4j cluster joining?

I'm trying to understand how neo4j cluster creation/joining works as it is not behaving properly in our application.
So I'm starting from scratch and creating a 3 box cluster as per the tutorial: http://neo4j.com/docs/2.3.4/ha-setup-tutorial.html
The following note is copy/pasted from the tutorial:
Startup Time When running in HA mode, the startup script returns
immediately instead of waiting for the server to become available.
This is because the instance does not accept any requests until a
cluster has been formed. In the example above this happens when you
start the second instance. To keep track of the startup state you can
follow the messages in console.log — the path is printed before the
startup script returns.
However when I startup the second instance, my cluster is still not formed... I need to startup the 3rd one for the cluster to start.
Is this an error in the neo4j docs?
Furthermore, is there a way to "force" an instance to become a master on cluster startup? For example, if I have 3 nodes and 2 of them fail and need to be re-installed, when I restart the cluster, how can I force the one with the valid database to become master? Isn't there a chance the 2nd or 3rd one with a blank database would become master?
When you start a cluster for the first time, or stop all instances and then start them again, the initial cluster MUST consist of all members listed in ha.initial_hosts. In addition, all instances in the cluster should have the exact same entries in ha.initial_hosts for the cluster to come up quickly and cleanly. The cluster will not form until all of the instances are up and running.

Is it possible to run a Neo4j cluster with strong consistency?

The docs of Neo4j state that when running in HA mode, you get eventual consistency. This is a quote from that page:
All updates will however propagate from the master to other slaves
eventually so a write from one slave may not be immediately visible on
all other slaves
My question is: is there a configuration that will allow me to write a cluster with strong consistency, of course at the cost of reduced performance? I'm looking for some sort of active-passive failover cluster configuration.
There is such an config option. ha.tx_push_factor determines to how many slaves a transaction should be pushed to synchronously. When setting this to ha.tx_push_factor=<clustersize>-1 you have immediate full consistency.

Why aren't JobListeners Durable in Quartz.NET?

I'm trying to chain a few jobs in Quartz.NET through JobChainingJobListener. I first create a couple of durable jobs (while using ADO JobStore with SQL Server) and this part works well - the jobs are visible across Quartz.NET restarts.
When I chain my jobs with Scheduler.ListenerManager.AddJobListener(listener, matchers)the listener fires correctly, but its definition cannot be made durable in the database. After every server restart, I have to define all listeners again.
Looking at the DB tables, there are no tables for listeners, nor does the code for ListenerManagerImpl contain any hints of listener persistence.
I'm planning to add listener durability and reload the global listener dictionary on server restart. Before I do that, I'm wondering if there are any reasons why the project does not already do so? Considering how mature Quartz.NET is, someone would have already ran into this, so it seems I'm missing something.
Can anyone please point to any pitfalls in implementing listener durability?
From Quartz's perspective listeners are just a configuration issue. Just like you configure job store type or other settings for the library. Commonly listeners are stateless and thus need no persistence services, unlike triggers and jobs that hold state that need to be persisted between invocations and possible job processing nodes.
If you have sound configuration management plan this shouldn't be an issue. Just handle the listener configuration like you would other aspects of the setup. If you have state management in your listeners that would need storage between restarts, that's a different story. Then you'd naturally need custom persistence.

Resources