Why task launcher prunes the task instance history every time a new task instance is launched? - spring-cloud-dataflow

When the Spring Cloud Data Flow server uses local deployer to handle the task lifecycle management(launch, stop, etc.,), the corresponding task execution log can be obtained only when the task execution status is RUNNING.
This is by design because the local task launcher prunes the task instance history every time a new task instance is launched and hence the access to the log is not available, which is explored by the code here.

The reason was not to grow the number of Task process IDs in the local deployer' in process Map. You can see the issue related to it here.
But, this causes some side effects as discussed in another thread as not being able to show the previous instances' task execution log in local deployer mode.
I think it would be ok to consider having some X number of task executions in history and that way at least we can avoid these side effects for a few executions in the history. Created a GH issue to track this.

Related

Quart.Net is Sometimes Running Overlapping Tasks

I am using Quartz.Net 3.0.7 to manage a scheduler. In my test environment I have two instances of the scheduler running. I have a test process that runs for exactly 2 hours before ending. Quartz is configured to start the process every 10 seconds and I am using the DisallowConcurrentExecution attribute to prevent multiple instances of the task from running at the same time. 80% of the time this is working as expected. Quartz will start up the process and prevent any other instances of the task from starting until after the initial one has completed. If I stop one of the two services hosting Quart, then the other instance picks up the task at the next 10-second mark.
However, after keeping these two Quartz services running for 48 uninterrupted hours, I have discovered a couple of times where things went horribly wrong. At times host B will start up the task, even though the task is still in the middle of its 2 hour execution on host A. At one point I even found the process had started up 3 times on host B, all within a 10 minute period. So, for a two hour period, the one task had three instances running simultaneously. After all three finished, Quartz went back to the expected schedule of only having one instance running at a time.
If these overlapping tasks were happening 100% of the time, I would think there is something wrong on my end, but since it seems to happen only about 20% of the time, I am thinking it must be something in the Quartz implementation. Is this by design or is it a bug? If there is an event I can capture from Quart.Net to tell me that another instance of a task has started up, I can listen for that and stop the existing task from running. I just need to make sure that DisallowConcurrentExecution is getting obeyed and prevent a task from running multiple instances concurrently. Thanks.
Edit:
I added logic that uses context.Scheduler.GetCurrentlyExecutingJobs to look for any jobs that have the same JobDetail.Key but a different FireInstanceId when my task starts up. If I find another currently executing job, I will prevent this instance from doing anything. I am finding that in the duplicate concurrent scenario, Quartz is reporting that there are no other jobs currently executing with the same JobDetail.Key. Should that be possible? Under what case would Quartz.Net start an IJob, lose track of it as an executing job after a few minutes, but allow it to continue executing without cancelling the CancellationToken?
Edit2:
I found an instance in my logs where Quartz started a task as expected. Then, one minute later, Quartz tried to start up 9 additional instances, each with a different FireInstanceId. My custom code blocked the 9 additional instances, because it can see that the original instance was still going, by calling GetCurrentlyExecutingJobs to get a list of running jobs. I double checked and the ConcurrentExecutionDisallowed flag is true on all of the tasks at runtime, so I would expect that Quartz would prevent the duplicate instances. This sounds like a bug. Am I expected to handle this manually or should I expect Quartz to get this right?
Edit3:
I am definitely looking at two different problems. In both cases Quartz.Net is launching my IJob instance with a new FireInstanceId while there is already another FireInstanceId running for the same JobKey. In one scenario I can see that both FireInstanceIds are active by calling GetCurrentlyExecutingJobs. In the second scenario calling GetCurrentlyExecutingJobs shows that the first FireInstanceId is no longer running, even though I can see from my logs that the original instance is still running. Both of these scenarios result in multiple instances of my IJob running at the same time, which is not acceptable. It is easy enough to tackle the first scenario by calling GetCurrentlyExecutingJobs when my IJob starts, but the second scenario is harder. I will have to ping GetCurrentlyExecutingJobs on an interval and stop the task if it’s FireInstanceId has disappeared from the active list. Has anyone else really not noticed this behavior?
I found that if I set this option, that I no longer have overlapping executing jobs. I still wish that Quartz would cancel the job’s cancellation token, though, if it lost track of the executing job.
QuartzProperties.Add("quartz.jobStore.clusterCheckinInterval", "60000");

spring cloud dataflow - detect running task

I am using spring cloud data flow (1.3.0.RELEASE). I would like to detect a running task in order to prevent multiple instances of the same task being started.
I was looking at the task execution status features, specifically "End Time" but i noticed that sometimes task execution status can have "Start Time" set along with "Exit Code" set to 0 and "End Time" not set.
Because of that, "End Time" does not look like a viable deciding factor.
Task execution list
What would be the best way to achieve that?
Thanks.
At the SCDF level, we don't (yet) have the native ability to control it as part of the orchestration layer. A few options are possible, though.
1) You could have the Task applications emit its lifecycle events via task-events destination (queue or topic); it could be the standard types or custom events. A stream could then be used as a decision point to trigger the subsequent launches.
2) In the recent 2.0 M3 release of Spring Cloud Task, we can restrict the launch of multiple Task instances of the same type. This could effectively be controlled by the spring.cloud.task.singleInstanceEnabled=true for each Task launches. With this flag set, while launching the Task instances, the lock-check would automatically be applied. Thus the duplicate or unintended launches can be prevented.
3) If you cannot switch to 2.0 M3, you could, in theory, replicate #2 solution from above on your 1.x based Task applications.

Creating a FIFO queue in SWF to control access to critical code sections

At the moment we have an Amazon Simple Workflow application that has a few tasks that can occur in parallel at the beginning of the process, followed by one path through a critical region where we can only allow one process to proceed.
We have modeled the critical region as a child workflow and we only allow one process to run in the child workflow at a time (though there is a race condition in our code that hasn't caused us issues yet). This is doing the job, but it has some issues.
We have a method that keeps checking if the child workflow is running and if it isn't it proceeds (race condition mentioned above - the is running check and starting running are not an atomic operation), otherwise throws an exception and retries, this method has an exponential backoff, the problems are: 1. With multiple workflows entering, which workflow will proceed first is non-deterministic, it would be better if this were a FIFO queue. 2. We can end up waiting a long time for the next workflow to start so there is wasted time, would be nice if the workflows proceeded as soon as the last one had finished.
We can address point 2 by reducing the retry interval, but we would still have the non-FIFO problem.
I can imagine modeling this quite easily on a single machine with a queue and locks, but what are our options in SWF?
You can have "critical section" workflow that is always running. Then signal it to "queue" execute requests. Upon receiving signal the "critical section" workflow either starts activity if it is not running or queues the request in the decider. When activity execution completes the "response" signal is sent back to the requester workflow. As "critical section" workflow is always running it has periodically restart itself as new (passing list of outstanding requests as a parameter) the same way all cron workflows are doing.

Service vs Scheduled Task intervals

If you have a recurring task that runs once per day, you use a Scheduled Task.
If you have a recurring task that runs every 10 seconds, you use a Service.
At what point do you switch between the two? Is there official guidance on this somewhere?
i`m not sure the interval is the main issue here.
here are a few thing to consider:
how much state this task needs in memory - do you load stuff from a file of DB ?
does the system that needs this task to run, have a need to communicate with the task
other that when its running ?
do you need more control over the process lifecycle when the task is up?
you can see where i`m going with this , that a service is a resident entity, and a sched task isn't.
i think it depends on the point if your programm is made for only one task or for more. if it's just doin' one "stupid" thing (like running a stored procedure in a database every 20 seconds) i would concidering a sheduled task, but if it does more than that and maybe got some dependencies (maybe what time it is running or some file-operations) I would concider a service.
I would also concider a service if the intervals when the operation is made are different. Let's say your programm runs a single stored procedure in a database and depending on the fact that it made "real" changes to the db. If it did something the next run is in 5 seconds and if not the next run is in 20 seconds. That's one of the perfect examples for a service.

rails backgroundjob running jobs in parallel?

I'm very happy with By so far, only I have this one issue:
When one process takes 1 or 2 hours to complete, all other jobs in the queue seem to wait for that one job to finish. Worse still is when uploading to a server which time's out regularly.
My question: is Bj running jobs in parallel or one after another?
Thank you,
Damir
BackgroundJob will only allow one worker to run per webserver instance. This is by design to keep things simple. Here is a quote from Bj's README:
If one ignores platform specific details the design of Bj is quite simple: the
main Rails application submits jobs to table, stored in the database. The act
of submitting triggers exactly one of two things to occur:
1) a new long running background runner to be started
2) an existing background runner to be signaled
The background runner refuses to run two copies of itself for a given
hostname/rails_env combination. For example you may only have one background
runner processing jobs on localhost in development mode.
The background runner, under normal circumstances, is managed by Bj itself -
you need do nothing to start, monitor, or stop it - it just works. However,
some people will prefer manage their own background process, see 'External
Runner' section below for more on this.
The runner simply processes each job in a highest priority oldest-in fashion,
capturing stdout, stderr, exit_status, etc. and storing the information back
into the database while logging it's actions. When there are no jobs to run
the runner goes to sleep for 42 seconds; however this sleep is interuptable,
such as when the runner is signaled that a new job has been submitted so,
under normal circumstances there will be zero lag between job submission and
job running for an empty queue.
You can learn more on the github page: Here

Resources