Does PurgeInstanceHistoryAsync remove old history for infinite orchestrations that use ContinueAsNew - azure-durable-functions

I have an orchestration that runs as a singleton by using the same instance id each time. It also runs infinitely by using ContinueAsNew at the end of each iteration to keep the history manageable. However, I have noticed that the history of each past iteration is kept in the history table, each with a different execution id (as is expected when ContinueAsNew is called).
I also use PurgeInstanceHistoryAsync once a day to delete any completed, failed, terminated or cancelled orchestrations that are more than 14 days old. However, since the infinite singleton orchestration is never in any of these states will PurgeInstanceHistoryAsync ever clean up the old execution histories?
The same question can be asked for a periodic singleton orchestration (i.e. an orchestration that runs periodically but uses the same instance Id each time). If the purge process happens whilst the orchestration is running, will any old histories be removed, or would it be a matter of luck that the orchestration is not actually running at the time the purge executes?

If you look in your history table in the azure storage account and query for your instance you should see that using ContinueAsNew will actually purge history automatically. (In my test it seemed to be at most 1 execution behind.)
From Docs: https://learn.microsoft.com/sv-se/azure/azure-functions/durable/durable-functions-eternal-orchestrations?tabs=csharp#resetting-and-restarting
When ContinueAsNew is called, the instance enqueues a message to itself before it exits. The message restarts the instance with the new input value. The same instance ID is kept, but the orchestrator function's history is effectively truncated.

Related

Whether Adding a new matching service in temporal workflow will cause all cache queue outdated

Matching service use consistent hash decide which queue is assigned to which server.
Most of time, the server will poll task from cache instead of persistent database.
If I add a new matching service, All cache in queue will be re-consistent-hash to new places, and this will cause all old cache outdated. Will it cause any problem?
Most of the time tasks are not cached but matched immediately to a waiting long poll. We call it a sync match. So adding a matching service shouldn't affect the health of the running applications.

Quart.Net is Sometimes Running Overlapping Tasks

I am using Quartz.Net 3.0.7 to manage a scheduler. In my test environment I have two instances of the scheduler running. I have a test process that runs for exactly 2 hours before ending. Quartz is configured to start the process every 10 seconds and I am using the DisallowConcurrentExecution attribute to prevent multiple instances of the task from running at the same time. 80% of the time this is working as expected. Quartz will start up the process and prevent any other instances of the task from starting until after the initial one has completed. If I stop one of the two services hosting Quart, then the other instance picks up the task at the next 10-second mark.
However, after keeping these two Quartz services running for 48 uninterrupted hours, I have discovered a couple of times where things went horribly wrong. At times host B will start up the task, even though the task is still in the middle of its 2 hour execution on host A. At one point I even found the process had started up 3 times on host B, all within a 10 minute period. So, for a two hour period, the one task had three instances running simultaneously. After all three finished, Quartz went back to the expected schedule of only having one instance running at a time.
If these overlapping tasks were happening 100% of the time, I would think there is something wrong on my end, but since it seems to happen only about 20% of the time, I am thinking it must be something in the Quartz implementation. Is this by design or is it a bug? If there is an event I can capture from Quart.Net to tell me that another instance of a task has started up, I can listen for that and stop the existing task from running. I just need to make sure that DisallowConcurrentExecution is getting obeyed and prevent a task from running multiple instances concurrently. Thanks.
Edit:
I added logic that uses context.Scheduler.GetCurrentlyExecutingJobs to look for any jobs that have the same JobDetail.Key but a different FireInstanceId when my task starts up. If I find another currently executing job, I will prevent this instance from doing anything. I am finding that in the duplicate concurrent scenario, Quartz is reporting that there are no other jobs currently executing with the same JobDetail.Key. Should that be possible? Under what case would Quartz.Net start an IJob, lose track of it as an executing job after a few minutes, but allow it to continue executing without cancelling the CancellationToken?
Edit2:
I found an instance in my logs where Quartz started a task as expected. Then, one minute later, Quartz tried to start up 9 additional instances, each with a different FireInstanceId. My custom code blocked the 9 additional instances, because it can see that the original instance was still going, by calling GetCurrentlyExecutingJobs to get a list of running jobs. I double checked and the ConcurrentExecutionDisallowed flag is true on all of the tasks at runtime, so I would expect that Quartz would prevent the duplicate instances. This sounds like a bug. Am I expected to handle this manually or should I expect Quartz to get this right?
Edit3:
I am definitely looking at two different problems. In both cases Quartz.Net is launching my IJob instance with a new FireInstanceId while there is already another FireInstanceId running for the same JobKey. In one scenario I can see that both FireInstanceIds are active by calling GetCurrentlyExecutingJobs. In the second scenario calling GetCurrentlyExecutingJobs shows that the first FireInstanceId is no longer running, even though I can see from my logs that the original instance is still running. Both of these scenarios result in multiple instances of my IJob running at the same time, which is not acceptable. It is easy enough to tackle the first scenario by calling GetCurrentlyExecutingJobs when my IJob starts, but the second scenario is harder. I will have to ping GetCurrentlyExecutingJobs on an interval and stop the task if it’s FireInstanceId has disappeared from the active list. Has anyone else really not noticed this behavior?
I found that if I set this option, that I no longer have overlapping executing jobs. I still wish that Quartz would cancel the job’s cancellation token, though, if it lost track of the executing job.
QuartzProperties.Add("quartz.jobStore.clusterCheckinInterval", "60000");

Creating a FIFO queue in SWF to control access to critical code sections

At the moment we have an Amazon Simple Workflow application that has a few tasks that can occur in parallel at the beginning of the process, followed by one path through a critical region where we can only allow one process to proceed.
We have modeled the critical region as a child workflow and we only allow one process to run in the child workflow at a time (though there is a race condition in our code that hasn't caused us issues yet). This is doing the job, but it has some issues.
We have a method that keeps checking if the child workflow is running and if it isn't it proceeds (race condition mentioned above - the is running check and starting running are not an atomic operation), otherwise throws an exception and retries, this method has an exponential backoff, the problems are: 1. With multiple workflows entering, which workflow will proceed first is non-deterministic, it would be better if this were a FIFO queue. 2. We can end up waiting a long time for the next workflow to start so there is wasted time, would be nice if the workflows proceeded as soon as the last one had finished.
We can address point 2 by reducing the retry interval, but we would still have the non-FIFO problem.
I can imagine modeling this quite easily on a single machine with a queue and locks, but what are our options in SWF?
You can have "critical section" workflow that is always running. Then signal it to "queue" execute requests. Upon receiving signal the "critical section" workflow either starts activity if it is not running or queues the request in the decider. When activity execution completes the "response" signal is sent back to the requester workflow. As "critical section" workflow is always running it has periodically restart itself as new (passing list of outstanding requests as a parameter) the same way all cron workflows are doing.

PostgreSQL + Rails concurrency clarification

I'm building a background job that's updating users' statistics for a web application. The job currently takes 55-60 seconds, and I'm concerned about what would happen if a user were to try to load his stats page at the same time that job is running.
From what I've read about PostgreSQL and concurrency, if two clients attempt to access the same row (one updating and one reading), and I'm not explicitly starting any transactions, the first one just has to wait for the second one to finish.
So if I'm understanding that correctly, the only performance hit I'm likely to incur is on the infinitesimally small chance that a user tries to load his stats page at the same moment that the row is being updated. It's not like the whole stats table is locked up during the 55-60 second job unless I were to explicitly configure Postgres to do that, right?
Is that a correct interpretation? Are there other factors I'm missing?
(I mention the Rails part just in case it has any bearing on the above scenario)
(Also: the PostgreSQL version is 9.0.4)
It depends on transaction isolation level. If I've got your case - you are talking about Dirty Read avoiding delay. And YES, Dirty Read is impossible if you are using default isolation level. Reader will wait for the writer only when it will try to get the same row that is being updated.
Read Committed is the default isolation level in PostgreSQL. When a transaction runs on this isolation level, a SELECT query sees only data committed before the query began;
specs on ISOLATION

Quartz.Net jobs not always running - can't find any reason why

We're using Quartz.Net to schedule about two hundred repeating jobs. Each job uses the same IJob implementing class, but they can have different schedules. In practice, they end up having the same schedule, so we have about two hundred job details, each with their own (identical) repeating/simple trigger, scheduled. The interval is one hour.
The task this job performs is to download an rss feed, and then download all of the media files linked to in the rss feed. Prior to downloading, it wipes the directory where it is going to place the files. A single run of a job takes anywhere from a couple seconds to a dozen seconds (occasionally more).
Our method of scheduling is to call GetScheduler() on a new StdSchedulerFactory (all jobs are scheduled at once into the same IScheduler instance). We follow the scheduling with an immediate Start().
The jobs appear to run fine, but upon closer inspection we are seeing that a minority of the jobs occasionally - or almost never - run.
So, for example, all two hundred jobs should have run at 6:40 pm this evening. Most of them did. But a handful did not. I determine this by looking at the file timestamps, which should certainly be updated if the job runs (because it deletes and redownloads the file).
I've enabled Quartz.Net logging, and added quite a few logging statements to our code as well.
I get log messages that indicate Quartz is creating and executing jobs for roughly one minute after the round of jobs starts.
After that, all activity stops. No jobs run, no log messages are created. Zero.
And then, at the next firing interval, Quartz starts up again and my log files update, and various files start downloading. But - it certainly appears like some JobDetail instances never make it to the head of the line (so to speak) or do so very infrequently. Over the entire weekend, some jobs appeared to update quite frequently, and recently, and others had not updated a single time since starting the process on Friday (it runs in a Windows Service shell, btw).
So ... I'm hoping someone can help me understand this behavior of Quartz.
I need to be certain that every job runs. If it's trigger is missed, I need Quartz to run it as soon as possible. From reading the documentation, I thought this would be the default behavior - for SimpleTrigger with an indefinite repeat count it would reschedule the job for immediate execution if the trigger window was missed. This doesn't seem to be the case. Is there any way I can determine why Quartz is not firing these jobs? I am logging at the trace level and they just simply aren't there. It creates and executes an awful lot of jobs, but if I notice one missing - all I can find is that it ran it the last time (for example, sometimes it hasn't run for hours or days). Nothing about why it was skipped (I expected Quartz to log something if it skips a job for any reason), etc.
Any help would really, really be appreciated - I've spent my entire day trying to figure this out.
After reading your post, it sounds a lot like the handful of jobs that are not executing are very likely misfiring. The reason that I believe this:
I get log messages that indicate Quartz is creating and executing jobs for roughly one minute after the round of jobs starts.
In Quartz.NET the default misfire threshold is 1 minute. Chances are, you need to examine your logging configuration to determine why those misfire events are not being logged. I bet if you throw open the the floodgates on your logging (ie. set everything to debug, and make sure that you definitely have a logging directive for the Quartz scheduler class), and then rerun your jobs. I'm almost positive that the problem is the misfire events are not showing up in your logs because the logging configuration is lacking something. This is understandable, because logging configuration can get very confusing, very quickly.
Also, in the future, you might want to consult the quartz.net forum on google, since that is where some of the more thorny issues are discussed.
http://groups.google.com/group/quartznet?pli=1
Now, your other question about setting the policy for what the scheduler should do, I can't specifically help you there, but if you read the API docs closely, and also consult the google discussion group, you should be able to easily set the misfire policy flag that suits your needs. I believe that Trigger's have a MisfireInstruction property which you can configure.
Also, I would argue that misfires introduce a lot of "noise" and should be avoided; perhaps bumping up the thread count on your scheduler would be a way to avoid misfires? The other option would be to stagger your job execution into separate/multiple batches.
Good luck!

Resources