Does Cloud Workflows executions really fail when the Failed status reaches 2000? - quota

I have a question about "Concurrent executions" in the following document.
https://cloud.google.com/workflows/quotas?hl=en#request_limit
I would like to leave execution in a failed state or delete failed execution.
But I read the document and it reads as follows:
Unable to delete execution.
Once the failed status has accumulated to 2000, the execution can no longer be created.
I would first like to confirm that this perception is correct.
Does every workflow executions have to be Success?
I read the document
https://cloud.google.com/workflows/quotas?hl=en#request_limit
Google Cloud Workflows - Concurrent executions limit

Concurrent executions only counts executions that have not yet completed or failed. Executions in a failed state are not counted, so you are fine to leave them (which is usually what you want for inspecting and processing later).

Related

Sidekiq multiple dependant jobs, when to completed or retry?

In my Rails application, I have a model called Report
Report has one or many chunks (called Chunk) that would generate a piece of content based on external service calls (APIs, etc.)
When user requests to generate a report, by using Sidekiq, I queue the "chunk's jobs" in order to run them in the background and notify user that we will be emailing them the result once the report is generated.
Report uses a state machine, to flag whether or not all the jobs are successfully finished. All the chunks must be completed before we flag the report as ready. If one fails, we need to either try again, or give up at some point.
I determined the states as draft (default), working, finished The finish result is a combination of all the services pieces together. 'Draft' is when the chunks are still in the queue and none of them has started generating any content.
How would you tackle this situation with Sidekiq? How do you keep a track (live) which chunk's services are finished, or working or failed, so we can flag the report finished or failed?
I'd like to see a way to periodically check the jobs to see where they are standing, and change a state when they all finished successfully, or flag it fail, if all the retries give up!
Thank you
We had a similar need in our application to determine when sidekiq jobs were finished during automated testing.
What we used is the sidekiq-status gem: https://github.com/utgarda/sidekiq-status
Here's the rough usage:
job_id = Job.perform_async()
You'd then pass the job ID to the place where it will try to check the status of the job
Sidekiq::Status::status job_id #=> :working, :queued, :failed, :complete
Hope this helps.
This is a Sidekiq Pro feature called Batches.
https://github.com/mperham/sidekiq/wiki/Batches

TFS 2015 "Build Job Timeout" results in no logs

We have a max execution time set for tests, but frankly this option is about as much use as a chocolate teapot.
When the execution time exceeds this limit, the whole build fails and all subsequent steps are aborted, so the "Publish Test Results" step never executes, so you get absolutely no information whatsoever to help you work out WHY it exceeded the timeout period.
Can anyone suggest an alternative?
I was thinking of maybe trying to implement the timeout as part of the test code itself - does anyone know if this is possible? If I launch a thread that monitors for the timeout, and if it is hit, then...?
Could I just have the test terminate it's own process?
Build job timeout in minutes
Specify the maximum time a build job is allowed to execute on an agent
before being canceled by the server. Leave it empty or at zero if you want the job to never be canceled by the server.
This timeout are including all tasks in your build definition, if the test step out of time. Then the whole build definition will be canceled by the server. Certainly, the whole build fails and all subsequent steps are aborted.
According to your requirement, suggest you leaving this value 0 and setting "continue on error" for test step just as comment suggested. With this, if the step fails, all following steps are executed and the step/ overall build partially succeeds. Then you could got related info to trouble shoot the failed task.
If your test will not automatically judge whether the execution is timeout or not, another way is creating a custom build task to collect the execution time of your test task(through read build log), set a judgment to pass of fail the customize step using Logging Commands.

Can I trust a pipeline 'succeeded' status with OutOfMemoryError in the job log?

I have a dataflow job with Autoscaling enabled, which resized the worker pool to 14 during execution. By the time the job had finished the job log reported 6 OutOfMemoryErrors but the whole pipeline, as well as each execution step, had status succeeded. Can I trust the job status, or could I have data loss due to the worker failures?
You can trust the job status and results, because Dataflow is designed to process data in a way that is resilient to such failures. Further information can be found in the description of Service Optimization and Execution. Specifically:
The Dataflow service is fault-tolerant, and may retry your code
multiple times in the case of worker issues. The Dataflow service may
create backup copies of your code, and can have issues with manual
side effects (such as if your code relies upon or creates temporary
files with non-unique names).

Dataflow API retries several times after data format exception

I found this as a required improvement for dataflow API or I may be wrong.
I created a batch dataflow and by mistake one of the lines in my input file had invalid data format.
So the pipeline job gave DataFormatException. But instead of stopping the job then itself it retried several times ~4 times before stopping the job.
I see this as a wrong behavior. When a batch dataflow receives an invalid data format, it should stop the job then itself instead of retrying several times and then stopping the job.
Ideas?
It seems like Dataflow is trying to build in some fault tolerance. That's a good thing. And this behaviour is clearly documented here ("How are Java exceptions handled in Dataflow?")
If you don't want this behaviour, just write your own exception handling code, and bail out if you don't want it to be retried.

Creating a FIFO queue in SWF to control access to critical code sections

At the moment we have an Amazon Simple Workflow application that has a few tasks that can occur in parallel at the beginning of the process, followed by one path through a critical region where we can only allow one process to proceed.
We have modeled the critical region as a child workflow and we only allow one process to run in the child workflow at a time (though there is a race condition in our code that hasn't caused us issues yet). This is doing the job, but it has some issues.
We have a method that keeps checking if the child workflow is running and if it isn't it proceeds (race condition mentioned above - the is running check and starting running are not an atomic operation), otherwise throws an exception and retries, this method has an exponential backoff, the problems are: 1. With multiple workflows entering, which workflow will proceed first is non-deterministic, it would be better if this were a FIFO queue. 2. We can end up waiting a long time for the next workflow to start so there is wasted time, would be nice if the workflows proceeded as soon as the last one had finished.
We can address point 2 by reducing the retry interval, but we would still have the non-FIFO problem.
I can imagine modeling this quite easily on a single machine with a queue and locks, but what are our options in SWF?
You can have "critical section" workflow that is always running. Then signal it to "queue" execute requests. Upon receiving signal the "critical section" workflow either starts activity if it is not running or queues the request in the decider. When activity execution completes the "response" signal is sent back to the requester workflow. As "critical section" workflow is always running it has periodically restart itself as new (passing list of outstanding requests as a parameter) the same way all cron workflows are doing.

Resources