Cadence activity task not retrying even after providing retry configuration - temporal

Cadence server version: 0.19.2
I have made following observation: I have a Job workflow that triggers encoding workflow (child workflow) which has an activity to handle encoding status. I have supplied retry configuration and heartbeat configuration in both child workflow and activity in case the workflow or activity fails due to server getting killed. However out of lets say 100 jobs, i get 20 jobs where the activity doesn't retries. It fails in attempt 0 with timeout type heartbeat timeout. I am sharing below parent workflow json and child workflow json and along with the some screen shots
Child workflow JSON
http://jsonblob.com/1007526462865293312
Parent Workflow JSON
http://jsonblob.com/1007526121943875584

Related

What happens when a branch in the pipeline throws an exception

Let's say for example if my pipeline consumes from Kafka and has two branches. The first branch writes to some data store and the second one produces a count of events seen, both belonging to the same window. What would happen if while making an api request to the datastore it throws an exception, but the second one never does? I.e. Would dataflow stop pulling from Kafka and wait until the first branch recovers or does it keep buffering data since the second one is chugging along fine?
Exceptions are retried.
If this is a batch pipeline, it will be retried several times; if it doesn't succeed, the entire pipeline will fail.
If this is a streaming pipeline, it will be retried until it succeeds. The rest of the pipeline will continue processing data meanwhile. If the exception keeps happening, you'll need to fix your code and update the pipeline.

How to stop a Flink job using REST API

I am trying to deploy a job to Flink from Jenkins. Thus far I have figured out how to submit the jar file that is created in the build job. Now I want to find any Flink jobs running with the old jar, stop them gracefully, and start a new job utilizing my new jar.
The API has methods to list the jobs, cancel jobs, and submit jobs. However, there does not seem to be a stop job endpoint. Any ideas on how to gracefully stop a job using API?
Even though the stop endpoint is not documented, it does exist and behaves similarly to the cancel one.
Basically, this is the bit missing in the Flink REST API documentation:
Stop Job
DELETE request to /jobs/:jobid/stop.
Stops a job, result on success is {}.
For those who are not aware of the difference between cancelling and stopping (copied from here):
The difference between cancelling and stopping a (streaming) job is the following:
On a cancel call, the operators in a job immediately receive a cancel() method call to cancel them as
soon as possible.
If operators are not not stopping after the cancel call, Flink will start interrupting the thread periodically
until it stops.
A “stop” call is a more graceful way of stopping a running streaming job. Stop is only available for jobs
which use sources that implement the StoppableFunction interface. When the user requests to stop a job,
all sources will receive a stop() method call. The job will keep running until all sources properly shut down.
This allows the job to finish processing all inflight data.
As i'm using Flink 1.7, below is how to cancel/stop flink job about this version.
Already Tested By Myself
Request path:
/jobs/{jobid}
jobid - 32-character hexadecimal string value that identifies a job.
Request method: PATCH
Query parameters:
mode (optional): String value that specifies the termination mode. Supported values are: "cancel, stop".
Example
10.xx.xx.xx:50865/jobs/4c88f503005f79fde0f2d92b4ad3ade4?mode=cancel
host an port is available when start yarn-seesion
jobid is available when you submit a job
Ref:
https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html`

JSR352: Monitoring Status of Job, Step and Partitions?

IBM's version of JSR352 provides a Rest API which can be used to trigger jobs, restart them, get the job logs. Can it also be used to get the status of each step and each partition of the step?
I want to build a job monitoring console from where i can trigger the jobs and monitor the status of the steps and partitions in real time without actually having to look into the job log. (after i trigger the job it should periodically give me the status of the step and partitions)
How should i go about doing this?
You can subscribe to our batch events, a JMS topic tree where we publish messages at various stages in the batch job lifecycle, (job started/ended, step checkpointed, etc.)
See the Knowledge Center documentation and this whitepaper as well for more information.

Can I trust a pipeline 'succeeded' status with OutOfMemoryError in the job log?

I have a dataflow job with Autoscaling enabled, which resized the worker pool to 14 during execution. By the time the job had finished the job log reported 6 OutOfMemoryErrors but the whole pipeline, as well as each execution step, had status succeeded. Can I trust the job status, or could I have data loss due to the worker failures?
You can trust the job status and results, because Dataflow is designed to process data in a way that is resilient to such failures. Further information can be found in the description of Service Optimization and Execution. Specifically:
The Dataflow service is fault-tolerant, and may retry your code
multiple times in the case of worker issues. The Dataflow service may
create backup copies of your code, and can have issues with manual
side effects (such as if your code relies upon or creates temporary
files with non-unique names).

Creating a FIFO queue in SWF to control access to critical code sections

At the moment we have an Amazon Simple Workflow application that has a few tasks that can occur in parallel at the beginning of the process, followed by one path through a critical region where we can only allow one process to proceed.
We have modeled the critical region as a child workflow and we only allow one process to run in the child workflow at a time (though there is a race condition in our code that hasn't caused us issues yet). This is doing the job, but it has some issues.
We have a method that keeps checking if the child workflow is running and if it isn't it proceeds (race condition mentioned above - the is running check and starting running are not an atomic operation), otherwise throws an exception and retries, this method has an exponential backoff, the problems are: 1. With multiple workflows entering, which workflow will proceed first is non-deterministic, it would be better if this were a FIFO queue. 2. We can end up waiting a long time for the next workflow to start so there is wasted time, would be nice if the workflows proceeded as soon as the last one had finished.
We can address point 2 by reducing the retry interval, but we would still have the non-FIFO problem.
I can imagine modeling this quite easily on a single machine with a queue and locks, but what are our options in SWF?
You can have "critical section" workflow that is always running. Then signal it to "queue" execute requests. Upon receiving signal the "critical section" workflow either starts activity if it is not running or queues the request in the decider. When activity execution completes the "response" signal is sent back to the requester workflow. As "critical section" workflow is always running it has periodically restart itself as new (passing list of outstanding requests as a parameter) the same way all cron workflows are doing.

Resources