There is a streaming dataflow running on google cloud (Apache beam 2.5). The dataflow was showing some system lag so I tries to update that dataflow with --update flag. Now the old dataflow is in Updating state and the new dataflow that initiated after the update process is in Pending state.
Now at this point everything is stuck. I am unable to stop/cancel the jobs now. Old job is still in updating state and no status change operation is permitted. I tried to change the state of the job using gcloud dataflow jobs cancel and REST api but it's showing job cannot be updated as it's in RELOAD state. The new initiate job is in not started/pending state. Unable to change the state of this as well. It's showing job is not in condition to perform this operation.
Please let me know how to stop/cancel/delete this streaming dataflow.
Did you try to cancel the job from both command line gcloud tool and web console UI? If nothing works, I think you need to contact Google Cloud Support.
Related
I am studying for the Data Engineer exam and, during my exercises, I have found this question:
You are operating a streaming Cloud Dataflow pipeline. Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update. What should you do?
A. Update the Cloud Dataflow pipeline inflight by passing the
--update option with the --jobName set to the existing job name .
B. Update the Cloud Dataflow pipeline inflight by passing the --update
option with the --jobName set to a new unique job name .
C. Stop the
Cloud Dataflow pipeline with the Cancel option. Create a new Cloud
Dataflow job with the updated code .
D. Stop the Cloud Dataflow
pipeline with the Drain option. Create a new Cloud Dataflow job with
the updated code.
In the official documentation: "We recommend that you attempt only smaller changes to your pipeline's windowing, such as changing the duration of fixed- or sliding-time windows. Making major changes to windowing or triggers, like changing the windowing algorithm, might have unpredictable results on your pipeline output.".
Therefore, I don't know if the correct answer is A or D. I think that A is more suitable when we don't want to lose data.
The answer is A because the question has a preconditions that no data is lost during the update. From the official documentation on updates:
The replacement job preserves any intermediate state data from the prior job, as well as any buffered data records or metadata currently "in-flight" from the prior job. For example, some records in your pipeline might be buffered while waiting for a window to resolve.
This means that the data will be temporarily saved (i.e. buffered) until the new pipeline is running with the state from the old job. Once the new pipeline is running, the buffered data will be sent to the new job.
In addition, the documentation states the the updated job's name must match the old job, therefore it's not B.
The google documentation did mention that if the windowing or triggering algorithm changes, you might have unpredictable results. This question did mention change in windowing and triggering algorithm. Safe bet is D.
I have a dataflow job, that subscribed to messages from PubSub:
p.apply("pubsub-topic-read", PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getPubSubSubscriptionName()).withIdAttribute("uuid"))
I see in docs that there is no guarantee for no duplication, and Beam suggests to use withIdAttribute.
This works perfectly until I drain an existing job, wait for it to be finished and restart another one, then I see millions of duplicate BigQuery records, (my job writes PubSub messages to BigQuery).
Any idea what I'm doing wrong?
I think you should be using the update feature instead of using drain to stop the pipeline and starting a new pipeline. In the latter approach state is not shared between the two pipelines, so Dataflow is not able to identify messages already delivered from PubSub. With update feature you should be able to continue your pipeline without duplicate messages.
I am trying to understand how dataflow/Beam manages state. When using kafka streams for example, it is possible to stop and restart your application and continue with the last state.
Does Beam/Dataflow have similar possibilities?
While you cannot snapshot Dataflow's state today, you can snapshot the Pub/Sub subscription where Dataflow gets its data from and restart later on. Review Cloud Pub/Sub Seek and Replay feature. More on the integration can be found here.
I was trying to cancel a google dataflow job, but it has been stuck "cancelling" for like 15mins now.
When I run the command: gcloud beta dataflow jobs list --status=active
It shows the job as active. I then run the command: gcloud beta dataflow jobs cancel [job id here].
It prompts me that it has been canceled, but it still appears as active in the status list.
These types of Google-end issues are best reported in the Public Issue Tracker where the engineering team responsible can be notified.
Providing further information in the issue report to Google such as your project number and the job ID of the stuck Dataflow job will help in resolving the issue more quickly.
I am trying to deploy a job to Flink from Jenkins. Thus far I have figured out how to submit the jar file that is created in the build job. Now I want to find any Flink jobs running with the old jar, stop them gracefully, and start a new job utilizing my new jar.
The API has methods to list the jobs, cancel jobs, and submit jobs. However, there does not seem to be a stop job endpoint. Any ideas on how to gracefully stop a job using API?
Even though the stop endpoint is not documented, it does exist and behaves similarly to the cancel one.
Basically, this is the bit missing in the Flink REST API documentation:
Stop Job
DELETE request to /jobs/:jobid/stop.
Stops a job, result on success is {}.
For those who are not aware of the difference between cancelling and stopping (copied from here):
The difference between cancelling and stopping a (streaming) job is the following:
On a cancel call, the operators in a job immediately receive a cancel() method call to cancel them as
soon as possible.
If operators are not not stopping after the cancel call, Flink will start interrupting the thread periodically
until it stops.
A “stop” call is a more graceful way of stopping a running streaming job. Stop is only available for jobs
which use sources that implement the StoppableFunction interface. When the user requests to stop a job,
all sources will receive a stop() method call. The job will keep running until all sources properly shut down.
This allows the job to finish processing all inflight data.
As i'm using Flink 1.7, below is how to cancel/stop flink job about this version.
Already Tested By Myself
Request path:
/jobs/{jobid}
jobid - 32-character hexadecimal string value that identifies a job.
Request method: PATCH
Query parameters:
mode (optional): String value that specifies the termination mode. Supported values are: "cancel, stop".
Example
10.xx.xx.xx:50865/jobs/4c88f503005f79fde0f2d92b4ad3ade4?mode=cancel
host an port is available when start yarn-seesion
jobid is available when you submit a job
Ref:
https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html`