Update a pipeline in Google Cloud Dataflow - google-cloud-dataflow

I am studying for the Data Engineer exam and, during my exercises, I have found this question:
You are operating a streaming Cloud Dataflow pipeline. Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update. What should you do?
A. Update the Cloud Dataflow pipeline inflight by passing the
--update option with the --jobName set to the existing job name .
B. Update the Cloud Dataflow pipeline inflight by passing the --update
option with the --jobName set to a new unique job name .
C. Stop the
Cloud Dataflow pipeline with the Cancel option. Create a new Cloud
Dataflow job with the updated code .
D. Stop the Cloud Dataflow
pipeline with the Drain option. Create a new Cloud Dataflow job with
the updated code.
In the official documentation: "We recommend that you attempt only smaller changes to your pipeline's windowing, such as changing the duration of fixed- or sliding-time windows. Making major changes to windowing or triggers, like changing the windowing algorithm, might have unpredictable results on your pipeline output.".
Therefore, I don't know if the correct answer is A or D. I think that A is more suitable when we don't want to lose data.

The answer is A because the question has a preconditions that no data is lost during the update. From the official documentation on updates:
The replacement job preserves any intermediate state data from the prior job, as well as any buffered data records or metadata currently "in-flight" from the prior job. For example, some records in your pipeline might be buffered while waiting for a window to resolve.
This means that the data will be temporarily saved (i.e. buffered) until the new pipeline is running with the state from the old job. Once the new pipeline is running, the buffered data will be sent to the new job.
In addition, the documentation states the the updated job's name must match the old job, therefore it's not B.

The google documentation did mention that if the windowing or triggering algorithm changes, you might have unpredictable results. This question did mention change in windowing and triggering algorithm. Safe bet is D.

Related

What does the pipeline "state" mean in DataFlow?

I am a beginner in Dataflow. There is a concept I'm not sure I understand and this is the "state".
When talking about the pipeline state, does it mean the data in the pipeline ?
For example, when taking a DataFlow snapshot, the documentation says there are two options:
Take a snapshot only for the pipeline state in DataFlow.
Take a snapshot as described in 1, plus a snapshot of the pub/sub source.
The documentatin
Does the state in section 1 mean the pipeline itself (the DAG) and the data in flight ?
What does the "state" mean ?
And if the data in flight is saved then why do we also need to take a snapshot of the source ?
Thank you
Guy
Yes, it means the running pipeline and data inflight. With the snapshot, you can recreate the state of the running job with a newer versioned pipeline. It's basically updating a streaming job without draining.
The snapshot of the source is specifically for Pub/Sub so that when reading from the existing subscription, it knows the ack state of inflight messages.

Drain DataFlow job and start another one right after, cause to message duplication

I have a dataflow job, that subscribed to messages from PubSub:
p.apply("pubsub-topic-read", PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getPubSubSubscriptionName()).withIdAttribute("uuid"))
I see in docs that there is no guarantee for no duplication, and Beam suggests to use withIdAttribute.
This works perfectly until I drain an existing job, wait for it to be finished and restart another one, then I see millions of duplicate BigQuery records, (my job writes PubSub messages to BigQuery).
Any idea what I'm doing wrong?
I think you should be using the update feature instead of using drain to stop the pipeline and starting a new pipeline. In the latter approach state is not shared between the two pipelines, so Dataflow is not able to identify messages already delivered from PubSub. With update feature you should be able to continue your pipeline without duplicate messages.

How to read from pubsub source in parallel using dataflow

I am very new to dataflow, I am looking to build pipeline which will use pubsub as source.
I have worked on streaming pipeline which has flink as streaming engine and kafka as source, in that we can set parallelism in flink to read messages from kafka so that message processing can happen in parallel instead of sequential.
I am wondering if same can be possible in pubsub->dataflow, or it will only read message in sequential order.
Take a look at the PubSubToBigQuery pipeline. This uses PubSub as a source, this will read data in parallel. Multiple threads will be each reading a message off of pubsub and handing it off to downstream transforms for processing, by default.
Please note that the PubSubToBQ pipeline can also be run as a template pipeline, which works well for many users. Just launch the pipeline from the Template UI and set the appropriate parameters to point to your pub sub and BQ locations. Some users prefer to use it that way. But this depends on where you want to store your data.

How to add labels to an existing Google Dataflow job?

I am using the Java GAPI client to work with Google Cloud Dataflow (v1b3-rev197-1.22.0). I am running a pipeline from template and the method for doing that (com.google.api.services.dataflow.Dataflow.Projects.Templates#create) does not allow me to set labels for the job. However I get the Job object back when I execute the pipeline, so I updated the labels and tried to call com.google.api.services.dataflow.Dataflow.Projects.Jobs#update to persist that information in Dataflow. But the labels do not get updated.
I also tried updating labels on finished jobs (which I also need to do), which didn't work either, so I thought it's because the job is in a terminal state. But updating labels seems to do nothing regardless of the state.
The documentation does not say anything about labels not being mutable on running or terminated pipelines, so I would expect things to work. Am I doing something wrong and if not what is the rationale behing the decision no to allow label updates? (And how are template users supposed to set the initial label set when executing the template?)
Background: I want to mark terminated pipelines that have been "processed", i.e. those that our automated infrastructure already sent notification about to appropriate places. Labels seemed as a good approach that would shield me from having to use some kind of local persitence to track stuff (big complexity jump). Any suggestions on how to approach this if labels are not the right tool? Sadly, Stackdriver cannot monitor finished pipelines, only failed ones. And sending a notification from within the pipeline code doesn't seem as a good idea to me (wrong?).

Check data watermark at different steps via the Dataflow API

In the Dataflow UI, I can check the data watermark at various steps of the job (ex. at step GroupByKey, the data watermark is 2017-05-24 (10:51:58)). Is it possible to access this data via the Dataflow API?
Yes, you can use the gcloud command line tool to access the API.
gcloud beta dataflow metrics list <job_id> --project=<project_name>
Look for metrics ending in data-watermark
F82-windmill-data-watermark
However, this is not yet easy to understand since the naming is based on an optimized view of the dataflow graph, not the view of the pipeline graph that the code and UI look like. It also uses identifiers like FX.
It might be best to take all the data-watermarks and grab the minimum value, which would show the oldest timestamp for elements not yet fully processed by the pipeline.
What information are you looking for in particular?
See:
https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/

Resources