What does the pipeline "state" mean in DataFlow? - google-cloud-dataflow

I am a beginner in Dataflow. There is a concept I'm not sure I understand and this is the "state".
When talking about the pipeline state, does it mean the data in the pipeline ?
For example, when taking a DataFlow snapshot, the documentation says there are two options:
Take a snapshot only for the pipeline state in DataFlow.
Take a snapshot as described in 1, plus a snapshot of the pub/sub source.
The documentatin
Does the state in section 1 mean the pipeline itself (the DAG) and the data in flight ?
What does the "state" mean ?
And if the data in flight is saved then why do we also need to take a snapshot of the source ?
Thank you
Guy

Yes, it means the running pipeline and data inflight. With the snapshot, you can recreate the state of the running job with a newer versioned pipeline. It's basically updating a streaming job without draining.
The snapshot of the source is specifically for Pub/Sub so that when reading from the existing subscription, it knows the ack state of inflight messages.

Related

Update a pipeline in Google Cloud Dataflow

I am studying for the Data Engineer exam and, during my exercises, I have found this question:
You are operating a streaming Cloud Dataflow pipeline. Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update. What should you do?
A. Update the Cloud Dataflow pipeline inflight by passing the
--update option with the --jobName set to the existing job name .
B. Update the Cloud Dataflow pipeline inflight by passing the --update
option with the --jobName set to a new unique job name .
C. Stop the
Cloud Dataflow pipeline with the Cancel option. Create a new Cloud
Dataflow job with the updated code .
D. Stop the Cloud Dataflow
pipeline with the Drain option. Create a new Cloud Dataflow job with
the updated code.
In the official documentation: "We recommend that you attempt only smaller changes to your pipeline's windowing, such as changing the duration of fixed- or sliding-time windows. Making major changes to windowing or triggers, like changing the windowing algorithm, might have unpredictable results on your pipeline output.".
Therefore, I don't know if the correct answer is A or D. I think that A is more suitable when we don't want to lose data.
The answer is A because the question has a preconditions that no data is lost during the update. From the official documentation on updates:
The replacement job preserves any intermediate state data from the prior job, as well as any buffered data records or metadata currently "in-flight" from the prior job. For example, some records in your pipeline might be buffered while waiting for a window to resolve.
This means that the data will be temporarily saved (i.e. buffered) until the new pipeline is running with the state from the old job. Once the new pipeline is running, the buffered data will be sent to the new job.
In addition, the documentation states the the updated job's name must match the old job, therefore it's not B.
The google documentation did mention that if the windowing or triggering algorithm changes, you might have unpredictable results. This question did mention change in windowing and triggering algorithm. Safe bet is D.

Drain DataFlow job and start another one right after, cause to message duplication

I have a dataflow job, that subscribed to messages from PubSub:
p.apply("pubsub-topic-read", PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getPubSubSubscriptionName()).withIdAttribute("uuid"))
I see in docs that there is no guarantee for no duplication, and Beam suggests to use withIdAttribute.
This works perfectly until I drain an existing job, wait for it to be finished and restart another one, then I see millions of duplicate BigQuery records, (my job writes PubSub messages to BigQuery).
Any idea what I'm doing wrong?
I think you should be using the update feature instead of using drain to stop the pipeline and starting a new pipeline. In the latter approach state is not shared between the two pipelines, so Dataflow is not able to identify messages already delivered from PubSub. With update feature you should be able to continue your pipeline without duplicate messages.

How to read from pubsub source in parallel using dataflow

I am very new to dataflow, I am looking to build pipeline which will use pubsub as source.
I have worked on streaming pipeline which has flink as streaming engine and kafka as source, in that we can set parallelism in flink to read messages from kafka so that message processing can happen in parallel instead of sequential.
I am wondering if same can be possible in pubsub->dataflow, or it will only read message in sequential order.
Take a look at the PubSubToBigQuery pipeline. This uses PubSub as a source, this will read data in parallel. Multiple threads will be each reading a message off of pubsub and handing it off to downstream transforms for processing, by default.
Please note that the PubSubToBQ pipeline can also be run as a template pipeline, which works well for many users. Just launch the pipeline from the Template UI and set the appropriate parameters to point to your pub sub and BQ locations. Some users prefer to use it that way. But this depends on where you want to store your data.

Can Beam/Dataflow keep state after you stop a pipeline and start a new one?

I am trying to understand how dataflow/Beam manages state. When using kafka streams for example, it is possible to stop and restart your application and continue with the last state.
Does Beam/Dataflow have similar possibilities?
While you cannot snapshot Dataflow's state today, you can snapshot the Pub/Sub subscription where Dataflow gets its data from and restart later on. Review Cloud Pub/Sub Seek and Replay feature. More on the integration can be found here.

Check data watermark at different steps via the Dataflow API

In the Dataflow UI, I can check the data watermark at various steps of the job (ex. at step GroupByKey, the data watermark is 2017-05-24 (10:51:58)). Is it possible to access this data via the Dataflow API?
Yes, you can use the gcloud command line tool to access the API.
gcloud beta dataflow metrics list <job_id> --project=<project_name>
Look for metrics ending in data-watermark
F82-windmill-data-watermark
However, this is not yet easy to understand since the naming is based on an optimized view of the dataflow graph, not the view of the pipeline graph that the code and UI look like. It also uses identifiers like FX.
It might be best to take all the data-watermarks and grab the minimum value, which would show the oldest timestamp for elements not yet fully processed by the pipeline.
What information are you looking for in particular?
See:
https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/

Resources