How to read from pubsub source in parallel using dataflow - google-cloud-dataflow

I am very new to dataflow, I am looking to build pipeline which will use pubsub as source.
I have worked on streaming pipeline which has flink as streaming engine and kafka as source, in that we can set parallelism in flink to read messages from kafka so that message processing can happen in parallel instead of sequential.
I am wondering if same can be possible in pubsub->dataflow, or it will only read message in sequential order.

Take a look at the PubSubToBigQuery pipeline. This uses PubSub as a source, this will read data in parallel. Multiple threads will be each reading a message off of pubsub and handing it off to downstream transforms for processing, by default.
Please note that the PubSubToBQ pipeline can also be run as a template pipeline, which works well for many users. Just launch the pipeline from the Template UI and set the appropriate parameters to point to your pub sub and BQ locations. Some users prefer to use it that way. But this depends on where you want to store your data.

Related

Apache Beam KafkaIO consumers in consumer group reading same message

I'm using KafkaIO in dataflow to read messages from one topic. I use the following code.
KafkaIO.<String, String>read()
.withReadCommitted()
.withBootstrapServers(endPoint)
.withConsumerConfigUpdates(new ImmutableMap.Builder<String, Object>()
.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true)
.put(ConsumerConfig.GROUP_ID_CONFIG, groupName)
.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 8000).put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, 2000)
.build())
// .commitOffsetsInFinalize()
.withTopics(Collections.singletonList(topicNames))
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.withoutMetadata();
I run the dataflow program in my local using the direct runner. Everything runs fine. I run another instance of the same program in parallel i.e another consumer. Now I see duplicate messages in processing of the pipeline.
Though I have provided consumer group id, starting another consumer with same consumer group id(different instance of the same program) shouldn't be processing same elements that are processed by another consumer right?
How does this turn out using dataflow runner?
I don't think the options you have set guarantees non-duplicate delivery of messages across pipelines.
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG: This is a flag for the Kafka consumer not for Beam pipeline itself. Seems like this is best effort and periodic so you might still see duplicates across multiple pipelines.
withReadCommitted(): This just means that Beam will not read uncommitted messages. Again, it will not prevent duplicates across multiple pipelines.
See here for the protocol Beam source use to determine the starting point of the Kafka source.
To guarantee non-duplicate delivery probably you have to read from different topics or different subscriptions.

Streaming pipeline publish to pubsub after write step completes

I have a use case where I have a Dataflow job running in streaming mode with an hourly fixed window.
When the pipeline runs for a given window, we calculate some data and write it to a data source. What I want to do next is publish some message to PubSub once the write is complete - how might I go about making sure that the write step is complete before writing to PubSub?
If the pipeline was executed in batch mode I know I could execute it in a blocking fashion as suggested here, but the tricky part is that this constantly running in streaming mode.
Wait.on() transform is designed for this use case. See documentation for usage example.

Check data watermark at different steps via the Dataflow API

In the Dataflow UI, I can check the data watermark at various steps of the job (ex. at step GroupByKey, the data watermark is 2017-05-24 (10:51:58)). Is it possible to access this data via the Dataflow API?
Yes, you can use the gcloud command line tool to access the API.
gcloud beta dataflow metrics list <job_id> --project=<project_name>
Look for metrics ending in data-watermark
F82-windmill-data-watermark
However, this is not yet easy to understand since the naming is based on an optimized view of the dataflow graph, not the view of the pipeline graph that the code and UI look like. It also uses identifiers like FX.
It might be best to take all the data-watermarks and grab the minimum value, which would show the oldest timestamp for elements not yet fully processed by the pipeline.
What information are you looking for in particular?
See:
https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/

Streaming Dataflow pipeline with no sink

We have a streaming Dataflow pipeline running on Google Cloud Dataflow
workers, which needs to read from a PubSub subscription, group
messages, and write them to BigQuery. The built-in BigQuery Sink does
not fit our needs as we need to target specific datasets and tables
for each group. As the custom sinks are not supported for streaming
pipelines, it seems like the only solution is to perform the insert
operations in a ParDo. Something like this:
Is there any known issue with not having a sink in a pipeline, or anything to be aware of when writing this kind of pipeline?
There should not be any issues for writing a pipeline without a sink. In fact, a sink is a type of ParDo in streaming.
I recommend that you use a custom ParDo and use the BigQuery API with your custom logic. Here is the definition of the BigQuerySink, you can use this code as a starting point.
You can define your own DoFn similar to StreamingWriteFn to add your custom ParDo logic, which will write to the appropriate BigQuery dataset/table.
Note that this is using Reshuffle instead of GroupByKey. I recommend that you use Reshuffle, which will also group by key, but avoid unnecessary windowing delays. In this case it means that the elements should be written out as soon as they come in, without extra buffering/delay. Additionally, this allows you to determine BQ table names at runtime.
Edit: I do not recommend using the built in BigQuerySink to write to different tables. This suggestion is to use the BigQuery API in your custom DoFn, rather than using the BigQuerySink

Any easier way to flush aggregator to GCS at the end of google dataflow pipeline

I am using Aggregator to log some runtime stats of dataflow job and I want to flush them to either GCS or BQ when the pipeline completes (or each transformer completes).
Currently I am doing it by beyond using Aggregator also creating side output by utilizing tupleTag at the same time and flush the side output PCollection.
However i am wondering whether might there by any other handy ways to flush the aggregators themselves directly?
Your method of using a side output PCollection should produce semantically equivalent results to using an Aggregator. (For example, both Aggregators and side outputs will not include duplicate values when a bundle fails and has to be retried.) The main difference is that partial results for Aggregators are available during pipeline execution in the monitoring UI and programmatically.
Within Java, you can use PipelineResult.getAggregatorValues(). If you get the PipelineResult from the [non-blocking]DataflowPipelineRunner, that will let you query aggregators as the job runs. If you use the BlockingDataflowPipelineRunner, Pipeline.run() blocks and you won't get the PipelineResult until after the job completes.
There's also commandline support: gcloud alpha dataflow metrics tail JOB_ID

Resources