Apache Beam KafkaIO consumers in consumer group reading same message - google-cloud-dataflow

I'm using KafkaIO in dataflow to read messages from one topic. I use the following code.
KafkaIO.<String, String>read()
.withReadCommitted()
.withBootstrapServers(endPoint)
.withConsumerConfigUpdates(new ImmutableMap.Builder<String, Object>()
.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true)
.put(ConsumerConfig.GROUP_ID_CONFIG, groupName)
.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 8000).put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, 2000)
.build())
// .commitOffsetsInFinalize()
.withTopics(Collections.singletonList(topicNames))
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.withoutMetadata();
I run the dataflow program in my local using the direct runner. Everything runs fine. I run another instance of the same program in parallel i.e another consumer. Now I see duplicate messages in processing of the pipeline.
Though I have provided consumer group id, starting another consumer with same consumer group id(different instance of the same program) shouldn't be processing same elements that are processed by another consumer right?
How does this turn out using dataflow runner?

I don't think the options you have set guarantees non-duplicate delivery of messages across pipelines.
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG: This is a flag for the Kafka consumer not for Beam pipeline itself. Seems like this is best effort and periodic so you might still see duplicates across multiple pipelines.
withReadCommitted(): This just means that Beam will not read uncommitted messages. Again, it will not prevent duplicates across multiple pipelines.
See here for the protocol Beam source use to determine the starting point of the Kafka source.
To guarantee non-duplicate delivery probably you have to read from different topics or different subscriptions.

Related

Drain DataFlow job and start another one right after, cause to message duplication

I have a dataflow job, that subscribed to messages from PubSub:
p.apply("pubsub-topic-read", PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getPubSubSubscriptionName()).withIdAttribute("uuid"))
I see in docs that there is no guarantee for no duplication, and Beam suggests to use withIdAttribute.
This works perfectly until I drain an existing job, wait for it to be finished and restart another one, then I see millions of duplicate BigQuery records, (my job writes PubSub messages to BigQuery).
Any idea what I'm doing wrong?
I think you should be using the update feature instead of using drain to stop the pipeline and starting a new pipeline. In the latter approach state is not shared between the two pipelines, so Dataflow is not able to identify messages already delivered from PubSub. With update feature you should be able to continue your pipeline without duplicate messages.

How to read from pubsub source in parallel using dataflow

I am very new to dataflow, I am looking to build pipeline which will use pubsub as source.
I have worked on streaming pipeline which has flink as streaming engine and kafka as source, in that we can set parallelism in flink to read messages from kafka so that message processing can happen in parallel instead of sequential.
I am wondering if same can be possible in pubsub->dataflow, or it will only read message in sequential order.
Take a look at the PubSubToBigQuery pipeline. This uses PubSub as a source, this will read data in parallel. Multiple threads will be each reading a message off of pubsub and handing it off to downstream transforms for processing, by default.
Please note that the PubSubToBQ pipeline can also be run as a template pipeline, which works well for many users. Just launch the pipeline from the Template UI and set the appropriate parameters to point to your pub sub and BQ locations. Some users prefer to use it that way. But this depends on where you want to store your data.

Replaying data into Apache Beam pipeline over Google Cloud Pub/Sub without overloading other subscribers

What I'm doing: I'm building a system in which one Cloud Pub/Sub topic will be read by dozens of Apache Beam pipelines in streaming mode. Each time I deploy a new pipeline, it should first process several years of historic data (stored in BigQuery).
The problem: If I replay historic data into the topic whenever I deploy a new pipeline (as suggested here), it will also be delivered to every other pipeline currently reading the topic, which would be wasteful and very costly. I can't use Cloud Pub/Sub Seek (as suggested here) as it stores a maximum of 7 days history (more details here).
The question: What is the recommended pattern to replay historic data into new Apache Beam streaming pipelines with minimal overhead (and without causing event time/watermark issues)?
Current ideas: I can currently think of three approaches to solving the problem, however, none of them seem very elegant and I have not seen any of them mentioned in the documentation, common patterns (part 1 or part 2) or elsewhere. They are:
Ideally, I could use Flatten to merge the real-time ReadFromPubSub with a one-off BigQuerySource, however, I see three potential issues: a) I can't account for data that has already been published to Pub/Sub, but hasn't yet made it into BigQuery, b) I am not sure whether the BigQuerySource might inadvertently be rerun if the pipeline is restarted, and c) I am unsure whether BigQuerySource works in streaming mode (per the table here).
I create a separate replay topic for each pipeline and then use Flatten to merge the ReadFromPubSubs for the main topic and the pipeline-specific replay topic. After deployment of the pipeline, I replay historic data to the pipeline-specific replay topic.
I create dedicated topics for each pipeline and deploy a separate pipeline that reads the main topic and broadcasts messages to the pipeline-specific topics. Whenever a replay is needed, I can replay data into the pipeline-specific topic.
Out of your three ideas:
The first one will not work because currently the Python SDK does not support unbounded reads from bounded sources (meaning that you can't add a ReadFromBigQuery to a streaming pipeline).
The third one sounds overly complicated, and maybe costly.
I believe your best bet at the moment is as you said, to replay your table into an extra PubSub topic that you Flatten with your main topic, as you rightly pointed out.
I will check if there's a better solution, but for now, option #2 should do the trick.
Also, I'd refer you to an interesting talk from Lyft on doing this for their architecture (in Flink).

Audit records while working with streaming data in Apache Beam

I have a use case wherein records will be published from an on-premise system to a PubSub topic. Now, I want to make sure that all records published are read by the Apache Beam job and they are all correctly written to BigQuery.
I have two questions regarding this:
1) How do I make sure that there is no data loss in the entire process?
2) I need to maintain an Audit table somewhere to make sure that if 'n' records were published I have dumped each one of them successfully. How to keep track of the records?
Thank You.
Google Cloud Dataflow guarantees exactly-once data processing, with transactional logic built into its sources and sinks. You can read more about exactly-once guarantees in the blog article: After Lambda: Exactly-once processing in Cloud Dataflow, Part 3 (sources and sinks).
For your question about an audit table: can you describe more about what you'd like to accomplish? Dataflow has built-in Elements Added counters available in the UI and API which will show exactly how many elements have been processed. You could match this up with the number of published Pub/Sub messages.

Streaming Dataflow pipeline with no sink

We have a streaming Dataflow pipeline running on Google Cloud Dataflow
workers, which needs to read from a PubSub subscription, group
messages, and write them to BigQuery. The built-in BigQuery Sink does
not fit our needs as we need to target specific datasets and tables
for each group. As the custom sinks are not supported for streaming
pipelines, it seems like the only solution is to perform the insert
operations in a ParDo. Something like this:
Is there any known issue with not having a sink in a pipeline, or anything to be aware of when writing this kind of pipeline?
There should not be any issues for writing a pipeline without a sink. In fact, a sink is a type of ParDo in streaming.
I recommend that you use a custom ParDo and use the BigQuery API with your custom logic. Here is the definition of the BigQuerySink, you can use this code as a starting point.
You can define your own DoFn similar to StreamingWriteFn to add your custom ParDo logic, which will write to the appropriate BigQuery dataset/table.
Note that this is using Reshuffle instead of GroupByKey. I recommend that you use Reshuffle, which will also group by key, but avoid unnecessary windowing delays. In this case it means that the elements should be written out as soon as they come in, without extra buffering/delay. Additionally, this allows you to determine BQ table names at runtime.
Edit: I do not recommend using the built in BigQuerySink to write to different tables. This suggestion is to use the BigQuery API in your custom DoFn, rather than using the BigQuerySink

Resources