Audit records while working with streaming data in Apache Beam - google-cloud-dataflow

I have a use case wherein records will be published from an on-premise system to a PubSub topic. Now, I want to make sure that all records published are read by the Apache Beam job and they are all correctly written to BigQuery.
I have two questions regarding this:
1) How do I make sure that there is no data loss in the entire process?
2) I need to maintain an Audit table somewhere to make sure that if 'n' records were published I have dumped each one of them successfully. How to keep track of the records?
Thank You.

Google Cloud Dataflow guarantees exactly-once data processing, with transactional logic built into its sources and sinks. You can read more about exactly-once guarantees in the blog article: After Lambda: Exactly-once processing in Cloud Dataflow, Part 3 (sources and sinks).
For your question about an audit table: can you describe more about what you'd like to accomplish? Dataflow has built-in Elements Added counters available in the UI and API which will show exactly how many elements have been processed. You could match this up with the number of published Pub/Sub messages.

Related

ordering message pub/sub GCP

I am new to Dataflow and pub-sub tools in GCP.
Need to migrate current on prem process to GCP.
Current Process is as follows:
We have two types of data feeds
Full Feed – its adhoc job – Size of full XML is ~100GB (Single XML – very complex one – Complete data – ETL Job process this xml and load it into ~60 tables)
Separate ETL jobs are there to process full feed. ETL job process
full feed and create load ready files and all tables will be truncate
and re-load.
Delta Feed - Every 30 min need to process delta files(XML files – it will have only changes with in last 30 min)
Source system push XML files in every 30 mins(More than one, file has timestamp), scheduled ETL process will pick all the files which are produced by source system and process all the xml files and create 3 load ready files insert, delete and update for each table
Schedule – ETL Jobs are scheduled to run every 5 min, if current process is running more than 5 min, next run will not trigger until current process completes
Order of the file processing is very important(ETL Job will take care of this). Need to process all the files in sequence.
At the end of ETL process load the load ready files into tables (Mainframe)
I was asked to propose the design to Migrate this to GCP. Need to have two process in GCP as well full and delta. My proposed solution should be handle/suitable for both the feeds.
Initially I thought below design.
Pub/sub -> DataFlow -> mySQL/BigQuery
Then came to know that pub/sub will not give the guarantee to process the files in sequence/order. After doing some research learn that recently google introduced ordering key concept for pub/sub, which will make sure to process the messages in order. In google cloud docs it was mentioned that, this feature is in Beta.
I have two questions:
Whether any one used ordering key concept in pub/sub in production environment. If yes, did you face any challenges while implementing this
Is this design is suitable for the above requirement or is there any better solution in GCP
is there any alternative for DataFlow?
Came to know that pub/sub can handle maximum 10MB size of messages, for us each XML size is more than ~5G.
As was mentioned by #guillaume blaquiere, Beta product launching phase brings some restrictions but they are mostly related to the product support:
At beta, products or features are ready for broader customer testing
and use. Betas are often publicly announced. There are no SLAs or
technical support obligations in a beta release unless otherwise
specified in product terms or the terms of a particular beta program.
The average beta phase lasts about six months.
Commonly, Cloud Pub/Sub message ordering feature works as intended, once you have something for developers attention it is highly appreciated to send a report via Google Issue tracker.

Replaying data into Apache Beam pipeline over Google Cloud Pub/Sub without overloading other subscribers

What I'm doing: I'm building a system in which one Cloud Pub/Sub topic will be read by dozens of Apache Beam pipelines in streaming mode. Each time I deploy a new pipeline, it should first process several years of historic data (stored in BigQuery).
The problem: If I replay historic data into the topic whenever I deploy a new pipeline (as suggested here), it will also be delivered to every other pipeline currently reading the topic, which would be wasteful and very costly. I can't use Cloud Pub/Sub Seek (as suggested here) as it stores a maximum of 7 days history (more details here).
The question: What is the recommended pattern to replay historic data into new Apache Beam streaming pipelines with minimal overhead (and without causing event time/watermark issues)?
Current ideas: I can currently think of three approaches to solving the problem, however, none of them seem very elegant and I have not seen any of them mentioned in the documentation, common patterns (part 1 or part 2) or elsewhere. They are:
Ideally, I could use Flatten to merge the real-time ReadFromPubSub with a one-off BigQuerySource, however, I see three potential issues: a) I can't account for data that has already been published to Pub/Sub, but hasn't yet made it into BigQuery, b) I am not sure whether the BigQuerySource might inadvertently be rerun if the pipeline is restarted, and c) I am unsure whether BigQuerySource works in streaming mode (per the table here).
I create a separate replay topic for each pipeline and then use Flatten to merge the ReadFromPubSubs for the main topic and the pipeline-specific replay topic. After deployment of the pipeline, I replay historic data to the pipeline-specific replay topic.
I create dedicated topics for each pipeline and deploy a separate pipeline that reads the main topic and broadcasts messages to the pipeline-specific topics. Whenever a replay is needed, I can replay data into the pipeline-specific topic.
Out of your three ideas:
The first one will not work because currently the Python SDK does not support unbounded reads from bounded sources (meaning that you can't add a ReadFromBigQuery to a streaming pipeline).
The third one sounds overly complicated, and maybe costly.
I believe your best bet at the moment is as you said, to replay your table into an extra PubSub topic that you Flatten with your main topic, as you rightly pointed out.
I will check if there's a better solution, but for now, option #2 should do the trick.
Also, I'd refer you to an interesting talk from Lyft on doing this for their architecture (in Flink).

Simple inquiry about streaming data directly into Cloud SQL using Google DataFlow

So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!
You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.

Multiple export using google dataflow

Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination

What is the Cloud Dataflow equivalent of BigQuery's table decorators?

We have a large table in BigQuery where the data is streaming in. Each night, we want to run Cloud Dataflow pipeline which processes the last 24 hours of data.
In BigQuery, it's possible to do this using a 'Table Decorator', and specifying the range we want i.e. 24 hours.
Is the same functionality somehow possible in Dataflow when reading from a BQ table?
We've had a look at the 'Windows' documentation for Dataflow, but we can't quite figure if that's what we need. We came up with up with this so far (we want the last 24 hours of data using FixedWindows), but it still tries to read the whole table:
pipeline.apply(BigQueryIO.Read
.named("events-read-from-BQ")
.from("projectid:datasetid.events"))
.apply(Window.<TableRow>into(FixedWindows.of(Duration.standardHours(24))))
.apply(ParDo.of(denormalizationParDo)
.named("events-denormalize")
.withSideInputs(getSideInputs()))
.apply(BigQueryIO.Write
.named("events-write-to-BQ")
.to("projectid:datasetid.events")
.withSchema(getBigQueryTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE) .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Are we on the right track?
Thank you for your question.
At this time, BigQueryIO.Read expects table information in "project:dataset:table" format, so specifying decorators would not work.
Until support for this is in place, you can try the following approaches:
Run a batch stage which extracts the whole bigquery and filters out unnecessary data and process that data. If the table is really big, you may want to fork the data into a separate table if the amount of data read is significantly smaller than the total amount of data.
Use streaming dataflow. For example, you may publish the data onto Pubsub, and create a streaming pipeline with a 24hr window. The streaming pipeline runs continuously, but provides sliding windows vs. daily windows.
Hope this helps

Resources