Simple inquiry about streaming data directly into Cloud SQL using Google DataFlow - google-cloud-dataflow

So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!

You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.

Related

Is apache-beam a good choice when the event time ordering has to preserved when writing to sink?

I'm considering using apache beam to write a streaming pipeline to apply a stream of mutations to replicate events from a source database into a destination database in the order of event time. The source could be either kafka or pubsub.
An example would be something like this except that the order in which the mutations are applied to the sink must be in order in which they arrived.
I did go over some of the previous questions asked on preserving order:
Processing Total Ordering of Events By Key using Apache Beam
Sort elements within a fixed window - Cloud Dataflow - This seems to be same use case i'm interested in.
I understand that if I go down the apache beam road i would have to
choose a windowing strategy with accommodation for late data (either a fixed windowing strategy with a allowed lateness or with global window, have triggers to emit panes and buffer for late data)
apply transformations
GroupByKey over a single key(so that everthing goes to the same worker), sort and write to sink
In addition to the above, I would have to make sure the windows(if i follow a fixed window strategy) are executed in order. Step 3 is bound to be the bottleneck.
If [2] above in the list of steps is a lot of computation then apache beam would make sense to take advantage of parallelism which beam offers. But if [2] is just a simple one to one mapping, does apache beam make sense for this replication usecase. Please let me know if i'm missing something.
Note: We do have a batch pipeline on dataflow using apache beam to load a datadump on gcs to database where the entire data is on disk and the order in which its written to sink does not matter.
Preserving order it's possible, but not sure if it's straightforward or efficient.
It also depends on how much data (elements/sec) you're expecting as well as what the sink type is. Potentially you could have the pipeline write out ordered entries to GCS, and the sink just reads the files in, in order, as a secondary process.
Your other option, of using parallel writes and make sure the database is usable only till the output watermark time of the last beam stage, it's maybe doable, but not really the core use case of Dataflow/Apache Beam.
Maybe there could be ways to process the stream out of order, but write to an intermediate sink that can easily be read from in order. i.e. writing out the mutation batches with a step or file number that can easily be used to order the files when applied to the final sink.
The window + write to final sink architecture is going to be difficult to get right, probably too complex for low volume of elements, and too inefficient for large volume. This is a good example of what this could look like.
But again, keep in mind that all this approaches are definitely not the core use case for Dataflow/Apache Beam.

Audit records while working with streaming data in Apache Beam

I have a use case wherein records will be published from an on-premise system to a PubSub topic. Now, I want to make sure that all records published are read by the Apache Beam job and they are all correctly written to BigQuery.
I have two questions regarding this:
1) How do I make sure that there is no data loss in the entire process?
2) I need to maintain an Audit table somewhere to make sure that if 'n' records were published I have dumped each one of them successfully. How to keep track of the records?
Thank You.
Google Cloud Dataflow guarantees exactly-once data processing, with transactional logic built into its sources and sinks. You can read more about exactly-once guarantees in the blog article: After Lambda: Exactly-once processing in Cloud Dataflow, Part 3 (sources and sinks).
For your question about an audit table: can you describe more about what you'd like to accomplish? Dataflow has built-in Elements Added counters available in the UI and API which will show exactly how many elements have been processed. You could match this up with the number of published Pub/Sub messages.

Streaming Dataflow pipeline with no sink

We have a streaming Dataflow pipeline running on Google Cloud Dataflow
workers, which needs to read from a PubSub subscription, group
messages, and write them to BigQuery. The built-in BigQuery Sink does
not fit our needs as we need to target specific datasets and tables
for each group. As the custom sinks are not supported for streaming
pipelines, it seems like the only solution is to perform the insert
operations in a ParDo. Something like this:
Is there any known issue with not having a sink in a pipeline, or anything to be aware of when writing this kind of pipeline?
There should not be any issues for writing a pipeline without a sink. In fact, a sink is a type of ParDo in streaming.
I recommend that you use a custom ParDo and use the BigQuery API with your custom logic. Here is the definition of the BigQuerySink, you can use this code as a starting point.
You can define your own DoFn similar to StreamingWriteFn to add your custom ParDo logic, which will write to the appropriate BigQuery dataset/table.
Note that this is using Reshuffle instead of GroupByKey. I recommend that you use Reshuffle, which will also group by key, but avoid unnecessary windowing delays. In this case it means that the elements should be written out as soon as they come in, without extra buffering/delay. Additionally, this allows you to determine BQ table names at runtime.
Edit: I do not recommend using the built in BigQuerySink to write to different tables. This suggestion is to use the BigQuery API in your custom DoFn, rather than using the BigQuerySink

Multiple export using google dataflow

Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination

What is the Cloud Dataflow equivalent of BigQuery's table decorators?

We have a large table in BigQuery where the data is streaming in. Each night, we want to run Cloud Dataflow pipeline which processes the last 24 hours of data.
In BigQuery, it's possible to do this using a 'Table Decorator', and specifying the range we want i.e. 24 hours.
Is the same functionality somehow possible in Dataflow when reading from a BQ table?
We've had a look at the 'Windows' documentation for Dataflow, but we can't quite figure if that's what we need. We came up with up with this so far (we want the last 24 hours of data using FixedWindows), but it still tries to read the whole table:
pipeline.apply(BigQueryIO.Read
.named("events-read-from-BQ")
.from("projectid:datasetid.events"))
.apply(Window.<TableRow>into(FixedWindows.of(Duration.standardHours(24))))
.apply(ParDo.of(denormalizationParDo)
.named("events-denormalize")
.withSideInputs(getSideInputs()))
.apply(BigQueryIO.Write
.named("events-write-to-BQ")
.to("projectid:datasetid.events")
.withSchema(getBigQueryTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE) .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Are we on the right track?
Thank you for your question.
At this time, BigQueryIO.Read expects table information in "project:dataset:table" format, so specifying decorators would not work.
Until support for this is in place, you can try the following approaches:
Run a batch stage which extracts the whole bigquery and filters out unnecessary data and process that data. If the table is really big, you may want to fork the data into a separate table if the amount of data read is significantly smaller than the total amount of data.
Use streaming dataflow. For example, you may publish the data onto Pubsub, and create a streaming pipeline with a 24hr window. The streaming pipeline runs continuously, but provides sliding windows vs. daily windows.
Hope this helps

Resources