Streaming Dataflow pipeline with no sink - google-cloud-dataflow

We have a streaming Dataflow pipeline running on Google Cloud Dataflow
workers, which needs to read from a PubSub subscription, group
messages, and write them to BigQuery. The built-in BigQuery Sink does
not fit our needs as we need to target specific datasets and tables
for each group. As the custom sinks are not supported for streaming
pipelines, it seems like the only solution is to perform the insert
operations in a ParDo. Something like this:
Is there any known issue with not having a sink in a pipeline, or anything to be aware of when writing this kind of pipeline?

There should not be any issues for writing a pipeline without a sink. In fact, a sink is a type of ParDo in streaming.
I recommend that you use a custom ParDo and use the BigQuery API with your custom logic. Here is the definition of the BigQuerySink, you can use this code as a starting point.
You can define your own DoFn similar to StreamingWriteFn to add your custom ParDo logic, which will write to the appropriate BigQuery dataset/table.
Note that this is using Reshuffle instead of GroupByKey. I recommend that you use Reshuffle, which will also group by key, but avoid unnecessary windowing delays. In this case it means that the elements should be written out as soon as they come in, without extra buffering/delay. Additionally, this allows you to determine BQ table names at runtime.
Edit: I do not recommend using the built in BigQuerySink to write to different tables. This suggestion is to use the BigQuery API in your custom DoFn, rather than using the BigQuerySink

Related

Is apache-beam a good choice when the event time ordering has to preserved when writing to sink?

I'm considering using apache beam to write a streaming pipeline to apply a stream of mutations to replicate events from a source database into a destination database in the order of event time. The source could be either kafka or pubsub.
An example would be something like this except that the order in which the mutations are applied to the sink must be in order in which they arrived.
I did go over some of the previous questions asked on preserving order:
Processing Total Ordering of Events By Key using Apache Beam
Sort elements within a fixed window - Cloud Dataflow - This seems to be same use case i'm interested in.
I understand that if I go down the apache beam road i would have to
choose a windowing strategy with accommodation for late data (either a fixed windowing strategy with a allowed lateness or with global window, have triggers to emit panes and buffer for late data)
apply transformations
GroupByKey over a single key(so that everthing goes to the same worker), sort and write to sink
In addition to the above, I would have to make sure the windows(if i follow a fixed window strategy) are executed in order. Step 3 is bound to be the bottleneck.
If [2] above in the list of steps is a lot of computation then apache beam would make sense to take advantage of parallelism which beam offers. But if [2] is just a simple one to one mapping, does apache beam make sense for this replication usecase. Please let me know if i'm missing something.
Note: We do have a batch pipeline on dataflow using apache beam to load a datadump on gcs to database where the entire data is on disk and the order in which its written to sink does not matter.
Preserving order it's possible, but not sure if it's straightforward or efficient.
It also depends on how much data (elements/sec) you're expecting as well as what the sink type is. Potentially you could have the pipeline write out ordered entries to GCS, and the sink just reads the files in, in order, as a secondary process.
Your other option, of using parallel writes and make sure the database is usable only till the output watermark time of the last beam stage, it's maybe doable, but not really the core use case of Dataflow/Apache Beam.
Maybe there could be ways to process the stream out of order, but write to an intermediate sink that can easily be read from in order. i.e. writing out the mutation batches with a step or file number that can easily be used to order the files when applied to the final sink.
The window + write to final sink architecture is going to be difficult to get right, probably too complex for low volume of elements, and too inefficient for large volume. This is a good example of what this could look like.
But again, keep in mind that all this approaches are definitely not the core use case for Dataflow/Apache Beam.

Simple inquiry about streaming data directly into Cloud SQL using Google DataFlow

So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!
You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.

Check data watermark at different steps via the Dataflow API

In the Dataflow UI, I can check the data watermark at various steps of the job (ex. at step GroupByKey, the data watermark is 2017-05-24 (10:51:58)). Is it possible to access this data via the Dataflow API?
Yes, you can use the gcloud command line tool to access the API.
gcloud beta dataflow metrics list <job_id> --project=<project_name>
Look for metrics ending in data-watermark
F82-windmill-data-watermark
However, this is not yet easy to understand since the naming is based on an optimized view of the dataflow graph, not the view of the pipeline graph that the code and UI look like. It also uses identifiers like FX.
It might be best to take all the data-watermarks and grab the minimum value, which would show the oldest timestamp for elements not yet fully processed by the pipeline.
What information are you looking for in particular?
See:
https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/

Multiple export using google dataflow

Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination

Writing large (>20KB) records with BigQueryIO

This wasn't clear from the documentation, but it looks like BigQueryIO.write performs a streaming write, which in turn limits the row size to <20KB?
Is it possible to configure a non-streaming BigQuery write that enables support for the larger (1MB) row size? My DataFlow job is a batch job, not a streaming one, and BigQuery streaming is not necessary, and undesired in this case, since it restricts me from importing my data.
If not, what's the recommended workflow for importing large rows into BigQuery? I guess I can run the DataFlow ETL and write my data into text files using TextIO, but then I'd have to add a manual step outside of this pipeline to trigger a BQ import?
Batch Datflow jobs don't stream data to BigQuery. The data is written to GCS and then we execute BigQuery import jobs to import the GCS files. So the streaming limits shouldn't apply.
Note the import job is executed by the service not by the workers which is why you don't see code for this in BigQueryIO.write.

Resources