Writing large (>20KB) records with BigQueryIO - google-cloud-dataflow

This wasn't clear from the documentation, but it looks like BigQueryIO.write performs a streaming write, which in turn limits the row size to <20KB?
Is it possible to configure a non-streaming BigQuery write that enables support for the larger (1MB) row size? My DataFlow job is a batch job, not a streaming one, and BigQuery streaming is not necessary, and undesired in this case, since it restricts me from importing my data.
If not, what's the recommended workflow for importing large rows into BigQuery? I guess I can run the DataFlow ETL and write my data into text files using TextIO, but then I'd have to add a manual step outside of this pipeline to trigger a BQ import?

Batch Datflow jobs don't stream data to BigQuery. The data is written to GCS and then we execute BigQuery import jobs to import the GCS files. So the streaming limits shouldn't apply.
Note the import job is executed by the service not by the workers which is why you don't see code for this in BigQueryIO.write.

Related

Autoscaling of IO bound Dataflow streaming jobs

I am in the middle of designing a component to load CSV files arrives on GCS bucket into BQ tables.
Since our requirements involves inserting additional columns,I have come up with the following design using Dataflow Streaming job.
1] CSV file arrives on GCS bucket
2] GCS object creation event is sent to a pub/sub topic.
These events are consumed by the 'GCS-BQ Loader' Dataflow streaming.
2a] Looking at the meta-data attached in GCS Object, decide the BQ table name,settings..etc
and calculate values for new columns (lot_num and batch)
3] Using BigqueryAPI, a temporary external table is created for the CSV file.
4] Using BigqueryAPI, query is executed to insert data from BQ external table into the final BQ table.This step is done because additional columns(lot_num,batch) needs to be added to the final table.
Finally the temporary external is deleted.
Depend on the metadata attached in GCS object, we expect to have around 1000 BQ tables. These CSV files range from couple of kilobytes to ~1Gb.
I have following questions regarding this design:
This process is IO bound (rather than CPU bound), since BQ API calls are blocking.
In this case, these blocking calls will block the Dataflow threads.Will this impact the performance?
How does Dataflow autoscaling will work in this case?
From the docs:
Dataflow scales based on the parallelism of a pipeline. The
parallelism of a pipeline is an estimate of the number of threads
needed to most efficiently process data at any given time.
What kind of metric Dataflow use for auto-scaling for these kind of IO bound processes?
Is it the number of unacknowledged pub/sub messages in the buffer?
Is Dataflow suitable for this kind of IO bound processing? Or a plain Java app running on GKE(k8s) is more suitable for this case ?

Simple inquiry about streaming data directly into Cloud SQL using Google DataFlow

So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!
You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.

Streaming Dataflow pipeline with no sink

We have a streaming Dataflow pipeline running on Google Cloud Dataflow
workers, which needs to read from a PubSub subscription, group
messages, and write them to BigQuery. The built-in BigQuery Sink does
not fit our needs as we need to target specific datasets and tables
for each group. As the custom sinks are not supported for streaming
pipelines, it seems like the only solution is to perform the insert
operations in a ParDo. Something like this:
Is there any known issue with not having a sink in a pipeline, or anything to be aware of when writing this kind of pipeline?
There should not be any issues for writing a pipeline without a sink. In fact, a sink is a type of ParDo in streaming.
I recommend that you use a custom ParDo and use the BigQuery API with your custom logic. Here is the definition of the BigQuerySink, you can use this code as a starting point.
You can define your own DoFn similar to StreamingWriteFn to add your custom ParDo logic, which will write to the appropriate BigQuery dataset/table.
Note that this is using Reshuffle instead of GroupByKey. I recommend that you use Reshuffle, which will also group by key, but avoid unnecessary windowing delays. In this case it means that the elements should be written out as soon as they come in, without extra buffering/delay. Additionally, this allows you to determine BQ table names at runtime.
Edit: I do not recommend using the built in BigQuerySink to write to different tables. This suggestion is to use the BigQuery API in your custom DoFn, rather than using the BigQuerySink

Is google dataflow BQ/BT Write atomic per job?

maybe I am a bad seeker but I couldn't find my answers in documentation, so I just want to try my luck here
So my question is that say I have a dataflow job that write to a BigQuery or BigTable and the job failed. Will dataflow will able to rollback to state before it started or there might simply be partial data in my table?
I know that write to GCS seems not atomic that there will be partial output partition produced along the way when the job is running.
However, I have tried dumping data into BQ by dataflow, and it seems that the output table will not be exposed to users until the job claimed success.
In Batch, Cloud Dataflow uses the following procedure for BigQueryIO.Write.to("some table"):
Write all data to a temporary directory on GCS.
Issue a BigQuery load job with an explicit list of all the temporary files containing the rows to be written.
If there are failures when the GCS writes are only partially complete, we will recreate the temp files on retry. Exactly one complete copy of the data will be produced by step 1 and used for loading in step 2, or the job will fail before step 2.
Each BigQuery load job, as in William V's answer, is atomic. The load job will succeed or fail, and if it fails there will be no data written to BigQuery.
For slightly more depth, Dataflow also uses a deterministic BigQuery job id (like dataflow_job_12423423) so that if the Dataflow code monitoring the load job fails and is retried we will still have exactly-once write semantics to BigQuery.
Together, this design means that each BigQueryIO.Write transform in your pipeline is atomic. In a common case, you have only one such write in your job, and so if the job succeeds the data will be in BigQuery and if the job fails there will be no data written.
However: Note that if you have multiple BigQueryIO.Write transforms in a pipeline, some of the writes may have successfully completed before the Dataflow job fails. The completed writes will not be reverted when the Dataflow job fails.
This means that you may need to be careful when rerunning a Dataflow pipeline with multiple sinks in order to ensure correctness in the presence of commited writes from the earlier failed job.
I can speak for Bigtable. Bigtable is atomic at the row level, not at the job level. A Dataflow job that fails part way will write partial data into Bigtable.
BigQuery jobs fail or succeed as a unit. From https://cloud.google.com/bigquery/docs/reference/v2/jobs
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
Though, just to be clear, BigQuery is atomic at the level of the BigQuery job, not at the level of a Dataflow job that might have created the BigQuery job. E.g. if your Dataflow job fails but it has written to BigQuery before failing (and that BigQuery job is complete) then the data will remain in BigQuery.

What is the Cloud Dataflow equivalent of BigQuery's table decorators?

We have a large table in BigQuery where the data is streaming in. Each night, we want to run Cloud Dataflow pipeline which processes the last 24 hours of data.
In BigQuery, it's possible to do this using a 'Table Decorator', and specifying the range we want i.e. 24 hours.
Is the same functionality somehow possible in Dataflow when reading from a BQ table?
We've had a look at the 'Windows' documentation for Dataflow, but we can't quite figure if that's what we need. We came up with up with this so far (we want the last 24 hours of data using FixedWindows), but it still tries to read the whole table:
pipeline.apply(BigQueryIO.Read
.named("events-read-from-BQ")
.from("projectid:datasetid.events"))
.apply(Window.<TableRow>into(FixedWindows.of(Duration.standardHours(24))))
.apply(ParDo.of(denormalizationParDo)
.named("events-denormalize")
.withSideInputs(getSideInputs()))
.apply(BigQueryIO.Write
.named("events-write-to-BQ")
.to("projectid:datasetid.events")
.withSchema(getBigQueryTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE) .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Are we on the right track?
Thank you for your question.
At this time, BigQueryIO.Read expects table information in "project:dataset:table" format, so specifying decorators would not work.
Until support for this is in place, you can try the following approaches:
Run a batch stage which extracts the whole bigquery and filters out unnecessary data and process that data. If the table is really big, you may want to fork the data into a separate table if the amount of data read is significantly smaller than the total amount of data.
Use streaming dataflow. For example, you may publish the data onto Pubsub, and create a streaming pipeline with a 24hr window. The streaming pipeline runs continuously, but provides sliding windows vs. daily windows.
Hope this helps

Resources