Is google dataflow BQ/BT Write atomic per job? - google-cloud-dataflow

maybe I am a bad seeker but I couldn't find my answers in documentation, so I just want to try my luck here
So my question is that say I have a dataflow job that write to a BigQuery or BigTable and the job failed. Will dataflow will able to rollback to state before it started or there might simply be partial data in my table?
I know that write to GCS seems not atomic that there will be partial output partition produced along the way when the job is running.
However, I have tried dumping data into BQ by dataflow, and it seems that the output table will not be exposed to users until the job claimed success.

In Batch, Cloud Dataflow uses the following procedure for BigQueryIO.Write.to("some table"):
Write all data to a temporary directory on GCS.
Issue a BigQuery load job with an explicit list of all the temporary files containing the rows to be written.
If there are failures when the GCS writes are only partially complete, we will recreate the temp files on retry. Exactly one complete copy of the data will be produced by step 1 and used for loading in step 2, or the job will fail before step 2.
Each BigQuery load job, as in William V's answer, is atomic. The load job will succeed or fail, and if it fails there will be no data written to BigQuery.
For slightly more depth, Dataflow also uses a deterministic BigQuery job id (like dataflow_job_12423423) so that if the Dataflow code monitoring the load job fails and is retried we will still have exactly-once write semantics to BigQuery.
Together, this design means that each BigQueryIO.Write transform in your pipeline is atomic. In a common case, you have only one such write in your job, and so if the job succeeds the data will be in BigQuery and if the job fails there will be no data written.
However: Note that if you have multiple BigQueryIO.Write transforms in a pipeline, some of the writes may have successfully completed before the Dataflow job fails. The completed writes will not be reverted when the Dataflow job fails.
This means that you may need to be careful when rerunning a Dataflow pipeline with multiple sinks in order to ensure correctness in the presence of commited writes from the earlier failed job.

I can speak for Bigtable. Bigtable is atomic at the row level, not at the job level. A Dataflow job that fails part way will write partial data into Bigtable.

BigQuery jobs fail or succeed as a unit. From https://cloud.google.com/bigquery/docs/reference/v2/jobs
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
Though, just to be clear, BigQuery is atomic at the level of the BigQuery job, not at the level of a Dataflow job that might have created the BigQuery job. E.g. if your Dataflow job fails but it has written to BigQuery before failing (and that BigQuery job is complete) then the data will remain in BigQuery.

Related

Call BQ stored procedures in Dataflow

I have set of stored procedures that I wish to run back to back in sequence so that I automate their execution. Cloud Dataflow is a good option to perform etl and post processing steps. But the issue with that is it just accepts SELECT queries in the job. How do I make a call to these procedures that i have saved in BQ from a dataflow job?
If cloud dataflow does not offer this help then what can be an alternative to achieve this? I dont want to use BQ scheduled query option.
P.S. When I say stored procedure, it involves multiple inserts, deletes, updates, truncates in the same script. I don't have SELECT there

Spring Cloud Dataflow - composed-task-runner doesn't start second task

I have a Dataflow pipeline consisting of two sequential batch jobs. The first batch gets completed successfully, but the second one doesn't start.
I have started Dataflow server with the embedded H2 DB. I've pointed Spring Batch to the same H2 instance via application.properties. After the first step in my pipeline gets completed, I can see batch execution logs in that same DB instance.
My composed-task-runner application seems getting the Dataflow's datasource correctly. I can see it inherits it from Dataflow server and props are shown in the Dashboard's task execution section.
There are no errors in the logs. Only log entries from successful execution of the first batch.
My TASK_EXECUTION entries:
What could be the problem? And why there are two entries in the TASK_EXECUTION table for the first step? Per the task_name - these entries belong to the first batch step only.
I was able to address this issue by re-building my batch task using Spring Initialzr. Initially I was trying to use spring-cloud-task-app-starters as the base for my work, and probably it is not the right way of building Dataflow tasks.

How to read from pubsub source in parallel using dataflow

I am very new to dataflow, I am looking to build pipeline which will use pubsub as source.
I have worked on streaming pipeline which has flink as streaming engine and kafka as source, in that we can set parallelism in flink to read messages from kafka so that message processing can happen in parallel instead of sequential.
I am wondering if same can be possible in pubsub->dataflow, or it will only read message in sequential order.
Take a look at the PubSubToBigQuery pipeline. This uses PubSub as a source, this will read data in parallel. Multiple threads will be each reading a message off of pubsub and handing it off to downstream transforms for processing, by default.
Please note that the PubSubToBQ pipeline can also be run as a template pipeline, which works well for many users. Just launch the pipeline from the Template UI and set the appropriate parameters to point to your pub sub and BQ locations. Some users prefer to use it that way. But this depends on where you want to store your data.

Cloud Dataflow Streaming continuously failing to insert

My dataflow pipeline functions as so:
Read from Pubsub
Transform data into rows
Write the rows to bigquery
On, occasion data is passed which fails to insert. That is alright, I know the reason for this failure. But dataflow continuously attempts to insert this data over and over and over and over. I would like to limit the number of retries as it bloats the worker logs with irrelevant information. Therefore making it extremely difficult to troubleshoot what is the problem when the same error repeatedly appears.
When running the pipeline locally I get:
no evaluator registered for Read(PubsubSource)
I would love to be able to test the pipeline locally. But it does not seem that dataflow supports this option with PubSub.
To clear the errors I am left with no other choice than canceling the pipeline and running a new job on the Google Cloud. Which costs time & money. Is there a way to limit the errors? Is there a way to test my pipeline locally? Is there a better approach to debugging the pipeline?
Dataflow UI
Job ID: 2017-02-08_09_18_15-3168619427405502955
To run the pipeline locally with unbounded data sets, on #Pablo's suggestion use the InProcessPipelineRunner.
dataflowOptions.setRunner(InProcessPipelineRunner.class);
Running the program locally has allowed me to handle errors with exceptions and optimize my workflow rapidly.

Any easier way to flush aggregator to GCS at the end of google dataflow pipeline

I am using Aggregator to log some runtime stats of dataflow job and I want to flush them to either GCS or BQ when the pipeline completes (or each transformer completes).
Currently I am doing it by beyond using Aggregator also creating side output by utilizing tupleTag at the same time and flush the side output PCollection.
However i am wondering whether might there by any other handy ways to flush the aggregators themselves directly?
Your method of using a side output PCollection should produce semantically equivalent results to using an Aggregator. (For example, both Aggregators and side outputs will not include duplicate values when a bundle fails and has to be retried.) The main difference is that partial results for Aggregators are available during pipeline execution in the monitoring UI and programmatically.
Within Java, you can use PipelineResult.getAggregatorValues(). If you get the PipelineResult from the [non-blocking]DataflowPipelineRunner, that will let you query aggregators as the job runs. If you use the BlockingDataflowPipelineRunner, Pipeline.run() blocks and you won't get the PipelineResult until after the job completes.
There's also commandline support: gcloud alpha dataflow metrics tail JOB_ID

Resources