Running periodic Dataflow job - google-cloud-dataflow

I have to join data from Google Datastore and Google BigTable to produce some report. I need to execute that operation every minute. Is it possible to accomplish with Google Cloud Dataflow (assuming the processing itself should not take long time and/or can be split in independent parallel jobs)?
Should I have endless loop inside the "main" creating and executing the same pipeline again and again?
If most of time in such scenario is taken by bringing up the VMs, is it possible to instruct the Dataflow to use customer VMs instead?
Thanks,

If you expect that your job is small enough to complete in 60 seconds you could consider using the Datastore and BigTable APIs from within a DoFn in a Streaming job. Your pipeline might look something like:
PCollection<Long> impulse = p.apply(
CountingInput.unbounded().withRate(1, Duration.standardMinutes(1)))
PCollection<A> input1 = impulse.apply(ParDo.of(readFromDatastore));
PCollection<B> input2 = impulse.apply(ParDo.of(readFromBigTable));
...
This produces a single input every minute, forever. Running as a streaming pipeline, the VMs will continue running.
After reading from both APIs you can then window/join as necessary.

Related

Using Dataflow vs. Cloud Composer

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation.
Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery.
Let me give a very basic example:
# file.csv
type\x01date
house\x0112/27/1982
car\x0111/9/1889
From this file we detect the schema and create a BigQuery table, something like this:
`table`
type (STRING)
date (DATE)
And, we also format our data to insert (in python) into BigQuery:
DATA = [
("house", "1982-12-27"),
("car", "1889-9-11")
]
This is a vast simplification of what's going on, but this is how we're currently using Cloud Dataflow.
My question then is, where does Cloud Composer come into the picture? What additional features could it provide on the above? In other words, why would it be used "on top of" Cloud Dataflow?
Cloud composer(which is backed by Apache Airflow) is designed for tasks scheduling in small scale.
Here is an example to help you understand:
Say you have a CSV file in GCS, and using your example, say you use Cloud Dataflow to process it and insert formatted data into BigQuery. If this is a one-off thing, you have just finished it and its perfect.
Now let's say your CSV file is overwritten at 01:00 UTC every day, and you want to run the same Dataflow job to process it every time when its overwritten. If you don't want to manually run the job exactly at 01:00 UTC regardless of weekends and holidays, you need a thing to periodically run the job for you (in our example, at 01:00 UTC every day). Cloud Composer can help you in this case. You can provide a config to Cloud Composer, which includes what jobs to run (operators), when to run (specify a job start time) and run in what frequency (can be daily, weekly or even yearly).
It seems cool already, however, what if the CSV file is overwritten not at 01:00 UTC, but anytime in a day, how will you choose the daily running time? Cloud Composer provides sensors, which can monitor a condition (in this case, the CSV file modification time). Cloud Composer can guarantee that it kicks off a job only if the condition is satisfied.
There are a lot more features that Cloud Composer/Apache Airflow provide, including having a DAG to run multiple jobs, failed task retry, failure notification and a nice dashboard. You can also learn more from their documentations.
For the basics of your described task, Cloud Dataflow is a good choice. Big data that can be processed in parallel is a good choice for Cloud Dataflow.
The real world of processing big data is usually messy. Data is usually somewhat to very dirty, arrives constantly or in big batches and needs to be processed in time sensitive ways. Usually it takes the coordination of more than one task / system to extract desired data. Think of load, transform, merge, extract and store types of tasks. Big data processing is often glued together using using shell scripts and / or Python programs. This makes automation, management, scheduling and control processes difficult.
Google Cloud Composer is a big step up from Cloud Dataflow. Cloud Composer is a cross platform orchestration tool that supports AWS, Azure and GCP (and more) with management, scheduling and processing abilities.
Cloud Dataflow handles tasks. Cloud Composer manages entire processes coordinating tasks that may involve BigQuery, Dataflow, Dataproc, Storage, on-premises, etc.
My question then is, where does Cloud Composer come into the picture?
What additional features could it provide on the above? In other
words, why would it be used "on top of" Cloud Dataflow?
If you need / require more management, control, scheduling, etc. of your big data tasks, then Cloud Composer adds significant value. If you are just running a simple Cloud Dataflow task on demand once in a while, Cloud Composer might be overkill.
Cloud Composer Apache Airflow is designed for tasks scheduling
Cloud Dataflow Apache Beam = handle tasks
For me, the Cloud Composer is a step up (a big one) from Dataflow. If I had one task, let's say to process my CSV file from Storage to BQ I would/could use Dataflow. But if I wanted to run the same job daily I would use Composer.

Dataflow worker pool creation and deletion time overhead

In the execution of each Dataflow job, job is taking around 2-4 mins for the creation and deletion of VMs(worker pool).
Please let me know if there is any way to minimize this?
OR
Can we create VMs for processing before execution of Dataflow job so that execution time can bring down?
Dataflow is fully managed. From documentation:
You should not attempt to manage or otherwise interact directly with
your Compute Engine Managed Instance Group; the Dataflow service will
take care of that for you. Manually altering any Compute Engine
resources associated with your Dataflow job is an unsupported
operation.

Cloud Dataflow Streaming continuously failing to insert

My dataflow pipeline functions as so:
Read from Pubsub
Transform data into rows
Write the rows to bigquery
On, occasion data is passed which fails to insert. That is alright, I know the reason for this failure. But dataflow continuously attempts to insert this data over and over and over and over. I would like to limit the number of retries as it bloats the worker logs with irrelevant information. Therefore making it extremely difficult to troubleshoot what is the problem when the same error repeatedly appears.
When running the pipeline locally I get:
no evaluator registered for Read(PubsubSource)
I would love to be able to test the pipeline locally. But it does not seem that dataflow supports this option with PubSub.
To clear the errors I am left with no other choice than canceling the pipeline and running a new job on the Google Cloud. Which costs time & money. Is there a way to limit the errors? Is there a way to test my pipeline locally? Is there a better approach to debugging the pipeline?
Dataflow UI
Job ID: 2017-02-08_09_18_15-3168619427405502955
To run the pipeline locally with unbounded data sets, on #Pablo's suggestion use the InProcessPipelineRunner.
dataflowOptions.setRunner(InProcessPipelineRunner.class);
Running the program locally has allowed me to handle errors with exceptions and optimize my workflow rapidly.

Is there any way to set numWorkers dynamically in the middle of dataflow job running?

I am using google dataflow on my work.
While I am using dataflow, I need to set number of workers dynamically while dataflow batch job is running.
That's mainly because of cloud bigtable QPS.
We are using 3 bigtable cluster nodes and they can't afford to receiving all traffics from 500 number of workers instantly.
So, I gotta change number of workers(from 500 to 25) just before trying to insert all the processed data into the bigtable.
Is there any way to achieve this goal?
Dataflow does not provide the ability to manually change the resource allocation of a batch job while it is running, however:
1) We plan to incorporate throttling into our autoscaling algorithms, so Dataflow would detect that it needs to downsize while writing to your bigtable. I don't have a concrete ETA, but this is definitely on our roadmap.
2) Meanwhile, you try to can artificially limit the parallelism of your pipeline by a trick like this:
Take your PCollection<Something> (Something being the data type you're writing to bigtable)
Pipe it through a sequence of transforms: ParDo(pair with a random key in 0..25), GroupByKey, ParDo(ungroup and remove random key). You get, again, a PCollection<Something>
Write this collection to Bigtable.
The trick here is that there is no parallelization within a single key after a GroupByKey, so the result of GroupByKey is a collection of 25 key-value pairs (where the value is an Iterable<Something>) that can't be processed by more than 25 workers in parallel. The ParDo's following it will likely get fused together with the writing to Bigtable, and will thus have a parallelism of 25.
The caveat is that Dataflow is within its right to materialize any intermediate collections if it predicts that this will improve performance of the pipeline. It may even do this just for the sake of increasing the degree of parallelism (which goes explicitly against your goal in this example). But if you have an urgent job to run, I believe right now this will probably do what you want.
Meanwhile the only long-term solution I can suggest, until we have throttling, is to use a smaller limit on number of workers, or use a larger Bigtable cluster, or both.
There's a lot of relevant information in the DATA & ANALYTICS: Analyzing 25 billion stock market events in an hour with NoOps on GCP talk from GCP/Next.
FWIW, you can increase the number of nodes of Bigtable before your batch job, give Bigtable a few minutes to adjust, and then start your job. You can turn down the Bigtable cluster when you're done with the batch job.

Any easier way to flush aggregator to GCS at the end of google dataflow pipeline

I am using Aggregator to log some runtime stats of dataflow job and I want to flush them to either GCS or BQ when the pipeline completes (or each transformer completes).
Currently I am doing it by beyond using Aggregator also creating side output by utilizing tupleTag at the same time and flush the side output PCollection.
However i am wondering whether might there by any other handy ways to flush the aggregators themselves directly?
Your method of using a side output PCollection should produce semantically equivalent results to using an Aggregator. (For example, both Aggregators and side outputs will not include duplicate values when a bundle fails and has to be retried.) The main difference is that partial results for Aggregators are available during pipeline execution in the monitoring UI and programmatically.
Within Java, you can use PipelineResult.getAggregatorValues(). If you get the PipelineResult from the [non-blocking]DataflowPipelineRunner, that will let you query aggregators as the job runs. If you use the BlockingDataflowPipelineRunner, Pipeline.run() blocks and you won't get the PipelineResult until after the job completes.
There's also commandline support: gcloud alpha dataflow metrics tail JOB_ID

Resources