dataflow pipeline design for real time aggregation analysis - google-cloud-dataflow

i have a case as below:
1) use pubsub as input in dataflow and load the stream data to bigquery
2) select aggregated result from bigquery and load to pubsub as output
3) client that listen to pubsub for display
e.g. i have sales transaction and want to see regional (aggregated) sales figure real-time. i knew that i can use 2 pipelines for load data to bigquery (1) and other dataflow pipeline to get aggregated result and push to pubsub.
Is there any way to do in a single pipeline? as i don't want to build a orchestration layer (i.e. after 1st pipeline finished, call 2 pipeline). and initialing pipeline is costly.
thanks.

I think this can be done with a single dataflow pipeline with pubsub as input and bigquery and pubsub as sinks.
Basically:
1. PubsubIO -> PCollection A.
2. A -> BigQueryIO
3. A -> Window.into(...) -> PCollection B.
4. B -> GroupBy(...) -> ParDo -> C
5. C -> PubsubIO
https://beam.apache.org/get-started/mobile-gaming-example/

In case you are loading/streaming raw transaction into BigQuery you may also consider using BigQuery itself to build real-time aggregates in a cost-effective way, with semi unbounded stream

Related

Update BigTable row in Apache Beam (Scio)

I have the following use case:
There is a PubSub topic with data I want to aggregate using Scio and then save those aggregates into BigTable.
In my pipeline there is a CountByKey aggregation. What I would like to do is to be able to increment value in BigTable for a given key, preferably using ReadModifyWrite. In the scio-examples there are only updates related to setting column values but there are none of using atomic increment.
I uderstand that I need to create Mutation in order to perform any operation on BigTable, like this:
Mutations.newSetCell(
FAMILY_NAME, COLUMN_QUALIFIER, ByteString.copyFromUtf8(value.toString), 0L)
How to create UPDATE mutation from Scio / Apache Beam transform to atomically update row in BigTable?
I don't see anything like ReadModifyWrite under com.google.bigtable.v2.Mutation API so not sure if this is possible. I see 2 possible workarounds:
save events as a Pubsub topic, consume it from a backend service and increment there
use a raw Bigtable client in a custom DoFn, see BigtableDoFn for inspiration.

How to navigate a tree using Apache Beam model

I have a pipeline that starts receiving a list of categories IDs.
In a ParDo I execute a DoFn that calls a REST API using those IDs as parameter and returns a PCollection of a Category object.
.apply("Read Category", ParDo.of(new DoFn<String, Category>(){});
In a second ParDo I persist this Category objects, read his children attribute and return his children IDs.
.apply("Persist Category", ParDo.of(new DoFn<Category, String>(){});
I would like to repeat the first ParDo again over the list of IDs returned by the second ParDo until there is no children categories.
How can I perform this with the Apache Beam model benefiting from the parallel processing?
Apache Beam currently does not provide any primitives for iterative parallel processing. There are some workarounds you can employ, e.g. some of them are listed in this answer.
Another alternative is to write a simple Java function that will traverse the tree for a specific top-level ID (recursively fetching categories and children starting from a given ID), and use ParDo to apply that function in parallel - but, obviously, there will be no distributed parallelism within that function.
You could also partially "unroll" the iteration in the pipeline first, to get a bunch of distributed parallelism across the first few levels of the tree - e.g. build a pipeline with a sequence of a couple of the first and second ParDo, and then apply a third ParDo that applies the iterative Java function to traverse the remaining levels.
Note that, if you are executing on Dataflow or any other runner that supports the fusion optimization, most likely you'll need to use one of the tricks for preventing fusion.

get arbitrary first N value of PCollection in google dataflow

I am wondering whether Google Dataflow can do something that is equivalent of like SQL
SELECT * FROM A INNER JOIN B ON A.a = B.b **LIMIT 1000**
I know that Dataflow has very standard programming paradigm to do join. However, the part I am interested in. is this LIMIT 1000. Since I don't need all of the joined result but only any 1000 of them. I am wondering whether I can utilize this use case to speed up my job (assuming the join are between very expansive tables and will produce very large result on a fully join)
So I assume that a very naive way to achieve the above SQL result is some template code as follows:
PCollection A = ...
PCollection B = ...
PCollection result = KeyedPCollectionTuple.of(ATag, A).and(BTag, B)
.apply(CoGroupByKey.create())
.apply(ParDo.of(new DoFn<KV<...,CoGbkResult>, ...>() {
})
.apply(Sample.any(1000))
However my concern is that how is this Sample transformation hooking up with ParDo internally handled by dataflow. Will dataflow able to optimize in the way that it will stop processing join as long as it know it will definitely have enough output? Or there is simply no optimization in this use case that dataflow will just compute the full join result and then select 1000 from the result? (In this way, Sample transform is will only be an overhead)
Or long question short, it is possible for me to utilize this use case to do partial join in dataflow?
EDIT:
Or in essentially, I am wondering does Sample.any() transform will able to hint any optimization to upstream PCollection? For example if I do
pipeline.apply(TextTO.Read.from("gs://path/to/my/file*"))
.apply(Sample.any(N))
Will dataflow first load all data in and then select N or will it able to take advantage of Sample.any() and do some optimization and prune out some useless read.
Currently neither Cloud Dataflow, nor any of the other Apache Beam runners (as far as I'm aware) implement such an optimization.

Writing results of google dataflow pipeline into mulitple sinks

I would like to write the Google dataflow pipeline results into multiple sinks.
Like, I want to write the result using TextIO into Google Cloud Storage as well write the results as a table in BigQuery. How can I do that?
The structure of a Cloud Dataflow pipeline is a DAG (directed acyclic graph) and it is allowed to apply multiple transforms to the same PCollection - write transforms are not an exception. You can apply multiple write transforms to the PCollection of your results, for example:
PCollection<Foo> results = p.apply(TextIO.Read.named("ReadFromGCS").from("gs://..."))
.apply(...the rest of your pipeline...);
results.apply(TextIO.Write.named("WriteToGCS").to("gs://..."));
results.apply(BigQueryIO.Write.named("WriteToBigQuery").to(...)...);

Dataflow error - "Sources are too large. Limit is 5.00Ti"

We have a pipeline that looks like:
BigQuery -> ParDo -> BigQuery
The table has ~2B rows, and is just under 1TB.
After running for just over 8 hours, the job failed with the following error:
May 19, 2015, 10:09:15 PM
S09: (f5a951d84007ef89): Workflow failed. Causes: (f5a951d84007e064): BigQuery job "dataflow_job_17701769799585490748" in project "gdfp-xxxx" finished with error(s): job error: Sources are too large. Limit is 5.00Ti., error: Sources are too large. Limit is 5.00Ti.
Job id is: 2015-05-18_21_04_28-9907828662358367047
It's a big table, but it's not that big and Dataflow should be easily able to handle it. Why can't it handle this use case?
Also, even though the job failed, it still shows it as successful on the diagram. Why?
I think that error means the data you are trying to write to BigQuery exceeds the 5TB limit set by BigQuery for a single import job.
One way to work around this limit might be to split your BigQuery writes into multiple jobs by having multiple Write transforms so that no Write transform receives more than 5TB.
Before your write transform, you could have a DoFn with N outputs. For each record randomly assign it to one of the outputs. Each of the N outputs can then have its own BigQuery.Write transform. The write transforms could all append data to the same table so that all of the data will end up in the same table.

Resources