I would like to write the Google dataflow pipeline results into multiple sinks.
Like, I want to write the result using TextIO into Google Cloud Storage as well write the results as a table in BigQuery. How can I do that?
The structure of a Cloud Dataflow pipeline is a DAG (directed acyclic graph) and it is allowed to apply multiple transforms to the same PCollection - write transforms are not an exception. You can apply multiple write transforms to the PCollection of your results, for example:
PCollection<Foo> results = p.apply(TextIO.Read.named("ReadFromGCS").from("gs://..."))
.apply(...the rest of your pipeline...);
results.apply(TextIO.Write.named("WriteToGCS").to("gs://..."));
results.apply(BigQueryIO.Write.named("WriteToBigQuery").to(...)...);
Related
I have the following use case:
There is a PubSub topic with data I want to aggregate using Scio and then save those aggregates into BigTable.
In my pipeline there is a CountByKey aggregation. What I would like to do is to be able to increment value in BigTable for a given key, preferably using ReadModifyWrite. In the scio-examples there are only updates related to setting column values but there are none of using atomic increment.
I uderstand that I need to create Mutation in order to perform any operation on BigTable, like this:
Mutations.newSetCell(
FAMILY_NAME, COLUMN_QUALIFIER, ByteString.copyFromUtf8(value.toString), 0L)
How to create UPDATE mutation from Scio / Apache Beam transform to atomically update row in BigTable?
I don't see anything like ReadModifyWrite under com.google.bigtable.v2.Mutation API so not sure if this is possible. I see 2 possible workarounds:
save events as a Pubsub topic, consume it from a backend service and increment there
use a raw Bigtable client in a custom DoFn, see BigtableDoFn for inspiration.
i have a case as below:
1) use pubsub as input in dataflow and load the stream data to bigquery
2) select aggregated result from bigquery and load to pubsub as output
3) client that listen to pubsub for display
e.g. i have sales transaction and want to see regional (aggregated) sales figure real-time. i knew that i can use 2 pipelines for load data to bigquery (1) and other dataflow pipeline to get aggregated result and push to pubsub.
Is there any way to do in a single pipeline? as i don't want to build a orchestration layer (i.e. after 1st pipeline finished, call 2 pipeline). and initialing pipeline is costly.
thanks.
I think this can be done with a single dataflow pipeline with pubsub as input and bigquery and pubsub as sinks.
Basically:
1. PubsubIO -> PCollection A.
2. A -> BigQueryIO
3. A -> Window.into(...) -> PCollection B.
4. B -> GroupBy(...) -> ParDo -> C
5. C -> PubsubIO
https://beam.apache.org/get-started/mobile-gaming-example/
In case you are loading/streaming raw transaction into BigQuery you may also consider using BigQuery itself to build real-time aggregates in a cost-effective way, with semi unbounded stream
I have a pipeline that starts receiving a list of categories IDs.
In a ParDo I execute a DoFn that calls a REST API using those IDs as parameter and returns a PCollection of a Category object.
.apply("Read Category", ParDo.of(new DoFn<String, Category>(){});
In a second ParDo I persist this Category objects, read his children attribute and return his children IDs.
.apply("Persist Category", ParDo.of(new DoFn<Category, String>(){});
I would like to repeat the first ParDo again over the list of IDs returned by the second ParDo until there is no children categories.
How can I perform this with the Apache Beam model benefiting from the parallel processing?
Apache Beam currently does not provide any primitives for iterative parallel processing. There are some workarounds you can employ, e.g. some of them are listed in this answer.
Another alternative is to write a simple Java function that will traverse the tree for a specific top-level ID (recursively fetching categories and children starting from a given ID), and use ParDo to apply that function in parallel - but, obviously, there will be no distributed parallelism within that function.
You could also partially "unroll" the iteration in the pipeline first, to get a bunch of distributed parallelism across the first few levels of the tree - e.g. build a pipeline with a sequence of a couple of the first and second ParDo, and then apply a third ParDo that applies the iterative Java function to traverse the remaining levels.
Note that, if you are executing on Dataflow or any other runner that supports the fusion optimization, most likely you'll need to use one of the tricks for preventing fusion.
I'm looking for a way to scan huge Google BigTable with filter dynamically composed based on the events and make bulk update/delete on huge numbers of rows.
At the moment, I'm trying to combine BigTable with java-based Dataflow (for intensive serverless compute power). I reached to the point where I can compose "Scan" object with dynamic filter based on the events but I still can't find a way to stream results from CloudBigtableIO.read() to subsequent dataflow pipeline.
Appreciate any advice.
Extend your DoFn from AbstractCloudBigtableTableDoFn. That will give you access to a getConnection() method. You'll do something like this:
try(Connection c = getConnection();
Table t = c.getTable(YOUR_TABLE_NAME);
ResultScanner resultScanner = t.getScanner(YOUR_SCAN)) {
for(Result r : resultScanner) {
Mutation m = ... // construct a Put or Delete
context.output(m)
}
}
I'm assuming that your pipeline starts with CloudBigtableIO.read(), has the AbstractCloudBigtableTableDoFn next, and then has a CloudBigtableIO.write().
I am wondering whether Google Dataflow can do something that is equivalent of like SQL
SELECT * FROM A INNER JOIN B ON A.a = B.b **LIMIT 1000**
I know that Dataflow has very standard programming paradigm to do join. However, the part I am interested in. is this LIMIT 1000. Since I don't need all of the joined result but only any 1000 of them. I am wondering whether I can utilize this use case to speed up my job (assuming the join are between very expansive tables and will produce very large result on a fully join)
So I assume that a very naive way to achieve the above SQL result is some template code as follows:
PCollection A = ...
PCollection B = ...
PCollection result = KeyedPCollectionTuple.of(ATag, A).and(BTag, B)
.apply(CoGroupByKey.create())
.apply(ParDo.of(new DoFn<KV<...,CoGbkResult>, ...>() {
})
.apply(Sample.any(1000))
However my concern is that how is this Sample transformation hooking up with ParDo internally handled by dataflow. Will dataflow able to optimize in the way that it will stop processing join as long as it know it will definitely have enough output? Or there is simply no optimization in this use case that dataflow will just compute the full join result and then select 1000 from the result? (In this way, Sample transform is will only be an overhead)
Or long question short, it is possible for me to utilize this use case to do partial join in dataflow?
EDIT:
Or in essentially, I am wondering does Sample.any() transform will able to hint any optimization to upstream PCollection? For example if I do
pipeline.apply(TextTO.Read.from("gs://path/to/my/file*"))
.apply(Sample.any(N))
Will dataflow first load all data in and then select N or will it able to take advantage of Sample.any() and do some optimization and prune out some useless read.
Currently neither Cloud Dataflow, nor any of the other Apache Beam runners (as far as I'm aware) implement such an optimization.