Update BigTable row in Apache Beam (Scio) - google-cloud-dataflow

I have the following use case:
There is a PubSub topic with data I want to aggregate using Scio and then save those aggregates into BigTable.
In my pipeline there is a CountByKey aggregation. What I would like to do is to be able to increment value in BigTable for a given key, preferably using ReadModifyWrite. In the scio-examples there are only updates related to setting column values but there are none of using atomic increment.
I uderstand that I need to create Mutation in order to perform any operation on BigTable, like this:
Mutations.newSetCell(
FAMILY_NAME, COLUMN_QUALIFIER, ByteString.copyFromUtf8(value.toString), 0L)
How to create UPDATE mutation from Scio / Apache Beam transform to atomically update row in BigTable?

I don't see anything like ReadModifyWrite under com.google.bigtable.v2.Mutation API so not sure if this is possible. I see 2 possible workarounds:
save events as a Pubsub topic, consume it from a backend service and increment there
use a raw Bigtable client in a custom DoFn, see BigtableDoFn for inspiration.

Related

Compare two PCollections for removal

Every day latest data available in the CloudSQL table, so while writing data into another CloudSQL table, I need to compare the existing data and perform the actions like, remove the deleted data and update the existing data and insert new data.
Could you please suggest best way to do this scenario using Dataflow pipeline (preferable Java).
One thing I identified that using upsert function in CloudSQL, we could do the insert/update the records with the help of jdbc.JdbcIO. But I do not know how to identified collection for removal.
You could read the old and new tables and do a Join followed by a DoFn that compares the two and only outputs changed elements, which can then be written wherever you like.

Dynamic query usage whole streaming with google dataflow?

I have a dataflow pipeline that is set up to receive information(JSON) and transform it into a DTO and then insert it into my database. This works great for insert, but where I am running into issues is with handling delete records. With the information I am receiving there is a deleted tag in the JSON to specify when that record is actually being deleted. After some research/experimenting, I am at a loss as whether or not this is possible.
My question: Is there a way to dynamically choose(or change) what sql statement the pipeline is using, while streaming?
To achieve this with Dataflow you need to think more in terms of water flowing through pipes than in terms of if-then-else coding.
You need to classify your records into INSERTs and DELETEs and route each set to a different sink that will do what you tell them to. You can use tags for that.
In this pipeline design example, instead of startsWithATag and startsWithBTag you can use tags for Insert and Delete.

Reading bulk data from a database using Apache Beam

I would like to know, how JdbcIO would execute a query in parallel if my query returns millions of rows.
I have referred https://issues.apache.org/jira/browse/BEAM-2803 and the related pull requests. I couldn't understand it completely.
ReadAll expand method uses a ParDo. Hence would it create multiple connections to the database to read the data in parallel? If I restrict the number of connections that can be created to a DB in the datasource, will it stick to the connection limit?
Can anyone please help me to understand how this would handled in JdbcIO? I am using 2.2.0
Update :
.apply(
ParDo.of(
new ReadFn<>(
getDataSourceConfiguration(),
getQuery(),
getParameterSetter(),
getRowMapper())))
The above code shows that ReadFn is applied with a ParDo. I think, the ReadFn will run in parallel. If my assumption is correct, how would I use the readAll() method to read from a DB where I can establish only a limited number of connections at a time?
Thanks
Balu
The ReadAll method handles the case where you have many multiple queries. You can store the queries as a PCollection of strings where each string is the query. Then when reading, each item is processed as a separate query in a single ParDo.
This does not work well for small number of queries because it limits paralellism to the number of queries. But if you have many, then it will preform much faster. This is the case for most of the ReadAll calls.
From the code it looks like a connection is made per worker in the setup function. This might include several queries depending on the number of workers and number of queries.
Where is the query limit set? It should behave similarly with or without ReadAll.
See the jira for more information: https://issues.apache.org/jira/browse/BEAM-2706
I am not very familiar with jdbcIO, but it seems like they implemented the version suggested in jira. Where a PCollection can be of anything and then a callback to modify the query depending on the element in the PCollection. This allows each item in the PCollection to represent a query but is a bit more flexible then having a new query as each element.
I created a Datasource, as follows.
ComboPooledDataSource cpds = new ComboPooledDataSource();
cpds.setDriverClass("com.mysql.jdbc.Driver"); // loads the jdbc driver
cpds.setJdbcUrl("jdbc:mysql://<IP>:3306/employees");
cpds.setUser("root");
cpds.setPassword("root");
cpds.setMaxPoolSize(5);
There is a better way to set this driver now.
I set the database pool size as 5. While doing JdbcIO transform, I used this datasource to create the connection.
In the pipeline, I set
option.setMaxNumWorkers(5);
option.setAutoscalingAlgorithm(AutoscalingAlgorithmType.THROUGHPUT_BASED);
I used a query which would return around 3 million records. While observing the DB connections , the number of connections were gradually increasing while the program was running. It used at most 5 connections on certain instances.
I think, this is how we can limit the number of connections created to a DB while running JdbcIO trnsformation to load bulk amount data from a database.
Maven dependency for ComboPoolDataSource
<dependency>
<groupId>c3p0</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.1.2</version>
</dependency>
**please feel free to correct the answer if I missed something here.*
I had similar task
I got count of records from the database and split it into ranges of 1000 records
Then I apply readAll to PCollection of ranges
here is description of solution.
And thanks Balu reg. datasource configuration.

bigtable bulk update based on dynamic filter

I'm looking for a way to scan huge Google BigTable with filter dynamically composed based on the events and make bulk update/delete on huge numbers of rows.
At the moment, I'm trying to combine BigTable with java-based Dataflow (for intensive serverless compute power). I reached to the point where I can compose "Scan" object with dynamic filter based on the events but I still can't find a way to stream results from CloudBigtableIO.read() to subsequent dataflow pipeline.
Appreciate any advice.
Extend your DoFn from AbstractCloudBigtableTableDoFn. That will give you access to a getConnection() method. You'll do something like this:
try(Connection c = getConnection();
Table t = c.getTable(YOUR_TABLE_NAME);
ResultScanner resultScanner = t.getScanner(YOUR_SCAN)) {
for(Result r : resultScanner) {
Mutation m = ... // construct a Put or Delete
context.output(m)
}
}
I'm assuming that your pipeline starts with CloudBigtableIO.read(), has the AbstractCloudBigtableTableDoFn next, and then has a CloudBigtableIO.write().

How do I write to BigQuery a schema computed during execution of the same Dataflow pipeline?

My scenario is a variation on the one discussed here:
How do I write to BigQuery using a schema computed during Dataflow execution?
In this case, the goal is that same (read a schema during execution, then write a table with that schema to BigQuery), but I want to accomplish it within a single pipeline.
For example, I'd like to write a CSV file to BigQuery and avoid fetching the file twice (once to read schema, once to read data).
Is this possible? If so, what's the best approach?
My current best guess is to read the schema into a PCollection via a side output and then use that to create the table (with a custom PTransform) before passing the data to BigQueryIO.Write.
If you use BigQuery.Write to create the table then the schema needs to known when the table is created.
Your proposed solution of not specifying the schema when you create the BigQuery.Write transform might work, but you might get an error because the table doesn't exist and you aren't configuring BigQueryIO.Write to create it if needed.
You might want to consider reading just enough of your CSV files in your main program to determine the schema before running your pipeline. This would avoid the complexity of determining the schema at runtime. You would still incur the cost of the extra read but hopefully that's minimal.
Alternatively you create a custom sink
to write your data to BigQuery. Your Sinks could write the data to GCS. Your finalize method could then create a BigQuery load job. Your custom sink could infer the schema by looking at the records and create the BigQuery table with the appropriate schema.

Resources