How to apply ParDo to a PDone object - google-cloud-dataflow

My Pipeline has a follwing 2 steps.
To extract data from BigQuery, then translate it and store CloudStorage.
To extract data from the CloudStorage, then translate it and store S3.
After step1, I can get PDone object, but it does not have apply or read method.
Are there any way to attach step 2 process after step1 process?

You can write the data extracted from BigQuery to several places at the same time. E.g. see this answer.

Related

Is there a way a task could iterate over measurements and aggregate data into another bucket?

I’m using a bucket for collecting tick data for multiple symbols in Binance (e.g. ETH/BTC and BNB/BTC) and storing on different measurements (binance_ethbtc and binance_bnbbtc respectively) and that’s working fine. Other than that, I’d like to make aggregations of OHLC data into another bucket, just like this guy here. I’ve already managed to write Flux code for aggregating this data for a single measurement but then it got me wondering: do I need to write a task for EVERY measurement I have? Isn’t there a way of iterating over measurements in a bucket and aggregating the data into another one?
Thanks to FixTestRepeat on the InfluxDB community, I've managed to do it (and iterating over measurements is not necessary). He's showed me that if I remove the filter for the _measurement field, the query will yield as many series as there are measurements. More information here

How to differentiate XML transaction in Liaison Delta and ECS

Can you please some one help to with following scenario.
I have 2 XML messages(ORDERS, SHIPRTN) are placing into SFTP location, using ECS i am picking and translating with Delta, but How do i differentiate both ORDERS and SHIPRTN and call the respective Maps in Delta.
You have a few ways to do this. You can put a condition on the event rule if the data contains the batch , then call map "A". if the data contains the data then call map "B".
The other way is to set identity values in the model so that TPM can figure it out.

Write multiple files on GS in dataflow based on key based grouping in the pipeline

I have a KV collection created by grouping and the goal is to write each V to a different file(V is a list of Strings).
Referring to this code,
How to use FileIO.writeDynamic() in Apache Beam 2.6 to write to multiple output paths? ,
I follow that FileIO.dynamic write will solve the purpose.
However does it consider hot key fanouts in grouping?
What is the best way to write to multiple files based on keys?
Also name of the file has to be key with values written in it.

Dynamic query usage whole streaming with google dataflow?

I have a dataflow pipeline that is set up to receive information(JSON) and transform it into a DTO and then insert it into my database. This works great for insert, but where I am running into issues is with handling delete records. With the information I am receiving there is a deleted tag in the JSON to specify when that record is actually being deleted. After some research/experimenting, I am at a loss as whether or not this is possible.
My question: Is there a way to dynamically choose(or change) what sql statement the pipeline is using, while streaming?
To achieve this with Dataflow you need to think more in terms of water flowing through pipes than in terms of if-then-else coding.
You need to classify your records into INSERTs and DELETEs and route each set to a different sink that will do what you tell them to. You can use tags for that.
In this pipeline design example, instead of startsWithATag and startsWithBTag you can use tags for Insert and Delete.

How do I write to BigQuery a schema computed during execution of the same Dataflow pipeline?

My scenario is a variation on the one discussed here:
How do I write to BigQuery using a schema computed during Dataflow execution?
In this case, the goal is that same (read a schema during execution, then write a table with that schema to BigQuery), but I want to accomplish it within a single pipeline.
For example, I'd like to write a CSV file to BigQuery and avoid fetching the file twice (once to read schema, once to read data).
Is this possible? If so, what's the best approach?
My current best guess is to read the schema into a PCollection via a side output and then use that to create the table (with a custom PTransform) before passing the data to BigQueryIO.Write.
If you use BigQuery.Write to create the table then the schema needs to known when the table is created.
Your proposed solution of not specifying the schema when you create the BigQuery.Write transform might work, but you might get an error because the table doesn't exist and you aren't configuring BigQueryIO.Write to create it if needed.
You might want to consider reading just enough of your CSV files in your main program to determine the schema before running your pipeline. This would avoid the complexity of determining the schema at runtime. You would still incur the cost of the extra read but hopefully that's minimal.
Alternatively you create a custom sink
to write your data to BigQuery. Your Sinks could write the data to GCS. Your finalize method could then create a BigQuery load job. Your custom sink could infer the schema by looking at the records and create the BigQuery table with the appropriate schema.

Resources