I have a pipeline that starts receiving a list of categories IDs.
In a ParDo I execute a DoFn that calls a REST API using those IDs as parameter and returns a PCollection of a Category object.
.apply("Read Category", ParDo.of(new DoFn<String, Category>(){});
In a second ParDo I persist this Category objects, read his children attribute and return his children IDs.
.apply("Persist Category", ParDo.of(new DoFn<Category, String>(){});
I would like to repeat the first ParDo again over the list of IDs returned by the second ParDo until there is no children categories.
How can I perform this with the Apache Beam model benefiting from the parallel processing?
Apache Beam currently does not provide any primitives for iterative parallel processing. There are some workarounds you can employ, e.g. some of them are listed in this answer.
Another alternative is to write a simple Java function that will traverse the tree for a specific top-level ID (recursively fetching categories and children starting from a given ID), and use ParDo to apply that function in parallel - but, obviously, there will be no distributed parallelism within that function.
You could also partially "unroll" the iteration in the pipeline first, to get a bunch of distributed parallelism across the first few levels of the tree - e.g. build a pipeline with a sequence of a couple of the first and second ParDo, and then apply a third ParDo that applies the iterative Java function to traverse the remaining levels.
Note that, if you are executing on Dataflow or any other runner that supports the fusion optimization, most likely you'll need to use one of the tricks for preventing fusion.
Related
I like to run an asynchronous dask dataframe computation with dd.persist() and then been able to track an individual partition status. The goal is to get access to partial results in a non-blocking way.
Here the desired pseudo code:
dd = dd.persist()
if dd.partitions[0].__dask_status__ == 'finished':
# Partial non-blocking result access
df = dd.partitions[0].compute()
Using dask futures works well, but submitting many individual partitions is very slow compared to a single dd.persist() and having one future per partition breaks the dashboard "groups" tab by showing too many blocks.
futures = list(map(client.compute, dd.partitions))
Broken dask dashboard "groups" tab
The function you probably want is distributed.futures_of, which lists the running futures of a collection. You can either examine this list yourself, looking at the status of the futures, or use with distributed.as_completed and a for-loop to process the partitions as they become available. The keys of the futures are like (collection-name, partition-index), so you know which partition each belongs to.
The reason dd.partitions[i] (or looping over these with list) doesn't work, is that this creates a new graph for each partition, and so you end up submitting much more to the scheduler than the single call to .persist().
I would like to join multiple streams on a common key and trigger a result either as soon as all of the streams have contributed at least one element or at the end of the window. CoGroupByKey seems to be the appropriate building block, but there does not seem to be a way to express the early trigger condition (count trigger applies per input collection)?
I believe CoGroupByKey is implemented as Flatten + GroupByKey under the hood. Once multiple streams are flattened into one, data-driven trigger (or any other triggers) won't have enough control to achieve what you want.
Instead of using CoGroupByKey, you can use Flatten and StatefulDoFn that fills an object backed by State for each key. Also in this case, StatefulDoFn would have the chance to decide what to do when stream A has 2 elements arrived but stream B doesn't have any element yet.
Another potential solution that comes to mind is (a stateless) DoFn that filters the CoGBK results to remove those that don't have at least one occurrence for each joined stream. For the end of window result (which does not have the same restriction), it would then be necessary to have a parallel CoGBK and its result would not go through the filter. I don't think there is a way to tag results with the trigger that emitted it?
I have a dataflow pipeline that is set up to receive information(JSON) and transform it into a DTO and then insert it into my database. This works great for insert, but where I am running into issues is with handling delete records. With the information I am receiving there is a deleted tag in the JSON to specify when that record is actually being deleted. After some research/experimenting, I am at a loss as whether or not this is possible.
My question: Is there a way to dynamically choose(or change) what sql statement the pipeline is using, while streaming?
To achieve this with Dataflow you need to think more in terms of water flowing through pipes than in terms of if-then-else coding.
You need to classify your records into INSERTs and DELETEs and route each set to a different sink that will do what you tell them to. You can use tags for that.
In this pipeline design example, instead of startsWithATag and startsWithBTag you can use tags for Insert and Delete.
I am wondering whether Google Dataflow can do something that is equivalent of like SQL
SELECT * FROM A INNER JOIN B ON A.a = B.b **LIMIT 1000**
I know that Dataflow has very standard programming paradigm to do join. However, the part I am interested in. is this LIMIT 1000. Since I don't need all of the joined result but only any 1000 of them. I am wondering whether I can utilize this use case to speed up my job (assuming the join are between very expansive tables and will produce very large result on a fully join)
So I assume that a very naive way to achieve the above SQL result is some template code as follows:
PCollection A = ...
PCollection B = ...
PCollection result = KeyedPCollectionTuple.of(ATag, A).and(BTag, B)
.apply(CoGroupByKey.create())
.apply(ParDo.of(new DoFn<KV<...,CoGbkResult>, ...>() {
})
.apply(Sample.any(1000))
However my concern is that how is this Sample transformation hooking up with ParDo internally handled by dataflow. Will dataflow able to optimize in the way that it will stop processing join as long as it know it will definitely have enough output? Or there is simply no optimization in this use case that dataflow will just compute the full join result and then select 1000 from the result? (In this way, Sample transform is will only be an overhead)
Or long question short, it is possible for me to utilize this use case to do partial join in dataflow?
EDIT:
Or in essentially, I am wondering does Sample.any() transform will able to hint any optimization to upstream PCollection? For example if I do
pipeline.apply(TextTO.Read.from("gs://path/to/my/file*"))
.apply(Sample.any(N))
Will dataflow first load all data in and then select N or will it able to take advantage of Sample.any() and do some optimization and prune out some useless read.
Currently neither Cloud Dataflow, nor any of the other Apache Beam runners (as far as I'm aware) implement such an optimization.
We have several BigQuery tables that we're reading from through DataFlow. At the moment those tables are flattened and a lot of the data is repeated. In Dataflow, all operations must be idempotent, so any output only depends on the input to the function, there's no state kept anywhere else. This is why it makes sense to first group all the records together that belong together and in our case, this probably means creating complex objects.
Example of A complex object (there are many other types like this). We can have millions of instances of each type obviously:
Customer{
customerId
address {
street
zipcode
region
...
}
first_name
last_name
...
contactInfo: {
"phone1": {type, number, ... },
"phone2": {type, number, ... }
}
}
The examples we found for DataFlow only process very simple objects and the examples demonstrate counting, summing and averaging.
In our case, we eventually want to use DataFlow to perform more complicated processing in accordance with sets of rules. Those rules apply to the full contact of a customer, invoice or order for example and eventually produce a whole set of indicators, sums and other items.
We considered doing this 100% in BigQuery, but this gets very messy very quickly due to the rules that apply per entity.
At this time I'm still wondering whether DataFlow is really the right tool for this job. There are almost no examples for dataFlow that demonstrate how it's used for these type of more complex objects with one or two collections. The closest I found was the use of a "LogMessage" object for log processing, but this didn't have any collections and therefore didn't do any hierarchical processing.
The biggest problem we're facing is hierarchical processing. We're reading data like this:
customerid ... street zipcode region ... phoneid type number
1 a b c phone1 1 555-2424
1 a b c phone2 1 555-8181
And the first operation should be group those rows together to construct a single entity, so we can make our operations idempotent. What is the best way to do that in DataFlow, or point us to an example that does that?
You can use any object as the elements in a Dataflow pipeline. The TrafficMaxLaneFlow example uses a complex object (although it doesn't have a collection).
In your example you would do a GroupByKey to group the elements. The result is a KV<K, Iterable<V>>. The KV here is just an object and has a collection-like value inside. You could then take that KV<K, Iterable<V>> and turn it into whatever kind of objects you wanted.
The only thing to be aware of is that if you have very few elements that are really big you may run into some parallelism limits. Specifically, each element needs to be small enough to be processed on a single machine.
You may also be interested in withoutFlatteningResults on BigQueryIO. It only supports reading from a query (rather than a table) but it should provide the results without flattening.