Conditional iterations in Google cloud dataflow - google-cloud-dataflow

I am looking at the opportunities for implementing a data analysis algorithm using Google Cloud Dataflow. Mind you, I have no experience with dataflow yet. I am just doing some research on whether it can fulfill my needs.
Part of my algorithm contains some conditional iterations, that is, continue until some condition is met:
PCollection data = ...
while(needsMoreWork(data)) {
data = doAStep(data)
}
I have looked around in the documentation and as far as I can see I am only able to do "iterations" if I know the exact number of iterations before the pipeline starts. In this case my pipeline construction code can just create a sequential pipeline with fixed number of steps.
The only "solution" I can think of is to run each iteration in separate pipelines, store the intermediate data in some database, and then decide in my pipeline construction whether or not to launch a new pipeline for the next iteration. This seems to be an extremely inefficient solution!
Are there any good ways to perform this kind of additional iterations in Google cloud dataflow?
Thanks!

For the time being, the two options you've mentioned are both reasonable. You could even combine the two approaches. Create a pipeline which does a few iterations (becoming a no-op if needsMoreWork is false), and then have a main Java program that submits that pipeline multiple times until needsMoreWork is false.
We've seen this use case a few times and hope to address it natively in the future. Native support is being tracked in https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/50.

Related

Why one of two identical Google Sheets runs so much slower than the other

I have a Google Sheet consisting of IMPORTXML functions (among others) as the decision basis for some dependent Zapier functions with a 3rd application, which has been running increasingly slower. This has downstream issues to the cascading schedule of systems I use thereafter, and not to mention mis-triggers to Zapier as it tries to catch up with changing cell values, as the formulas run.
I've tried to optimise what I can in it, to get it to run faster. One step I took, was to clone the original sheet for the purpose of testing optimisations on the clone (essentially staging sheet).
What I've noticed in doing this, is that the clone sheet runs all the included formulas sooooo much quicker than its identical twin original, like soooo much quicker.
Just wondering if anyone knows of some obvious reasons why this would be happening, and what I can do to get the original running as fast as the clone?

Conditional Flow - Spring Cloud Dataflow

In Spring Cloud Dataflow (Stream) what's the right generic way to implement the conditional flow based on a processor output
For example, in below case to have the execution flow through different path based on the product price spit by a transformer.
Example Process Diagram
You can use the router-sink to dynamically decide where to send the data based on the upstream event-data.
There is a simple SpEL as well as comprehensive support via a Groovy script as options, which can help with decision making and conditions. See README.
If you foresee more complex conditions and process workflows, alternatively, you could build a custom processor that can do the dynamic routing to various downstream named-channel destinations. This approach can also help with unit/IT testing the business-logic standalone.

Using Z3 to minimize makespan in scheduling problems

I am trying to model job shop scheduling problems using Z3. Specifically let's say i have a set of tasks each of which may have other task dependencies. Then I wish to minimize the time of scheduling the last tasks i.e. the makespan.
Since there can be more than one job which has dependencies on other jobs but no forward dependencies (i.e. no job depends on this one), A simple minimize operation in Z3 may not suffice. And Z3 doesn't admit to a max function over a list.
Hence to solve this, I am considering adding a fake job which depends on all such jobs and then minimizing the time of scheduling this job. I wonder if this approach is scalable as I need to add constraints to many jobs.
Is this the only approach or are there other more elegant means?
You can define max using a chain of ite calls yourself; assuming you know exactly how many jobs there are. See here: Use Z3 and SMT-LIB to get a maximum of two values

Custom PCollectionView

There have been a number of times when I wanted to create a custom PCollectionView. Is this possible? For now, the only workaround I have is to create a PTransform, return a PCollection, and then apply a PCollectionView.asSingleton() transform, but I've noticed (at least several months ago) that this is much slower than running a native PCollectionView transform, such as View.AsList(). And since I'll be calling this PCollectionView method millions of times, it makes a difference if it takes a few milliseconds vs say a second.
How do you want to view the contents of your PCollection? The answer to this question will determine how you should approach things.
Cloud Dataflow (more generally, any Apache Beam backend) has a few ways that it will materialize your PCollection to allow you to efficiently access it as a side input. So list, singleton, map, and multimap are each pretty efficient for their usual access patterns (iteration, key lookup, etc). The architecture of Dataflow (now Beam) is such that you can define custom views, but if it requires a new access pattern then it will require backend support to be efficient.
Also you might care to know that after the first access to a singleton sided input, the value will usually be cached.

A way to use "aggregator" (custom counter) within Combine step?

My team uses a lot of aggregators (custom counters) for many of dataflow pipelines we use for monitoring and analysis purposes.
We mostly write DoFn classes to do so, but we sometimes use Combine.perKey(), by writing our own combine class that implements SerializableFunction<Iterable<T>, S> (usually in our case, T and S are the same). Some of the jobs we run have a small fraction of very hot keys, and we are looking to utilize some of the features offered by Combine (such as hot key fanout), but there is one issue with this approach.
It appears that aggregators are only available within DoFn, and I am wondering if there is a way around this, or this is a likely feature to be added in the future. Mostly, we use a bunch of custom counters to count the number of certain events/objects of different types for analysis and monitoring purposes. In some cases, we can probably apply another DoFn after the Combine step to do this, but in other cases we really need to count things during the combine process -- for instance, we want to know the distribution of objects over keys to understand how many hot keys we have and what draws the line between hot keys and very hot keys, for instance. There are a few other cases that seem tricky to us.
I searched around, but I couldn't find much resource around how one can use aggregators during the Combine step, so any help will be really appreciated!
If needed, I can perhaps describe what kind of Combine step we use and what we are trying to count, but it'll take some time and I'd like to have a general solution around this.
This is not currently possible. In the future (as part of Apache Beam) it is likely to be possible to define metrics (which are like aggregators) within a CombineFn which should address this.
In the meantime, for your use case you can do as you describe. You can have a Combine.perKey(), and then have multiple steps consuming the result -- one for your actual processing and others to report various metrics.
You could also look at the methods in CombineFns which allow creating a composed CombineFn. For instance, you could use your CombineFn and a simple Count, so that the reporting DoFn can report the number of elements in each key (consuming the Count) and the actual processing DoFn can consume the result of your CombineFn.

Resources