Dataflow fusion + windows/triggers - google-cloud-dataflow

I'm aware that Dataflow can modify a pipeline's execution graph through Fusion Optimization.
Do windows/triggers factor in at all to fusion optimization?
Does a streaming pipeline and/or unbounded sources (Pub/Sub) influence that behavior at all?

All the complex operations of the Beam programming model, including evaluation of windowing/triggering and such, end up being translated to a low-level graph of (possibly stateful) ParDo and GroupByKey operations (a.k.a. Map and Reduce :) ).
E.g.
You can think of the assigning windows (Window.into()) as a ParDo that takes an element and returns a list of pairs (element, window) for all windows into which the element's timestamp maps
A GroupByKey by a key (or a Combine) in your original pipeline gets translated into a GroupByKey by a composite key (user key, window)
Evaluation of triggers happens as a stateful ParDo that gets inserted immediately after any GroupByKey and reacts to new values arriving for a given key/window by buffering the new value and deciding whether, according to the trigger, it's already time to emit the accumulated values or not.
This is not an exact correspondence (semantics of windows is a little more complex than that), just to give you an idea.
Fusion operates on this low-level graph of ParDo and GroupByKey, collapsing some chains of ParDo's into a single ParDo. Fusion doesn't care whether some of the ParDos play a role related to windowing, or that a GroupByKey groups by a composite key, etc.
I believe in Dataflow Streaming runner, fusion is in practice more aggressive (it always collapses chains of ParDos) than in the batch runner (that collapses only in cases where it seems beneficial according to data size estimates, based on the FlumeJava paper), but this can change as we make improvements to both runners.

Related

What should I use as the Key for GroupIntoBatches.withShardedKey

I want to batch the calls to an external service in my streaming dataflow job for unbounded sources. I used windowing + attach a dummy key + GroupByKey as below
messages
// 1. Windowing
.apply("window-5-seconds",
Window.<Message>into(FixedWindows.of(Duration.standardSeconds(5)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1000)
.orFinally(AfterWatermark.pastEndOfWindow())))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
// 2. attach arbitrary key
.apply("attach-arbitrary-key", ParDo.of(new MySink.AttachDummyKey()))
// 3. group by key
.apply(GroupByKey.create())
// 4. call my service
.apply("call-my-service",
ParDo.of(new MySink(myClient)));
This implementation caused performance issues as I attached a dummy key to all the messages that caused the transform to not execute in parallel at all. After reading this answer, I switched to GroupIntoBatches transform as below.
messages
// 1. Windowing
.apply("window-5-seconds",
Window.<Message>into(FixedWindows.of(Duration.standardSeconds(5)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1000)
.orFinally(AfterWatermark.pastEndOfWindow())))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
// 2. attach sharding key
.apply("attach-sharding-key", ParDo.of(new MySink.AttachShardingKey()))
// 3. group by key into batches
.apply("group-into-batches",
GroupIntoBatches.<String, MessageWrapper>ofSize(1000)
.withMaxBufferingDuration(Duration.standardSeconds(5);)
.withShardedKey())
// 4. call my service
.apply("call-my-service",
ParDo.of(new MySink(myClient)));
The document states that withShardedKey increases parallelism by spreading one key over multiple threads but the question is what would be a good key when using withShardedKey?
If this truly is runner-determined sharding, would it make sense to use a single dummy key? Or the same problem would occur just like GroupByKey? Currently I do not have a good key to use, I was thinking of creating a hash based on some fields of the message. If I do pick a key that could evenly distribute the traffic, would it still make sense to use withShardedKey? Or it might cause each shard not to include enough data that GroupIntoBatches may not actually be useful?
Usually the key would be a natural key, but since you mentioned that there's no such key, I feel there are a few trade-offs to consider.
You can apply a static key, but the parallelism will just depend on the number of threads (GroupIntoBatches semantic) which is runner specific:
Outputs batched elements associated with sharded input keys. By default, keys are sharded to such that the input elements with the same key are spread to all available threads executing the transform. Runners may override the default sharding to do a better load balancing during the execution time.
If your pipeline can afford more calls (with eventually not full batches, depending on the distribution), applying a random key (using a small range - would have to try an ideal balance) instead of static may provide better guarantees.
I recommend watching this session which provides some relevant information: Beam Summit 2021 - Autoscaling your transforms with auto-sharded GroupIntoBatches

Apache Beam - Parallelize Google Cloud Storage Blob Downloads While Maintaining Grouping of Blobs

I’d like to be able to maintain a grouping of entities within a single PCollection element, but parallelize the fetching of those entities from Google Cloud Storage (GCS). i.e.PCollection<Iterable<String>> --> PCollection<Iterable<String>> where the starting PCollection is an Iterable of file paths and the resulting PCollection is Iterable of file contents. Alternatively, PCollection<String> --> PCollection<Iterable<String>> would also work and perhaps even be preferable, where the starting PCollection is a glob pattern, and the resulting PCollection is an iterable of file contents which matched the glob.
My use-case is that at a point in my pipeline I have as input PCollection<String>. Each element of the PCollection is a GCS glob pattern. It’s important that files which match the glob be grouped together because the content of the files–once all files in a group are read–need to be grouped downstream in the pipeline. I originally tried using FileIO.matchAll and a subsequently GroupByKey . However, the matchAll, window, and GroupByKey combination lacked any guarantee that all files matching the glob would be read and in the same window before performing the GroupByKey transform (though I may be misunderstanding Windowing). It’s possible to achieve the desired results if a large time span WindowFn is applied, but it’s still probabilistic rather than a guarantee that all files will be read before grouping. It’s also the main goal of my pipeline to maintain the lowest possible latency.
So my next, and currently operational, plan was to use an AsyncHttpClient to fan out fetching file contents via GCS HTTP API. I feel like this goes against the grain in Beam and is likely sub-optimal in terms of parallelization.
So I’ve started investigating SplittableDoFn . My current plan is to allow splitting such that each entity in the input Iterable (i.e. each matched file from the glob pattern) could be processed separately. I've been able to modify FileIO#MatchFn (defined here in the Java SDK) to provide mechanics for PCollection<String> -> PCollection<Iterable<String>> transform between input of GCS glob patterns and output of Iterable of matches for the glob.
The challenge I’ve encountered is: how do I go about grouping/gathering the split invocations back into a single output value in my DoFn? I’ve tried using stateful processing and using a BagState to collect file contents along the way, but I realized part way along that the ProcessElement method of a splittable DoFn may only accept ProcessContext and Restriction tuples, and no other args therefore no StateId args referring to a StateSpec (throws an invalid argument error at runtime).
I noticed in the FilePatternWatcher example in the official SDF proposal doc that a custom tracker was created wherein FilePath Objects kept in a set and presumably added to the set via tryClaim. This seems as though it could work for my use-case, but I don’t see/understand how to go about implementing a #SplitRestriction method using a custom RestrictionTracker.
I would be very appreciative if anyone were able to offer advice. I have no preference for any particular solution, only that I want to achieve the ability to maintain a grouping of entities within a single PCollection element, but parallelize the fetching of those entities from Google Cloud Storage (GCS).
Would joining the output PCollections help you?
PCollectionList
.of(collectionOne)
.and(collectionTwo)
.and(collectionThree)
...
.apply(Flatten.pCollections())

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

Apache Beam: why is the timestamp of aggregate value in Global Window 9223371950454775?

We migrated from Google Dataflow 1.9 to Apache Beam 0.6. We are noticing a change in the behavior to the timestamps after applying the globalwindow. In Google Dataflow 1.9, we would get the correct timestamps in the DoFn after windowing/combine function. Now we get some huge value for the timestamp e.g. 9223371950454775, Did the default behavior for the globalwindow change in Apache Beam version?
input.apply(name(id, "Assign To Shard"), ParDo.of(new AssignToTest()))
.apply(name(id, "Window"), Window
.<KV<Long, ObjectNode >>into(new GlobalWindows())
.triggering(Repeatedly.forever(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))))
.discardingFiredPanes())
.apply(name(id, "Group By Shard"), GroupByKey.create())
.appy(.....) }
TL;DR: When you are combining a bunch of timestamped values, you need to choose a timestamp for the result of the aggregation. There are multiple good answers for this output timestamp. In Dataflow 1.x the default was the minimum of the input timestamps. Based on our experience with 1.x in Beam the default was changed to the end of the window. You can restore the prior behavior by adding .withTimestampCombiner(TimestampCombiner.EARLIEST) to your Window transform.
I'll unpack this. Let's use the # sign to pair up a value and its timestamp. Focusing on just one key, you have timestamped values v1#t1, v2#t2, ..., etc. I will stick with your example of a raw GroupByKey even though this also applies to other ways of combining the values. So the output iterable of values is [v1, v2, ...] in arbitrary order.
Here are some possibilities for the timestamp:
min(t1, t2, ...)
max(t1, t2, ...)
the end of the window these elements are in (ignoring input timestamps)
All of these are correct. These are all available as options for your OutputTimeFn in Dataflow 1.x and TimestampCombiner in Apache Beam.
The timestamps have different interpretations and they are useful for different things. The output time of the aggregated value governs the downstream watermark. So choosing earlier timestamps holds the downstream watermark more, while later timestamps allows it to move ahead.
min(t1, t2, ...) allows you to unpack the iterable and re-output v1#t1
max(t1, t2, ...) accurately models the logical time that the aggregated value was fully available. Max does tend to be the most expensive, for reasons to do with implementation details.
end of the window:
models the fact that this aggregation represents all the data for the window
is very easy to understand
allows downstream watermarks to advance as fast as possible
is extremely efficient
For all of these reasons, we switched the default from the min to end of window.
In Beam, you can restore the prior behavior by adding .withTimestampCombiner(TimestampCombiner.EARLIEST) to your Window transform. In Dataflow 1.x you can migrate to Beam's defaults by adding .withOutputTimeFn(OutputTimeFns.outputAtEndOfWindow()).
Another technicality is that the user-defined OutputTimeFn is removed and replaced by the TimestampCombiner enum, so there are only these three choices, not a whole API to write your own.

In beam custom combine function, does serialization occur even if the object is on "same" machine?

We have a custom combine function (on beam sdk 2.0) in which the millions of objects get accumulated but they do NOT necessarily get reduced....that is, they sometimes get added to a List such that eventually, the List might get quite large (hundreds of megabytes, even gigabytes).
To minimize the problem of having to "pass around" these objects (during merging of accumulators) between nodes, we've created a SINGLE giant node (of 64 cores, tonnes of RAM).
So, in "theory", dataflow does not need to serialize the List object (and any of these big objects in the List) even during "merge accumulator" operations, since all the objects are on the same node. But, does dataflow still serialize even if all the objects of interest are on the same node or is it smart enough to know that an object is on the same node vs separate nodes?
Ideally, when objects are on same node, we can just pass around references to the objects (rather than serializing/deserializing the contents of these objects, which can be very very large.) (I understand, of course, than when dealing with multiple nodes, there's no choice but to serialize/deserialize since the data has to be passed around somehow; but within a node, is beam sdk 2.0 smart enough to not serialize/deserialize during these combine functions, group by's etc.?)
The Dataflow service aggressively optimizes your pipeline to avoid needless serialization. The optimization you are interested in is fusion, described here in the Dataflow documentation. When data moves through a fused "stage" (a sequence of low-level instructions roughly corresponding to steps in your input pipeline), it is not serialized and deserialized.
However, if your CombineFn builds a list, and that list grows large, you should try to rephrase your pipeline to use a raw GroupByKey. Another important optimization is "combiner lifting" or "mapper-side combine" where your CombineFn is applied per-key locally prior to shuffling your data between machines, based on the assumption that the accumulator will be smaller than just a list of elements. So the whole list will be serialized, shuffled, and deserialized prior to completing the Combine transform. If, instead, you use a GroupByKey directly, your elements would be much more efficiently streamed, without serializing an entire list.
I should note that Beam's other runners also perform standard fusion optimization and others. These all generally come from functional programming work in the late 80s / early 90s and was applied to distributed data processing in FlumeJava, circa 2010, so it is a baseline expectation now.

Resources