I have multiple custom combine functions which I call as such:
e.g. I have 'data' calculated previously in the pipeline.
cd1 = data | customCombFn1()
cd2 = data | customCombFn2()
cd3 = data | customCombFn3()
How does the pipeline work in the above case ? Is the 'data' evaluated again and again ? Or are cd1, cd2, and cd3 evaluated as a by-product of the pipeline ?
Your data object is a PCollection. Applying a combine transformation on a PCollection creates another PCollection, most often containing much fewer elements.
There would be no 're-evaluation', as you call it. PCollection is typically produced on multiple workers and immediately consumed by transformations that need it. If that is not possible in a given case, PCollection will typically be stored for processing at a later point.
Generally speaking, Cloud Dataflow service automatically applies optimizations to users' pipeline. In most cases, including this one, it allows users to focus on their business logic instead of the underlying execution considerations.
Related
I’d like to be able to maintain a grouping of entities within a single PCollection element, but parallelize the fetching of those entities from Google Cloud Storage (GCS). i.e.PCollection<Iterable<String>> --> PCollection<Iterable<String>> where the starting PCollection is an Iterable of file paths and the resulting PCollection is Iterable of file contents. Alternatively, PCollection<String> --> PCollection<Iterable<String>> would also work and perhaps even be preferable, where the starting PCollection is a glob pattern, and the resulting PCollection is an iterable of file contents which matched the glob.
My use-case is that at a point in my pipeline I have as input PCollection<String>. Each element of the PCollection is a GCS glob pattern. It’s important that files which match the glob be grouped together because the content of the files–once all files in a group are read–need to be grouped downstream in the pipeline. I originally tried using FileIO.matchAll and a subsequently GroupByKey . However, the matchAll, window, and GroupByKey combination lacked any guarantee that all files matching the glob would be read and in the same window before performing the GroupByKey transform (though I may be misunderstanding Windowing). It’s possible to achieve the desired results if a large time span WindowFn is applied, but it’s still probabilistic rather than a guarantee that all files will be read before grouping. It’s also the main goal of my pipeline to maintain the lowest possible latency.
So my next, and currently operational, plan was to use an AsyncHttpClient to fan out fetching file contents via GCS HTTP API. I feel like this goes against the grain in Beam and is likely sub-optimal in terms of parallelization.
So I’ve started investigating SplittableDoFn . My current plan is to allow splitting such that each entity in the input Iterable (i.e. each matched file from the glob pattern) could be processed separately. I've been able to modify FileIO#MatchFn (defined here in the Java SDK) to provide mechanics for PCollection<String> -> PCollection<Iterable<String>> transform between input of GCS glob patterns and output of Iterable of matches for the glob.
The challenge I’ve encountered is: how do I go about grouping/gathering the split invocations back into a single output value in my DoFn? I’ve tried using stateful processing and using a BagState to collect file contents along the way, but I realized part way along that the ProcessElement method of a splittable DoFn may only accept ProcessContext and Restriction tuples, and no other args therefore no StateId args referring to a StateSpec (throws an invalid argument error at runtime).
I noticed in the FilePatternWatcher example in the official SDF proposal doc that a custom tracker was created wherein FilePath Objects kept in a set and presumably added to the set via tryClaim. This seems as though it could work for my use-case, but I don’t see/understand how to go about implementing a #SplitRestriction method using a custom RestrictionTracker.
I would be very appreciative if anyone were able to offer advice. I have no preference for any particular solution, only that I want to achieve the ability to maintain a grouping of entities within a single PCollection element, but parallelize the fetching of those entities from Google Cloud Storage (GCS).
Would joining the output PCollections help you?
PCollectionList
.of(collectionOne)
.and(collectionTwo)
.and(collectionThree)
...
.apply(Flatten.pCollections())
Input is PCollection<KV<String,String>>
I have to write files by the key and each line as value of the KV group.
In order to group based on Key, I have 2 options :
1. GroupByKey --> PCollection<KV<String, Iterable<String>>>
2. Combine.perKey.withhotKeyFanout --> PCollection
where value String is accumulated Strings from all pairs.
(Combine.CombineFn<String, List<String>, CustomStringObJ>)
I can have a millon records per key.The collection of keyed-data is optimised using Windows and Trigger, still can have thousands of entries per key.
I worry the max size of String will cause issue if Combine.perKey.withHotKeyFanout is used to create a CustomStringObJ which has List<String> as member to be written in the file.
If we use GroupByKey, how to handle hot keys?
You should use the approach with GroupByKey, not use Combine to concatenate a large string. The actual implementation (not unique to Dataflow) is that elements are shuffled according to their key and in the output KV<K, Iterable<V>> the iterable of values is a particular lazy/streamed view on the elements shuffled to that key. There is no actual iterable constructed - this is just as good as routing each element to the worker that owns each file and writing it directly.
Your use of windows and triggers might actually force buffering and make this less efficient. You should only use event time windowing if it is part of your business case; it isn't a mechanism for controlling performance. Triggers are good for managing how data is batched up and sent downstream, but most useful for aggregations where triggering less frequently saves a lot of data volume. For a raw grouping of the elements, triggers tend to be less useful.
This topic is difficult to Google, because of "node" (not node.js), and "graph" (no, I'm not trying to make charts).
Despite being a pretty well rounded and experienced developer, I can't piece together a mental model of how these sorts of editors get data in a sensible way, in a sensible order, from node to node. Especially in the Alteryx example, because a Sort module, for example, needs its entire upstream dataset before proceeding. And some nodes can send a single output to multiple downstream consumers.
I was able to understand trees and what not in my old data structures course back in the day, and successfully understand and adapt the basic graph concepts from https://www.python.org/doc/essays/graphs/ in a real project. But that was a static structure and data weren't being passed from node to node.
Where should I be starting and/or what concept am I missing that I could use implement something like this? Something to let users chain together some boxes to slice and dice text files or data records with some basic operations like sort and join? I'm using C#, but the answer ought to be language independent.
This paradigm is called Dataflow Programming, it works with stream of data which is passed from instruction to instruction to be processed.
Dataflow programs can be programmed in textual or visual form, and besides the software you have mentioned there are a lot of programs that include some sort of dataflow language.
To create your own dataflow language you have to:
Create program modules or objects that represent your processing nodes realizing different sort of data processing. Processing nodes usually have one or multiple data inputs and one or multiple data output and implement some data processing algorithm inside them. Nodes also may have control inputs that control how given node process data. A typical dataflow algorithm calculates output data sample from one or many input data stream values as for example FIR filters do. However processing algorithm also can have data values feedback (output values in some way are mixed with input values) as in IIR filters, or accumulate values in some way to calculate output value
Create standard API for passing data between processing nodes. It can be different for different kinds of data and controlling signals, but it must be standard because processing nodes should 'understand' each other. Data usually is passed as plain values. Controlling signals can be plain values, events, or more advanced controlling language - depending of your needs.
Create arrangement to link your nodes and to pass data between them. You can create your own program machinery or use some standard things like pipes, message queues, etc. For example this functional can be implemented as a tree-like structure whose nodes are your processing nodes, and have references to next nodes and its appropriate input that process data coming from the output of the current node.
Create some kind of nodes iterator that starts from begin of the dataflow graph and iterates over each processing node where it:
provides next data input values
invokes node data processing methods
updates data output value
pass updated data output values to inputs of downstream processing nodes
Create a tool for configuring nodes parameters and links between them. It can be just a simple text file edited with text editor or a sophisticated visual editor with GUI to draw dataflow graph.
Regarding your note about Sort module in Alteryx - perhaps data values are just accumulated inside this module and then sorted.
here you can find even more detailed description of Dataflow programming languages.
I'm aware that Dataflow can modify a pipeline's execution graph through Fusion Optimization.
Do windows/triggers factor in at all to fusion optimization?
Does a streaming pipeline and/or unbounded sources (Pub/Sub) influence that behavior at all?
All the complex operations of the Beam programming model, including evaluation of windowing/triggering and such, end up being translated to a low-level graph of (possibly stateful) ParDo and GroupByKey operations (a.k.a. Map and Reduce :) ).
E.g.
You can think of the assigning windows (Window.into()) as a ParDo that takes an element and returns a list of pairs (element, window) for all windows into which the element's timestamp maps
A GroupByKey by a key (or a Combine) in your original pipeline gets translated into a GroupByKey by a composite key (user key, window)
Evaluation of triggers happens as a stateful ParDo that gets inserted immediately after any GroupByKey and reacts to new values arriving for a given key/window by buffering the new value and deciding whether, according to the trigger, it's already time to emit the accumulated values or not.
This is not an exact correspondence (semantics of windows is a little more complex than that), just to give you an idea.
Fusion operates on this low-level graph of ParDo and GroupByKey, collapsing some chains of ParDo's into a single ParDo. Fusion doesn't care whether some of the ParDos play a role related to windowing, or that a GroupByKey groups by a composite key, etc.
I believe in Dataflow Streaming runner, fusion is in practice more aggressive (it always collapses chains of ParDos) than in the batch runner (that collapses only in cases where it seems beneficial according to data size estimates, based on the FlumeJava paper), but this can change as we make improvements to both runners.
We have a custom combine function (on beam sdk 2.0) in which the millions of objects get accumulated but they do NOT necessarily get reduced....that is, they sometimes get added to a List such that eventually, the List might get quite large (hundreds of megabytes, even gigabytes).
To minimize the problem of having to "pass around" these objects (during merging of accumulators) between nodes, we've created a SINGLE giant node (of 64 cores, tonnes of RAM).
So, in "theory", dataflow does not need to serialize the List object (and any of these big objects in the List) even during "merge accumulator" operations, since all the objects are on the same node. But, does dataflow still serialize even if all the objects of interest are on the same node or is it smart enough to know that an object is on the same node vs separate nodes?
Ideally, when objects are on same node, we can just pass around references to the objects (rather than serializing/deserializing the contents of these objects, which can be very very large.) (I understand, of course, than when dealing with multiple nodes, there's no choice but to serialize/deserialize since the data has to be passed around somehow; but within a node, is beam sdk 2.0 smart enough to not serialize/deserialize during these combine functions, group by's etc.?)
The Dataflow service aggressively optimizes your pipeline to avoid needless serialization. The optimization you are interested in is fusion, described here in the Dataflow documentation. When data moves through a fused "stage" (a sequence of low-level instructions roughly corresponding to steps in your input pipeline), it is not serialized and deserialized.
However, if your CombineFn builds a list, and that list grows large, you should try to rephrase your pipeline to use a raw GroupByKey. Another important optimization is "combiner lifting" or "mapper-side combine" where your CombineFn is applied per-key locally prior to shuffling your data between machines, based on the assumption that the accumulator will be smaller than just a list of elements. So the whole list will be serialized, shuffled, and deserialized prior to completing the Combine transform. If, instead, you use a GroupByKey directly, your elements would be much more efficiently streamed, without serializing an entire list.
I should note that Beam's other runners also perform standard fusion optimization and others. These all generally come from functional programming work in the late 80s / early 90s and was applied to distributed data processing in FlumeJava, circa 2010, so it is a baseline expectation now.