How can I emit summary data for each window even if a given window was empty? - google-cloud-dataflow

It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults which will then emit nothing if the window was empty.

Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection with a heartbeat PCollection consisting of empty values. So for the example of Sum.integersGlobally, you would Flatten your main PCollection<Integer> with a secondary PCollection<Integer> that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows or SlidingWindows; Sessions are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource that accepts an enumerable WindowFn plus a default value and generates an appropriate unbounded PCollection.

Related

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

Can Google dataflow GroupByKey handle hot keys?

Input is PCollection<KV<String,String>>
I have to write files by the key and each line as value of the KV group.
In order to group based on Key, I have 2 options :
1. GroupByKey --> PCollection<KV<String, Iterable<String>>>
2. Combine.perKey.withhotKeyFanout --> PCollection
where value String is accumulated Strings from all pairs.
(Combine.CombineFn<String, List<String>, CustomStringObJ>)
I can have a millon records per key.The collection of keyed-data is optimised using Windows and Trigger, still can have thousands of entries per key.
I worry the max size of String will cause issue if Combine.perKey.withHotKeyFanout is used to create a CustomStringObJ which has List<String> as member to be written in the file.
If we use GroupByKey, how to handle hot keys?
You should use the approach with GroupByKey, not use Combine to concatenate a large string. The actual implementation (not unique to Dataflow) is that elements are shuffled according to their key and in the output KV<K, Iterable<V>> the iterable of values is a particular lazy/streamed view on the elements shuffled to that key. There is no actual iterable constructed - this is just as good as routing each element to the worker that owns each file and writing it directly.
Your use of windows and triggers might actually force buffering and make this less efficient. You should only use event time windowing if it is part of your business case; it isn't a mechanism for controlling performance. Triggers are good for managing how data is batched up and sent downstream, but most useful for aggregations where triggering less frequently saves a lot of data volume. For a raw grouping of the elements, triggers tend to be less useful.

Remove duplicates across window triggers/firings

Let's say I have an unbounded pcollection of sentences keyed by userid, and I want a constantly updated value for whether the user is annoying, we can calculate whether a user is annoying by passing all of the sentences they've ever said into the funcion isAnnoying(). Forever.
I set the window to global with a trigger afterElement(1), accumulatingFiredPanes(), do GroupByKey, then have a ParDo that emits userid,isAnnoying
That works forever, keeps accumulating the state for each user etc. Except it turns out the vast majority of the time a new sentence does not change whether a user isAnnoying, and so most of the times the window fires and emits a userid,isAnnoying tuple it's a redundant update and the io was unnecessary. How do I catch these duplicate updates and drop while still getting an update every time a sentence comes in that does change the isAnnoying value?
Today there is no way to directly express "output only when the combined result has changed".
One approach that you may be able to apply to reduce data volume, depending on your pipeline: Use .discardingFiredPanes() and then follow the GroupByKey with an immediate filter that drops any zero values, where "zero" means the identity element of your CombineFn. I'm using the fact that associativity requirements of Combine mean you must be able to independently calculate the incremental "annoying-ness" of a sentence without reference to the history.
When BEAM-23 (cross-bundle mutable per-key-and-window state for ParDo) is implemented, you will be able to manually maintain the state and implement this sort of "only send output when the result changes" logic yourself.
However, I think this scenario likely deserves explicit consideration in the model. It blends the concepts embodied today by triggers and the accumulation mode.

Real-time pipeline feedback loop

I have a dataset with potentially corrupted/malicious data. The data is timestamped. I'm rating the data with a heuristic function. After a period of time I know that all new data items coming with some IDs needs to be discarded and they represent a significant portion of data (up to 40%).
Right now I have two batch pipelines:
First one just runs the rating over the data.
The second one first filters out the corrupted data and runs the analysis.
I would like to switch from batch mode (say, running every day) into an online processing mode (hope to get a delay < 10 minutes).
The second pipeline uses a global window which makes processing easy. When the corrupted data key is detected, all other records are simply discarded (also using the discarded keys from previous days as a pre-filter is easy). Additionally it makes it easier to make decisions about the output data as during the processing all historic data for a given key is available.
The main question is: can I create a loop in a Dataflow DAG? Let's say I would like to accumulate quality-rates given to each session window I process and if the rate sum is over X, some a filter function in earlier stage of pipeline should filter out malicious keys.
I know about side input, I don't know if it can change during runtime.
I'm aware that DAG by definition cannot have cycle, but how achieve same result without it?
Idea that comes to my mind is to use side output to mark ID as malicious and make fake unbounded output/input. The output would dump the data to some storage and the input would load it every hour and stream so it can be joined.
Side inputs in the Beam programming model are windowed.
So you were on the right path: it seems reasonable to have a pipeline structured as two parts: 1) computing a detection model for the malicious data, and 2) taking the model as a side input and the data as a main input, and filtering the data according to the model. This second part of the pipeline will get the model for the matching window, which seems to be exactly what you want.
In fact, this is one of the main examples in the Millwheel paper (page 2), upon which Dataflow's streaming runner is based.

Why did #sideInput() method move from Context to ProcessContext in Dataflow beta

I wonder why has the #sideInput() method moved to ProcessContext class?
Previously I could do some additional processing in the #startBundle() method and cache the result.
Doing that in #processElement() sounds less efficient. Of course I could do the preprocessing before passing the data to the view, but there still is the overhead of calling #sideInput() for each element...
Thanks,
G
Great question. The reason is that we added support for windowed PCollections as side inputs. This enables additional scenarios, including using side inputs with unbounded PCollections in streaming mode.
Before the change, we only supported side inputs that were globally windowed, and then entire side input PCollection was available while processing every element of the main input PCollection. This works fine for bounded PCollections in traditional batch style processing, but didn't extend to windowed or unbounded PCollections.
After the change, the window of the current element you are processing in your ParDo controls what subset of the side input is visible. (And so you can't access side inputs in startBundle(), where there is no current element and hence no current window.)
For example, consider an example where you have a streaming pipeline processing your website logs and providing real time updates to a live usage dashboard. You've got two unbounded input PCollections: one contains new user signups and the other contains user clicks. You can identify which user clicks come from new users by windowing both PCollections by hour and doing a ParDo over the user clicks that takes new user signups as a side input. Now when you process a user click which is in a given hour, you automatically see just the subset of the new user sign ups from the same hour. You can do different variants on this by changing the windowing functions and moving element timestamps forward in time on the side input -- like continuing to window the user clicks per hour, but using the new signups from the last 24 hours.
I do agree this change makes it harder to cache any postprocessing on your side input. We added View.asMultimap to handle a common case where you turn the Iterable into a lookup table. If your post-processing is element-wise, you can do it with a ParDo before creating the PCollectionView. For anything else right now, I'd recommend doing it lazily from within processElement. I'd be interested in hearing about other patterns that occur, so we can work on ways to make them more efficient.

Resources