Count distinct values in a stream pipeline - google-cloud-dataflow

I have a pipeline that looks like
pipeline.apply(PubsubIO.read.subscription("some subscription"))
.apply(Window.into(SlidingWindow.of(10 mins).every(20 seconds)
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(20 seconds))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()))
.apply(RemoveDuplicates.create())
.apply(Window.discardingFiredPanes()) // this is suggested in the warnings under https://cloud.google.com/dataflow/model/triggers#window-accumulation-modes
.apply(Count.<String>globally().withoutDefaults())
This pipeline overcounts distinct values significantly (20x normal value). Initially, I was suspecting that the default trigger may have caused this issue. I have tweaked to use triggers that allow no lateness/discard fired panes/use processing time, all of which have similar overcount issues.
I've also tried ApproximateUnique.globally: it failed during pipeline construction because of an exception that looks like
Default values are not supported in Combine.globally() if the output PCollection is not windowed by GlobalWindows. There seems no way to add withoutDefaults to it (like we did with Count.globally).
Is there a recommended way to do COUNT(DISTINCT) in dataflow/beam streaming pipeline with reasonable precision?
P.S. I'm using Java Dataflow SDK 1.9.0.

Your code looks OK; it shouldn't overcount. Note that you are placing each element into 30 windows, so if you have a window-unaware sink (equivalent to collapsing all the sliding windows) you would expect precisely 30 times as many elements. If you could show a bit more of the pipeline or how you are observing the counts, that might help.
Beyond that, I have a few suggestions for the pipeline:
I suggest changing your trigger for RemoveDuplicates to AfterPane.elementCountAtLeast(1); this will get you the same result at lower latency, since later elements arriving will have no impact. This trigger, and your current trigger, will never fire repeatedly. So it does not actually matter whether you set accumulatingFiredPanes() or discardingFiredPanes(). This is good, because neither one would work with the rest of your pipeline.
I'd install a new trigger prior to the Count. The reason is a bit technical, but I'll try to describe it:
In your current pipeline, the trigger installed there (the "continuation trigger" of the trigger for RemoveDuplicates) notes the arrival time of the first element and waits until it has received all elements that were produced at or before that processing time, as measured by the upstream worker. There is some nondeterminism because it puns the local processing time and the processing time of other workers.
If you take my advice and switch the trigger for RemoveDuplicates, then the continuation trigger will be AfterPane.elementCountAtLeast(1) so it will always emit a count as soon as possible and then discard further data, which is very wrong.

Related

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

Beam CoGroupByKey with fixed window and event time based trigger generates random elements

I have a pipeline in Beam that uses CoGroupByKey to combine 2 PCollections, first one reads from a Pub/Sub subscription and the second one uses the same PCollection, but enriches the data by looking up additional information from a table, using JdbcIO.readAll. So there is no way there would be data in the second PCollection without it being there in the first one.
There is a fixed window of 10seconds with an event based trigger like below;
Repeatedly.forever(
AfterWatermark.pastEndOfWindow().withEarlyFirings(
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(40))
).withLateFirings(AfterPane.elementCountAtLeast(1))
);
The issue I am seeing is that when I stop the pipeline using the Drain mode, it seems to be randomly generating elements for the second PCollection when there has not been any messages coming in to the input Pub/Sub topic. This also happens randomly when the pipeline is running as well, but not consistent, but when draining the pipeline I have been able to consistently reproduce this.
Please find the variation in input vs output below;
You are using a non-deterministic triggering, which means the output is sensitive to the exact ordering in which events come in. Another way to look at this is that CoGBK does not wait for both sides to come in; the trigger starts ticking as soon as either side comes in.
For example, lets call your PCollections A and A' respectively, and assume they each have two elements a1, a2, a1', and a2' (of common provenance).
Suppose a1 and a1' come into the CoGBK, 39 seconds passes, and then a2 comes in (on the same key), another 2 seconds pass, then a2' comes in. The CoGBK will output ([a1, a2], [a1']) when the 40-second mark hits, and then when the window closes ([], [a2']) will get emitted. (Even if everything is on the same key, this could happen occasionally if there is more than a 40-second walltime delay going through the longer path, and will almost certainly happen for any late data (each side will fire separately).
Draining makes things worse, e.g. I think all processing time triggers fire immediately.

Calculating periodic checkpoints for an unbounded stream in Apache Beam/DataFlow

I am using a global unbounded stream in combination with Stateful processing and timers in order to totally order a stream per key by event timestamp. The solution is described with the answer to this question:
Processing Total Ordering of Events By Key using Apache Beam
In order to restart the pipeline after a failure or stopping for some other reason, I need to determine the lowest event timestamp at which we are guaranteed that all other events have been processed downstream. This timestamp can be calculated periodically and persisted to a datastore and used as the input to the source IO (Kinesis) so that the stream can be re-read without having to go back to the beginning. (It is ok for us to have events replayed)
I considered having the stateful transformation emit the lowest processed timestamp as the output when the timer triggers and then combine all the outputs globally to find the minimum value. However, it is not possible to use a Global combine operation because a either a Window or a Trigger must be applied first.
Assuming that my stateful transform emits a Long when the timer fires which represents the smallest timestamp, I am defining the pipeline like this:
p.apply(events)
.apply("StatefulTransform", ParDo.of(new StatefulTransform()))
.apply(Window.<Long>configure().triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(100),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))))
.apply(Combine.globally(new MinLongFn()))
.apply("WriteCheckpoint", ParDo.of(new WriteCheckpoint()));
Will this ensure that the checkpoints will only be written when all of the parallel workers have emitted at least one of their panes? I am concerned that a the combine operation may operate on panes from only some of the workers, e.g. there may be a worker that has either failed or is still waiting for another event to trigger it's timer.
I'm a newbie of the Beam, but according to this blog https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html, Splittable DoFn might be the thing you are looking for!
You could create an SDF to fetch the stream and accept the input element as the start point.

Cloud Dataflow window trigger overwrites value from closed window

I'm writing a Dataflow (Beam SDK 2.0.0) that reads from Pub/Sub, counts the elements in a window and then stores the counts in BigTable as a timeseries. The windows are fixed on durations of 1 minute.
My intention was to update the value of the current window every second using a trigger in order to get real-time updates on the current time window.
But that doesn't seem to work. The value gets correctly updated every second but once Dataflow starts working on the next minute the first one is updated to zero. So basically only my last value is correct, all the rest is zero.
Pipeline pipeline = Pipeline.create(options);
PCollection<String> live = pipeline
.apply("Read from PubSub", PubsubIO.readStrings()
.fromSubscription("projects/..."))
.apply("Window per minute",
Window
.<String>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.orFinally(AfterWatermark.pastEndOfWindow()))
.accumulatingFiredPanes()
.withAllowedLateness(Duration.ZERO)
);
I've tried playing with the trigger code but nothing helps. My only options right now is to remove the entire .trigger block. Has anyone experienced similar behaviour?
After reporting my issue to Google they discovered some issues in the Beam SDK which are causing this. More details on these links:
When EOW and GC timers fire together (nonzero allowed lateness) we fail to note that it is the final pane: https://issues.apache.org/jira/browse/BEAM-2505
Processing time timers are not ignored properly if they come in with the GC timer: https://issues.apache.org/jira/browse/BEAM-2502
Processing time timers are just interpreted as GC timers, completely incorrectly comparing timestamps from different time domains: https://issues.apache.org/jira/browse/BEAM-2504

Total aggregate over an unbounded stream in Dataflow

A number of examples show aggregation over windows of an unbounded stream, but suppose we need to get a count-per-key of the entire stream seen up to some point in time. (Think word count that emits totals for everything seen so far rather than totals for each window.)
It seems like this could be a Combine.perKey and a trigger to emit panes at some interval. In this case the window is essentially global, and we emit panes for that same window throughout the life of the job. Is this safe/reasonable, or perhaps there is another way to compute a rolling, total aggregate?
Ryan your solution of using a global window and a periodic trigger is the recommended approach. Just make sure you use accumulation mode on the trigger and not discarding mode. The Triggers page should have more information.
Let us know if you need additional help.

Resources