Calculating periodic checkpoints for an unbounded stream in Apache Beam/DataFlow - google-cloud-dataflow

I am using a global unbounded stream in combination with Stateful processing and timers in order to totally order a stream per key by event timestamp. The solution is described with the answer to this question:
Processing Total Ordering of Events By Key using Apache Beam
In order to restart the pipeline after a failure or stopping for some other reason, I need to determine the lowest event timestamp at which we are guaranteed that all other events have been processed downstream. This timestamp can be calculated periodically and persisted to a datastore and used as the input to the source IO (Kinesis) so that the stream can be re-read without having to go back to the beginning. (It is ok for us to have events replayed)
I considered having the stateful transformation emit the lowest processed timestamp as the output when the timer triggers and then combine all the outputs globally to find the minimum value. However, it is not possible to use a Global combine operation because a either a Window or a Trigger must be applied first.
Assuming that my stateful transform emits a Long when the timer fires which represents the smallest timestamp, I am defining the pipeline like this:
p.apply(events)
.apply("StatefulTransform", ParDo.of(new StatefulTransform()))
.apply(Window.<Long>configure().triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(100),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))))
.apply(Combine.globally(new MinLongFn()))
.apply("WriteCheckpoint", ParDo.of(new WriteCheckpoint()));
Will this ensure that the checkpoints will only be written when all of the parallel workers have emitted at least one of their panes? I am concerned that a the combine operation may operate on panes from only some of the workers, e.g. there may be a worker that has either failed or is still waiting for another event to trigger it's timer.

I'm a newbie of the Beam, but according to this blog https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html, Splittable DoFn might be the thing you are looking for!
You could create an SDF to fetch the stream and accept the input element as the start point.

Related

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

Streaming Beam pipeline with "lookbehind"

I am new to Beam/Dataflow and am trying to figure out if it is suited to this problem. I am trying to keep a running sum of which types of messages are currently backlogged in a queueing system. The system uses a monotonically increasing offset number to order messages: producers learn the number when the send a message, and consumers track the watermark offset as they process each message in FIFO order. This pipeline would have two inputs: counts from the producers and watermarks from the consumers.
The queue producer would regularly flush a batch of count metrics to Beam:
(type1, offset, count)
(type2, offset, count)
...
where the offset was the last offset the producer wrote for typeN, and count is how many typeN messages it enqueued in the current batch period.
The queue consumer will regularly send its latest consumed watermark offset. The effect this should have is to invalidate any counts that have an offset lower than this consumer watermark.
The output of the pipeline is the sum of all counts with a higher offset than the largest consumer watermark yet seen, grouped by message type. (snapshotted every 5 minutes or so.)
(Of course there would be 100k message "types", hundreds of producer servers, occasional 2-hour periods where the consumer doesn't report an advancing watermark, etc.)
Is this doable? That this pipeline would need to maintain and scan an unbounded-ish history of count records is the part that seems maybe unsuited to Beam.
One possible approach would be to model this as two timeseries (left , right) where you want to match left.timestamp <= right.timestamp. You can do this using the State and Timer API.
In order to achieve this unbounded, you will need to be working within a GlobalWindow. Important note in the Global Window there is no expiry of the state, so you will need to make sure to do Garbage Collection on your left and right streams. Also data will arrive in the onprocess unordered, so you will need to make use of Event Time timers to do the actual work.
Very roughly:
onProcess(){
Store data in BagState.
Setup Event time timer to go off
}
OnTimer(){
Do your buiss logic.
}
This is a lot easier with Apache Beam > 2.24.0 as OrderedListState has been added.
Although the timeseries use case is different from the one in this question, this talk from the 2019 Beam summit also has some pointers (but does not make use of OrderedListState, which was not available at the time);
State and Timer API and Timeseries

Count distinct values in a stream pipeline

I have a pipeline that looks like
pipeline.apply(PubsubIO.read.subscription("some subscription"))
.apply(Window.into(SlidingWindow.of(10 mins).every(20 seconds)
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(20 seconds))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()))
.apply(RemoveDuplicates.create())
.apply(Window.discardingFiredPanes()) // this is suggested in the warnings under https://cloud.google.com/dataflow/model/triggers#window-accumulation-modes
.apply(Count.<String>globally().withoutDefaults())
This pipeline overcounts distinct values significantly (20x normal value). Initially, I was suspecting that the default trigger may have caused this issue. I have tweaked to use triggers that allow no lateness/discard fired panes/use processing time, all of which have similar overcount issues.
I've also tried ApproximateUnique.globally: it failed during pipeline construction because of an exception that looks like
Default values are not supported in Combine.globally() if the output PCollection is not windowed by GlobalWindows. There seems no way to add withoutDefaults to it (like we did with Count.globally).
Is there a recommended way to do COUNT(DISTINCT) in dataflow/beam streaming pipeline with reasonable precision?
P.S. I'm using Java Dataflow SDK 1.9.0.
Your code looks OK; it shouldn't overcount. Note that you are placing each element into 30 windows, so if you have a window-unaware sink (equivalent to collapsing all the sliding windows) you would expect precisely 30 times as many elements. If you could show a bit more of the pipeline or how you are observing the counts, that might help.
Beyond that, I have a few suggestions for the pipeline:
I suggest changing your trigger for RemoveDuplicates to AfterPane.elementCountAtLeast(1); this will get you the same result at lower latency, since later elements arriving will have no impact. This trigger, and your current trigger, will never fire repeatedly. So it does not actually matter whether you set accumulatingFiredPanes() or discardingFiredPanes(). This is good, because neither one would work with the rest of your pipeline.
I'd install a new trigger prior to the Count. The reason is a bit technical, but I'll try to describe it:
In your current pipeline, the trigger installed there (the "continuation trigger" of the trigger for RemoveDuplicates) notes the arrival time of the first element and waits until it has received all elements that were produced at or before that processing time, as measured by the upstream worker. There is some nondeterminism because it puns the local processing time and the processing time of other workers.
If you take my advice and switch the trigger for RemoveDuplicates, then the continuation trigger will be AfterPane.elementCountAtLeast(1) so it will always emit a count as soon as possible and then discard further data, which is very wrong.

Serializing Flux in Reactor

Is possible to serialize Reactor Flux. For example my Flux is in some state and is currently processing some event. And suddenly service is terminated. Current state of Flux is saved to database or to file. And then on restart of aplication I just take all Flux from that file/table and subscribe them to restart processing from last state. This is possible in reactor?
No, this is not possible. Flux are not serializable and are closer to a chain of functions, they don't necessarily have a state[1] but describe what to do given an input (provided by an initial generating Flux)...
So in order to "restart" a Flux, you'd have to actually create a new one that gets fed the remaining input the original one would have received upon service termination.
Thus it would be more up to the source of your data to save the last emitted state and allow restarting a new Flux sequence from there.
[1] Although, depending on what operators you chained in, you could have it impact some external state. In that case things will get more complicated, as you'll have to also persist that state.

Google Dataflow late data

I have been reading the Dataflow SDK documentation, trying to find out what happens when data arrives past the watermark in a streaming job.
This page:
https://cloud.google.com/dataflow/model/windowing
indicates that if you use the default window/trigger strategies, then late data will be discarded:
Note: Dataflow's default windowing and trigger strategies discard late data. If you want to ensure that your pipeline handles instances of late data, you'll need to explicitly set .withAllowedLateness when you set your PCollection's windowing strategy and set triggers for your PCollections accordingly.
Yet this page:
https://cloud.google.com/dataflow/model/triggers
indicates that late data will be emitted as a single element PCollection when it arrives late:
The default trigger for a PCollection is event time-based, and emits the results of the window when the system's watermark (Dataflow's notion of when it "should" have all the data) passes the end of the window. The default trigger emits on a repeating basis, meaning that any late data will by definition arrive after the watermark and trip the trigger, causing the late elements to be emitted as they arrive.
So, will late data past the watermark be discarded completely? Or, will it only not be emitted with the other data it would have been windowed with had it arrived in time, and be emitted on its own instead?
The default "windowing and trigger strategies" discard late data. The WindowingStrategy is an object which consists of windowing, triggering, and a few other parameters such as allowed lateness. The default allowed lateness is 0, so any late data elements are discarded.
The default trigger handles late data. If you take the default WindowingStrategy and change only the allowed lateness, then you will receive a PCollection which contains one output pane for all of the on time data, and then a new output pane for approximately every late element.

Resources