Google Dataflow late data - google-cloud-dataflow

I have been reading the Dataflow SDK documentation, trying to find out what happens when data arrives past the watermark in a streaming job.
This page:
https://cloud.google.com/dataflow/model/windowing
indicates that if you use the default window/trigger strategies, then late data will be discarded:
Note: Dataflow's default windowing and trigger strategies discard late data. If you want to ensure that your pipeline handles instances of late data, you'll need to explicitly set .withAllowedLateness when you set your PCollection's windowing strategy and set triggers for your PCollections accordingly.
Yet this page:
https://cloud.google.com/dataflow/model/triggers
indicates that late data will be emitted as a single element PCollection when it arrives late:
The default trigger for a PCollection is event time-based, and emits the results of the window when the system's watermark (Dataflow's notion of when it "should" have all the data) passes the end of the window. The default trigger emits on a repeating basis, meaning that any late data will by definition arrive after the watermark and trip the trigger, causing the late elements to be emitted as they arrive.
So, will late data past the watermark be discarded completely? Or, will it only not be emitted with the other data it would have been windowed with had it arrived in time, and be emitted on its own instead?

The default "windowing and trigger strategies" discard late data. The WindowingStrategy is an object which consists of windowing, triggering, and a few other parameters such as allowed lateness. The default allowed lateness is 0, so any late data elements are discarded.
The default trigger handles late data. If you take the default WindowingStrategy and change only the allowed lateness, then you will receive a PCollection which contains one output pane for all of the on time data, and then a new output pane for approximately every late element.

Related

Dataflow to process late and out-of-order data for batch and stream messages?

My company receives both batch and stream based event data. I want to process the data using Google Cloud dataflow over a predictable time period. However, I realize that in some instances the data comes late or out of order. How to use Dataflow to handle late or out of order?
This is a homework question, and would like to know the only answer in below.
a. Set a single global window to capture all data
b. Set sliding window to capture all the lagged data
c. Use watermark and timestamps to capture the lagged data
d. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.
My reasoning - I believe 'C' is the answer. But then, watermark is actually different from late data. Please confirm. Also, since the question mentioned both batch and stream based, i also think if 'D' could be the answer since 'batch'(or bounded collection) mode doesn't have the timestamps unless it comes from source or is programmatically set. So, i am a bit confused on the answer.
Please help here. I am a non-native english speaker, so not sure if I could have missed some cues in the question.
How to use Dataflow to handle late or out of order
This is a big question. I will try to give some simple explanations but provide some resources that might help you understand.
Bounded data collection
You have gotten a sense of it: bounded data does not have lateness problem. By the nature of bounded data, you can read the full data set at once before pipeline starts.
Unbounded data collection
Your C is correct, and watermark is different from late data. Watermark in implementation is a monotonically increasing timestamp. When Beam/Dataflow see a record with a event timestamp that is earlier than the watermark, the record is treated as late data (this is only conceptual and you might want to check[1] for some detailed discussion).
Here are [2], [3], [4] as reference for this topic:
https://docs.google.com/document/d/12r7frmxNickxB5tbpuEh_n35_IJeVZn1peOrBrhhP6Y/edit#heading=h.7a03n7d5mf6g
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
https://www.oreilly.com/library/view/streaming-systems/9781491983867/
https://docs.google.com/presentation/d/1ln5KndBTiskEOGa1QmYSCq16YWO9Dtmj7ZwzjU7SsW4/edit#slide=id.g19b6635698_3_4
B and C may be the answer.
With sliding windows, you have the order of the data, so if you recive the data in position 9 and you don't recive the data in the position 8, you know that data 8 is delayed and wait for it. The problem is, if the latest data is delayed, you can't know this data is delayed and you lost it. https://en.wikipedia.org/wiki/Sliding_window_protocol
Watermark, wait a period of time for the lagged data, if this time passes and the data doesn't arrive, you lose this data.
So, the answer is C, because B says "capture all the lagged data" and C ignores the word all

Apache BEAM pipeline IllegalArgumentException - Timestamp skew?

I have an existing BEAM pipeline that is handling the data ingested (from Google Pubsub topic) by 2 routes. The 'hot' path does some basic transformation and stores them in Datastore, while the 'cold' path performs fixed hourly windowing for deeper analysis before storage.
So far the pipeline has been running fine until I started to do some local buffering on the data before publishing to Pubsub (so data arrives at Pubsub may be a few hours 'late'). The error that gets thrown is as below:
java.lang.IllegalArgumentException: Cannot output with timestamp 2018-06-19T14:00:56.862Z. Output timestamps must be no earlier than the timestamp of the current input (2018-06-19T14:01:01.862Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.
at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.checkTimestamp(SimpleDoFnRunner.java:463)
at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.outputWithTimestamp(SimpleDoFnRunner.java:429)
at org.apache.beam.sdk.transforms.WithTimestamps$AddTimestampsDoFn.processElement(WithTimestamps.java:138)
It seems to be referencing the section of my code (withTimestamps method) that performs the hourly windowing as below:
Window<KV<String, Data>> window = Window.<KV<String, Data>>into
(FixedWindows.of(Duration.standardHours(1)))
.triggering(Repeatedly.forever(pastEndOfWindow()))
.withAllowedLateness(Duration.standardSeconds(10))
.discardingFiredPanes();
PCollection<KV<String, List<Data>>> keyToDataList = eData.apply("Add Event Timestamp", WithTimestamps.of(new EventTimestampFunction()))
.apply("Windowing", window)
.apply("Group by Key", GroupByKey.create())
.apply("Sort by date", ParDo.of(new SortDataFn()));
I'm not sure if I understand exactly what I've done wrong here. Is it because the data is arriving late that is throwing the error? As I understand, if the data arrives late past the allowed lateness, it should be discarded and not throw an error like the one I'm seeing.
Wondering if setting an unlimited timestampSkew will resolve this? The data that's late can be exempt from analysis, I just need to ensure that errors don't get thrown that will choke the pipeline. There's also nowhere else where I'm adding/ changing the timestamps for the data so I'm not sure why the errors are thrown.
It looks like your DoFn is using “outputWithTimestamp” and you are trying to set a timestamp which is older than the input element’s timestamp. Typically timestamps of output elements are derived from inputs, this is important to ensure the correctness of the watermark computation.
You may be able to workaround this by increasing both the timestamp skew and the windwing allowed lateness, however, some data may be lost, it is for you to determine if such loss is acceptable in your scenario.
Another alternative is not to use output with timestamp and instead use the PubSub message timestamp to process each message. Then, output each element as a KV, where the RealTimestamp is computed in the same way you are currently processing the timestamp (just don’t use it in “WithTimestamps”), GroupByKey and write the KV to Datastore.
Other questions you can ask yourself are:
Why are the input elements associated to a most recent timestamp than the output elements?
Do you really need to Buffer that much data before publishing to PubSub?

Calculating periodic checkpoints for an unbounded stream in Apache Beam/DataFlow

I am using a global unbounded stream in combination with Stateful processing and timers in order to totally order a stream per key by event timestamp. The solution is described with the answer to this question:
Processing Total Ordering of Events By Key using Apache Beam
In order to restart the pipeline after a failure or stopping for some other reason, I need to determine the lowest event timestamp at which we are guaranteed that all other events have been processed downstream. This timestamp can be calculated periodically and persisted to a datastore and used as the input to the source IO (Kinesis) so that the stream can be re-read without having to go back to the beginning. (It is ok for us to have events replayed)
I considered having the stateful transformation emit the lowest processed timestamp as the output when the timer triggers and then combine all the outputs globally to find the minimum value. However, it is not possible to use a Global combine operation because a either a Window or a Trigger must be applied first.
Assuming that my stateful transform emits a Long when the timer fires which represents the smallest timestamp, I am defining the pipeline like this:
p.apply(events)
.apply("StatefulTransform", ParDo.of(new StatefulTransform()))
.apply(Window.<Long>configure().triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(100),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))))
.apply(Combine.globally(new MinLongFn()))
.apply("WriteCheckpoint", ParDo.of(new WriteCheckpoint()));
Will this ensure that the checkpoints will only be written when all of the parallel workers have emitted at least one of their panes? I am concerned that a the combine operation may operate on panes from only some of the workers, e.g. there may be a worker that has either failed or is still waiting for another event to trigger it's timer.
I'm a newbie of the Beam, but according to this blog https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html, Splittable DoFn might be the thing you are looking for!
You could create an SDF to fetch the stream and accept the input element as the start point.

GroupByKey returns no elements in Google Cloud Dataflow

I'm new to Dataflow, so this is probably an easy question.
I want to try out the Sessions windowing strategy. According to the windowing documentation, windowing is not applied until we've done a GroupByKey, so I'm trying to do that.
However, when I look at my pipeline in Google Cloud Platform, I can see that MapElements returns elements, but no elements are returned by GroupByKey ("Elements Added: -"). What am I doing wrong when grouping by key?
Here's the most relevant part of the code:
events = events
.apply(Window.named("eventsSessionsWindowing")
.<MyEvent>into(Sessions.withGapDuration(Duration.standardSeconds(3)))
);
PCollection<KV<String, MyEvent>> eventsKV = events
.apply(MapElements
.via((MyEvent e) -> KV.of(ExtractKey(e), e))
.withOutputType(new TypeDescriptor<KV<String, MyEvent>>() {}));
PCollection<KV<String, Iterable<MyEvent>>> eventsGrouped = eventsKV.apply(GroupByKey.<String, MyEvent>create());
A GroupByKey fires according to a triggering strategy, which determines when the system thinks that all data for this key/window has been received and it's time to group it and pass to downstream transforms. The default strategy is:
The default trigger for a PCollection is event time-based, and emits the results of the window when the system's watermark (Dataflow's notion of when it "should" have all the data) passes the end of the window.
Please see Default Trigger for details. You were seeing a delay of a couple of minutes that corresponded to the progression of PubSub's watermark.

How can I emit summary data for each window even if a given window was empty?

It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults which will then emit nothing if the window was empty.
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection with a heartbeat PCollection consisting of empty values. So for the example of Sum.integersGlobally, you would Flatten your main PCollection<Integer> with a secondary PCollection<Integer> that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows or SlidingWindows; Sessions are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource that accepts an enumerable WindowFn plus a default value and generates an appropriate unbounded PCollection.

Resources