Streaming Beam pipeline with "lookbehind" - google-cloud-dataflow

I am new to Beam/Dataflow and am trying to figure out if it is suited to this problem. I am trying to keep a running sum of which types of messages are currently backlogged in a queueing system. The system uses a monotonically increasing offset number to order messages: producers learn the number when the send a message, and consumers track the watermark offset as they process each message in FIFO order. This pipeline would have two inputs: counts from the producers and watermarks from the consumers.
The queue producer would regularly flush a batch of count metrics to Beam:
(type1, offset, count)
(type2, offset, count)
...
where the offset was the last offset the producer wrote for typeN, and count is how many typeN messages it enqueued in the current batch period.
The queue consumer will regularly send its latest consumed watermark offset. The effect this should have is to invalidate any counts that have an offset lower than this consumer watermark.
The output of the pipeline is the sum of all counts with a higher offset than the largest consumer watermark yet seen, grouped by message type. (snapshotted every 5 minutes or so.)
(Of course there would be 100k message "types", hundreds of producer servers, occasional 2-hour periods where the consumer doesn't report an advancing watermark, etc.)
Is this doable? That this pipeline would need to maintain and scan an unbounded-ish history of count records is the part that seems maybe unsuited to Beam.

One possible approach would be to model this as two timeseries (left , right) where you want to match left.timestamp <= right.timestamp. You can do this using the State and Timer API.
In order to achieve this unbounded, you will need to be working within a GlobalWindow. Important note in the Global Window there is no expiry of the state, so you will need to make sure to do Garbage Collection on your left and right streams. Also data will arrive in the onprocess unordered, so you will need to make use of Event Time timers to do the actual work.
Very roughly:
onProcess(){
Store data in BagState.
Setup Event time timer to go off
}
OnTimer(){
Do your buiss logic.
}
This is a lot easier with Apache Beam > 2.24.0 as OrderedListState has been added.
Although the timeseries use case is different from the one in this question, this talk from the 2019 Beam summit also has some pointers (but does not make use of OrderedListState, which was not available at the time);
State and Timer API and Timeseries

Related

Apache Beam pipeline metrices

I have a pipeline which runs on Google Dataflow and reads from a pub/sub. I use metrices, mostly counters, to get an idea of how many events were processed.
The general idea is that it reads protobuf messages from the pubsub, deserializes them and dumps the into different BigQuery tables.
I have a counter in the ParDo that does the de-serialization and also counter in the ParDo that does the inserts into the BQ. Most of the time, the difference between the end counter and the start counter is big, like around 40%. The wall time on the pipeline steps is always increasing, but the Data Freshness and System Latency for the pipeline is really low, like around 30 sec...
Why is this happening thos? Is is something related to counters?
You can see the size_ok metric which is the first metric from the pipeline, and if you sum up all of the below metrices, you don't even get close to this number.
My question is, why is there such a big difference between size_ok metric, which is the total number of events that started to be processed and the inserts metrics. If you sum up all the insert metrics, there is no way that they sum up to 6,6 mln events as shown by size_ok

Cloud Dataflow what is the exact definition of freshness and latency?

Problem:
When using Cloud Dataflow, we get presented 2 metrics (see this page):
system latency
data freshness
These are also available in Stackdriver under the following names (extract from here):
system_lag: The current maximum duration that an item of data has been awaiting processing, in seconds.
data_watermark_age: The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline.
But, these descriptions are still very vague:
what does "awaiting processing" mean? is how long a message waits in pubsub? or the total time it has to wait inside the pipeline?
the "maximum duration": after that maximum item is processed, will the metric be adjusted?
"time since event timestamp" does that mean if my event was put in pubsub at timestamp t1 and it flows out of one end of the pipeline at timestamp t2, the pipeline is at t1? I think I can assume that if the metric is at t1, everything before t1 can be assumed processed.
Question:
As these metrics coincide with the semantics of Apache Beam, I would love to see some examples, or at least more clear definitions of these metrics to make them usable.
These metrics are notoriously tricky. An in-depth dive into how they work can be seen in this talk by a member of the Beam / Dataflow team.
Pipelines are split in series of computations that occur in memory, and computations that require serializing your data to some sort of data store. For example, consider the following pipeline:
with Pipeline() as p:
p | beam.ReadFromPubSub(...) \
| beam.Map(parse_data)
| beam.Map(into_key_value_pairs) \
| beam.WindowInto(....) \
| beam.GroupByKey() \
| beam.Map(format_data) \
| beam.WriteToBigquery(...)
This pipeline would get broken up into two stages. A stage is a series of computations that can be applied in memory.
The first stage goes from ReadFromPubSub to the GroupByKey operation. Everything in between those two PTransforms can be done in-memory. To perform the GroupByKey, the data needs to be written to persistent state (and therefore into a new source).
The second stage goes from GroupByKey to WriteToBigQuery. In this case, the data is read from a 'source'.
Each source has its own set of watermarks. The watermarks that you see in the Dataflow UI are the maximum watermarks coming from any source in the pipeline.
--
Answering your questions:
What's awaiting processing?
Answer
It is how long an element waits in PubSub. Specifically, how long an element waits inside any source in the pipeline.
Consider a simpler pipeline:
ReadFromPubSub -> Map -> WriteToBigQuery.
This pipeline does the following operations for each item: Read an item from PubSub -> Operate on it -> Insert to BigQuery -> **Confirm to PubSub that the item has been consumed**.
Now, imagine that the BigQuery service goes down for 5 minutes. This means that PubSub will not receive confirmations for any of the elements for 5 minutes. Therefore, these elements will be stuck in PubSub for a while.
This means that the system latency (and the data freshness metric as well) will balloon up to 5 minutes while BQ writes are blocked.
Does maximum duration get adjusted after processing?
Answer
That's right. For instance, consider the previous pipeline again: BQ is dead for 5 minutes. When BQ comes back, a large batch of items may be written to it, and confirmed as read from PubSub. This will drastically reduce the system latency (and data freshness) back to a few seconds.
What's time since event timestamp?
Answer
An event timestamp can be provided as an attribute of the message to PubSub. It's a bit of a tricky concept, but essentially:
For each stage there is an output data watermark. An output data watermark of T indicates that the computation has processed all elements with event time before T. The latest an output data watermark can be is the earliest input watermark of all its upstream computations. However, the output watermark could be held back if there is some input data that has not yet been processed.
This metric is, of course, heuristic. If some data point comes in very late, then the Data Freshness will be held back.
--
I'd advice you to check out the talk by Slava. It goes over all these concepts.

Calculating periodic checkpoints for an unbounded stream in Apache Beam/DataFlow

I am using a global unbounded stream in combination with Stateful processing and timers in order to totally order a stream per key by event timestamp. The solution is described with the answer to this question:
Processing Total Ordering of Events By Key using Apache Beam
In order to restart the pipeline after a failure or stopping for some other reason, I need to determine the lowest event timestamp at which we are guaranteed that all other events have been processed downstream. This timestamp can be calculated periodically and persisted to a datastore and used as the input to the source IO (Kinesis) so that the stream can be re-read without having to go back to the beginning. (It is ok for us to have events replayed)
I considered having the stateful transformation emit the lowest processed timestamp as the output when the timer triggers and then combine all the outputs globally to find the minimum value. However, it is not possible to use a Global combine operation because a either a Window or a Trigger must be applied first.
Assuming that my stateful transform emits a Long when the timer fires which represents the smallest timestamp, I am defining the pipeline like this:
p.apply(events)
.apply("StatefulTransform", ParDo.of(new StatefulTransform()))
.apply(Window.<Long>configure().triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(100),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))))
.apply(Combine.globally(new MinLongFn()))
.apply("WriteCheckpoint", ParDo.of(new WriteCheckpoint()));
Will this ensure that the checkpoints will only be written when all of the parallel workers have emitted at least one of their panes? I am concerned that a the combine operation may operate on panes from only some of the workers, e.g. there may be a worker that has either failed or is still waiting for another event to trigger it's timer.
I'm a newbie of the Beam, but according to this blog https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html, Splittable DoFn might be the thing you are looking for!
You could create an SDF to fetch the stream and accept the input element as the start point.

Real-time pipeline feedback loop

I have a dataset with potentially corrupted/malicious data. The data is timestamped. I'm rating the data with a heuristic function. After a period of time I know that all new data items coming with some IDs needs to be discarded and they represent a significant portion of data (up to 40%).
Right now I have two batch pipelines:
First one just runs the rating over the data.
The second one first filters out the corrupted data and runs the analysis.
I would like to switch from batch mode (say, running every day) into an online processing mode (hope to get a delay < 10 minutes).
The second pipeline uses a global window which makes processing easy. When the corrupted data key is detected, all other records are simply discarded (also using the discarded keys from previous days as a pre-filter is easy). Additionally it makes it easier to make decisions about the output data as during the processing all historic data for a given key is available.
The main question is: can I create a loop in a Dataflow DAG? Let's say I would like to accumulate quality-rates given to each session window I process and if the rate sum is over X, some a filter function in earlier stage of pipeline should filter out malicious keys.
I know about side input, I don't know if it can change during runtime.
I'm aware that DAG by definition cannot have cycle, but how achieve same result without it?
Idea that comes to my mind is to use side output to mark ID as malicious and make fake unbounded output/input. The output would dump the data to some storage and the input would load it every hour and stream so it can be joined.
Side inputs in the Beam programming model are windowed.
So you were on the right path: it seems reasonable to have a pipeline structured as two parts: 1) computing a detection model for the malicious data, and 2) taking the model as a side input and the data as a main input, and filtering the data according to the model. This second part of the pipeline will get the model for the matching window, which seems to be exactly what you want.
In fact, this is one of the main examples in the Millwheel paper (page 2), upon which Dataflow's streaming runner is based.

Total aggregate over an unbounded stream in Dataflow

A number of examples show aggregation over windows of an unbounded stream, but suppose we need to get a count-per-key of the entire stream seen up to some point in time. (Think word count that emits totals for everything seen so far rather than totals for each window.)
It seems like this could be a Combine.perKey and a trigger to emit panes at some interval. In this case the window is essentially global, and we emit panes for that same window throughout the life of the job. Is this safe/reasonable, or perhaps there is another way to compute a rolling, total aggregate?
Ryan your solution of using a global window and a periodic trigger is the recommended approach. Just make sure you use accumulation mode on the trigger and not discarding mode. The Triggers page should have more information.
Let us know if you need additional help.

Resources