How can we implement a trigger which fires after certain count in PCollection - google-cloud-dataflow

I'm writing a Beam data pipeline reading from an unbounded source like Kafka. I am not performing any analytic functions. I would like to transform the elements and write to the sink let's say after the record count of the PCollection reaches a certain threshold. This is to throttle the data being sent to the sink
Looked at the existing triggers but couldn't figure if they are a good fit

I have tested triggers and they work as expected,
here is a scala code example
val data: PCollection[Type] = results
.apply(
Window
.into[Type](FixedWindows.of(Duration.millis(2000)))
.withAllowedLateness(Duration.millis(1000))
.triggering(AfterPane.elementCountAtLeast(4)
.accumulatingFiredPanes()
)
It waits for 4 elements then trigger the window.

Related

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

Apache Beam Streaming pipeline with sequential batches

What I am trying to do:
Consume json messages from PubSub subscription using Apache Beam Streaming pipeline & Dataflow Runner
Unmarshal payload strings into objects.
Assume 'messageId' is the unique Id of incoming message. Ex: msgid1, msgid2, etc
Retrieve child records from a database for each object resulted from #2. Same child can be applicable for multiple messages.
Assume 'childId' as the unique Id of child record. Ex: cid1234, cid1235 etc
Group child records by their unique id as shown in example below
KV.of(cid1234,Map.of(msgid1, msgid2)) and KV.of(cid1235,Map.of(msgid1, msgid2))
Write grouped result at childId level to the database
Questions:
Where should the windowing be introduced? we currently have 30minutes fixed windowing after step#1
How does Beam define start and end time of 30mins window? is it right after we start pipeline or after first message of batch?
What if the steps 2 to 5 take more than 1hour for a window and next window batch is ready. Would both windows batches gets processed in parallel?
How can make the next window messages wait until previous window batch is completed?
If we dont do this, the result at childId level will be overwritten by next batches
Code snippet:
PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
PubsubIO.readMessagesWithAttributes()
.fromSubscription("projects/project1/subscriptions/subscription1"));
PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
.of(Duration.standardMinutes(30))));
PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
ParDo.of(new JsonUnmarshallFn())
.withOutputTags(JsonUnmarshallFn.mainOutputTag,
TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));
PCollectionTuple childRecordsTuple = unmarshalResultTuple
.get(JsonUnmarshallFn.mainOutputTag)
.apply("FetchChildsFromDBAndProcess",
ParDo.of(new ChildsReadFn() )
.withOutputTags(ChildsReadFn.mainOutputTag,
TupleTagList.of(ChildsReadFn.deadLetterTag)));
// input is KV of (childId, msgids), output is mutations to write to BT
PCollectionTuple postProcessTuple = childRecordsTuple
.get(ChildsReadFn.mainOutputTag)
.apply(GroupByKey.create())
.apply("UpdateChildAssociations",
ParDo.of(new ChildsProcessorFn())
.withOutputTags(ChildsProcessorFn.mutations,
TupleTagList.of(ChildsProcessorFn.deadLetterTag)));
postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);
Addressing each of your questions.
Regarding questions 1 and 2 When you us Windowing within Apache Beam, you need to understand that the "windows existed before the job". What I mean is that the windows start at the UNIX epoch (timestamp = 0). In other words, your data will be allocated within each fixed time range, example with fixed 60 seconds windows:
PCollection<String> items = ...;
PCollection<String> fixedWindowedItems = items.apply(
Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));
First window: [0s;59s) - Second : [60s;120s)...and so on
Please refer to the documentation 1, 2 and 3
About question 3, the default of Windowing and Triggering in Apache Beam is to ignore late data. Although, it is possible to configure the handling of late data using withAllowedLateness. In order to do so, it is necessary to understand the concept of Watermarks before. Watermark is a metric of how far behind the data is. Example: you can have a 3 second watermark, then if your data is 3 seconds late it will be assigned to the right window. On the other hand, if it is passed the watermark, you define what it will happen with this data, you can reprocess or ignore it using Triggers.
withAllowedLateness
PCollection<String> items = ...;
PCollection<String> fixedWindowedItems = items.apply(
Window.<String>into(FixedWindows.of(Duration.standardMinutes(1)))
.withAllowedLateness(Duration.standardDays(2)));
Pay attention that an amount of time is set for late data to arrive.
Triggering
PCollection<String> pc = ...;
pc.apply(Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES))
.triggering(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(1)))
.withAllowedLateness(Duration.standardMinutes(30));
Notice that the window is re-processed and re-computed event time there is late data. This trigger gives you the opportunity to react to the late data.
Finally, about question 4, which is partially explained with the concepts described above. The computations will occur within each fixed window and recomputed/processed every time a trigger is fired. This logic will guarantee your data it is in the right window.

Calculating periodic checkpoints for an unbounded stream in Apache Beam/DataFlow

I am using a global unbounded stream in combination with Stateful processing and timers in order to totally order a stream per key by event timestamp. The solution is described with the answer to this question:
Processing Total Ordering of Events By Key using Apache Beam
In order to restart the pipeline after a failure or stopping for some other reason, I need to determine the lowest event timestamp at which we are guaranteed that all other events have been processed downstream. This timestamp can be calculated periodically and persisted to a datastore and used as the input to the source IO (Kinesis) so that the stream can be re-read without having to go back to the beginning. (It is ok for us to have events replayed)
I considered having the stateful transformation emit the lowest processed timestamp as the output when the timer triggers and then combine all the outputs globally to find the minimum value. However, it is not possible to use a Global combine operation because a either a Window or a Trigger must be applied first.
Assuming that my stateful transform emits a Long when the timer fires which represents the smallest timestamp, I am defining the pipeline like this:
p.apply(events)
.apply("StatefulTransform", ParDo.of(new StatefulTransform()))
.apply(Window.<Long>configure().triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(100),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))))
.apply(Combine.globally(new MinLongFn()))
.apply("WriteCheckpoint", ParDo.of(new WriteCheckpoint()));
Will this ensure that the checkpoints will only be written when all of the parallel workers have emitted at least one of their panes? I am concerned that a the combine operation may operate on panes from only some of the workers, e.g. there may be a worker that has either failed or is still waiting for another event to trigger it's timer.
I'm a newbie of the Beam, but according to this blog https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html, Splittable DoFn might be the thing you are looking for!
You could create an SDF to fetch the stream and accept the input element as the start point.

Accumulation of elements in GroupByKey subtask when writing to BigQuery Apache beam v2.0

I'm implementing a Dataflow pipeline that reads messages from Pubsub and writes TableRows into BigQuery (BQ) using Apache Beam SDK 2.0.0 for Java.
This is the related portion of code:
tableRowPCollection
.apply(BigQueryIO.writeTableRows().to(this.tableId)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
This code generates a group of tasks under the hood in the Dataflow pipeline. One of these tasks is the GroupByKey. This task is accumulating elements in the pipeline as can be seen in this print screen:
GBK elements accumulation image.
After reading the docs I suspect that this issue relates to the Window config. But I could not find a way of modifying the Window configuration since it is implicitly created by Window.Assign transform inside the Reshuffle task.
Is there a way of setting the window parameters and/or attaching triggers to this implicit Window or should I create my own DoFn that inserts a TableRow in BQ?
Thanks in advance!
[Update]
I left the pipeline running for a day approximately and after that the GroupByKey subtask became faster and the number of elements coming in and coming out approximated to each other (sometimes were the same). Furthermore, I also noticed that the Watermark got closer to the current date and was increasing faster. So the "issue" was solved.
There isn't any waiting introduced by the Reshuffle in the BigQuery sink. Rather, it is used to create the batches for of rows to write to BigQuery. The number of elements coming out of the GroupByKey is smaller because each output element represents a batch (or group) of input elements.
You should be able to see the total number of elements coming out as the output of the ExpandIterable (the output of the Reshuffle).

GroupByKey returns no elements in Google Cloud Dataflow

I'm new to Dataflow, so this is probably an easy question.
I want to try out the Sessions windowing strategy. According to the windowing documentation, windowing is not applied until we've done a GroupByKey, so I'm trying to do that.
However, when I look at my pipeline in Google Cloud Platform, I can see that MapElements returns elements, but no elements are returned by GroupByKey ("Elements Added: -"). What am I doing wrong when grouping by key?
Here's the most relevant part of the code:
events = events
.apply(Window.named("eventsSessionsWindowing")
.<MyEvent>into(Sessions.withGapDuration(Duration.standardSeconds(3)))
);
PCollection<KV<String, MyEvent>> eventsKV = events
.apply(MapElements
.via((MyEvent e) -> KV.of(ExtractKey(e), e))
.withOutputType(new TypeDescriptor<KV<String, MyEvent>>() {}));
PCollection<KV<String, Iterable<MyEvent>>> eventsGrouped = eventsKV.apply(GroupByKey.<String, MyEvent>create());
A GroupByKey fires according to a triggering strategy, which determines when the system thinks that all data for this key/window has been received and it's time to group it and pass to downstream transforms. The default strategy is:
The default trigger for a PCollection is event time-based, and emits the results of the window when the system's watermark (Dataflow's notion of when it "should" have all the data) passes the end of the window.
Please see Default Trigger for details. You were seeing a delay of a couple of minutes that corresponded to the progression of PubSub's watermark.

Resources