Google Dataflow side inputs sometimes take hours to propagate - google-cloud-dataflow

Goal
I'm using a slowly changing side input to enrich my main input in Google Cloud Dataflow using the published patterns, such as slowly updating global window side inputs. The idea is to use a global window that triggers on each new side input and discards the previous value.
Expected
I expect that when I push an update to my side input (via PubSub) that I should see that get reflected in the enriched outputs after some small latency (less than a few minutes).
Actual
What I see is that sometimes the update occurs relatively quickly and uniformly. Sometimes the update occurs sporadically over hours with some workers apparently using the old side input and some using the new side input, and then eventually all enriched outputs reflect the new side input. During this time, there is a possibly sporadic stream of main inputs.
Speculation
I don't know how this works under the hood, but experiments seem to show that once a worker has sampled the old side input, it retains that value until it goes idle for some time (no main inputs to process). Then when new main inputs become available, I see the new side input reflected.
Is this expected behavior? How can I get the behavior I expect?
Example
Here's a simple example that illustrates the behavior.
Publish a side input
Publish a steady stream of main inputs
While there are still main inputs keeping the pipeline busy, publish a new side input.
Observe the input leg of Dataflow has ingested the new side input, but enriched outputs do not.
Stop publishing main inputs and let the pipeline go idle for ~minutes.
Restart publishing main inputs.
Enriched outputs reflect new side input.
Here's a toy example that was used for this experiment.
class Merger(beam.DoFn):
def process(self, element, side_inputs):
yield (element, side_inputs[0])
def run():
with beam.Pipeline(options=pipeline_options) as pipeline:
side_input = (
pipeline
| "SidePubSub" >> beam.io.ReadFromPubSub(topic=SIDE_TOPIC)
| "WindowSide" >> beam.WindowInto(
window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterProcessingTime()),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
)
main_input = (
pipeline
| "MainPubSub" >> beam.io.ReadFromPubSub(topic=MAIN_TOPIC)
| "Window" >> beam.WindowInto(window.FixedWindows(MAIN_WINDOW))
)
merged = (
main_input
| "Merge" >> beam.ParDo(
Merger(), side_inputs=beam.pvalue.AsList(side_input))
)

Related

Apache Beam Streaming pipeline with sequential batches

What I am trying to do:
Consume json messages from PubSub subscription using Apache Beam Streaming pipeline & Dataflow Runner
Unmarshal payload strings into objects.
Assume 'messageId' is the unique Id of incoming message. Ex: msgid1, msgid2, etc
Retrieve child records from a database for each object resulted from #2. Same child can be applicable for multiple messages.
Assume 'childId' as the unique Id of child record. Ex: cid1234, cid1235 etc
Group child records by their unique id as shown in example below
KV.of(cid1234,Map.of(msgid1, msgid2)) and KV.of(cid1235,Map.of(msgid1, msgid2))
Write grouped result at childId level to the database
Questions:
Where should the windowing be introduced? we currently have 30minutes fixed windowing after step#1
How does Beam define start and end time of 30mins window? is it right after we start pipeline or after first message of batch?
What if the steps 2 to 5 take more than 1hour for a window and next window batch is ready. Would both windows batches gets processed in parallel?
How can make the next window messages wait until previous window batch is completed?
If we dont do this, the result at childId level will be overwritten by next batches
Code snippet:
PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
PubsubIO.readMessagesWithAttributes()
.fromSubscription("projects/project1/subscriptions/subscription1"));
PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
.of(Duration.standardMinutes(30))));
PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
ParDo.of(new JsonUnmarshallFn())
.withOutputTags(JsonUnmarshallFn.mainOutputTag,
TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));
PCollectionTuple childRecordsTuple = unmarshalResultTuple
.get(JsonUnmarshallFn.mainOutputTag)
.apply("FetchChildsFromDBAndProcess",
ParDo.of(new ChildsReadFn() )
.withOutputTags(ChildsReadFn.mainOutputTag,
TupleTagList.of(ChildsReadFn.deadLetterTag)));
// input is KV of (childId, msgids), output is mutations to write to BT
PCollectionTuple postProcessTuple = childRecordsTuple
.get(ChildsReadFn.mainOutputTag)
.apply(GroupByKey.create())
.apply("UpdateChildAssociations",
ParDo.of(new ChildsProcessorFn())
.withOutputTags(ChildsProcessorFn.mutations,
TupleTagList.of(ChildsProcessorFn.deadLetterTag)));
postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);
Addressing each of your questions.
Regarding questions 1 and 2 When you us Windowing within Apache Beam, you need to understand that the "windows existed before the job". What I mean is that the windows start at the UNIX epoch (timestamp = 0). In other words, your data will be allocated within each fixed time range, example with fixed 60 seconds windows:
PCollection<String> items = ...;
PCollection<String> fixedWindowedItems = items.apply(
Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));
First window: [0s;59s) - Second : [60s;120s)...and so on
Please refer to the documentation 1, 2 and 3
About question 3, the default of Windowing and Triggering in Apache Beam is to ignore late data. Although, it is possible to configure the handling of late data using withAllowedLateness. In order to do so, it is necessary to understand the concept of Watermarks before. Watermark is a metric of how far behind the data is. Example: you can have a 3 second watermark, then if your data is 3 seconds late it will be assigned to the right window. On the other hand, if it is passed the watermark, you define what it will happen with this data, you can reprocess or ignore it using Triggers.
withAllowedLateness
PCollection<String> items = ...;
PCollection<String> fixedWindowedItems = items.apply(
Window.<String>into(FixedWindows.of(Duration.standardMinutes(1)))
.withAllowedLateness(Duration.standardDays(2)));
Pay attention that an amount of time is set for late data to arrive.
Triggering
PCollection<String> pc = ...;
pc.apply(Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES))
.triggering(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(1)))
.withAllowedLateness(Duration.standardMinutes(30));
Notice that the window is re-processed and re-computed event time there is late data. This trigger gives you the opportunity to react to the late data.
Finally, about question 4, which is partially explained with the concepts described above. The computations will occur within each fixed window and recomputed/processed every time a trigger is fired. This logic will guarantee your data it is in the right window.

Beam CoGroupByKey with fixed window and event time based trigger generates random elements

I have a pipeline in Beam that uses CoGroupByKey to combine 2 PCollections, first one reads from a Pub/Sub subscription and the second one uses the same PCollection, but enriches the data by looking up additional information from a table, using JdbcIO.readAll. So there is no way there would be data in the second PCollection without it being there in the first one.
There is a fixed window of 10seconds with an event based trigger like below;
Repeatedly.forever(
AfterWatermark.pastEndOfWindow().withEarlyFirings(
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(40))
).withLateFirings(AfterPane.elementCountAtLeast(1))
);
The issue I am seeing is that when I stop the pipeline using the Drain mode, it seems to be randomly generating elements for the second PCollection when there has not been any messages coming in to the input Pub/Sub topic. This also happens randomly when the pipeline is running as well, but not consistent, but when draining the pipeline I have been able to consistently reproduce this.
Please find the variation in input vs output below;
You are using a non-deterministic triggering, which means the output is sensitive to the exact ordering in which events come in. Another way to look at this is that CoGBK does not wait for both sides to come in; the trigger starts ticking as soon as either side comes in.
For example, lets call your PCollections A and A' respectively, and assume they each have two elements a1, a2, a1', and a2' (of common provenance).
Suppose a1 and a1' come into the CoGBK, 39 seconds passes, and then a2 comes in (on the same key), another 2 seconds pass, then a2' comes in. The CoGBK will output ([a1, a2], [a1']) when the 40-second mark hits, and then when the window closes ([], [a2']) will get emitted. (Even if everything is on the same key, this could happen occasionally if there is more than a 40-second walltime delay going through the longer path, and will almost certainly happen for any late data (each side will fire separately).
Draining makes things worse, e.g. I think all processing time triggers fire immediately.

Cloud Dataflow what is the exact definition of freshness and latency?

Problem:
When using Cloud Dataflow, we get presented 2 metrics (see this page):
system latency
data freshness
These are also available in Stackdriver under the following names (extract from here):
system_lag: The current maximum duration that an item of data has been awaiting processing, in seconds.
data_watermark_age: The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline.
But, these descriptions are still very vague:
what does "awaiting processing" mean? is how long a message waits in pubsub? or the total time it has to wait inside the pipeline?
the "maximum duration": after that maximum item is processed, will the metric be adjusted?
"time since event timestamp" does that mean if my event was put in pubsub at timestamp t1 and it flows out of one end of the pipeline at timestamp t2, the pipeline is at t1? I think I can assume that if the metric is at t1, everything before t1 can be assumed processed.
Question:
As these metrics coincide with the semantics of Apache Beam, I would love to see some examples, or at least more clear definitions of these metrics to make them usable.
These metrics are notoriously tricky. An in-depth dive into how they work can be seen in this talk by a member of the Beam / Dataflow team.
Pipelines are split in series of computations that occur in memory, and computations that require serializing your data to some sort of data store. For example, consider the following pipeline:
with Pipeline() as p:
p | beam.ReadFromPubSub(...) \
| beam.Map(parse_data)
| beam.Map(into_key_value_pairs) \
| beam.WindowInto(....) \
| beam.GroupByKey() \
| beam.Map(format_data) \
| beam.WriteToBigquery(...)
This pipeline would get broken up into two stages. A stage is a series of computations that can be applied in memory.
The first stage goes from ReadFromPubSub to the GroupByKey operation. Everything in between those two PTransforms can be done in-memory. To perform the GroupByKey, the data needs to be written to persistent state (and therefore into a new source).
The second stage goes from GroupByKey to WriteToBigQuery. In this case, the data is read from a 'source'.
Each source has its own set of watermarks. The watermarks that you see in the Dataflow UI are the maximum watermarks coming from any source in the pipeline.
--
Answering your questions:
What's awaiting processing?
Answer
It is how long an element waits in PubSub. Specifically, how long an element waits inside any source in the pipeline.
Consider a simpler pipeline:
ReadFromPubSub -> Map -> WriteToBigQuery.
This pipeline does the following operations for each item: Read an item from PubSub -> Operate on it -> Insert to BigQuery -> **Confirm to PubSub that the item has been consumed**.
Now, imagine that the BigQuery service goes down for 5 minutes. This means that PubSub will not receive confirmations for any of the elements for 5 minutes. Therefore, these elements will be stuck in PubSub for a while.
This means that the system latency (and the data freshness metric as well) will balloon up to 5 minutes while BQ writes are blocked.
Does maximum duration get adjusted after processing?
Answer
That's right. For instance, consider the previous pipeline again: BQ is dead for 5 minutes. When BQ comes back, a large batch of items may be written to it, and confirmed as read from PubSub. This will drastically reduce the system latency (and data freshness) back to a few seconds.
What's time since event timestamp?
Answer
An event timestamp can be provided as an attribute of the message to PubSub. It's a bit of a tricky concept, but essentially:
For each stage there is an output data watermark. An output data watermark of T indicates that the computation has processed all elements with event time before T. The latest an output data watermark can be is the earliest input watermark of all its upstream computations. However, the output watermark could be held back if there is some input data that has not yet been processed.
This metric is, of course, heuristic. If some data point comes in very late, then the Data Freshness will be held back.
--
I'd advice you to check out the talk by Slava. It goes over all these concepts.

Prevent fusion in Apache Beam / Dataflow streaming (python) pipelines to remove pipeline bottleneck

We are currently working on a streaming pipeline on Apache Beam with DataflowRunner. We are reading messages from Pub/Sub and do some processing on them and afterwards we window them in slidings windows (currently the window size is 3 seconds and the interval is 3 seconds as well). Once the window is fired we do some post-processing on the elements inside the window. This post-processing step is significantly larger than the window size, it takes about 15 seconds.
The apache beam code of the pipeline:
input = ( pipeline | beam.io.ReadFromPubSub(subscription=<subscription_path>)
| beam.Map(process_fn))
windows = input | beam.WindowInto(beam.window.SlidingWindows(3, 3),
trigger=AfterCount(30),
accumulation_mode = AccumulationModel.DISCARDING)
group = windows | beam.GroupByKey()
group | beam.Map(post_processing_fn)
As you know, Dataflow tries to perform some optimizations on your pipeline steps. In our case it fusions everything together from the windowing onwards (clustered operations: 1/ processing 2/ windowing + post-processing) which is causing a slow sequential post-processing of all the windows by just 1 worker. We see logs every 15 seconds that the pipeline is processing the next window. However, we would like to have multiple workers picking up separate windows instead of the workload going to a single worker.
Therefore we were looking for ways to prevent this fusion from happening so Dataflow separates the window from post-processing of the windows. In that way we would expect Dataflow to be able to assign multiple workers again to the post-processing of fired windows.
What we have tried so far:
Increase the number of workers to 20, 30 or even 40 but without effect. Only the steps before the windowing gets assigned to multiple workers
Running the pipeline for 5 or 10 minutes but we noticed no worker re-allocation to help on this larger post-processing step after the windowing
After the windowing, put them back into a global window
Simulate another GroupByKey with a dummy key (as mentioned in https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#preventing-fusion) but without any success.
The last two actions indeed created a third clustered operation (1/ processing 2/ windowing 3/ post-processing ) but we noticed that still the same worker is executing everything after the windowing.
Is there any solution that can resolve this problem statement?
The current workaround we are now considering is to build another streaming pipeline which receives the windows so these worker can process the windows in parallel but it is cumbersome..
You have done the right thing to break fusion in your elements. I suspect there may be an issue getting you into trouble.
For streaming, a single key always gets processed in the same worker. By any chance, are all or most of your records assigned to a single key? If so, your processing will be done in a single worker.
Something that you can do to prevent this is to make the window a part of the key, so that the elements for multiple windows can be processed in different workers even though they have the same key:
class KeyIntoKeyPlusWindow(core.DoFn):
def process(self, element, window=core.DoFn.WindowParam):
key, values = element
yield ((key, window), element)
group = windows | beam.ParDo(KeyIntoKeyPlusWindow() | beam.GroupByKey()
And once you've done that, you can apply your post-processing:
group | beam.Map(post_processing_fn)

Why did #sideInput() method move from Context to ProcessContext in Dataflow beta

I wonder why has the #sideInput() method moved to ProcessContext class?
Previously I could do some additional processing in the #startBundle() method and cache the result.
Doing that in #processElement() sounds less efficient. Of course I could do the preprocessing before passing the data to the view, but there still is the overhead of calling #sideInput() for each element...
Thanks,
G
Great question. The reason is that we added support for windowed PCollections as side inputs. This enables additional scenarios, including using side inputs with unbounded PCollections in streaming mode.
Before the change, we only supported side inputs that were globally windowed, and then entire side input PCollection was available while processing every element of the main input PCollection. This works fine for bounded PCollections in traditional batch style processing, but didn't extend to windowed or unbounded PCollections.
After the change, the window of the current element you are processing in your ParDo controls what subset of the side input is visible. (And so you can't access side inputs in startBundle(), where there is no current element and hence no current window.)
For example, consider an example where you have a streaming pipeline processing your website logs and providing real time updates to a live usage dashboard. You've got two unbounded input PCollections: one contains new user signups and the other contains user clicks. You can identify which user clicks come from new users by windowing both PCollections by hour and doing a ParDo over the user clicks that takes new user signups as a side input. Now when you process a user click which is in a given hour, you automatically see just the subset of the new user sign ups from the same hour. You can do different variants on this by changing the windowing functions and moving element timestamps forward in time on the side input -- like continuing to window the user clicks per hour, but using the new signups from the last 24 hours.
I do agree this change makes it harder to cache any postprocessing on your side input. We added View.asMultimap to handle a common case where you turn the Iterable into a lookup table. If your post-processing is element-wise, you can do it with a ParDo before creating the PCollectionView. For anything else right now, I'd recommend doing it lazily from within processElement. I'd be interested in hearing about other patterns that occur, so we can work on ways to make them more efficient.

Resources