Google Dataflow pipeline stuck on GroupByKey - google-cloud-dataflow

There seem to be many alike questions/issues with a Beam pipeline being stuck in GroupByKey, but after a whole day of trying different settings for PubSubIO and Window triggers, I still have not come any further.
I have a PubSub topic producing a steady stream of data:
When starting my pipeline in Dataflow a bunch of elements are added. Hereafter, the number of elements remains the same. The data watermark increases slowly and is around 10-11 minutes behind current time, which coincides with the message retention time of the PubSub subscription of 10 minutes. I have tried different setups for reading PubSub, with and without attributes, adding the timestamp myself, etc. In this job it is just a vanilla read without attributes relying on Google to compute the timestamps.
Here I am trying to run my pipeline, which doesn't do any grouping by key:
I tried a lot of Window setups. My goals is a sliding window of 30 minutes duration every 1 minute, but here I am just trying to get it to work with a FixedWindow of 1 minute. I also tried a lot of different triggers with both early and late firings added. In this job, I did not specify anything, ie. it should use the default trigger.
The job id is 2019-11-01_06_38_28-2654303843876161133
Does anyone have any suggestions on what else I can try to get some element through the GBK?
Update:
I have simplified my pipeline to continue troubleshooting the issue using the hint in one of the comments to look at the watermark at reading the PubSub messages.
I have an attribute on the PubSub message that I use as timestamp (.withTimestampAttribute(...)). Logging the timestamp() in ProcessContext does give the correct timestamp and window assignment. Messages are streaming in real-time with a couple of seconds lag, but the issue is that "data watermark" stays a while (observed like 1,5 hours) behind and therefore the window is never triggered and GroupByKey does not work.
If I omit .withTimestampAttribute(...) when reading from PubSub, everything seems to work correct, but I have a lag on my timestamp, which causes messages to be assigned to a later window in many cases.
I found a workaround by triggering in processing time instead of event time, but I haven't assessed if this is a real solution:
.triggering(AfterProcessingTime.pastFirstElementInPane().alignedTo(Duration.standardMinutes(1)).plusDelayOf(Duration.standardMinutes(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
The question is how I can make sure the watermark is updates when reading from PubSub with using the timestamp attribute?

Related

Dataflow Sliding Window vs Global window with trigger usecase?

I am designing a basket abandoning system for an Ecommerce company. The system will send a message to a user based on the below rules:
There is no interaction by the user on the site for 30 minutes.
Has added more than $50 worth of products to the basket.
Has not yet completed a transaction.
I use Google Cloud Dataflow to process the data and decide if a message should be sent. I have couple of options in below:
Use a Sliding window with a duration of 30 minutes.
A global window with a time based trigger with a delay of 30 minutes.
I think Sliding Window might work here. But my question is, can there be a solution based on using a global window with a processing time based trigger and a delay for this usecase?
As far as i understand the triggers based on Apache Beam documentation =>
Triggers allow Beam to emit early results, before a given window is closed. For example, emitting after a certain amount of time elapses, or after a certain number of elements arrives.
Triggers allow processing late data by triggering after the event time watermark passes the end of the window.
So, for my use case and as per the above trigger concepts, i don't think the trigger can be triggered after a set delay for each and every user (It is mentioned in above - can emit only after a certain number of elements it is mentioned above, but not sure if that could be 1). Can you confirm?
Both answers 1 - Sliding Windows and 2 - Global Window are incorrect
Sliding windows is not correct because - assuming there is one key per user, a message will be sent 30 minutes after they first started browsing even if they are still browsing
Global Windows is not correct because - it will cause messages to be sent out every 30 minutes to all users regardless of where they are in their current session
Even Fixed Windows would be incorrect in this case, because assuming there is one key per user, a message will be sent every 30 minutes
Correct answer would be - Use a session window with a gap duration of 30 minutes
This is correct because it will send a message per user after that user is inactive for 30 minutes
I think that sliding window is the correct approach from what you described, and I don't think you can solve this with trigger+delay. If event time sliding windowing makes sense from your business logic perspective, try to use it first, that's what it's for.
My understanding is that while you can use a trigger to produce early results, it is not guaranteed to fire at specific (server/processing) time or with exact number of elements (received so far for the window). The trigger condition enables/unblocks the runner to emit the window contents but it doesn't force it to do so.
In case of event time this makes sense, as it doesn't matter when the event arrives or when the trigger fires, because if the element has a timestamp within a window, then it will be assigned to the correct window no matter when it arrives. And when the trigger will fire for the window, the element will be guaranteed to be in that window if it has arrived.
With processing time you can't do this. If event arrives late, it will be accounted for at that time, and will be emitted next time the trigger fires, basically. And because the trigger doesn't guarantee the exact moment it fires you can potentially end up with unexpected data belonging to unexpected emitted panes. It is useful to get the early results in general but I am not sure if you can reason about windowing based on that.
Also, trigger delay only adds a fire delay (e.g. if it was supposed to fire at 12pm, not it will fire at 12.05pm) but it doesn't allow you to reliably stagger multiple trigger firings so that it goes off at specific intervals.
You can look at the design doc for triggers here: https://s.apache.org/beam-triggers , and possibly lateness doc may be relevant as well: https://s.apache.org/beam-lateness
Other docs can be found here, if you are interested: https://beam.apache.org/contribute/design-documents/ .
Update:
Rui pointed that this use case can be more complicated and probably not easily solvable by sliding windows. Maybe it's worth looking into session windows or manual logic on top of keys+state+timers
I find state[1] and timer[2] doc of Apache Beam, which should be able to handle this specific use case without using processing time trigger in global window.
Assuming the incoming data are events of users' actions, and each event(action) can be keyed by user_id.
The nice property that state and timer have is on per key and window basis. So you can accumulate state for each user_id and the state is amount of money in cart in this case. Timer can be set at the first time when amount in cart exceeds $50, and timer can be reset when user still have shopping actions within 30 mins in processing time.
Assume transaction completion is also a user_id keyed event. When an transaction completion event is seen, timer can be deleted[3].
update:
This idea is completely on processing time domain so it will have false alarm messages depending on lateness problem in system. So the question is how to improve this idea to event time domain so we have less false alarm. One possibility is event time based timer[4]. I am not clear what does event time based timer mean at this moment.
[1] https://beam.apache.org/blog/2017/02/13/stateful-processing.html
[2] https://docs.google.com/document/d/1zf9TxIOsZf_fz86TGaiAQqdNI5OO7Sc6qFsxZlBAMiA/edit#
[3] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/state/Timers.java#L45
[4] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/state/TimeDomain.java#L33

In Dataflow with PubsubIO is there any possibility of late data in a global window?

I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.

Dataflow batch vs streaming: window size larger than batch size

Let's say we have log data with timestamps that can either be streamed into BigQuery or stored as files in Google Storage, but not streamed directly to the unbounded collection source types that Dataflow supports.
We want to analyse this data based on timestamp, either relatively or absolutely, e.g. "how many hits in the last 1 hour?" and "how many hits between 3pm and 4pm on 5th Feb 2018"?
Having read the documentation on windows and triggers, it's not clear how we would divide our incoming data into batches in a way that is supported by Dataflow if we want to have a large window - potentially we want to aggregate over the last day, 30 days, 3 months, etc.
For example, if our batched source is a BigQuery query, run every 5 mins, for the last 5 mins worth of data, will Dataflow keep the windows open between job runs, even though the data is arriving in 5 min chunks?
Similarly, if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved to the bucket, the same question applies - is the job stopped and started, and all knowledge of previous jobs discarded, or does the large window (e.g. up to a month) remain open for new events?
How do we change/modify this pipeline without disturbing the existing state?
Apologies if these are basic questions, even a link to some docs would be appreciated.
It sounds like you want arbitrary interactive aggregation queries on your data. Beam / Dataflow are not a good fit for this per se, however one of the most common use cases of Dataflow is to ingest data into BigQuery (e.g. from GCS files or from Pubsub), which is a very good fit for that.
A few more comments on your question:
it's not clear how we would divide our incoming data into batches
Windowing in Beam is simply a way to specify the aggregation scope in the time dimension. E.g. if you're using sliding windows of size 15 minutes every 5 minutes, then a record whose event-time timestamp is 14:03 counts towards aggregations in three windows: 13:50..14:05, 13:55..14:10, 14:00..14:15.
So: same way as you don't need to divide your incoming data into "keys" when grouping by a key (the data processing framework performs the group-by-key for you), you don't divide it into windows either (the framework performs group-by-window implicitly as part of every aggregating operation).
will Dataflow keep the windows open between job runs
I'm hoping this is addressed by the previous point, but to clarify more: No. Stopping a Dataflow job discards all of its state. However, you can "update" a job with new code (e.g. if you've fixed a bug or added an extra processing step) - in that case state is not discarded, but I think that's not what you're asking.
if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved
It sounds like you want to ingest data continuously. The way to do that is to write a single continuously running streaming pipeline that ingests the data continuously, rather than to start a new pipeline every time new data arrives. In the case of files arriving into a bucket, you can use TextIO.read().watchForNewFiles() if you're reading text files, or its various analogues if you're reading some other kind of files (most general is FileIO.matchAll().continuously()).

Watermark getting stuck

I am ingesting data via pub/sub to a dataflow pipeline which is running in unbounded mode. The data are basically coordinates with timestamps captured from tracking devices. Those messages arrive in batches, where each batch might be 1..n messages. For a certain period there might be no messages arriving, which might be resent later on (or not). We use the time-stamp (in UTC) of each coordinate as an attribute for the pub-sub message. And read the pipeline via a Timestamp label:
pipeline.apply(PubsubIO.Read.topic("new").timestampLabel("timestamp")
An example of coordinates and delay looks like:
36 points wait 0:02:24
36 points wait 0:02:55
18 points wait 0:00:45
05 points wait 0:00:01
36 points wait 0:00:33
36 points wait 0:00:43
36 points wait 0:00:34
A message might look like:
2013-07-07 09:34:11;47.798766;13.050133
After the first batch the Watermark is empty, after the second batch I can see a Watermark in the Pipeline diagnostics, just it doesn't get updated, although new messages arrive. Also according to stackdriver logging PubSub has no undelivered or unacknowledged messages.
Shouldn't the watermark move forward as messages with new event time arrive?
According to What is the watermark heuristic for PubsubIO running on GCD? the WaterMark should also move forward every 2minutes which it doesn't?
[..] In the case that we have not seen data on the subscription in more
than two minutes (and there's no backlog), we advance the watermark to
near real time. [..]
Update to address Bens questions:
Is there a job ID that we could look at?
Yes I just restarted the whole setup at 09:52 CET which is 07:52 UTC, with job ID 2017-05-05_00_49_11-11176509843641901704.
What version of the SDK are you using?
1.9.0
How are you publishing the messages with the timestamp labels?
We use a python script to publish the data which is using the pub sub sdk.
A message from there might look like:
{'data': {timestamp;lat;long;ele}, 'timestamp': '2017-05-05T07:45:51Z'}
We use the timestamp attribute for the timestamplabel in dataflow.
What is the watermark stuck at?
For this job the watermark is now stuck at 09:57:35 (I am posting this around 10:10), although new data is sent e.g. at
10:05:14
10:05:43
10:06:30
I can also see that it may happen that we publish data to pub sub with delay of more than 10 seconds e.g. at 10:07:47 we publish data with a highest timestamp of 10:07:26.
After a few hours the watermark catches up but I cannot see why it is delayed /not moving in the beginning.
This is an edge-case in the PubSub watermark tracking logic that has two work arounds (see below). Essentially, if there is no input for 2 minutes, then the watermark will advance to the current time. But, if data is arriving faster than every 2 minutes but still at a very low QPS, then there isn't enough data to have a keep the estimated watermark up to date.
As I mentioned, there are several work arounds:
If you process more data the issue will naturally be resolved.
Alternatively, if you inject extra messages (say 2 per second) it will provide enough data for the watermark to advance more quickly. These just need to have timestamps, and may be immediately filtered out of the pipeline.
For the record, another thing to have in mind about the previously mentioned edge cases in a direct runner context, is the parallelism of the runner. Having a higher parallelism, which is default especially on multicore machines, seems to need even more data. In my case a setting --targetParallelism=1 helped. Basically transformed a stuck pipeline to in a working one without any other intervention.

How can you replay old data into dataflow via pub/sub and maintain correct event time logic?

We're trying to use dataflow's processing-time independence to start up a new streaming job and replay all of our data into it via Pub/Sub but are running into the following problem:
The first stage of the pipeline is a groupby on a transaction id, with a session window of 10s discarding fired panes and no allowed lateness. So if we don't specify the timestampLabel of our replay pub/sub topic then when we replay into pub/sub all of the event timestamps are the same and the groupby tries to group all of our archived data into transaction id's for all time. No good.
If we set the timestampLabel to be the actual event timestamp from the archived data, and replay say 1d at a time into the pub/sub topic then it works for the first day's worth of events, but then as soon as those are exhausted the data watermark for the replay pub/sub somehow jumps forward to the current time, and all subsequent replayed days are dropped as late data. I don't really understand why that happens, as it seems to violate the idea that dataflow logic is independent of the processing time.
If we set the timestampLabel to be the actual event timestamp from the archived data, and replay all of it into the pub/sub topic, and then start the streaming job to consume it, the data watermark never seems to advance, and nothing ever seems to come out of the groupby. I don't really understand what's going on with that either.
Your approaches #2 and #3 are suffering from different issues:
Approach #3 (write all data, then start consuming): Since data is written to the pubsub topic out-of-order, the watermark really cannot advance until all (or most) of the data is consumed - because the watermark is a soft guarantee "further items that you receive you are unlikely to have event time later than this", but due to out-of-order publishing there is no correspondence whatsoever between publish time and event time. So your pipeline is effectively stuck until it finishes processing all this data.
Approach #2: technically it suffers from the same problem within each particular day, but I suppose the amount of data within 1 day is not that large, so the pipeline is able to process it. However, after that, the pubsub channel stays empty for a long time, and in that case the current implementation of PubsubIO will advance the watermark to real time, that's why further days of data are declared late. The documentation explains this some more.
In general, quickly catching up with a large backlog, e.g. by using historic data to "seed" the pipeline and then continuing to stream in new data, is an important use case that we currently don't support well.
Meanwhile I have a couple of recommendations for you:
(better) Use a variation on approach #2, but try timing it against the streaming pipeline so that the pubsub channel doesn't stay empty.
Use approach #3, but with more workers and more disk per worker (your current job appears to be using autoscaling with max 8 workers - try something much larger, like 100? It will downscale after it catches up)

Resources