I am ingesting data via pub/sub to a dataflow pipeline which is running in unbounded mode. The data are basically coordinates with timestamps captured from tracking devices. Those messages arrive in batches, where each batch might be 1..n messages. For a certain period there might be no messages arriving, which might be resent later on (or not). We use the time-stamp (in UTC) of each coordinate as an attribute for the pub-sub message. And read the pipeline via a Timestamp label:
pipeline.apply(PubsubIO.Read.topic("new").timestampLabel("timestamp")
An example of coordinates and delay looks like:
36 points wait 0:02:24
36 points wait 0:02:55
18 points wait 0:00:45
05 points wait 0:00:01
36 points wait 0:00:33
36 points wait 0:00:43
36 points wait 0:00:34
A message might look like:
2013-07-07 09:34:11;47.798766;13.050133
After the first batch the Watermark is empty, after the second batch I can see a Watermark in the Pipeline diagnostics, just it doesn't get updated, although new messages arrive. Also according to stackdriver logging PubSub has no undelivered or unacknowledged messages.
Shouldn't the watermark move forward as messages with new event time arrive?
According to What is the watermark heuristic for PubsubIO running on GCD? the WaterMark should also move forward every 2minutes which it doesn't?
[..] In the case that we have not seen data on the subscription in more
than two minutes (and there's no backlog), we advance the watermark to
near real time. [..]
Update to address Bens questions:
Is there a job ID that we could look at?
Yes I just restarted the whole setup at 09:52 CET which is 07:52 UTC, with job ID 2017-05-05_00_49_11-11176509843641901704.
What version of the SDK are you using?
1.9.0
How are you publishing the messages with the timestamp labels?
We use a python script to publish the data which is using the pub sub sdk.
A message from there might look like:
{'data': {timestamp;lat;long;ele}, 'timestamp': '2017-05-05T07:45:51Z'}
We use the timestamp attribute for the timestamplabel in dataflow.
What is the watermark stuck at?
For this job the watermark is now stuck at 09:57:35 (I am posting this around 10:10), although new data is sent e.g. at
10:05:14
10:05:43
10:06:30
I can also see that it may happen that we publish data to pub sub with delay of more than 10 seconds e.g. at 10:07:47 we publish data with a highest timestamp of 10:07:26.
After a few hours the watermark catches up but I cannot see why it is delayed /not moving in the beginning.
This is an edge-case in the PubSub watermark tracking logic that has two work arounds (see below). Essentially, if there is no input for 2 minutes, then the watermark will advance to the current time. But, if data is arriving faster than every 2 minutes but still at a very low QPS, then there isn't enough data to have a keep the estimated watermark up to date.
As I mentioned, there are several work arounds:
If you process more data the issue will naturally be resolved.
Alternatively, if you inject extra messages (say 2 per second) it will provide enough data for the watermark to advance more quickly. These just need to have timestamps, and may be immediately filtered out of the pipeline.
For the record, another thing to have in mind about the previously mentioned edge cases in a direct runner context, is the parallelism of the runner. Having a higher parallelism, which is default especially on multicore machines, seems to need even more data. In my case a setting --targetParallelism=1 helped. Basically transformed a stuck pipeline to in a working one without any other intervention.
Related
I have an activity function which will:
read a bunch of results from a process elsewhere in my pipeline, via azure blob storage
perform some processing on each result, using the fan-out, fan-in pattern
Combine the original results with the newly processed data, and re-upload to blob storage
When I have a large number of results (~10,000) - I'm experiencing large delays between finishing my processing, and the upload activity actually triggering.
I'm seeing the processing take ~3-4 minutes, then my "PersistResults" activity is scheduled - then 10 minutes later, the "PersistResults" activity actually runs, and takes about 20 seconds to run.
My guess is that the large payload on the "Persist" activity function is slowing it down a lot - though I've read nothing about limiting that, and the documentation certainly implies that any actual work I do (e.g. storing results) should be done in the activities to keep the orchestrator deterministic.
Actually uploading the results to storage appears to be fairly quick, given that when my activity does run, it only takes 20 seconds.
The final payload (all combined results) is roughly 50MB uncompressed - I notice that within the durable functions storage account it uses a compressed version for the activity input.
Is this significant delay expected when using durable functions like this?
Is there anything I've likely done to cause such a delay?
Is there anything I can do to prevent it?
There seem to be many alike questions/issues with a Beam pipeline being stuck in GroupByKey, but after a whole day of trying different settings for PubSubIO and Window triggers, I still have not come any further.
I have a PubSub topic producing a steady stream of data:
When starting my pipeline in Dataflow a bunch of elements are added. Hereafter, the number of elements remains the same. The data watermark increases slowly and is around 10-11 minutes behind current time, which coincides with the message retention time of the PubSub subscription of 10 minutes. I have tried different setups for reading PubSub, with and without attributes, adding the timestamp myself, etc. In this job it is just a vanilla read without attributes relying on Google to compute the timestamps.
Here I am trying to run my pipeline, which doesn't do any grouping by key:
I tried a lot of Window setups. My goals is a sliding window of 30 minutes duration every 1 minute, but here I am just trying to get it to work with a FixedWindow of 1 minute. I also tried a lot of different triggers with both early and late firings added. In this job, I did not specify anything, ie. it should use the default trigger.
The job id is 2019-11-01_06_38_28-2654303843876161133
Does anyone have any suggestions on what else I can try to get some element through the GBK?
Update:
I have simplified my pipeline to continue troubleshooting the issue using the hint in one of the comments to look at the watermark at reading the PubSub messages.
I have an attribute on the PubSub message that I use as timestamp (.withTimestampAttribute(...)). Logging the timestamp() in ProcessContext does give the correct timestamp and window assignment. Messages are streaming in real-time with a couple of seconds lag, but the issue is that "data watermark" stays a while (observed like 1,5 hours) behind and therefore the window is never triggered and GroupByKey does not work.
If I omit .withTimestampAttribute(...) when reading from PubSub, everything seems to work correct, but I have a lag on my timestamp, which causes messages to be assigned to a later window in many cases.
I found a workaround by triggering in processing time instead of event time, but I haven't assessed if this is a real solution:
.triggering(AfterProcessingTime.pastFirstElementInPane().alignedTo(Duration.standardMinutes(1)).plusDelayOf(Duration.standardMinutes(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
The question is how I can make sure the watermark is updates when reading from PubSub with using the timestamp attribute?
Let's say we have log data with timestamps that can either be streamed into BigQuery or stored as files in Google Storage, but not streamed directly to the unbounded collection source types that Dataflow supports.
We want to analyse this data based on timestamp, either relatively or absolutely, e.g. "how many hits in the last 1 hour?" and "how many hits between 3pm and 4pm on 5th Feb 2018"?
Having read the documentation on windows and triggers, it's not clear how we would divide our incoming data into batches in a way that is supported by Dataflow if we want to have a large window - potentially we want to aggregate over the last day, 30 days, 3 months, etc.
For example, if our batched source is a BigQuery query, run every 5 mins, for the last 5 mins worth of data, will Dataflow keep the windows open between job runs, even though the data is arriving in 5 min chunks?
Similarly, if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved to the bucket, the same question applies - is the job stopped and started, and all knowledge of previous jobs discarded, or does the large window (e.g. up to a month) remain open for new events?
How do we change/modify this pipeline without disturbing the existing state?
Apologies if these are basic questions, even a link to some docs would be appreciated.
It sounds like you want arbitrary interactive aggregation queries on your data. Beam / Dataflow are not a good fit for this per se, however one of the most common use cases of Dataflow is to ingest data into BigQuery (e.g. from GCS files or from Pubsub), which is a very good fit for that.
A few more comments on your question:
it's not clear how we would divide our incoming data into batches
Windowing in Beam is simply a way to specify the aggregation scope in the time dimension. E.g. if you're using sliding windows of size 15 minutes every 5 minutes, then a record whose event-time timestamp is 14:03 counts towards aggregations in three windows: 13:50..14:05, 13:55..14:10, 14:00..14:15.
So: same way as you don't need to divide your incoming data into "keys" when grouping by a key (the data processing framework performs the group-by-key for you), you don't divide it into windows either (the framework performs group-by-window implicitly as part of every aggregating operation).
will Dataflow keep the windows open between job runs
I'm hoping this is addressed by the previous point, but to clarify more: No. Stopping a Dataflow job discards all of its state. However, you can "update" a job with new code (e.g. if you've fixed a bug or added an extra processing step) - in that case state is not discarded, but I think that's not what you're asking.
if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved
It sounds like you want to ingest data continuously. The way to do that is to write a single continuously running streaming pipeline that ingests the data continuously, rather than to start a new pipeline every time new data arrives. In the case of files arriving into a bucket, you can use TextIO.read().watchForNewFiles() if you're reading text files, or its various analogues if you're reading some other kind of files (most general is FileIO.matchAll().continuously()).
I'm graphing my power meter with an old laptop in my barn.
This sends data using mqtt to mrtg(cacti)
Lately this laptop has begun to lockup when playing spotify.
This is a separate issue.
However, when I reboot, all the power used in the mean time is shown as being used in a single time period, giving a huge spike, so the rest of the data is hardly visible.
Is it possible to, when the data finally arrives, to intrapolate it on all the missing datapoints?
The laptop sending data was down between Sat 18:00 and Sun 11:00 approx, but of cause the real powermeter keeps running.
I'd rather have a straight line between the two datapoints, it is still loss of data, but is more true than a spike.
Edit: Complication, as Cacti reads the data asynchroneously from mqtt, it keeps getting the latest count even if the data is stale.
I guess I need to get my mqtt->cacti interface to send NaN or U if the timestamp of the data has not changed.
You have 2 options.
Add a timestamp to the message that way you can rebuild the data as the queued messages are delivered when the laptop reconnects to the broker.
Use a QOS 0 subscriptions and ensure that clean session is set to true, this will mean the missing readings are dropped. Zero data is probably easier to interpret from the graph than a large spike.
We're trying to use dataflow's processing-time independence to start up a new streaming job and replay all of our data into it via Pub/Sub but are running into the following problem:
The first stage of the pipeline is a groupby on a transaction id, with a session window of 10s discarding fired panes and no allowed lateness. So if we don't specify the timestampLabel of our replay pub/sub topic then when we replay into pub/sub all of the event timestamps are the same and the groupby tries to group all of our archived data into transaction id's for all time. No good.
If we set the timestampLabel to be the actual event timestamp from the archived data, and replay say 1d at a time into the pub/sub topic then it works for the first day's worth of events, but then as soon as those are exhausted the data watermark for the replay pub/sub somehow jumps forward to the current time, and all subsequent replayed days are dropped as late data. I don't really understand why that happens, as it seems to violate the idea that dataflow logic is independent of the processing time.
If we set the timestampLabel to be the actual event timestamp from the archived data, and replay all of it into the pub/sub topic, and then start the streaming job to consume it, the data watermark never seems to advance, and nothing ever seems to come out of the groupby. I don't really understand what's going on with that either.
Your approaches #2 and #3 are suffering from different issues:
Approach #3 (write all data, then start consuming): Since data is written to the pubsub topic out-of-order, the watermark really cannot advance until all (or most) of the data is consumed - because the watermark is a soft guarantee "further items that you receive you are unlikely to have event time later than this", but due to out-of-order publishing there is no correspondence whatsoever between publish time and event time. So your pipeline is effectively stuck until it finishes processing all this data.
Approach #2: technically it suffers from the same problem within each particular day, but I suppose the amount of data within 1 day is not that large, so the pipeline is able to process it. However, after that, the pubsub channel stays empty for a long time, and in that case the current implementation of PubsubIO will advance the watermark to real time, that's why further days of data are declared late. The documentation explains this some more.
In general, quickly catching up with a large backlog, e.g. by using historic data to "seed" the pipeline and then continuing to stream in new data, is an important use case that we currently don't support well.
Meanwhile I have a couple of recommendations for you:
(better) Use a variation on approach #2, but try timing it against the streaming pipeline so that the pubsub channel doesn't stay empty.
Use approach #3, but with more workers and more disk per worker (your current job appears to be using autoscaling with max 8 workers - try something much larger, like 100? It will downscale after it catches up)