Processing with State and Timers - google-cloud-dataflow

Are there any guidelines or limitations for using stateful processing and timers with the Beam Dataflow runner (as of v2.1.0)? Things such as limitations on the size of state or frequency of updates etc.? The candidate streaming pipeline would use state and timers extensively for user session state, with Bigtable as durable storage.

Here is some general advice for your use case
Please aggregate multiple elements then set a timer.
Please don't create a timer per element, which would be excessive.
Try and aggregate state, instead of accumulating large amount of state. I.e. aggregate as a sum and count, instead of storing every number when trying to compute a mean.
Please consider session windows for this use case.
In dataflow, state is not supported for merging windows. It is for beam.
Please use state according to your access pattern, i.e. BagState for blind writes.
Here is an informative blog post with some more info on state "Stateful processing with Apache Beam."

Related

Azure Durable Function getting slower and slower over time

My Azure Durable Function(Runtime V3) getting an average of 3M events per day. When it runs for two or three weeks it is getting slower and slower. When I remove two table storages(History & Instances) used by Durable Function Framework, it is getting better and works as expected. I hosted my function app in the consumption plan. And also inside my function app, I'm used Durabel Entities as well. In my code, I'm using sub orchestrators as well for the Fan-Out mechanism.
Is this problem possible when it comes to heavy workload? Do I need to clear those table storages from time to time or do I need to Delete the state of completed entities inside my Durable Entity Function?
Someone, please help me
Yes, you should perform periodic clean-ups yourself by calling the PurgeInstanceHistoryAsync method. See a similar post on how to do this: https://stackoverflow.com/a/60894392
Also review any loops or Monitor patterns that you may have in your code.
Any looping logic, (like foreach, for or while loops) will replay from the initial startup state. Whilst the Durable Function replay architecture is very efficient at doing this, the code we write may not be optimised for repetitive iterations.
Durable Monitor Pattern is almost an Anti-Pattern. The concept is OK but it is easily misinterpreted and is open to abuse. It is designed for a low-frequency loop that polls an endpoint either for a set number of iterations or up until a finite time, or of course when the state of the endpoint being monitoried has changed. That state change will be the trigger to perform the rest of the operation.
It is NOT an example of how to use general or high frequency looping structures in Durable functions
It is NOT and example of how to implement a traditional HTTP endpoint reporting monitor in an infinite loop (while(true)) style, perhaps to record changes into a data store over time.
If your durable function logic has an iterator that may involve many iterations, consider migrating the iteration step to a sub-orchestration that uses the Eternal Orchestration pattern

Latency Monitoring in Flink application

I'm looking for help regarding latency monitoring (flink 1.8.0).
Let's say I have a simple streaming data flow with the following operators:
FlinkKafkaConsumer -> Map -> print.
In case I want to measure a latency of records processing in my dataflow, what would be the best opportunity?
I want to get the duration of processing input received in the source until it received by the sink/finished sink operation.
I've added my code: env.getConfig().setLatencyTrackingInterval(100);
And then, the following latency metrics are available:
But I don't understand what exactly they are measuring? Also latency avg values are not seem to be related to latency as I see it.
I've tried also to use codahale metrics to get duration of some methods but it's not helping me to get a latency of record that processed in my whole pipeline.
Is the solution related to LatencyMarker? If yes, how can I reach it in my sink operation in order to retrieve it?
Thanks,
Roey.
-- copying my answer from the mailing list for future reference
Hi Roey,
with Latency Tracking you will get a distribution of the time it took for LatencyMarkers to travel from each source operator to each downstream operator (per default one histogram per source operator in each non-source operator, see metrics.latency.granularity).
LatencyMarkers are injected periodicaly in the sources and are flowing through the topology. They can not overtake regular records. LatencyMarkers pass through function (user code) without any delay. This means the latencies measured by latency tracking will only reflect a part of the end-to-end latency, in particular in non-backpressure scenarios. In backpressure scenarios latency markers will queue up before the slowest operator (as they can not overtake records) and the latency will better reflect the real latency in the pipeline. In my opinion, latency markers are not the right tool to measure the "user-facing/end-to-end latency" in a Flink application. For me this is a debugging tool to find sources of latency or congested channels.
I suggest, that instead of using latency tracking you add a histogram metric in the sink operator yourself, which depicts the difference between the current processing time and the event time to get a distribution of the event time lag at the source. If you do the same in the source (and any other points of interests) you will get a good picture of how the even-time lag changes over time.
Hope this helps.
Cheers,
Konstantin

In Dataflow with PubsubIO is there any possibility of late data in a global window?

I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

How can you replay old data into dataflow via pub/sub and maintain correct event time logic?

We're trying to use dataflow's processing-time independence to start up a new streaming job and replay all of our data into it via Pub/Sub but are running into the following problem:
The first stage of the pipeline is a groupby on a transaction id, with a session window of 10s discarding fired panes and no allowed lateness. So if we don't specify the timestampLabel of our replay pub/sub topic then when we replay into pub/sub all of the event timestamps are the same and the groupby tries to group all of our archived data into transaction id's for all time. No good.
If we set the timestampLabel to be the actual event timestamp from the archived data, and replay say 1d at a time into the pub/sub topic then it works for the first day's worth of events, but then as soon as those are exhausted the data watermark for the replay pub/sub somehow jumps forward to the current time, and all subsequent replayed days are dropped as late data. I don't really understand why that happens, as it seems to violate the idea that dataflow logic is independent of the processing time.
If we set the timestampLabel to be the actual event timestamp from the archived data, and replay all of it into the pub/sub topic, and then start the streaming job to consume it, the data watermark never seems to advance, and nothing ever seems to come out of the groupby. I don't really understand what's going on with that either.
Your approaches #2 and #3 are suffering from different issues:
Approach #3 (write all data, then start consuming): Since data is written to the pubsub topic out-of-order, the watermark really cannot advance until all (or most) of the data is consumed - because the watermark is a soft guarantee "further items that you receive you are unlikely to have event time later than this", but due to out-of-order publishing there is no correspondence whatsoever between publish time and event time. So your pipeline is effectively stuck until it finishes processing all this data.
Approach #2: technically it suffers from the same problem within each particular day, but I suppose the amount of data within 1 day is not that large, so the pipeline is able to process it. However, after that, the pubsub channel stays empty for a long time, and in that case the current implementation of PubsubIO will advance the watermark to real time, that's why further days of data are declared late. The documentation explains this some more.
In general, quickly catching up with a large backlog, e.g. by using historic data to "seed" the pipeline and then continuing to stream in new data, is an important use case that we currently don't support well.
Meanwhile I have a couple of recommendations for you:
(better) Use a variation on approach #2, but try timing it against the streaming pipeline so that the pubsub channel doesn't stay empty.
Use approach #3, but with more workers and more disk per worker (your current job appears to be using autoscaling with max 8 workers - try something much larger, like 100? It will downscale after it catches up)

Resources