what actually manages watermarks in beam? - google-cloud-dataflow

Beam's big power comes from it's advanced windowing capabilities, but it's also a bit confusing.
Having seen some oddities in local tests (I use rabbitmq for an input Source) where messages were not always getting ackd, and fixed windows that were not always closing, I started digging around StackOverflow and the Beam code base.
It seems there are Source-specific concerns with when exactly watermarks are set:
RabbitMQ watermark does not advance: Apache Beam : RabbitMqIO watermark doesn't advance
PubSub watermark does not advance for low volumes: https://issues.apache.org/jira/browse/BEAM-7322
SQS IO does not advance the watermark over a period of time of no new incoming messages - https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/sqs/SqsIO.java#L44
(and others). Further, there seem to be independent notions of Checkpoints (CheckpointMarks) as oppose to Watermarks.
So I suppose this is a multi-part question:
What code is responsible for moving the watermark? It seems to be some combination of the Source and the Runner, but I can't seem to actually find it to understand it better (or tweak it for our use cases). It is a particular issue for me as in periods of low volume the watermark never advances and messages are not ackd
I don't see much documentation around what a Checkpoint/Checkpoint mark is conceptually (the non-code Beam documentation doesn't discuss it). How does a CheckpointMark interact with a Watermark, if at all?

Each PCollection has its own watermark. The watermark indicates how complete that particular PCollection is. The source is responsible for the watermark of the PCollection that it produces. The propagation of watermarks to downstream PCollections is automatic with no additional approximation; it can be roughly understood as "the minimum of input PCollections and buffered state". So in your case, it is RabbitMqIO to look at for watermark problems. I am not familiar with this particular IO connector, but a bug report or email to the user list would be good if you have not already done this.
A checkpoint is a source-specific piece of data that allows it to resume reading without missed messages, as long as the checkpoint is durably persisted by the runner. Message ACK tends to happen in checkpoint finalization, since the runner calls this method when it is known that the message never needs to be re-read.

Related

Latency Monitoring in Flink application

I'm looking for help regarding latency monitoring (flink 1.8.0).
Let's say I have a simple streaming data flow with the following operators:
FlinkKafkaConsumer -> Map -> print.
In case I want to measure a latency of records processing in my dataflow, what would be the best opportunity?
I want to get the duration of processing input received in the source until it received by the sink/finished sink operation.
I've added my code: env.getConfig().setLatencyTrackingInterval(100);
And then, the following latency metrics are available:
But I don't understand what exactly they are measuring? Also latency avg values are not seem to be related to latency as I see it.
I've tried also to use codahale metrics to get duration of some methods but it's not helping me to get a latency of record that processed in my whole pipeline.
Is the solution related to LatencyMarker? If yes, how can I reach it in my sink operation in order to retrieve it?
Thanks,
Roey.
-- copying my answer from the mailing list for future reference
Hi Roey,
with Latency Tracking you will get a distribution of the time it took for LatencyMarkers to travel from each source operator to each downstream operator (per default one histogram per source operator in each non-source operator, see metrics.latency.granularity).
LatencyMarkers are injected periodicaly in the sources and are flowing through the topology. They can not overtake regular records. LatencyMarkers pass through function (user code) without any delay. This means the latencies measured by latency tracking will only reflect a part of the end-to-end latency, in particular in non-backpressure scenarios. In backpressure scenarios latency markers will queue up before the slowest operator (as they can not overtake records) and the latency will better reflect the real latency in the pipeline. In my opinion, latency markers are not the right tool to measure the "user-facing/end-to-end latency" in a Flink application. For me this is a debugging tool to find sources of latency or congested channels.
I suggest, that instead of using latency tracking you add a histogram metric in the sink operator yourself, which depicts the difference between the current processing time and the event time to get a distribution of the event time lag at the source. If you do the same in the source (and any other points of interests) you will get a good picture of how the even-time lag changes over time.
Hope this helps.
Cheers,
Konstantin

In Dataflow with PubsubIO is there any possibility of late data in a global window?

I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

Initial state for a dataflow job

I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.
Current ideas:
Write a custom unbounded source that does just that. Reads over the file until it's exhausted and then starts reading from the stream. Not much fun because writing custom sources is not much fun.
Run the logic in batch mode over the file, and as the last step emit the state to a stream sink somehow, then have a streaming version of the logic start up that reads from both the state stream and the data stream, and somehow combines the two. This seems to make some sense, but not sure how to make sure that the streaming job reads everything from the state source, to initialise, before reading from the data stream.
Pipe the historical data into a stream, write a job that reads from both the streams. Same problems as the second solution, not sure how to make sure one stream is "consumed" first.
EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.
You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.

How do I set the inputBuffer of an akka stream publisher?

I'm using Akka streams in a context where sinks for a single source will come and go. For this reason I'm creating a publisher from a source and attaching subscribers as the need arise:
val publisher= mySource.runWith(Sink.publisher(true))
with
publisher.subscribe(subscriber1)// There will be others
Some of the subscribers will be faster than others and I'd like to allow the faster ones to go ahead independently of the slowest, at least to the extend permitted by the input buffer of the publisher. This buffer is described by the comment on the Sink.publisher(true) method:
If fanout is true, the materialized Publisher will support multiple Subscribers and the size of the inputBuffer configured for this stage becomes the maximum number of elements that the fastest [[org.reactivestreams.Subscriber]] can be ahead of the slowest one before slowing the processing down due to back pressure.
My problem is that I don't know how to set this inputBuffer value "for this stage". The closest I have seen is described in the Dropping Broadcast section of this article but this seems to insist on the use of the Flow DSL. I believe that I can't use the DSL because of my need to continually attach new Subscribers.
As a result, my overall stream rate is held back by the slowest subscriber. A related aspect of what I am trying to do relates to making sure the different subscribers are running on different threads (without creating explicit actors as subscribers).
It'd look something like (for Akka Streams 2.0.1):
Sink.asPublisher(true).addAttributes(Attributes.inputBuffer(initialSize, maxSize))

Resources