Latency Monitoring in Flink application - monitoring

I'm looking for help regarding latency monitoring (flink 1.8.0).
Let's say I have a simple streaming data flow with the following operators:
FlinkKafkaConsumer -> Map -> print.
In case I want to measure a latency of records processing in my dataflow, what would be the best opportunity?
I want to get the duration of processing input received in the source until it received by the sink/finished sink operation.
I've added my code: env.getConfig().setLatencyTrackingInterval(100);
And then, the following latency metrics are available:
But I don't understand what exactly they are measuring? Also latency avg values are not seem to be related to latency as I see it.
I've tried also to use codahale metrics to get duration of some methods but it's not helping me to get a latency of record that processed in my whole pipeline.
Is the solution related to LatencyMarker? If yes, how can I reach it in my sink operation in order to retrieve it?
Thanks,
Roey.

-- copying my answer from the mailing list for future reference
Hi Roey,
with Latency Tracking you will get a distribution of the time it took for LatencyMarkers to travel from each source operator to each downstream operator (per default one histogram per source operator in each non-source operator, see metrics.latency.granularity).
LatencyMarkers are injected periodicaly in the sources and are flowing through the topology. They can not overtake regular records. LatencyMarkers pass through function (user code) without any delay. This means the latencies measured by latency tracking will only reflect a part of the end-to-end latency, in particular in non-backpressure scenarios. In backpressure scenarios latency markers will queue up before the slowest operator (as they can not overtake records) and the latency will better reflect the real latency in the pipeline. In my opinion, latency markers are not the right tool to measure the "user-facing/end-to-end latency" in a Flink application. For me this is a debugging tool to find sources of latency or congested channels.
I suggest, that instead of using latency tracking you add a histogram metric in the sink operator yourself, which depicts the difference between the current processing time and the event time to get a distribution of the event time lag at the source. If you do the same in the source (and any other points of interests) you will get a good picture of how the even-time lag changes over time.
Hope this helps.
Cheers,
Konstantin

Related

what actually manages watermarks in beam?

Beam's big power comes from it's advanced windowing capabilities, but it's also a bit confusing.
Having seen some oddities in local tests (I use rabbitmq for an input Source) where messages were not always getting ackd, and fixed windows that were not always closing, I started digging around StackOverflow and the Beam code base.
It seems there are Source-specific concerns with when exactly watermarks are set:
RabbitMQ watermark does not advance: Apache Beam : RabbitMqIO watermark doesn't advance
PubSub watermark does not advance for low volumes: https://issues.apache.org/jira/browse/BEAM-7322
SQS IO does not advance the watermark over a period of time of no new incoming messages - https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/sqs/SqsIO.java#L44
(and others). Further, there seem to be independent notions of Checkpoints (CheckpointMarks) as oppose to Watermarks.
So I suppose this is a multi-part question:
What code is responsible for moving the watermark? It seems to be some combination of the Source and the Runner, but I can't seem to actually find it to understand it better (or tweak it for our use cases). It is a particular issue for me as in periods of low volume the watermark never advances and messages are not ackd
I don't see much documentation around what a Checkpoint/Checkpoint mark is conceptually (the non-code Beam documentation doesn't discuss it). How does a CheckpointMark interact with a Watermark, if at all?
Each PCollection has its own watermark. The watermark indicates how complete that particular PCollection is. The source is responsible for the watermark of the PCollection that it produces. The propagation of watermarks to downstream PCollections is automatic with no additional approximation; it can be roughly understood as "the minimum of input PCollections and buffered state". So in your case, it is RabbitMqIO to look at for watermark problems. I am not familiar with this particular IO connector, but a bug report or email to the user list would be good if you have not already done this.
A checkpoint is a source-specific piece of data that allows it to resume reading without missed messages, as long as the checkpoint is durably persisted by the runner. Message ACK tends to happen in checkpoint finalization, since the runner calls this method when it is known that the message never needs to be re-read.

In Dataflow with PubsubIO is there any possibility of late data in a global window?

I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.

Processing with State and Timers

Are there any guidelines or limitations for using stateful processing and timers with the Beam Dataflow runner (as of v2.1.0)? Things such as limitations on the size of state or frequency of updates etc.? The candidate streaming pipeline would use state and timers extensively for user session state, with Bigtable as durable storage.
Here is some general advice for your use case
Please aggregate multiple elements then set a timer.
Please don't create a timer per element, which would be excessive.
Try and aggregate state, instead of accumulating large amount of state. I.e. aggregate as a sum and count, instead of storing every number when trying to compute a mean.
Please consider session windows for this use case.
In dataflow, state is not supported for merging windows. It is for beam.
Please use state according to your access pattern, i.e. BagState for blind writes.
Here is an informative blog post with some more info on state "Stateful processing with Apache Beam."

How to discard data from the first sliding window in Dataflow?

I'd like to recognize and discard incomplete windows (independent of sliding) at the start of pipeline execution. For example:
If I'm counting the number of events hourly and I start at :55 past the hour, then I should expect ~1/12th the value in the first window and then a smooth ramp-up to the "correct" averages.
Since data could be arbitrarily late in a user-defined way, the time you start the pipeline up and the windows that are guaranteed to be missing data might be only loosely connected.
You'll need some out-of-band way of indicating which windows they are. If I were implementing such a thing, I would consider a few approaches, in this order I think:
Discarding outliers based on not enough data points. Seems that it would be robust to lots of data issues, if your data set can tolerate it (a statistician might disagree)
Discarding outliers based on data points not distributed in the window (ditto)
Discarding outliers based on some characteristic of the result instead of the input (statisticians will be even more likely to say don't do this, since you are already averaging)
Using a custom pipeline option to indicate a minimum start/end time for interesting windows.
One reason to choose more robust approaches than just "start time" is in the case of down time of your data producer or any intermediate system, etc. (even with delivery guarantees, the watermark may have moved on and made all that data droppable).

how to find an offset from two audio file ? one is noisy and one is clear

I have once scenario in which user capturing the concert scene with the realtime audio of the performer and at the same time device is downloading the live streaming from audio broadcaster device.later i replace the realtime noisy audio (captured while recording) with the one i have streamed and saved in my phone (good quality audio).right now i am setting the audio offset manually with trial and error basis while merging so i can sync the audio and video activity at exact position.
Now what i want to do is to automate the process of synchronisation of audio.instead of merging the video with clear audio at given offset i want to merge the video with clear audio automatically with proper sync.
for that i need to find the offset at which i should replace the noisy audio with clear audio.e.g. when user start the recording and stop the recording then i will take that sample of real time audio and compare with live streamed audio and take the exact part of that audio from that and sync at perfect time.
does any one have any idea how to find the offset by comparing two audio files and sync with the video.?
Here's a concise, clear answer.
• It's not easy - it will involve signal processing and math.
• A quick Google gives me this solution, code included.
• There is more info on the above technique here.
• I'd suggest gaining at least a basic understanding before you try and port this to iOS.
• I would suggest you use the Accelerate framework on iOS for fast Fourier transforms etc
• I don't agree with the other answer about doing it on a server - devices are plenty powerful these days. A user wouldn't mind a few seconds of processing for something seemingly magic to happen.
Edit
As an aside, I think it's worth taking a step back for a second. While
math and fancy signal processing like this can give great results, and
do some pretty magical stuff, there can be outlying cases where the
algorithm falls apart (hopefully not often).
What if, instead of getting complicated with signal processing,
there's another way? After some thought, there might be. If you meet
all the following conditions:
• You are in control of the server component (audio broadcaster
device)
• The broadcaster is aware of the 'real audio' recording
latency
• The broadcaster and receiver are communicating in a way
that allows accurate time synchronisation
...then the task of calculating audio offset becomes reasonably
trivial. You could use NTP or some other more accurate time
synchronisation method so that there is a global point of reference
for time. Then, it is as simple as calculating the difference between
audio stream time codes, where the time codes are based on the global
reference time.
This could prove to be a difficult problem, as even though the signals are of the same event, the presence of noise makes a comparison harder. You could consider running some post-processing to reduce the noise, but noise reduction in its self is an extensive non-trivial topic.
Another problem could be that the signal captured by the two devices could actually differ a lot, for example the good quality audio (i guess output from the live mix console?) will be fairly different than the live version (which is guess is coming out of on stage monitors/ FOH system captured by a phone mic?)
Perhaps the simplest possible approach to start would be to use cross correlation to do the time delay analysis.
A peak in the cross correlation function would suggest the relative time delay (in samples) between the two signals, so you can apply the shift accordingly.
I don't know a lot about the subject, but I think you are looking for "audio fingerprinting". Similar question here.
An alternative (and more error-prone) way is running both sounds through a speech to text library (or an API) and matching relevant part. This would be of course not very reliable. Sentences frequently repeat in songs and concert maybe instrumental.
Also, doing audio processing on a mobile device may not play well (because of low performance or high battery drain or both). I suggest you to use a server if you go that way.
Good luck.

Resources