How to discard data from the first sliding window in Dataflow?

How to discard data from the first sliding window in Dataflow? - google-cloud-dataflow

I'd like to recognize and discard incomplete windows (independent of sliding) at the start of pipeline execution. For example:
If I'm counting the number of events hourly and I start at :55 past the hour, then I should expect ~1/12th the value in the first window and then a smooth ramp-up to the "correct" averages.

Since data could be arbitrarily late in a user-defined way, the time you start the pipeline up and the windows that are guaranteed to be missing data might be only loosely connected.
You'll need some out-of-band way of indicating which windows they are. If I were implementing such a thing, I would consider a few approaches, in this order I think:
Discarding outliers based on not enough data points. Seems that it would be robust to lots of data issues, if your data set can tolerate it (a statistician might disagree)
Discarding outliers based on data points not distributed in the window (ditto)
Discarding outliers based on some characteristic of the result instead of the input (statisticians will be even more likely to say don't do this, since you are already averaging)
Using a custom pipeline option to indicate a minimum start/end time for interesting windows.
One reason to choose more robust approaches than just "start time" is in the case of down time of your data producer or any intermediate system, etc. (even with delivery guarantees, the watermark may have moved on and made all that data droppable).

Related

Latency Monitoring in Flink application

I'm looking for help regarding latency monitoring (flink 1.8.0).
Let's say I have a simple streaming data flow with the following operators:
FlinkKafkaConsumer -> Map -> print.
In case I want to measure a latency of records processing in my dataflow, what would be the best opportunity?
I want to get the duration of processing input received in the source until it received by the sink/finished sink operation.
I've added my code: env.getConfig().setLatencyTrackingInterval(100);
And then, the following latency metrics are available:
But I don't understand what exactly they are measuring? Also latency avg values are not seem to be related to latency as I see it.
I've tried also to use codahale metrics to get duration of some methods but it's not helping me to get a latency of record that processed in my whole pipeline.
Is the solution related to LatencyMarker? If yes, how can I reach it in my sink operation in order to retrieve it?
Thanks,
Roey.

-- copying my answer from the mailing list for future reference
Hi Roey,
with Latency Tracking you will get a distribution of the time it took for LatencyMarkers to travel from each source operator to each downstream operator (per default one histogram per source operator in each non-source operator, see metrics.latency.granularity).
LatencyMarkers are injected periodicaly in the sources and are flowing through the topology. They can not overtake regular records. LatencyMarkers pass through function (user code) without any delay. This means the latencies measured by latency tracking will only reflect a part of the end-to-end latency, in particular in non-backpressure scenarios. In backpressure scenarios latency markers will queue up before the slowest operator (as they can not overtake records) and the latency will better reflect the real latency in the pipeline. In my opinion, latency markers are not the right tool to measure the "user-facing/end-to-end latency" in a Flink application. For me this is a debugging tool to find sources of latency or congested channels.
I suggest, that instead of using latency tracking you add a histogram metric in the sink operator yourself, which depicts the difference between the current processing time and the event time to get a distribution of the event time lag at the source. If you do the same in the source (and any other points of interests) you will get a good picture of how the even-time lag changes over time.
Hope this helps.
Cheers,
Konstantin

In Dataflow with PubsubIO is there any possibility of late data in a global window?

I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.

When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks

This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

Would dataflow avoid duplicate computing in SlidingWindows?

In cloud dataflow an element may be assigned into multiple windows in the occasion of SlidingWindow which has a size and a step. Suppose that we have a large size SlidingWindow which has a very small step, in fact the elements in two adjacent window would be almost same except the sliding step.
So would computing on every SlidingWindow just simply load all the elements in this window and trigger a compute on these elements? Or the adjacent window could reuse some computing result to avoid duplicate computing? And whether the element would be copied when been assigned into multiple windows?

Dataflow does not have any special handling for SlidingWindows like this. The element occurs in every window to which it is assigned.
We typically haven't found performance problems using regular SlidingWindows with a CombineFn afterwards. We would suggest trying that first and following up with more details on what you're trying to compute and specifics on your windowing if you're having problems.
Automatically doing this as an optimization doesn't work well in the presence of user-defined windowing, triggering, out-of-order data, and other optimizations already present in the system.

MQL4: Global trend/variable or text files for Single trade per signal/event

On each new bar/tick, my variable is re-initialed, I am trying to execute a trade once per signal, the problem is that once TP is achieved, if same trends continues, it triggers another trade. I am thinking to store variable in Text file. So just wondering what would the best way to handle such variable. Sorry I don't have code.

MT4 Global Variable objects
While MT4 supports somehow ghost-alike semi-persistent objects called "Global Variables", that can survive between MT4 Terminal re-runs for about several weeks, these ghosts are rather complicated to be used for your sketched purposes.
GlobalVariableCheck()
GlobalVariableSet()
GlobalVariableSetOnCondition()
GlobalVariableGet()
FileSystem Text-File
While doable, this ought be a last resort only option, as this is the slowest and the least manageable part, once running several units, several tens, several hundreds of MT4 Terminal instances in the same environment, the risk of fileIO collisions is clearly visible.
Solution?
Try to create & maintain a singleton-pattern to avoid multiple re-entries into a trend you have already put one trade into.
Try to setup also a clear definition for trend reversals, which stop / reset the singleton-pattern once a new trend was formed.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart