If I have a window like this:
.apply(Window
.<String>into(Sessions
.withGapDuration(Duration.standardSeconds(10)))
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))));
And it receives data:
a -> x (timestamp 0) (received at 20)
a -> y (timestamp 1) (received at 21)
(watermark passes 11) (at 22)
I suppose that the window would trigger at times:
20 sec, because of an early firing
21 sec, because of an early firing
22 sec, because of watermark passing the GapDuration
In a ParDo function that I apply over the windowed data, is there a way to distinguish early firing from watermark passing the GapDuration?
According to this stackoverflow question there is no way to get the watermark. If I was able to do that, I could check max(timestamp) < watermark. But since I cannot get the watermark, is there any other way to figure out that a window was triggered by watermark passing it.
You can access the PaneInfo of the elements in a DoFn after the GroupByKey by calling ProcesContext#pane() and use that to determine the Timing. This will allow you to identify if this as an "on time" firing (due to the watermark passing the end of the window) or a speculative/late firing.
Related
I have two streams, stream A and stream B. Both streams contain the same type of event which has an ID and a timestamp. For now, all i want the flink job to do is join the events that have the same ID inside of a window of 1 minute. The watermark is assigned on event.
sourceA = initialSourceA.map(parseToEvent)
sourceB = initialSourceB.map(parseToEvent)
streamA = sourceA
.assignTimestampsAndWatermarks(CustomWatermarkStrategy())
.keyBy(Event.Key)
streamB = sourceB
.assignTimestampsAndWatermarks(CustomWatermarkStrategy())
.keyBy(Event.Key)
streamA
.join(streamB)
.where(Event.Key)
.equalTo(Event.Key)
.window(TumblingEventTimeWindows.of(Time.of(1, TimeUnit.MINUTES)))
.apply(giveMePairOfEvents)
.print()
Inside my test I try to send the following:
sourceA.send(Event(ID_1, 0 seconds))
sourceB.send(Event(ID_1, 0 seconds))
//to increase the watermark
sourceA.send(Event(ID_1, 62 seconds))
sourceB.send(Event(ID_1, 62 seconds))
For parallelism = 1, I can see the events from time 0 getting joined together.
However, for parallelism = 2 the print does not display anything getting joined. To figure out the problem, I tried to print the events after the keyBy of each stream and I can see they are all running on the same instance. Placing the print after the watermarking, for obvious reasons, that the events are currently on the different instances.
This leads me to believe that I am somehow doing something incorrectly when it comes to watermarking since for a parallelism higher than 1 it doesn't increase the watermark. So here's a couple of questions i asked myself:
Is it possible that each event has a seperate watermark generator and i have to increase them specifically?
Do I run keyBy first and then watermark so that my events from each stream use the same watermarkgenerator?
Sending another set of events as follows:
sourceA.send(Event(ID_1, 0 seconds))
sourceB.send(Event(ID_1, 0 seconds))
//to increase the watermark
sourceA.send(Event(ID_1, 62 seconds))
sourceB.send(Event(ID_1, 62 seconds))
sourceA.send(Event(ID_1, 122 seconds))
sourceB.send(Event(ID_1, 122 seconds))
Ended up sending the joined first events. Further inspection showed that the third set of events used the same watermarkgenerator that the second one didn't use. Something which I am not very clear on why is happening. How can I assign and increase watermarks correctly when using a join function in Flink?
EDIT 1:
The custom watermark generator:
class CustomWaterMarkGenerator(
private val maxOutOfOrderness: Long,
private var currentMaxTimeStamp: Long = 0,
)
: WatermarkGenerator<EventType> {
override fun onEvent(event: EventType, eventTimestamp: Long, output: WatermarkOutput) {
val a = currentMaxTimeStamp.coerceAtLeast(eventTimestamp)
currentMaxTimeStamp = a
output.emitWatermark(Watermark(currentMaxTimeStamp - maxOutOfOrderness - 1));
}
override fun onPeriodicEmit(output: WatermarkOutput?) {
}
}
The watermark strategy:
class CustomWatermarkStrategy(
): WatermarkStrategy<Event> {
override fun createWatermarkGenerator(context: WatermarkGeneratorSupplier.Context?): WatermarkGenerator<Event> {
return CustomWaterMarkGenerator(0)
}
override fun createTimestampAssigner(context: TimestampAssignerSupplier.Context?): TimestampAssigner<Event> {
return TimestampAssigner{ event: Event, _: Long->
event.timestamp
}
}
}
Custom source:
The sourceFunction is currently an rsocket connection that connects to a mockstream where i can send events through mockStream.send(event). The first thing I do with the events is parse them using a map function (from string into my event type) and then i assign my watermarks etc.
Each parallel instance of the watermark generator will operate independently, based solely on the events it observes. Doing the watermarking immediately after the sources makes sense (although even better, in general, is to do watermarking directly in the sources).
An operator with multiple input channels (such as the keyed windowed join in your application) sets its current watermark to the minimum of the watermarks it has received from its active input channels. This has the effect that any idle source instances will cause the watermarks to stall in downstream tasks -- unless those sources explicitly mark themselves as idle. (And FLINK-18934 meant that prior to Flink 1.14 idleness propagation didn't work correctly with joins.) An idle source is a likely suspect in your situation.
One strategy for debugging this sort of problem is to bring up the Flink WebUI and observe the behavior of the current watermark in all of the tasks.
To get more help, please share the rest of the application, or at least the custom source and watermark strategy.
I have a dask dataframe and want to compute some tasks that are independent. Some tasks are faster than others but I'm getting the result of each task after longer tasks have completed.
I created a local Client and use client.compute() to send tasks. Then I use future.result() to get the result of each task.
I'm using threads to ask for results at the same time and measure the time for each result to compute like this:
def get_result(future,i):
t0 = time.time()
print("calculating result", i)
result = future.result()
print("result {} took {}".format(i, time.time() - t0))
client = Client()
df = dd.read_csv(path_to_csv)
future1 = client.compute(df[df.x > 200])
future2 = client.compute(df[df.x > 500])
threading.Thread(target=get_result, args=[future1,1]).start()
threading.Thread(target=get_result, args=[future2,2]).start()
I expect the output of the above code to be something like:
calculating result 1
calculating result 2
result 2 took 10
result 1 took 46
Since the first task is larger.
But instead I got both at the same time
calculating result 1
calculating result 2
result 2 took 46.3046760559082
result 1 took 46.477620363235474
I asume that is because future2 actually computes in the background and finishes before future1, but it waits until future1 is completed to return.
Is there a way I can get the result of future2 at the moment it finishes ?
You do not need to make threads to use futures in an asynchronous fashion - they are already inherently async, and monitor their status in the background. If you want to get results in the order they are ready, you should use as_completed.
However, fo your specific situation, you may want to simply view the dashboard (or use df.visulalize()) to understand the computation which is happening. Both futures depend on reading the CSV, and this one task will be required before either can run - and probably takes the vast majority of the time. Dask does not know, without scanning all of the data, which rows have what value of x.
I'm taking a PCollection of sessions and trying to get average session duration per channel/connection. I'm doing something where my early triggers are firing for each window produced - if 60min windows sliding every 1 minute, an early trigger will fire 60 times. Looking at the timestamps on the outputs, there's a window every minute for 60minutes into the future. I'd like the trigger to fire once for the most recent window so that every 10 seconds I have an average of session durations for the last 60 minutes.
I've used sliding windows before and had the results I expected. By mixing sliding and sessions windows, I'm somehow causing this.
Let me paint you a picture of my pipeline:
First, I'm creating sessions based on active users:
.apply("Add Window Sessions",
Window.<KV<String, String>> into(Sessions.withGapDuration(Duration.standardMinutes(60)))
.withOnTimeBehavior(Window.OnTimeBehavior.FIRE_ALWAYS)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply("Group Sessions", Latest.perKey())
Steps after this create a session object, compute session duration, etc. This ends with a PCollection(Session).
I create a KV of connection,duration from the Pcollection(Session).
Then I apply the sliding window and then the mean.
.apply("Apply Rolling Minute Window",
Window. < KV < String, Integer >> into(
SlidingWindows
.of(Duration.standardMinutes(60))
.every(Duration.standardMinutes(1)))
.triggering(
Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10)))
)
)
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes()
)
.apply("Get Average", Mean.perKey())
It's at this point where I'm seeing issues. What I'd like to see is a single output per key with the average duration. What I'm actually seeing is 60 outputs for the same key for each minute into the next 60 minutes.
With this log in a DoFn with C being the ProcessContext:
LOG.info(c.pane().getTiming() + " " + c.timestamp());
I get this output 60 times with timestamps 60 minutes into the future:
EARLY 2017-12-17T20:41:59.999Z
EARLY 2017-12-17T20:43:59.999Z
EARLY 2017-12-17T20:56:59.999Z
(cont)
The log was printed at Dec 17, 2017 19:35:19.
The number of outputs is always window size/slide duration. So if I did 60 minute windows every 5 minutes, I would get 12 output.
I think I've made sense of this.
Sliding windows create a new window with the .every() function. Setting early firings applies to each window so getting multiple firings makes sense.
In order to fit my use case and only output the "current window", I'm checking c.pane().isFirst() == true before outputting results and adjusting the .every() to control the frequency.
We have a DataFlow job that is subscribed to a PubSub stream of events. We have applied sliding windows of 1 hour with a 10 minute period. In our code, we perform a Count.perElement to get the counts for each element and we then want to run this through a Top.of to get the top N elements.
At a high level:
1) Read from pubSub IO
2) Window.into(SlidingWindows.of(windowSize).every(period)) // windowSize = 1 hour, period = 10 mins
3) Count.perElement
4) Top.of(n, comparisonFunction)
What we're seeing is that the window is being applied twice so data seems to be watermarked 1 hour 40 mins (instead of 50 mins) behind current time. When we dig into the job graph on the Dataflow console, we see that there are two groupByKey operations being performed on the data:
1) As part of Count.perElement. Watermark on the data from this step onwards is 50 minutes behind current time which is expected.
2) As part of the Top.of (in the Combine.PerKey). Watermark on this seems to be another 50 minutes behind the current time. Thus, data in steps below this is watermarked 1:40 mins behind.
This ultimately manifests in some downstream graphs being 50 minutes late.
Thus it seems like every time a GroupByKey is applied, windowing seems to kick in afresh.
Is this expected behavior? Anyway we can make the windowing only be applicable for the Count.perElement and turn it off after that?
Our code is something on the lines of:
final int top = 50;
final Duration windowSize = standardMinutes(60);
final Duration windowPeriod = standardMinutes(10);
final SlidingWindows window = SlidingWindows.of(windowSize).every(windowPeriod);
options.setWorkerMachineType("n1-standard-16");
options.setWorkerDiskType("compute.googleapis.com/projects//zones//diskTypes/pd-ssd");
options.setJobName(applicationName);
options.setStreaming(true);
options.setRunner(DataflowPipelineRunner.class);
final Pipeline pipeline = Pipeline.create(options);
// Get events
final String eventTopic =
"projects/" + options.getProject() + "/topics/eventLog";
final PCollection<String> events = pipeline
.apply(PubsubIO.Read.topic(eventTopic));
// Create toplist
final PCollection<List<KV<String, Long>>> topList = events
.apply(Window.into(window))
.apply(Count.perElement()) //as eventIds are repeated
// get top n to get top events
.apply(Top.of(top, orderByValue()).withoutDefaults());
Windowing is not applied each time there is a GroupByKey. The lag you were seeing was likely the result of two issues, both of which should be resolved.
The first was that data that was buffered for later windows at the first group by key was preventing the watermark from advancing, which meant that the earlier windows were getting held up at the second group by key. This has been fixed in the latest versions of the SDK.
The second was that the sliding windows was causing the amount of data to increase significantly. A new optimization has been added which uses the combine (you mentioned Count and Top) to reduce the amount of data.
In other words, can we choose to invoke the handler at 20 times per second, or at 45 times per second, or must it be 60 times per second?
I guess one way to simulate it will be to use the timestamp given to the handler to determine if it is "time for action yet" -- if not, simply return without doing anything?
You can use the frameInterval property to invoke the handler every 60/n frames, where n is a positive integer.
You can't invoke the handler 45 times per second, but the following values are possible: 60, 30, 20, 15, 12, ...