Apache Beam - Sliding Windows outputting multiple windows - google-cloud-dataflow

I'm taking a PCollection of sessions and trying to get average session duration per channel/connection. I'm doing something where my early triggers are firing for each window produced - if 60min windows sliding every 1 minute, an early trigger will fire 60 times. Looking at the timestamps on the outputs, there's a window every minute for 60minutes into the future. I'd like the trigger to fire once for the most recent window so that every 10 seconds I have an average of session durations for the last 60 minutes.
I've used sliding windows before and had the results I expected. By mixing sliding and sessions windows, I'm somehow causing this.
Let me paint you a picture of my pipeline:
First, I'm creating sessions based on active users:
.apply("Add Window Sessions",
Window.<KV<String, String>> into(Sessions.withGapDuration(Duration.standardMinutes(60)))
.withOnTimeBehavior(Window.OnTimeBehavior.FIRE_ALWAYS)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply("Group Sessions", Latest.perKey())
Steps after this create a session object, compute session duration, etc. This ends with a PCollection(Session).
I create a KV of connection,duration from the Pcollection(Session).
Then I apply the sliding window and then the mean.
.apply("Apply Rolling Minute Window",
Window. < KV < String, Integer >> into(
SlidingWindows
.of(Duration.standardMinutes(60))
.every(Duration.standardMinutes(1)))
.triggering(
Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10)))
)
)
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes()
)
.apply("Get Average", Mean.perKey())
It's at this point where I'm seeing issues. What I'd like to see is a single output per key with the average duration. What I'm actually seeing is 60 outputs for the same key for each minute into the next 60 minutes.
With this log in a DoFn with C being the ProcessContext:
LOG.info(c.pane().getTiming() + " " + c.timestamp());
I get this output 60 times with timestamps 60 minutes into the future:
EARLY 2017-12-17T20:41:59.999Z
EARLY 2017-12-17T20:43:59.999Z
EARLY 2017-12-17T20:56:59.999Z
(cont)
The log was printed at Dec 17, 2017 19:35:19.
The number of outputs is always window size/slide duration. So if I did 60 minute windows every 5 minutes, I would get 12 output.

I think I've made sense of this.
Sliding windows create a new window with the .every() function. Setting early firings applies to each window so getting multiple firings makes sense.
In order to fit my use case and only output the "current window", I'm checking c.pane().isFirst() == true before outputting results and adjusting the .every() to control the frequency.

Related

waitForCompletion(timeout) in Abaqus API does not actually kill the job after timeout passes

I'm doing a parametric sweep of some Abaqus simulations, and so I'm using the waitForCompletion() function to prevent the script from moving on prematurely. However, occassionally the combination of parameters causes the simulation to hang on one or two of the parameters in the sweep for something like half an hour to an hour, whereas most parameter combos only take ~10 minutes. I don't need all the data points, so I'd rather sacrifice one or two results to power through more simulations in that time. Thus I tried to use waitForCompletion(timeout) as documented here. But it doesn't work - it ends up functioning just like an indefinite waitForCompletion, regardless of how low I set the wait time. I am using Abaqus 2017, and I was wondering if anyone else had gotten this function to work and if so how?
While I could use a workaround like adding a custom timeout function and using the kill() function on the job, I would prefer to use the built-in functionality of the Abaqus API, so any help is much appreciated!
It seems like starting from a certain version the timeOut optional argument was removed from this method: compare the "Scripting Reference Manual" entry in the documentation of v6.7 and v6.14.
You have a few options:
From Abaqus API: Checking if the my_abaqus_script.023 file still exists during simulation:
import os, time
timeOut = 600
total_time = 60
time.sleep(60)
# whait untill the the job is completed
while os.path.isfile('my_job_name.023') == True:
if total_time > timeOut:
my_job.kill()
total_time += 60
time.sleep(60)
From outside: Launching the job using the subprocess
Note: don't use interactive keyword in your command because it blocks the execution of the script while the simulation process is active.
import subprocess, os, time
my_cmd = 'abaqus job=my_abaqus_script analysis cpus=1'
proc = subprocess.Popen(
my_cmd,
cwd=my_working_dir,
stdout='my_study.log',
stderr='my_study.err',
shell=True
)
and checking the return code of the child process suing poll() (see also returncode):
timeOut = 600
total_time = 60
time.sleep(60)
# whait untill the the job is completed
while proc.poll() is None:
if total_time > timeOut:
proc.terminate()
total_time += 60
time.sleep(60)
or waiting until the timeOut is reached using wait()
timeOut = 600
try:
proc.wait(timeOut)
except subprocess.TimeoutExpired:
print('TimeOut reached!')
Note: I know that terminate() and wait() methods should work in theory but I haven't tried this solution myself. So maybe there will be some additional complications (like looking for all children processes created by Abaqus using psutil.Process(proc.pid) )

How to hold a value for a period of time

New to Lua scripting, so hopefully I'm using the right terminology to describe this, please forgive if I'm not, I'll get better over time....
I'm writing code for a game using Lua. And I have a function that sets a value. Now I need to "hold" this value for a set period of time after which it gets set to another value.
I'm trying to replicate this behavior:
Imagine a car-start key.
0 = off
1 = ignition on
2 = ignition on + starter -- this fires to get the engine to ignite. In real-life, this "start" position 2 is HELD while the starter does its thing, then once the engine ignites, you'd let go and it springs back to = 1 to keep the ignition on while the engine keeps running.
Now imagine instead, a start button and it can not NOT be pushed in and held in position=2 (like the key) for the time required for the starter to ignite the engine. Instead, once touched, that should cause the starter to runs for a set period of time in position 2 and when it senses a certain =>RPM, the starter stops and goes back to position 1, and leaves the engine running.
Similarly, I have a button that when "touched" fires the starter, but that starter event as expected goes from 0 to 2 and back to 1 in a blink. If I set it to be 2 then it fires forever.
I need to hold its phase 2 position in place for 15 seconds, then it can get back to 1
What I have:
A button that has an animated state. (0 off, 1, 2).
If 0 then 1, if 1 then 2 using phase==2 as the spring step from 2 back to 1
At position == 2, StarterKey = 2
No syntax issues.
The StarterKey variable is triggered to be =2 at position 2 -- BUT not long enough for the engine to ignite. I need StarterKey=2 to stay as a 2 for 15 seconds. Or, do I need to time the entire if phase==2 stage to be held for longer? Not sure.
What should I be looking at please? Here's a relevant snip..
function EngineStart()
if phase == 0 then
if ButtonState == 0 then
ButtonState = 1
StarterKey = 2 -- I tried adding *duration = 15* this event but that does not work
end
end
end
I also have the subsequent elseif phase == 2 part for when the button is in position == 2 so it springs back to 1 -- that all works fine.
How do I, or what do I need to use, to introduce time for my events and functions? I don't want to measure time, but set event time.
Thanks!
Solved this by watching a Lua tutorial video online (taught me how) + discovered that the sim has a timer function that I then used to set the number of seconds ("time") that the starter key-press stayed in its 2 position, which then fired it long enough for the igniters to do their work. (what to use)
QED :)) HA!

Dataflow - Approx Unique on unbounded source

I'm getting unexpected results streaming in the cloud.
My pipeline looks like:
SlidingWindow(60min).every(1min)
.triggering(Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(30)))
)
)
.withAllowedLateness(15sec)
.accumulatingFiredPanes()
.apply("Get UniqueCounts", ApproximateUnique.perKey(.05))
.apply("Window hack filter", ParDo(
if(window.maxTimestamp.isBeforeNow())
c.output(element)
)
)
.toJSON()
.toPubSub()
If that filter isn't there, I get 60 windows per output. Apparently because the pubsub sink isn't window aware.
So in the examples below, if each time period is a minute, I'd expect to see the unique count grow until 60 minutes when the sliding window closes.
Using DirectRunner, I get expected results:
t1: 5
t2: 10
t3: 15
...
tx: growing unique count
In dataflow, I get weird results:
t1: 5
t2: 10
t3: 0
t4: 0
t5: 2
t6: 0
...
tx: wrong unique count
However, if my unbounded source has older data, I'll get normal looking results until it catches up at which point I'll get the wrong results.
I was thinking it had to do with my window filter, but removing that didn't change the results.
If I do a Distinct() then Count().perKey(), it works, but that slows my pipeline considerably.
What am I overlooking?
[Update from the comments]
ApproximateUnique inadvertently resets its accumulated value when result is extracted. This is incorrect when the value is read more than once as with windows firing multiple times. Fix (will be in version 2.4): https://github.com/apache/beam/pull/4688

Check if window is triggered by watermark passing it

If I have a window like this:
.apply(Window
.<String>into(Sessions
.withGapDuration(Duration.standardSeconds(10)))
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))));
And it receives data:
a -> x (timestamp 0) (received at 20)
a -> y (timestamp 1) (received at 21)
(watermark passes 11) (at 22)
I suppose that the window would trigger at times:
20 sec, because of an early firing
21 sec, because of an early firing
22 sec, because of watermark passing the GapDuration
In a ParDo function that I apply over the windowed data, is there a way to distinguish early firing from watermark passing the GapDuration?
According to this stackoverflow question there is no way to get the watermark. If I was able to do that, I could check max(timestamp) < watermark. But since I cannot get the watermark, is there any other way to figure out that a window was triggered by watermark passing it.
You can access the PaneInfo of the elements in a DoFn after the GroupByKey by calling ProcesContext#pane() and use that to determine the Timing. This will allow you to identify if this as an "on time" firing (due to the watermark passing the end of the window) or a speculative/late firing.

Applying multiple GroupByKey transforms in a DataFlow job causing windows to be applied multiple times

We have a DataFlow job that is subscribed to a PubSub stream of events. We have applied sliding windows of 1 hour with a 10 minute period. In our code, we perform a Count.perElement to get the counts for each element and we then want to run this through a Top.of to get the top N elements.
At a high level:
1) Read from pubSub IO
2) Window.into(SlidingWindows.of(windowSize).every(period)) // windowSize = 1 hour, period = 10 mins
3) Count.perElement
4) Top.of(n, comparisonFunction)
What we're seeing is that the window is being applied twice so data seems to be watermarked 1 hour 40 mins (instead of 50 mins) behind current time. When we dig into the job graph on the Dataflow console, we see that there are two groupByKey operations being performed on the data:
1) As part of Count.perElement. Watermark on the data from this step onwards is 50 minutes behind current time which is expected.
2) As part of the Top.of (in the Combine.PerKey). Watermark on this seems to be another 50 minutes behind the current time. Thus, data in steps below this is watermarked 1:40 mins behind.
This ultimately manifests in some downstream graphs being 50 minutes late.
Thus it seems like every time a GroupByKey is applied, windowing seems to kick in afresh.
Is this expected behavior? Anyway we can make the windowing only be applicable for the Count.perElement and turn it off after that?
Our code is something on the lines of:
final int top = 50;
final Duration windowSize = standardMinutes(60);
final Duration windowPeriod = standardMinutes(10);
final SlidingWindows window = SlidingWindows.of(windowSize).every(windowPeriod);
options.setWorkerMachineType("n1-standard-16");
options.setWorkerDiskType("compute.googleapis.com/projects//zones//diskTypes/pd-ssd");
options.setJobName(applicationName);
options.setStreaming(true);
options.setRunner(DataflowPipelineRunner.class);
final Pipeline pipeline = Pipeline.create(options);
// Get events
final String eventTopic =
"projects/" + options.getProject() + "/topics/eventLog";
final PCollection<String> events = pipeline
.apply(PubsubIO.Read.topic(eventTopic));
// Create toplist
final PCollection<List<KV<String, Long>>> topList = events
.apply(Window.into(window))
.apply(Count.perElement()) //as eventIds are repeated
// get top n to get top events
.apply(Top.of(top, orderByValue()).withoutDefaults());
Windowing is not applied each time there is a GroupByKey. The lag you were seeing was likely the result of two issues, both of which should be resolved.
The first was that data that was buffered for later windows at the first group by key was preventing the watermark from advancing, which meant that the earlier windows were getting held up at the second group by key. This has been fixed in the latest versions of the SDK.
The second was that the sliding windows was causing the amount of data to increase significantly. A new optimization has been added which uses the combine (you mentioned Count and Top) to reduce the amount of data.

Resources