Apache Beam: Trigger for Fixed Window

Apache Beam: Trigger for Fixed Window - google-cloud-dataflow

According to following documentation, it is stated that if you don't explicitly specify a trigger you get behavior described below:
If unspecified, the default behavior is to trigger first when the
watermark passes the end of the window, and then trigger again every
time there is late arriving data.
Is this behavior true for FixedWindow as well? For example you would assume fixed window should have a default trigger of repeatedly firing after watermark passes end of window, and discard all late data unless late data is explicitly handled. Also where in the source code can I see definition of trigger for, example FixedWindow object?

The best doc to start with is the guide for triggers, and windows (and following the links from there). In particular, it says that, even though the default trigger fires every time late data arrives, in default configuration it still effectively only triggers once, discarding the late data:
if you are using both the default windowing configuration and the
default trigger, the default trigger emits exactly once, and late data
is discarded. This is because the default windowing configuration has
an allowed lateness value of 0. See the Handling Late Data section for
information about modifying this behavior.
Details
Windowing concept in Beam in general encompasses few things, including assigning windows, handling triggers, handling late data and few other things. However these things are assigned and handled separately. It gets confusing quickly from here.
How the elements are assigned to a window is handled by a WindowFn, see here. For example FixedWindows: link. It is basically the only thing that happens there (almost). Assigning a window is a special case of grouping the elements based on the event timestamps (kinda). You can think of the logic being similar to manually assigning custom keys to elements based on the timestamps, and then applying GroupByKey.
Triggering is a related but separate concept. Triggers are (roughly) just predicates to indicate when the runner is allowed to emit the data accumulated in the window so far (source). I think this is the closest thing to the original design doc for triggers: https://s.apache.org/beam-triggers
Lateness is another related part of the configuration which is also somewhat separate (link). Even though a trigger might allow the runner to emit all the late data forever, the pipeline can be set to not allow any late data (which is the default behavior), or only allow late data for some limited time. This leads to the default trigger behavior described above. Yes, this is confusing. Avoid using any complex triggering and lateness if you can, it likely won't work as you expect it to.
So the window classes only handle the grouping logic, i.e. what kind of elements have the same grouping key. These classes don't care about when you will want to emit the accumulated results. This depends on your business logic, e.g. you might want to handle newly arrived elements, or you might want to discard them, it's not part of the window. This means there's no special triggers for FixedWindows or other windows, you can use any trigger with any window (even if logically some specific trigger doesn't make sense in context of some window).
Default trigger is just that, something that is just set by default. You should assign your own trigger if it doesn't suit your needs. And it likely won't, except for some basic use cases.
Update
An example of how to use FixedWindows with triggers.

Related

In Dataflow with PubsubIO is there any possibility of late data in a global window?

I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.

When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks

This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

Observable pattern redundancy in Ruby

The Observable module uses change method as a way of toggling when the observers should get updates of any change in state in the Subject. However, it seems redundant to me, as notify_observers(self) it is going to be called intentionally. Is there a situation where having change makes a difference?

It allows for a split in timing, and for multiple changes without side effects.
The state of an object may or may not change during a sub-process of an application. If it changes, then .change can be invoked without any side effects. It is also possible to call it multiple times.
Later, notifications can be sent. As long as they are sent before any dependent actions are taken, then anything that depends on updates due to the original change should be able to make its update correctly
You might do this for example if processing following a change had a high cost (I/O, CPU time, network bandwidth), but only needed to be done before a second section of code, not simply every time a change occurred.
You might also do this if updates are not strictly necessary in all code paths. I.e. sometimes you care about updating due to a change, but other times it is not necessary, and any change can be ignored.
An example might be if you need to re-generate an XML document every time a content object used by the XML creation code has a property change. You would place calls to .change in the important property setters of the content object, hiding them from the external caller. You don't want to generate a new XML document each time you call content.property= (it could be very slow), instead you wait until you are finished making updates and place a single call to content.notify_observers after all possible changes.

iOS UIATarget detect when UIViewController is loaded after a tap()?

I am writing automation testing for my iOS app and I am trying to figure out how to detect in the javascript script when a view controller fully loaded and is on screen...
So right now for example the script taps on a button:
target.frontMostApp().mainWindow().buttons()["loginButton"].tap();
Then, once the app logs in (which could take a few seconds or less) I need to press another button.
Right now, I made it work by simply putting a delay:
target.delay(3);
But I want to be able to detect when the next view controller is loaded so that I know I can access the elements on the new screen just loaded.
Any suggestions?

There are some ways to achieve that:
Tune Up Library
https://github.com/alexvollmer/tuneup_js
I would really recommend to check this out. They have some useful wrapper function to write more advanced test. Some methods is written to extend the UIAElement class. They have waitUntilVisible() method.
But there might be possibility that the element itself is nil. Its not the matter of visibility but its just not in the screen yet.
UIATargetInstance.pushTimeout(delay) and UIATargetInstance.popTimeout()
Apple documentation says :
To provide some flexibility in such cases and to give you finer
control over timing, UI Automation provides for a timeout period, a
period during which it will repeatedly attempt to perform the
specified action before failing. If the action completes during the
timeout period, that line of code returns, and your script can
proceed. If the action doesn’t complete during the timeout period, an
exception is thrown. The default timeout period is five seconds, but
your script can change that at any time.
You can also use UIATarget.delay(delay); function, but this delays the execution of the next statement, not nescessarily waiting to find the element. setTimeout(delay) can also be used to globally set the timeout before UIAutomation decides it could not find the element.
There is Apple's guide explaining more understanding towards the framework here. http://developer.apple.com/library/IOS/#documentation/DeveloperTools/Conceptual/InstrumentsUserGuide/UsingtheAutomationInstrument/UsingtheAutomationInstrument.html

In most cases simply identifying an element specific to the next view controller should be sufficient. The description of the element must be something unique to the new UI however.
i.e.
target.frontmostApp().mainWindo().navigationBar()["My Next View"];
The script should delay until the element can be identified or the timeout has been reached (see pushTimeout popTimout).
In some cases it might be necessary to specify that the element is visible.
.withValueForKey(1, "isVisible")
It's also possible to use waitForInvalid to do the reverse (wait for something in the previous UI to go away).

Search for tuneup library here
Refer function waitUntilVisible for the same. you may include other files to for supporting functions in it.
Hope it help.

EventAggregator vs CompositeCommand

I worked my way through the Prism guidance and think I got a grasp of most of their communication vehicles.
Commanding is very straightforward, so it is clear that the DelegateCommand will be used just to connect the View with its Model.
It is somewhat less clear, when it comes to cross Module Communication, specifically when to use EventAggregation over Composite Commands.
The practical effect is the same e.g.
You publish an event -> all subscribers receive notice and execute code in response
You execute a composite command -> all registered commands get executed and with it their attached code
Both work along the lines of "fire and forget", that is they don't care about any responses from their subscribers after firing the event/executing the commands.
I have trouble seeing a practical difference in usage although I understand that the implementation of both (under the hood) is very different.
So should we think of what it actually means - Event? Is that when something happens (an event occurs)? Something the user did not directly request like a "web request completed"?
And Command? Does that mean a user clicked something and thus issued a command to our application, requesting a service directly?
Is that it? Or are there other ways to determine when to use one of these communication vehicles over the other. The guidance, although one of the best documentations I read, gives no specific explanation.
So I hope people involved in/using Prism can help in shedding some light on this.

There are two primary differences between these two.
CanExecute for Commands. A Command
can say whether or not it is valid
for execution by calling
Command.RaiseCanExecuteChanged() and
having its CanExecute delegate
return false. If you consider the
case of a "Save All"
CompositeCommand compositing several
"Save" commands, but one of the
commands saying that it can't
execute, the Save All button will
automatically disable (nice!).
EventAggregator is a Messaging
pattern and Commands are a
Commanding pattern. Although
CompositeCommands are not explicitly
a UI pattern, it is implicitly so
(generally they are hooked up to an
input action, like a Button click).
EventAggregator is not this way -
any part of the application
effectively raise an EventAggregator
event: background processes,
ViewModels, etc. It is a
brokered avenue for messaging
across your application with support
for things like filtering,
background thread execution, etc.
Hope this helps explain the differences. It's more difficult to say when to use each, but generally I use the rule of thumb that if it's user interaction that raises the event, use a command for anything else, use EventAggregator.
Hope this helps.

Additionally, there is one more important difference: With the current implementation, an event from the EventAggregator is asynchronous, while the CompositeCommand is synchronous.
If you want to implement something like "notify that event X happened; do something that relies on the event handlers for event X to have executed", you either have to do something like Application.DoEvents() or use CompositeCommands.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart