Processing Total Ordering of Events By Key using Apache Beam - google-cloud-dataflow

Problem Context
I am trying to generate a total (linear) order of event items per key from a real-time stream where the order is event time (derived from the event payload).
Approach
I had attempted to implement this using streaming as follows:
1) Set up a non overlapping sequential windows, e.g. duration 5 minutes
2) Establish an allowed lateness - it is fine to discard late events
3) Set accumulation mode to retain all fired panes
4) Use the "AfterwaterMark" trigger
5) When handling a triggered pane, only consider the pane if it is the final one
6) Use GroupBy.perKey to ensure all events in this window for this key will be processed as a unit on a single resource
While this approach ensures linear order for each key within a given window, it does not make that guarantee across multiple windows, e.g. there could be a window of events for the key which occurs after that is being processed at the same time as the earlier window, this could easily happen if the first window failed and had to be retried.
I'm considering adapting this approach where the realtime stream can first be processed so that it partitions the events by key and writes them to files named by their window range.
Due to the parallel nature of beam processing, these files will also be generated out of order.
A single process coordinator could then submit these files sequentially to a batch pipeline - only submitting the next one when it has received the previous file and that downstream processing of it has completed successfully.
The problem is that Apache Beam will only fire a pane if there was at least one time element in that time window. Thus if there are gaps in events then there could be gaps in the files that are generated - i.e. missing files. The problem with having missing files is that the coordinating batch processor cannot make the distinction between knowing whether the time window has passed with no data or if there has been a failure in which case it cannot proceed until the file finally arrives.
One way to force the event windows to trigger might be to somehow add dummy events to the stream for each partition and time window. However, this is tricky to do...if there are large gaps in the time sequence then if these dummy events occur surrounded by events much later then they will be discarded as being late.
Are there other approaches to ensuring there is a trigger for every possible event window, even if that results in outputting empty files?
Is generating a total ordering by key from a realtime stream a tractable problem with Apache Beam? Is there another approach I should be considering?

Depending on your definition of tractable, it is certainly possible to totally order a stream per key by event timestamp in Apache Beam.
Here are the considerations behind the design:
Apache Beam does not guarantee in-order transport, so there is no use within a pipeline. So I will assume you are doing this so you can write to an external system with only the capability to handle things if they come in order.
If an event has timestamp t, you can never be certain no earlier event will arrive unless you wait until t is droppable.
So here's how we'll do it:
We'll write a ParDo that uses state and timers (blog post still under review) in the global window. This makes it a per-key workflow.
We'll buffer elements in state when they arrive. So your allowed lateness affects how efficient of a data structure you need. What you need is a heap to peek and pop the minimum timestamp and element; there's no built-in heap state so I'll just write it as a ValueState.
We'll set a event time timer to receive a call back when an element's timestamp can no longer be contradicted.
I'm going to assume a custom EventHeap data structure for brevity. In practice, you'd want to break this up into multiple state cells to minimize the data transfered. A heap might be a reasonable addition to primitive types of state.
I will also assume that all the coders we need are already registered and focus on the state and timers logic.
new DoFn<KV<K, Event>, Void>() {
#StateId("heap")
private final StateSpec<ValueState<EventHeap>> heapSpec = StateSpecs.value();
#TimerId("next")
private final TimerSpec nextTimerSpec = TimerSpec.timer(TimeDomain.EVENT_TIME);
#ProcessElement
public void process(
ProcessContext ctx,
#StateId("heap") ValueState<EventHeap> heapState,
#TimerId("next") Timer nextTimer) {
EventHeap heap = firstNonNull(
heapState.read(),
EventHeap.createForKey(ctx.element().getKey()));
heap.add(ctx.element().getValue());
// When the watermark reaches this time, no more elements
// can show up that have earlier timestamps
nextTimer.set(heap.nextTimestamp().plus(allowedLateness);
}
#OnTimer("next")
public void onNextTimestamp(
OnTimerContext ctx,
#StateId("heap") ValueState<EventHeap> heapState,
#TimerId("next") Timer nextTimer) {
EventHeap heap = heapState.read();
// If the timer at time t was delivered the watermark must
// be strictly greater than t
while (!heap.nextTimestamp().isAfter(ctx.timestamp())) {
writeToExternalSystem(heap.pop());
}
nextTimer.set(heap.nextTimestamp().plus(allowedLateness);
}
}
This should hopefully get you started on the way towards whatever your underlying use case is.

Related

DJI Waypoint mission listeners

I need to create/upload/start waypoint mission on one button. When user press button drone should move up for certain number of point based on current position. User can stop mission and again start new one. My logic here is next:
I initialize mission with points
Load mission
Add Listeners to mission operator
Upload mission
Mission starts on listener
missionOperator.addListener(toUploadEvent: self, with: DispatchQueue.main) { (event) in
if event.currentState == .readyToExecute {
self.startMission()
}
}
I'm reading documentation for days and trying to understand how this thing work, but I'm missing something obviously. Listeners are created on waypoint mission operator, but if I create listeners before loading mission they are not called. If I create listeners every time I load mission, startMission() is called multiple times (first time is called ones, but after one mission is stopped or finished, next time startMission() gets called two times)
So, I guess that my questions would be:
What is right moment to add listeners and to remove them since I'm calling startMission() from listeners? Actually what is appropriate way to init/upload/start mission on one button, and be able to do that multiple times?
You need to remove the upload listener when the the upload succeeded and the event state is readyToExecute. Also when the event contains an error, or the state is readytoupload/notsupported/disconnected. Pretty much in every case except when it's still in the state 'uploading'.
When you start the mission, add a listener for execution events, and one for finished. Remove those again when the mission is stopped/cancelled, has an error, or finishes successfully.
Even though you use Swift, I suggest looking at the more complete Objective C sample code, which includes examples of several different types of missions.

How to optimize performance of Results change listeners in Realm (Swift) with a deep hierarchy?

We're using Realm (Swift binding currently in version 3.12.0) from the earliest days in our project. In some early versions before 1.0 Realm provided change listeners for Results without actually giving changeSets.
We used this a lot in order to find out if a specific Results list changed.
Later the guys at Realm exchanged this API with changeSet providing methods. We had to switch and are now mistreating this API just in order to find out if anything in a specific List changed (inserts, deletions, modifications).
Together with RxSwift we wrote our own implementation of Results change listening which looks like this:
public var observable: Observable<Base> {
return Observable.create { observer in
let token = self.base.observe { changes in
if case .update = changes {
observer.onNext(self.base)
}
}
observer.onNext(self.base)
return Disposables.create(with: {
observer.onCompleted()
token.invalidate()
})
}
}
When we now want to have consecutive updates on a list we subscribe like so:
someRealm.objects(SomeObject.self).filter(<some filter>).rx.observable
.subscribe(<subscription code that gets called on every update>)
//dispose code missing
We wrote the extension on RealmCollection so that we can subscribe to List type as well.
The concept is equal to RxRealm's approach.
So now in our App we have a lot of filtered lists/results that we are subscribing to.
When data gets more and more we notice significant performance losses when it comes to seeing a change visually after writing something into the DB.
For example:
Let's say we have a Car Realm Object class with some properties and some 1-to-n and some 1-to-1 relationships. One of the properties is a Bool, namely isDriving.
Now we have a lot of cars stored in the DB and bunch of change listeners with different filters listing to changes of the cars collection (collection observers listening for changeSets in order to find out if the list was changed).
If I take one car of some list and set the property of isDriving from false to true (important: we do writes in the background) ideally the change listener fires fast and I have the nearly immediate correct response to my write on the main thread.
Added with edit on 2019-06-19:
Let's make the scenario still a little more real:
Let's change something down the hierarchy, let's say the tires manufacturer's name. Let's say a Car has a List<Tire>, a Tire has a Manufacturer and a Manufacturer has aname.
Now we're still listing toResultscollection changes with some more or less complex filters applied.
Then we're changing the name of aManufacturer` which is connected to one of the tires which are connected to one of the cars which is in that filtered list.
Can this still be fast?
Obviously when the length of results/lists where change listeners are attached to gets longer Realm's internal change listener takes longer to calculate the differences and fires later.
So after a write we see the changes - in worst case - much later.
In our case this is not acceptable. So we are thinking through different scenarios.
One scenario would be to not use .observe on lists/results anymore and switch to Realm.observe which fires every time anything did change in the realm, which is not ideal, but it is fast because the change calculation process is skipped.
My question is: What can I do to solve this whole dilemma and make our app fast again?
The crucial thing is the threading stuff. We're always writing in the background due to our design. So the writes itself should be very fast, but then that stuff needs to synchronize to the other threads where Realms are open.
In my understanding that happens after the change detection for all Results has run through, is that right?
So when I read on another thread, the data is only fresh after the thread sync, which happens after all notifications were sent out. But I am not sure currently if the sync happens before, that would be more awesome, did not test it by now.

Operations on a stream produce a result, but do not modify its underlying data source

Unable to understand how "Operations on a stream produce a result, but do not modify its underlying data source" with reference to java 8 streams.
shapes.stream()
.filter(s -> s.getColor() == BLUE)
.forEach(s -> s.setColor(RED));
As per my understanding, forEach is setting the color of object from shapes then how does the top statement hold true?
The value s isn't being altered in this example, however no deep copy is taken, and there is nothing to stop you altering the object referenced.
Are able to can alter an object via a reference in any context in Java and there isn't anything to prevent it. You can only prevent shallow values being altered.
NOTE: Just because you are able to do this doesn't mean it's a good idea. Altering an object inside a lambda is likely to be dangerous as functional programming models assume you are not altering the data being process (always creating new object instead)
If you are going to alter an object, I suggest you use a loop (non functional style) to minimise confusion.
An example of where using a lambda to alter an object has dire consequences is the following.
map.computeIfAbsent(key, k -> {
map.computeIfAbsent(key, k -> 1);
return 2;
});
The behaviour is not deterministic, can result in both key/values being added and for ConcurrentHashMap, this will never return.
As mentioned Here
Most importantly, a stream isn’t a data structure.
You can often create a stream from collections to apply a number of functions on a data structure, but a stream itself is not a data structure. That’s so important, I mentioned it twice! A stream can be composed of multiple functions that create a pipeline that data that flows through. This data cannot be mutated. That is to say the original data structure doesn’t change. However the data can be transformed and later stored in another data structure or perhaps consumed by another operation.
AND as per Java docs
This is possible only if we can prevent interference with the data
source during the execution of a stream pipeline.
And the reason is :
Modifying a stream's data source during execution of a stream pipeline
can cause exceptions, incorrect answers, or nonconformant behavior.
That's all theory, live examples are always good.
So here we go :
Assume we have a List<String> (say :names) and stream of this names.stream(). We can apply .filter(), .reduce(), .map() etc but we can never change the source. Meaning if you try to modify the source (names) you will get an java.util.ConcurrentModificationException .
public static void main(String[] args) {
List<String> names = new ArrayList<>();
names.add("Joe");
names.add("Phoebe");
names.add("Rose");
names.stream().map((obj)->{
names.add("Monika"); //modifying the source of stream, i.e. ConcurrentModificationException
/**
* If we comment the above line, we are modifying the data(doing upper case)
* However the original list still holds the lower-case names(source of stream never changes)
*/
return obj.toUpperCase();
}).forEach(System.out::println);
}
I hope that would help!
I understood the part do not modify its underlying data source - as it will not add/remove elements to the source; I think you are safe since you alter an element, you do not remove it.
You ca read comments from Tagir and Brian Goetz here, where they do agree that this is sort of fine.
The more idiomatic way to do what you want, would be a replace all for example:
shapes.replaceAll(x -> {
if(x.getColor() == BLUE){
x.setColor(RED);
}
return x;
})

How to wait for Angular2 to finish rendering

In case it matters, here's the reason I want to do this:
My application will have many elements in various places on the same page that each trigger individually tiny http requests for small (less than a kilobyte, sometimes only tens of bytes) pieces of data. To avoid excessive per-request overhead, I want to combine them all into one larger request. I've got the code for combining and handling the combined request and response done, but I'm not sure how to tell when the requests have stopped coming and it's time to send it.
The idea I'm using right now is to use new Future.value().whenComplete() (I'm coding in Dart) to simply wait for the event loop to run, but I don't know whether Angular2's rendering spans multiple iterations of the event loop or not. Is this enough to guarantee Angular2 has invoked every property binding on the page before my http request goes out, and if not how can I get such a guarantee?
I don't think there is a better way.
Instead of
new Future.value().whenComplete()
You can just use
new Future(() {
// delayed code here
});
or to delay a bit more
new Future.delayed(const Duration(milliseconds: 10), () {
// delayed code here
});

ASP MVC - Comet/Reverse Ajax/PUSH - Is this code thread safe?

I'm trying to implement comet style features by polling the server for changes in data and holding the connection open untill there is something to response with.
Firstly i have a static variable on my controller which stores the time that the data was last updated:
public static volatile DateTime lastUpdateTime = 0;
So whenever the data i'm polling changes this variable will be changed.
I then have an Action, which takes the last time that the data was retrieved as a parameter:
public ActionResult Push(DateTime lastViewTime)
{
while (lastUpdateTime <= lastViewTime)
{
System.Threading.Thread.Sleep(10000);
}
return Content("testing 1 2 3...");
}
So if lastUpdateTime is less than or equal to the lastViewTime, we know that there is no new data, and we simply hold the request there in a loop, keeping the connection open, untill there is new information, which we could then send back to the client, which would handle the response and then make a new request, so the connection is essentially always open.
This seems to work fine but i'm concerned about thread safety, is this OK? Does lastUpdateTime need to be marked as volatile? Is there a better way?
Thanks
edit: perhaps i should use a lock object when i update the time value
private static object lastUpdateTimeLock = new object();
..
lock (lastUpdateTimeLock)
{
lastUpdateTime = DateTime.Now;
}
Regarding your original question, you do have to be careful with DateTimes, since they're actual objects in the .NET runtime. Only a few data types can be natively accessed (eg ints, bools) without locking (assuming you're not using Interlocked). If you want to avoid any issues with Datetimes, you can get the ticks as a long and use the Interlocked class to manage them.
That said, if you're looking for comet capabilities in a .NET application, you're unfortunately going to have to go a lot further than what you've got here. IIS/ASP.NET won't scale with the approach you've got in place right now; you'll hit limits before you even get to 100 users. Among other things, you will have to switch to using async handlers, and implement a custom bounded thread pool for the incoming requests.
If you really want a tested solution for ASP.NET/IIS, check out WebSync, it's a full comet server designed specifically for that purpose.
Honestly my concern would be with the number of connections kept open and the empty while loop. The connections you're probably fine on, but I'd definitely want to do some load testing to be sure.
The while (lastUpdateTime <= lastViewTime) {} seems like it should have a Thread.Sleep(100) or something in there. Otherwise I'd think it would consume a lot of cpu cycles needlessly.
The lock does not seem necessary to me around lastUpdateTime = DateTime.Now since the previous value does not matter. If it were lastUpdateTime = lastUpdateTime + 1 or something, then maybe it would be.

Resources