Reactor groupBy: What happens with remaining items after GroupedFlux is canceled? - project-reactor

I need to group infinite Flux by key with high cardinality.
For example:
group key is domain url
calls to one domain should be strictly sequential (next call happens after previous one is completed)
calls to different domains should be concurrent
time interval between items with same key (url) is unknown, but expected to have burst nature. Several items emitted in short period of time then long pause until next group.
queue
.groupBy(keyMapper, groupPrefetch)
.flatMap(
{ group ->
group.concatMap(
{ task -> makeSlowRemoteCall(task) },
0
)
.takeUntil { remoteCallResult -> remoteCallResult == DONE }
.timeout(groupTimeout, Mono.empty())
.then()
}
, concurrency
)
I cancel the group in two cases:
makeSlowRemoteCall() result indicates that with high probability there will be no new items in this group in near future.
Next item is not emitted during groupTimeout. I use timeout(timeout, fallback) variant to suppress TimeoutException and allow flatMap's inner publisher to complete successfully.
I want possible future items with same key to make new GroupedFlux and be processed with same flatMap inner pipeline.
But what happens if GroupedFlux has remaining unrequested items when I cancel it?
Does groupBy operator re-queue them into new group with same key or they are lost forever. If later what is the proper way to solve my problem. I am also not sure if I need to set concatMap() prefetch to 0 in this case.

I think groupBy() operator is not fit for my task with infinite source and a lot of groups. It makes infinite groups so it is necessary to somehow cancel idle groups downstream. But it is not possible to cancel GroupedFlux with guarantee that it has no unconsumed elements.
I think it will be great to have groupBy variant that emits finite groups.
Something like groupBy(keyMapper, boundryPredicate). When boundryPredicate returns true current group is complete and next element with same key will start new group.

Related

Reactor 3.x - limit the time of a groupBy Flux

is there any way to force a Flux generated by groupBy() to complete after a period of time (or similarly, limit the maximum # of "open" groups) regardless of the completeness of the upstream? I have something like the following:
Flux<Foo> someFastPublisher;
someFastPublisher
.groupBy(f -> f.getKey())
.delayElements(Duration.ofSeconds(1)) // rate limit each group
.flatMap(g -> g) // unwind the group
.subscribe()
;
and am running into the case where the Flux hangs, assumed because the number of groups is greater than the flatMap's concurrency. I could increase the flatMap concurrency, but there's no easy way to tell what the max possible size is. Instead, i know the Foo's being grouped by Foo.key are going to be close to each other in time/publication order and would rather use some sort of time window on the groupBy Flux vs. flatMap concurrency (and ending up w/ two different groups w/ the same key() isn't a big deal).
I'm guessing the groupBy Flux's won't onComplete until someFastPubisher onCompletes - i.e. the Flux's handed off to flatMap just stay "open" (although they're not likely to ever get a new event).
I am able to work around this either by prefetching Integer.MAX in the groupBy or Integer.MAXing the concurrency - but is there a way to control the "life" of the group?
yes: you can apply a take(Duration) to the groups in order to ensure they close early and a new group with the same key will open after that:
source.groupBy(v -> v.intValue() % 2)
.flatMap(group -> group
.take(Duration.ofMillis(1000))
.count()
.map(c -> "group " + group.key() + " size = " + c)
)
.log()
.blockLast();

Esper very simple context and aggregation

I have a quite simple problem to modelize and I don't have experience in Esper, so I may be headed the wrong way so I'd like some insight.
Here's the scenario: I have one stream of events "ParkingEvent", with two types of events "SpotTaken" and "SpotFree". So I have an Esper context both partitioned by id and bordered by a starting event of type "SpotTaken" and an end event of type "SpotFree". The idea is to monitor a parking spot with a sensor and then aggregate data to count the number of times the spot has been taken and also the time occupation.
That's it, no time window or whatsoever, so it seems quite simple but I struggle aggregating data. Here's the code I got so far:
create context ParkingSpotOccupation
context PartionBySource
partition by source from SmartParkingEvent,
context ContextBorders
initiated by SmartParkingEvent(
type = "SpotTaken") as startEvent
terminated by SmartParkingEvent(
type = "SpotFree") as endEvent;
#Name("measurement_occupation")
context ParkingSpotOccupation
insert into CreateMeasurement
select
e.source as source,
"ParkingSpotOccupation" as type,
{
"startDate", min(e.time),
"endDate", max(e.time),
"duration", dateDifferenceInSec(max(e.time), min(e.time))
} as fragments
from
SmartParkingEvent e
output
snapshot when terminated;
I got the same data for min and max so I'm guessing I'm doing somthing wrong.
When I'm using context.ContextBorders.startEvent.time and context.ContextBorders.endEvent.time instead of min and max, the measurement_occupation statement is not triggered.
Given that measurements have already been computed by the EPL that you provided, this counts the number of times the spot has been taken (and freed) and totals up the duration:
select source, count(*), sum(duration) from CreateMeasurement group by source

Is Neo4j newsfeed flawed?

I am using this example, http://neo4j.com/docs/stable/cypher-cookbook-newsfeed.html, to maintain newsfeeds for my users. So I use the following to post a status update:
MATCH (me)
WHERE me.name='Bob'
OPTIONAL MATCH (me)-[r:STATUS]-(secondlatestupdate)
DELETE r
CREATE (me)-[:STATUS]->(latest_update { text:'Status',date:123 })
WITH latest_update, collect(secondlatestupdate) AS seconds
FOREACH (x IN seconds | CREATE (latest_update)-[:NEXT]->(x))
RETURN latest_update.text AS new_status
I encountered a severe flaw in this and don't know how to fix it. In a very rare scenario where two status updates are posted at the exactly same time (ex. 10ms apart), instead of replacing the current status, Neo4j creates two status updates. This leads to a much bigger problem where, the next updates are posted twice!
This looks like a race condition. To resolve that you basically need to make sure that at a given time only one transaction is modifying the status for this specific user.
Neo4j's Java API does have the ability to set locks to achieve this. Cypher doesn't have an explicit feature for this but you can e.g. remove a non-existing property to force a lock on the given node. With a lock in place concurrent transaction need to wait this the holder of the lock is finished with his transaction.
So grab a lock early in your statement:
MATCH (me)
WHERE me.name='Bob'
REMOVE me._not_existing // side effect: grab a lock early
WITH me
OPTIONAL MATCH (me)-[r:STATUS]-(secondlatestupdate)
DELETE r
CREATE (me)-[:STATUS]->(latest_update { text:'Status',date:123 })
WITH latest_update, collect(secondlatestupdate) AS seconds
FOREACH (x IN seconds | CREATE (latest_update)-[:NEXT]->(x))
RETURN latest_update.text AS new_status

Esper EPL statement each time a value has increased a multiple

I am looking for an EPL statement which fires an event each time a certain value has increased by a specified amount, with any number of events in between, for example:
Considering a stream, which continuously provides new prices.
I want to get a notification, e.g. if the price is greater than the first price + 100. Something like
select * from pattern[a=StockTick -> every b=StockTick(b.price>=a.price+100)];
But how to realize that I get the next event(s), if the increase is >= 200, >=300 and so forth?
Diverse tests with context and windows has not been successful so far, so I appreciate any help! Thanks!
The contexts would be the right way to go.
You could start by defining a start event like this:
create schema StartEvent(threshold int);
And then have context that uses the start event:
create context ThresholdContext inititiated by StartEvent as se
terminated after 5 years
context ThresholdContext select * from pattern[a=StockTick -> every b=StockTick(b.price>=context.se.threshold)];
You can generate the StartEvent using "insert into" from the same pattern (probably want to remove the "every") or have the listener send in a StartEvent or declare another pattern that fires just once for creating a StartEvent.

Filtering by aggregate function

I am trying to raise an event when the average value of a field is over a threshold for a minute. I have the object defined as:
class Heartbeat
{
public string Name;
public int Heartbeat;
}
My condition is defined as
select avg(Heartbeat) , Name
from Heartbeat.std:groupwin(Name).win:time(60 sec)
having avg(Heartbeat) > 100
However, the event never gets fired despite the fact that I fire a number of events with the Heartbeat value over 100. Any suggestions on what I have done wrong?
Thanks in advance
It confuses many people, but since time is the same for all groups you can simplify the query and remove the groupwin. The documentation note in this section explains why: http://esper.codehaus.org/esper-4.11.0/doc/reference/en-US/html_single/index.html#view-std-groupwin
The semantics with or without groupwin are the same.
I think you want group-by (and not groupwin) since group-by controls the aggregation level and groupwin controls the data window level.
New query:
select avg(Heartbeat) , Name from Heartbeat.win:time(60 sec) group by Name having avg(Heartbeat) > 100

Resources