Reactor delayElements within a GroupedFlux delays elements across all groups - project-reactor

I have a use case where I want to create a bunch of GroupedFlux by a PartitionKey and within each group delay elements by 100 milliseconds. However, I want multiple groups to start at the same time. So if there are 3 groups, I expect 3 messages emitted every 100 millisecond. However, with the following code I see only 1 message every 100 milliseconds.
This is the code that I was expecting to work.
final Flux<GroupedFlux<String, TData>> groupedFlux =
flux.groupBy(Event::getPartitionKey);
groupedFlux.subscribe(g -> g.delayElements(Duration.ofMillis(100))
.flatMap(this::doWork)
.doOnError(throwable -> log.error("error: ", throwable))
.onErrorResume(e -> Mono.empty())
.subscribe());
This is the log.
21:24:29.318 parallel-5] : GroupByKey : 2
21:24:29.424 parallel-6] : GroupByKey : 3
21:24:29.529 parallel-7] : GroupByKey : 1
21:24:29.634 parallel-8] : GroupByKey : 2
21:24:29.739 parallel-9] : GroupByKey : 3
21:24:29.844 parallel-10] : GroupByKey : 1
21:24:29.953 parallel-11] : GroupByKey : 2
21:24:30.059 parallel-12] : GroupByKey : 3
21:24:30.167 parallel-1] : GroupByKey : 1
(See almost 100 ms difference between each log statement. 1s column is the timestamp.

Upon more analysis I found out that it was working fine. My tests had incorrect data for PartitionKey which was resulting in a single GroupedFlux.
Answering my own question in case someone ever doubts that the delayElements works differently on a groupedFlux. It does not.

Related

Can Flux of Project Reactor process messages one by one

I'm trying to process a list of numbers, for example, 1 to 10, one by one using Reactor Flux, and there is an API /double which simply double the incoming Integer (1 -> 2, 4 -> 8...) ,however this API has performance issue, it always takes 2 seconds to response the result.
When using limitRate(1) what I expected is Reactor processes requests one after another as following:
2020-01-01 00:00:02 - 2
2020-01-01 00:00:04 - 4
2020-01-01 00:00:06 - 6
2020-01-01 00:00:08 - 8
2020-01-01 00:00:10 - 10
...
But actually Reactor fires all requests at once:
2020-01-01 00:00:02 - 6
2020-01-01 00:00:02 - 10
2020-01-01 00:00:02 - 2
2020-01-01 00:00:02 - 4
2020-01-01 00:00:02 - 8
...
Here is the code
Flux.range(1, 10).limitRate(1)
.flatMap(i -> webClient.get().uri("http://localhost:10001/double?integer={int}", i).exchange()
.flatMap(resp -> resp.bodyToMono(Integer.class)))
.subscribe(System.out::println);
Thread.sleep(10000);
Seems limitRate is not working as I expected, what went wrong? Is there any way to process requests one after another using Reactor? Thanks in advance.
.flatMap doesn't work here as it subscribes to the inner streams eagerly - that is, it won't wait for an inner stream to emit an onComplete before subscribing to the next stream. This is why all of your calls are made concurrently. It works in the receive->dispatch->receive->dispatch mode.
Reactor provides an overloaded version of flatMap where you can specify the concurrency factor as .flatMap(innerstream, concurrency). This factor caps the number of streams flatMap will subscribe to. If it is say 5, flatMap can subscribe to at most 5 inner streams. As soon as this limit is hit, it has to wait for an inner stream to emit onComplete before it subscribes to the next one.
In your case, you can either set it to 1 or use .concatMap(). concatMap() is exactly flatMap with concurrency = 1. It'll basically works in the receive->dispatch->wait->receive->dispatch->wait mode.
I wrote a post some time back explaining exactly how flatMap works, because I think a lot of people use it without understanding its internals. You can refer to the article here
Consider to use a concatMap instead:
/**
* Transform the elements emitted by this {#link Flux} asynchronously into Publishers,
* then flatten these inner publishers into a single {#link Flux}, sequentially and
* preserving order using concatenation.
* <p>
* There are three dimensions to this operator that can be compared with
* {#link #flatMap(Function) flatMap} and {#link #flatMapSequential(Function) flatMapSequential}:
* <ul>
* <li><b>Generation of inners and subscription</b>: this operator waits for one
* inner to complete before generating the next one and subscribing to it.</li>
* <li><b>Ordering of the flattened values</b>: this operator naturally preserves
* the same order as the source elements, concatenating the inners from each source
* element sequentially.</li>
* <li><b>Interleaving</b>: this operator does not let values from different inners
* interleave (concatenation).</li>
* </ul>
*
* <p>
* Errors will immediately short circuit current concat backlog.
*
* <p>
* <img class="marble" src="doc-files/marbles/concatMap.svg" alt="">
*
* #reactor.discard This operator discards elements it internally queued for backpressure upon cancellation.
*
* #param mapper the function to transform this sequence of T into concatenated sequences of V
* #param <V> the produced concatenated type
*
* #return a concatenated {#link Flux}
*/
public final <V> Flux<V> concatMap(Function<? super T, ? extends Publisher<? extends V>>
mapper) {
Pay attention to the sequentially and preserving order using concatenation. phrase. Seems for me what you are looking for.
Inspired by Artem Bilan's answer, I found flatMapSequential is a better for my case, since the flatMapSequential accepts second parameter as maxConcurrency, so that it is possible not to process messages one by one but twice a time and etc.
Thanks Artem Bilan and Prashant Pandey for your answers, really helped.

Is there an equivalent to Akka Streams' `conflate` and/or `batch` operators in Reactor?

I am looking for an equivalent of the batch and conflate operators from Akka Streams in Project Reactor, or some combination of operators that mimic their behavior.
The idea is to aggregate upstream items when the downstream backpressures in a reduce-like manner.
Note that this is different from this question because the throttleLatest / conflate operator described there is different from the one in Akka Streams.
Some background regarding what I need this for:
I am watching a change stream on a MongoDB and for every change I run an aggregate query on the MongoDB to update some metric. When lots of changes come in, the queries can't keep up and I'm getting errors. As I only need the latest value of the aggregate query, it is fine to aggregate multiple change events and run the aggregate query less often, but I want the metric to be as up-to-date as possible so I want to avoid waiting a fixed amount of time when there is no backpressure.
The closest I could come so far is this:
changeStream
.window(Duration.ofSeconds(1))
.concatMap { it.reduce(setOf<String>(), { applicationNames, event -> applicationNames + event.body.sourceReference.applicationName }) }
.concatMap { Flux.fromIterable(it) }
.concatMap { taskRepository.findTaskCountForApplication(it) }
but this would always wait for 1 second regardless of backpressure.
What I would like is something like this:
changeStream
.conflateWithSeed({setOf(it.body.sourceReference.applicationName)}, {applicationNames, event -> applicationNames + event.body.sourceReference.applicationName})
.concatMap { Flux.fromIterable(it) }
.concatMap { taskRepository.findTaskCountForApplication(it) }
I assume you always run only 1 query at the same time - no parallel execution. My idea is to buffer elements in list(which can be easily aggregated) as long as the query is running. As soon as the query finishes, another list is executed.
I tested it on a following code:
boolean isQueryRunning = false;
Flux.range(0, 1000000)
.delayElements(Duration.ofMillis(10))
.bufferUntil(aLong -> !isQueryRunning)
.doOnNext(integers -> isQueryRunning = true)
.concatMap(integers-> Mono.fromCallable(() -> {
int sleepTime = new Random().nextInt(10000);
System.out.println("processing " + integers.size() + " elements. Sleep time: " + sleepTime);
Thread.sleep(sleepTime);
return "";
})
.subscribeOn(Schedulers.elastic())
).doOnNext(s -> isQueryRunning = false)
.subscribe();
Which prints
processing 1 elements. Sleep time: 4585
processing 402 elements. Sleep time: 2466
processing 223 elements. Sleep time: 2613
processing 236 elements. Sleep time: 5172
processing 465 elements. Sleep time: 8682
processing 787 elements. Sleep time: 6780
Its clearly visible, that size of the next batch is proprortional to previous query execution time(Sleep time).
Note that it is not "real" backpressure solution, just a workaround. Also its not suited for parallel execution. It might also require some tuning in order to prevent running queries for empty batches.

How do I properly transform missing datapoints as 0 in Prometheus?

We have an alert we want to fire based on the previous 5m of metrics (say, if it's above 0). However, if the metric is 0 it's not written to prometheus, and as such it's not returned for that time bucket.
The result is that we may have an example data-set of:
-60m | -57m | -21m | -9m | -3m <<< Relative Time
1 , 0 , 1 , 0 , 1 <<< Data Returned
which ultimately results in the alert firing every time the metric is above 0, not only when it's above 0 for 5m. I've tried writing our query with OR on() vector() appended to the end, but it does funny stuff with the returned dataset:
values:Array[12]
0:Array[1539021420,0.16666666666666666]
1:Array[1539021480,0]
2:Array[1539021540,0]
3:Array[1539021600,0]
4:Array[1539021660,0]
5:Array[1539021720,0]
6:Array[1539021780,0]
7:Array[1539021840,0]
8:Array[1539021900,0]
9:Array[1539021960,0]
10:Array[1539022020,0]
11:Array[1539022080,0]
For some reason it's putting the "real" data at the front of the array (even though my starting time is well before 1539021420) and continuing from that timestamp forward.
What is the proper way to have Prometheus return 0 for data-points which may not exist?
To be clear, this isn't an alertmanager question -- I'm using a different tool for alerting on this data.

Dataflow - Approx Unique on unbounded source

I'm getting unexpected results streaming in the cloud.
My pipeline looks like:
SlidingWindow(60min).every(1min)
.triggering(Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(30)))
)
)
.withAllowedLateness(15sec)
.accumulatingFiredPanes()
.apply("Get UniqueCounts", ApproximateUnique.perKey(.05))
.apply("Window hack filter", ParDo(
if(window.maxTimestamp.isBeforeNow())
c.output(element)
)
)
.toJSON()
.toPubSub()
If that filter isn't there, I get 60 windows per output. Apparently because the pubsub sink isn't window aware.
So in the examples below, if each time period is a minute, I'd expect to see the unique count grow until 60 minutes when the sliding window closes.
Using DirectRunner, I get expected results:
t1: 5
t2: 10
t3: 15
...
tx: growing unique count
In dataflow, I get weird results:
t1: 5
t2: 10
t3: 0
t4: 0
t5: 2
t6: 0
...
tx: wrong unique count
However, if my unbounded source has older data, I'll get normal looking results until it catches up at which point I'll get the wrong results.
I was thinking it had to do with my window filter, but removing that didn't change the results.
If I do a Distinct() then Count().perKey(), it works, but that slows my pipeline considerably.
What am I overlooking?
[Update from the comments]
ApproximateUnique inadvertently resets its accumulated value when result is extracted. This is incorrect when the value is read more than once as with windows firing multiple times. Fix (will be in version 2.4): https://github.com/apache/beam/pull/4688

Arrays and Datatypes in Z3py

I'm actually using Z3py for scheduling solving problems and I'm trying to represent a 2 processors system where 4 process of different execution time must be done.
My actual data are :
Process 1 : Arrival at 0 and execution time of 4
Process 2 : Arrival at 1 and execution time of 3
Process 3 : Arrival at 3 and execution time of 5
Process 4 : Arrival at 1 and execution time of 2
I'm actually trying to represent each process while decomposing each in subprocess of equal time so my datatypes are like this :
Pn = Datatype('Pn')
Pn.declare('1')
Pn.declare('2')
Pn.declare('3')
Pn.declare('4')
Pt = Datatype('Pt')
Pt.declare('1')
Pt.declare('2')
Pt.declare('3')
Pt.declare('4')
Pt.declare('5')
Process = Datatype('Process')
Process.declare('cons' , ('name',Pn) , ('time', Pt))
Process.declare('idle')
where pn and pt are the process name and the part of the process (process 1 is in 4 parts, ...)
But now I don't know how I can represent my processors to add 3 rules I need : unicity (each sub process must be done 1 and only 1 time by only 1 processor) check arrival (the first part of a process can't be processed before it arrived) and order (each part of a process must be processed after the precedent)
So I was thinking of using arrays to represent my 2 processors with this kind of declaration :
P = Array('P', IntSort() , Process)
But when I tried to execute it I got an error message saying :
Traceback (most recent call last):
File "C:\Users\Alexis\Desktop\test.py", line 16, in <module>
P = Array('P', IntSort() , Process)
File "src/api/python\z3.py", line 3887, in Array
File "src/api/python\z3.py", line 3873, in ArraySort
File "src/api/python\z3.py", line 56, in _z3_assert
Z3Exception: 'Z3 sort expected'
And know I don't know how handle it... must I create a new datatype and figure a way to add my rules ? or Is there a way to add datatypes to array which would let me create rules like this :
unicity = ForAll([x,y] , (Implies(x!=y,P[x]!=P[y])))
Thanks in advance
There is a tutorial on using Datatypes from the Python API. A link to the tutorial is:
http://rise4fun.com/Z3Py/tutorialcontent/advanced#h22
It shows how to create a list data-type and use the "create()" method to instantiate a Sort object from the object used when declaring the data-type. For your example, it suffices to add calls to "create()" in the places where you want to use the declared type as a sort.
See: http://rise4fun.com/Z3Py/rQ7t
Regarding the rest of the case study you are looking at: it is certainly possible to express the constrsaints you describe using quantifiers and arrays. You could also consider somewhat more efficient encodings:
Instead of using an array, use a function declaration. So P would be declared as a unary function:
P = Function('P', IntSort(), Process.create()).
Using quantifiers for small finite domain problems may be more of an overhead than a benefit. Writing down the constraints directly as a finite conjunction saves the overhead of instantiating quantifiers possibly repeatedly. That said, some quantified axioms can also be optimized. Z3 automatically compiles axioms of the form: ForAll([x,y], Implies(x != y, P(x) != P(y))) into
an axioms of the form Forall([x], Pinv(P(x)) == x), where "Pinv" is a fresh function. The new axiom still enforces that P is injective but requires only a linear number of instantiations, linear in the number of occurrences of P(t) for some term 't'.
Have fun!

Resources