Can Flux of Project Reactor process messages one by one - project-reactor

I'm trying to process a list of numbers, for example, 1 to 10, one by one using Reactor Flux, and there is an API /double which simply double the incoming Integer (1 -> 2, 4 -> 8...) ,however this API has performance issue, it always takes 2 seconds to response the result.
When using limitRate(1) what I expected is Reactor processes requests one after another as following:
2020-01-01 00:00:02 - 2
2020-01-01 00:00:04 - 4
2020-01-01 00:00:06 - 6
2020-01-01 00:00:08 - 8
2020-01-01 00:00:10 - 10
...
But actually Reactor fires all requests at once:
2020-01-01 00:00:02 - 6
2020-01-01 00:00:02 - 10
2020-01-01 00:00:02 - 2
2020-01-01 00:00:02 - 4
2020-01-01 00:00:02 - 8
...
Here is the code
Flux.range(1, 10).limitRate(1)
.flatMap(i -> webClient.get().uri("http://localhost:10001/double?integer={int}", i).exchange()
.flatMap(resp -> resp.bodyToMono(Integer.class)))
.subscribe(System.out::println);
Thread.sleep(10000);
Seems limitRate is not working as I expected, what went wrong? Is there any way to process requests one after another using Reactor? Thanks in advance.

.flatMap doesn't work here as it subscribes to the inner streams eagerly - that is, it won't wait for an inner stream to emit an onComplete before subscribing to the next stream. This is why all of your calls are made concurrently. It works in the receive->dispatch->receive->dispatch mode.
Reactor provides an overloaded version of flatMap where you can specify the concurrency factor as .flatMap(innerstream, concurrency). This factor caps the number of streams flatMap will subscribe to. If it is say 5, flatMap can subscribe to at most 5 inner streams. As soon as this limit is hit, it has to wait for an inner stream to emit onComplete before it subscribes to the next one.
In your case, you can either set it to 1 or use .concatMap(). concatMap() is exactly flatMap with concurrency = 1. It'll basically works in the receive->dispatch->wait->receive->dispatch->wait mode.
I wrote a post some time back explaining exactly how flatMap works, because I think a lot of people use it without understanding its internals. You can refer to the article here

Consider to use a concatMap instead:
/**
* Transform the elements emitted by this {#link Flux} asynchronously into Publishers,
* then flatten these inner publishers into a single {#link Flux}, sequentially and
* preserving order using concatenation.
* <p>
* There are three dimensions to this operator that can be compared with
* {#link #flatMap(Function) flatMap} and {#link #flatMapSequential(Function) flatMapSequential}:
* <ul>
* <li><b>Generation of inners and subscription</b>: this operator waits for one
* inner to complete before generating the next one and subscribing to it.</li>
* <li><b>Ordering of the flattened values</b>: this operator naturally preserves
* the same order as the source elements, concatenating the inners from each source
* element sequentially.</li>
* <li><b>Interleaving</b>: this operator does not let values from different inners
* interleave (concatenation).</li>
* </ul>
*
* <p>
* Errors will immediately short circuit current concat backlog.
*
* <p>
* <img class="marble" src="doc-files/marbles/concatMap.svg" alt="">
*
* #reactor.discard This operator discards elements it internally queued for backpressure upon cancellation.
*
* #param mapper the function to transform this sequence of T into concatenated sequences of V
* #param <V> the produced concatenated type
*
* #return a concatenated {#link Flux}
*/
public final <V> Flux<V> concatMap(Function<? super T, ? extends Publisher<? extends V>>
mapper) {
Pay attention to the sequentially and preserving order using concatenation. phrase. Seems for me what you are looking for.

Inspired by Artem Bilan's answer, I found flatMapSequential is a better for my case, since the flatMapSequential accepts second parameter as maxConcurrency, so that it is possible not to process messages one by one but twice a time and etc.
Thanks Artem Bilan and Prashant Pandey for your answers, really helped.

Related

Apache Beam - Sliding Windows Only Emit Earliest Active Window

I'm trying to use Apache Beam (via Scio) to run a continuous aggregation of the last 3 days of data (processing time) from a streaming source and output results from the earliest, active window every 5 minutes. Earliest meaning the window with the earliest start time, active meaning that the end of the window hasn't yet passed. Essentially I'm trying to get a 'rolling' aggregation by dropping the non-overlapping period between sliding windows.
A visualization of what I'm trying to accomplish with an example sliding window of size 3 days and period 1 day:
early firing - ^ no firing - x
|
** stop firing from this window once time passes this point
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | | ** stop firing from this window once time passes this point
w1: +====================+^ ^ ^
x x x x x x x | | |
w2: +====================+^ ^ ^
x x x x x x x | | |
w3: +====================+
time: ----d1-----d2-----d3-----d4-----d5-----d6-----d7---->
I've tried using sliding windows (size=3 days, period=5 min), but they produce a new window for every 3 days/5 min combination in the future and are emitting early results for every window. I tried using trigger = AfterWatermark.pastEndOfWindow(), but I need early results when the job first starts. I've tried comparing the pane data (isLast, timestamp, etc.) between windows but they seem identical.
My most recent attempt, which seems somewhat of a hack, included attaching window information to each key in a DoFn, re-windowing into a fixed window, and attempting to group and reduce to the oldest window from the attached data, but the final reduceByKey doesn't seem to output anything.
DoFn to attach window information
// ValueType is just a case class I'm using for objects
type DoFnT = DoFn[KV[String, ValueType], KV[String, (ValueType, Instant)]]
class Test extends DoFnT {
// Window.toString looks like the following:
// [2020-05-16T23:57:00.000Z..2020-05-17T00:02:00.000Z)
def parseWindow(window: String): Instant = {
Instant.parse(
window
.stripPrefix("[")
.stripSuffix(")")
.split("\\.\\.")(1))
}
#ProcessElement
def process(
context: DoFnT#ProcessContext,
window: BoundedWindow): Unit = {
context.output(
KV.of(
context.element().getKey,
(context.element().getValue, parseWindow(window.toString))
)
)
}
}
sc
.pubsubSubscription(...)
.keyBy(_.key)
.withSlidingWindows(
size = Duration.standardDays(3),
period = Duration.standardMinutes(5),
options = WindowOptions(
accumulationMode = DISCARDING_FIRED_PANES,
allowedLateness = Duration.ZERO,
trigger = Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1)))))))
.reduceByKey(ValueType.combineFunction())
.applyPerKeyDoFn(new Test())
.withFixedWindows(
duration = Duration.standardMinutes(5),
options = WindowOptions(
accumulationMode = DISCARDING_FIRED_PANES,
trigger = AfterWatermark.pastEndOfWindow(),
allowedLateness = Duration.ZERO))
.reduceByKey((x, y) => if (x._2.isBefore(y._2)) x else y)
.saveAsCustomOutput(
TextIO.write()...
)
Any suggestions?
First, regarding processing time: If you want to window according to processing time, you should set your event time to the processing time. This is perfectly fine - it means that the event you are processing is the event of ingesting the record, not the event that the record represents.
Now you can use sliding windows off-the-shelf to get the aggregation you want, grouped the way you want.
But you are correct that it is a bit of a headache to trigger the way you want. Triggers are not easily expressive enough to say "output the last 3 day aggregation but only begin when the window is 5 minutes from over" and even less able to express "for the first 3 day period from pipeline startup, output the whole time".
I believe a stateful ParDo(DoFn) will be your best choice. State is partitioned per key and window. Since you want to have interactions across 3 day aggregations you will need to run your DoFn in the global window and manage the partitioning of the aggregations yourself. You tagged your question google-cloud-dataflow and Dataflow does not support MapState so you will need to use a ValueState that holds a map of the active 3 day aggregations, starting new aggregations as needed and removing old ones when they are done. Separately, you can easily track the aggregation from which you want to periodically output, and have a timer callback that periodically emits the active aggregation. Something like the following pseudo-Java; you can translate to Scala and insert your own types:
DoFn<> {
#StateId("activePeriod") StateSpec<ValueState<Period>> activePeriod = StateSpecs.value();
#StateId("accumulators") StateSpec<ValueState<Map<Period, Accumulator>>> accumulators = StateSpecs.value();
#TimerId("nextPeriod") TimerSpec nextPeriod = TimerSpecs.timer(TimeDomain.EVENT_TIME);
#TimerId("output") TimerSpec outputTimer = TimerSpecs.timer(TimeDomain.EVENT_TIME);
#ProcessElement
public void process(
#Element element,
#TimerId("nextPeriod") Timer nextPeriod,
#TimerId("output") Timer output,
#StateId("activePeriod") ValueState<Period> activePeriod
#StateId("accumulators") ValueState<Map<Period, Accumulator>> accumulators) {
// Set nextPeriod if it isn't already running
// Set output if it isn't already running
// Set activePeriod if it isn't already set
// Add the element to the appropriate accumulator
}
#OnTimer("nextPeriod")
public void onNextPeriod(
#TimerId("nextPeriod") Timer nextPeriod,
#StateId("activePriod") ValueState<Period> activePeriod {
// Set activePeriod to the next one
// Clear the period we will never read again
// Reset the timer (there's a one-time change in this logic after the first window; add a flag for this)
}
#OnTimer("output")
public void onOutput(
#TimerId("output") Timer output,
#StateId("activePriod") ValueState<Period> activePeriod,
#StateId("accumulators") ValueState<MapState<Period, Accumulator>> {
// Output the current accumulator for the active period
// Reset the timer
}
}
I do have some reservations about this, because the outputs we are working so hard to suppress are not comparable to the outputs that are "replacing" them. I would be interesting in learning more about the use case. It is possible there is a more straightforward way to express the result you are interested in.

Scheduling a build x times

I'm trying to understand the way to scheduling a job
H 6 * * * -> this launch every day at morning
So How can I launch 3 builds every morning ?
Thanks
You can click on the "?" in the GUI to get this help:
To specify multiple values for one field, the following operators are available. In the order of precedence,
- * specifies all valid values
- M-N specifies a range of values
- M-N/X or */X steps by intervals of X through the specified range or whole valid range
- A,B,...,Z enumerates multiple values
With this help you should use
H 6-7/3 * * *
to get three build between 6 AM and 7 AM.

Possible to use less/greater than operators with IF ANY?

Is it possible to use <,> operators with the if any function? Something like this:
select if (any(>10,Q1) AND any(<2,Q2 to Q10))
You definitely need to create an auxiliary variable to do this.
#Jignesh Sutar's solution is one that works fine. However there are often multiple ways in SPSS to accomplish a certain task.
Here is another solution where the COUNT command comes in handy.
It is important to note that the following solution assumes that the values of the variables are integers. If you have float values (1.5 for instance) you'll get a wrong result.
* count occurrences where Q2 to Q10 is less then 2.
COUNT #QLT2 = Q2 TO Q10 (LOWEST THRU 1).
* select if Q1>10 and
* there is at least one occurrence where Q2 to Q10 is less then 2.
SELECT (Q1>10 AND #QLT2>0).
There is also a variant for this sort of solution that deals with float variables correctly. But I think it is less intuitive though.
* count occurrences where Q2 to Q10 is 2 or higher.
COUNT #QGE2 = Q2 TO Q10 (2 THRU HIGHEST).
* select if Q1>10 and
* not every occurences of (the 9 variables) Q2 to Q10 is two or higher.
SELECT IF (Q1>10 AND #QGE2<9).
Note: Variables beginning with # are temporary variables. They are not stored in the data set.
I don't think you can (would be nice if you could - you can do something similar in Excel with COUNTIF & SUMIF IIRC).
You've have to construct a new variable which tests the multiple ANY less than condition, as per below example:
input program.
loop #j = 1 to 1000.
compute ID=#j.
vector Q(10).
loop #i = 1 to 10.
compute Q(#i) = trunc(rv.uniform(-20,20)).
end loop.
end case.
end loop.
end file.
end input program.
execute.
vector Q=Q2 to Q10.
loop #i=1 to 9 if Q(#i)<2.
compute #QLT2=1.
end loop if Q(#i)<2.
select if (Q1>10 and #QLT2=1).
exe.

Spread load evenly by using ‘H * * * *’ rather than ‘5 * * * *’

When setting up how Jenkins shoul pull changes from subversion
I tried checked Poll SCM and set schedule to 5 * * * *, I get the following warning
Spread load evenly by using ‘H * * * *’ rather than ‘5 * * * *’
I'm not sure what H means in this context and why I should use that.
H stands for Hash
To allow periodically scheduled tasks to produce even load on the
system, the symbol H (for “hash”) should be used wherever possible.
For example, using 0 0 * * * for a dozen daily jobs will cause a large
spike at midnight. In contrast, using H H * * * would still execute
each job once a day, but not all at the same time, better using
limited resources.
Click on the question-mark beside your schedule specification.
It says there:
To allow periodically scheduled tasks to produce even load on the
system, the symbol H (for “hash”) should be used wherever possible.
For example, using 0 0 * * * for a dozen daily jobs will cause a large
spike at midnight. In contrast, using H H * * * would still execute
each job once a day, but not all at the same time, better using
limited resources.
Also in the documentation worth noting is that:
The H symbol can be used with a range. For example, H H(0-7) * * * means some time between 12:00 AM (midnight) to 7:59 AM. You can also use step intervals with H, with or without ranges.
The H symbol can be thought of as a random value over a range, but it actually is a hash of the job name, not a random function, so that the value remains stable for any given project.

Schedule nightly 22-03 build using Jenkins and H, the "hash symbol"

A build that takes about three hours to complete needs to be scheduled for nightly building outside office hours: not sooner than 22:00 and not later than 3:59 next day.
I'd also like to use the "H symbol" to avoid collision with future nightly builds. From in-line help in Jenkins:
To allow periodically scheduled tasks to produce even load on the system, the symbol H (for “hash”) should be used wherever possible. For example, using 0 0 * * * for a dozen daily jobs will cause a large spike at midnight. In contrast, using H H * * * would still execute each job once a day, but not all at the same time, better using limited resources.
(How) can I schedule this using Jenkins? What I've tried was all considered invalid by Jenkins:
H H(22,23,0,1,2,3) * * *
Invalid input: "H H(22,23,0,1,2,3) * * *": line 1:7: expecting "-", found ','
H H22,23,0,1,2,3 * * *
Invalid input: "H H22,23,0,1,2,3 * * *": line 1:4: unexpected token: 22
H H(22-3) * * *
Invalid input: "H H(22-3) * * *": line 1:9: 1 is an invalid value. Must be within 1
and -18
Is it possible to achieve this without using plug-ins?
I think the closest you will get is to use:
H H(0-3) * * * This will run at some point between 0:00 and 3:59
#midnight This will run at some point between 0:00 and 2:59
The H(4-8) construct only works if the second items is larger then the first.
But you might as well fill in the hour yourself. Jenkins actually never changes the hour the jobs runs once it is set. It will basically create some random hour once you save the job and always run the job at that particular time.
Of course, you can also file a bug report or feature request that you should be able to specify this as H(22-3) or better, fix the code and submit a patch ;)
There is no direct support to write the expression like this, but since there is timezone support (now), you can work around this.
# DONT COPY PASTE - THIS DOESNT WORK!
# This is what we would like to write, but is not supported
H H(22-3) * * *
Above expression means we want to build somewhen between 22 PM and 3 AM, this is a 5 hour period, so we could write:
# Assuming we're in GMT+2 we can just shift the timezone
# so 22-03 becomes 10-15 wich is 12 hours earlier so the
# timezone is GMT-10
TZ=Etc/GMT-10
H H(10-15) * * *
I found this workaround in the comments of JENKINS-18313
UPDATE:
There is currently a bug JENKINS-57702 and the timezone GMT-XX is not evaluated correctly. A workaround is to use a equivalent timezone, in this example the one for Hawaii:
TZ=US/Hawaii
H H(10-15) * * *

Resources