Move from EmitterProcessor to Sinks.many() - project-reactor

For some time have been using create an EmitterProcessor with built in sink as follows:
EmitterProcessor<String> emitter = EmitterProcessor.create();
FluxSink<String> sink = emitter.sink(FluxSink.OverflowStrategy.LATEST);
The sink publishes using a Flux .from command
Flux<String> out = Flux
.from(emitter
.log(log.getName()));
and the sink can be passed around, and populated with strings, simply using the next instruction.
Now we see that EmitterProcessor is deprecated.
It's all replaced with Sinks.many() like this
Many<String> sink = Sinks.many().unicast().onBackpressureBuffer();
but how to use that to publish from?

The answer was casting the Sinks.many() to asFlux()
Flux<String> out = Flux
.from(sink.asFlux()
.log(log.getName()));
Also using that for cancel and termination of the flux
sink.asFlux().doOnCancel(() -> {
cancelSink(id, request);
});
/* Handle errors, eviction, expiration */
sink.asFlux().doOnTerminate(() -> {
disposeSink(id);
});
UPDATE The cancel and terminate don't appear to work per this question

Related

Reactor Flux - Only emit from Publisher on completion

I have some Reactor Kafka code that reads in events via a KafkaReceiver and writes 1..many downstream messages via 1 or more KafkaSenders that are concatenated into a single Publisher. Everything is working great, but what I'd like to do is only emit an event from this concatenated senders Flux when it is complete (i.e. it's done writing to all downstream topics for any given event, so it does not emit anything for each element as it writes to Kafka downstream until it's done). This way I could sample() and periodically commit offsets, knowing that whenever it is that sample() happens to trigger and I commit offsets for an incoming event that I've processed all downstream messages for each event I'm committing offsets for. It seems like I could use either pauseUntilOther() or then() somehow, but I don't quite see exactly how given my code and specific use case. Any thoughts or suggestions appreciated, thanks.
Main Publisher code:
this.kafkaReceiver.receive()
.groupBy(m -> m.receiverOffset().topicPartition())
.flatMap(partitionFlux ->
partitionFlux.publishOn(this.scheduler)
.flatMap(this::processEvent)
.sample(Duration.ofMillis(10))
.concatMap(sr -> commitReceiverOffset(sr.correlationMetadata())))
.subscribe();
Concatenated KafkaSenders returned by call to processEvent():
return Flux.concat(kafkaSenderFluxA, kafkaSenderFluxB)
.doOnComplete(LOG.info(“Finished writing all downstream messages for incoming event);
Sounds like Flux.last() is what you are looking for:
return Flux.concat(kafkaSenderFluxA, kafkaSenderFluxB)
.doOnComplete(LOG.info(“Finished writing all downstream messages for incoming event)
.last();
Then your .sample(Duration.ofMillis(10)) would do whatever is available as a last item from one or several batches sent to those brokers. And in the end your commitReceiverOffset() would properly commit whatever was the last.
See its JavaDocs for more info:
/**
* Emit the last element observed before complete signal as a {#link Mono}, or emit
* {#link NoSuchElementException} error if the source was empty.
* For a passive version use {#link #takeLast(int)}
*
* <p>
* <img class="marble" src="doc-files/marbles/last.svg" alt="">
*
* <p><strong>Discard Support:</strong> This operator discards elements before the last.
*
* #return a {#link Mono} with the last value in this {#link Flux}
*/
public final Mono<T> last() {
and marble diagram: https://projectreactor.io/docs/core/release/api/reactor/core/publisher/doc-files/marbles/last.svg

How to dynamically trigger a Window based on the number of processed elements?

I have an Apache Beam pipeline which runs on Google Cloud Dataflow. This a streaming pipeline which receives input messages from Google Cloud PubSub which are basically JSON arrays of elements to process.
Roughly speaking, the pipeline has these steps:
Deserializes the message into a PCollecttion<List<T>>.
Splits (or explodes) the array into a PCollection<T>.
Few processing steps: some elements will finish before other elements and some elements are cached so they simply skip to the end without much processing at all.
Flatten all outputs and apply a GroupByKey(this is the problem step): it transforms the PCollection back into a Pcollection<List<T>> but it doesn't wait for all the elements.
Serialize to publish a PubSub Message.
I cannot get the last GroupByKey to group all elements that where received together. The published message doesn't contain the elements that had to be processed and took longer than those which skipped to the end.
I think this would be straight forward to solve if I could write a custom Data-Driven trigger. Or even if I could dynamically set the trigger AfterPane.elementCountAtLeast() from a customized WindowFn.
It doesn't seem that I can make a custom trigger. But is it possible to somehow dynamically set the trigger for each window?
--
Here is a simplified version of the pipeline I am working on.
I have simplified the input from an array of objects T into a simple array of Integer. I have simulated the keys (or IDs) for these integers. Normally they would be part of the objects.
I also simplified the slow processing step (which really is several steps) into a sigle step with an artificial delay.
(complete example gist https://gist.github.com/naringas/bfc25bcf8e7aca69f74de719d75525f2 )
PCollection<String> queue = pipeline
.apply("ReadQueue", PubsubIO.readStrings().fromTopic(topic))
.apply(Window
.<String>into(FixedWindows.of(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.standardSeconds(3))
.triggering(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(2)))
.discardingFiredPanes());
TupleTag<List<KV<Integer, Integer>>> tagDeserialized = new TupleTag<List<KV<Integer, Integer>>>() {};
TupleTag<Integer> tagDeserializeError = new TupleTag<Integer>() {};
PCollectionTuple imagesInputTuple = queue
.apply("DeserializeJSON", ParDo.of(new DeserializingFn()).withOutputTags(tagDeserialized, TupleTagList.of(tagDeserializeError)));
/*
This is where I think that I must adjust the custom window strategy, set the customized dynamic-trigger
*/
PCollection<KV<Integer, Integer>> images = imagesInputTuple.get(tagDeserialized)
/* I have tried many things
.apply(Window.<List<KV<Integer, Integer>>>into(new GlobalWindows()))
*/
.apply("Flatten into timestamp", ParDo.of(new DoFn<List<KV<Integer, Integer>>, KV<Integer, Integer>>() {
// Flatten and output into same ts
// like Flatten.Iterables() but I set the output window
#ProcessElement
public void processElement(#Element List<KV<Integer, Integer>> input, OutputReceiver<KV<Integer, Integer>> out, #Timestamp Instant ts, BoundedWindow w, PaneInfo p) {
Instant timestamp = w.maxTimestamp();
for (KV<Integer, Integer> el : input) {
out.outputWithTimestamp(el, timestamp);
}
}
}))
.apply(Window.<KV<Integer, Integer>>into(new GlobalWindows()));
TupleTag<KV<Integer, Integer>> tagProcess = new TupleTag<KV<Integer, Integer>>() {};
TupleTag<KV<Integer, Integer>> tagSkip = new TupleTag<KV<Integer, Integer>>() {};
PCollectionTuple preproc = images
.apply("PreProcessingStep", ParDo.of(new SkipOrNotDoFn()).withOutputTags(tagProcess, TupleTagList.of(tagSkip)));
TupleTag<KV<Integer, Integer>> tagProcessed = new TupleTag<KV<Integer, Integer>>() {};
TupleTag<KV<Integer, Integer>> tagError = new TupleTag<KV<Integer, Integer>>() {};
PCollectionTuple processed = preproc.get(tagProcess)
.apply("ProcessingStep", ParDo.of(new DummyDelasyDoFn).withOutputTags(tagProcessed, TupleTagList.of(tagError)));
/* Here, at the "end"
the elements get grouped back
first: join into a PcollectionList and flatten it
second: GroupByKey which should but doesn't way for all elements
lastly: serilize and publish (in this case just print out)
*/
PCollection end = PCollectionList.of(preproc.get(tagSkip)).and(processed.get(tagProcessed))
.apply("FlattenUpsert", Flatten.pCollections())
//
.apply("GroupByParentId", GroupByKey.create())
.apply("GroupedValues", Values.create())
.apply("PublishSerialize", ParDo.of(
new DoFn<Object, String>() {
#ProcessElement
public void processElement(ProcessContext pc) {
String output = GSON.toJson(pc.element());
LOG.info("DONE: {}", output);
pc.output(output);
}
}));
// "send the string to pubsub" goes here
I played around a little bit with stateful pipelines. As you'd like to use data-driven triggers or AfterPane.elementCountAtLeast() I assume you know the number of elements that conform the message (or, at least, it does not change per key) so I defined NUM_ELEMENTS = 10 in my case.
The main idea of my approach is to keep track of the number of elements that I have seen so far for a particular key. Notice that I had to merge the PreProcessingStep and ProcessingStep into a single one for an accurate count. I understand this is just a simplified example so I don't know how that would translate to the real scenario.
In the stateful ParDo I defined two state variables, one BagState with all integers seen and a ValueState to count the number of errors:
// A state bag holding all elements seen for that key
#StateId("elements_seen")
private final StateSpec<BagState<Integer>> elementSpec =
StateSpecs.bag();
// A state cell holding error count
#StateId("errors")
private final StateSpec<ValueState<Integer>> errorSpec =
StateSpecs.value(VarIntCoder.of());
Then we process each element as usual but we don't output anything yet unless it's an error. In that case we update the error counter before emitting the element to the tagError side output:
errors.write(firstNonNull(errors.read(), 0) + 1);
is_error = true;
output.get(tagError).output(input);
We update the count and, for successfully processed or skipped elements (i.e. !is_error), write the new observed element into the BagState:
int count = firstNonNull(Iterables.size(state.read()), 0) + firstNonNull(errors.read(), 0);
if (!is_error) {
state.add(input.getValue());
count += 1;
}
Then, if the sum of successfully processed elements and errors is equal to NUM_ELEMENTS (we are simulating a data-driven trigger here), we flush all the items from the BagState:
if (count >= NUM_ELEMENTS) {
Iterable<Integer> all_elements = state.read();
Integer key = input.getKey();
for (Integer value : all_elements) {
output.get(tagProcessed).output(KV.of(key, value));
}
}
Note that here we can already group the values and emit just a single KV<Integer, Iterable<Integer>> instead. I just made a for loop instead to avoid changing other steps downstream.
With this, I publish a message such as:
gcloud pubsub topics publish streamdemo --message "[1,2,3,4,5,6,7,8,9,10]"
And where before I got:
INFO: DONE: [4,8]
Now I get:
INFO: DONE: [1,2,3,4,5,6,8,9,10]
Element 7 is not present as is the one that simulates errors.
Tested with DirectRunner and 2.16.0 SDK. Full code here.
Let me know if that works for your use case, keep in mind that I only did some minor tests.

Reactor. List of Monos, retry on fail

I have list List<Mono<String>>. Each Mono represents API call where I wait on I/O for result. The problem is that some times some calls return nothing (empty String), and I need repeat them again on that case.
Now it looks like this:
val firstAskForItemsRetrieved = firstAskForItems.map {
it["statistic"] = (it["statistic"] as Mono<Map<Any, Any>>).block()
it
}
I'm waiting for all Monos to finish, then in case of empty body I repeat request
val secondAskForItem = firstAskForItemsRetrieved
.map {
if ((it["statistic"] as Map<Any, Any>).isEmpty()) {
// repeat request
it["statistic"] = getUserItem(userName) // return Mono
} else
it["statistic"] = Mono.just(it["statistic"])
it
}
And then block on each item again
val secondAskForItemsRetrieved = secondAskForItems.map {
it["statistic"] = (it["statistic"] as Mono<Map<Any, Any>>).block()
it
}
I see that looks ugly
Are any other ways to retry call in Mono if it fails, without doing it manually?
Is it block on each item a right way to get them all?
How to make the code better?
Thank you.
There are 2 operators I believe can help your:
For the "wait for all Mono" use case, have a look at the static methods when and zip.
when just cares about completion, so even if the monos are empty it will just signal an onComplete whenever all of the monos have finished. You don't get the data however.
zip cares about the values and expects all Monos to be valued. When all Monos are valued, it combines their values according to the passed Function. Otherwise it just completes empty.
To retry the empty Monos, have a look at repeatWhenEmpty. It resubscribes to an empty Mono, so if that Mono is "cold" it would restart the source (eg. make another HTTP request).

Consuming unbounded data in windows with default trigger

I have a Pub/Sub topic + subscription and want to consume and aggregate the unbounded data from the subscription in a Dataflow. I use a fixed window and write the aggregates to BigQuery.
Reading and writing (without windowing and aggregation) works fine. But when I pipe the data into a fixed window (to count the elements in each window) the window is never triggered. And thus the aggregates are not written.
Here is my word publisher (it uses kinglear.txt from the examples as input file):
public static class AddCurrentTimestampFn extends DoFn<String, String> {
#ProcessElement public void processElement(ProcessContext c) {
c.outputWithTimestamp(c.element(), new Instant(System.currentTimeMillis()));
}
}
public static class ExtractWordsFn extends DoFn<String, String> {
#ProcessElement public void processElement(ProcessContext c) {
String[] words = c.element().split("[^a-zA-Z']+");
for (String word:words){ if(!word.isEmpty()){ c.output(word); }}
}
}
// main:
Pipeline p = Pipeline.create(o); // 'o' are the pipeline options
p.apply("ReadLines", TextIO.Read.from(o.getInputFile()))
.apply("Lines2Words", ParDo.of(new ExtractWordsFn()))
.apply("AddTimestampFn", ParDo.of(new AddCurrentTimestampFn()))
.apply("WriteTopic", PubsubIO.Write.topic(o.getTopic()));
p.run();
Here is my windowed word counter:
Pipeline p = Pipeline.create(o); // 'o' are the pipeline options
BigQueryIO.Write.Bound tablePipe = BigQueryIO.Write.to(o.getTable(o))
.withSchema(o.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND);
Window.Bound<String> w = Window
.<String>into(FixedWindows.of(Duration.standardSeconds(1)));
p.apply("ReadTopic", PubsubIO.Read.subscription(o.getSubscription()))
.apply("FixedWindow", w)
.apply("CountWords", Count.<String>perElement())
.apply("CreateRows", ParDo.of(new WordCountToRowFn()))
.apply("WriteRows", tablePipe);
p.run();
The above subscriber will not work, since the window does not seem to trigger using the default trigger. However, if I manually define a trigger the code works and the counts are written to BigQuery.
Window.Bound<String> w = Window.<String>into(FixedWindows.of(Duration.standardSeconds(1)))
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes();
I like to avoid specifying custom triggers if possible.
Questions:
Why does my solution not work with Dataflow's default trigger?
How do I have to change my publisher or subscriber to trigger windows using the default trigger?
How are you determining the trigger never fires?
Your PubSubIO.Write and PubSubIO.Read transforms should both specify a timestamp label using withTimestampLabel, otherwise the timestamps you've added will not be written to PubSub and the publish times will be used.
Either way, the input watermark of the pipeline will be derived from the timestamps of the elements waiting in PubSub. Once all inputs have been processed, it will stay back for a few minutes (in case there was a delay in the publisher) before advancing to real time.
What you are likely seeing is that all the elements are published in the same ~1 second window (since the input file is pretty small). These are all read and processed relatively quickly, but the 1-second window they are put in will not trigger until after the input watermark has advanced, indicating that all data in that 1-second window has been consumed.
This won't happen until several minutes, which may make it look like the trigger isn't working. The trigger you wrote fired after 1 second of processing time, which would fire much earlier, but there is no guarantee all the data has been processed.
Steps to get better behavior from the default trigger:
Use withTimestampLabel on both the write and read pubsub steps.
Have the publisher spread the timestamps out further (eg., run for several minutes and spread the timestamps out across that range)

F# Start/Stop class instance at the same time

I am doing F# programming, I have some special requirements.
I have 3 class instances; each class instance has to run for one hour every day, from 9:00AM to 10:00AM. I want to control them from main program, starting them at the same time, and stop them also at the same time. The following is my code to start them at the same time, but I don’t know how to stop them at the same time.
#light
module Program
open ClassA
open ClassB
open ClassC
let A = new CalssA.A("A")
let B = new ClassB.B("B")
let C = new ClassC.C("C")
let task = [ async { return A.jobA("A")};
async { return B.jobB("B")};
async { return C.jobC("C")} ]
task |> Async.Parallel |> Async.RunSynchronously |> ignore
Anyone knows hows to stop all 3 class instances at 10:00AM, please show me your code.
Someone told me that I can use async with cancellation tokens, but since I am calling instance of classes in different modules, it is difficult for me to find suitable code samples.
Thanks,
The jobs themselves need to be stoppable, either by having a Stop() API of some sort, or cooperatively being cancellable via CancellationTokens or whatnot, unless you're just talking about some job that spins in a loop and you'll just thread-abort it eventually? Need more info about what "stop" means in this context.
As Brian said, the jobs themselves need to support cancellation. The programming model for cancellation that works the best with F# is based on CancellationToken, because F# keeps CancellationToken automatically in asynchronous workflows.
To implement the cancellation, your JobA methods will need to take additional argument:
type A() =
member x.Foo(str, cancellationToken:CancellationToken) =
for i in 0 .. 10 do
cancellationToken.ThrowIfCancellationRequested()
someOtherWork()
The idea is that you call ThrowIfCancellationRequested frequently during the execution of your job. If a cancellation is requested, the method thorws and the operation will stop. Once you do this, you can write asynchronous workflow that gets the current CancellationToken and passes it to JobA member when calling it:
let task =
[ async { let! tok = Async.CancellationToken
return A.JobA("A", tok) };
async { let! tok = Async.CancellationToken
return B.JobB("B") }; ]
Now you can create a new token using CancellationTokenSource and start the workflow. When you then cancel the token source, it will automatically stop any jobs running as part of the workflow:
let src = new CancellationTokenSource()
Async.Start(task, cancellationToken = src.Token)
// To cancel the job:
src.Cancel()
You asked this question on hubfs.net, and I'll repeat here my answer: try using Quartz.NET. You'd just implement IInteruptableJob in A,B,C, defining how they stop. Then another job at 10:00AM to stop the others.
Quartz.NET has a nice tutorial, FAQ, and lots of examples. It's pretty easy to use for simple cases like this, yet very powerful if you ever need more complex scheduling, monitoring jobs, logging, etc.

Resources