I have some Reactor Kafka code that reads in events via a KafkaReceiver and writes 1..many downstream messages via 1 or more KafkaSenders that are concatenated into a single Publisher. Everything is working great, but what I'd like to do is only emit an event from this concatenated senders Flux when it is complete (i.e. it's done writing to all downstream topics for any given event, so it does not emit anything for each element as it writes to Kafka downstream until it's done). This way I could sample() and periodically commit offsets, knowing that whenever it is that sample() happens to trigger and I commit offsets for an incoming event that I've processed all downstream messages for each event I'm committing offsets for. It seems like I could use either pauseUntilOther() or then() somehow, but I don't quite see exactly how given my code and specific use case. Any thoughts or suggestions appreciated, thanks.
Main Publisher code:
this.kafkaReceiver.receive()
.groupBy(m -> m.receiverOffset().topicPartition())
.flatMap(partitionFlux ->
partitionFlux.publishOn(this.scheduler)
.flatMap(this::processEvent)
.sample(Duration.ofMillis(10))
.concatMap(sr -> commitReceiverOffset(sr.correlationMetadata())))
.subscribe();
Concatenated KafkaSenders returned by call to processEvent():
return Flux.concat(kafkaSenderFluxA, kafkaSenderFluxB)
.doOnComplete(LOG.info(“Finished writing all downstream messages for incoming event);
Sounds like Flux.last() is what you are looking for:
return Flux.concat(kafkaSenderFluxA, kafkaSenderFluxB)
.doOnComplete(LOG.info(“Finished writing all downstream messages for incoming event)
.last();
Then your .sample(Duration.ofMillis(10)) would do whatever is available as a last item from one or several batches sent to those brokers. And in the end your commitReceiverOffset() would properly commit whatever was the last.
See its JavaDocs for more info:
/**
* Emit the last element observed before complete signal as a {#link Mono}, or emit
* {#link NoSuchElementException} error if the source was empty.
* For a passive version use {#link #takeLast(int)}
*
* <p>
* <img class="marble" src="doc-files/marbles/last.svg" alt="">
*
* <p><strong>Discard Support:</strong> This operator discards elements before the last.
*
* #return a {#link Mono} with the last value in this {#link Flux}
*/
public final Mono<T> last() {
and marble diagram: https://projectreactor.io/docs/core/release/api/reactor/core/publisher/doc-files/marbles/last.svg
Related
In my model, I have some agents;
"Demand" agent,
"EnergyProducer1" agent
"EnergyProducer2" agent.
When my hourly energy demands are created in the Main agent with a function, the priority for satisfying this demand is belongs to "EnergyProducer1" agent. In this agent, I have a function that calculate energy production based on some situtations. The some part of the inside of this function is following;
**" if (statechartA.isStateActive(Operating.busy)) && ( main.heatLoadDemandPerHour >= heatPowerNominal) {
producedHeatPower = heatPowerNominal;
naturalGasConsumptionA = naturalGasConsumptionNominal;
send("boilerWorking",boiler);
} else ..... "**
Here my question is related to 4th line of the code. If my agent1 fails to satisfy the hourly demand, I have to say agent2 that " to satisfy rest of demand". If I send this message to agent2, its statechart will be active and the function of agent2 will be working. My question is that this all situations will be realized at the same hour ??? İf it is not, is accessing variables and parameters of other agent2 more appropiaote way???
I hope I could explain my problem.
thanks for your help in advance...
**Edited question...
As a general comment on your question, within AnyLogic environment sending messages is alway preferable to directly accessing variable and parameters of another agent.
Specifically in the example presented the send() function will schedule message delivery the next instance after the completion of the current function.
Update: A message in AnyLogic can be any Java class. Sending strings such as "boilerWorking" used in the example is good for general control, however if more information needs to be shared (such as a double value) then it is good practice to create a new Java class (let's call is ModelMessage and follow these instructions) with at least two properties msgStr and msgVal. With this new class sending a message changes from this:
...
send("boilerWorking", boiler);
...
to this:
...
send(new ModelMessage("boilerWorking",42.0), boiler);
...
and firing transitions in the statechart has to be changed to use if expression is true with expression being msg.msgString == "boilerWorking".
More information about Agent communication is available here.
For some time have been using create an EmitterProcessor with built in sink as follows:
EmitterProcessor<String> emitter = EmitterProcessor.create();
FluxSink<String> sink = emitter.sink(FluxSink.OverflowStrategy.LATEST);
The sink publishes using a Flux .from command
Flux<String> out = Flux
.from(emitter
.log(log.getName()));
and the sink can be passed around, and populated with strings, simply using the next instruction.
Now we see that EmitterProcessor is deprecated.
It's all replaced with Sinks.many() like this
Many<String> sink = Sinks.many().unicast().onBackpressureBuffer();
but how to use that to publish from?
The answer was casting the Sinks.many() to asFlux()
Flux<String> out = Flux
.from(sink.asFlux()
.log(log.getName()));
Also using that for cancel and termination of the flux
sink.asFlux().doOnCancel(() -> {
cancelSink(id, request);
});
/* Handle errors, eviction, expiration */
sink.asFlux().doOnTerminate(() -> {
disposeSink(id);
});
UPDATE The cancel and terminate don't appear to work per this question
I'm trying to understand Beam/Dataflow concepts better, so pretend I have the following streaming pipeline:
pipeline
.apply(PubsubIO.readStrings().fromSubscription("some-subscription"))
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
String message = c.element();
LOGGER.debug("Got message: {}", message);
c.output(message);
}
}));
How often will the unbounded source pull messages from the subscription? Is this configurable at all (potentially based on windows/triggers)?
Since no custom windowing/triggers have been defined, and there are no sinks (just a ParDo that logs + re-outputs the message), will my ParDo still be executed immediately as messages are received, and is that setup problematic in any way (not having any windows/triggers/sinks defined)?
It will pull messages from the subscription continuously - as soon as a message arrives, it will be processed immediately (modulo network and RPC latency).
Windowing and triggers do not affect this at all - they only affect how the data gets grouped at grouping operations (GroupByKey and Combine). If your pipeline doesn't have grouping operations, windowing and triggers are basically a no-op.
The Beam model does not have the concept of a sink - writing to various storage systems (e.g. writing files, writing to BigQuery etc) is implemented as regular Beam composite transforms, made of ParDo and GroupByKey like anything else. E.g. writing each element to its own file could be implemented by a ParDo whose #ProcessElement opens the file, writes the element to it and closes the file.
Background
We have a pipeline that starts by receiving messages from PubSub, each with the name of a file. These files are exploded to line level, parsed to JSON object nodes and then sent to an external decoding service (which decodes some encoded data). Object nodes are eventually converted to Table Rows and written to Big Query.
It appeared that Dataflow was not acknowledging the PubSub messages until they arrived at the decoding service. The decoding service is slow, resulting in a backlog when many message are sent at once. This means that lines associated with a PubSub message can take some time to arrive at the decoding service. As a result, PubSub was receiving no acknowledgement and resending the message. My first attempt to remedy this was adding an attribute to each PubSub messages that is passed to the Reader using withAttributeId(). However, on testing, this only prevented duplicates that arrived close together.
My second attempt was to add a fusion breaker (example) after the PubSub read. This simply performs a needless GroupByKey and then ungroups, the idea being that the GroupByKey forces Dataflow to acknowledge the PubSub message.
The Problem
The fusion breaker discussed above works in that it prevents PubSub from resending messages, but I am finding that this GroupByKey outputs more elements than it receives: See image.
To try and diagnose this I have removed parts of the pipeline to get a simple pipeline that still exhibits this behavior. The behavior remains even when
PubSub is replaced by some dummy transforms that send out a fixed list of messages with a slight delay between each one.
The Writing transforms are removed.
All Side Inputs/Outputs are removed.
The behavior I have observed is:
Some number of the received messages pass straight through the GroupByKey.
After a certain point, messages are 'held' by the GroupByKey (presumably due to the backlog after the GroupByKey).
These messages eventually exit the GroupByKey (in groups of size one).
After a short delay (about 3 minutes), the same messages exit the GroupByKey again (still in groups of size one). This may happen several times (I suspect it is proportional to the time they spend waiting to enter the GroupByKey).
Example job id is 2017-10-11_03_50_42-6097948956276262224. I have not run the beam on any other runner.
The Fusion Breaker is below:
#Slf4j
public class FusionBreaker<T> extends PTransform<PCollection<T>, PCollection<T>> {
#Override
public PCollection<T> expand(PCollection<T> input) {
return group(window(input.apply(ParDo.of(new PassthroughLogger<>(PassthroughLogger.Level.Info, "Fusion break in")))))
.apply("Getting iterables after breaking fusion", Values.create())
.apply("Flattening iterables after breaking fusion", Flatten.iterables())
.apply(ParDo.of(new PassthroughLogger<>(PassthroughLogger.Level.Info, "Fusion break out")));
}
private PCollection<T> window(PCollection<T> input) {
return input.apply("Windowing before breaking fusion", Window.<T>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.discardingFiredPanes());
}
private PCollection<KV<Integer, Iterable<T>>> group(PCollection<T> input) {
return input.apply("Keying with random number", ParDo.of(new RandomKeyFn<>()))
.apply("Grouping by key to break fusion", GroupByKey.create());
}
private static class RandomKeyFn<T> extends DoFn<T, KV<Integer, T>> {
private Random random;
#Setup
public void setup() {
random = new Random();
}
#ProcessElement
public void processElement(ProcessContext context) {
context.output(KV.of(random.nextInt(), context.element()));
}
}
}
The PassthroughLoggers simply log the elements passing through (I use these to confirm that elements are indeed repeated, rather than there being an issue with the counts).
I suspect this is something to do with windows/triggers, but my understanding is that elements should never be repeated when .discardingFiredPanes() is used - regardless of the windowing setup. I have also tried FixedWindows with no success.
First, the Reshuffle transform is equivalent to your Fusion Breaker, but has some additional performance improvements that should make it preferable.
Second, both counters and logging may see an element multiple times if it is retried. As described in the Beam Execution Model, an element at a step may be retried if anything that is fused into it is retried.
Have you actually observed duplicates in what is written as the output of the pipeline?
I have a Pub/Sub topic + subscription and want to consume and aggregate the unbounded data from the subscription in a Dataflow. I use a fixed window and write the aggregates to BigQuery.
Reading and writing (without windowing and aggregation) works fine. But when I pipe the data into a fixed window (to count the elements in each window) the window is never triggered. And thus the aggregates are not written.
Here is my word publisher (it uses kinglear.txt from the examples as input file):
public static class AddCurrentTimestampFn extends DoFn<String, String> {
#ProcessElement public void processElement(ProcessContext c) {
c.outputWithTimestamp(c.element(), new Instant(System.currentTimeMillis()));
}
}
public static class ExtractWordsFn extends DoFn<String, String> {
#ProcessElement public void processElement(ProcessContext c) {
String[] words = c.element().split("[^a-zA-Z']+");
for (String word:words){ if(!word.isEmpty()){ c.output(word); }}
}
}
// main:
Pipeline p = Pipeline.create(o); // 'o' are the pipeline options
p.apply("ReadLines", TextIO.Read.from(o.getInputFile()))
.apply("Lines2Words", ParDo.of(new ExtractWordsFn()))
.apply("AddTimestampFn", ParDo.of(new AddCurrentTimestampFn()))
.apply("WriteTopic", PubsubIO.Write.topic(o.getTopic()));
p.run();
Here is my windowed word counter:
Pipeline p = Pipeline.create(o); // 'o' are the pipeline options
BigQueryIO.Write.Bound tablePipe = BigQueryIO.Write.to(o.getTable(o))
.withSchema(o.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND);
Window.Bound<String> w = Window
.<String>into(FixedWindows.of(Duration.standardSeconds(1)));
p.apply("ReadTopic", PubsubIO.Read.subscription(o.getSubscription()))
.apply("FixedWindow", w)
.apply("CountWords", Count.<String>perElement())
.apply("CreateRows", ParDo.of(new WordCountToRowFn()))
.apply("WriteRows", tablePipe);
p.run();
The above subscriber will not work, since the window does not seem to trigger using the default trigger. However, if I manually define a trigger the code works and the counts are written to BigQuery.
Window.Bound<String> w = Window.<String>into(FixedWindows.of(Duration.standardSeconds(1)))
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes();
I like to avoid specifying custom triggers if possible.
Questions:
Why does my solution not work with Dataflow's default trigger?
How do I have to change my publisher or subscriber to trigger windows using the default trigger?
How are you determining the trigger never fires?
Your PubSubIO.Write and PubSubIO.Read transforms should both specify a timestamp label using withTimestampLabel, otherwise the timestamps you've added will not be written to PubSub and the publish times will be used.
Either way, the input watermark of the pipeline will be derived from the timestamps of the elements waiting in PubSub. Once all inputs have been processed, it will stay back for a few minutes (in case there was a delay in the publisher) before advancing to real time.
What you are likely seeing is that all the elements are published in the same ~1 second window (since the input file is pretty small). These are all read and processed relatively quickly, but the 1-second window they are put in will not trigger until after the input watermark has advanced, indicating that all data in that 1-second window has been consumed.
This won't happen until several minutes, which may make it look like the trigger isn't working. The trigger you wrote fired after 1 second of processing time, which would fire much earlier, but there is no guarantee all the data has been processed.
Steps to get better behavior from the default trigger:
Use withTimestampLabel on both the write and read pubsub steps.
Have the publisher spread the timestamps out further (eg., run for several minutes and spread the timestamps out across that range)