Log flux until mono completes - project-reactor

I have the following:
Flux<String> flux = ...
Mono<Void> mono = ...
Mono<Void> combined = operation(flux, mono);
They represent operations happening in parallel.
Now, I would like to print all elements emitted by the flux to sysout until mono completes,
what's the right operator to use here?
I've tried :
final Disposable subscribe = flux.subscribe(System.out::println);
mono.doOnSuccessOrError((o, e) -> subscribe.dispose());
But if feels clumpsy, I have a feeling there might be a better way to do this. Is there?

So, Project Reactor provides some methods to log emitted results:
doOnNext(Consumer onNext)
subscribe(Consumer consumer)
log()
You can see how it works.
Code example:
Flux<String> stringFlux = Flux.just("one", "two", "three")
.doOnNext(s -> System.out.println("On next: " + s));
Mono<String> stringMono = Mono.just("four");
stringFlux = stringFlux.concatWith(stringMono)
.map(s -> s + " hundred")
.log();
stringFlux.subscribe(s -> System.out.println("On next subscriber: " + s));
Result:
15:32:18.984 [main] INFO reactor.Flux.Map.1 - onSubscribe(FluxMap.MapSubscriber)
15:32:18.986 [main] INFO reactor.Flux.Map.1 - request(unbounded)
On next: one
15:32:19.005 [main] INFO reactor.Flux.Map.1 - onNext(one hundred)
On next subscriber: one hundred
On next: two
15:32:19.005 [main] INFO reactor.Flux.Map.1 - onNext(two hundred)
On next subscriber: two hundred
On next: three
15:32:19.005 [main] INFO reactor.Flux.Map.1 - onNext(three hundred)
On next subscriber: three hundred
15:32:19.006 [main] INFO reactor.Flux.Map.1 - onNext(four hundred)
On next subscriber: four hundred
15:32:19.007 [main] INFO reactor.Flux.Map.1 - onComplete()
First message has been written from doOnNext of stringFlux, after that log of combined Publishers and the last is subscribe.
P.S. Also you can log other events like OnError, OnComplete and etc.

Related

Apache beam seems to be truncating pub sub message payload

We've created a pretty simple pipeline for pub sub event processing. The pub sub message payload itself is tab separated csv data.
After the message is read, the payload data is being truncated when inflating back into the event object. Using the direct runner and running locally the pipeline is working end to end.
Its only when running within the Google Cloud Dataflow runner where we are seeing this message data truncated?
// Create the pipeline
Pipeline pipeline = Pipeline.create(options);
LOG.info("Reading from subscription: " + options.getInputSubscription());
//Step #1: Read from a PubSub subscription.
PCollection<PubsubMessage> pubsubMessages = pipeline.apply(
"ReadPubSubSubscription",
PubsubIO.readMessagesWithMessageId()
.fromSubscription(options.getInputSubscription())
);
//Step #2: Transform the PubsubMessages into snowplow events.
PCollection<Event> rawEvents = pubsubMessages.apply(
"ConvertMessageToEvent",
ParDo.of(new PubsubMessageEventFn())
);
// other pipeline functions.....
Here the conversion function, where for every pub sub message were falling into the error case. Note that Event.parse() is actually a scala library but I don't see how that could affect this as the message data itself is what has been truncated between the two stages of the pipeline.
Perhaps there is an encoding issue?
public static class PubsubMessageEventFn extends DoFn<PubsubMessage, Event> {
#ProcessElement
public void processElement(ProcessContext context) {
PubsubMessage message = context.element();
Validated<ParsingError, Event> event = Event.parse(new String(message.getPayload()));
Either<ParsingError, Event> condition = event.toEither();
if (condition.isLeft()) {
ParsingError err = condition.left().get();
LOG.error("Event parsing error: " + err.toString() + " for message: " + new String(message.getPayload()));
} else {
Event e = condition.right().get();
context.output(e);
}
}
}
Here is a sample of the data that is emitted in the log message:
Event parsing error: FieldNumberMismatch(5) for message: 4f6ec25-67a7-4edf-972a-29e80320f67f web 2020-04-14 21:26:40.034 2020-04-14 21:26:39.884 2020-04-1
Note that the Pub/Sub implementation for DirectRunner is different from the implementation in Dataflow Runner as documented here - https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub#integration-features.
I believe the issue is related to encoding because message.getPayload is of type bytes and the code might need to be modified as new String(message.getPayload(), StandardCharsets.UTF_8) in the below line
Validated<ParsingError, Event> event = Event.parse(new String(message.getPayload(), StandardCharsets.UTF_8));
Using readMessagesWithAttributesAndMessageId instead of readMessagesWithMessageId is the workaround according to this bug issue https://issues.apache.org/jira/browse/BEAM-9483.
It does not appear to have been fixed yet.

When does reactor execute a subscription chain?

The reactor documentation states the following:
Nothing happens until you subscribe
If that was true, why do I see a java.lang.NullPointerException when I run the following code snippet, which has a reactor chain without a subscription?
#Test
void test() {
String a = null;
Flux.just(a.toLowerCase())
.doOnNext(System.out::println);
}
Deepak,
Nothing happens means the data will not be flowing through the chain of your functions to your consumers until a subscription happens.
You're getting NPE because Java tries to compute the value which is given to a hot operator just() on the Flux definition step.
You can also convert just() to a cold operator using defer() so you will receive NPE only after a subscription happened:
public Flux<String> test() {
String a = null;
return Flux.defer(() -> Flux.just(a.toLowerCase()))
.doOnNext(System.out::println);
}
Please, read more about hot vs hold operators.
Update:
Small example of cold and hot publishers. Each time new subscription happens cold publisher's body is recalculated. Meanwhile, just() is only producing time that was calculated only once at definition time.
Mono<Date> currentTime = Mono.just(Calendar.getInstance().getTime());
Mono<Date> realCurrentTime = Mono.defer(() -> Mono.just(Calendar.getInstance().getTime()));
// 1 sec sleep
Thread.sleep(1000);
currentTime.subscribe(time -> System.out.println("Current Time " + time.getTime()));
realCurrentTime.subscribe(time -> System.out.println("Real current Time " + time.getTime()));
Thread.sleep(2000);
currentTime.subscribe(time -> System.out.println("Current Time " + time.getTime()));
realCurrentTime.subscribe(time -> System.out.println("Real current Time " + time.getTime()));
The output is:
Current Time 1583788755759
Real current Time 1583788756826
Current Time 1583788755759
Real current Time 1583788758833

Reactor Netty - how to send with delayed Flux

In Reactor Netty, when sending data to TCP channel via out.send(publisher), one would expect any publisher to work. However, if instead of a simple immediate Flux we use a more complex one with delayed elements, then it stops working properly.
For example, if we take this hello world TCP echo server, it works as expected:
import reactor.core.publisher.Flux;
import reactor.netty.DisposableServer;
import reactor.netty.tcp.TcpServer;
import java.time.Duration;
public class Reactor1 {
public static void main(String[] args) throws Exception {
DisposableServer server = TcpServer.create()
.port(3344)
.handle((in, out) -> in
.receive()
.asString()
.flatMap(s ->
out.sendString(Flux.just(s.toUpperCase()))
))
.bind()
.block();
server.channel().closeFuture().sync();
}
}
However, if we change out.sendString to
out.sendString(Flux.just(s.toUpperCase()).delayElements(Duration.ofSeconds(1)))
then we would expect that for each received item an output will be produced with one second delay.
However, the way server behaves is that if it receives multiple items during the interval, it will produce output only for the first item. For example, below we type aa and bb during the first second, but only AA gets produced as output (after one second):
$ nc localhost 3344
aa
bb
AA <after one second>
Then, if we later type additional line, we get output (after one second) but from the previous input:
cc
BB <after one second>
Any ideas how to make send() work as expected with a delayed Flux?
I think you shouldn't recreate publisher for the out.sendString(...)
This works:
DisposableServer server = TcpServer.create()
.port(3344)
.handle((in, out) -> out
.options(NettyPipeline.SendOptions::flushOnEach)
.sendString(in.receive()
.asString()
.map(String::toUpperCase)
.delayElements(Duration.ofSeconds(1))))
.bind()
.block();
server.channel().closeFuture().sync();
Try to use concatMap. This works:
DisposableServer server = TcpServer.create()
.port(3344)
.handle((in, out) -> in
.receive()
.asString()
.concatMap(s ->
out.sendString(Flux.just(s.toUpperCase())
.delayElements(Duration.ofSeconds(1)))
))
.bind()
.block();
server.channel().closeFuture().sync();
Delaying on the incoming traffic
DisposableServer server = TcpServer.create()
.port(3344)
.handle((in, out) -> in
.receive()
.asString()
.timestamp()
.delayElements(Duration.ofSeconds(1))
.concatMap(tuple2 ->
out.sendString(
Flux.just(tuple2.getT2().toUpperCase() +
" " +
(System.currentTimeMillis() - tuple2.getT1())
))
))
.bind()
.block();

Odd behaviour of SlidingWindows when used with TestPipeline

I've got a simple test that demonstrate an odd behaviour of sliding window when used with TestPipeline. Basically a bunch of strings is fed to the input, then they get accumulated in the sliding window, then the sum aggregation is applied to count the duplicates and finally the output of the aggregation function is logged. With a sliding window of 10 minutes duration and 5 minutes period I expected only one window being used to store all the elements (as the new one is started in 5 minutes after the first one)...
public class SlidingWindowTest {
private static PipelineOptions options = PipelineOptionsFactory.create();
private static final Logger LOG = LoggerFactory.getLogger(SlidingWindowTest.class);
private static class IdentityDoFn extends DoFn<KV<String, Integer>, KV<String, Integer>>
implements DoFn.RequiresWindowAccess{
#Override
public void processElement(ProcessContext processContext) throws Exception {
KV<String, Integer> item = processContext.element();
LOG.info("~~~~~~~~~~> {} => {}", item.getKey(), item.getValue());
LOG.info("~~~~~~~~~~~ {}", processContext.window());
processContext.output(item);
}
}
#Test
public void whatsWrongWithSlidingWindow() {
Pipeline p = TestPipeline.create(options);
p.apply(Create.of("cab", "abc", "a1b2c3", "abc", "a1b2c3"))
.apply(MapElements.via((String item) -> KV.of(item, 1))
.withOutputType(new TypeDescriptor<KV<String, Integer>>() {}))
.apply(Window.<KV<String, Integer>>into(SlidingWindows.of(Duration.standardMinutes(10))
.every(Duration.standardMinutes(5))))
.apply(Sum.integersPerKey())
.apply(ParDo.of(new IdentityDoFn()));
p.run();
}
}
But I got 8 windows being fired instead. Is there something wrong with TestPipeline or with my understanding of how sliding windows are supposed to work?
12:19:04.566 [main] DEBUG c.g.c.d.sdk.coders.CoderRegistry - Default coder for com.google.cloud.dataflow.sdk.values.KV<java.lang.String, java.lang.Integer>: KvCoder(StringUtf8Coder, VarIntCoder)
12:19:04.566 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~> abc => 2
12:19:04.567 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~~ [-290308-12-21T19:50:00.000Z..-290308-12-21T20:00:00.000Z)
12:19:04.567 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~> abc => 2
12:19:04.567 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~~ [-290308-12-21T19:55:00.000Z..-290308-12-21T20:05:00.000Z)
12:19:04.567 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~> a1b2c3 => 2
12:19:04.567 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~~ [-290308-12-21T20:00:00.000Z..-290308-12-21T20:10:00.000Z)
12:19:04.567 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~> cab => 1
12:19:04.568 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~~ [-290308-12-21T19:50:00.000Z..-290308-12-21T20:00:00.000Z)
12:19:04.568 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~> a1b2c3 => 2
12:19:04.568 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~~ [-290308-12-21T19:50:00.000Z..-290308-12-21T20:00:00.000Z)
12:19:04.568 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~> cab => 1
12:19:04.568 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~~ [-290308-12-21T19:55:00.000Z..-290308-12-21T20:05:00.000Z)
12:19:04.568 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~> abc => 2
12:19:04.568 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~~ [-290308-12-21T20:00:00.000Z..-290308-12-21T20:10:00.000Z)
12:19:04.568 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~> cab => 1
12:19:04.568 [main] INFO c.q.m.core.SlidingWindowTest - ~~~~~~~~~~~ [-290308-12-21T20:00:00.000Z..-290308-12-21T20:10:00.000Z)
P/S: Dataflow sdk version: 1.8.0
The expected behavior is different that what you observe, but also different from what you expect:
First, you have three different keys, so if they all fell into a single window, then you would expect three outputs.
For sliding windows of 10 minutes with a 5 minute period, every element necessarily falls into two windows. If an element arrives at minute 1 it falls into both the window from 0 to 10 but also the window from -5 to 5. So you should expect six output values, two per key. It is a common pitfall to think of windows as something that updates as a pipeline runs, when in fact they are simply calculated properties of the input data, not a property of its arrival time or the pipeline's execution.
The Create transform will output all values with a timestamp of BoundedWindow.TIMESTAMP_MIN_VALUE so they should all fall into the same two windows.
Your example seems to indicate a real bug. It should not be possible for "a1b2c3" to be in the two disjoint windows that it falls in, nor for "abc" to fall into three windows, two of which are disjoint.
Incidentally, though, you would benefit from checking out DataflowAssert (called PAssert now in Beam) for testing the contents of a PCollection in a consistent and cross-runner way.

I would like to know if Log4j2 has been built with that in mind or if things may (crash | lose logs | etc) at very high concurrency

This question has been askedfor log4j but not log4j2: Is it safe to use the same log file by two different appenders
Technically, you can create multiple appenders in Log4j2 that write in the same file. This seems to work well.
Here's my OS / JDK :
Oracle JDK 7u45
Ubuntu LTE 14.04
Here's a sample configuration (in yaml) :
Configuration:
status: debug
Appenders:
RandomAccessFile:
- name: TestA
fileName: logs/TEST.log
PatternLayout:
Pattern: "%msg%n"
- name: TestB
fileName: logs/TEST.log
PatternLayout:
Pattern: "%msg%n"
Loggers:
Root:
level: trace
AppenderRef:
- ref: TestA
- ref: TestB
My Java sample :
final Logger root = LoggerFactory.getLogger(Logger.ROOT_LOGGER_NAME);
root.trace("!!! Trace World !!!");
root.debug("!!! Debug World !!!");
root.info("!!! Info World !!!");
root.warn("!!! Warn World !!!");
root.error("!!! Error World !!!");
My log file result :
20150513T112956,819 TRACE "!!! Trace World !!!"
20150513T112956,819 TRACE "!!! Trace World !!!"
20150513T112956,819 DEBUG "!!! Debug World !!!"
20150513T112956,819 DEBUG "!!! Debug World !!!"
20150513T112956,819 INFO "!!! Info World !!!"
20150513T112956,819 INFO "!!! Info World !!!"
20150513T112956,819 WARN "!!! Warn World !!!"
20150513T112956,819 WARN "!!! Warn World !!!"
20150513T112956,819 ERROR "!!! Error World !!!"
20150513T112956,819 ERROR "!!! Error World !!!"
I would like to know if Log4j2 has been built with that in mind or if things may (crash | lose logs | etc) at very high concurrency.
UPDATE :
I ran this Benchmark test and there is no missing log. Still, I'm not sure this test fully solves the question. :
public class Benchmark {
private static final int nbThreads = 32;
private static final int iterations = 10000;
static List<BenchmarkThread> benchmarkThreadList = new ArrayList<>(nbThreads);
private static Logger root;
static {
System.setProperty("Log4jContextSelector", "org.apache.logging.log4j.core.async.AsyncLoggerContextSelector");
}
public static void main(String[] args) throws InterruptedException {
root = LoggerFactory.getLogger(Logger.ROOT_LOGGER_NAME);
// Create BenchmarkThreads
for (int i = 1; i <= nbThreads; i++) {
benchmarkThreadList.add(new BenchmarkThread("T" + i, iterations));
}
root.error("-----------------------------------------------------------");
root.error("---------------------- WARMUP ---------------------------");
root.error("-----------------------------------------------------------");
// Warmup loggers
doBenchmark("WARMUP", 100);
Thread.sleep(100);
root.error("-----------------------------------------------------------");
root.error("--------------------- BENCHMARK -------------------------");
root.error("-----------------------------------------------------------");
// Execute Benchmark
for (int i = 0; i < nbThreads; i++) {
benchmarkThreadList.get(i).start();
}
Thread.sleep(100);
root.error("-----------------------------------------------------------");
root.error("---------------------- FINISHED -------------------------");
root.error("-----------------------------------------------------------");
}
protected static void doBenchmark(String name, int iteration) {
for (int i = 1; i <= iteration; i++) {
root.error("{};{}", name, i);
}
}
protected static class BenchmarkThread extends Thread {
protected final int iteration;
protected final String name;
public BenchmarkThread(String name, int iteration) {
this.name = name;
this.iteration = iteration;
}
#Override
public void run() {
Benchmark.doBenchmark(name, iteration);
}
}
}
UPDATE:
I did not realize that you are already using Async Loggers. In that case this is indeed a log4j2 question. :-)
The answer is yes, log4j2 and especially Async Loggers are designed with very high concurrency in mind. Multiple loggers in multiple threads can log concurrently, and the resulting log messages are put on a lock-free queue for later processing by the background thread. There is a single background thread that calls all appenders sequentially, so even if multiple appenders write to the same file, no messages are dropped and messages from each logger thread are written fully before the next message is written (no partial writes).
In case of a crash, messages that were in the queue but have not been flushed to disk yet may be lost. This is a trade-off for performance and is the case with all asynchronous logging.
If you are logging synchronously (e.g. without using Async Loggers) it becomes a question of what file I/O atomicity guarantees the JVM and OS make.
PREVIOUS ANSWER:
This is more of a JVM/OS question than a log4j2 question. Rephrased: if multiple threads concurrently write to the same file, will the resulting file contain all messages (nothing is lost, and all messages are complete and correct)? (You may want to specify your JVM vendor and version and OS name and version.)
If you are looking for a safe log4j2 configuration, consider using Async Loggers.
With Async Loggers all appenders are called sequentially in the single shared background thread, so you are sure no corruption will occur. In addition you get nice performance benefits.

Resources