Reactor flux throws illegalArgumentException - suspecting due to bufferTimeout - project-reactor

I have a spring application which builds a reactive pipeline as follows:
buildPipeline(). // returns a flux based on changeStreamEvents or Kafka receives
.bufferTimeout( capacity, Duration.ofSeconds(1))
. flatMap( r -> {
element x = r.get(r.size()-1)
//some processing on element and the batch obtained
})
.doOnError( e-> log.info("error occurred:" + e.toString())
.subscribe()
However, I see my application intermediately throwing the below error -
java.lang.illegalArgumentException:3.9 While the Subscription is not cancelled, Subscription.request(long n) MUST throw a java.lang.illegalArgumentException if argument <= 0
at com.mongodb.reactivestreams.client.internal.ObservableToPublisher$1
$1.request(ObservableToPublisher.java:43)
at reactor.core.publisher.FluxMap$MapSubscriber.request(FluxMap.java:155)
at reactor.core.publisher.FluxBufferTimeout
$BufferimeoutSubscriber.requestMore(FluxBufferTimeout.java:317)
I'm not able to determine what is wrong, and why the stream is terminating with this error.
Any help would be highly appreciated.
The application started throwing this error after I added "bufferTimeout" to add a feature of batching. Before that, I had never encountered this exception.
Not sure how to replicate the issue as well, as it is not occurring locally or in UAT, but only in production environment of the application.
Any leads would be helpful.
Thanks!

Try adding a onBackPressureBuffer(), so that in case of low demand this operator buffers the requests, and emits items in a controlled way.

Related

Is Mono.doFinally sufficient to handle release/cleanup?

I am trying to synchronize a resource with spring webClient:
this.semaphore.acquire()
webClient
.post()
.uri("/a")
.bodyValue(payload)
.retrieve()
.bodyToMono(String.class)
// release
.doFinally(st -> this.semaphore.release())
.switchIfEmpty(Mono.just("a"))
.onErrorResume(Exception.class, e -> Mono.empty())
.doOnNext()
.subscribe();
Is doFinally sufficient to handle the release?
If not, what are the "escape" points?
This will clean up your resources if your mono is cancelled, completes, or errors out, which are all the ways in which a mono can end.
However, a Mono does not necessarily have to end and the doFinally hook will not be executed.
So it depends on how your webClient is configured in cases where the external api fails to respond: Normally, there should be a timeout and a maximum number of retries. In that case, your code should be correct.
NOTE: the release may not happen on the same thread as the acquire. Depending on the resource, this might actually be a problem. For example, a ReentrantReadWriteLock has semantics that it is owned by the thread that created it. I do not know if this problem exists with your semaphore.

Impossible (?) NullPointerException - Springframework RabbitMQ, Failed to invoke afterAckCallback

I'm running a Java application that uses RabbitMQ Server 3.8.9, spring-amqp-2.2.10.RELEASE, and spring-rabbit-2.2.10.RELEASE.
My test case does something like the following:
Start the RabbitMQ Server
Start my Java application
Test and validate some functionality on my Java application
Gracefully stop my Java application
Gracefully stop the RabbitMQ Server
Repeat 1-6 a few more times
Everything looks fine except sometimes during one of the restarts about 10 minutes into it, I see the following error in my application's logs:
2021-02-05 12:52:46.498 UTC,ERROR,org.springframework.amqp.rabbit.connection.PublisherCallbackChannelImpl,null,rabbitConnectionFactory23,runWorker():1149,Failed to invoke afterAckCallback
java.lang.NullPointerException: null
at org.springframework.amqp.rabbit.connection.PublisherCallbackChannelImpl.lambda$doHandleConfirm$1(PublisherCallbackChannelImpl.java:1027) ~[spring-rabbit.jar:2.2.10.RELEASE]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]
Further analysis doesn't point to anything specific. There are no errors in the RabbitMQ log files, no restarts of the RabbitMQ server, nothing weird in the RabbitMQ logs during the time stamp above.
The code in question:
https://github.com/spring-projects/spring-amqp/blob/v2.2.10.RELEASE/spring-rabbit/src/main/java/org/springframework/amqp/rabbit/connection/PublisherCallbackChannelImpl.java#L1027
My tests are automated and run as part of a CI pipeline. The issue is intermittent and I have had trouble reproducing it locally in my sandbox.
From what I can tell, the functionality of my Java application is unaffected.
Code that creates the RabbitMQ connection factory used everywhere:
final CachingConnectionFactory connectionFactory = new CachingConnectionFactory(HOST_NAME);
connectionFactory.setChannelCacheSize(1);
connectionFactory.setPublisherConfirms(true);
It seems like a concurrency problem, but I'm not so sure on how to get to the bottom of it. For the most part, we use the RabbitTemplate and other Spring facilities to connect to RabbitMQ.
Anyone in the Spring world with some knowledge in RabbitMQ care to chime in?
Thanks
The code you talk about is like this:
finally {
try {
if (this.afterAckCallback != null && getPendingConfirmsCount() == 0) {
this.afterAckCallback.accept(this);
this.afterAckCallback = null;
}
}
catch (Exception e) {
this.logger.error("Failed to invoke afterAckCallback", e);
}
}
There is really could be a race condition around that this.afterAckCallback property.
We may pass if() in one but then different thread makes this.afterAckCallback as null, so we fail with that NPE.
We have to copy its value to the local variable and then check and perform accept().
Feel free to raise a GitHub issue against Spring AMQP project: https://github.com/spring-projects/spring-amqp/issues
We have a race condition because we really call this doHandleConfirm() with its async logic from the loop in the processMultipleAck().

Processing stuck in step Write mutations to Cloud Spanner to Spanner without outputting or completing in state process

Processing stuck in step Sink-Spanner/Write mutations to Cloud Spanner/Write mutations to Spanner for at least 05m00s without outputting or completing in state process
Write to Spanner is happening, but throwing this error
PCollectionTuple spanMutTuple = pColIfEnrichMsgs.apply("CreateMutation",
ParDo.of(new SpannerMutation(options, ttSpanMutOutMsgs, erroMessage))
.withOutputTags(ttSpanMutOutMsgs, TupleTagList.of(erroMessage))
);
/*...*/
pColSpanMut.apply("Sink-Spanner",
SpannerIO.write()
.withInstanceId(options.getOutputSpannerInstanceId())
.withDatabaseId(options.getOutputSpannerDatabaseId())
.withMaxNumMutations(options.getOutputSpannerMaxMutations().get())
.withBatchSizeBytes(options.getOutputSpannerBatchSizeBytes().get() * 1048576)
.withFailureMode(FailureMode.REPORT_FAILURES)
.withProjectId(options.getOutputSpannerProjectId().get())
);
Expected: No warning/errors like it is reporting in google cloud dataflow UI.
Processing stuck in step Sink-Spanner/Write mutations to Cloud Spanner/Write mutations to Spanner for at least 05m00s without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:469)
at com.google.api.core.AbstractApiFuture.get(AbstractApiFuture.java:56)
at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.get(GapicSpannerRpc.java:556)
at com.google.cloud.spanner.spi.v1.GapicSpannerRpc.commit(GapicSpannerRpc.java:528)
at com.google.cloud.spanner.SpannerImpl$SessionImpl$2.call(SpannerImpl.java:822)
at com.google.cloud.spanner.SpannerImpl$SessionImpl$2.call(SpannerImpl.java:819)
at com.google.cloud.spanner.SpannerImpl.runWithRetries(SpannerImpl.java:251)
at com.google.cloud.spanner.SpannerImpl$SessionImpl.writeAtLeastOnce(SpannerImpl.java:818)
at com.google.cloud.spanner.SessionPool$PooledSession.writeAtLeastOnce(SessionPool.java:329)
at com.google.cloud.spanner.DatabaseClientImpl.writeAtLeastOnce(DatabaseClientImpl.java:59)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteToSpannerFn.processElement(SpannerIO.java:1243)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteToSpannerFn$DoFnInvoker.invokeProcessElement(Unknown Source)
Thank you for reporting and of course it should do a better job in dealing with the error message. I will follow up with an internal ticket. In the meantime, are you unblocked? If so, is it perhaps due to too high values in the withBatchSizeBytes() and/or withMaxNumMutations()?

Dataflow pipeline is dropping events during processing when using outputWithTimestamp

I have a Cloud Dataflow pipeline in which I alter the original timestamp for the event in order to simulate real world scenarios of events arriving late. However, it appears I'm dropping some percentage of my events on each run of the pipeline. Inside my DoFn I use the following code to change the timestamp:
Instant newTimestamp = originalTimestamp.minus(Duration.standardMinutes(RANDOM.nextInt(15)));
c.outputWithTimestamp(KV.of(Integer.toString(RANDOM.nextInt(100)), element), newTimestamp);
The problem is most likely caused by your DoFn step outputting a timestamp that is earlier than the timestamp that was received by the processing step minus the allowed timestamp skew. The exception that would be thrown can be found here in the code:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/DoFnRunnerBase.java#L493
This behavior is documented with regard to using outputWithTimestamp here:
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn.Context#outputWithTimestamp-OutputT-org.joda.time.Instant-
While you could override the getAllowedTimestampSkew function, is is also documented that this might cause unpredictable issues with the watermark calculations so it should only be used without windowing/grouping.
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn#getAllowedTimestampSkew--

Cloud Dataflow: java.lang.IllegalStateException: no evaluator registered for GroupedValues

I'm getting the following exception when running the pipeline locally. There is no exception when submitting for cloud execution.
Thanks,
Genady
INFO: Executing pipeline using the DirectPipelineRunner.
Exception in thread "main" java.lang.IllegalStateException: no evaluator registered for GroupedValues [GroupedValues]
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:606)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:200)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:196)
at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:109)
at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:204)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:583)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:327)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:70)
at app.Main.main(Main.java:124)
The code outline is basically this:
PCollection<KV<MyKey, Iterable<MyValue>>> groupedByMyKey = ...
PCollection<KV<MyKey, MyAggregated>> aggregated = groupedByMyKey.apply(
Combine.<MyKey, MyValue, MyAggregated>groupedValues(new Aggregator()));
Aggregator class extends CombineFn<MyValue, List<MyValue>, MyAggregated>
Can you share a code snippet that triggers this? GroupedValues is a PTransform that is often used within various combining transforms, so it might be from using something like Min, Max, etc.
The error means that the DirectPipelineRunner doesn't know how to evaluate a GroupedValues. However, that's unexpected, since that should have been expanded into a ParDo before execution.
I found the reason to this behaviour
I was using a command line argument to run it in remote mode (--runner=BlockingDataflowPipelineRunner) and then forced it to run locally with
PipelineRunner<?> runner = DirectPipelineRunner.fromOptions(options);
runner.run(p);
After removing these lines and just using the --runner=DirectPipelineRunner argument it worked as expected.

Resources