error handling in data pipeline using project reactor - project-reactor

I'm writing a data pipeline using Reactor and Reactor Kafka and use spring's Message<> to save
the ReceiverOffset of ReceiverRecord in the headers, to be able to use ReciverOffset.acknowledge() when finish processing. I'm also using the out-of-order commit feature enabled.
When an event process fails I want to be able to log the error, write to another topic that represents all the failure events, and commit to the source topic. I'm currently solving that by returning Either<Message<Error>,Message<myPojo>> from each processing stage, that way the stream will not be stopped by exceptions and I'm able to save the original event headers and eventually commit the failed messages at the button of the pipeline.
The problem is that each step of the pipline gets Either<> as input and needs to filter the previous errors, apply the logic only on the Either.right and that could be cumbersome, especially when working with buffers and the operator get 'List<Either<>>' as input. So I would want to keep my business pipeline clean and get only Message<MyPojo> as input but also not missing errors that need to be handled.
I read that sending those message erros to other channel or stream is a soulution for that.
Spring Integration uses that pattern for error handling and I also read an article (link to article) that solves this problem in Akka Streams using 'divertTo()':
I couldn't find documentation or code examples of how to implement that in Reactor,
is there any way to use Spring Integration error channel with Reactor? or any other ideas to implement that?

Not familiar with reactor per se, but you can keep the stream linear. The trick, since Vavr's Either is right-biased is to use flatMap, which would take a function from Message<MyPojo> to Either<Message<Error>, Message<MyPojo>>. If the Either coming in is a right (i.e. a Message<MyPojo>, the function gets invoked and otherwise it just gets passed through.
// Apologies if the Java is atrocious... haven't written Java since pre-Java 8
incomingEither.flatMap(
myPojoMessage -> ... // compute a new Either
)
Presumably at some point you want to do something (publish to a dead-letter topic, tickle metrics, whatever) with the Message<Error> case, so for that, orElseRun will come in handy.

Related

Best practice to detect broken log sinks

I am trying to replace the in-house-written logger solution of one of my customers. Pretty much everything is straight forward, but i need to implement one sink that sends the logs to a custom log window that i cannot change (for now). It communicates using named pipes. this pipe may be broken or busy, so the current solution actually blocks on every log call - which I want to improve.
The question is what the best practice is when using serilog: whats the best way to tell serilog the sink is currently broken so it is not slowing down the system. Is throwing an exception enough?
Serilog itself doesn't know (or care) when a sink is broken or not, so I'm not sure I understand your goal.
Writing to a Serilog logger is supposed to be a safe operation, by design, thus any exceptions that happen in your sink will automatically be caught by Serilog to make sure the app doesn't crash. Serilog will make sure these exceptions are written to the SelfLog which developers can use to troubleshoot sink issues. See an example here.
Therefore, if your goal is to have a way that a developer can see when the sink experienced problems, the recommendation is to write error messages to the SelfLog and throw your own exceptions from within your sink.
If you can detect from within your sink that the named pipe is not available without blocking, then just write to SelfLog and return/short-circuit without trying to write to it. It's really up to you to implement any kind of resilience policy from within your sink.
If your goal is to improve the blocking calls, you might want to consider making your sink asynchronous, with the messages sent on a separate thread, without blocking the main thread of the app.
Given you're implementing your own custom sink, an easy way to do that is to turn your sink into a Periodic Batching sink and leverage the infrastructure it provides. Alternatively, you can use Serilog.Sinks.Async wrapper sink.

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

How to pass thread local variable in Project Reactor

I started using project reactor. Does anyone know how can I pass thread local variables from one thread to another? I saw some methods on Hooks.java but could not figure out what is the recommended way of doing this. Can someone point me to some documentation or with a code snippet on how to do it. Thanks.
I have a working example in this github repository based on the spring-cloud-sleuth's implementation: https://github.com/gumartinm/JavaForFun/tree/master/SpringJava/WebReactive/spring-webreactive-reactor-context-enrich
The key classes are: ContextCoreSubscriber.java, SubscriberContext.java, ThreadContextEnrichmentAutoConfiguration.java and UsernameFilter.java
ContextCoreSubscriber.java:
Enables you to fill the Mapped Diagnostic Context: MDC
SubscriberContext.java:
Helper class for inserting data in the Reactor's Context.
ThreadContextEnrichmentAutoConfiguration.java:
In charge of configuring the Reactor's Hooks: Hooks.onEachOperator
UsernameFilter.java:
Example where we want to register the username information based on some HTTP header.
Reactor doesn't guarantee that the processing done by a Flux or Mono chain of operators will stick executing on a single thread. On the contrary, it performs work-stealing and lets the user switch execution context.
As such, using ThreadLocal is not very adapted to Reactor.
There is currently some work done in 3.1.0 towards providing an equivalent, at least for library authors that use Reactor, but nothing definite in place yet.
Keep your eyes peeled for 3.1.0, that should be the main theme of that release (and will probably be the focus of the second upcoming milestone, M2).

Initial state for a dataflow job

I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.
Current ideas:
Write a custom unbounded source that does just that. Reads over the file until it's exhausted and then starts reading from the stream. Not much fun because writing custom sources is not much fun.
Run the logic in batch mode over the file, and as the last step emit the state to a stream sink somehow, then have a streaming version of the logic start up that reads from both the state stream and the data stream, and somehow combines the two. This seems to make some sense, but not sure how to make sure that the streaming job reads everything from the state source, to initialise, before reading from the data stream.
Pipe the historical data into a stream, write a job that reads from both the streams. Same problems as the second solution, not sure how to make sure one stream is "consumed" first.
EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.
You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.

C# 5 .NET MVC long async task, progress report and cancel globally

I use ASP.Net MVC 5 and I have a long running action which have to poll webservices, process data and store them in database.
For that I want to use TPL library to start the task async.
But I wonder how to do 3 things :
I want to report progress of this task. For this I think about SignalR
I want to be able to left the page where I start this task from and be able to report the progression across the website (from a panel on the left but this is ok)
And I want to be able to cancel this task globally (from my panel on the left)
I know quite a few about all of technologies involved. But I'm not sure about the best way to achieve this.
Is someone can help me about the best solution ?
The fact that you want to run long running work while the user can navigate away from the page that initiates the work means that you need to run this work "in the background". It cannot be performed as part of a regular HTTP request because the user might cancel his request at any time by navigating away or closing the browser. In fact this seems to be a key scenario for you.
Background work in ASP.NET is dangerous. You can certainly pull it off but it is not easy to get right. Also, worker processes can exit for many reasons (app pool recycle, deployment, machine reboot, machine failure, Stack Overflow or OOM exception on an unrelated thread). So make sure your long-running work tolerates being aborted mid-way. You can reduce the likelyhood that this happens but never exclude the possibility.
You can make your code safe in the face of arbitrary termination by wrapping all work in a transaction. This of course only works if you don't cause non-transacted side-effects like web-service calls that change state. It is not possible to give a general answer here because achieving safety in the presence of arbitrary termination depends highly on the concrete work to be done.
Here's a possible architecture that I have used in the past:
When a job comes in you write all necessary input data to a database table and report success to the client.
You need a way to start a worker to work on that job. You could start a task immediately for that. You also need a periodic check that looks for unstarted work in case the app exits after having added the work item but before starting a task for it. Have the Windows task scheduler call a secret URL in your app once per minute that does this.
When you start working on a job you mark that job as running so that it is not accidentally picked up a second time. Work on that job, write the results and mark it as done. All in a single transaction. When your process happens to exit mid-way the database will reset all data involved.
Write job progress to a separate table row on a separate connection and separate transaction. The browser can poll the server for progress information. You could also use SignalR but I don't have experience with that and I expect it would be hard to get it to resume progress reporting in the presence of arbitrary termination.
Cancellation would be done by setting a cancel flag in the progress information row. The app needs to poll that flag.
Maybe you can make use of message queueing for job processing but I'm always wary to use it. To process a message in a transacted way you need MSDTC which is unsupported with many high-availability solutions for SQL Server.
You might think that this architecture is not very sophisticated. It makes use of polling for lots of things. Polling is a primitive technique but it works quite well. It is reliable and well-understood. It has a simple concurrency model.
If you can assume that your application never exits at inopportune times the architecture would be much simpler. But this cannot be assumed. You cannot assume that there will be no deployments during work hours and that there will be no bugs leading to crashes.
Even if using http worker is a bad thing to run long task I have made a small example of how to manage it with SignalR :
Inside this example you can :
Start a task
See task progression
Cancel task
It's based on :
twitter bootstrap
knockoutjs
signalR
C# 5.0 async/await with CancelToken and IProgress
You can find the source of this example here :
https://github.com/dragouf/SignalR.Progress

Resources