I have an event stream as input, want to do a series of transformations on them, then buffer them up into a List, and finally perform a batched write in another system. Once the input stream has been drained I want to manually stop this process which needs to flush the remaining buffer into a write. What is the best way to get this to work consistently?
Example:
Flux.from(<source>)
.map(<some transform>)
.buffer(4096)
.subscribe(<batched writer that takes a List>)
With the above, if I cancel the batched writer subscription I will lose whatever is pending in the buffer. Is there any way I can "complete" the original source, thus causing a flush/drain in the whole pipeline, to ensure no events gets lost?
Related
i read about flink`s window assigners over here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#window-assigners , but i cant find any solution for my problem.
as part of my project i need a windowing that the timer will start given the first element of the key and will be closed and set ready for processing after X minutes. for example:
first keyA comes at (hh:mm:ss) 00:00:02, i want all keyA will be windowing until 00:01:02, and then the timer of 1 minutes will start again only when keyA will be given as input.
Is it possible to do something like that in flink? is there a workaround?
hope i made it clear enough.
Implementing keyed windows that are aligned with the first event, rather than with the epoch, is quite difficult, in general, which I believe is why this isn't supported by Flink's window API. The problem is that with an out-of-order stream using event time processing, as earlier events arrive you may need to revise your notion of when the window began, and when it should end. For example, if the first keyA arrives at 00:00:02, but then some time later an event with keyA arrives with a timestamp of 00:00:01, now suddenly the window should end at 00:01:01, rather than 00:01:02. And if the out-of-orderness is large compared to the window length, handling this becomes quite complex -- imagine, for example, that the event from 00:00:01 arrives 2 minutes after the event from 00:00:02.
Rather than trying to implement this with the window API, I would use a KeyedProcessFunction. If you only need to support processing time windows, then these concerns about out-of-orderness do not apply, and the solution can be fairly simple. It suffices to keep one object in keyed state, which might be a list holding all of the events in the window, or a counter or other aggregator, depending on what you're trying to accomplish.
When an event arrives, if the state (for this key) is null, then there is no open window for this key. Initialize the state (i.e., create a new, empty list, or set the counter to zero), and create a Timer to fire at the appropriate time. Then regardless of whether the state had been null, add the incoming event to the state (i.e., append it to the list, or increment the counter).
When the timer fires, emit the window's result and reset the state to null.
If, on the other hand, you want to do this with event time windows, first sort the stream and then use the same approach. Note that you won't be able to handle late events, so plan your watermarking accordingly (reducing the likelihood of late events to a manageable level), or go for a more complex implementation.
I read that, "..The ordering operator has to buffer all elements it receives. Then, when it receives a watermark it can sort all elements that have a timestamp that is lower than the watermark and emit them in the sorted order. This is correct because the watermark signals that not more elements can arrive that would be intermixed with the sorted elements..." - https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams
Hence, it seems that the watermark serves as a signal to the following operator, for beginning processing. I guess, that's what also a Trigger does. What's the difference between the two?
You can think of watermarks as special records that tell an operator what (event-) time it is. When an operator receives a watermark, it compares the watermark with its current time and other watermarks it received from different stream partitions. Depending on the comparison, the operator advances its own clock.
Some operators register timers (windows, time-based joins, custom implementations). An operator triggers a timer when the clock of the operator passes the time for which the timer was registered.
So, watermarks and timers are two different things. Watermarks tell an operator what time it is and the operator triggers a timer at the right point in time.
A Watermark can be thought of as an assertion that an event time stream is now complete up to a particular timestamp. When a Watermark is processed by an operator it will cause the firing of any relevant event time timers. The operators that use EventTimeTimers are EventTimeWindows and ProcessFunctions.
Triggers are part of the window API and define when Windows will produce results. An EventTimeTrigger wraps around an event time timer that is called when an suitably large Watermark is processed, indicating that the window is now complete.
I am using a global unbounded stream in combination with Stateful processing and timers in order to totally order a stream per key by event timestamp. The solution is described with the answer to this question:
Processing Total Ordering of Events By Key using Apache Beam
In order to restart the pipeline after a failure or stopping for some other reason, I need to determine the lowest event timestamp at which we are guaranteed that all other events have been processed downstream. This timestamp can be calculated periodically and persisted to a datastore and used as the input to the source IO (Kinesis) so that the stream can be re-read without having to go back to the beginning. (It is ok for us to have events replayed)
I considered having the stateful transformation emit the lowest processed timestamp as the output when the timer triggers and then combine all the outputs globally to find the minimum value. However, it is not possible to use a Global combine operation because a either a Window or a Trigger must be applied first.
Assuming that my stateful transform emits a Long when the timer fires which represents the smallest timestamp, I am defining the pipeline like this:
p.apply(events)
.apply("StatefulTransform", ParDo.of(new StatefulTransform()))
.apply(Window.<Long>configure().triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(100),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))))
.apply(Combine.globally(new MinLongFn()))
.apply("WriteCheckpoint", ParDo.of(new WriteCheckpoint()));
Will this ensure that the checkpoints will only be written when all of the parallel workers have emitted at least one of their panes? I am concerned that a the combine operation may operate on panes from only some of the workers, e.g. there may be a worker that has either failed or is still waiting for another event to trigger it's timer.
I'm a newbie of the Beam, but according to this blog https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html, Splittable DoFn might be the thing you are looking for!
You could create an SDF to fetch the stream and accept the input element as the start point.
Is possible to serialize Reactor Flux. For example my Flux is in some state and is currently processing some event. And suddenly service is terminated. Current state of Flux is saved to database or to file. And then on restart of aplication I just take all Flux from that file/table and subscribe them to restart processing from last state. This is possible in reactor?
No, this is not possible. Flux are not serializable and are closer to a chain of functions, they don't necessarily have a state[1] but describe what to do given an input (provided by an initial generating Flux)...
So in order to "restart" a Flux, you'd have to actually create a new one that gets fed the remaining input the original one would have received upon service termination.
Thus it would be more up to the source of your data to save the last emitted state and allow restarting a new Flux sequence from there.
[1] Although, depending on what operators you chained in, you could have it impact some external state. In that case things will get more complicated, as you'll have to also persist that state.
I noticed that in RxJava we are not able to perform operations such as this method signature buffer(count, skip, timespan, TimeUnit)
Is there anyway to perform this operations without recreating another operation for buffer.
this method doesn't help a lot. Beside, in the doc there is no spec if it emits overlapping buffered items or not !??
UPDATE
What flatMapIterable adds here ? The desired behaviour is that if a have 10 items in the specified timespan then a buffered item is emitted. if the timespan has elapsed and no item is emitted i want to receive an empty buffer.
using the count and skip oblige to wait until the count is reached but returns nothing in between.