I noticed that in RxJava we are not able to perform operations such as this method signature buffer(count, skip, timespan, TimeUnit)
Is there anyway to perform this operations without recreating another operation for buffer.
this method doesn't help a lot. Beside, in the doc there is no spec if it emits overlapping buffered items or not !??
UPDATE
What flatMapIterable adds here ? The desired behaviour is that if a have 10 items in the specified timespan then a buffered item is emitted. if the timespan has elapsed and no item is emitted i want to receive an empty buffer.
using the count and skip oblige to wait until the count is reached but returns nothing in between.
Related
i read about flink`s window assigners over here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#window-assigners , but i cant find any solution for my problem.
as part of my project i need a windowing that the timer will start given the first element of the key and will be closed and set ready for processing after X minutes. for example:
first keyA comes at (hh:mm:ss) 00:00:02, i want all keyA will be windowing until 00:01:02, and then the timer of 1 minutes will start again only when keyA will be given as input.
Is it possible to do something like that in flink? is there a workaround?
hope i made it clear enough.
Implementing keyed windows that are aligned with the first event, rather than with the epoch, is quite difficult, in general, which I believe is why this isn't supported by Flink's window API. The problem is that with an out-of-order stream using event time processing, as earlier events arrive you may need to revise your notion of when the window began, and when it should end. For example, if the first keyA arrives at 00:00:02, but then some time later an event with keyA arrives with a timestamp of 00:00:01, now suddenly the window should end at 00:01:01, rather than 00:01:02. And if the out-of-orderness is large compared to the window length, handling this becomes quite complex -- imagine, for example, that the event from 00:00:01 arrives 2 minutes after the event from 00:00:02.
Rather than trying to implement this with the window API, I would use a KeyedProcessFunction. If you only need to support processing time windows, then these concerns about out-of-orderness do not apply, and the solution can be fairly simple. It suffices to keep one object in keyed state, which might be a list holding all of the events in the window, or a counter or other aggregator, depending on what you're trying to accomplish.
When an event arrives, if the state (for this key) is null, then there is no open window for this key. Initialize the state (i.e., create a new, empty list, or set the counter to zero), and create a Timer to fire at the appropriate time. Then regardless of whether the state had been null, add the incoming event to the state (i.e., append it to the list, or increment the counter).
When the timer fires, emit the window's result and reset the state to null.
If, on the other hand, you want to do this with event time windows, first sort the stream and then use the same approach. Note that you won't be able to handle late events, so plan your watermarking accordingly (reducing the likelihood of late events to a manageable level), or go for a more complex implementation.
I read that, "..The ordering operator has to buffer all elements it receives. Then, when it receives a watermark it can sort all elements that have a timestamp that is lower than the watermark and emit them in the sorted order. This is correct because the watermark signals that not more elements can arrive that would be intermixed with the sorted elements..." - https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams
Hence, it seems that the watermark serves as a signal to the following operator, for beginning processing. I guess, that's what also a Trigger does. What's the difference between the two?
You can think of watermarks as special records that tell an operator what (event-) time it is. When an operator receives a watermark, it compares the watermark with its current time and other watermarks it received from different stream partitions. Depending on the comparison, the operator advances its own clock.
Some operators register timers (windows, time-based joins, custom implementations). An operator triggers a timer when the clock of the operator passes the time for which the timer was registered.
So, watermarks and timers are two different things. Watermarks tell an operator what time it is and the operator triggers a timer at the right point in time.
A Watermark can be thought of as an assertion that an event time stream is now complete up to a particular timestamp. When a Watermark is processed by an operator it will cause the firing of any relevant event time timers. The operators that use EventTimeTimers are EventTimeWindows and ProcessFunctions.
Triggers are part of the window API and define when Windows will produce results. An EventTimeTrigger wraps around an event time timer that is called when an suitably large Watermark is processed, indicating that the window is now complete.
I am using a global unbounded stream in combination with Stateful processing and timers in order to totally order a stream per key by event timestamp. The solution is described with the answer to this question:
Processing Total Ordering of Events By Key using Apache Beam
In order to restart the pipeline after a failure or stopping for some other reason, I need to determine the lowest event timestamp at which we are guaranteed that all other events have been processed downstream. This timestamp can be calculated periodically and persisted to a datastore and used as the input to the source IO (Kinesis) so that the stream can be re-read without having to go back to the beginning. (It is ok for us to have events replayed)
I considered having the stateful transformation emit the lowest processed timestamp as the output when the timer triggers and then combine all the outputs globally to find the minimum value. However, it is not possible to use a Global combine operation because a either a Window or a Trigger must be applied first.
Assuming that my stateful transform emits a Long when the timer fires which represents the smallest timestamp, I am defining the pipeline like this:
p.apply(events)
.apply("StatefulTransform", ParDo.of(new StatefulTransform()))
.apply(Window.<Long>configure().triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(100),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))))
.apply(Combine.globally(new MinLongFn()))
.apply("WriteCheckpoint", ParDo.of(new WriteCheckpoint()));
Will this ensure that the checkpoints will only be written when all of the parallel workers have emitted at least one of their panes? I am concerned that a the combine operation may operate on panes from only some of the workers, e.g. there may be a worker that has either failed or is still waiting for another event to trigger it's timer.
I'm a newbie of the Beam, but according to this blog https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html, Splittable DoFn might be the thing you are looking for!
You could create an SDF to fetch the stream and accept the input element as the start point.
I have seen a lot of blogs about throttle and debounce. Most of them said they are the same thing. But I get the different result form my example? Here is the example:
let disposeBag = DisposeBag()
Observable.of(1,2,3,4,5)
.debounce(1, scheduler: MainScheduler.instance)
.subscribe(onNext: {print($0)})
.addDisposableTo(disposeBag)
the result was 5. But when I used throttle, the result was 1
let disposeBag = DisposeBag()
Observable.of(1,2,3,4,5)
.throttle(1, scheduler: MainScheduler.instance)
.subscribe(onNext: {print($0)})
.addDisposableTo(disposeBag)
So,I can't understand about the throttle operator?
In earlier versions of RxSwift, throttle and debounce did the same thing, which is why you will see articles stating this. In RxSwift 3.0 they do a similar but opposite thing.
Both debounce and throttle are used to filter items emitted by an observable over time.
throttle emits only the first item emitted by the source observable in the time window.
debounce only emits an item after the specified time period has passed without another item being emitted by the source observable.
Both can be used to reduce the number of items emitted by an observable; which one you use depends on whether you want the "first" or "last" value emitted in a time period.
The term "debounce" comes from electronics and refers to the tendency of switch contacts to "bounce" between on and off very quickly when a switching action occurs. You won't notice this when you turn on a lightbulb but a microprocessor looking at an input thousands of times a second will see a rapid sequence of "ons" and "offs" before the switch settles into its final state. This is why debounce gives you the value of 5; the final item that was emitted in your time frame (1 ms). If you put a time delay into your code so that the items were emitted more slowly (more than 1ms apart) you would see a number of items emitted by debounce.
In an app you could use debounce for performing a search that is expensive (say it requires a network operation). The user is going to type a number of characters characters for their search string but you don't want to initiate a search as they enter each character, since the search is expensive and the earlier results will be obsolete by the time they return. Using debounce you can ensure that the search string is only emitted once the user stops typing for a period (say 500ms).
You might use throttle where an operation takes some time and you want to ignore further input until that time has elapsed. Say you have a button that initiates an operation. If the user taps the button multiple times in quick succession you only want to initiate the operation once. You can use throttle to ignore the subsequent taps within a specified time window. debounce could also work but would introduce a delay before the action item was emitted, while throttle allows you to react to the first action and ignore the rest.
Simple explanation is here https://medium.com/fantageek/throttle-vs-debounce-in-rxswift-86f8b303d5d4
Throttle: the original function is called at most once per specified period.
Debounce: the original function is called after the caller stops calling the decorated function after a specified period.
I have an event stream as input, want to do a series of transformations on them, then buffer them up into a List, and finally perform a batched write in another system. Once the input stream has been drained I want to manually stop this process which needs to flush the remaining buffer into a write. What is the best way to get this to work consistently?
Example:
Flux.from(<source>)
.map(<some transform>)
.buffer(4096)
.subscribe(<batched writer that takes a List>)
With the above, if I cancel the batched writer subscription I will lose whatever is pending in the buffer. Is there any way I can "complete" the original source, thus causing a flush/drain in the whole pipeline, to ensure no events gets lost?