What is the difference between Flux.create and Flux.generate? I am looking--ideally with an example use case--to understand when I should use one or the other.
In short:
Flux::create doesn't react to changes in the state of the app while Flux::generate does.
The long version
Flux::create
You will use it when you want to calculate multiple (0...infinity) values which are not influenced by the state of your app and the state of your pipeline (your pipeline == the chain of operations which comes after Flux::create == downstream).
Why? Because the method which you sent to Flux::create keeps calculating elements (or none). The downstream will determine how many elements (elements == next signals) it wants and if he can't keep up, those elements which are already emitted will be removed/buffered in some strategy (by default they will be buffered until the downstream will ask for more).
The first and easiest use case is for emitting values which you, theoretically, could sum to a collection and only then take each element and do something with it:
Flux<String> articlesFlux = Flux.create((FluxSink<String> sink) -> {
/* get all the latest article from a server and emit them one by one to downstream. */
List<String> articals = getArticalsFromServer();
articals.forEach(sink::next);
});
As you can see, Flux.create is used for interaction between blocking method (getArticalsFromServer) to asynchronous code.
I'm sure there are other use cases for Flux.create.
Flux::generate
Flux.generate((SynchronousSink<Integer> synchronousSink) -> {
synchronousSink.next(1);
})
.doOnNext(number -> System.out.println(number))
.doOnNext(number -> System.out.println(number + 4))
.subscribe();
The output will be 1 5 1 5 1 5................forever
In each invocation of the method you sent to Flux::generate, synchronousSink can only emits: onSubscribe onNext? (onError | onComplete)?.
It means that Flux::generate will calculate and emit values on demand. When should you use it? In cases where it's too expensive to calculate elements which may not be used downstream or the events which you emit are influenced by the state of the app or from the state of your pipeline (your pipeline == the chain of operations which comes after Flux::create == downstream).
For example, if you are building a torrent application then you are receiving blocks of data in real time. You could use Flux::generate to give tasks (blocks to download) to multiple threads and you will calculate the block you want to download inside Flux::generate only when some thread is asking. So you will emit only blocks you don't have. The same algorithm with Flux::create will fail because Flux::create will emit all the blocks we don't have and if some blocks failed to be downloaded then we have a problem. because Flux::create doesn't react to changes in the state of the app while Flux::generate does.
Create:
Accepts a Consumer<FluxSink<T>>
Consumer is invoked only once per subscriber
Consumer can emit 0..N elements immediately
Publisher is not aware of downstream state. So we need to provide Overflow strategy as an additional parameter
We can get the reference of FluxSink using which we could keep on emitting elements as and when required using multiple threads.
Generate:
Accepts a Consumer<SynchronousSink<T>>
Consumer is invoked again and again based on the downstream demand
Consumer can emit only one element at the max with an optional complete/error signal.
Publisher produces elements based on the downstream demand
We can get the reference of SynchronousSink. But it might not be really useful as we could emit only one element
Check this blog for more details.
Related
I have a pipeline running on Dataflow that ingests files containing several thousand records. These files arrive at a steady frequency, which are processed by a stateful ParDo with timers that attempts to throttle the rate of ingest by batching and holding these files until the timer fires, before being expanded into individual record elements via a file processing ParDo, and finally written to BigQuery destinations.
On occasion, either an intermittent event such as an OOM event or autoscaling events, I have seen Dataflow attempting to emit the files in the stateful ParDo after the event resolves, causing duplicate record elements downstream when the file processing ParDo reprocesses the files. I understand that bundles are retried if there is a failure, but do they account for duplicates?
How/What is exactly-once processing achieving in this context, especially with regard to the State/Timer API, since I am seeing duplicates at my destination?
Dataflow achieves exactly once processing by ensuring that data produced from failing workers is not passed downstream (or, more precisely, if work is retried only one successful result is consumed downstream). For example, if stage A of your pipeline is producing elements and stage B is counting them, and workers in stage A fail and are re-tried, duplicate elements will not be counted by stage B (though of course stage B might itself have to be retried). This also applies to state and timers--a given bundle of work is either committed in its entirety (i.e. the set of inputs are marked as consumed, and the set of outputs committed atomically with the consumption/setting of state and timers) or entirely discarded (state/timers is left unconsumed/untouched and the retry will not not be influenced by what happened before.)
What is not exactly once is interactions with external systems (due to the possibility of retries). These are instead at least once, and so to guarantee correctness all such interactions should be idempotent. Sinks often achieve this by assigning a unique id such that multiple writes can be deduplicated in the downstream system. For files, one can write to temporary files, and then rename the "winning" set of shards to the final destination after a barrier. It's not clear from your question what files you're emitting (or ingesting) but hopefully this should be helpful in understanding how the system works.
More specifically, say the initial state is {state: A, timers: [X, Y], inputs: [i, j, k]}. Suppose further that when processing the bundle (these timers and inputs) the state is updated to B, we emit elements m, and n downstream, and we set a timer W.
If the bundle succeeds, the new state will be {state: B, timers: [W], inputs: []} and the elements [m, n] are guaranteed to be passed downstream. Furthermore, any competing retry of this bundle would always fail.
On the other hand, if the bundle fails (even if it "emitted" some of the elements or tried to update the state) the resulting state of the system will be {state: A, timers: [X, Y], inputs: [i, j, k]} for a fresh retry and nothing that was emitted from this failed bundle will be observed downstream.
Another way to look at it is that the set {inputs consumed, timers consumed, state modifications, timers set, outputs to produce downstream} is written to the backing "database" in a single transaction. Only a single successful attempt is ever committed, failed attempts are discarded.
More details can be found at https://beam.apache.org/documentation/runtime/model/
I wonder, is there any difference in behavior/guarantees between the MonoJust and FluxJust created with exactly one argument?
From the source code of the Reactor Core 3.3.7 I am able to see that the former one is using the Operators#ScalarSubscription as its subscription object, while the latter one uses its private WeakScalarSubscription.
The only difference between these two is that ScalarSubscription has this volatile int once thing (a counter) defined and checked on each method call and somewhat ensures the onComplete() is called exactly once. At the same time, WeakScalarSubscription uses the boolean terminado thing (a non-volatile flag) for the same purposes, but without the "exactly once" guarantees for onComplete() call.
Using volatile in Java has its price, which is payed off e.g. when one creates a lot of these things (with Mono.just(1) or Flux.just(1)) in the highly-concurrent client code. (As we do in our project inside the flatMap that runs in parallel on a dedicated thread pool.)
There's no class javadoc for MonoJust, so I wonder if my assumptions are correct: that the only difference is that FluxJust may send the completion signal more than once in some circumstances — and that's it? Or are there other subtle differences?
I think that the biggest difference is how you use Flux and Mono. Mono emits one item or error and then completes, whereas Flux can emit more than one element, error, and then completion signal.
just() methods are meant to evaluate one element (or vararg variant for Flux) and return it immediately. I can imagine cases when Flux with only one element is returned.
I am using a global unbounded stream in combination with Stateful processing and timers in order to totally order a stream per key by event timestamp. The solution is described with the answer to this question:
Processing Total Ordering of Events By Key using Apache Beam
In order to restart the pipeline after a failure or stopping for some other reason, I need to determine the lowest event timestamp at which we are guaranteed that all other events have been processed downstream. This timestamp can be calculated periodically and persisted to a datastore and used as the input to the source IO (Kinesis) so that the stream can be re-read without having to go back to the beginning. (It is ok for us to have events replayed)
I considered having the stateful transformation emit the lowest processed timestamp as the output when the timer triggers and then combine all the outputs globally to find the minimum value. However, it is not possible to use a Global combine operation because a either a Window or a Trigger must be applied first.
Assuming that my stateful transform emits a Long when the timer fires which represents the smallest timestamp, I am defining the pipeline like this:
p.apply(events)
.apply("StatefulTransform", ParDo.of(new StatefulTransform()))
.apply(Window.<Long>configure().triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(100),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))))
.apply(Combine.globally(new MinLongFn()))
.apply("WriteCheckpoint", ParDo.of(new WriteCheckpoint()));
Will this ensure that the checkpoints will only be written when all of the parallel workers have emitted at least one of their panes? I am concerned that a the combine operation may operate on panes from only some of the workers, e.g. there may be a worker that has either failed or is still waiting for another event to trigger it's timer.
I'm a newbie of the Beam, but according to this blog https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html, Splittable DoFn might be the thing you are looking for!
You could create an SDF to fetch the stream and accept the input element as the start point.
Is possible to serialize Reactor Flux. For example my Flux is in some state and is currently processing some event. And suddenly service is terminated. Current state of Flux is saved to database or to file. And then on restart of aplication I just take all Flux from that file/table and subscribe them to restart processing from last state. This is possible in reactor?
No, this is not possible. Flux are not serializable and are closer to a chain of functions, they don't necessarily have a state[1] but describe what to do given an input (provided by an initial generating Flux)...
So in order to "restart" a Flux, you'd have to actually create a new one that gets fed the remaining input the original one would have received upon service termination.
Thus it would be more up to the source of your data to save the last emitted state and allow restarting a new Flux sequence from there.
[1] Although, depending on what operators you chained in, you could have it impact some external state. In that case things will get more complicated, as you'll have to also persist that state.
It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults which will then emit nothing if the window was empty.
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection with a heartbeat PCollection consisting of empty values. So for the example of Sum.integersGlobally, you would Flatten your main PCollection<Integer> with a secondary PCollection<Integer> that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows or SlidingWindows; Sessions are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource that accepts an enumerable WindowFn plus a default value and generates an appropriate unbounded PCollection.