I have a processor and a sink that emits elements with sink.next().
The flux could potentially fail on one of the steps, I want to retry only the failed element and not the rest.
On the image above I want only to re-emit the purple element.
Related
I have a pipeline running on Dataflow that ingests files containing several thousand records. These files arrive at a steady frequency, which are processed by a stateful ParDo with timers that attempts to throttle the rate of ingest by batching and holding these files until the timer fires, before being expanded into individual record elements via a file processing ParDo, and finally written to BigQuery destinations.
On occasion, either an intermittent event such as an OOM event or autoscaling events, I have seen Dataflow attempting to emit the files in the stateful ParDo after the event resolves, causing duplicate record elements downstream when the file processing ParDo reprocesses the files. I understand that bundles are retried if there is a failure, but do they account for duplicates?
How/What is exactly-once processing achieving in this context, especially with regard to the State/Timer API, since I am seeing duplicates at my destination?
Dataflow achieves exactly once processing by ensuring that data produced from failing workers is not passed downstream (or, more precisely, if work is retried only one successful result is consumed downstream). For example, if stage A of your pipeline is producing elements and stage B is counting them, and workers in stage A fail and are re-tried, duplicate elements will not be counted by stage B (though of course stage B might itself have to be retried). This also applies to state and timers--a given bundle of work is either committed in its entirety (i.e. the set of inputs are marked as consumed, and the set of outputs committed atomically with the consumption/setting of state and timers) or entirely discarded (state/timers is left unconsumed/untouched and the retry will not not be influenced by what happened before.)
What is not exactly once is interactions with external systems (due to the possibility of retries). These are instead at least once, and so to guarantee correctness all such interactions should be idempotent. Sinks often achieve this by assigning a unique id such that multiple writes can be deduplicated in the downstream system. For files, one can write to temporary files, and then rename the "winning" set of shards to the final destination after a barrier. It's not clear from your question what files you're emitting (or ingesting) but hopefully this should be helpful in understanding how the system works.
More specifically, say the initial state is {state: A, timers: [X, Y], inputs: [i, j, k]}. Suppose further that when processing the bundle (these timers and inputs) the state is updated to B, we emit elements m, and n downstream, and we set a timer W.
If the bundle succeeds, the new state will be {state: B, timers: [W], inputs: []} and the elements [m, n] are guaranteed to be passed downstream. Furthermore, any competing retry of this bundle would always fail.
On the other hand, if the bundle fails (even if it "emitted" some of the elements or tried to update the state) the resulting state of the system will be {state: A, timers: [X, Y], inputs: [i, j, k]} for a fresh retry and nothing that was emitted from this failed bundle will be observed downstream.
Another way to look at it is that the set {inputs consumed, timers consumed, state modifications, timers set, outputs to produce downstream} is written to the backing "database" in a single transaction. Only a single successful attempt is ever committed, failed attempts are discarded.
More details can be found at https://beam.apache.org/documentation/runtime/model/
What is the difference between Flux.create and Flux.generate? I am looking--ideally with an example use case--to understand when I should use one or the other.
In short:
Flux::create doesn't react to changes in the state of the app while Flux::generate does.
The long version
Flux::create
You will use it when you want to calculate multiple (0...infinity) values which are not influenced by the state of your app and the state of your pipeline (your pipeline == the chain of operations which comes after Flux::create == downstream).
Why? Because the method which you sent to Flux::create keeps calculating elements (or none). The downstream will determine how many elements (elements == next signals) it wants and if he can't keep up, those elements which are already emitted will be removed/buffered in some strategy (by default they will be buffered until the downstream will ask for more).
The first and easiest use case is for emitting values which you, theoretically, could sum to a collection and only then take each element and do something with it:
Flux<String> articlesFlux = Flux.create((FluxSink<String> sink) -> {
/* get all the latest article from a server and emit them one by one to downstream. */
List<String> articals = getArticalsFromServer();
articals.forEach(sink::next);
});
As you can see, Flux.create is used for interaction between blocking method (getArticalsFromServer) to asynchronous code.
I'm sure there are other use cases for Flux.create.
Flux::generate
Flux.generate((SynchronousSink<Integer> synchronousSink) -> {
synchronousSink.next(1);
})
.doOnNext(number -> System.out.println(number))
.doOnNext(number -> System.out.println(number + 4))
.subscribe();
The output will be 1 5 1 5 1 5................forever
In each invocation of the method you sent to Flux::generate, synchronousSink can only emits: onSubscribe onNext? (onError | onComplete)?.
It means that Flux::generate will calculate and emit values on demand. When should you use it? In cases where it's too expensive to calculate elements which may not be used downstream or the events which you emit are influenced by the state of the app or from the state of your pipeline (your pipeline == the chain of operations which comes after Flux::create == downstream).
For example, if you are building a torrent application then you are receiving blocks of data in real time. You could use Flux::generate to give tasks (blocks to download) to multiple threads and you will calculate the block you want to download inside Flux::generate only when some thread is asking. So you will emit only blocks you don't have. The same algorithm with Flux::create will fail because Flux::create will emit all the blocks we don't have and if some blocks failed to be downloaded then we have a problem. because Flux::create doesn't react to changes in the state of the app while Flux::generate does.
Create:
Accepts a Consumer<FluxSink<T>>
Consumer is invoked only once per subscriber
Consumer can emit 0..N elements immediately
Publisher is not aware of downstream state. So we need to provide Overflow strategy as an additional parameter
We can get the reference of FluxSink using which we could keep on emitting elements as and when required using multiple threads.
Generate:
Accepts a Consumer<SynchronousSink<T>>
Consumer is invoked again and again based on the downstream demand
Consumer can emit only one element at the max with an optional complete/error signal.
Publisher produces elements based on the downstream demand
We can get the reference of SynchronousSink. But it might not be really useful as we could emit only one element
Check this blog for more details.
I am using a global unbounded stream in combination with Stateful processing and timers in order to totally order a stream per key by event timestamp. The solution is described with the answer to this question:
Processing Total Ordering of Events By Key using Apache Beam
In order to restart the pipeline after a failure or stopping for some other reason, I need to determine the lowest event timestamp at which we are guaranteed that all other events have been processed downstream. This timestamp can be calculated periodically and persisted to a datastore and used as the input to the source IO (Kinesis) so that the stream can be re-read without having to go back to the beginning. (It is ok for us to have events replayed)
I considered having the stateful transformation emit the lowest processed timestamp as the output when the timer triggers and then combine all the outputs globally to find the minimum value. However, it is not possible to use a Global combine operation because a either a Window or a Trigger must be applied first.
Assuming that my stateful transform emits a Long when the timer fires which represents the smallest timestamp, I am defining the pipeline like this:
p.apply(events)
.apply("StatefulTransform", ParDo.of(new StatefulTransform()))
.apply(Window.<Long>configure().triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(100),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))))
.apply(Combine.globally(new MinLongFn()))
.apply("WriteCheckpoint", ParDo.of(new WriteCheckpoint()));
Will this ensure that the checkpoints will only be written when all of the parallel workers have emitted at least one of their panes? I am concerned that a the combine operation may operate on panes from only some of the workers, e.g. there may be a worker that has either failed or is still waiting for another event to trigger it's timer.
I'm a newbie of the Beam, but according to this blog https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html, Splittable DoFn might be the thing you are looking for!
You could create an SDF to fetch the stream and accept the input element as the start point.
I'm implementing a Dataflow pipeline that reads messages from Pubsub and writes TableRows into BigQuery (BQ) using Apache Beam SDK 2.0.0 for Java.
This is the related portion of code:
tableRowPCollection
.apply(BigQueryIO.writeTableRows().to(this.tableId)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
This code generates a group of tasks under the hood in the Dataflow pipeline. One of these tasks is the GroupByKey. This task is accumulating elements in the pipeline as can be seen in this print screen:
GBK elements accumulation image.
After reading the docs I suspect that this issue relates to the Window config. But I could not find a way of modifying the Window configuration since it is implicitly created by Window.Assign transform inside the Reshuffle task.
Is there a way of setting the window parameters and/or attaching triggers to this implicit Window or should I create my own DoFn that inserts a TableRow in BQ?
Thanks in advance!
[Update]
I left the pipeline running for a day approximately and after that the GroupByKey subtask became faster and the number of elements coming in and coming out approximated to each other (sometimes were the same). Furthermore, I also noticed that the Watermark got closer to the current date and was increasing faster. So the "issue" was solved.
There isn't any waiting introduced by the Reshuffle in the BigQuery sink. Rather, it is used to create the batches for of rows to write to BigQuery. The number of elements coming out of the GroupByKey is smaller because each output element represents a batch (or group) of input elements.
You should be able to see the total number of elements coming out as the output of the ExpandIterable (the output of the Reshuffle).
I have a pipeline that looks like
pipeline.apply(PubsubIO.read.subscription("some subscription"))
.apply(Window.into(SlidingWindow.of(10 mins).every(20 seconds)
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(20 seconds))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()))
.apply(RemoveDuplicates.create())
.apply(Window.discardingFiredPanes()) // this is suggested in the warnings under https://cloud.google.com/dataflow/model/triggers#window-accumulation-modes
.apply(Count.<String>globally().withoutDefaults())
This pipeline overcounts distinct values significantly (20x normal value). Initially, I was suspecting that the default trigger may have caused this issue. I have tweaked to use triggers that allow no lateness/discard fired panes/use processing time, all of which have similar overcount issues.
I've also tried ApproximateUnique.globally: it failed during pipeline construction because of an exception that looks like
Default values are not supported in Combine.globally() if the output PCollection is not windowed by GlobalWindows. There seems no way to add withoutDefaults to it (like we did with Count.globally).
Is there a recommended way to do COUNT(DISTINCT) in dataflow/beam streaming pipeline with reasonable precision?
P.S. I'm using Java Dataflow SDK 1.9.0.
Your code looks OK; it shouldn't overcount. Note that you are placing each element into 30 windows, so if you have a window-unaware sink (equivalent to collapsing all the sliding windows) you would expect precisely 30 times as many elements. If you could show a bit more of the pipeline or how you are observing the counts, that might help.
Beyond that, I have a few suggestions for the pipeline:
I suggest changing your trigger for RemoveDuplicates to AfterPane.elementCountAtLeast(1); this will get you the same result at lower latency, since later elements arriving will have no impact. This trigger, and your current trigger, will never fire repeatedly. So it does not actually matter whether you set accumulatingFiredPanes() or discardingFiredPanes(). This is good, because neither one would work with the rest of your pipeline.
I'd install a new trigger prior to the Count. The reason is a bit technical, but I'll try to describe it:
In your current pipeline, the trigger installed there (the "continuation trigger" of the trigger for RemoveDuplicates) notes the arrival time of the first element and waits until it has received all elements that were produced at or before that processing time, as measured by the upstream worker. There is some nondeterminism because it puns the local processing time and the processing time of other workers.
If you take my advice and switch the trigger for RemoveDuplicates, then the continuation trigger will be AfterPane.elementCountAtLeast(1) so it will always emit a count as soon as possible and then discard further data, which is very wrong.