I take that watermarks can be adjusted in 2 ways:
by emitting them via SourceContext.emitWatermark() on the datasource
by wiring a WatermarksStrategy to the DataSourceStream
Would I be emitting watermarks already in a datasource, if I wire a new Watermark strategy after the datasource operator, will the first watermarks be replaced by the ones of the late watermark strategy?
Essentially I'm in a situation where I don't have control over the source-events/datasource, but I would need to later adjust the watermarks
You can add a WatermarkStrategy at any point in a Flink pipeline. A downstream watermark generator will eat any incoming watermarks -- the only watermarks it emits will be those it generates. Moreover, it's not necessary that the source generate watermarks (though it is preferable).
Related
I just read this article
https://medium.com/bb-tutorials-and-thoughts/how-to-create-a-streaming-job-on-gcp-dataflow-a71b9a28e432
What I am truly missing here though is if I drop 50 files and this is a streaming job like the article says(always live), then won't the output be a windowed join of all the files?
If not, what would it look like and how would it change to be a windowed join? I am trying to get a picture of my head of both worlds of
A windowed join in a streaming job(outputting 1 file for all the files input)
A not windowed join in a streaming job(outputting 1 file PER input file)
Can anyone shed light on that article and what would change?
I also read something about 'Bounded PCollections'. In that case, perhaps windowing is not needed as inside the stream it is sort of like a batch of until we have the entire Pcollection processed, we do not move to the next stage? Perhaps if the article is using bounded pcollcation, then all input files map 1 to 1 with output files?
How can one tell from inside a function if I am receiving data from a bounded or unbounded collection? Is there some other way I can tell that? Is bounded collections even possible in apache beam streaming job?
I'll try to answer some of your questions.
What I am truly missing here though is if I drop 50 files and this is
a streaming job like the article says(always live), then won't the
output be a windowed join of all the files?
Input (source) and output (sink) are not directly linked. So this depends on what you do in your pipeline. TextIO.watchForNewFiles is an streaming source transform that keeps observing a given file location and keeps reading news files and outputting lines read from such files. Hence the output from this step will be a PCollection<String> that stream lines of text read from such files.
Windowing is set next, this decides how your data will be bundled into Windows. For this pipeline, they choose to use FixedWindows of 1 minute. Timestamp will be the time the file was observed.
Sink transform is applied at the end of your pipeline (sometimes sinks also produce outputs, so it might not really be the end). In this case they choose TextIO.write() which writes lines of Strings from an input PCollection<String> to output text files.
So whether the output will include data from all input files or not depends on how your input files are processed and how they are bundled into Windows within the pipeline.
I also read something about 'Bounded PCollections'. In that case,
perhaps windowing is not needed as inside the stream it is sort of
like a batch of until we have the entire Pcollection processed, we do
not move to the next stage? Perhaps if the article is using bounded
pcollcation, then all input files map 1 to 1 with output files?
You could use bounded inputs in a streaming pipeline. In a streaming pipeline, the progression is tracked through a watermark function. If you use a bounded input (for example, a bounded source) the watermark will just go from 0 to infinity instead of progressing gradually. Hence your pipeline might just end instead of waiting for more data.
How can one tell from inside a function if I am receiving data from a
bounded or unbounded collection? Is there some other way I can tell
that? Is bounded collections even possible in apache beam streaming
job?
It is definitely possible as I mentioned above. If you have access to the input PCollection, you can use the isBounded function to determine if it is bounded. See here for an example. You have access to input PCollections when expanding PTransforms (hence during job submission). I don't believe you have access to this at runtime.
I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.
I am using a global unbounded stream in combination with Stateful processing and timers in order to totally order a stream per key by event timestamp. The solution is described with the answer to this question:
Processing Total Ordering of Events By Key using Apache Beam
In order to restart the pipeline after a failure or stopping for some other reason, I need to determine the lowest event timestamp at which we are guaranteed that all other events have been processed downstream. This timestamp can be calculated periodically and persisted to a datastore and used as the input to the source IO (Kinesis) so that the stream can be re-read without having to go back to the beginning. (It is ok for us to have events replayed)
I considered having the stateful transformation emit the lowest processed timestamp as the output when the timer triggers and then combine all the outputs globally to find the minimum value. However, it is not possible to use a Global combine operation because a either a Window or a Trigger must be applied first.
Assuming that my stateful transform emits a Long when the timer fires which represents the smallest timestamp, I am defining the pipeline like this:
p.apply(events)
.apply("StatefulTransform", ParDo.of(new StatefulTransform()))
.apply(Window.<Long>configure().triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(100),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))))
.apply(Combine.globally(new MinLongFn()))
.apply("WriteCheckpoint", ParDo.of(new WriteCheckpoint()));
Will this ensure that the checkpoints will only be written when all of the parallel workers have emitted at least one of their panes? I am concerned that a the combine operation may operate on panes from only some of the workers, e.g. there may be a worker that has either failed or is still waiting for another event to trigger it's timer.
I'm a newbie of the Beam, but according to this blog https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html, Splittable DoFn might be the thing you are looking for!
You could create an SDF to fetch the stream and accept the input element as the start point.
I have been reading the Dataflow SDK documentation, trying to find out what happens when data arrives past the watermark in a streaming job.
This page:
https://cloud.google.com/dataflow/model/windowing
indicates that if you use the default window/trigger strategies, then late data will be discarded:
Note: Dataflow's default windowing and trigger strategies discard late data. If you want to ensure that your pipeline handles instances of late data, you'll need to explicitly set .withAllowedLateness when you set your PCollection's windowing strategy and set triggers for your PCollections accordingly.
Yet this page:
https://cloud.google.com/dataflow/model/triggers
indicates that late data will be emitted as a single element PCollection when it arrives late:
The default trigger for a PCollection is event time-based, and emits the results of the window when the system's watermark (Dataflow's notion of when it "should" have all the data) passes the end of the window. The default trigger emits on a repeating basis, meaning that any late data will by definition arrive after the watermark and trip the trigger, causing the late elements to be emitted as they arrive.
So, will late data past the watermark be discarded completely? Or, will it only not be emitted with the other data it would have been windowed with had it arrived in time, and be emitted on its own instead?
The default "windowing and trigger strategies" discard late data. The WindowingStrategy is an object which consists of windowing, triggering, and a few other parameters such as allowed lateness. The default allowed lateness is 0, so any late data elements are discarded.
The default trigger handles late data. If you take the default WindowingStrategy and change only the allowed lateness, then you will receive a PCollection which contains one output pane for all of the on time data, and then a new output pane for approximately every late element.
A number of examples show aggregation over windows of an unbounded stream, but suppose we need to get a count-per-key of the entire stream seen up to some point in time. (Think word count that emits totals for everything seen so far rather than totals for each window.)
It seems like this could be a Combine.perKey and a trigger to emit panes at some interval. In this case the window is essentially global, and we emit panes for that same window throughout the life of the job. Is this safe/reasonable, or perhaps there is another way to compute a rolling, total aggregate?
Ryan your solution of using a global window and a periodic trigger is the recommended approach. Just make sure you use accumulation mode on the trigger and not discarding mode. The Triggers page should have more information.
Let us know if you need additional help.