Apache beam: wait for N minutes before processing element - google-cloud-dataflow

We use python beam SDK with GCP's dataflow. Our pipeline depends on external system that we know has some delay. How can I write a pipeline that waits for N minutes (where N is constant I provide when launching job).
Something like
pubsub -> (sleep for 1 minutes) -> read data from external system
My understanding of "FixedWindow" is it groups data into timeframe, so if I use 60 seconds fixed window I can achieve "up to 60 seconds" delay but I want here is constant 60 seconds delay for all incoming data.

Since the Window question was answer by #Kenn Knowles allow me to answer the other half.
I think you could use Stateful and Timely processing and use a Timer of one minute for every element.
Bare in mind that the Timers are applied for each key, so each key would need to be unique in order for this to work. I made this code sample so you can test this, reading from Topic projects/pubsub-public-data/topics/taxirides-realtime.
p
.apply("Read From PubSub", PubsubIO.readStrings().fromTopic(options.getTopic()))
.apply("Parse and to KV", ParDo.of(new DoFn<String, KV<String, String>>() {
#ProcessElement
public void processElement(ProcessContext c) throws ParseException {
JSONObject json = new JSONObject(c.element());
String rideStatus = json.getString("ride_status");
// ride_id is unique for dropoff
String rideId = json.getString("ride_id"); // this is the session
if (rideStatus.equals("dropoff")) {
c.output(KV.of(rideId, "value"));
}
}
}
))
// Stateful DoFn need to have a KV as input
.apply("Timer", ParDo.of(new DoFn<KV<String, String>, String >() {
private final Duration BUFFER_TIME = Duration.standardSeconds(60);
#TimerId("timer")
private final TimerSpec timerSpec = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#StateId("buffer")
// Elements will be saved here, with type String
private final StateSpec<BagState<String>> bufferedEvents = StateSpecs.bag();
#ProcessElement
public void processElement(ProcessContext c,
#TimerId("timer") Timer timer,
#StateId("buffer") BagState<String> buffer)
throws ParseException {
// keys are unique, so no need to use counters to trigger the offset
timer.offset(BUFFER_TIME).setRelative();
buffer.add(c.element().getKey()); // add to buffer the unique id
LOG.info("TIMER: Adding " + c.element().getKey() +
" buffer at " + Instant.now().toString());
}
// This method is call when timers expire
#OnTimer("timer")
public void onTimer(
OnTimerContext c,
#StateId("buffer") BagState<String> buffer
) throws IOException {
for (String id : buffer.read()) { // there should be only one since key is unique
LOG.info("TIMER: Releasing " + id +
" from buffer at " + Instant.now().toString());
c.output(id);
}
buffer.clear(); // clearing buffer
}
})
);
This was just a quick test, so probably there would be things to improve in the code.
I am not sure, though, how would this perform with a lot of elements, since you are caching all elements for one minute in individual timers. I'm currently running this pipeline in Dataflow and so far so good, will update this if something weird happens.
The advantage of this vs using sleeps is that the sleep would need to wait for every single element in the bundle to sleep, while this does the wait parallely. The disadvantage may be using too much shuffle, but I haven't test this as much to be sure about this.
Note that in "normal" Stateful DoFns (1) keys are not expected to be unique, and in that case more than one element would be added to the bag, (2) using a counter or something to know if the timer has been offset already is needed, in this case we didn't need it since the keys are unique
Here you have a screenshot of the pipeline working

FixedWindows does not introduce any delay.
In Beam, windowing groups elements according to their timestamps. This is separate from when the elements arrive.
The PubsubIO transform maintains a "watermark" which measures the timestamps that are still remaining in the Pubsub queue. So the watermark will lag real time by 1 minute.
If the pubsub topic becomes empty for a long time, the watermark will sync up with real time. So in that case you may need to allow late data in your pipeline.

Related

What is the proper way of increasing the watermark when using a joinFunction in Flink?

I have two streams, stream A and stream B. Both streams contain the same type of event which has an ID and a timestamp. For now, all i want the flink job to do is join the events that have the same ID inside of a window of 1 minute. The watermark is assigned on event.
sourceA = initialSourceA.map(parseToEvent)
sourceB = initialSourceB.map(parseToEvent)
streamA = sourceA
.assignTimestampsAndWatermarks(CustomWatermarkStrategy())
.keyBy(Event.Key)
streamB = sourceB
.assignTimestampsAndWatermarks(CustomWatermarkStrategy())
.keyBy(Event.Key)
streamA
.join(streamB)
.where(Event.Key)
.equalTo(Event.Key)
.window(TumblingEventTimeWindows.of(Time.of(1, TimeUnit.MINUTES)))
.apply(giveMePairOfEvents)
.print()
Inside my test I try to send the following:
sourceA.send(Event(ID_1, 0 seconds))
sourceB.send(Event(ID_1, 0 seconds))
//to increase the watermark
sourceA.send(Event(ID_1, 62 seconds))
sourceB.send(Event(ID_1, 62 seconds))
For parallelism = 1, I can see the events from time 0 getting joined together.
However, for parallelism = 2 the print does not display anything getting joined. To figure out the problem, I tried to print the events after the keyBy of each stream and I can see they are all running on the same instance. Placing the print after the watermarking, for obvious reasons, that the events are currently on the different instances.
This leads me to believe that I am somehow doing something incorrectly when it comes to watermarking since for a parallelism higher than 1 it doesn't increase the watermark. So here's a couple of questions i asked myself:
Is it possible that each event has a seperate watermark generator and i have to increase them specifically?
Do I run keyBy first and then watermark so that my events from each stream use the same watermarkgenerator?
Sending another set of events as follows:
sourceA.send(Event(ID_1, 0 seconds))
sourceB.send(Event(ID_1, 0 seconds))
//to increase the watermark
sourceA.send(Event(ID_1, 62 seconds))
sourceB.send(Event(ID_1, 62 seconds))
sourceA.send(Event(ID_1, 122 seconds))
sourceB.send(Event(ID_1, 122 seconds))
Ended up sending the joined first events. Further inspection showed that the third set of events used the same watermarkgenerator that the second one didn't use. Something which I am not very clear on why is happening. How can I assign and increase watermarks correctly when using a join function in Flink?
EDIT 1:
The custom watermark generator:
class CustomWaterMarkGenerator(
private val maxOutOfOrderness: Long,
private var currentMaxTimeStamp: Long = 0,
)
: WatermarkGenerator<EventType> {
override fun onEvent(event: EventType, eventTimestamp: Long, output: WatermarkOutput) {
val a = currentMaxTimeStamp.coerceAtLeast(eventTimestamp)
currentMaxTimeStamp = a
output.emitWatermark(Watermark(currentMaxTimeStamp - maxOutOfOrderness - 1));
}
override fun onPeriodicEmit(output: WatermarkOutput?) {
}
}
The watermark strategy:
class CustomWatermarkStrategy(
): WatermarkStrategy<Event> {
override fun createWatermarkGenerator(context: WatermarkGeneratorSupplier.Context?): WatermarkGenerator<Event> {
return CustomWaterMarkGenerator(0)
}
override fun createTimestampAssigner(context: TimestampAssignerSupplier.Context?): TimestampAssigner<Event> {
return TimestampAssigner{ event: Event, _: Long->
event.timestamp
}
}
}
Custom source:
The sourceFunction is currently an rsocket connection that connects to a mockstream where i can send events through mockStream.send(event). The first thing I do with the events is parse them using a map function (from string into my event type) and then i assign my watermarks etc.
Each parallel instance of the watermark generator will operate independently, based solely on the events it observes. Doing the watermarking immediately after the sources makes sense (although even better, in general, is to do watermarking directly in the sources).
An operator with multiple input channels (such as the keyed windowed join in your application) sets its current watermark to the minimum of the watermarks it has received from its active input channels. This has the effect that any idle source instances will cause the watermarks to stall in downstream tasks -- unless those sources explicitly mark themselves as idle. (And FLINK-18934 meant that prior to Flink 1.14 idleness propagation didn't work correctly with joins.) An idle source is a likely suspect in your situation.
One strategy for debugging this sort of problem is to bring up the Flink WebUI and observe the behavior of the current watermark in all of the tasks.
To get more help, please share the rest of the application, or at least the custom source and watermark strategy.

How to limit PCollection in Apache Beam as soon as possible?

I'm using Apache Beam 2.28.0 on Google Cloud DataFlow (with Scio SDK). I have a large input PCollection (bounded) and I want to limit / sample it to a fixed number of elements, but I want to start the downstream processing as soon as possible.
Currently, when my input PCollection has e.g. 20M elements and I want to limit it to 1M by using https://beam.apache.org/releases/javadoc/2.28.0/org/apache/beam/sdk/transforms/Sample.html#any-long-
input.apply(Sample.<String>any(1000000))
it waits until all of the 20M elements are read, which takes a long time.
How to efficiently limit number of elements to a fixed size and start downstream processing as soon as the limit is reached, discarding the rest of the input processing?
OK, so my initial solution for that is to use Stateful DoFn like this (I'm using Scio's Scala SDK as mentioned in the question):
import java.lang.{Long => JLong}
class MyLimitFn[T](limit: Long) extends DoFn[KV[String, T], KV[String, T]] {
#StateId("count") private val count = StateSpecs.value[JLong]()
#ProcessElement
def processElement(context: DoFn[KV[String, T], KV[String, T]]#ProcessContext, #StateId("count") count: ValueState[JLong]): Unit = {
val current = count.read()
if(current < limit) {
count.write(current + 1L)
context.output(context.element())
}
}
}
The downside of this solution is that I need to synthetically add the same key (e.g. an empty string) to all elements before using it. So far, it's much faster than Sample.<>any().
I still look forward to see better / more efficient solutions.

Dataflow + OutOfMemoryError

I have 3 GroupBys in my Java pipeline that after each GroupBy, the program runs some computation b/w stages. These groups become larger and larger blocks.The only thing, the program adds is a new key to each block.
The last GroupBy deals w/ smaller # of large blocks. Of course, the pipeline works for small # of items, but it fails at the second or third GroupBys for large # of items.
I played w/ Xms and Xmx and even chose much larger instances 'n1-standard-64', but it din't work. For the failed example, I'm sure the output is smaller than 5G, so is there any other way that I can control memory in DataFlow per map/reduce tasks?
If Dataflow can handle the first GroupBy then it should be able to reduce the number of tasks to allocate more memory on heap and handle large blocks in the next stage.
Any suggestion will be appreciated!
UPDATE:
.apply(ParDo.named("Sort Bins").of(
new DoFn<KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>>, KV<Integer, KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>>>>() {
#Override
public void processElement(ProcessContext c) {
KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>> e = c.element();
Integer Secondary_key = e.getKey();
ArrayList<KV<Long, Iterable<TableRow>>> records = Lists.newArrayList(e.getValue()); // Get a modifiable list.
Collections.sort(records, BinID_COMPARATOR);
Integer Primary_key= is a simple function of the secondary key;
c.output(KV.of(Primary_key, KV.of(Secondary_key, (Iterable<KV<Long, Iterable<TableRow>>>) records)));
}
}));
Error reported for the last line (c.output).

Joining two streams

Is it possible to join two separate PubSubIo Unbounded PCollections using a key present in both of them? I try to accomplish the task with something like:
Read(FistStream)&Read(SecondStream) -> Flatten -> Generate key to use in joining -> Use Session Windowing to gather them together -> Group by key then rewindow with fixed size windows -> AvroIOWrite to disk using windowing.
EDIT:
Here is the pipeline code I created. I experience two problems:
Nothing get's written to the disk
Pipeline starts to be really unstable - it randomly slows down processing of certain steps. Especially group by. It's not able to keep up with ingestion speed even when I use 10 dataflow workers.
I need to handle ~ 10 000 sessions a second. Each session comprises of 1 or 2 events, then needs to be closed.
PubsubIO.Read<String> auctionFinishedReader = PubsubIO.readStrings().withTimestampAttribute(TIMESTAMP_ATTRIBUTE)
.fromTopic("projects/authentic-genre-152513/topics/auction_finished");
PubsubIO.Read<String> auctionAcceptedReader = PubsubIO.readStrings().withTimestampAttribute(TIMESTAMP_ATTRIBUTE)
.fromTopic("projects/authentic-genre-152513/topics/auction_accepted");
PCollection<String> auctionFinishedStream = p.apply("ReadAuctionFinished", auctionFinishedReader);
PCollection<String> auctionAcceptedStream = p.apply("ReadAuctionAccepted", auctionAcceptedReader);
PCollection<String> combinedEvents = PCollectionList.of(auctionFinishedStream)
.and(auctionAcceptedStream).apply(Flatten.pCollections());
PCollection<KV<String, String>> keyedAuctionFinishedStream = combinedEvents
.apply("AddKeysToAuctionFinished", WithKeys.of(new GenerateKeyForEvent()));
PCollection<KV<String, Iterable<String>>> sessions = keyedAuctionFinishedStream
.apply(Window.<KV<String, String>>into(Sessions.withGapDuration(Duration.standardMinutes(1)))
.withTimestampCombiner(TimestampCombiner.END_OF_WINDOW))
.apply(GroupByKey.create());
PCollection<SodaSession> values = sessions
.apply(ParDo.of(new DoFn<KV<String, Iterable<String>>, SodaSession> () {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
c.output(new SodaSession("auctionid", "stattedat"));
}
}));
PCollection<SodaSession> windowedEventStream = values
.apply("ApplyWindowing", Window.<SodaSession>into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
);
AvroIO.Write<SodaSession> avroWriter = AvroIO
.write(SodaSession.class)
.to("gs://storage/")
.withWindowedWrites()
.withFilenamePolicy(new EventsToGCS.PerWindowFiles("sessionsoda"))
.withNumShards(3);
windowedEventStream.apply("WriteToDisk", avroWriter);
I've found an efficient solution. As one of my collection was disproportionate in size compared to the other one so I used side input to speed up grouping operation. Here is an overview of my solution:
Read both event streams.
Flatten them into single PCollection.
Use sliding window sized (closable session duration + session max length, every closable session duration).
Partition collections again.
Create PCollectionView from smaller PCollection.
Join both streams using sideInput with the view created in the previous step.
Write sessions to disk.
It handles joining 4000 events/sec stream (larger one) + 60 events/sec stream on 1-2 DataFlow workers versus ~15 workers when used Session windowing along with GroupBy.

Using a fixedWindow of an unbouded datasource as side input for a parDo?

I am reading data (GPS-coordinates, with time stamps) from a unbounded pub/sub datasource and need to calculate the distance between all those points. My idea is to have lets say 1 minute windows and do a ParDo with the whole collection as a side input, where I use the side input to look up the next point and calculate the distance inside the ParDo.
If I run the pipeline I can see the View.asList step is not producing any output. Also calcDistance is never producing any output. Are there any examples of how to use a FixedWindow collection as side input? picture of pipeline
Pipeline:
PCollection<Timepoint> inputWindow = pipeline.apply(PubsubIO.Read.topic(""))
.apply(ParDo.of(new ExtractTimestamps()))
.apply(Window.<Timepoint>into(FixedWindows.of(Duration.standardMinutes(1))));
final PCollectionView<List<Timepoint>> SideInputWindowed = inputWindow.apply(View.<Timepoint>asList());
inputWindow.apply(ParDo.named("Add Timestamp "+teams[i]).of(new AddTimeStampAsKey()))
.apply(ParDo.of(new CalcDistanceTest(SideInputWindowed)).withSideInputs(SideInputWindowed));
ParDo:
static class CalcDistance extends DoFn<KV<String,Timepoint>,Timepoint> {
private final PCollectionView<List<Timepoint>> pCollectionView;
public CalcDistance(PCollectionView pCollectionView){
this.pCollectionView = pCollectionView;
}
#Override
public void processElement(ProcessContext c) throws Exception {
LOG.info("starting to calculate distance");
Timepoint input = c.element().getValue();
//doing distance calculation
c.output(input);
}
}
The overall issue is that the timestamp of the element is not known by Dataflow when reading from Pubsub since for your usecase its an attribute of the data.
You'll want to ensure that when reading from Pubsub, you ingest the records with a timestamp label as discussed here.
Finally, the GameStats example uses a side input to find spammy users. In your case instead of calculating a globalMeanScore per window, you'll just place all your Timepoints into the side input.

Resources