Joining two streams - google-cloud-dataflow

Is it possible to join two separate PubSubIo Unbounded PCollections using a key present in both of them? I try to accomplish the task with something like:
Read(FistStream)&Read(SecondStream) -> Flatten -> Generate key to use in joining -> Use Session Windowing to gather them together -> Group by key then rewindow with fixed size windows -> AvroIOWrite to disk using windowing.
EDIT:
Here is the pipeline code I created. I experience two problems:
Nothing get's written to the disk
Pipeline starts to be really unstable - it randomly slows down processing of certain steps. Especially group by. It's not able to keep up with ingestion speed even when I use 10 dataflow workers.
I need to handle ~ 10 000 sessions a second. Each session comprises of 1 or 2 events, then needs to be closed.
PubsubIO.Read<String> auctionFinishedReader = PubsubIO.readStrings().withTimestampAttribute(TIMESTAMP_ATTRIBUTE)
.fromTopic("projects/authentic-genre-152513/topics/auction_finished");
PubsubIO.Read<String> auctionAcceptedReader = PubsubIO.readStrings().withTimestampAttribute(TIMESTAMP_ATTRIBUTE)
.fromTopic("projects/authentic-genre-152513/topics/auction_accepted");
PCollection<String> auctionFinishedStream = p.apply("ReadAuctionFinished", auctionFinishedReader);
PCollection<String> auctionAcceptedStream = p.apply("ReadAuctionAccepted", auctionAcceptedReader);
PCollection<String> combinedEvents = PCollectionList.of(auctionFinishedStream)
.and(auctionAcceptedStream).apply(Flatten.pCollections());
PCollection<KV<String, String>> keyedAuctionFinishedStream = combinedEvents
.apply("AddKeysToAuctionFinished", WithKeys.of(new GenerateKeyForEvent()));
PCollection<KV<String, Iterable<String>>> sessions = keyedAuctionFinishedStream
.apply(Window.<KV<String, String>>into(Sessions.withGapDuration(Duration.standardMinutes(1)))
.withTimestampCombiner(TimestampCombiner.END_OF_WINDOW))
.apply(GroupByKey.create());
PCollection<SodaSession> values = sessions
.apply(ParDo.of(new DoFn<KV<String, Iterable<String>>, SodaSession> () {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
c.output(new SodaSession("auctionid", "stattedat"));
}
}));
PCollection<SodaSession> windowedEventStream = values
.apply("ApplyWindowing", Window.<SodaSession>into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
);
AvroIO.Write<SodaSession> avroWriter = AvroIO
.write(SodaSession.class)
.to("gs://storage/")
.withWindowedWrites()
.withFilenamePolicy(new EventsToGCS.PerWindowFiles("sessionsoda"))
.withNumShards(3);
windowedEventStream.apply("WriteToDisk", avroWriter);

I've found an efficient solution. As one of my collection was disproportionate in size compared to the other one so I used side input to speed up grouping operation. Here is an overview of my solution:
Read both event streams.
Flatten them into single PCollection.
Use sliding window sized (closable session duration + session max length, every closable session duration).
Partition collections again.
Create PCollectionView from smaller PCollection.
Join both streams using sideInput with the view created in the previous step.
Write sessions to disk.
It handles joining 4000 events/sec stream (larger one) + 60 events/sec stream on 1-2 DataFlow workers versus ~15 workers when used Session windowing along with GroupBy.

Related

How to limit PCollection in Apache Beam as soon as possible?

I'm using Apache Beam 2.28.0 on Google Cloud DataFlow (with Scio SDK). I have a large input PCollection (bounded) and I want to limit / sample it to a fixed number of elements, but I want to start the downstream processing as soon as possible.
Currently, when my input PCollection has e.g. 20M elements and I want to limit it to 1M by using https://beam.apache.org/releases/javadoc/2.28.0/org/apache/beam/sdk/transforms/Sample.html#any-long-
input.apply(Sample.<String>any(1000000))
it waits until all of the 20M elements are read, which takes a long time.
How to efficiently limit number of elements to a fixed size and start downstream processing as soon as the limit is reached, discarding the rest of the input processing?
OK, so my initial solution for that is to use Stateful DoFn like this (I'm using Scio's Scala SDK as mentioned in the question):
import java.lang.{Long => JLong}
class MyLimitFn[T](limit: Long) extends DoFn[KV[String, T], KV[String, T]] {
#StateId("count") private val count = StateSpecs.value[JLong]()
#ProcessElement
def processElement(context: DoFn[KV[String, T], KV[String, T]]#ProcessContext, #StateId("count") count: ValueState[JLong]): Unit = {
val current = count.read()
if(current < limit) {
count.write(current + 1L)
context.output(context.element())
}
}
}
The downside of this solution is that I need to synthetically add the same key (e.g. an empty string) to all elements before using it. So far, it's much faster than Sample.<>any().
I still look forward to see better / more efficient solutions.

Apache beam: wait for N minutes before processing element

We use python beam SDK with GCP's dataflow. Our pipeline depends on external system that we know has some delay. How can I write a pipeline that waits for N minutes (where N is constant I provide when launching job).
Something like
pubsub -> (sleep for 1 minutes) -> read data from external system
My understanding of "FixedWindow" is it groups data into timeframe, so if I use 60 seconds fixed window I can achieve "up to 60 seconds" delay but I want here is constant 60 seconds delay for all incoming data.
Since the Window question was answer by #Kenn Knowles allow me to answer the other half.
I think you could use Stateful and Timely processing and use a Timer of one minute for every element.
Bare in mind that the Timers are applied for each key, so each key would need to be unique in order for this to work. I made this code sample so you can test this, reading from Topic projects/pubsub-public-data/topics/taxirides-realtime.
p
.apply("Read From PubSub", PubsubIO.readStrings().fromTopic(options.getTopic()))
.apply("Parse and to KV", ParDo.of(new DoFn<String, KV<String, String>>() {
#ProcessElement
public void processElement(ProcessContext c) throws ParseException {
JSONObject json = new JSONObject(c.element());
String rideStatus = json.getString("ride_status");
// ride_id is unique for dropoff
String rideId = json.getString("ride_id"); // this is the session
if (rideStatus.equals("dropoff")) {
c.output(KV.of(rideId, "value"));
}
}
}
))
// Stateful DoFn need to have a KV as input
.apply("Timer", ParDo.of(new DoFn<KV<String, String>, String >() {
private final Duration BUFFER_TIME = Duration.standardSeconds(60);
#TimerId("timer")
private final TimerSpec timerSpec = TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
#StateId("buffer")
// Elements will be saved here, with type String
private final StateSpec<BagState<String>> bufferedEvents = StateSpecs.bag();
#ProcessElement
public void processElement(ProcessContext c,
#TimerId("timer") Timer timer,
#StateId("buffer") BagState<String> buffer)
throws ParseException {
// keys are unique, so no need to use counters to trigger the offset
timer.offset(BUFFER_TIME).setRelative();
buffer.add(c.element().getKey()); // add to buffer the unique id
LOG.info("TIMER: Adding " + c.element().getKey() +
" buffer at " + Instant.now().toString());
}
// This method is call when timers expire
#OnTimer("timer")
public void onTimer(
OnTimerContext c,
#StateId("buffer") BagState<String> buffer
) throws IOException {
for (String id : buffer.read()) { // there should be only one since key is unique
LOG.info("TIMER: Releasing " + id +
" from buffer at " + Instant.now().toString());
c.output(id);
}
buffer.clear(); // clearing buffer
}
})
);
This was just a quick test, so probably there would be things to improve in the code.
I am not sure, though, how would this perform with a lot of elements, since you are caching all elements for one minute in individual timers. I'm currently running this pipeline in Dataflow and so far so good, will update this if something weird happens.
The advantage of this vs using sleeps is that the sleep would need to wait for every single element in the bundle to sleep, while this does the wait parallely. The disadvantage may be using too much shuffle, but I haven't test this as much to be sure about this.
Note that in "normal" Stateful DoFns (1) keys are not expected to be unique, and in that case more than one element would be added to the bag, (2) using a counter or something to know if the timer has been offset already is needed, in this case we didn't need it since the keys are unique
Here you have a screenshot of the pipeline working
FixedWindows does not introduce any delay.
In Beam, windowing groups elements according to their timestamps. This is separate from when the elements arrive.
The PubsubIO transform maintains a "watermark" which measures the timestamps that are still remaining in the Pubsub queue. So the watermark will lag real time by 1 minute.
If the pubsub topic becomes empty for a long time, the watermark will sync up with real time. So in that case you may need to allow late data in your pipeline.

Dataflow + OutOfMemoryError

I have 3 GroupBys in my Java pipeline that after each GroupBy, the program runs some computation b/w stages. These groups become larger and larger blocks.The only thing, the program adds is a new key to each block.
The last GroupBy deals w/ smaller # of large blocks. Of course, the pipeline works for small # of items, but it fails at the second or third GroupBys for large # of items.
I played w/ Xms and Xmx and even chose much larger instances 'n1-standard-64', but it din't work. For the failed example, I'm sure the output is smaller than 5G, so is there any other way that I can control memory in DataFlow per map/reduce tasks?
If Dataflow can handle the first GroupBy then it should be able to reduce the number of tasks to allocate more memory on heap and handle large blocks in the next stage.
Any suggestion will be appreciated!
UPDATE:
.apply(ParDo.named("Sort Bins").of(
new DoFn<KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>>, KV<Integer, KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>>>>() {
#Override
public void processElement(ProcessContext c) {
KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>> e = c.element();
Integer Secondary_key = e.getKey();
ArrayList<KV<Long, Iterable<TableRow>>> records = Lists.newArrayList(e.getValue()); // Get a modifiable list.
Collections.sort(records, BinID_COMPARATOR);
Integer Primary_key= is a simple function of the secondary key;
c.output(KV.of(Primary_key, KV.of(Secondary_key, (Iterable<KV<Long, Iterable<TableRow>>>) records)));
}
}));
Error reported for the last line (c.output).

Apache Kafka Streams Materializing KTables to a topic seems slow

I'm using kafka stream and I'm trying to materialize a KTable into a topic.
It works but it seems to be done every 30 secs or so.
How/When does Kafka Stream decides to materialize the current state of a KTable into a topic ?
Is there any way to shorten this time and to make it more "real-time" ?
Here is the actual code I'm using
// Stream of random ints: (1,1) -> (6,6) -> (3,3)
// one record every 500ms
KStream<Integer, Integer> kStream = builder.stream(Serdes.Integer(), Serdes.Integer(), RandomNumberProducer.TOPIC);
// grouping by key
KGroupedStream<Integer, Integer> byKey = kStream.groupByKey(Serdes.Integer(), Serdes.Integer());
// same behaviour with or without the TimeWindow
KTable<Windowed<Integer>, Long> count = byKey.count(TimeWindows.of(1000L),"total");
// same behaviour with only count.to(Serdes.Integer(), Serdes.Long(), RandomCountConsumer.TOPIC);
count.toStream().map((k,v) -> new KeyValue<>(k.key(), v)).to(Serdes.Integer(), Serdes.Long(), RandomCountConsumer.TOPIC);
This is controlled by commit.interval.ms, which defaults to 30s. More details here:
http://docs.confluent.io/current/streams/developer-guide.html
The semantics of caching is that data is flushed to the state store and forwarded to the next downstream processor node whenever the earliest of commit.interval.ms or cache.max.bytes.buffering (cache pressure) hits.
and here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-63%3A+Unify+store+and+downstream+caching+in+streams

Using a fixedWindow of an unbouded datasource as side input for a parDo?

I am reading data (GPS-coordinates, with time stamps) from a unbounded pub/sub datasource and need to calculate the distance between all those points. My idea is to have lets say 1 minute windows and do a ParDo with the whole collection as a side input, where I use the side input to look up the next point and calculate the distance inside the ParDo.
If I run the pipeline I can see the View.asList step is not producing any output. Also calcDistance is never producing any output. Are there any examples of how to use a FixedWindow collection as side input? picture of pipeline
Pipeline:
PCollection<Timepoint> inputWindow = pipeline.apply(PubsubIO.Read.topic(""))
.apply(ParDo.of(new ExtractTimestamps()))
.apply(Window.<Timepoint>into(FixedWindows.of(Duration.standardMinutes(1))));
final PCollectionView<List<Timepoint>> SideInputWindowed = inputWindow.apply(View.<Timepoint>asList());
inputWindow.apply(ParDo.named("Add Timestamp "+teams[i]).of(new AddTimeStampAsKey()))
.apply(ParDo.of(new CalcDistanceTest(SideInputWindowed)).withSideInputs(SideInputWindowed));
ParDo:
static class CalcDistance extends DoFn<KV<String,Timepoint>,Timepoint> {
private final PCollectionView<List<Timepoint>> pCollectionView;
public CalcDistance(PCollectionView pCollectionView){
this.pCollectionView = pCollectionView;
}
#Override
public void processElement(ProcessContext c) throws Exception {
LOG.info("starting to calculate distance");
Timepoint input = c.element().getValue();
//doing distance calculation
c.output(input);
}
}
The overall issue is that the timestamp of the element is not known by Dataflow when reading from Pubsub since for your usecase its an attribute of the data.
You'll want to ensure that when reading from Pubsub, you ingest the records with a timestamp label as discussed here.
Finally, the GameStats example uses a side input to find spammy users. In your case instead of calculating a globalMeanScore per window, you'll just place all your Timepoints into the side input.

Resources