I'm currently using Apache Beam with Google Dataflow for processing real time data. The data comes from Google PubSub, which is unbounded, so currently I'm using streaming pipeline. However, it turns out that having a streaming pipeline running 24/7 is quite expensive. To reduce cost, I'm thinking of switching to a batch pipeline that runs at a fixed time interval (e.g. every 30 minutes), since it's not really important for the processing to be real time for the user.
I'm wondering if it's possible to use PubSub subscription as a bounded source? My idea is that each time the job is run, it will accumulate the data for a 1 minute before triggering. So far it does not seem possible, but I've come across a class called BoundedReadFromUnboundedSource (which I have no idea how to use), so maybe there is a way?
Below is roughly how the source looks like:
PCollection<MyData> data = pipeline
.apply("ReadData", PubsubIO
.readMessagesWithAttributes()
.fromSubscription(options.getInput()))
.apply("ParseData", ParDo.of(new ParseMyDataFn()))
.apply("Window", Window
.<MyData>into(new GlobalWindows())
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5))
)
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
);
I tried to do the following, but the job still runs in streaming mode:
PCollection<MyData> data = pipeline
.apply("ReadData", PubsubIO
.readMessagesWithAttributes()
.fromSubscription(options.getInput()))
.apply("ParseData", ParDo.of(new ParseMyDataFn()))
// Is there a way to make the window trigger once and turning it into a bounded source?
.apply("Window", Window
.<MyData>into(new GlobalWindows())
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
);
This is not explicitly supported in PubsubIO currently, however you could try periodically starting a streaming job and programmatically invoking Drain on it a few minutes later.
Related
Context
Hi all, I have been using Apache Beam pipelines to generate columnar DB to store in GCS, I have a datastream coming in from Kafka and have a window of 1m.
I want to transform all data of that 1m window into a columnar DB file (ORC in my case, can be Parquet or anything else), I have written a pipeline for this transformation.
Problem
I am experiencing general slowness. I suspect it could be due to the group by key transformation as I have only key. Is there really a need to do that? If not, what should be done instead? I read that combine isn't very useful for this as my pipeline isn't really aggregating the data but creating a merged file. What I exactly need is an iterable list of objects per window which will be transformed to ORC files.
Pipeline Representation
input -> window -> group by key (only 1 key) -> pardo (to create DB) -> IO (to write to GCS)
What I have tried
I have tried using the profiler, scaling horizontally/vertically. Using the profiler I saw more than 50% of the time going into group by key operation. I do believe the problem is of hot keys but I am unable to find a solution on what should be done. When I removed the group by key operation, my pipeline keeps up with the Kafka lag (ie, it doesn't seem to be an issue at Kafka end).
Code Snippet
p.apply("ReadLines", KafkaIO.<Long, byte[]>read().withBootstrapServers("myserver.com:9092")
.withTopic(options.getInputTopic())
.withTimestampPolicyFactory(MyTimePolicy.myTimestampPolicyFactory())
.withConsumerConfigUpdates(Map.of("group.id", "mygroup-id")).commitOffsetsInFinalize()
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(ByteArrayDeserializer.class).withoutMetadata())
.apply("UncompressSnappy", ParDo.of(new UncompressSnappy()))
.apply("DecodeProto", ParDo.of(new DecodePromProto()))
.apply("MapTSSample", ParDo.of(new MapTSSample()))
.apply(Window.<TSSample>into(FixedWindows.of(Duration.standardMinutes(1)))
.withTimestampCombiner(TimestampCombiner.END_OF_WINDOW))
.apply(WithKeys.<Integer, TSSample>of(1))
.apply(GroupByKey.<Integer, TSSample>create())
.apply("CreateTSORC", ParDo.of(new CreateTSORC()))
.apply(new WriteOneFilePerWindow(options.getOutput(), 1));
Wall Time Profile
https://gist.github.com/anandsinghkunwar/4cc26f7e3da7473af66ce9a142a74c35
The problem indeed seems to be a hot keys issue, I had to change my pipeline to create a custom IO for ORC files and bump up the number of shards to 50 for my case. I removed the GroupByKey totally. Since beam doesn't yet have auto determination of number of shards for FileIO.write(), you'll have to manually choose a number that suits your workload.
Also, enabling streaming engine API in Google Dataflow sped up the ingestion even more.
I'm trying to break fusion with a GroupByKey. This creates one huge window and since my job is big I'd rather start emitting.
With the direct runner using something like what I found here it seems to work. However, when run on Cloud Dataflow it seems to batch the GBK together and not emit output until the source nodes have "succeeded".
I'm doing a bounded/batch job. I'm extracting the contents of archive files and then writing them to gcs.
Everything works except it takes longer than I expected and cpu utilization is low. I suspect that this is due to fusion -- my hypothesis is that the extraction is fused to the write operation and so there's a pattern of extraction / higher CPU followed by less CPU because we're doing network calls and back again.
The code looks like:
.apply("Window",
Window.<MyType>into(new GlobalWindows())
.triggering(
Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5))))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply("Add key", MapElements...)
.apply(GroupByKey.create())
Locally I verify using debug logs so that I can see work is being done after the GBK. The timestamp between the first extraction finishing and the first post-GBK op usually reflects the 5s duration (or other values I change it to (1,5,10,20,30)).
On GCP I verify by looking at the pipeline structure and I can see that everything after the GBK is "not started" and the output collection of the GBK is empty ("-") while the input collection has millions of elements.
Edit:
this is on beam v2.10.0.
the extraction is being done by a SplittableDoFn (not sure this is relevant)
Looks like the answer you referred to was for a streaming pipeline (unbounded input). For batch pipeline processing a bounded input, GroupByKey will not emit till all data for a given key has been processed. Please see here for more details.
I'm writing a Dataflow (Beam SDK 2.0.0) that reads from Pub/Sub, counts the elements in a window and then stores the counts in BigTable as a timeseries. The windows are fixed on durations of 1 minute.
My intention was to update the value of the current window every second using a trigger in order to get real-time updates on the current time window.
But that doesn't seem to work. The value gets correctly updated every second but once Dataflow starts working on the next minute the first one is updated to zero. So basically only my last value is correct, all the rest is zero.
Pipeline pipeline = Pipeline.create(options);
PCollection<String> live = pipeline
.apply("Read from PubSub", PubsubIO.readStrings()
.fromSubscription("projects/..."))
.apply("Window per minute",
Window
.<String>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.orFinally(AfterWatermark.pastEndOfWindow()))
.accumulatingFiredPanes()
.withAllowedLateness(Duration.ZERO)
);
I've tried playing with the trigger code but nothing helps. My only options right now is to remove the entire .trigger block. Has anyone experienced similar behaviour?
After reporting my issue to Google they discovered some issues in the Beam SDK which are causing this. More details on these links:
When EOW and GC timers fire together (nonzero allowed lateness) we fail to note that it is the final pane: https://issues.apache.org/jira/browse/BEAM-2505
Processing time timers are not ignored properly if they come in with the GC timer: https://issues.apache.org/jira/browse/BEAM-2502
Processing time timers are just interpreted as GC timers, completely incorrectly comparing timestamps from different time domains: https://issues.apache.org/jira/browse/BEAM-2504
I have a pipeline that looks like
pipeline.apply(PubsubIO.read.subscription("some subscription"))
.apply(Window.into(SlidingWindow.of(10 mins).every(20 seconds)
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(20 seconds))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()))
.apply(RemoveDuplicates.create())
.apply(Window.discardingFiredPanes()) // this is suggested in the warnings under https://cloud.google.com/dataflow/model/triggers#window-accumulation-modes
.apply(Count.<String>globally().withoutDefaults())
This pipeline overcounts distinct values significantly (20x normal value). Initially, I was suspecting that the default trigger may have caused this issue. I have tweaked to use triggers that allow no lateness/discard fired panes/use processing time, all of which have similar overcount issues.
I've also tried ApproximateUnique.globally: it failed during pipeline construction because of an exception that looks like
Default values are not supported in Combine.globally() if the output PCollection is not windowed by GlobalWindows. There seems no way to add withoutDefaults to it (like we did with Count.globally).
Is there a recommended way to do COUNT(DISTINCT) in dataflow/beam streaming pipeline with reasonable precision?
P.S. I'm using Java Dataflow SDK 1.9.0.
Your code looks OK; it shouldn't overcount. Note that you are placing each element into 30 windows, so if you have a window-unaware sink (equivalent to collapsing all the sliding windows) you would expect precisely 30 times as many elements. If you could show a bit more of the pipeline or how you are observing the counts, that might help.
Beyond that, I have a few suggestions for the pipeline:
I suggest changing your trigger for RemoveDuplicates to AfterPane.elementCountAtLeast(1); this will get you the same result at lower latency, since later elements arriving will have no impact. This trigger, and your current trigger, will never fire repeatedly. So it does not actually matter whether you set accumulatingFiredPanes() or discardingFiredPanes(). This is good, because neither one would work with the rest of your pipeline.
I'd install a new trigger prior to the Count. The reason is a bit technical, but I'll try to describe it:
In your current pipeline, the trigger installed there (the "continuation trigger" of the trigger for RemoveDuplicates) notes the arrival time of the first element and waits until it has received all elements that were produced at or before that processing time, as measured by the upstream worker. There is some nondeterminism because it puns the local processing time and the processing time of other workers.
If you take my advice and switch the trigger for RemoveDuplicates, then the continuation trigger will be AfterPane.elementCountAtLeast(1) so it will always emit a count as soon as possible and then discard further data, which is very wrong.
How low can we expect the latency from Dataflow to be in cases where we just do a simple transform on a high-traffic Google Dataflow cluster, and each “data point” is small.
We’re planning on using the Sessions windowing strategy with a gap duration of 3 seconds, if that’s relevant.
Is it realistic that the time from a data point gets into Dataflow until we have a result to output can be less than 2 seconds? Less than 1 second?
We have been running benchmarks for our application flow using a test harness but then reverted to benchmarking the current out-of-the-box Google-supplied PubSub to PubSub template flow (see: https://cloud.google.com/dataflow/docs/templates/overview, although not listed here - you can create it from the Console).
Our test harness generated and sent millions of JSON-formatted messages of a few hundred bytes with timestamps and compared the latencies at either end.
Very Simply:
Test Publisher -> PubSub -> Data Flow -> PubSub -> Test Subscriber.
For single instance publisher and subscribers we varied the message rates and experimented with the windowing and Trigger Strategies to see if we could improve the average latency but typically weren't able to improve much beyond 1.7 seconds end-to-end for 1,500 - 2000 messages per second (our typical workload).
We then removed Dataflow from the equation and just hooked up the publisher to the subscriber directly and saw latencies typically around 20-30 milliseconds for identical message rates.
Reverting to using the standard PubSub to PubSub Data flow template we saw end-to-end latencies similar to that of our application data flow of around 1.5 - 1.7 seconds.
We sampled the timestamps at various points in the pipeline and written the values to a number custom metrics and have seen that the average latency for adding the message to the initial PCollection from the PubSubIO.Read was around 380msec, but the minimum was as low as 25msec, we ignored the higher values because of the startup overheads. But it seems that there was an overhead that we were unable to influence.
The windowing strategy we tried looked like this:
Pipeline p = Pipeline.create(options);
/*
* Attempt to read from PubSub Topic
*/
PCollectionTuple feedInputResults =
p.apply(feedName + ":read", PubsubIO.readStrings().fromTopic(inboundTopic))
.apply(Window.<String>configure()
.triggering(Repeatedly
.forever(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.millis(windowDelay)))
// Fire on any late data
.withLateFirings(AfterPane.elementCountAtLeast(windowMinElementCount))))
.discardingFiredPanes())
.apply(feedName + ":parse", ParDo.of(new ParseFeedInputFn())
.withOutputTags(validBetRecordTag,
// Specify the output with tag startsWithBTag, as a TupleTagList.
TupleTagList.of(invalidBetRecordTag)));