Google Cloud Dataflow latency for real-time processing - google-cloud-dataflow

How low can we expect the latency from Dataflow to be in cases where we just do a simple transform on a high-traffic Google Dataflow cluster, and each “data point” is small.
We’re planning on using the Sessions windowing strategy with a gap duration of 3 seconds, if that’s relevant.
Is it realistic that the time from a data point gets into Dataflow until we have a result to output can be less than 2 seconds? Less than 1 second?

We have been running benchmarks for our application flow using a test harness but then reverted to benchmarking the current out-of-the-box Google-supplied PubSub to PubSub template flow (see: https://cloud.google.com/dataflow/docs/templates/overview, although not listed here - you can create it from the Console).
Our test harness generated and sent millions of JSON-formatted messages of a few hundred bytes with timestamps and compared the latencies at either end.
Very Simply:
Test Publisher -> PubSub -> Data Flow -> PubSub -> Test Subscriber.
For single instance publisher and subscribers we varied the message rates and experimented with the windowing and Trigger Strategies to see if we could improve the average latency but typically weren't able to improve much beyond 1.7 seconds end-to-end for 1,500 - 2000 messages per second (our typical workload).
We then removed Dataflow from the equation and just hooked up the publisher to the subscriber directly and saw latencies typically around 20-30 milliseconds for identical message rates.
Reverting to using the standard PubSub to PubSub Data flow template we saw end-to-end latencies similar to that of our application data flow of around 1.5 - 1.7 seconds.
We sampled the timestamps at various points in the pipeline and written the values to a number custom metrics and have seen that the average latency for adding the message to the initial PCollection from the PubSubIO.Read was around 380msec, but the minimum was as low as 25msec, we ignored the higher values because of the startup overheads. But it seems that there was an overhead that we were unable to influence.
The windowing strategy we tried looked like this:
Pipeline p = Pipeline.create(options);
/*
* Attempt to read from PubSub Topic
*/
PCollectionTuple feedInputResults =
p.apply(feedName + ":read", PubsubIO.readStrings().fromTopic(inboundTopic))
.apply(Window.<String>configure()
.triggering(Repeatedly
.forever(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.millis(windowDelay)))
// Fire on any late data
.withLateFirings(AfterPane.elementCountAtLeast(windowMinElementCount))))
.discardingFiredPanes())
.apply(feedName + ":parse", ParDo.of(new ParseFeedInputFn())
.withOutputTags(validBetRecordTag,
// Specify the output with tag startsWithBTag, as a TupleTagList.
TupleTagList.of(invalidBetRecordTag)));

Related

Apache Beam pipeline metrices

I have a pipeline which runs on Google Dataflow and reads from a pub/sub. I use metrices, mostly counters, to get an idea of how many events were processed.
The general idea is that it reads protobuf messages from the pubsub, deserializes them and dumps the into different BigQuery tables.
I have a counter in the ParDo that does the de-serialization and also counter in the ParDo that does the inserts into the BQ. Most of the time, the difference between the end counter and the start counter is big, like around 40%. The wall time on the pipeline steps is always increasing, but the Data Freshness and System Latency for the pipeline is really low, like around 30 sec...
Why is this happening thos? Is is something related to counters?
You can see the size_ok metric which is the first metric from the pipeline, and if you sum up all of the below metrices, you don't even get close to this number.
My question is, why is there such a big difference between size_ok metric, which is the total number of events that started to be processed and the inserts metrics. If you sum up all the insert metrics, there is no way that they sum up to 6,6 mln events as shown by size_ok

Streaming Beam pipeline with "lookbehind"

I am new to Beam/Dataflow and am trying to figure out if it is suited to this problem. I am trying to keep a running sum of which types of messages are currently backlogged in a queueing system. The system uses a monotonically increasing offset number to order messages: producers learn the number when the send a message, and consumers track the watermark offset as they process each message in FIFO order. This pipeline would have two inputs: counts from the producers and watermarks from the consumers.
The queue producer would regularly flush a batch of count metrics to Beam:
(type1, offset, count)
(type2, offset, count)
...
where the offset was the last offset the producer wrote for typeN, and count is how many typeN messages it enqueued in the current batch period.
The queue consumer will regularly send its latest consumed watermark offset. The effect this should have is to invalidate any counts that have an offset lower than this consumer watermark.
The output of the pipeline is the sum of all counts with a higher offset than the largest consumer watermark yet seen, grouped by message type. (snapshotted every 5 minutes or so.)
(Of course there would be 100k message "types", hundreds of producer servers, occasional 2-hour periods where the consumer doesn't report an advancing watermark, etc.)
Is this doable? That this pipeline would need to maintain and scan an unbounded-ish history of count records is the part that seems maybe unsuited to Beam.
One possible approach would be to model this as two timeseries (left , right) where you want to match left.timestamp <= right.timestamp. You can do this using the State and Timer API.
In order to achieve this unbounded, you will need to be working within a GlobalWindow. Important note in the Global Window there is no expiry of the state, so you will need to make sure to do Garbage Collection on your left and right streams. Also data will arrive in the onprocess unordered, so you will need to make use of Event Time timers to do the actual work.
Very roughly:
onProcess(){
Store data in BagState.
Setup Event time timer to go off
}
OnTimer(){
Do your buiss logic.
}
This is a lot easier with Apache Beam > 2.24.0 as OrderedListState has been added.
Although the timeseries use case is different from the one in this question, this talk from the 2019 Beam summit also has some pointers (but does not make use of OrderedListState, which was not available at the time);
State and Timer API and Timeseries

Cloud Dataflow what is the exact definition of freshness and latency?

Problem:
When using Cloud Dataflow, we get presented 2 metrics (see this page):
system latency
data freshness
These are also available in Stackdriver under the following names (extract from here):
system_lag: The current maximum duration that an item of data has been awaiting processing, in seconds.
data_watermark_age: The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline.
But, these descriptions are still very vague:
what does "awaiting processing" mean? is how long a message waits in pubsub? or the total time it has to wait inside the pipeline?
the "maximum duration": after that maximum item is processed, will the metric be adjusted?
"time since event timestamp" does that mean if my event was put in pubsub at timestamp t1 and it flows out of one end of the pipeline at timestamp t2, the pipeline is at t1? I think I can assume that if the metric is at t1, everything before t1 can be assumed processed.
Question:
As these metrics coincide with the semantics of Apache Beam, I would love to see some examples, or at least more clear definitions of these metrics to make them usable.
These metrics are notoriously tricky. An in-depth dive into how they work can be seen in this talk by a member of the Beam / Dataflow team.
Pipelines are split in series of computations that occur in memory, and computations that require serializing your data to some sort of data store. For example, consider the following pipeline:
with Pipeline() as p:
p | beam.ReadFromPubSub(...) \
| beam.Map(parse_data)
| beam.Map(into_key_value_pairs) \
| beam.WindowInto(....) \
| beam.GroupByKey() \
| beam.Map(format_data) \
| beam.WriteToBigquery(...)
This pipeline would get broken up into two stages. A stage is a series of computations that can be applied in memory.
The first stage goes from ReadFromPubSub to the GroupByKey operation. Everything in between those two PTransforms can be done in-memory. To perform the GroupByKey, the data needs to be written to persistent state (and therefore into a new source).
The second stage goes from GroupByKey to WriteToBigQuery. In this case, the data is read from a 'source'.
Each source has its own set of watermarks. The watermarks that you see in the Dataflow UI are the maximum watermarks coming from any source in the pipeline.
--
Answering your questions:
What's awaiting processing?
Answer
It is how long an element waits in PubSub. Specifically, how long an element waits inside any source in the pipeline.
Consider a simpler pipeline:
ReadFromPubSub -> Map -> WriteToBigQuery.
This pipeline does the following operations for each item: Read an item from PubSub -> Operate on it -> Insert to BigQuery -> **Confirm to PubSub that the item has been consumed**.
Now, imagine that the BigQuery service goes down for 5 minutes. This means that PubSub will not receive confirmations for any of the elements for 5 minutes. Therefore, these elements will be stuck in PubSub for a while.
This means that the system latency (and the data freshness metric as well) will balloon up to 5 minutes while BQ writes are blocked.
Does maximum duration get adjusted after processing?
Answer
That's right. For instance, consider the previous pipeline again: BQ is dead for 5 minutes. When BQ comes back, a large batch of items may be written to it, and confirmed as read from PubSub. This will drastically reduce the system latency (and data freshness) back to a few seconds.
What's time since event timestamp?
Answer
An event timestamp can be provided as an attribute of the message to PubSub. It's a bit of a tricky concept, but essentially:
For each stage there is an output data watermark. An output data watermark of T indicates that the computation has processed all elements with event time before T. The latest an output data watermark can be is the earliest input watermark of all its upstream computations. However, the output watermark could be held back if there is some input data that has not yet been processed.
This metric is, of course, heuristic. If some data point comes in very late, then the Data Freshness will be held back.
--
I'd advice you to check out the talk by Slava. It goes over all these concepts.

Apache Beam: Batch Pipeline with Unbounded Source

I'm currently using Apache Beam with Google Dataflow for processing real time data. The data comes from Google PubSub, which is unbounded, so currently I'm using streaming pipeline. However, it turns out that having a streaming pipeline running 24/7 is quite expensive. To reduce cost, I'm thinking of switching to a batch pipeline that runs at a fixed time interval (e.g. every 30 minutes), since it's not really important for the processing to be real time for the user.
I'm wondering if it's possible to use PubSub subscription as a bounded source? My idea is that each time the job is run, it will accumulate the data for a 1 minute before triggering. So far it does not seem possible, but I've come across a class called BoundedReadFromUnboundedSource (which I have no idea how to use), so maybe there is a way?
Below is roughly how the source looks like:
PCollection<MyData> data = pipeline
.apply("ReadData", PubsubIO
.readMessagesWithAttributes()
.fromSubscription(options.getInput()))
.apply("ParseData", ParDo.of(new ParseMyDataFn()))
.apply("Window", Window
.<MyData>into(new GlobalWindows())
.triggering(Repeatedly
.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5))
)
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
);
I tried to do the following, but the job still runs in streaming mode:
PCollection<MyData> data = pipeline
.apply("ReadData", PubsubIO
.readMessagesWithAttributes()
.fromSubscription(options.getInput()))
.apply("ParseData", ParDo.of(new ParseMyDataFn()))
// Is there a way to make the window trigger once and turning it into a bounded source?
.apply("Window", Window
.<MyData>into(new GlobalWindows())
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
)
.withAllowedLateness(Duration.ZERO).discardingFiredPanes()
);
This is not explicitly supported in PubsubIO currently, however you could try periodically starting a streaming job and programmatically invoking Drain on it a few minutes later.

Reading large gzip JSON files from Google Cloud Storage via Dataflow into BigQuery

I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.

Resources