Apache Beam pipeline metrices - google-cloud-dataflow

I have a pipeline which runs on Google Dataflow and reads from a pub/sub. I use metrices, mostly counters, to get an idea of how many events were processed.
The general idea is that it reads protobuf messages from the pubsub, deserializes them and dumps the into different BigQuery tables.
I have a counter in the ParDo that does the de-serialization and also counter in the ParDo that does the inserts into the BQ. Most of the time, the difference between the end counter and the start counter is big, like around 40%. The wall time on the pipeline steps is always increasing, but the Data Freshness and System Latency for the pipeline is really low, like around 30 sec...
Why is this happening thos? Is is something related to counters?
You can see the size_ok metric which is the first metric from the pipeline, and if you sum up all of the below metrices, you don't even get close to this number.
My question is, why is there such a big difference between size_ok metric, which is the total number of events that started to be processed and the inserts metrics. If you sum up all the insert metrics, there is no way that they sum up to 6,6 mln events as shown by size_ok

Related

Streaming Beam pipeline with "lookbehind"

I am new to Beam/Dataflow and am trying to figure out if it is suited to this problem. I am trying to keep a running sum of which types of messages are currently backlogged in a queueing system. The system uses a monotonically increasing offset number to order messages: producers learn the number when the send a message, and consumers track the watermark offset as they process each message in FIFO order. This pipeline would have two inputs: counts from the producers and watermarks from the consumers.
The queue producer would regularly flush a batch of count metrics to Beam:
(type1, offset, count)
(type2, offset, count)
...
where the offset was the last offset the producer wrote for typeN, and count is how many typeN messages it enqueued in the current batch period.
The queue consumer will regularly send its latest consumed watermark offset. The effect this should have is to invalidate any counts that have an offset lower than this consumer watermark.
The output of the pipeline is the sum of all counts with a higher offset than the largest consumer watermark yet seen, grouped by message type. (snapshotted every 5 minutes or so.)
(Of course there would be 100k message "types", hundreds of producer servers, occasional 2-hour periods where the consumer doesn't report an advancing watermark, etc.)
Is this doable? That this pipeline would need to maintain and scan an unbounded-ish history of count records is the part that seems maybe unsuited to Beam.
One possible approach would be to model this as two timeseries (left , right) where you want to match left.timestamp <= right.timestamp. You can do this using the State and Timer API.
In order to achieve this unbounded, you will need to be working within a GlobalWindow. Important note in the Global Window there is no expiry of the state, so you will need to make sure to do Garbage Collection on your left and right streams. Also data will arrive in the onprocess unordered, so you will need to make use of Event Time timers to do the actual work.
Very roughly:
onProcess(){
Store data in BagState.
Setup Event time timer to go off
}
OnTimer(){
Do your buiss logic.
}
This is a lot easier with Apache Beam > 2.24.0 as OrderedListState has been added.
Although the timeseries use case is different from the one in this question, this talk from the 2019 Beam summit also has some pointers (but does not make use of OrderedListState, which was not available at the time);
State and Timer API and Timeseries

Cloud Dataflow what is the exact definition of freshness and latency?

Problem:
When using Cloud Dataflow, we get presented 2 metrics (see this page):
system latency
data freshness
These are also available in Stackdriver under the following names (extract from here):
system_lag: The current maximum duration that an item of data has been awaiting processing, in seconds.
data_watermark_age: The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline.
But, these descriptions are still very vague:
what does "awaiting processing" mean? is how long a message waits in pubsub? or the total time it has to wait inside the pipeline?
the "maximum duration": after that maximum item is processed, will the metric be adjusted?
"time since event timestamp" does that mean if my event was put in pubsub at timestamp t1 and it flows out of one end of the pipeline at timestamp t2, the pipeline is at t1? I think I can assume that if the metric is at t1, everything before t1 can be assumed processed.
Question:
As these metrics coincide with the semantics of Apache Beam, I would love to see some examples, or at least more clear definitions of these metrics to make them usable.
These metrics are notoriously tricky. An in-depth dive into how they work can be seen in this talk by a member of the Beam / Dataflow team.
Pipelines are split in series of computations that occur in memory, and computations that require serializing your data to some sort of data store. For example, consider the following pipeline:
with Pipeline() as p:
p | beam.ReadFromPubSub(...) \
| beam.Map(parse_data)
| beam.Map(into_key_value_pairs) \
| beam.WindowInto(....) \
| beam.GroupByKey() \
| beam.Map(format_data) \
| beam.WriteToBigquery(...)
This pipeline would get broken up into two stages. A stage is a series of computations that can be applied in memory.
The first stage goes from ReadFromPubSub to the GroupByKey operation. Everything in between those two PTransforms can be done in-memory. To perform the GroupByKey, the data needs to be written to persistent state (and therefore into a new source).
The second stage goes from GroupByKey to WriteToBigQuery. In this case, the data is read from a 'source'.
Each source has its own set of watermarks. The watermarks that you see in the Dataflow UI are the maximum watermarks coming from any source in the pipeline.
--
Answering your questions:
What's awaiting processing?
Answer
It is how long an element waits in PubSub. Specifically, how long an element waits inside any source in the pipeline.
Consider a simpler pipeline:
ReadFromPubSub -> Map -> WriteToBigQuery.
This pipeline does the following operations for each item: Read an item from PubSub -> Operate on it -> Insert to BigQuery -> **Confirm to PubSub that the item has been consumed**.
Now, imagine that the BigQuery service goes down for 5 minutes. This means that PubSub will not receive confirmations for any of the elements for 5 minutes. Therefore, these elements will be stuck in PubSub for a while.
This means that the system latency (and the data freshness metric as well) will balloon up to 5 minutes while BQ writes are blocked.
Does maximum duration get adjusted after processing?
Answer
That's right. For instance, consider the previous pipeline again: BQ is dead for 5 minutes. When BQ comes back, a large batch of items may be written to it, and confirmed as read from PubSub. This will drastically reduce the system latency (and data freshness) back to a few seconds.
What's time since event timestamp?
Answer
An event timestamp can be provided as an attribute of the message to PubSub. It's a bit of a tricky concept, but essentially:
For each stage there is an output data watermark. An output data watermark of T indicates that the computation has processed all elements with event time before T. The latest an output data watermark can be is the earliest input watermark of all its upstream computations. However, the output watermark could be held back if there is some input data that has not yet been processed.
This metric is, of course, heuristic. If some data point comes in very late, then the Data Freshness will be held back.
--
I'd advice you to check out the talk by Slava. It goes over all these concepts.

Reading large gzip JSON files from Google Cloud Storage via Dataflow into BigQuery

I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.

Google Cloud Dataflow latency for real-time processing

How low can we expect the latency from Dataflow to be in cases where we just do a simple transform on a high-traffic Google Dataflow cluster, and each “data point” is small.
We’re planning on using the Sessions windowing strategy with a gap duration of 3 seconds, if that’s relevant.
Is it realistic that the time from a data point gets into Dataflow until we have a result to output can be less than 2 seconds? Less than 1 second?
We have been running benchmarks for our application flow using a test harness but then reverted to benchmarking the current out-of-the-box Google-supplied PubSub to PubSub template flow (see: https://cloud.google.com/dataflow/docs/templates/overview, although not listed here - you can create it from the Console).
Our test harness generated and sent millions of JSON-formatted messages of a few hundred bytes with timestamps and compared the latencies at either end.
Very Simply:
Test Publisher -> PubSub -> Data Flow -> PubSub -> Test Subscriber.
For single instance publisher and subscribers we varied the message rates and experimented with the windowing and Trigger Strategies to see if we could improve the average latency but typically weren't able to improve much beyond 1.7 seconds end-to-end for 1,500 - 2000 messages per second (our typical workload).
We then removed Dataflow from the equation and just hooked up the publisher to the subscriber directly and saw latencies typically around 20-30 milliseconds for identical message rates.
Reverting to using the standard PubSub to PubSub Data flow template we saw end-to-end latencies similar to that of our application data flow of around 1.5 - 1.7 seconds.
We sampled the timestamps at various points in the pipeline and written the values to a number custom metrics and have seen that the average latency for adding the message to the initial PCollection from the PubSubIO.Read was around 380msec, but the minimum was as low as 25msec, we ignored the higher values because of the startup overheads. But it seems that there was an overhead that we were unable to influence.
The windowing strategy we tried looked like this:
Pipeline p = Pipeline.create(options);
/*
* Attempt to read from PubSub Topic
*/
PCollectionTuple feedInputResults =
p.apply(feedName + ":read", PubsubIO.readStrings().fromTopic(inboundTopic))
.apply(Window.<String>configure()
.triggering(Repeatedly
.forever(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.millis(windowDelay)))
// Fire on any late data
.withLateFirings(AfterPane.elementCountAtLeast(windowMinElementCount))))
.discardingFiredPanes())
.apply(feedName + ":parse", ParDo.of(new ParseFeedInputFn())
.withOutputTags(validBetRecordTag,
// Specify the output with tag startsWithBTag, as a TupleTagList.
TupleTagList.of(invalidBetRecordTag)));

Real-time pipeline feedback loop

I have a dataset with potentially corrupted/malicious data. The data is timestamped. I'm rating the data with a heuristic function. After a period of time I know that all new data items coming with some IDs needs to be discarded and they represent a significant portion of data (up to 40%).
Right now I have two batch pipelines:
First one just runs the rating over the data.
The second one first filters out the corrupted data and runs the analysis.
I would like to switch from batch mode (say, running every day) into an online processing mode (hope to get a delay < 10 minutes).
The second pipeline uses a global window which makes processing easy. When the corrupted data key is detected, all other records are simply discarded (also using the discarded keys from previous days as a pre-filter is easy). Additionally it makes it easier to make decisions about the output data as during the processing all historic data for a given key is available.
The main question is: can I create a loop in a Dataflow DAG? Let's say I would like to accumulate quality-rates given to each session window I process and if the rate sum is over X, some a filter function in earlier stage of pipeline should filter out malicious keys.
I know about side input, I don't know if it can change during runtime.
I'm aware that DAG by definition cannot have cycle, but how achieve same result without it?
Idea that comes to my mind is to use side output to mark ID as malicious and make fake unbounded output/input. The output would dump the data to some storage and the input would load it every hour and stream so it can be joined.
Side inputs in the Beam programming model are windowed.
So you were on the right path: it seems reasonable to have a pipeline structured as two parts: 1) computing a detection model for the malicious data, and 2) taking the model as a side input and the data as a main input, and filtering the data according to the model. This second part of the pipeline will get the model for the matching window, which seems to be exactly what you want.
In fact, this is one of the main examples in the Millwheel paper (page 2), upon which Dataflow's streaming runner is based.

Resources