Dataflow pipeline pauses after scaling down with low CPU utilization - google-cloud-dataflow

In our streaming pipeline we read data from pubsub, do some validations and then group it by a key in a 10 second gap session window. Afterwards the data is processed further and written to bigtable and pubsub again.
We're using apache beam 2.28 and the dataflow streaming engine. During the day we process more data than over night and the pipeline scales up the number of workers (n2d-standard-4) automatically. Mostly it scales up from 2 workers to 4 or 5 to reduce the backlog. After that it will scale down again as the CPU utilization is too low for 4 or 5 workers.
It is at this point that the CPU utilization drops to nearly 0% for all workers and the entire pipeline starts lagging behind massively. The result is that the number of workers is scaled up to a higher number again and the pipeline processing the data further. After the backlog is reduced again, the number of workers is gradually lowered and the same issue arises.
metrics
What we notice is that in the GroupByKey step, the input throughput stays more or less the same, but the output throughput drops to 0.
GroupByKey throughput
I know using GroupByKey can have hotkeys, but then I would expect the CPU utilization of 1 worker to be very high while the others have nothing to do.
Does anyone know what might be causing this issue?

The issue was caused by by the combination of using the session window with a groupbykey, how the watermark for a pubsub unbounded source works and when the acknowledges are being sent to pubsub.
Our session window with a gap of 10 seconds sometimes didn't output any messages for a couple of minutes (due to no early trigger being configured and messages continuously arriving for the same key within the 10 second session gap). Because these steps are part of the first fused stage in the actual execution of our pipeline, this lead to some messages not being acknowledged to pubsub (the ack is only sent when the first fused stage is completed). The oldest unacknowledged message time on the subscription kept on rising, causing the watermark not to advance.
This issue was became more outspoken due to the acknowledgement deadline being set to 10 minutes. When the number of workers scaled down, this caused the issue described in the original question.
We were able to solve this by adding a Reshuffle before the creation of the session window (with the groupbykey) and decreasing the acknowledgement deadline.
https://cloud.google.com/blog/products/data-analytics/handling-duplicate-data-in-streaming-pipeline-using-pubsub-dataflow
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization

Related

Slow Dataflow job draining

I have a dataflow job that reads from pubsub subscription and writes to a Redis instance on a fixed time windows. The job seems to be running well with 4 workers and almost 0s systems latency until I try to drain it which causes upscaling to 10 workers and takes hours to finish.
I suspect this is caused by the windowing/grouping since the output collection metric suggest that it keeps producing elements long after the drain is started.
That's the windowing that I'm using.
beam.WindowInto(
window.FixedWindows(size=120),
trigger=Repeatedly(
AfterAny(AfterCount(100), AfterProcessingTime(120))),
accumulation_mode=AccumulationMode.DISCARDING,
allowed_lateness=Duration(seconds=2 * 24 * 60 * 60))
Given your large allowed lateness, when you drain your pipeline it causes every window for every key seen in the last 48 hours to close all at once, which is probably why there's so much work and it's upscaling. This could be especially bad if your keys are not often re-used.

GCP Dataflow - Throughput gradually slows down, Workers underutilized

I have a Beam script running in GCP Dataflow. This data flow performs the below steps:
Read a number of files that are PGP encrypted. (Total size more than 100 GB, individual files are of 2 GB in size)
Decrypt the files to form a PCollection
Do a wait() on PCollection
Do some processing on each record in the PCollection before writing into an output file
Behavior seen with GCP Dataflow:
When reading the input files and decrypting the files, it starts with one workers, and then scales upto 30 workers. But, only one worker continues to be utilized, utilization in all other workers is less than 10 %
Initially, throughput was 150K records per second while decryption. So, 90% of the decryption gets completed in 1 hours, which is good. But, then the throughput slows down gradually, even to just 100 records per second. So, it takes another 1-2 hours to complete the remaining 10% of the workload.
Any idea why the workers are underutilized? If there is no utilization, why are they not scaled down? Here, I am paying unnecessarily for a large number of VM-s :-(. Second, why the throughput slows reduction towards the end, and thereby significantly increasing the time for completion?
There is an issue related to the throughput and input behavior of the Cloud Dataflow. I suggest you to track the improvements being made to the autoscaling and utilization behavior of workers here.
The default architecture for Dataflow worker processing and autoscaling is not as responsive in some cases compared to when the Dataflow Streaming Engine feature is enabled. I would recommend you to try running the relevant Dataflow pipeline with Streaming Engine enabled, since it provides a more responsive autoscaling performance based on CPU utilization for your pipeline.
I hope you find the above pieces of information useful.
Can you try to implement your solution without wait() ?
For example,
FileIO.match().filepattern() -> ParDo(DoFn to decrypt files) -> fileIO.readmatches() -> ParDo(DoFn to read files)
See the example here.
This should allow your pipeline to better parallelize.

Dataflow: system lag keeps going up

We are testing Cloud Dataflow which pulls message from Pub/Sub subscription and convert data to BigQuery TableRow and load them to BigQuery as load job in every 1 min 30 sec.
We can see the pipeline works well and can process 500,000 elements per second with 40 workers. But when trying autoscaling, the number of workers unexpectedly goes up to 40 and stay there even if we send only 50,000 messages to Pub/Sub. In this situation, no unacknowledged message and workers' CPU utilizations are bellow 60%. One thing we noticed is that the Dataflow system lag goes up slowly.
Is system lag affects autoscaling?
If so, is there any solutions or ways to debugging this problem?
Is system lag affects autoscaling?
Google does not really expose the specifics of its autoscaling algorithm. Generally, though, it is based on CPU utilization, throughput and backlog. Since you're using Pub/Sub, backlog in by itself should be based on the number of unacknowledged messages. Still, the rate at which these are being consumed (i.e. the throughput at the Pub/Sub read stage) is also taken into account. Now, throughput as a whole relates to the rate at which each stage processes input bytes. As for CPU utilization, if the aforementioned don't "run smoothly", 60% usage is already too high. So, system lag at some stage could be interpreted as the throughput of that stage and therefore should affect autoscaling. Then again, these two should not always be conflated. If for example a worker gets stuck due to a hot key, system lag is high but there's no autoscaling, as the work is not parallelizable. So, all in all, it depends.
If so, is there any solutions or ways to debugging this problem?
The most important tools you have at hand are the execution graph, stackdriver logging and stackdriver monitoring. From monitoring, you should consider jvm, compute and dataflow metrics. gcloud dataflow jobs describe can also be useful, mostly to see how steps are fused and, by extension, see which steps are run in the same worker, like so:
gcloud dataflow jobs describe --full $JOB_ID --format json | jq '.pipelineDescription.executionPipelineStage[] | {"stage_id": .id, "stage_name": .name, "fused_steps": .componentTransform }'
Stackdriver monitoring exposes all three of the main autoscaling components.
Now, how you're going to take advantage of the above obviously depends on the problem. In your case, at first glance I'd say that, if you can work without autoscaling and 40 workers, you should normally expect that you can do the same with autoscaling when you've set maxNumWorkers to 40. Then again, the number of messages alone does not say the full story, their size/content also matters. I think you should start by analyzing your graph, check which step has the highest lag, see what's the input/output ratio and check for messages with severity>=WARNING in your logs. If you shared any of those here maybe we could spot something more specific.

How do you determine how many resources to provision in a Google Dataflow streaming pipeline?

I'm curious how to decide on how to provision resources for Apache Beam pipelines running on Google's Dataflow platform. I've built a streaming pipeline (Beam Java 2.0.0) that takes a PubSub JSON string, transforms it to a BQ TableRow, then routes it to the correct tables. There are also two transforms within the pipeline, one with a 5 minute sliding window every minute and another window with a 1 minute fixed time duration.
For some context, each incoming message is about a 1KB JSON string, and at an extreme peak the pipeline will receive 250,000 messages in one second. My sliding time window could possibly grow to have 5,000,000 million tablerows / minute before it closes (worst case scenario, but that's what we're planning for). Our typical peak traffic usage is about 75k messages / second. However, 90% of the time our pipeline is processing only 30 messages / second.
We're running on dataflow with autoscaling enabled, and by default Google provisions 4 CPUs, 15GB, and 420gb * max_number of workers for streaming pipelines. With 10 max workers set, we're going to be paying for 4.2TB of disk usage a month. That seems a bit overkill, but I don't know what data I should be looking at to verify my theory.
Something I've been thinking about is to instead use 2 CPUs and 7.5 GB of memory with 20GB of SSD per worker, and setting the max number of workers at 50. Under this configuration, we'd have at minimum 4 workers.
Summary of my spiel:
- How do you determine the CPU, RAM, and disk space you need for your streaming pipelines?
- How do you determine that a pipeline should provision SSD resources instead of standard harddrives?
- What metric measurements can I look at to measure performance of my pipeline?
Since pipelines are very different, there is no all purpose general way to say how many workers and what sizes of disks to use. There are several approaches that do work well though:
Dataflow's horizontal scaling is very close to linear. This means
that if you run a sampled pipeline (eg by sampling 10% of your input
traffic) you can very quickly estimate the resources the full
pipeline will need, without overpaying. You can tell if the pipeline is "keeping up" with the input, if the system lag stays low, and the data watermark continues to advance. You can then estimate the
maximum number of workers that your pipeline will need at peak input rate using this strategy. Lets call this number m
Having done the above, you can then rely on autoscaling, having set the maxNumWorkers flag to a number k*m where k will effectively determine how quickly your pipeline can catch up from a backlog at peak load. Eg, at k=1 the pipeline can only keep up with peak load, so a backlog at peak load may never be drained, or wait for non-peak load to drain. at k=2 the pipeline can process 2x the peak load, so it will catch up faster. Of course this is a tradeoff for how many resources you are willing to pay for during backlog, and how much catchup latency you are willing to tolerate.
Autoscaling will also ensure that the pipeline downscales during non-peak load, so that you will not be paying for all of the resources during non-peak times.
A few other notes:
Streaming dataflow tends to perform better with 4 CPU workers vs 2 CPU workers. This is because there is some per-worker overhead, and certain tuning for work parallelism that is optimized to 4 CPU workers.
SSD use should already be enabled by default when using dataflow, as SSDs drastically improve write throughput and lead to much better performance.

Autoscaling in Google Cloud Dataflow is not working as expected

I am trying to enable autoscaling in my dataflow job as described in this article. I did that by setting the relevant algorithm via the following code:
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setAutoscalingAlgorithm(AutoscalingAlgorithmType.THROUGHPUT_BASED)
After I set this and deployed my job, it always works with the max. number of CPUs available, i.e. if I set max number of workers to 10, then it uses all 10 CPUs although average CPU usage is about 50%. How does this THROUGHPUT_BASED algorithm works and where I am making mistake?
Thanks.
Although Autoscaling tries to reduce both the backlog and CPU, backlog reduction takes priority. Specific values backlog matters, Dataflow calculates 'backlog in seconds' roughly as 'backlog / throughput' and tries to keep it below 10 seconds.
In your case, I think what is preventing downscaling from 10 is due to policy regarding persistent disks (PDs) used for pipeline execution. When max workers is 10, Dataflow uses 10 persistent disks and tries to keep the number of workers at any time such that these disks are distributed roughly equally. As a consequence when the pipeline is at its max workers of 10, it tries to downscale to 5 rather than 7 or 8. In addition, it tries to keep projected CPU after downscaling to no more than 80%.
These two factors might be effectively preventing downscaling in your case. If CPU utilization is 50% with 10 workers, the projected CPU utilization is 100% for 5 workers, so it does not downscale since it is above the target 80%.
Google Dataflow is working on new execution engine that does not depend on persistent disks and does not suffer from the limitation of amout of downscaling.
A work around for this is to set higher max_workers and your pipeline might still stay at 10 or below. But that incurs a small increase in cost for PDs.
Another remote possibility is that sometimes even after upscaling estimated 'backlog seconds' might not stay below 10 seconds even with enough CPU. This could be due to various factors (user code processing, pubsub batching, etc). Would like to hear if that is affecting your pipeline.

Resources