I have a dataflow job that reads from pubsub subscription and writes to a Redis instance on a fixed time windows. The job seems to be running well with 4 workers and almost 0s systems latency until I try to drain it which causes upscaling to 10 workers and takes hours to finish.
I suspect this is caused by the windowing/grouping since the output collection metric suggest that it keeps producing elements long after the drain is started.
That's the windowing that I'm using.
beam.WindowInto(
window.FixedWindows(size=120),
trigger=Repeatedly(
AfterAny(AfterCount(100), AfterProcessingTime(120))),
accumulation_mode=AccumulationMode.DISCARDING,
allowed_lateness=Duration(seconds=2 * 24 * 60 * 60))
Given your large allowed lateness, when you drain your pipeline it causes every window for every key seen in the last 48 hours to close all at once, which is probably why there's so much work and it's upscaling. This could be especially bad if your keys are not often re-used.
Related
I made a simple program which waits for 60 seconds. I have 300 input elements to process.
Number of threads - Batch - 1 and Streaming - 300 per this document
https://cloud.google.com/dataflow/docs/resources/faq#beam-java-sdk
In streaming mode - with 1 worker and 300 threads, job should complete in 2 to 3 minutes considering the overhead of spawning workers etc. My understanding is there will be 300 threads for each of 300 input elements and all sleep for 60 seconds and the job should get done. However, the job takes more time to complete.
Similarly, in Batch mode with 1 worker (1 Thread) and 300 input elements, it should take 300 minutes to complete.
Can someone clarify how this happens at worker level ?
There is considerable overhead in starting up and tearing down worker VMs, so it's hard to generalize from a short experiment such as this. In addition, there's not promise that there will be a given number of workers for streaming or batch, as this is an implementation-dependent parameter that my change at any time for any runner (and indeed may even be chosen dynamically).
In our streaming pipeline we read data from pubsub, do some validations and then group it by a key in a 10 second gap session window. Afterwards the data is processed further and written to bigtable and pubsub again.
We're using apache beam 2.28 and the dataflow streaming engine. During the day we process more data than over night and the pipeline scales up the number of workers (n2d-standard-4) automatically. Mostly it scales up from 2 workers to 4 or 5 to reduce the backlog. After that it will scale down again as the CPU utilization is too low for 4 or 5 workers.
It is at this point that the CPU utilization drops to nearly 0% for all workers and the entire pipeline starts lagging behind massively. The result is that the number of workers is scaled up to a higher number again and the pipeline processing the data further. After the backlog is reduced again, the number of workers is gradually lowered and the same issue arises.
metrics
What we notice is that in the GroupByKey step, the input throughput stays more or less the same, but the output throughput drops to 0.
GroupByKey throughput
I know using GroupByKey can have hotkeys, but then I would expect the CPU utilization of 1 worker to be very high while the others have nothing to do.
Does anyone know what might be causing this issue?
The issue was caused by by the combination of using the session window with a groupbykey, how the watermark for a pubsub unbounded source works and when the acknowledges are being sent to pubsub.
Our session window with a gap of 10 seconds sometimes didn't output any messages for a couple of minutes (due to no early trigger being configured and messages continuously arriving for the same key within the 10 second session gap). Because these steps are part of the first fused stage in the actual execution of our pipeline, this lead to some messages not being acknowledged to pubsub (the ack is only sent when the first fused stage is completed). The oldest unacknowledged message time on the subscription kept on rising, causing the watermark not to advance.
This issue was became more outspoken due to the acknowledgement deadline being set to 10 minutes. When the number of workers scaled down, this caused the issue described in the original question.
We were able to solve this by adding a Reshuffle before the creation of the session window (with the groupbykey) and decreasing the acknowledgement deadline.
https://cloud.google.com/blog/products/data-analytics/handling-duplicate-data-in-streaming-pipeline-using-pubsub-dataflow
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization
I have a Beam script running in GCP Dataflow. This data flow performs the below steps:
Read a number of files that are PGP encrypted. (Total size more than 100 GB, individual files are of 2 GB in size)
Decrypt the files to form a PCollection
Do a wait() on PCollection
Do some processing on each record in the PCollection before writing into an output file
Behavior seen with GCP Dataflow:
When reading the input files and decrypting the files, it starts with one workers, and then scales upto 30 workers. But, only one worker continues to be utilized, utilization in all other workers is less than 10 %
Initially, throughput was 150K records per second while decryption. So, 90% of the decryption gets completed in 1 hours, which is good. But, then the throughput slows down gradually, even to just 100 records per second. So, it takes another 1-2 hours to complete the remaining 10% of the workload.
Any idea why the workers are underutilized? If there is no utilization, why are they not scaled down? Here, I am paying unnecessarily for a large number of VM-s :-(. Second, why the throughput slows reduction towards the end, and thereby significantly increasing the time for completion?
There is an issue related to the throughput and input behavior of the Cloud Dataflow. I suggest you to track the improvements being made to the autoscaling and utilization behavior of workers here.
The default architecture for Dataflow worker processing and autoscaling is not as responsive in some cases compared to when the Dataflow Streaming Engine feature is enabled. I would recommend you to try running the relevant Dataflow pipeline with Streaming Engine enabled, since it provides a more responsive autoscaling performance based on CPU utilization for your pipeline.
I hope you find the above pieces of information useful.
Can you try to implement your solution without wait() ?
For example,
FileIO.match().filepattern() -> ParDo(DoFn to decrypt files) -> fileIO.readmatches() -> ParDo(DoFn to read files)
See the example here.
This should allow your pipeline to better parallelize.
We are testing Cloud Dataflow which pulls message from Pub/Sub subscription and convert data to BigQuery TableRow and load them to BigQuery as load job in every 1 min 30 sec.
We can see the pipeline works well and can process 500,000 elements per second with 40 workers. But when trying autoscaling, the number of workers unexpectedly goes up to 40 and stay there even if we send only 50,000 messages to Pub/Sub. In this situation, no unacknowledged message and workers' CPU utilizations are bellow 60%. One thing we noticed is that the Dataflow system lag goes up slowly.
Is system lag affects autoscaling?
If so, is there any solutions or ways to debugging this problem?
Is system lag affects autoscaling?
Google does not really expose the specifics of its autoscaling algorithm. Generally, though, it is based on CPU utilization, throughput and backlog. Since you're using Pub/Sub, backlog in by itself should be based on the number of unacknowledged messages. Still, the rate at which these are being consumed (i.e. the throughput at the Pub/Sub read stage) is also taken into account. Now, throughput as a whole relates to the rate at which each stage processes input bytes. As for CPU utilization, if the aforementioned don't "run smoothly", 60% usage is already too high. So, system lag at some stage could be interpreted as the throughput of that stage and therefore should affect autoscaling. Then again, these two should not always be conflated. If for example a worker gets stuck due to a hot key, system lag is high but there's no autoscaling, as the work is not parallelizable. So, all in all, it depends.
If so, is there any solutions or ways to debugging this problem?
The most important tools you have at hand are the execution graph, stackdriver logging and stackdriver monitoring. From monitoring, you should consider jvm, compute and dataflow metrics. gcloud dataflow jobs describe can also be useful, mostly to see how steps are fused and, by extension, see which steps are run in the same worker, like so:
gcloud dataflow jobs describe --full $JOB_ID --format json | jq '.pipelineDescription.executionPipelineStage[] | {"stage_id": .id, "stage_name": .name, "fused_steps": .componentTransform }'
Stackdriver monitoring exposes all three of the main autoscaling components.
Now, how you're going to take advantage of the above obviously depends on the problem. In your case, at first glance I'd say that, if you can work without autoscaling and 40 workers, you should normally expect that you can do the same with autoscaling when you've set maxNumWorkers to 40. Then again, the number of messages alone does not say the full story, their size/content also matters. I think you should start by analyzing your graph, check which step has the highest lag, see what's the input/output ratio and check for messages with severity>=WARNING in your logs. If you shared any of those here maybe we could spot something more specific.
I have a question regarding the reserved CPU time field in Google Dataflow. I don't understand why it varies so widely depending on the configuration of my run. I suspect that I am not interpreting the reserved CPU time for what it really is. To my understanding, it is the CPU time that was needed to complete the job I submitted, but based on the following evidence, it seems I may be mistaken. Is it the time that is allocated to your job, regardless of whether it is actually using the resources? If that's the case, how do I get the actual CPU time of my job?
First I ran my job with a variable sized pool of workers (max 24 workers).
The corresponding stats are as follows:
Then, I ran my script using a fixed number of workers (10):
And the stats changes to:
They went from 15 days to 7 hours? How is that possible?!
Thanks!
If you hover over the "?" next to "Reserved CPU time" a pop-up message will show and it will read: "The total time Dataflow was active on GCE instances, on a per-CPU basis." This indicates it is not the CPU-time used by the VMs. At this time Dataflow does not aggregate per-machine CPU usage stats; you may, however, be able to use the cloud monitoring API to extract those metrics yourself.