Google Cloud DataFlow Autoscaling not working - google-cloud-dataflow

I'm running a dataflow job that has 800K files to process.
The job id is 2018-08-23_07_07_46-4958738268363865409.
It reports that it has successfully listed 800K files, but for some odd reason, the autoscaler only assigned 1 worker to it. Since it's processing rate is 2/sec, this is going to take a loooong time.
I didn't touch the default scaler settings which to my knowledge means it can scale freely up to 100 workers.
Why doesn't it scale?
Thanks,
Tomer
Update:
Following Neri's suggestion, I started a new job (id 2018-08-29_13_47_04-1454220104656653184) and set autoscaling_algorithm=THROUGHPUT_BASED even though according to the documentation it should default to that anyway. Same behavior. processing speed is at 1 element per second and I have only one worker.
What's the use of running in the cloud if you cannot scale?

In order to autoscale your Dataflow Job, be sure that you use autoscalingAlgorithm = THROUGHPUT_BASED.
If you use "autoscalingAlgorithm":"NONE", then your Dataflow Job will get stuck even if it could autoscale. Otherwise, you will need to specify the number of workers you want on numWorkers.
Also, to scale to the amount of workers you want, be sure to specify (for numWorkers and maxNumWorkers) a number equal or lower to your quota, check your quota by using:
gcloud compute project-info describe

Related

Shuffle service now enabled by default in apache beam?

Without any changes on our side, our jobs using the python SDK for dataflow have started using the shuffle service:
According to the docs:
To use the service-based Dataflow Shuffle in your batch pipelines, specify the following parameter:
--experiments=shuffle_mode=service
However, we have not enabled this flag.
One major effect of this is the default size of the disk has gone from 250GB to 25GB on our workers. In one case, we ran out of disk space while the worker was starting up leading to a hung job never finishing.
Questions are:
Is this a change in the underlying dataflow environment?
Where do such changes get announced?
Is this a change in the underlying dataflow environment?
Yes, it is.
Where do such changes get announced?
I couldn't find any announcement of this change. But it should be updated here. I'll make sure it's up-to-date.
Since Oct. 2020, batch jobs have began to opt into using Dataflow Shuffle by default. To opt out of using it, specify --experiments=shuffle_mode=appliance.

Can't generate more than ~8000 RPM from Locust

I'm using Locust to load test my web servers. I'm running Locust in distributed mode. The worker nodes are written in Java, and use the Locust/Java port using locust4j. The master node and the worker nodes are containerized, our orchestrator is Kubernetes. When I want to spin up more workers, I'm doing it from there.
The problem that I'm running into is that no matter how many users I add, or worker nodes I add, I can't seem to generate more than ~8000 RPM. This is confirmed by the Locust web frontend, as well as the metrics I'm collecting from my web server.
Does anyone have any ideas why this is happening?
I've attached an image of timings I've collected. The snapshots are from running the load test for 60 seconds, I'm timing it from a stopwatch.
The usual culprit in these kinds of situations is your servers can't handle more than that. In my experience, the behavior you'll see client side as the servers get overwhelmed is you'll start to see a slow but steady increase in response times. This is one big reason why Locust includes those in the metrics it shows you.
Based on what I'm seeing in your screenshots, this is most likely the case for you. You have some very low minimum times but your average, median, and 90%iles are a lot higher than your minimums; your maximums are very significantly higher than those. Without seeing your charts I can't know for sure but that's a big red flag.
For more things to look out for, check out this question in the FAQ (especially see the list of server stats to investigate):
https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps

Dataflow: system lag keeps going up

We are testing Cloud Dataflow which pulls message from Pub/Sub subscription and convert data to BigQuery TableRow and load them to BigQuery as load job in every 1 min 30 sec.
We can see the pipeline works well and can process 500,000 elements per second with 40 workers. But when trying autoscaling, the number of workers unexpectedly goes up to 40 and stay there even if we send only 50,000 messages to Pub/Sub. In this situation, no unacknowledged message and workers' CPU utilizations are bellow 60%. One thing we noticed is that the Dataflow system lag goes up slowly.
Is system lag affects autoscaling?
If so, is there any solutions or ways to debugging this problem?
Is system lag affects autoscaling?
Google does not really expose the specifics of its autoscaling algorithm. Generally, though, it is based on CPU utilization, throughput and backlog. Since you're using Pub/Sub, backlog in by itself should be based on the number of unacknowledged messages. Still, the rate at which these are being consumed (i.e. the throughput at the Pub/Sub read stage) is also taken into account. Now, throughput as a whole relates to the rate at which each stage processes input bytes. As for CPU utilization, if the aforementioned don't "run smoothly", 60% usage is already too high. So, system lag at some stage could be interpreted as the throughput of that stage and therefore should affect autoscaling. Then again, these two should not always be conflated. If for example a worker gets stuck due to a hot key, system lag is high but there's no autoscaling, as the work is not parallelizable. So, all in all, it depends.
If so, is there any solutions or ways to debugging this problem?
The most important tools you have at hand are the execution graph, stackdriver logging and stackdriver monitoring. From monitoring, you should consider jvm, compute and dataflow metrics. gcloud dataflow jobs describe can also be useful, mostly to see how steps are fused and, by extension, see which steps are run in the same worker, like so:
gcloud dataflow jobs describe --full $JOB_ID --format json | jq '.pipelineDescription.executionPipelineStage[] | {"stage_id": .id, "stage_name": .name, "fused_steps": .componentTransform }'
Stackdriver monitoring exposes all three of the main autoscaling components.
Now, how you're going to take advantage of the above obviously depends on the problem. In your case, at first glance I'd say that, if you can work without autoscaling and 40 workers, you should normally expect that you can do the same with autoscaling when you've set maxNumWorkers to 40. Then again, the number of messages alone does not say the full story, their size/content also matters. I think you should start by analyzing your graph, check which step has the highest lag, see what's the input/output ratio and check for messages with severity>=WARNING in your logs. If you shared any of those here maybe we could spot something more specific.

Beam Runner hooks for Throughput-based autoscaling

I'm curious if anyone can point me towards greater visibility into how various Beam Runners manage autoscaling. We seem to be experiencing hiccups during both the 'spin up' and 'spin down' phases, and we're left wondering what to do about it. Here's the background of our particular flow:
1- Binary files arrive on gs://, and object notification duly notifies a PubSub topic.
2- Each file requires about 1Min of parsing on a standard VM to emit about 30K records to downstream areas of the Beam DAG.
3- 'Downstream' components include things like inserts to BigQuery, storage in GS:, and various sundry other tasks.
4- The files in step 1 arrive intermittently, usually in batches of 200-300 every hour, making this - we think - an ideal use case for autoscaling.
What we're seeing, however, has us a little perplexed:
1- It looks like when 'workers=1', Beam bites off a little more than it can chew, eventually causing some out-of-RAM errors, presumably as the first worker tries to process a few of the PubSub messages which, again, take about 60 seconds/message to complete because the 'message' in this case is that a binary file needs to be deserialized in gs.
2- At some point, the runner (in this case, Dataflow with jobId 2017-11-12_20_59_12-8830128066306583836), gets the message additional workers are needed and real work can now get done. During this phase, errors decrease and throughput rises. Not only are there more deserializers for step1, but the step3/downstream tasks are evenly spread out.
3-Alas, the previous step gets cut short when Dataflow senses (I'm guessing) that enough of the PubSub messages are 'in flight' to begin cooling down a little. That seems to come a little too soon, and workers are getting pulled as they chew through the PubSub messages themselves - even before the messages are 'ACK'd'.
We're still thrilled with Beam, but I'm guessing the less-than-optimal spin-up/spin-down phases are resulting in 50% more VM usage than what is needed. What do the runners look for beside PubSub consumption? Do they look at RAM/CPU/etc??? Is there anything a developer can do, beside ACK a PubSub message to provide feedback to the runner that more/less resources are required?
Incidentally, in case anyone doubted Google's commitment to open-source, I spoke about this very topic with an employee there yesterday, and she expressed interest in hearing about my use case, especially if it ran on a non-Dataflow runner! We hadn't yet tried our Beam work on Spark (or elsewhere), but would obviously be interested in hearing if one runner has superior abilities to accept feedback from the workers for THROUGHPUT_BASED work.
Thanks in advance,
Peter
CTO,
ATS, Inc.
Generally streaming autoscaling in Dataflow works like this :
Upscale: If the pipeline's backlog is more than a few seconds based on current throughput, pipeline is upscaled. Here CPU utilization does not directly affect the amount of upsize. Using CPU (say it is at 90%), does not help in answering the question 'how many more workers are required'. CPU does affect indirectly since pipelines fall behind when they they don't enough CPU thus increasing backlog.
Downcale: When backlog is low (i.e. < 10 seconds), pipeline is downcaled based on current CPU consumer. Here, CPU does directly influence down size.
I hope the above basic description helps.
Due to inherent delays involved in starting up new GCE VMs, the pipeline pauses for a minute or two during resizing events. This is expected to improve in near future.
I will ask specific questions about the job you mentioned in description.

Is there any way to set numWorkers dynamically in the middle of dataflow job running?

I am using google dataflow on my work.
While I am using dataflow, I need to set number of workers dynamically while dataflow batch job is running.
That's mainly because of cloud bigtable QPS.
We are using 3 bigtable cluster nodes and they can't afford to receiving all traffics from 500 number of workers instantly.
So, I gotta change number of workers(from 500 to 25) just before trying to insert all the processed data into the bigtable.
Is there any way to achieve this goal?
Dataflow does not provide the ability to manually change the resource allocation of a batch job while it is running, however:
1) We plan to incorporate throttling into our autoscaling algorithms, so Dataflow would detect that it needs to downsize while writing to your bigtable. I don't have a concrete ETA, but this is definitely on our roadmap.
2) Meanwhile, you try to can artificially limit the parallelism of your pipeline by a trick like this:
Take your PCollection<Something> (Something being the data type you're writing to bigtable)
Pipe it through a sequence of transforms: ParDo(pair with a random key in 0..25), GroupByKey, ParDo(ungroup and remove random key). You get, again, a PCollection<Something>
Write this collection to Bigtable.
The trick here is that there is no parallelization within a single key after a GroupByKey, so the result of GroupByKey is a collection of 25 key-value pairs (where the value is an Iterable<Something>) that can't be processed by more than 25 workers in parallel. The ParDo's following it will likely get fused together with the writing to Bigtable, and will thus have a parallelism of 25.
The caveat is that Dataflow is within its right to materialize any intermediate collections if it predicts that this will improve performance of the pipeline. It may even do this just for the sake of increasing the degree of parallelism (which goes explicitly against your goal in this example). But if you have an urgent job to run, I believe right now this will probably do what you want.
Meanwhile the only long-term solution I can suggest, until we have throttling, is to use a smaller limit on number of workers, or use a larger Bigtable cluster, or both.
There's a lot of relevant information in the DATA & ANALYTICS: Analyzing 25 billion stock market events in an hour with NoOps on GCP talk from GCP/Next.
FWIW, you can increase the number of nodes of Bigtable before your batch job, give Bigtable a few minutes to adjust, and then start your job. You can turn down the Bigtable cluster when you're done with the batch job.

Resources