Is it possible when to know when the autoscaling feature is limited by some IAM quota on Google Cloud DataFlow? I'm finding that many of my tasks, regardless of size, boot time, whatever, will grow until they hit 15 workers. It could be a coincidence, but I doubt it. I can turn autoscaling off and set the num workers to 50 without a problem, so there isn't an explicit quota limit i'm hitting.
Quotas don't seem to be an issue.
but even large tasks seem to always hit 15 workers. When I manually set 40 workers, the tasks finishes much faster, which I know doesn't exactly mean that autoscaling isn't working, but is concerning.
The default maximum number of workers for autoscaling is currently 15. If you would like to allow it to scale to more workers, you can use the --maxNumWorkers= option.
Related
We are testing Cloud Dataflow which pulls message from Pub/Sub subscription and convert data to BigQuery TableRow and load them to BigQuery as load job in every 1 min 30 sec.
We can see the pipeline works well and can process 500,000 elements per second with 40 workers. But when trying autoscaling, the number of workers unexpectedly goes up to 40 and stay there even if we send only 50,000 messages to Pub/Sub. In this situation, no unacknowledged message and workers' CPU utilizations are bellow 60%. One thing we noticed is that the Dataflow system lag goes up slowly.
Is system lag affects autoscaling?
If so, is there any solutions or ways to debugging this problem?
Is system lag affects autoscaling?
Google does not really expose the specifics of its autoscaling algorithm. Generally, though, it is based on CPU utilization, throughput and backlog. Since you're using Pub/Sub, backlog in by itself should be based on the number of unacknowledged messages. Still, the rate at which these are being consumed (i.e. the throughput at the Pub/Sub read stage) is also taken into account. Now, throughput as a whole relates to the rate at which each stage processes input bytes. As for CPU utilization, if the aforementioned don't "run smoothly", 60% usage is already too high. So, system lag at some stage could be interpreted as the throughput of that stage and therefore should affect autoscaling. Then again, these two should not always be conflated. If for example a worker gets stuck due to a hot key, system lag is high but there's no autoscaling, as the work is not parallelizable. So, all in all, it depends.
If so, is there any solutions or ways to debugging this problem?
The most important tools you have at hand are the execution graph, stackdriver logging and stackdriver monitoring. From monitoring, you should consider jvm, compute and dataflow metrics. gcloud dataflow jobs describe can also be useful, mostly to see how steps are fused and, by extension, see which steps are run in the same worker, like so:
gcloud dataflow jobs describe --full $JOB_ID --format json | jq '.pipelineDescription.executionPipelineStage[] | {"stage_id": .id, "stage_name": .name, "fused_steps": .componentTransform }'
Stackdriver monitoring exposes all three of the main autoscaling components.
Now, how you're going to take advantage of the above obviously depends on the problem. In your case, at first glance I'd say that, if you can work without autoscaling and 40 workers, you should normally expect that you can do the same with autoscaling when you've set maxNumWorkers to 40. Then again, the number of messages alone does not say the full story, their size/content also matters. I think you should start by analyzing your graph, check which step has the highest lag, see what's the input/output ratio and check for messages with severity>=WARNING in your logs. If you shared any of those here maybe we could spot something more specific.
Is there a limit on the number of queues you can create with AWS SQS?
This page https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-limits.html doesn't state so one way or the other.
We're not looking to create thousands of the things but might dynamically create a good few dozen for a while then destroy them. I've come across unexpected limits with AWS before (only 4 transcoding pipelines - why?) so need to be sure on this.
Thanks for any advice.
AB
Indeed there is no informations about that in the AWS documentation.
I don't think there is a limit on the queues number.
We are actually working with 28 fulltime Queues on our infrastructure without any problem.
If at least you hit a limit, a simple AWS support ticket can increase it.
Just like the Ec2 Number limit increase process.
Hope it helps
I'm running a dataflow job that has 800K files to process.
The job id is 2018-08-23_07_07_46-4958738268363865409.
It reports that it has successfully listed 800K files, but for some odd reason, the autoscaler only assigned 1 worker to it. Since it's processing rate is 2/sec, this is going to take a loooong time.
I didn't touch the default scaler settings which to my knowledge means it can scale freely up to 100 workers.
Why doesn't it scale?
Thanks,
Tomer
Update:
Following Neri's suggestion, I started a new job (id 2018-08-29_13_47_04-1454220104656653184) and set autoscaling_algorithm=THROUGHPUT_BASED even though according to the documentation it should default to that anyway. Same behavior. processing speed is at 1 element per second and I have only one worker.
What's the use of running in the cloud if you cannot scale?
In order to autoscale your Dataflow Job, be sure that you use autoscalingAlgorithm = THROUGHPUT_BASED.
If you use "autoscalingAlgorithm":"NONE", then your Dataflow Job will get stuck even if it could autoscale. Otherwise, you will need to specify the number of workers you want on numWorkers.
Also, to scale to the amount of workers you want, be sure to specify (for numWorkers and maxNumWorkers) a number equal or lower to your quota, check your quota by using:
gcloud compute project-info describe
I ran the program datastorewordcount.java from Google cloud cookbook examples. When I observe dataflow monitoring console, the workers never exceed one. I am using Google cloud 1 year free usage.
Why is autoscaling never increasing number of workers?
It usually takes my pipelines a few minutes before it starts upping the number of workers.
You can specify the initial number of workers with numWorkers in your 'pipeline options'
https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
I am trying to enable autoscaling in my dataflow job as described in this article. I did that by setting the relevant algorithm via the following code:
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setAutoscalingAlgorithm(AutoscalingAlgorithmType.THROUGHPUT_BASED)
After I set this and deployed my job, it always works with the max. number of CPUs available, i.e. if I set max number of workers to 10, then it uses all 10 CPUs although average CPU usage is about 50%. How does this THROUGHPUT_BASED algorithm works and where I am making mistake?
Thanks.
Although Autoscaling tries to reduce both the backlog and CPU, backlog reduction takes priority. Specific values backlog matters, Dataflow calculates 'backlog in seconds' roughly as 'backlog / throughput' and tries to keep it below 10 seconds.
In your case, I think what is preventing downscaling from 10 is due to policy regarding persistent disks (PDs) used for pipeline execution. When max workers is 10, Dataflow uses 10 persistent disks and tries to keep the number of workers at any time such that these disks are distributed roughly equally. As a consequence when the pipeline is at its max workers of 10, it tries to downscale to 5 rather than 7 or 8. In addition, it tries to keep projected CPU after downscaling to no more than 80%.
These two factors might be effectively preventing downscaling in your case. If CPU utilization is 50% with 10 workers, the projected CPU utilization is 100% for 5 workers, so it does not downscale since it is above the target 80%.
Google Dataflow is working on new execution engine that does not depend on persistent disks and does not suffer from the limitation of amout of downscaling.
A work around for this is to set higher max_workers and your pipeline might still stay at 10 or below. But that incurs a small increase in cost for PDs.
Another remote possibility is that sometimes even after upscaling estimated 'backlog seconds' might not stay below 10 seconds even with enough CPU. This could be due to various factors (user code processing, pubsub batching, etc). Would like to hear if that is affecting your pipeline.