How to manage workers in google cloud dataflow - google-cloud-dataflow

I ran the program datastorewordcount.java from Google cloud cookbook examples. When I observe dataflow monitoring console, the workers never exceed one. I am using Google cloud 1 year free usage.
Why is autoscaling never increasing number of workers?

It usually takes my pipelines a few minutes before it starts upping the number of workers.
You can specify the initial number of workers with numWorkers in your 'pipeline options'
https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options

Related

Dask workers get stuck in SLURM queue and won't start until the master hits the walltime

Lately, I've been trying to do some machine learning work with Dask on an HPC cluster which uses the SLURM scheduler. Importantly, on this cluster SLURM is configured to have a hard wall-time limit of 24h per job.
Initially, I ran my code with a single worker, but my job was running out of memory. I tried to increase the number of workers (and, therefore, the number of requested nodes), but the workers got stuck in the SLURM queue (with the reason for such being labeled as "Priority"). Meanwhile, the master would run and eventually hit the wall-time, leaving the workers to die when they finally started.
Thinking that the issue might be my requesting too many SLURM jobs, I tried condensing the workers into a single, multi-node job using a workaround I found on github. Nevertheless, these multi-node jobs ran into the same issue.
I then attempted to get in touch with the cluster's IT support team. Unfortunately, they are not too familiar with Dask and could only provide general pointers. Their primary suggestions were to either put the master job on hold until the workers were ready, or launch new masters every 24h until the the workers could leave the queue. To help accomplish this, they cited the SLURM options --begin and --dependency. Much to my chagrin, I was unable to find a solution using either suggestion.
As such, I would like to ask if, in a Dask/SLURM environment, there is a way to force the master to not start until the workers are ready, or to launch a master that is capable of "inheriting" workers previously created by another master.
Thank you very much for any help you can provide.
I might be wrong on the below, but in my experience with SLURM, Dask itself won't be able to communicate with the SLURM scheduler. There is dask_jobqueue that helps to create workers, so one option could be to launch the scheduler on a low-resource node (that presumably could be requested for longer).
There is a relatively new feature of heterogeneous jobs on SLURM (see https://slurm.schedmd.com/heterogeneous_jobs.html), and as I understand this will guarantee that your workers, scheduler and client launch at the same time, and perhaps this is something that your IT can help with as this is specific to SLURM (rather than dask). Unfortunately, this will work only for non-interactive workloads.
The answer to my problem turned out to be deceptively simple. Our SLURM configuration uses the backfill scheduler. Because my Dask workers were using the maximum possible --time (24 hours), this meant that the backfill scheduler wasn't working effectively. As soon as I lowered --time to the amount I believed was necessary for the workers to finish running the script, they left "queue hell"!

Cloud Dataflow Resource Share Pool

I wanted to check if there is scenario where there are 30-40 jobs running concurrently in cloud dataflow. Is there a setting by which the workers used on 1 job can be shared across other or use managed instance group as compute option.
The reason for asking is if the risk of running out of compute instances or exceeding the quota can be managed.
Cloud Dataflow manages the GCE instances internally. This means that it is unable to share the instances with other jobs. Please see here for more information.

Google Cloud DataFlow Autoscaling not working

I'm running a dataflow job that has 800K files to process.
The job id is 2018-08-23_07_07_46-4958738268363865409.
It reports that it has successfully listed 800K files, but for some odd reason, the autoscaler only assigned 1 worker to it. Since it's processing rate is 2/sec, this is going to take a loooong time.
I didn't touch the default scaler settings which to my knowledge means it can scale freely up to 100 workers.
Why doesn't it scale?
Thanks,
Tomer
Update:
Following Neri's suggestion, I started a new job (id 2018-08-29_13_47_04-1454220104656653184) and set autoscaling_algorithm=THROUGHPUT_BASED even though according to the documentation it should default to that anyway. Same behavior. processing speed is at 1 element per second and I have only one worker.
What's the use of running in the cloud if you cannot scale?
In order to autoscale your Dataflow Job, be sure that you use autoscalingAlgorithm = THROUGHPUT_BASED.
If you use "autoscalingAlgorithm":"NONE", then your Dataflow Job will get stuck even if it could autoscale. Otherwise, you will need to specify the number of workers you want on numWorkers.
Also, to scale to the amount of workers you want, be sure to specify (for numWorkers and maxNumWorkers) a number equal or lower to your quota, check your quota by using:
gcloud compute project-info describe

Cloud DataFlow always maxes out at 15 workers?

Is it possible when to know when the autoscaling feature is limited by some IAM quota on Google Cloud DataFlow? I'm finding that many of my tasks, regardless of size, boot time, whatever, will grow until they hit 15 workers. It could be a coincidence, but I doubt it. I can turn autoscaling off and set the num workers to 50 without a problem, so there isn't an explicit quota limit i'm hitting.
Quotas don't seem to be an issue.
but even large tasks seem to always hit 15 workers. When I manually set 40 workers, the tasks finishes much faster, which I know doesn't exactly mean that autoscaling isn't working, but is concerning.
The default maximum number of workers for autoscaling is currently 15. If you would like to allow it to scale to more workers, you can use the --maxNumWorkers= option.

Is it recommended to reduce my default requests quota for Cloud Dataflow?

The requests quota for Cloud Dataflow is set to 500,000 requests per 100 second by default, is it recommended that I lower this value?
https://console.cloud.google.com/apis/api/dataflow.googleapis.com/quotas
The recommended minimum for requests is 500,000 per 100 seconds for the Cloud Dataflow API. Setting this value any lower has the potential to cause problems during Dataflow job execution.
Lowering this quota beyond 500,000/100s may cause requests to fail with errors in worker logs:
"429 Too Many Requests"
The Dataflow team intends to apply a minimum floor to this quota in the future to ensure users do not lower quotas beyond what is recommended.

Resources