Cloud Dataflow Resource Share Pool - google-cloud-dataflow

I wanted to check if there is scenario where there are 30-40 jobs running concurrently in cloud dataflow. Is there a setting by which the workers used on 1 job can be shared across other or use managed instance group as compute option.
The reason for asking is if the risk of running out of compute instances or exceeding the quota can be managed.

Cloud Dataflow manages the GCE instances internally. This means that it is unable to share the instances with other jobs. Please see here for more information.

Related

Dask workers get stuck in SLURM queue and won't start until the master hits the walltime

Lately, I've been trying to do some machine learning work with Dask on an HPC cluster which uses the SLURM scheduler. Importantly, on this cluster SLURM is configured to have a hard wall-time limit of 24h per job.
Initially, I ran my code with a single worker, but my job was running out of memory. I tried to increase the number of workers (and, therefore, the number of requested nodes), but the workers got stuck in the SLURM queue (with the reason for such being labeled as "Priority"). Meanwhile, the master would run and eventually hit the wall-time, leaving the workers to die when they finally started.
Thinking that the issue might be my requesting too many SLURM jobs, I tried condensing the workers into a single, multi-node job using a workaround I found on github. Nevertheless, these multi-node jobs ran into the same issue.
I then attempted to get in touch with the cluster's IT support team. Unfortunately, they are not too familiar with Dask and could only provide general pointers. Their primary suggestions were to either put the master job on hold until the workers were ready, or launch new masters every 24h until the the workers could leave the queue. To help accomplish this, they cited the SLURM options --begin and --dependency. Much to my chagrin, I was unable to find a solution using either suggestion.
As such, I would like to ask if, in a Dask/SLURM environment, there is a way to force the master to not start until the workers are ready, or to launch a master that is capable of "inheriting" workers previously created by another master.
Thank you very much for any help you can provide.
I might be wrong on the below, but in my experience with SLURM, Dask itself won't be able to communicate with the SLURM scheduler. There is dask_jobqueue that helps to create workers, so one option could be to launch the scheduler on a low-resource node (that presumably could be requested for longer).
There is a relatively new feature of heterogeneous jobs on SLURM (see https://slurm.schedmd.com/heterogeneous_jobs.html), and as I understand this will guarantee that your workers, scheduler and client launch at the same time, and perhaps this is something that your IT can help with as this is specific to SLURM (rather than dask). Unfortunately, this will work only for non-interactive workloads.
The answer to my problem turned out to be deceptively simple. Our SLURM configuration uses the backfill scheduler. Because my Dask workers were using the maximum possible --time (24 hours), this meant that the backfill scheduler wasn't working effectively. As soon as I lowered --time to the amount I believed was necessary for the workers to finish running the script, they left "queue hell"!

AWS ECS scale up and down ways

We are using AWS ECS service where docker containers are running into it. These docker container having application code which continuously polling SQS and gets the single message, process it and kill their self and that's the life cycle of task.
Now we are scaling tasks and EC2 in cluster based on number of messages comes to SQS. We are able to scale up but it's difficult to scale down because we don't know whether any task is still processing any message because message processing time is huge due to some complex logic.
Could anybody suggest what's the based way to scale up and scale down in this case?
Have you considered using AWS Lambda for this use case rather than ECS (provided that your application logic runs in less than 5 mins). You can use SQS as a trigger for the Lambda. AWS Documentation : Using AWS Lambda with Amazon SQS provides a comprehensive guide on how to achieve this using Lambda.
The use case you have mentioned doesn't mean for ECS for EC2 instances. You should consider AWS ECS Fargate or AWS BATCH. On one side fargate will give you more capabilities in term of infrastructure like The task can be run for longer periods or scaling of tasks according to some parameters like CPU or MEM. On another side, you will be paying only the number of tasks running at a moment in your cluster.
Ref: https://aws.amazon.com/fargate/

Dataflow worker pool creation and deletion time overhead

In the execution of each Dataflow job, job is taking around 2-4 mins for the creation and deletion of VMs(worker pool).
Please let me know if there is any way to minimize this?
OR
Can we create VMs for processing before execution of Dataflow job so that execution time can bring down?
Dataflow is fully managed. From documentation:
You should not attempt to manage or otherwise interact directly with
your Compute Engine Managed Instance Group; the Dataflow service will
take care of that for you. Manually altering any Compute Engine
resources associated with your Dataflow job is an unsupported
operation.

Can slave process be dynamically provisioned based on load using Spring Cloud data flow?

We are currently using Spring batch - remote chunking for scaling batch process . Thinking of using Cloud data flow but would like to know if based on load Slaves can be dynamically provisioned?
we are deployed in Google Cloud and hence want to think of using Spring Cloud data flow support for kubernetes as well if Cloud data flow would fit our needs ?
When using the batch extensions of Spring Cloud Task (specifically the DeployerPartitionHandler), workers are dynamically launched as needed. That PartitionHandler allows you to configure a maxiumum number of workers, then it will process each partition as an independent worker up to that max (processing the rest of the partitions as others finish up). The "dynamic" aspect is really controlled by the number of partitions returned by the Partitioner. The more partitions returned means the more workers launched.
You can see a simple example configured to use CloudFoundry in this repo: https://github.com/mminella/S3JDBC The main difference between it and what you'd need is that you'd swap out the CloudFoundryTaskLauncher for a KubernetesTaskLauncher and it's appropriate configuration.

What happens if I manually delete one of the VMs that Dataflow created?

I see the GCE instances that Dataflow created for my job in the GCE console. What happens if I delete them?
Manually altering resources provisioned by Google Cloud Dataflow is an unsupported operation. It will interfere with Dataflow’s clean-up process and might result in leftover resources and therefore extra cost. In particular, deleting the VMs of a streaming Dataflow job might leave persistent disks around, which will still be billed.
Using the Dataflow provisioned VMs or Persistent Disks for other purposes than the Dataflow job is also not supported. Do not attempt to reattach the disks to other machines, or to get the VMs to run other independent programs. The Dataflow service might get rid of these resources at any point, without warning, and any data on these resources will be lost.

Resources