By my reading of Dask-Jobqueue (https://jobqueue.dask.org/en/latest/), and by testing on our SLURM cluster, it seems when you set cluster.scale(n), and create client = Client(cluster), none of your jobs are able to start until all n of your jobs are able to start.
Suppose you have 999 jobs to run, and a cluster with 100 nodes or slots; worse yet, suppose other people share the cluster, and maybe some of them have long-running jobs. Admins sometimes need to do maintenance on some of the nodes, so they add and remove nodes. You never know how much parallelism you'll be able to get. You want the cluster scheduler to simply take 999 jobs (in slurm, these would be submitted via sbatch), run them in any order on any available nodes, store results in a shared directory, and have a dependent job (in slurm, that would be sbatch --dependency=) process the shared directory after all 999 jobs completed. Is this possible with DASK somehow?
It seems a fundamental limitation of the architecture, that all the jobs are expected to run in parallel, and the user must specify the degree of parallelism.
Your understanding is not correct. Dask can run with fewer than the specified number of jobs, just as you've asked for. It will use whatever resources arrive.
Related
Submitting jobs on a gpu cluster managed by Slurm.
I am doing some experiments and as you know we have to tune the parameters, which means I need to run several similar scripts with different hyperparameters. So I wrote multiple bash scripts (say, named training_n.sh) for executing, in each script it's like:
# training_n.sh
srun [command with specific model/training hyperparameters]
Then I use sbatch to execute these scripts, in the sbatch script it's like:
# sbatch script
bash training_1.sh
bash training_2.sh
...
bash training_n.sh
If I have a list of "srun"s in my "sbatch" script as shown above, how are they arranged in the queue (assuming I run on a single partition)? Are all these "srun"s seen as a single job or they are seen as separate jobs?
In other words, are they consecutively queued in the "squeue" list and will be executed consecutively? Or by contrast, other users' jobs will queue exactly behind the "srun" I am running and the remaining "srun"s can only be executed after these users' jobs are completed?
Additionally, any better ideas to submit a batch of experiment scripts on a publically used cluster? Since many people are using it, I want to complete all my designed experiments consecutively once it's my turn, instead of finishing one "srun" and waiting for other users to complete to start my next one.
If I have a list of "srun"s in my "sbatch" script as shown above, how are they arranged in the queue (assuming I run on a single partition)? Are all these "srun"s seen as a single job or they are seen as separate jobs?
In other words, are they consecutively queued in the "squeue" list and will be executed consecutively? Or by contrast, other users' jobs will queue exactly behind the "srun" I am running and the remaining "srun"s can only be executed after these users' jobs are completed?
If you submit all these single srun scripts/commands in a single sbatch script, you will only get one job. The reason for this is that srun works differently inside a job allocation then outside. If you run srun inside a job allocation (e.g. in an sbatch script), it will not create a new job, but just create a job step. So in your case, you will have a single job with n job steps, that will run consecutively in your allocation.
Additionally, any better ideas to submit a batch of experiment scripts on a publically used cluster?
If these runs are completely independent, you should use a job array, with size n. This way you can create n jobs that can run whenever there are resources available.
Since many people are using it, I want to complete all my designed experiments consecutively once it's my turn, instead of finishing one "srun" and waiting for other users to complete to start my next one.
That might not be a good idea. If these jobs are independent, you can rather submit them as an array. This way, they could take advantage of backfill scheduling and might run more quickly. You likely don't gain anything by putting them into a large job.
Lately, I've been trying to do some machine learning work with Dask on an HPC cluster which uses the SLURM scheduler. Importantly, on this cluster SLURM is configured to have a hard wall-time limit of 24h per job.
Initially, I ran my code with a single worker, but my job was running out of memory. I tried to increase the number of workers (and, therefore, the number of requested nodes), but the workers got stuck in the SLURM queue (with the reason for such being labeled as "Priority"). Meanwhile, the master would run and eventually hit the wall-time, leaving the workers to die when they finally started.
Thinking that the issue might be my requesting too many SLURM jobs, I tried condensing the workers into a single, multi-node job using a workaround I found on github. Nevertheless, these multi-node jobs ran into the same issue.
I then attempted to get in touch with the cluster's IT support team. Unfortunately, they are not too familiar with Dask and could only provide general pointers. Their primary suggestions were to either put the master job on hold until the workers were ready, or launch new masters every 24h until the the workers could leave the queue. To help accomplish this, they cited the SLURM options --begin and --dependency. Much to my chagrin, I was unable to find a solution using either suggestion.
As such, I would like to ask if, in a Dask/SLURM environment, there is a way to force the master to not start until the workers are ready, or to launch a master that is capable of "inheriting" workers previously created by another master.
Thank you very much for any help you can provide.
I might be wrong on the below, but in my experience with SLURM, Dask itself won't be able to communicate with the SLURM scheduler. There is dask_jobqueue that helps to create workers, so one option could be to launch the scheduler on a low-resource node (that presumably could be requested for longer).
There is a relatively new feature of heterogeneous jobs on SLURM (see https://slurm.schedmd.com/heterogeneous_jobs.html), and as I understand this will guarantee that your workers, scheduler and client launch at the same time, and perhaps this is something that your IT can help with as this is specific to SLURM (rather than dask). Unfortunately, this will work only for non-interactive workloads.
The answer to my problem turned out to be deceptively simple. Our SLURM configuration uses the backfill scheduler. Because my Dask workers were using the maximum possible --time (24 hours), this meant that the backfill scheduler wasn't working effectively. As soon as I lowered --time to the amount I believed was necessary for the workers to finish running the script, they left "queue hell"!
I have a scenario where I have long-running jobs that I need to move to a background process. Delayed job with a single worker would be very simple to implement, but would run very, very slowly as jobs mount up. Much of the work is slow because the thread has to sleep to wait on various remote API calls, so running multiple workers concurrently is a very obvious choice.
Unfortunately, some of these jobs are dependent on each other. I can't run two jobs belonging to the same identifier simultaneously. Order doesn't matter, only that exactly one worker can be working on a given ID's work.
My first thought was named queues, and name the queue for the identifiers, but the identifiers are dynamic data. We could be running ID 1 today, 5 tomorrow, 365849 and 645609 the next, so on and so forth. That's too many named queues. Not only would giving each one a single worker probably exceed available system resources (as well as being incredibly wasteful since most of them won't be active at any given time), but since workers aren't configured from inside the code but rather as environment variables, I'd wind up with some insane config files. And creating a sane pool of N generic workers could wind up with all N workers running on the same queue if that's the only queue with work to do.
So what I need is a way to prevent two jobs sharing a given ID from running at the same time, while allowing any number of jobs not sharing IDs to run concurrently.
we are running Jenkins with lots of jobs. At the moment these jobs are kind of grouped by using "master jobs". These do nothing but start all jobs of one group. But, if one of these master jobs runs, it starts around 10 other jobs at one time. Depending on the duration of these jobs and the number of build processores (at the moment 6) Jenkins is blocked for a longer time (up to an hour). The other thing is, that these jobs are not really suitable for such massive parallelization.
To solve this, I'm looking for a way (a plugin), that allows to group some jobs and start them parallel, but limit the build processors used for the jobs of this group to a fixed number (e.g. 2). So it would be possible to run a group of jobs that compile java projects and parallel another group of jobs that installs test databases.
I tried the Build flow plugin, but it's not really the right one: you must separate the jobs manually to the sub-groups that run parallel and if a job in one sub-group failes, the following jobs of this group are not started.
So, maybe someone knows a Jenkins plugin that fits better? Thanks a lot in advance!
Frank
Throttle Concurrent Builds Plugin
Create some category my-group.
Add all the jobs into this group.
Set Maximum Total Concurrent Builds and Maximum Concurrent Builds Per Node.
I need some advice writing a Job scheduler in Erlang which is able to distribute jobs ( external os processes) over a set of worker nodes. A job can last from a few milliseconds to a few hours. The "scheduler" should be a global registry where jobs come in, get sorted and then get assigned and executed on connected "worker nodes". Worker nodes should be able to register on the scheduler by telling how many jobs they are able to process in parallel (slots). Worker nodes should be able to join and leave at any time.
An Example:
Scheduler has 10 jobs waiting
Worker Node A connects and is able to process 3 jobs in parallel
Worker Node B connects and is able to process 1 job in parallel
Some time later, another worker node joins which is able to process 2 jobs in parallel
Questions:
I seriously spent some time thinking about the problem but I am still not sure which way to go. My current solution is to have a globally registered gen_server for the scheduler which holds the jobs in its state. Every worker node spawns N worker processes and registers them on the scheduler. The worker processes then pull a job from the scheduler (which is an infinite blocking call with {noreply, ...} if no jobs are currently availale).
Here are some questions:
Is it a good idea to assign every new job to an existing worker, knowing that I will have to re-assign the job to another worker at the time new workers connect? (I think this is how the Erlang SMP scheduler does things, but reassigning jobs seems like a big headache to me)
Should I start a process for every worker processing slot and where should this process live: on the scheduler node or on the worker node? Should the scheduler make rpc calls to the worker node or would it be better for the worker nodes to pull new jobs and then execute them on their own?
And finally: Is this problem already solved and where to find the code for it? :-)
I already tried RabbitMQ for job scheduling but custom job sorting and deployment adds a lot of complexity.
Any advice is highly welcome!
Having read your answer in the comments I'd still recommend to use pool(3):
Spawning 100k processes is not a big deal for Erlang because spawning a process is much cheaper than in other systems.
One process per job is a very good pattern in Erlang, start a new process run the job in the process keeping all the state in the process and terminate the process after the job is done.
Don't bother with worker processes that process a job and wait for a new one. This is the way to go if you are using OS-processes or threads because spawning is expensive but in Erlang this only adds unnecessary complexity.
The pool facility is useful as a low level building block, the only thing it misses your your functionality is the ability to start additional nodes automatically. What I would do is start with pool and a fixed set of nodes to get the basic functionality.
Then add some extra logic that watches the load on the nodes e.g. also like pool does it with statistics(run_queue). If you find that all nodes are over a certain load threshold just slave:start/2,3 a new node on a extra machine and use pool:attach/1to add it to your pool.
This won't rebalance old running jobs but new jobs will automatically be moved to the newly started node since its still idle.
With this you can have a fast pool controlled distribution of incoming jobs and a slower totally separate way of adding and removing nodes.
If you got all this working and still find out -- after some real world benchmarking please -- you need rebalancing of jobs you can always build something into the jobs main loops, after a message rebalance it can respawn itself using the pool master passing its current state as a argument.
Most important just go ahead and build something simple and working and optimize it later.
My solution to the problem:
"distributor" - gen_server,
"worker" - gen_server.
"distributor" starts "workers" using slave:start_link, each "worker" is started with max_processes parameter,
"distributor" behavior:
handle_call(submit,...)
* put job to the queue,
* cast itself check_queue
handle_cast(check_queue,...)
* gen_call all workers for load (current_processes / max_processes),
* find the least busy,
* if chosen worker load is < 1 gen_call(submit,...) worker
with next job if any, remove job from the queue,
"worker" behavior (trap_exit = true):
handle_call(report_load, ...)
* return current_process / max_process,
handle_call(submit, ...)
* spawn_link job,
handle_call({'EXIT', Pid, Reason}, ...)
* gen_cast distributor with check_queue
In fact it is more complex than that as I need to track running jobs, kill them if I need to, but it is easy to implement in such architecture.
This is not a dynamic set of nodes though, but you can start new node from the distributor whenever you need.
P.S. Looks similar to pool, but in my case I am submitting port processes, so I need to limit them and have better control of what is going where.