In Slurm, how to submit multiple experimental runs in a batch, and excute them consecutively one-by-one? - machine-learning

Submitting jobs on a gpu cluster managed by Slurm.
I am doing some experiments and as you know we have to tune the parameters, which means I need to run several similar scripts with different hyperparameters. So I wrote multiple bash scripts (say, named training_n.sh) for executing, in each script it's like:
# training_n.sh
srun [command with specific model/training hyperparameters]
Then I use sbatch to execute these scripts, in the sbatch script it's like:
# sbatch script
bash training_1.sh
bash training_2.sh
...
bash training_n.sh
If I have a list of "srun"s in my "sbatch" script as shown above, how are they arranged in the queue (assuming I run on a single partition)? Are all these "srun"s seen as a single job or they are seen as separate jobs?
In other words, are they consecutively queued in the "squeue" list and will be executed consecutively? Or by contrast, other users' jobs will queue exactly behind the "srun" I am running and the remaining "srun"s can only be executed after these users' jobs are completed?
Additionally, any better ideas to submit a batch of experiment scripts on a publically used cluster? Since many people are using it, I want to complete all my designed experiments consecutively once it's my turn, instead of finishing one "srun" and waiting for other users to complete to start my next one.

If I have a list of "srun"s in my "sbatch" script as shown above, how are they arranged in the queue (assuming I run on a single partition)? Are all these "srun"s seen as a single job or they are seen as separate jobs?
In other words, are they consecutively queued in the "squeue" list and will be executed consecutively? Or by contrast, other users' jobs will queue exactly behind the "srun" I am running and the remaining "srun"s can only be executed after these users' jobs are completed?
If you submit all these single srun scripts/commands in a single sbatch script, you will only get one job. The reason for this is that srun works differently inside a job allocation then outside. If you run srun inside a job allocation (e.g. in an sbatch script), it will not create a new job, but just create a job step. So in your case, you will have a single job with n job steps, that will run consecutively in your allocation.
Additionally, any better ideas to submit a batch of experiment scripts on a publically used cluster?
If these runs are completely independent, you should use a job array, with size n. This way you can create n jobs that can run whenever there are resources available.
Since many people are using it, I want to complete all my designed experiments consecutively once it's my turn, instead of finishing one "srun" and waiting for other users to complete to start my next one.
That might not be a good idea. If these jobs are independent, you can rather submit them as an array. This way, they could take advantage of backfill scheduling and might run more quickly. You likely don't gain anything by putting them into a large job.

Related

How to batch schedule dask_jobqueue jobs in DASK instead of concurrent?

By my reading of Dask-Jobqueue (https://jobqueue.dask.org/en/latest/), and by testing on our SLURM cluster, it seems when you set cluster.scale(n), and create client = Client(cluster), none of your jobs are able to start until all n of your jobs are able to start.
Suppose you have 999 jobs to run, and a cluster with 100 nodes or slots; worse yet, suppose other people share the cluster, and maybe some of them have long-running jobs. Admins sometimes need to do maintenance on some of the nodes, so they add and remove nodes. You never know how much parallelism you'll be able to get. You want the cluster scheduler to simply take 999 jobs (in slurm, these would be submitted via sbatch), run them in any order on any available nodes, store results in a shared directory, and have a dependent job (in slurm, that would be sbatch --dependency=) process the shared directory after all 999 jobs completed. Is this possible with DASK somehow?
It seems a fundamental limitation of the architecture, that all the jobs are expected to run in parallel, and the user must specify the degree of parallelism.
Your understanding is not correct. Dask can run with fewer than the specified number of jobs, just as you've asked for. It will use whatever resources arrive.

Massive-Distributed Parallel Execution of tasks

We are currently struggling with the following task. We need to run a windows application (single instance only working) 1000 times with different input parameters. One run of this application can take up to multiple hours. It feels like we have the same problem like any video rendering farm – each picture of a video should be calculated independently and parallel – but it is not rendering.
Currently we tried to execute it with Jenkins and Pipeline jobs. We used the parallel steps in pipeline and lets Jenkins queue and execute the application. We use the Jenkins Label Expression to lets Jenkins choose which job can be run on which node.
The limitation in Jenkins is currently with massive parallel jobs (https://issues.jenkins-ci.org/browse/JENKINS-47724). When the queue contains multiple hundred jobs adding new jobs took much longer – will become even worse by increasing queue. And main problem: Jenkins will start the execution of parallel pipeline part-jobs only after finishing adding all to the queue.
We already investigated ideas how to solve this problem:
Python Distributed: https://distributed.readthedocs.io/en/latest/
a. For single functions it looks great, but for the complete run like we have in Jenkins => Deploy and collect results looks complex
b. Client->Server bidirectional communication needed – no chance to bring it online through a NAT (VM Server)
BOINC: https://boinc.berkeley.edu/
a. for our understanding we had to extend the backend in a massive way to bring our jobs working => to configure the jobs in BOINC we had to write a lot of new automating code
b. currently we need a predeployed application which can differ between different inputs => no equivalent of Jenkins Label Expression
Any ideas how to solve it?
Thanks in advance

How to increase maximum concurrent jobs?

In my newly installed Jenkins, I have four jobs. I can only run two concurrently. If I trigger the build of a third job, it is set in the queue and triggered once one of the first two finishes.
I know my server can handle more than two concurrent jobs at a time. How can I increase this default threshold of two?
If it means anything, these are not build-a-deployable package kind of jobs but environment prep jobs that instantiate various DBs. So the jobs simply invoke a python script on the Jenkins server, which is the same script across multiple jobs but each job invokes it with different input params. The jobs are 100% independent of one another and do not share any resource except the script.
You go to Manage Jenkins --> Configure System, then change # of executors:

Using a lock in a Jenkins Workflow Job

I want to use a lock in a workflow job in order to prevent jobs from running at the same time on the same node.
I want to use the functionality of the lock and latches plugin to control the parallel execution of jobs: When a Job A starts building on a specific node, Job B should wait until A is done, and then B can run.
How can I achieve that ? or is there another solution (in case locks are not supported in workflow jobs) ?
Thank you.
What exactly are you trying to prevent? The easiest way would be to set each node as having only 1 executor... If you do this, then the node will only ever run one job at a time. Note that some fly-weight tasks may run but generally these are non-significant and involve polling the remote SCM repository and such.
If you just mean within the same workflow, you can use various mix of the parallel step to split parallel sections and then combine the results.

rails backgroundjob running jobs in parallel?

I'm very happy with By so far, only I have this one issue:
When one process takes 1 or 2 hours to complete, all other jobs in the queue seem to wait for that one job to finish. Worse still is when uploading to a server which time's out regularly.
My question: is Bj running jobs in parallel or one after another?
Thank you,
Damir
BackgroundJob will only allow one worker to run per webserver instance. This is by design to keep things simple. Here is a quote from Bj's README:
If one ignores platform specific details the design of Bj is quite simple: the
main Rails application submits jobs to table, stored in the database. The act
of submitting triggers exactly one of two things to occur:
1) a new long running background runner to be started
2) an existing background runner to be signaled
The background runner refuses to run two copies of itself for a given
hostname/rails_env combination. For example you may only have one background
runner processing jobs on localhost in development mode.
The background runner, under normal circumstances, is managed by Bj itself -
you need do nothing to start, monitor, or stop it - it just works. However,
some people will prefer manage their own background process, see 'External
Runner' section below for more on this.
The runner simply processes each job in a highest priority oldest-in fashion,
capturing stdout, stderr, exit_status, etc. and storing the information back
into the database while logging it's actions. When there are no jobs to run
the runner goes to sleep for 42 seconds; however this sleep is interuptable,
such as when the runner is signaled that a new job has been submitted so,
under normal circumstances there will be zero lag between job submission and
job running for an empty queue.
You can learn more on the github page: Here

Resources