In Slurm, how to submit multiple experimental runs in a batch, and excute them consecutively one-by-one? - machine-learning

Submitting jobs on a gpu cluster managed by Slurm.
I am doing some experiments and as you know we have to tune the parameters, which means I need to run several similar scripts with different hyperparameters. So I wrote multiple bash scripts (say, named for executing, in each script it's like:
srun [command with specific model/training hyperparameters]
Then I use sbatch to execute these scripts, in the sbatch script it's like:
# sbatch script
If I have a list of "srun"s in my "sbatch" script as shown above, how are they arranged in the queue (assuming I run on a single partition)? Are all these "srun"s seen as a single job or they are seen as separate jobs?
In other words, are they consecutively queued in the "squeue" list and will be executed consecutively? Or by contrast, other users' jobs will queue exactly behind the "srun" I am running and the remaining "srun"s can only be executed after these users' jobs are completed?
Additionally, any better ideas to submit a batch of experiment scripts on a publically used cluster? Since many people are using it, I want to complete all my designed experiments consecutively once it's my turn, instead of finishing one "srun" and waiting for other users to complete to start my next one.

If you submit all these single srun scripts/commands in a single sbatch script, you will only get one job. The reason for this is that srun works differently inside a job allocation then outside. If you run srun inside a job allocation (e.g. in an sbatch script), it will not create a new job, but just create a job step. So in your case, you will have a single job with n job steps, that will run consecutively in your allocation.
If these runs are completely independent, you should use a job array, with size n. This way you can create n jobs that can run whenever there are resources available.
That might not be a good idea. If these jobs are independent, you can rather submit them as an array. This way, they could take advantage of backfill scheduling and might run more quickly. You likely don't gain anything by putting them into a large job.


