Massive-Distributed Parallel Execution of tasks - jenkins

We are currently struggling with the following task. We need to run a windows application (single instance only working) 1000 times with different input parameters. One run of this application can take up to multiple hours. It feels like we have the same problem like any video rendering farm – each picture of a video should be calculated independently and parallel – but it is not rendering.
Currently we tried to execute it with Jenkins and Pipeline jobs. We used the parallel steps in pipeline and lets Jenkins queue and execute the application. We use the Jenkins Label Expression to lets Jenkins choose which job can be run on which node.
The limitation in Jenkins is currently with massive parallel jobs (https://issues.jenkins-ci.org/browse/JENKINS-47724). When the queue contains multiple hundred jobs adding new jobs took much longer – will become even worse by increasing queue. And main problem: Jenkins will start the execution of parallel pipeline part-jobs only after finishing adding all to the queue.
We already investigated ideas how to solve this problem:
Python Distributed: https://distributed.readthedocs.io/en/latest/
a. For single functions it looks great, but for the complete run like we have in Jenkins => Deploy and collect results looks complex
b. Client->Server bidirectional communication needed – no chance to bring it online through a NAT (VM Server)
BOINC: https://boinc.berkeley.edu/
a. for our understanding we had to extend the backend in a massive way to bring our jobs working => to configure the jobs in BOINC we had to write a lot of new automating code
b. currently we need a predeployed application which can differ between different inputs => no equivalent of Jenkins Label Expression
Any ideas how to solve it?
Thanks in advance

Related

In Slurm, how to submit multiple experimental runs in a batch, and excute them consecutively one-by-one?

Submitting jobs on a gpu cluster managed by Slurm.
I am doing some experiments and as you know we have to tune the parameters, which means I need to run several similar scripts with different hyperparameters. So I wrote multiple bash scripts (say, named training_n.sh) for executing, in each script it's like:
# training_n.sh
srun [command with specific model/training hyperparameters]
Then I use sbatch to execute these scripts, in the sbatch script it's like:
# sbatch script
bash training_1.sh
bash training_2.sh
...
bash training_n.sh
If I have a list of "srun"s in my "sbatch" script as shown above, how are they arranged in the queue (assuming I run on a single partition)? Are all these "srun"s seen as a single job or they are seen as separate jobs?
In other words, are they consecutively queued in the "squeue" list and will be executed consecutively? Or by contrast, other users' jobs will queue exactly behind the "srun" I am running and the remaining "srun"s can only be executed after these users' jobs are completed?
Additionally, any better ideas to submit a batch of experiment scripts on a publically used cluster? Since many people are using it, I want to complete all my designed experiments consecutively once it's my turn, instead of finishing one "srun" and waiting for other users to complete to start my next one.
If I have a list of "srun"s in my "sbatch" script as shown above, how are they arranged in the queue (assuming I run on a single partition)? Are all these "srun"s seen as a single job or they are seen as separate jobs?
In other words, are they consecutively queued in the "squeue" list and will be executed consecutively? Or by contrast, other users' jobs will queue exactly behind the "srun" I am running and the remaining "srun"s can only be executed after these users' jobs are completed?
If you submit all these single srun scripts/commands in a single sbatch script, you will only get one job. The reason for this is that srun works differently inside a job allocation then outside. If you run srun inside a job allocation (e.g. in an sbatch script), it will not create a new job, but just create a job step. So in your case, you will have a single job with n job steps, that will run consecutively in your allocation.
Additionally, any better ideas to submit a batch of experiment scripts on a publically used cluster?
If these runs are completely independent, you should use a job array, with size n. This way you can create n jobs that can run whenever there are resources available.
Since many people are using it, I want to complete all my designed experiments consecutively once it's my turn, instead of finishing one "srun" and waiting for other users to complete to start my next one.
That might not be a good idea. If these jobs are independent, you can rather submit them as an array. This way, they could take advantage of backfill scheduling and might run more quickly. You likely don't gain anything by putting them into a large job.

Jenkins Pipeline and huge amount of parallel steps

I have searched the whole internet for 2 weeks now, asked on freenode IRC and in the Jenkins user group mailing list for that but got no answer so here I am, you are my last hope (no pressure)
I have a Jenkins scripted pipeline that generates hundreds of parallel branches that have to run simultaneously on hundreds of slaves node. At the moment it looks like Jenkins BlueOcean user interface is not suited for that. We reach a point were all the steps can't be displayed.
I need to provide some kind of background to let you understand our need: We have a huge project in our company that have thousands of Behat/Selenium and this takes more that 30h to run now if done sequentially. We implemented a basic solution some times ago were we use a queuing system (RabbitMq) to store all the tests and consumers that run the tests by downloading the source code from Jenkins and uploading artifacts back to Jenkins too, but this is not as scallable as Jenkins native slaves and it is not maintainable enough (eg. we don't benefit from real time output log and usage statistics).
I know there is an open issue that describe the problem here : https://issues.jenkins-ci.org/browse/JENKINS-41205 but, basically, I need a workaround working for the next week (Our deelopment team are waiting for this new pipeline for a long time now).
Our pippeline looks like that at the moment:
Build --- Unit Tests --- Integration Tests --- Functional Tests ---
| | |
tool A suite A matrix-A-A-batch 0
tool B suite B matrix-A-A-batch 1
tool C matrix-A-A-batch 2
matrix-A-A-batch 3
....
"Unable to display more"
You can find a full version of our Jenkinsfile here : https://github.com/willy-ahva/pim-community-dev/blob/086e4ed48ef1a3d880ca16b6f5572f350d26eb03/Jenkinsfile (It may looks complicated but, basically, the real problem is the "Functional Tests" stage)
My questions are:
Am I using parallel the good way ?
Is it only a Jenkins/BlueOcean issue and I should contribute to the issue I linked ? (If yes, how ? I'm not a Java dev at all)
Should I try to use MultiJob and parallelize jobs instead of steps ?
Is there any other tool except parallel that I can use ? (some kind of fork or whatever) ?
Thanks a lot for your help. I love what Jenkins became with the Pipeline and BlueOcean UI and I really want to make it work in our team.
This is probably a poor way to do the parallel tasks. I would instead treat each parallel map entry as a worker, and put your tests into a queue / stack / data structure. Each worker thread could pop off the queue as required, and then you wouldn't sit there with a million tasks queued. You would have to be more careful with your logging so that it is apparent which test failed, but that shouldn't be too tough.
It's probably not something that's easy to fix, as it is as much a UI design issue as anything else. I would recommend that you give it a poke though! Who knows, maybe a solution will click for you?
Probably not. In my opinion this makes this muddier
Parallel is your option for forking.
If you really want to keep doing this, but don't want the UI to be so weird, you can stop defining each test as a stage. It'll be less clear what failed when one fails, but the UI should be happier.

Jenkins pipeline job, use all nodes of a given label before locking

Got a pipeline job who can run at 4 different nodes with one label. Previously i got the problem that they randomly tried to run at the same node, so i installed the lockable recources plugin and tried this:
node('TEST') {
try {
notifyBuild('STARTED')
lock(env.NODE_NAME){
This works generally, but it seems to be random wich node from the Label TEST the job chooses. For example the first two job executions can choose the same node and so the 2nd job will have to wait even if there are free nodes available. Is there a way to secure that all nodes are used before jobs have to wait?
Better solution is the https://github.com/jenkinsci/throttle-concurrent-builds-plugin which also works for pipeline jobs. This plugin doesn´t checks if recources are available before it blocks them. Also all recources are used before jobs have to wait.

Multiple pipeline jobs versus single large pipeline job

I am fairly new to Jenkins pipeline and am considering migrating an existing Jenkins batch to use pipeline script.
This may be an obvious question to those in the know but I have not been able to find any discussion of it anywhere. If you have a fairly complex set of jobs, say a few hundred, is it best practice to end up with one job with a fairly large script or a small number of jobs, probably parameterized, say 5 to 10, with smaller pipeline scripts that call each other.
Having one huge job has the severe disadvantage that you cannot easily execute the single stages anymore. On the other hand, splitting everything into different jobs has the disadvantage that many of the nice pipeline features (shared variables, shared code) cannot be used anymore. I do not think that there is a unique answer to this.
Have a look at the following two related questions:
Jenkins Build Pipeline - Restart At Stage
Run Parts of a Pipeline as Separate Job

Using a lock in a Jenkins Workflow Job

I want to use a lock in a workflow job in order to prevent jobs from running at the same time on the same node.
I want to use the functionality of the lock and latches plugin to control the parallel execution of jobs: When a Job A starts building on a specific node, Job B should wait until A is done, and then B can run.
How can I achieve that ? or is there another solution (in case locks are not supported in workflow jobs) ?
Thank you.
What exactly are you trying to prevent? The easiest way would be to set each node as having only 1 executor... If you do this, then the node will only ever run one job at a time. Note that some fly-weight tasks may run but generally these are non-significant and involve polling the remote SCM repository and such.
If you just mean within the same workflow, you can use various mix of the parallel step to split parallel sections and then combine the results.

Resources