Jenkins Pipeline and huge amount of parallel steps

Jenkins Pipeline and huge amount of parallel steps - jenkins

I have searched the whole internet for 2 weeks now, asked on freenode IRC and in the Jenkins user group mailing list for that but got no answer so here I am, you are my last hope (no pressure)
I have a Jenkins scripted pipeline that generates hundreds of parallel branches that have to run simultaneously on hundreds of slaves node. At the moment it looks like Jenkins BlueOcean user interface is not suited for that. We reach a point were all the steps can't be displayed.
I need to provide some kind of background to let you understand our need: We have a huge project in our company that have thousands of Behat/Selenium and this takes more that 30h to run now if done sequentially. We implemented a basic solution some times ago were we use a queuing system (RabbitMq) to store all the tests and consumers that run the tests by downloading the source code from Jenkins and uploading artifacts back to Jenkins too, but this is not as scallable as Jenkins native slaves and it is not maintainable enough (eg. we don't benefit from real time output log and usage statistics).
I know there is an open issue that describe the problem here : https://issues.jenkins-ci.org/browse/JENKINS-41205 but, basically, I need a workaround working for the next week (Our deelopment team are waiting for this new pipeline for a long time now).
Our pippeline looks like that at the moment:
Build --- Unit Tests --- Integration Tests --- Functional Tests ---
| | |
tool A suite A matrix-A-A-batch 0
tool B suite B matrix-A-A-batch 1
tool C matrix-A-A-batch 2
matrix-A-A-batch 3
....
"Unable to display more"
You can find a full version of our Jenkinsfile here : https://github.com/willy-ahva/pim-community-dev/blob/086e4ed48ef1a3d880ca16b6f5572f350d26eb03/Jenkinsfile (It may looks complicated but, basically, the real problem is the "Functional Tests" stage)
My questions are:
Am I using parallel the good way ?
Is it only a Jenkins/BlueOcean issue and I should contribute to the issue I linked ? (If yes, how ? I'm not a Java dev at all)
Should I try to use MultiJob and parallelize jobs instead of steps ?
Is there any other tool except parallel that I can use ? (some kind of fork or whatever) ?
Thanks a lot for your help. I love what Jenkins became with the Pipeline and BlueOcean UI and I really want to make it work in our team.

This is probably a poor way to do the parallel tasks. I would instead treat each parallel map entry as a worker, and put your tests into a queue / stack / data structure. Each worker thread could pop off the queue as required, and then you wouldn't sit there with a million tasks queued. You would have to be more careful with your logging so that it is apparent which test failed, but that shouldn't be too tough.
It's probably not something that's easy to fix, as it is as much a UI design issue as anything else. I would recommend that you give it a poke though! Who knows, maybe a solution will click for you?
Probably not. In my opinion this makes this muddier
Parallel is your option for forking.
If you really want to keep doing this, but don't want the UI to be so weird, you can stop defining each test as a stage. It'll be less clear what failed when one fails, but the UI should be happier.

Related

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?

The first resource that you should check is Dataflow documentation. It should be useful to check these:
Troubleshooting your Pipeline
Common error guidance
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
Job stuck at startup
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
Did you add a custom setup.py file?
Do you have any dependencies that require a special setup on worker startup?
Are you manipulating the worker container?
To debug this sort of issue I usually open StackDriver logging, and look for worker-startup logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Job seems stuck in user code
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
Is your job performing operations that require you to wait for them? (e.g. loading data to an external service, waiting for promises/futures)
Note that some of the builtin transforms of Beam do exactly this (e.g. the Beam IOs like BigQueryIO, FileIO, etc).
Is your job loading very large side inputs into memory? This may happen if you are using View.AsList for a side input.
Is your job loading very large iterables after GroupByKey operations?
A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle and finishBundle calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner instead of GroupByKey + ParDo.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
The data freshness or system lag are increasing
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
One possibility is that your pipeline is hitting an unresolvable error condition. When a bundle fails processing, your pipeline will continue to attempt to execute that bundle indefinitely, and this will hold the watermark back.
When this happens, you will see errors in your Dataflow console, and the count will keep climbing as the bundle is retried. See:
You may have a bug when associating the timestamps to your data. Make sure that the resolution of your timestamp data is the correct one!
Although unlikely, it is possible that you've hit a bug in Dataflow. If neither of the other tips helps, please open a support ticket.

How do I structure Jobs in Jenkins?

I have been tasked with setting up automated deployment and, after some research, settled on Jenkins to get the job done. Prior to this I had approximately zero knowledge of Jenkins, beyond hearing the name. I have no real knowledge of Devops beyond what I have learnt in the last couple of weeks; no formal training, no actual books, just Google searches.
We are not running a full blown/classic CI/CD process; this is a business decision. The basic requirements are:
Source code will be stored in GitHub.
Pull requests must be peer approved.
Pull requests must pass build/unit/db deploy tests.
Commits to specific branches must trigger a deployment to a related specific environment (Production, Staging or Development).
The basic functionality that I am attempting to support covers (what I currently see as) two separate processes:
On creation of a pull request, application is built, unit tests run, and db deploy tested. Status info must be passed to GitHub.
On commit to one of three specific branches (master, staging and dev) the application should be built, and deployed to one of three environments (production, staging and dev).
I have managed to cobble together a pipeline that does the first task rather well. I am using the generic web hook trigger, and manually handling all steps using a declarative pipeline stored in source control. This works rather well so far and, after much hacking, I am quite happy with the shape of it.
I am now starting work on the next bit, automated deployment.
On to my actual question(s).
In short, how do I split this up into Jobs in Jenkins?
To my mind, there are 1, 2 or 4 Jobs to be created:
One Job to Rule them All
This seems sub-optimal to me, as the pipeline will include relatively complex conditional logic and, depending on whether the Job is triggered by a Pull Request or a Commit, different stages will be run. The historical data will be so polluted as to be near useless.
OR
One job for handling pull requests
One job for handling commits
Historical data for deployments across all environments will be intermixed. I am a little concerned that I will end up with >1 Jenkinsfile in my repository. Although I see no technical reason why I can't have >1 Jenkinsfile, every example I see uses a single file. Is it OK to have >1 Jenkinsfile (Jenkinsfile_Test & Jenkinsfile_Deploy) in the repository?
OR
One job for handling pull requests
One job for handling commits to Development
One job for handling commits to Staging
One job for handling commits to Production
This seems to have some benefit over the previous option, because historical data for deployments into each environment will not be cross polluting each other. But now we're well over the >1 Jenkinsfile (perceived) limit, and I will end up with (Jenkinsfile_Test, Jenkinsfile_Deploy_Development, Jenkinsfile_Deploy_Staging and Jenkinsfile_Deploy_Production). This method also brings either extra complexity (common code in a shared library) or copy/paste code reuse, which I certainly want to avoid.
My primary objective is for this to be maintainable by someone other than myself, because Bus Factor. A real Devops/Jenkins person will have to update/manage all of this one day, and I would strongly prefer them not to suffer from my ignorance.
I have done countless searches, but I haven't found anything that provides the direction I need here. Searches for best practices make no mention on handling >1 Jenkinsfile, instead focusing on the contents of a single pipeline.

After further research, I have found an answer to my core question. This might not be the absolute correct answer, but it makes sense to me, and serves my needs.
While it is technically possible to have >1 Jenkinsfile in a project, that does not appear to align with best practices.
The best practice appears to be to create a separate repository for each Jenkinsfile, which maps 1:1 with a Job in Jenkins.
To support my specific use case I have removed the Jenkinsfile from my main source code repository. I then create 4 new repositories:
Project_Jenkinsfile_Test
Project_Jenkinsfile_Deploy_Development
Project_Jenkinsfile_Deploy_Staging
Project_Jenkinsfile_Deploy_Production
Each repository contains a single Jenkinsfile and a readme.md that, in theory, contains useful information.
This separation gives me a nice view of the historical success/failure of the Test runs as a whole, and Deployments to each environment separately.
It is highly likely that I will later create a fifth repository:
Project_Jenkinsfile_Deploy_SharedLibrary
This last repository would contain pipeline code that is shared amongst the four 'core' pipelines. Once I have the 'core' pipelines up and running properly, I will consider refactoring what I can into this shared library.
I will not accept my own answer at this point, in the hope that more answers are forthcoming.

Here's a proposal I would try for your requirements based on the experience at my last job.
Job1: builds and runs unit tests on every commit on master or whatever your main dev branch is (checks every 20 minutes or whatever suits you); this job usually finds compile and unit test issues very fast
Job2 (optional): run integration tests and various static code checks (e.g. clang-tidy, valgrind, cppcheck, etc.) every night, if the last run of Job1 was successful; this job finds usually lots of things, but probably takes lots of time, so let it run only at night
Job3: builds and tests every pull request for release branches; so you get some info in your pull requests, if its mature enough to be merged into the release branches
Job4: deploys to the appropriate environment on every commit on a release branch; on dev and staging you could probably trigger some more tests, if you have them
So Job1, Job2 and Job3 should run all the time. If pull requests to your release branches are approved by QA (i.e. reviews OK and tests successful) and merged to release branches, the deployment is done by Job4 automatically.
It depends on your requirements and your dev process, if you want to trigger Job4 only manually instead.

Jenkins - How to handle concurrent jobs that use a limited resource pool?

I'm trying to improve some of the testing procedures at work and since I'm not an expert on Jenkins was hoping you guys could maybe point me in the right direction?.
Our current situation is like this. We have a huge suite of E2E tests that run regularly. These tests rely on a pool of limited resources (AWS VMs that are use to run each tests). We have 2 test suites. A full blown regression that consumes, at its peak, a total of ~80% of those resources and a much more light weight smoke run that just uses 15% or so.
Right now I'm using the lockable resources plugin. When the Test Run step comes it checks whether you are running a regression or not and if you are then it will request the single lock. If it is available then all good and if not it will wait until it becomes available before continuing. This allows me to make sure that at no point there will be more than 1 regression running at the same point but it has a lot of gaps. Like a regression could be running and several smoke runs might be triggered which will exhaust the resource pool.
What I would like to accomplish on a best-case-scenario would be some sort of conditional rules that would decide whether the test execution step can go forward or not based on something like this:
Only 1 regression can be running at the same time.
If a regression is running allow only 1 smoke run to be run in
parallel.
If no regression is running then allow up to 5 or 6 smoke tests.
If 2 or more smoke tests are running do not allow a regression to
launch.
Would something like that be possible from a Jenkins pipeline? In this case I'm using the declarative pipeline with a bunch of helper groovy code I've put together over time. My first idea is to see if there's a way to check if a lockable resource is available or not (but without actually requesting it yet) and then go through a bunch of if/then/else to set up the logic. But again I'm not sure if there's a way to check a lockable resource state or how many of a kind have already been requested.
Honestly, something this complex might probably be outside of what Jenkins is supposed to handle but I'm not sure and figured asking here would be a good start.
Thanks!.

Create a declarative pipeline with steps that build individual jobs. Don't allow people to run the jobs ad-hoc, or when changes are pushed to the repository, and force a build schedule.
How can this solve your issue:
Only 1 regression can be running at the same time.
Put all these jobs in sequence in a declarative pipeline.
If a regression is running allow only 1 smoke run to be run in parallel.
Put smoke tests that are related to the regression test in sequence, just after the regression build, but run the smoke tests in parallel, prior to the next regression build.
If no regression is running then allow up to 5 or 6 smoke tests.
See previous
If 2 or more smoke tests are running do not allow a regression to launch.
It will never happen if you run things in sequence.
Here is an ugly picture explaining what I am talking about.
You can manually create the pipeline, or use the coolness of blue ocean to give you a graphical interface to put the steps in sequence or in parallel:
https://jenkins.io/doc/tutorials/create-a-pipeline-in-blue-ocean/
The downside is that if one of those jobs fails, it will stop the build, but that is not necessarily a bad thing if the jobs are highly correlated.

Completely forgot to update this but after reading and experimenting a bit more with the lockable resources plugin I found out you could have several resources under the same label and request a set quantity whenever a specific job starts.
I defined 5 resources and set the Jenkinsfile to check whether you are running the test suite with the parameter regression or not. If you are running a full regression it will try to request 4 locks while a smoke test will only try to request 1. This way when there aren't enough locks available the job will wait until either the enough amount becomes available or the timeout expires.
Here's a snippet from my Jenkinsfile:
stage('Test') {
steps {
lock(resource: null, label: 'server-farm-tokens', quantity: getQuantityBySuiteType()) {
<<whatever I do to run my tests here>>
}
}
resource has to be null due to a bug in Jenkin's declarative pipeline. If you're using the scripted one you can ignore that parameter.

Massive-Distributed Parallel Execution of tasks

We are currently struggling with the following task. We need to run a windows application (single instance only working) 1000 times with different input parameters. One run of this application can take up to multiple hours. It feels like we have the same problem like any video rendering farm – each picture of a video should be calculated independently and parallel – but it is not rendering.
Currently we tried to execute it with Jenkins and Pipeline jobs. We used the parallel steps in pipeline and lets Jenkins queue and execute the application. We use the Jenkins Label Expression to lets Jenkins choose which job can be run on which node.
The limitation in Jenkins is currently with massive parallel jobs (https://issues.jenkins-ci.org/browse/JENKINS-47724). When the queue contains multiple hundred jobs adding new jobs took much longer – will become even worse by increasing queue. And main problem: Jenkins will start the execution of parallel pipeline part-jobs only after finishing adding all to the queue.
We already investigated ideas how to solve this problem:
Python Distributed: https://distributed.readthedocs.io/en/latest/
a. For single functions it looks great, but for the complete run like we have in Jenkins => Deploy and collect results looks complex
b. Client->Server bidirectional communication needed – no chance to bring it online through a NAT (VM Server)
BOINC: https://boinc.berkeley.edu/
a. for our understanding we had to extend the backend in a massive way to bring our jobs working => to configure the jobs in BOINC we had to write a lot of new automating code
b. currently we need a predeployed application which can differ between different inputs => no equivalent of Jenkins Label Expression
Any ideas how to solve it?
Thanks in advance

Simple continuous integration from cron?

Does anybody know about (or have experience with) a simple continuous integration system that can be run via cron and produce a static HTML report?
Tools like Jenkins and BuildBot all seem to need their own daemon processes. (Which in my case means I need to get time from sysadmins to set up. Not gonna happen.)
Ideally, I'd like an output report that looks like:
http://buildbot.buildbot.net/#/console
That is, one row per revision and one column per build config. With a green/red status on each.

Generally, the reason why people don't run CI systems in cron is that they rapidly outgrow cron.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart