I have a heavily parallelized build across 45 slaves (one master that just handles launches).
The problem I am running into is that about 3% of the jobs disappear.
The project setup is a "master" job that then launches (via the parameterized job plugin) N jobs across N slaves. Most of the time, the console output for the master job is correct with regards to job numbers of the distributed build steps.
Occasionally, however, the job indicated in console actually belongs to a completely different build.
Where do I even start looking to track this down? The jenkins logs are eerily empty of any information about failed jobs or problems launching jobs.
My best guess at the moment is that the missing jobs were actually queued waiting for executors when something happened to remove them. But I have no evidence to support this.
Thoughts, suggestions, helpful links all greatly appreciated,
Here's how you can get more info: http://[jenkins_server]/log/ -> Add new log recorder -> enter a name of your choice -> OK -> Add -> enter hudson.model.Run as Logger -> set Log Level to all -> Save.
Now http://[jenkins_server]/log/[your log name]/ will provide you with more info as far as running your jobs is concerned.
As long as the bugs https://issues.jenkins-ci.org/browse/JENKINS-15156 and its linked ones are open, it will happen in certain cases. It does not matter what you use for parallel building or dependant building... it is just core problem. Leave it or Live it.
I doubt additional logging is a fix or answer to your problem.
My answer would be - debug and send patches to devs.
Related
For a while, our Jenkins experiences critical problems. We have jobs hung, our job scheduler does not trigger the builds. After the Jenkins service restart, everything is back to normal, but after some period of time all problem are return. (this period can be week or day or ever less). Any idea where we can start looking? I'll appreciate any help on this issue
Muatik has made a good point in his comment, the recommended approach is to run jobs on agents (slave) nodes. If you already do it, you can look at:
Jenkins master machine CPU, RAM and hard disk usage. Access the machine and/or use plugin like Java Melody. I have seen missing graphics in the builds test results and stuck builds due to no hard disk space. You could also have hit the limit of RAM or CPU for the slaves/jobs you are executing. You may need more heap space.
Look at Jenkins log files, start with severe exceptions. If the files are too big or you see logrotate exceptions, you can change the logging levels, so that fewer exceptions are logged. For more details see my article on this topic. Try to fix exceptions that you see logged.
Go through recently made changes that can be the cause of such behavior, for example, new plugins, changes to config files (jenkins.xml)?
Look at TCP connections. Run netstat -a Are there suspicious connections (CLOSED_WAIT status)?
Delete old builds that you do not need.
We have been facing this issue from last 4 months, and tried everything, changing resources CPU & memory, increasing desired nodes in ASG. But nothing seems worked .
Solution: 1. Go to Manage Jnekins-> System Configurationd-> Maven project
configurations
2. In "usage" field, select "Only buid Jobs with label expressions matching this nodes"
Doing this resolved it and jenkins is working as a Rocket now :)
I have searched the whole internet for 2 weeks now, asked on freenode IRC and in the Jenkins user group mailing list for that but got no answer so here I am, you are my last hope (no pressure)
I have a Jenkins scripted pipeline that generates hundreds of parallel branches that have to run simultaneously on hundreds of slaves node. At the moment it looks like Jenkins BlueOcean user interface is not suited for that. We reach a point were all the steps can't be displayed.
I need to provide some kind of background to let you understand our need: We have a huge project in our company that have thousands of Behat/Selenium and this takes more that 30h to run now if done sequentially. We implemented a basic solution some times ago were we use a queuing system (RabbitMq) to store all the tests and consumers that run the tests by downloading the source code from Jenkins and uploading artifacts back to Jenkins too, but this is not as scallable as Jenkins native slaves and it is not maintainable enough (eg. we don't benefit from real time output log and usage statistics).
I know there is an open issue that describe the problem here : https://issues.jenkins-ci.org/browse/JENKINS-41205 but, basically, I need a workaround working for the next week (Our deelopment team are waiting for this new pipeline for a long time now).
Our pippeline looks like that at the moment:
Build --- Unit Tests --- Integration Tests --- Functional Tests ---
| | |
tool A suite A matrix-A-A-batch 0
tool B suite B matrix-A-A-batch 1
tool C matrix-A-A-batch 2
matrix-A-A-batch 3
....
"Unable to display more"
You can find a full version of our Jenkinsfile here : https://github.com/willy-ahva/pim-community-dev/blob/086e4ed48ef1a3d880ca16b6f5572f350d26eb03/Jenkinsfile (It may looks complicated but, basically, the real problem is the "Functional Tests" stage)
My questions are:
Am I using parallel the good way ?
Is it only a Jenkins/BlueOcean issue and I should contribute to the issue I linked ? (If yes, how ? I'm not a Java dev at all)
Should I try to use MultiJob and parallelize jobs instead of steps ?
Is there any other tool except parallel that I can use ? (some kind of fork or whatever) ?
Thanks a lot for your help. I love what Jenkins became with the Pipeline and BlueOcean UI and I really want to make it work in our team.
This is probably a poor way to do the parallel tasks. I would instead treat each parallel map entry as a worker, and put your tests into a queue / stack / data structure. Each worker thread could pop off the queue as required, and then you wouldn't sit there with a million tasks queued. You would have to be more careful with your logging so that it is apparent which test failed, but that shouldn't be too tough.
It's probably not something that's easy to fix, as it is as much a UI design issue as anything else. I would recommend that you give it a poke though! Who knows, maybe a solution will click for you?
Probably not. In my opinion this makes this muddier
Parallel is your option for forking.
If you really want to keep doing this, but don't want the UI to be so weird, you can stop defining each test as a stage. It'll be less clear what failed when one fails, but the UI should be happier.
This is not just another question about concurrent job execution in Jenkins. The problem I have is that there are several jobs that run independently from one another. When they finish it should be possible to run a manual job. The condition though is that all those automated jobs should be in successful state. Otherwise it should not be possible to run this manual job. It should also not be possible to run or even schedule run of this manual job if those other jobs are running.
I searched for the answer everywhere and checked every possible plugin that serves synchronization. But I did not figure it out how to solve the above problem.
IMHO the delivery pipeline plugin (see https://wiki.jenkins-ci.org/display/JENKINS/Delivery+Pipeline+Plugin for the download and http://www.infoq.com/articles/orch-pipelines-jenkins for a thorough description) could do what you want.
You can run a lot of jobs (in parallel or not), and when (and only when) they succeed another job (or more). You even can add manual steps (needing a button click when your pipeline may continue).
Everything is configurable - and quite stable at this moment.
No-one should be able to manually (or otherwise) start a job that is in "waiting state" for other jobs to finish.
Regarding this question:
Otherwise it should not be possible to run this manual job. It should also not be possible to run or even schedule run of this manual job if those other jobs are running.
You can use the Throttle Concurrent Builds Plugin and create a category which will include your automated jobs and the manual jobs.
If one automated job is running, it will be impossible to launch the manual jobs.
Regarding your first question, did you have a look to the Join plugin?
Cheers
https://wiki.jenkins-ci.org/display/JENKINS/Promoted+Builds+Plugin can also be option. Setup promotions in that way that manual approval is needed and build will not fail only if automated jobs are done.
I would like to monitor the estimated time of all of my builds to catch the cases where this value is shown as 'N/A'.
In these cases the build gets stuck (probably due to network issues in my environment) and it won't start new builds for that job until killed manually.
What I am missing is how to get that data for each job, either from api or other source.
I would appreciated any suggestion.
Thanks.
For each job, you can click "Trend" on the job run history table, and it will show you the currently executing progress along with a graph of "usual" execution times.
Using the API, you can go to http://jenkins/job/<your_job_name>/<build_number>/api/xml (or /json) and the information is under <duration> and <estimatedDuration> fields.
Finally, there is a Jenkins Timeout Plugin that you can use to automatically take care of "stuck" builds
Sorry for another non programming question, but I'm using Quartz.NET, a scheduler for .NET applications, for a Windows Service which allows users to schedule transferrig of files that match a regular expression from various sources - for example the user may schedule a job to occur every day at 6pm that transfers the files from a network path to a FTP server.
The adding jobs and management is done using an ASP.NET project, and I'm creating a Dashboard to display useful info to the user. I have the following information on the dashboard so far:
Total number of jobs
Windows Service status
Time since scheduler active
I know it's a very general question, but what other snippets of info can I add to the dashboard, as it's very sparse at the moment.
I've worked as a product manager on a few schedulers. Here are some common requirements for these types of things, but I urge you to talk to some target users to find out if they are applicable to your application.
The use cases:
1. Trying to identify if the jobs are running okay.
2. If jobs are not running okay, give the user clues as to the cause. Give user tools to debug and fix.
General requirements:
1. Table with info on last N jobs:
- Time started, time completed. Status of completion (success / failure). Length of time. Any errors. User who scheduled job. Any dependencies this job has on other jobs or other events. Specific machine the job ran on (if in a cluster).
Might be nice to include links in this dashlet that would allow you to cancel a job that might be hung.
Priority of the job (if you have priorities).
Compare all jobs: %succeed, %failure. Avg time to complete job.
Compare jobs by the scheduling user: avg time, %success, %failure.
This is by no means a comprehensive list or something. Just my trying to give you a few ideas, based on what I can remember off the top of my head.
-