I have a Jenkins server that keeps running out of memory and cannot create native threads. I've upped the memory and installed the Monitoring plugin.
There are about 150 projects on the server, and I've been watching the thread count creep up all day. It is now around 990. I expect when it hits 1024, which is the user limit for threads, Jenkins will run out of memory again.
[edit]: I have hit 1016 threads and am now getting the out of memory error
Is this an appropriate number of threads for Jenkins to be running? How can I tell Jenkins to destroy threads when it is finished with them?
tl;dr:
There was a post-build action running a bash script that didn't return anything via stderr or stdout to Jenkins. Therefore, every time the build ran, threads would be created and get stuck waiting. I resolved this issue by having the bash script return an exit status.
long answer
I am running Jenkins on CentOS and have installed in via the RPM. In terms of modifying the Winstone servlet container, you can change that in Jenkin's init script in /etc/sysctrl/jenkins. However the options above only control the number of HTTP threads that are created, not the number of threads overall.
That would be a solution if my threads were hanging on accessing an HTTP API of Jenkins as part of a post-commit action. However, using the ever-handy Monitoring plugin mentioned in my question, I inspected the stuck threads.
The threads were stuck on something in the com.trilead.ssh2.channel package. The getChannelData method has a while(true) loop that looks for output on the stderr or stdout of an ssh stream. The thread was getting suck in that loop because nothing was coming through. I learned this on GrepCode.
This was because the post-build action was to execute a command via SSH onto a server and execute a bash script that would inspect a git repo. However, the git repo was misconfigured and the git command would error, but the exit 1 status did not bubble up through the bash script (partially due to a misformed if-elif-else statement).
The script completed and the build was considered a success, but somehow the thread handling the SSH connection from Jenkins was left hanging due to this git error.
But thank you for your help on this question!
If you run Jenkins "out of the box" it uses Winstone servlet container. You can pass command-line arguments to it as described here. Some of those parameters can limit the number of threads:
--handlerCountStartup = set the no of worker threads to spawn at startup. Default is 5
--handlerCountMax = set the max no of worker threads to allow. Default is 300
--handlerCountMaxIdle = set the max no of idle worker threads to allow. Default is 50
Now, I tried this some time ago and was not 100% convinced that it worked, so no guarantees, but it is worth a try.
Related
We are running an automated job in Jenkins that triggers an ansible playbook. However the playbook execution works 80% of the time, but sometimes it stops randomly at some tasks(hung) without any error as shown below
TASK [Install utilities like vim, mlocate etc.] ***************
changed: [192.0.100.30]
changed: [192.0.100.27]
It does not get stopped at the same task. It is getting stopped randomly at random points!!
I am unable to reproduce this issue if I execute it manually.
Similar issue I have faced earlier, this could be due to specific host might be hung or high on resource utilisation, in this case ansible task will be in queue and sometimes in hung state for that host.
Whenever such situation re-occurs, then immediately check the host health on which ansible is trying to apply the playbook.
setup:
I have Jenkins running on an ubuntu server for several months with no problems until now.
problem:
For a few days now, building a job in Jenkins results in the webUI on port 8080 becoming unresponsive (ERR_CONNECTION_REFUSED or ERR_EMPTY_RESPONSE or endless loading).
There is one job that on build seemingly always kills the Jenkins webUI and another job only sometimes does so.
(maybe) useful information:
the jenkins logs often include the following warnings:
2022-01-22 14:47:20.931+0000 [id=96] WARNING hudson.security.csrf.CrumbFilter#doFilter: Found invalid crumb 80e9a2cf9c3c6d86f8787587vg8f77465b9e498d818466586fb165b9430. If you are calling this URL with a script, please use the API Token instead. More information: https://www.jenkins.io/redirect/crumb-cannot-be-used-for-script
2022-01-22 14:47:20.932+0000 [id=96] WARNING hudson.security.csrf.CrumbFilter#doFilter: No valid crumb was included in request for /ajaxExecutors by <Jenkins User Id>. Returning 403.
Given these warnings, it seems to me the crumb validation fails (if so, why and how would i resolve this?), but i also suspected some memory issue somewhere, as the job that on build crashes the jenkinsUI downloads files from s3 (and cleans up afterwards). Reducing the number of downloaded file per chunk seemed to keep it from not crashing (for a short time, now its also crashing on the lower amount). So i am a little confused in which direction i should look.
Also when i ssh into to server while jenkins is down, it sometimes times out, which makes me think the whole server is overwhelmed by the execution of the jenkins job at times (maybe due an oom?)
Looking at other ppl having simular problems, i checked for phantomjs processes:
$ ps -ef | grep phantomjs | awk '{print $2}' | xargs sudo kill -9
kill: (2876): No such process
Thanks to anyone taking the time, i m completely lost with these sort of problems :D
I'm using jenkins to do a few actions in a remote server.
I have an Execute Shell command in which I do the following:
sudo ssh <remote server> 'sudo service supervisor restart'
sleep 30
When jenkins reaches the first line I can see 'Restarting Supervisor' but after a moment I see that jenkins closed the ssh connection and moved on to the second line.
I tried adding a 'sleep 30' after the restart command but it still doesn't work.
Seems jenkins doesn't wait for the supervisor restart command to be completed.
Problem is it's not something that always happens, just sometimes, but it does make a lot of problems when it fails.
I think you can never be certain all processes started by supervisord are in a 'ready' state after a restart. Even is the restart action would wait for processes to be started, it wouldn't know if they are 'ready'.
In docker-compose setups that need to know if a certain service is available I've used an extra 'really ready' check for this - optionally in a loop with a sleep/wait. If the process that you are starting opens a port you can use one of the variations of 'wait-for' for this.
I´m thinking about the following high availability solution for my enviroment:
Datacenter with one powered on Jenkins master node.
Datacenter for desasters with one off Jenkins master node.
Datacenter one is always powered on, the second is only for disasters. My idea is install the two jenkins using the same ip but with a shared NFS. If the first has fallen, the second starts with the same ip and I still having my service successfully
My question is, can this solution work?.
Thanks all by the hekp ;)
I don't see any challenges as such why it should not work. But you still got to monitor in case of switch-over because I have faced the situation where jobs that were running when jenkins abruptly shuts down were still in the queue when service was recovered but they never completed afterwards, I had to manually delete the build using script console.
Over the jenkins forum a lot of people have reported such bugs, most of them seems to have fixed, but still there are cases where this might happen, and it is because every time jenkins is restarted/started the configuration is reloaded from the disk. So there is inconsistency at times because of in memory config that were there earlier and reloaded config.
So in your case, it might happen that your executor thread would still be blocked when service is recovered. Thus you got to make sure that everything is running fine after recovery.
After completing a Jenkins task, I execute a Linux shell script by using Jenkins' post-condition configuration section.
This Linux shell script wants to launch a standby service on the backend and can NOT cause Jenkins to pause.
I tried to use "nohup+&", etc., but it does not work.
Is there a good way to do it?
Jenkins is probably waiting for some pipes to close. Your background process has inherited some file descriptors and is keeping them open for as long as it runs.
If you are lucky, the only file descriptors are 0, 1 and 2 (the standard ones.) You might want to check the file descriptors of the background process using lsof -p PID where PID is the process id of the background process.
You should make sure all of those file descriptors (both inputs and outputs) are redirected for the background process, so start it with something like:
nohup daemon </dev/null >/dev/null 2>&1 &
Feel free to direct the output to a file other than /dev/null but make sure you keep the order of the redirections. The order is important.
If you plan to start background processes from a Jenkins job, be advised that Jenkins will kill background processes when build ends. See https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller on how to prevent that.
I had a similar problem with running a shell script from Jenkins as a background process. I fixed it by using the below command:
BUILD_ID=dontKillMe nohup ./start-fitnesse.sh &