Jenkins webUI crashes on job build - jenkins

setup:
I have Jenkins running on an ubuntu server for several months with no problems until now.
problem:
For a few days now, building a job in Jenkins results in the webUI on port 8080 becoming unresponsive (ERR_CONNECTION_REFUSED or ERR_EMPTY_RESPONSE or endless loading).
There is one job that on build seemingly always kills the Jenkins webUI and another job only sometimes does so.
(maybe) useful information:
the jenkins logs often include the following warnings:
2022-01-22 14:47:20.931+0000 [id=96] WARNING hudson.security.csrf.CrumbFilter#doFilter: Found invalid crumb 80e9a2cf9c3c6d86f8787587vg8f77465b9e498d818466586fb165b9430. If you are calling this URL with a script, please use the API Token instead. More information: https://www.jenkins.io/redirect/crumb-cannot-be-used-for-script
2022-01-22 14:47:20.932+0000 [id=96] WARNING hudson.security.csrf.CrumbFilter#doFilter: No valid crumb was included in request for /ajaxExecutors by <Jenkins User Id>. Returning 403.
Given these warnings, it seems to me the crumb validation fails (if so, why and how would i resolve this?), but i also suspected some memory issue somewhere, as the job that on build crashes the jenkinsUI downloads files from s3 (and cleans up afterwards). Reducing the number of downloaded file per chunk seemed to keep it from not crashing (for a short time, now its also crashing on the lower amount). So i am a little confused in which direction i should look.
Also when i ssh into to server while jenkins is down, it sometimes times out, which makes me think the whole server is overwhelmed by the execution of the jenkins job at times (maybe due an oom?)
Looking at other ppl having simular problems, i checked for phantomjs processes:
$ ps -ef | grep phantomjs | awk '{print $2}' | xargs sudo kill -9
kill: (2876): No such process
Thanks to anyone taking the time, i m completely lost with these sort of problems :D

Related

docker-compose exits due to large folder var/cache/prod of Symfony project

My symfony 5 app is running inside a docker container. When I want to deploy an update, docker-compose shows this error:
ERROR: for app UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).
I've tried to run export COMPOSE_HTTP_TIMEOUT=200 before docker-compose but the problem remains !
The only solution is to enter container, manually empty var/cache/prod folder then run docker-compose, but it's not a clean way !
Notice that the size of var/cache/prod is increasing enormously and very quickly : almost 2Gb in less than 3 hours !

Logrotate inside docker container not letting go of files?

Technical details:
Docker versions: This has happened on docker 18, 19 and 20, at several minor versions.
19.03.12 and 20.10.3 are the most recent installs I've got running.
OS: Ubuntu 16.04/18.04/20.04, all display this behavior.
All installs use overlay2.
Description:
Running lsof | grep -i 'deleted reveals log files being held on to, from processes running inside docker containers. Grepping the PID in the output of ps -aef --forest reveals the processes are running inside a docker container.
Running the same process outside docker has thus far not reproduced this behavior.
This started to become apparent when tens of gigabytes of diskspace could not be found.
This happens both for java and nodejs processes, using logback/winston respectively.
These logger libraries take care of their own logrotating and are not being tracked by a distinct logrotate service.
I have no clue what causes this behavior, and why it doesn't happen when not dockerized.
Googling "docker logrotate files being held" just gives results on logrotating the log files of docker itself.
I'm not finding what I'm looking for with general searching, so I came here, to ask if maybe someone recognizes this and it's actually an easy fix? Some simple thing I forgot but has to be pointed out?
Here's hoping someone knows what's going on.

Docker compose stops printing standard out and then unexpectedly quits

I'm running two binary services (each written in C++) through docker-compose 1.22.0. There's nothing particularly special about the docker-compose.yml. Both services use host networking, have init: true, use user: "0", and have large volumes on the host container, but otherwise their commands are pretty much just running binary files. (unfortunately, for software licensing reasons, I cannot give much more information, but I know of nothing else relevant)
What's happening is that occasionally and inconsistently (I'd say at least 75% of the time, it works fine), both services will stop printing to the console after perplexingly outputting the following:
first_service_1 | second_service_1 |
first_service_1 is colored differently than second_service_1, and they look identical to the other log lines except without a newline between them.
After that, nothing gets printed. The services (which expose services on ports) will still respond to requests, but they won't print what they're supposed to/what they normally print.
After some time (it's unclear how much exactly), first_service will quit like so:
my-dockerfiles_first_service_1 exited with code 1
Assuming docker-compose isn't causing the exit (which I'm unsure of because it seems to mostly crash only when it fails to print), I'd be able to debug this more easily. I know of no reason printing should ever fail or it should ever print a log line without a newline at the end.
Can anyone advise me on why docker compose might print that odd line or stop printing entirely?
The host environment is Ubuntu 16.04 on an aarch64 NVidia TX-2 running L4T 28.2. docker-compose was installed recently (within the last couple of weeks) through pip. Docker is on version 18.06.0-ce.

Why would a service that works locally be getting kill signals frequently in docker?

I have a universal react app hosted in a docker container in a minikube (kubernetes) dev environment. I use virtualbox and I actually have more microservices on this vm.
In this react app, I use pm2 to restart my app on changes to server code, and webpack hmr to hot-reload client code on changes to client code.
Every say 15-45 seconds, pm2 is logging the below message to me indicating that the app exited due to a SIGKILL.
App [development] with id [0] and pid [299], exited with code [0] via signal [SIGKILL]
I can't for the life of me figure out why it is happening. It is relatively frequent, but not so frequent that it happens every second. It's quite annoying because each time it happens, my webpack bundle has to recompile.
What are some reasons why pm2 might receive a SIGKILL in this type of dev environment? Also, what are some possible ways of debugging this?
I noticed that my services that use pm2 to restart on server changes do NOT have this problem when they are just backend services. I.e. when they don't have webpack. In addition, I don't see these SIGKILL problems in my prod version of the app. That suggests to me there is some problem with the combination of webpack hmr setup, pm2, and minikube / docker.
I've tried the app locally (not in docker /minikube) and it works fine without any sigkills, so it can't be webpack hmr on its own. Does kubernetes kill services that use a lot of memory? (Maybe it thinks my app is using a lot of memory). If that's not the case, what might be some reasons kubernetes or docker send SIGKILL? Is there any way to debug this?
Any guidance is greatly appreciated. Thanks
I can't quite tell from the error message you posted, but usually this is a result of the kernel OOM Killer (Out of Memory Killer) taking out your process. This can be either because your process is just using up too much memory, or you have a cgroup setting on your container that is overly aggressive and causing it to get killed. You may also have under-allocated memory to your VirtualBox instance.
Normally you'll see Docker reporting that the container exited with code 137 in docker ps -a
dmesg or your syslogs on the node in question may show the kernel OOM killer output.

How many threads should Jenkins run?

I have a Jenkins server that keeps running out of memory and cannot create native threads. I've upped the memory and installed the Monitoring plugin.
There are about 150 projects on the server, and I've been watching the thread count creep up all day. It is now around 990. I expect when it hits 1024, which is the user limit for threads, Jenkins will run out of memory again.
[edit]: I have hit 1016 threads and am now getting the out of memory error
Is this an appropriate number of threads for Jenkins to be running? How can I tell Jenkins to destroy threads when it is finished with them?
tl;dr:
There was a post-build action running a bash script that didn't return anything via stderr or stdout to Jenkins. Therefore, every time the build ran, threads would be created and get stuck waiting. I resolved this issue by having the bash script return an exit status.
long answer
I am running Jenkins on CentOS and have installed in via the RPM. In terms of modifying the Winstone servlet container, you can change that in Jenkin's init script in /etc/sysctrl/jenkins. However the options above only control the number of HTTP threads that are created, not the number of threads overall.
That would be a solution if my threads were hanging on accessing an HTTP API of Jenkins as part of a post-commit action. However, using the ever-handy Monitoring plugin mentioned in my question, I inspected the stuck threads.
The threads were stuck on something in the com.trilead.ssh2.channel package. The getChannelData method has a while(true) loop that looks for output on the stderr or stdout of an ssh stream. The thread was getting suck in that loop because nothing was coming through. I learned this on GrepCode.
This was because the post-build action was to execute a command via SSH onto a server and execute a bash script that would inspect a git repo. However, the git repo was misconfigured and the git command would error, but the exit 1 status did not bubble up through the bash script (partially due to a misformed if-elif-else statement).
The script completed and the build was considered a success, but somehow the thread handling the SSH connection from Jenkins was left hanging due to this git error.
But thank you for your help on this question!
If you run Jenkins "out of the box" it uses Winstone servlet container. You can pass command-line arguments to it as described here. Some of those parameters can limit the number of threads:
--handlerCountStartup = set the no of worker threads to spawn at startup. Default is 5
--handlerCountMax = set the max no of worker threads to allow. Default is 300
--handlerCountMaxIdle = set the max no of idle worker threads to allow. Default is 50
Now, I tried this some time ago and was not 100% convinced that it worked, so no guarantees, but it is worth a try.

Resources