Jenkins makes a Kubernetes node stuck when high CPU usage - jenkins

I noticed that when launching some Jenkins builds sometimes the node hosting Jenkins get stuck forever. It means the whole node is not reachable, and all its pods are down (not ready in the dashboard).
To make things up again I need to remove it from the cluster and add it again (I'm on GCE so I need to remove it from the instance group to be able to delete it).
Note: during hours I'm not able to connect through SSH to the node, it is clearly out of service ^^
From my understanding, reaching memory top crashes a node, but reaching top CPU usage should just slow down the server and not make a big deal like what I'm experiencing. In the worst case Kubelet should be unavailable until CPU gets better.
Does someone is able to help me determining the origin of this issue? What could cause such a problem?
Node metrics 1
Node metrics 2
Jenkins slave metrics
Node metrics from GCE
On the other side, after waiting hours, I've been able to access the node through SSH and I run sudo journalctl -u kubelet to see what's going on. I don't see anything specific at 7pm o'clock but I'm able to see recurrent error like:
Apr 04 19:00:58 nodes-s2-2g5v systemd[43508]: kubelet.service: Failed at step EXEC spawning /home/kubernetes/bin/kubelet: Permission denied
Apr 04 19:00:58 nodes-s2-2g5v systemd[1]: kubelet.service: Main process exited, code=exited, status=203/EXEC
Apr 04 19:00:58 nodes-s2-2g5v systemd[1]: kubelet.service: Unit entered failed state.
Apr 04 19:00:58 nodes-s2-2g5v systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: Stopped Kubernetes Kubelet Server.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.
Apr 04 19:01:00 nodes-s2-2g5v systemd[43511]: kubelet.service: Failed at step EXEC spawning /home/kubernetes/bin/kubelet: Permission denied
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Main process exited, code=exited, status=203/EXEC
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Unit entered failed state.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 04 19:01:02 nodes-s2-2g5v systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Apr 04 19:01:02 nodes-s2-2g5v systemd[1]: Stopped Kubernetes Kubelet Server.
Apr 04 19:01:02 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.
I go to older logs and I found at 5:30pm the start of this kind of messages:
Apr 04 17:26:50 nodes-s2-2g5v kubelet[1841]: I0404 17:25:05.168402 1841 prober.go:111] Readiness probe for "...
Apr 04 17:26:50 nodes-s2-2g5v kubelet[1841]: I0404 17:25:04.021125 1841 prober.go:111] Readiness probe for "...
-- Reboot --
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.
Apr 04 17:31:31 nodes-s2-2g5v systemd[1699]: kubelet.service: Failed at step EXEC spawning /home/kubernetes/bin/kubelet: Permission denied
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: kubelet.service: Main process exited, code=exited, status=203/EXEC
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: kubelet.service: Unit entered failed state.
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 04 17:31:33 nodes-s2-2g5v systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Apr 04 17:31:33 nodes-s2-2g5v systemd[1]: Stopped Kubernetes Kubelet Server.
Apr 04 17:31:33 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.
At this time node kubelet reboots and it corresponds to a Jenkins build. There is the same pattern with high CPU usage. I don't know why earlier it just rebooted and around 7pm the node just get stuck :/
I'm really sorry, it's a lot of information but I'm totally lost, that's not the first time it happens to me ^^
Thank you,

As mentioned by #Brandon, it was related to resource limits applied to my Jenkins slaves.
In my case even if precised in my Helm chart YAML file, the values were not set. I had to go deeper in the UI to set them manually.
From this modification, everything is now stable! :)

Related

How to resolve jenkins installation error- jenkins.service - Jenkins Continuous Integration Server

I was trying to install jenkins in ec2 ubuntu and amazon-linux but I am getting this error
jenkins.service - Jenkins Continuous Integration Server
Loaded: loaded (/lib/systemd/system/jenkins.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2022-11-27 16:29:54 UTC; 1h 8min ago
Process: 18706 ExecStart=/usr/bin/jenkins (code=exited, status=1/FAILURE)
Main PID: 18706 (code=exited, status=1/FAILURE)
Nov 27 16:29:54 ip-172-31-19-156 systemd[1]: jenkins.service: Main process exited, code=exited, status=1/FAILURE
Nov 27 16:29:54 ip-172-31-19-156 systemd[1]: jenkins.service: Failed with result 'exit-code'.
Nov 27 16:29:54 ip-172-31-19-156 systemd[1]: Failed to start Jenkins Continuous Integration Server.
Nov 27 16:29:54 ip-172-31-19-156 systemd[1]: jenkins.service: Scheduled restart job, restart counter is at 5.
Nov 27 16:29:54 ip-172-31-19-156 systemd[1]: Stopped Jenkins Continuous Integration Server.
Nov 27 16:29:54 ip-172-31-19-156 systemd[1]: jenkins.service: Start request repeated too quickly.
Nov 27 16:29:54 ip-172-31-19-156 systemd[1]: jenkins.service: Failed with result 'exit-code'.
Nov 27 16:29:54 ip-172-31-19-156 systemd[1]: Failed to start Jenkins Continuous Integration Server.
root#ip-172-31-19-156:~# cd /car/usr
First I installed jenkins and was interrupted in between that I changed the ec2 volume for reset my ec2, after this I'm unable to install jenkins in ec2

Jenkins will not start

After installing Java and Jenkins on my CentOS 7 server. I tried to start the Jenkins, and I am getting the below error message.
Job for jenkins.service failed. See "systemctl status jenkins.service"
and "journalctl -xe" for details.
When I run "systemctl status jenkins.service" to see what the issue is, I get the below output
● jenkins.service - Jenkins Continuous Integration Server
Loaded: loaded (/usr/lib/systemd/system/jenkins.service; disabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Thu 2022-08-18 14:23:02 UTC; 20s ago
Process: 8847 ExecStart=/usr/bin/jenkins (code=exited, status=0/SUCCESS)
Main PID: 8847 (code=exited, status=0/SUCCESS)
Aug 18 14:23:02 localhost.localdomain systemd[1]: Failed to start Jenkins Continuous Integration Server.
Aug 18 14:23:02 localhost.localdomain systemd[1]: Unit jenkins.service entered failed state.
Aug 18 14:23:02 localhost.localdomain systemd[1]: jenkins.service failed.
Aug 18 14:23:02 localhost.localdomain systemd[1]: jenkins.service holdoff time over, scheduling restart.
Aug 18 14:23:02 localhost.localdomain systemd[1]: Stopped Jenkins Continuous Integration Server.
Aug 18 14:23:02 localhost.localdomain systemd[1]: start request repeated too quickly for jenkins.service
Aug 18 14:23:02 localhost.localdomain systemd[1]: Failed to start Jenkins Continuous Integration Server.
Aug 18 14:23:02 localhost.localdomain systemd[1]: Unit jenkins.service entered failed state.
Aug 18 14:23:02 localhost.localdomain systemd[1]: jenkins.service failed.
Not quite sure how to fix this. Anybody with a solution? Thanks
can you please use journalctl -xe for more detailed logs.
also can you run Jenkins in interactive mode to see why its failing to start like -
java -jar jenkins.war
you can get command details in /usr/bin/jenkins file.

Docker 17 fails to start in Centos 7

We have installed docker 17.12 in our Centos 7.x and after the installation is complete, am facing an error while trying to start the docker service. Initially, I tried for systemctl docker start then for more output on this when I tried journalctl it says docker.service entered failed state.
More details below:
Docker :
17.12.1-ce , build 7390fc6
Command tried:
sudo systemctl start docker
journalctl -u docker.service
Expected Output:
Docker service should be started successfully
Actual output:
Mar 26 23:51:19 docker[16420]: See 'docker --help'
Mar 26 23:51:19 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Mar 26 23:51:19 systemd[1]: Failed to start Docker Application Container Engine.
Mar 26 23:51:19 systemd[1]: Unit docker.service entered failed state.
Mar 26 23:51:19 docker.service failed.
Mar 26 23:51:21 systemd[1]: docker.service holdoff time over, scheduling restart.
Mar 26 23:51:21 systemd[1]: start request repeated too quickly for docker.service
Mar 26 23:51:21 systemd[1]: Failed to start Docker Application Container Engine.
Mar 26 23:51:21 systemd[1]: Unit docker.service entered failed state.
Mar 26 23:51:21 systemd[1]: docker.service failed.
Mar 26 23:52:22 systemd[1]: Starting Docker Application Container Engine...
Mar 26 23:52:22 docker[16582]: docker: 'daemon' is not a docker command.
Mar 26 23:52:22 docker[16582]: See 'docker --help'
Mar 26 23:52:22 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Mar 26 23:52:22 systemd[1]: Failed to start Docker Application Container Engine.
Mar 26 23:52:22 systemd[1]: Unit docker.service entered failed state.
Mar 26 23:52:22 systemd[1]: docker.service failed.
Mar 26 23:52:24 systemd[1]: docker.service holdoff time over, scheduling restart.
Mar 26 23:52:24 systemd[1]: Starting Docker Application Container Engine...
Mar 26 23:52:25 docker[16601]: docker: 'daemon' is not a docker command.
Mar 26 23:52:25 docker[16601]: See 'docker --help'
Mar 26 23:52:25 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Mar 26 23:52:25 systemd[1]: Failed to start Docker Application Container Engine.
Mar 26 23:52:25 systemd[1]: Unit docker.service entered failed state.
Mar 26 23:52:25 systemd[1]: docker.service failed.
Mar 26 23:52:27 systemd[1]: docker.service holdoff time over, scheduling restart.
Mar 26 23:52:27 systemd[1]: Starting Docker Application Container Engine...
Mar 26 23:52:27 docker[16619]: docker: 'daemon' is not a docker command.
Mar 26 23:52:27 docker[16619]: See 'docker --help'
Mar 26 23:52:27 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Mar 26 23:52:27 systemd[1]: Failed to start Docker Application Container Engine.
Mar 26 23:52:27 systemd[1]: Unit docker.service entered failed state.
Mar 26 23:52:27 systemd[1]: docker.service failed.
Mar 26 23:52:29 systemd[1]: docker.service holdoff time over, scheduling restart.
Mar 26 23:52:29 systemd[1]: start request repeated too quickly for docker.service
Mar 26 23:52:29 systemd[1]: Failed to start Docker Application Container Engine.
Mar 26 23:52:29 systemd[1]: Unit docker.service entered failed state.
Mar 26 23:52:29 systemd[1]: docker.service failed.
Please check on this issue and help us resolve the docker start issue.
no evidence in your log.
Would you just reinstall with the official way ?
$ curl -fsSL https://get.docker.com -o get-docker.sh
$ sh get-docker.sh
Check if there's another issue with:
sudo dockerd --debug
In my situation I had invalid config in the daemon.json.

while start marathon , exited with status 1

24 15:28:57 ivum01-HP-Pro-3330-SFF systemd[1]: marathon.service: Main process exited, code=exited, status=1/FAILURE
Jan 24 15:28:57 ivum01-HP-Pro-3330-SFF systemd[1]: marathon.service: Unit entered failed state.
Jan 24 15:28:57 ivum01-HP-Pro-3330-SFF systemd[1]: marathon.service: Failed with result 'exit-code'.
Jan 24 15:29:57 ivum01-HP-Pro-3330-SFF systemd[1]: marathon.service: Service hold-off time over, scheduling restart.
Jan 24 15:29:57 ivum01-HP-Pro-3330-SFF systemd[1]: Stopped Scheduler for Apache Mesos.
Jan 24 15:29:57 ivum01-HP-Pro-3330-SFF systemd[1]: Starting Scheduler for Apache Mesos...
Jan 24 15:29:57 ivum01-HP-Pro-3330-SFF systemd[1]: Started Scheduler for Apache Mesos.
Jan 24 15:29:57 ivum01-HP-Pro-3330-SFF marathon[1838]: No start hook file found ($HOOK_MARATHON_START). Proceeding with the start script.
Jan 24 15:29:59 ivum01-HP-Pro-3330-SFF marathon[1838]: [scallop] Error: Required option 'master' not found
Jan 24 15:29:59 ivum01-HP-Pro-3330-SFF systemd[1]: marathon.service: Main process exited, code=exited, status=1/FAILURE
Jan 24 15:29:59 ivum01-HP-Pro-3330-SFF systemd[1]: marathon.service: Unit entered failed state.
These are the commands I am using for Marathon:
sudo mkdir -p /etc/marathon/conf
sudo cp /etc/mesos-master/hostname /etc/marathon/conf
sudo cp /etc/mesos/zk /etc/marathon/conf/master
sudo cp /etc/marathon/conf/master /etc/marathon/conf/zk
sudo nano /etc/marathon/conf/zk
The only portion I need to modify in this file is the endpoint. Change it from /mesos to /marathon.
That’s an out of memory error. Are you sure your node has enough memory to run both Mesos Master and Marathon?

Unable to restart Unicorn in Ubuntu 16.04

I am trying to deploy my new rails app to Ubuntu 16.04 Digital Ocean Server. Here Unicorn is managed via systemd. This is my /etc/systemd/system/unicorn.service file
[Unit]
Description=Skreem Application
Before=nginx.service
Requires=network.target
[Service]
Type=simple
User=rails
Group=rails
RuntimeDirectory=DigitalOceanOneClick
SyslogIdentifier=DigitalOceanRailsOneClick
# Go paranoid
PrivateTmp=true
PrivateDevices=true
ProtectSystem=full
ProtectKernelTunables=true
NoNewPrivileges=true
WorkingDirectory=/home/rails/skreem-ror
ExecStart=/bin/bash /home/rails/skreem-ror/.unicorn.sh
TimeoutSec=60s
RestartSec=10s
Restart=always
[Install]
WantedBy=multi-user.target
When I am trying to restart the unicorn service, I am getting following error
Failed to restart unicorn.service: Unit unicorn.service is not loaded properly: Invalid argument.
See system logs and 'systemctl status unicorn.service' for details.
Then I tried systemctl status unicorn.service and getting
Jul 03 10:05:06 skreem-dev2 systemd[1]: unicorn.service: Main process exited, code=exited, status=1/FAILURE
Jul 03 10:05:06 skreem-dev2 systemd[1]: unicorn.service: Unit entered failed state.
Jul 03 10:05:06 skreem-dev2 systemd[1]: unicorn.service: Failed with result 'exit-code'.
Jul 03 10:05:07 skreem-dev2 systemd[1]: [/etc/systemd/system/unicorn.service:18] Unknown lvalue 'ProtectKernelTunables' in section 'Service'
Jul 03 10:05:07 skreem-dev2 systemd[1]: [/etc/systemd/system/unicorn.service:32] Missing '='.
Jul 03 10:05:16 skreem-dev2 systemd[1]: unicorn.service: Service hold-off time over, scheduling restart.
Jul 03 10:05:16 skreem-dev2 systemd[1]: unicorn.service: Failed to schedule restart job: Unit unicorn.service is not loaded properly: Invalid a
Jul 03 10:05:16 skreem-dev2 systemd[1]: unicorn.service: Unit entered failed state.
Jul 03 10:05:16 skreem-dev2 systemd[1]: unicorn.service: Failed with result 'resources'.
Jul 03 11:33:51 skreem-dev2 systemd[1]: Stopped DigitalOcean Rails One-Click Application.
Its not coming from my updated unicorn.service file. Is it because my changes are not loading properly. Please help me to solve this issue.

Resources