Jenkins-swarm slaves go down - jenkins

We have a large number of jenkins slaves setup vip jenkins-swarm plugin(auto-discover the master)
Recently, we started to see slaves going offline and existing jobs get stuck. We fixed it with restarting the master. However, it has been happening too frequently. Everything seems to be fine, no network issue nor gc issue.
Anyone?
On the jenkins slaves, we see this repeatity once the node become unavailable:
Retrying in 10 seconds
Attempting to connect to http://ci.foobar.com/jenkins/ 46185efa-0009-4281-89d2-c4018fa9240c with ID 5a0f1773
Could not obtain CSRF crumb. Response code: 404
Failed to create a slave on Jenkins CODE: 409
A slave called '10.9.201.89-5a0f1773' is already created and on-line

Related

FATAL: Remote call on jenkins node server failed

We are using jenkins version is 2.121.3.
For java JDK versions we are using different versions for master and slave machines.
This issue will occurs only for security jobs. Which they are added through jenkins node.
This issue will occurs only in occasionally.
Some times job getting succeed sometimes getting failed.
For temporary solution we are relaunching the nodes.
At that time it was working.
We are searching for permanent solution but we unable to find it.

How to restart interrupted Jenkins jobs after a server or node failure/restart?

I'm running a Jenkins server and some slaves on a docker swarm that's hosted on preemptive google instances (akin to AWS spot instances). I've got everything set up so that at any given moment there is a Jenkins master running on a single server and slaves running on every other server on the swarm. When one server gets terminated another is spun up and replaces it, and eventually Jenkins is back up running again on another machine even if its server was stopped, and slaves get replaced as they die.
I'm facing two problems:
My first one is when the Jenkins master dies and comes back online it tries to resume the jobs that were previously running and they end up getting stuck trying to be built. Is there any way to automatically have Jenkins restart jobs that were interrupted instead of trying to resume them?
The second is when a slave dies I'd like to automatically restart any jobs that were running on it elsewhere. Is there any way to do that?
Currently I'm dealing with both situations by have an external application retry the failed build jobs, but that's not really optimal.
Thanks!

How does a Jenkins master stay connected with a slave

So I was messin around with Jenkins for the laugh and trying to see what a slaves tolerance was for network interruptions. In other words how long of a network outage could happen without disconnecting the slave from the master.
What I found was that if I brought down the interface on the slave using ifdown it stayed connected to the master. The master didn't complain even when I kept the interface down for 15 seconds. So 15 seconds where ping and ssh were not possible.
However when I rebooted the slave the master instantly reported that the slave was offline/disconnected/gone (can't remember the exact terminology)
So why is it that bringing down the interface in this way doesn't seem to bother Jenkins? How is the connection between master and slave retained? Is there something about the ifdown command?

How to recover when Jenkins jobs stop due to temporary loss of connectivity?

I am working on a project with hundreds of Jenkins jobs, and some of them run automated tests for several hours (7, 8, even 10 hours). Unfortunately we've been experiencing network problems (WAN) and at least once a day we lose connectivity for several minutes. Then the jobs fail because the "Agent went offline":
Agent went offline during the build ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2332) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2801) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
From my understanding, the Jenkins master uses the agent in each slave to request new jobs to run. In my project, once that is done all the tests are run locally and not remotely. Therefore, the slaves could (should?) continue to run the jobs in the background, and when the connectivity is restored to the master it can then catch up on the slave logs and status. I don't know the Jenkins architecture in depth, but this doesn't sound like an impossible task.
So this brings me to my questions: is it possible to do such recovery? Is it possible to keep the slave running isolated from the master for a while? I am sure other Jenkins users have had this problem before.

Jenkins Executor Starvation on CloudBees

I have setup jobs correctly using Jenkins on Cloudbees, Janky, and Hubot. Hubot and Janky work and are pushing the jobs to the Jenkins server.
The job has been sitting in the Jenkins queue for over an hour now. I don't see anywhere to configure the # of executors and this is a completely default instance from Cloudbees.
Is the CloudBees service just taking a while or is something misconfigured?
This was a problem in early March caused by the build containers failing to start cleanly at times.
The root cause was a kernel oops that was occurring in the build container as it launched.
This has since been resolved, and you should not experience these long pauses waiting for an executor.
Anything more than 10 minutes is definitely a bug, and typically more than about 5s is unusual (although when a lot of jobs are launched simultaneously the time to get a container can be on the order of around 3 minutes).

Resources