Jenkins connection to Ansible tower fails intermittently - jenkins

We are using Jenkins (recently updated) with the Ansible plugin and the Ansible Tower plugin to connect to our AWX tower. Most of the time, it works great, but lately, the tower will sometimes not respond correctly to Jenkins. Again, this does not happen always, but frequently enough to be a major concern.
When the issue occurs, the error messages I receive in Jenkins are along these lines:
ERROR: Failed to get job status from Tower: Unable to make tower request: Connection reset
ERROR: Failed to get job events from tower: Unexpected error code returned (503)
The normal response should be:
Tower completed the requested job
The option "Enable Debugging" is enabled for the Ansible Tower, but I have not seen any additional output in the Jenkins job logs so far.
Last time the connection failed, I went into the Jenkins settings and clicked "Test Connection" for the Ansible Tower plugin, and it worked right away.
I have not seen the web interface fail, and the jobs do complete normally. The issue lies in communication between Jenkins and AWX.
Jenkins and all the plugins were recently updated.
The person who installed AWX is no longer with us, and I don't know where else to go to help me troubleshoot this.
Versions:
AWX version: 9.0.0.0
AWX install method: openshift sts
Ansible version: 2.8.5
Operating System: N/A
Web Browser: N/A
Jenkins: 2.204.2
Jenkins Ansible plugin: 1.0
Jenkins Ansible Tower plugin: 0.14.0
In the Jenkins pipeline, the following code handles the Ansible part:
wrap([$class: 'AnsiColorBuildWrapper', colorMapName: "xterm"]) {
ansibleTower( [parameters] )
I don't have access to Jenkins on the file system level, only the general web UI.
I'd appreciate any troubleshooting steps you could provide or advice on where else to ask.

Related

retry jenkins kubernetes agent connection

I am using kubernetes plugin in Jenkins pipelines to create agents in kubernetes. I am able to launch, connect and do builds on the agents. However, when the agent pod doesn't have enough capacity, the agent bringup fails immediately with "forbidden: exceeded quota" error. My question is, is there a way to retry 'n' number of times with sleep time inbetween to bringup the agent as other builds running on kubernetes can finish and free up resources.
Thanks,
GD
the kubernetes plugin version i was using is 1.27.7 and apparently this is a known bug in that version ( https://issues.jenkins.io/browse/JENKINS-63976 ). the bug seems to be fixed on kubernetes plugin version 1.28.6.

jenkins kubernetes-plugin slaveConnectTimeout not honoured

i am running jenkins 2.103 in docker and have connected it to a kubernetes on arm cluster.
i have been able to manually connect the jnlp (v3.16) slave to the master, however it appears to take around 15mins for it to fully connect and report as online. Once online I can run builds as expected.
The problem is that it appears the 'slaveConnectTimeout' setting in the podTemplate is not honoured in the pipeline configuration, and neither is the default template setting of 'Timeout in seconds for Jenkins connection' in Pod Template section of Global Settings.
has anyone be able to make this setting work, and, does anyone have any idea what could be causing the 15min delay in registration?
this issue has been raised as a bug JENKINS-49281 now as well.
the issue ended up being openjdk and me not fully understanding what the kubernetes timeout is all about.
the delay in agent registration is not just a jenkins issue, i have seen the same behaviour in gocd and other java based apps. platform issue, not app issue

Jenkins-swarm slaves go down

We have a large number of jenkins slaves setup vip jenkins-swarm plugin(auto-discover the master)
Recently, we started to see slaves going offline and existing jobs get stuck. We fixed it with restarting the master. However, it has been happening too frequently. Everything seems to be fine, no network issue nor gc issue.
Anyone?
On the jenkins slaves, we see this repeatity once the node become unavailable:
Retrying in 10 seconds
Attempting to connect to http://ci.foobar.com/jenkins/ 46185efa-0009-4281-89d2-c4018fa9240c with ID 5a0f1773
Could not obtain CSRF crumb. Response code: 404
Failed to create a slave on Jenkins CODE: 409
A slave called '10.9.201.89-5a0f1773' is already created and on-line

Jenkins docker-plugin - Job does not start (waiting for executor)

I'm trying not (not hard enough it seems) to get our jenkins server to provision a jenkins-slave using docker.
I have installed the Docker-plugin and configured it according to the description on the page. I have also tested the connectivity and at least this part works.
I have also configured 1 label in the plugin and in my job. I even get a nice page showing me the connected jobs for this slave.
When I then try to start a build nothing really happens. A build is scheduled, but never started - (pending—Waiting for next available executor).
From the message it would seem like jenkins is not able to start the slave via docker....
I'm using docker 1.6.2 and the plugin is 0.10.1.
Any clue to what is going on would be much appreciated!
It seems the problem was that I had added the docker version in the plugin config. That is apparently a no-go according to this post

how to relaunch building application after jenkins slave agent was rebooted

we have jenkins project. use case:
jenkins triggers the build
slave agent builds application
server with slave agent goes to reboot (for any reason, for example, problem with electricity, somebody rebooted it, resource shortage and so on)
after that jenkins reports about failed build. how can we automatically relaunch application building in jenkins when slave agent recovered from failure?
There are two aspects to this issue -
Jenkins Server needs to reschedule the build that failed(when the slave-machine crashed).
Install the Naginator Plugin
Set it to rebuild whatever job you have set on the problematic slave
Jenkins Slave needs to restart automatically as soon as its host is up again.
On Windows, for example, you need to set it with a service that starts automatically
Note the Naginator Plugin doesn't know what caused the build to fail,
so it will try to rebuild any build that fails.
To solve this, scan the log for an indication that the slave crashed
and set a regular expression (in the Naginator) to catch it.
Cheers

Resources