Connection dropped for Jenkins slave on Kubernetes

Connection dropped for Jenkins slave on Kubernetes - jenkins

Performed update to Kubernetes 1.5.2 on Google Container Engine.
Then started getting the following errors:
Failed to count the # of live instances on Kubernetes
To resolve this I then upgraded Jenkins (to 2.32.2) and the Kubernetes plugin (to 0.10) to the latest versions.
Afterwards, then I started getting the following errors:
Feb 08, 2017 9:51:52 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
WARNING: Connection #5 failed
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:213)
Feb 08, 2017 9:51:57 PM org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud$ProvisioningCallback call
SEVERE: Error in provisioning; slave=KubernetesSlave name: default-6126d6e4fb5, template=org.csanchez.jenkins.plugins.kubernetes.PodTemplate#47404ab7
java.lang.IllegalStateException: Containers are terminated with exit codes: {jnlp=255}
at org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud$ProvisioningCallback.call(KubernetesCloud.java:600)
at org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud$ProvisioningCallback.call(KubernetesCloud.java:532)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

The last error was resolved by changing the slave container name to jnlp instead of default (see image). The google documentation shows the name is supposed to be default but it seems with these updates this is not the right approach to get this system working.
It looks like the updated kubernetes-plugin creates two containers (a container with your specified image and another with the default jnlp image). If your image's name isn't jnlp then the plugin will run both containers in the slave pod... this seems to be causing the connection issue.

As #Alex mentioned the problem is the name of the container. It has to be jnlp otherwise you get the below error.
java.lang.IllegalStateException: Pod has terminated containers: jenkins/jnlp-42t0n (prod-slave)
Initially I named the container prod-slave that caused it to create two containers in the slave pod. The additional container was named jnlp. When I renamed the container name in container template to jnlp everything worked perfectly with only one container in the slave pod.
Although it seems weird to me that naming of a container could affect the behavior of the slave!

I also faced same issue as #Alex and #Shinto
I used container name as slave-agent with docker image jenkins/jnlp-slave:latest and It started creating two containers with error
Error in provisioning; agent=KubernetesSlave name: kube-xgmd5, template=PodTemplate{id='9af2eabc-971f-42d4-8710-549942d76cbe', name='kube', label='kubepod', podRetention='On Failure', containers=[ContainerTemplate{name='slave-agent', image='jenkins/jnlp-slave:latest', workingDir='/home/jenkins/agent', command='', args='', resourceRequestCpu='', resourceRequestMemory='', resourceRequestEphemeralStorage='', resourceLimitCpu='', resourceLimitMemory='', resourceLimitEphemeralStorage='', envVars=[KeyValueEnvVar [getValue()=http://192.168.29.123:8080/jenkins/, getKey()=jenkins]], livenessProbe=ContainerLivenessProbe{execArgs='', timeoutSeconds=0, initialDelaySeconds=0, failureThreshold=0, periodSeconds=0, successThreshold=0}}]}
**java.lang.IllegalStateException: Pod has terminated containers: default/kube-xgmd5 (slave-agent)**
When updated container name to jnlp. It worked as expected.

Related

Docker Engine Fails to start on Windows Server 2019

Our application is docker based and requires Nat network to be created on the host machine in order to communicate since its a web service. It was working since last 4 months and suddenly stopped working. Checked and found that docker service is stopped. Manually tried restarting the service but it failed to start. Below is the error in the event log:
Error:
fatal: failed to start deamon: Error initializing network controller: Error creating default network: failed during hnsCallRawResponse: hnsCall failed in Win32: There are no more endpoints available from endpoint mapper. (0x6d9)
Tried the below steps:
Deleted the hns.data and restarted the hns service. Then restarted the docker engine service. The issue persists.
Tried running MOFCOMP. Same issue.
Tried removing docker and reinstalling it. Doesn't work.
Tried creating nat network manually. But getting the above mentioned error.
Can someone help here? what needs to be checked or what could be the reason for this issue?

Jenkins slave pod on kubernetes randomly failing

I have set a Jenkins master (on a VM) and this is provisioning jnlp slaves as kubernetes pods.
In very rare occasions, the pipeline fails, with this message:
java.io.IOException: Pipe closed
at java.io.PipedInputStream.checkStateForReceive(PipedInputStream.java:260)
at java.io.PipedInputStream.receive(PipedInputStream.java:226)
at java.io.PipedOutputStream.write(PipedOutputStream.java:149)
at java.io.OutputStream.write(OutputStream.java:75)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.setupEnvironmentVariable(ContainerExecDecorator.java:510)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:474)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:333)
at hudson.Launcher$ProcStarter.start(Launcher.java:455)
Viewing kubernetes logs Stackdriver in Stackdriver, one can see that the pod does manage to connect to the master, e.g.
Handshaking
Agent discovery successful
Trying protocol: JNLP4-Connect
Remote Identity confirmed: <some_hash_here>
Connecting to <jenkins-master-url>:49187
started container
loading plugin ...
but after a while it fails and here are the relevant logs:
org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave$SlaveDisconnector call
INFO: Disabled slave engine reconnects.
hudson.remoting.jnlp.Main$CuiListener status
Terminated
hudson.remoting.Request$2 run
Failed to send back a reply to the request hudson.remoting.Request$2#336ec321: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel#29d0e8b2:JNLP4-connect connection to <jenkins-master-url>/<public-ip-of-jenkins-master>:49187": channel is already closed
"Processing signal 'terminated'"
.
.
.
How can I further troubleshoot this random error?

Can you take a look at the Kubernetes Pod-Events with Stackdriver? We had a similar behavior with a different CI-System (GitlabCI). Our builds where also randomly failing. It turned out that the JVM inside the Container exceeded its memory limitation and was killed by Kubernetes (OOMKilled) and the CI-System recognised this as a build error.

kubelet logs flooding even after pods deleted

Kubernetes version : v1.6.7
Network plugin : weave
I recently noticed that my entire cluster of 3 nodes went down. Doing my initial level of troubleshooting revealed that /var on all nodes was 100%.
Doing further into the logs revealed the logs to be flooded by kubelet stating
Jan 15 19:09:43 test-master kubelet[1220]: E0115 19:09:43.636001 1220 kuberuntime_gc.go:138] Failed to stop sandbox "fea8c54ca834a339e8fd476e1cfba44ae47188bbbbb7140e550d055a63487211" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "<TROUBLING_POD>-1545236220-ds0v1_kube-system" network: CNI failed to retrieve network namespace path: Error: No such container: fea8c54ca834a339e8fd476e1cfba44ae47188bbbbb7140e550d055a63487211
Jan 15 19:09:43 test-master kubelet[1220]: E0115 19:09:43.637690 1220 docker_sandbox.go:205] Failed to stop sandbox "fea94c9f46923806c177e4a158ffe3494fe17638198f30498a024c3e8237f648": Error response from daemon: {"message":"No such container: fea94c9f46923806c177e4a158ffe3494fe17638198f30498a024c3e8237f648"}
The <TROUBLING_POD>-1545236220-ds0v1 was being initiated due to a cronjob and due to some misconfigurations, there were errors occurring during the running of those pods and more pods were being spun up.
So I deleted all the jobs and their related pods. So I had a cluster that had no jobs/pods running related to my cronjob and still see the same ERROR messages flooding the logs.
I did :
1) Restart docker and kubelet on all nodes.
2) Restart the entire control plane
and also
3) Reboot all nodes.
But still the logs are being flooded with the same error messages even though no such pods are even being spun up.
So I dont know how can I stop kubelet from throwing out the errors.
Is there a way for me to reset the network plugin I am using ? Or do something else ?

Check if the pod directory exists under /var/lib/kubelet
You're on a very old version of Kubernetes, upgrading will fix this issue.

Flink Could not upload the jar files on Kubernetes with Calico. PUT operation failed

We run Flink in Kubernetes 1.8 in AWS. It's been fine for months.
I've setup a new k8s clusters. Everything the same EXCEPT we enabled Calico (instead of using only Flannel)
Just like Flannel, Calico gives us networking between containers.
Since enabling Calico, Flink client receive this error when trying to send a jar file to job manager:
org.apache.flink.client.program.ProgramInvocationException: The program
execution failed: Could not upload the jar files to the job manager.
Caused by: java.io.IOException: Could not retrieve the JobManager's blob port.
Caused by: java.io.IOException: PUT operation failed: Connection reset
Caused by: java.net.SocketException: Connection reset
and Job manager says:
java.lang.IllegalArgumentException: Invalid BLOB addressing for permanent BLOBs
2018-03-27 06:28:16,069 INFO org.apache.flink.runtime.jobmanager.JobManager - Submitting job 11433fc332c7d76100fd08e6d1b623b4 (flink-job-connectivity-test).
2018-03-27 06:28:16,085 INFO org.apache.flink.runtime.jobmanager.JobManager - Using restart strategy NoRestartStrategy for 11433fc332c7d76100fd08e6d1b623b4.
2018-03-27 06:28:16,096 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job recovers via failover strategy: full graph restart
2018-03-27 06:28:16,105 INFO org.apache.flink.runtime.jobmanager.JobManager - Running initialization on master for job flink-job-connectivity-test (11433fc332c7d76100fd08e6d1b623b4).
2018-03-27 06:28:16,105 INFO org.apache.flink.runtime.jobmanager.JobManager - Successfully ran initialization on master in 0 ms.
2018-03-27 06:28:16,117 ERROR org.apache.flink.runtime.jobmanager.JobManager - Failed to submit job 11433fc332c7d76100fd08e6d1b623b4 (ignite-flink-job-connectivity-test)
java.lang.NullPointerException at
org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:58)
at org.apache.flink.runtime.checkpoint.CheckpointStatsTracker.(CheckpointStatsTracker.java:121)
at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:228)
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1277)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:447)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
It looks like the file cannot be transferred from the client to the job manager. I believe Invalid BLOB addressing is because the job manager did not receive any file.
Everything is the same. Works on one cluster. Does not work on another. Ports are configured the same. Every artefact is the same.
We don't have any NetworkPolicy. But would Calico enabled have some form of effect on networking?

Problem solved. I added this to my Flink task manager manifest file
name: data
port: 6121
name: rpc
port: 6122
name: query
port: 6125
And this in the flink conf files :
taskmanager.data.port: 6121
So basically I pinned a data port for task manager. I had done that for the job manager (blob server port).
And it was fine. But it looks like Calico works differently than Flannel and it could not use a random data port for task manager

Jenkins agent keeps disconnecting / reconnecting repeatedly

I have a jenkins master server. I just created a new jenkins agent and launching it via Java Web Start in a ubuntu host. The agent connects successfully, but after some time it says "Terminated", then again after some time it says "Connected". And it keeps repeating like this throughout.
I am not even trying to run a build/job yet
Interestingly enough, this ubuntu agent and this jnlp and this java web start has been working fine for the last several weeks - even until a few hours ago. Now suddenly it's starting to disconnected and reconnect repeatedly like this.
JNLP agent connected from /116.68.205.58
<===[JENKINS REMOTING CAPACITY]===>Slave.jar version: 3.2
This is a Unix agent
ERROR: Connection terminated
java.io.IOException: Unexpected termination of the channel
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
Caused by: java.io.EOFException
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)
at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
ERROR: Failed to install restarter
hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
at hudson.remoting.Request.abort(Request.java:307)
at hudson.remoting.Channel.terminate(Channel.java:888)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:92)
at ......remote call to Channel to /116.68.205.58(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1537)
at hudson.remoting.Request.call(Request.java:172)
at hudson.remoting.Channel.call(Channel.java:821)
at jenkins.slaves.restarter.JnlpSlaveRestarterInstaller.install(JnlpSlaveRestarterInstaller.java:52)
at jenkins.slaves.restarter.JnlpSlaveRestarterInstaller.access$000(JnlpSlaveRestarterInstaller.java:33)
at jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$1.call(JnlpSlaveRestarterInstaller.java:39)
at jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$1.call(JnlpSlaveRestarterInstaller.java:36)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Unexpected termination of the channel
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
Caused by: java.io.EOFException
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)
at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
JNLP agent connected from /116.68.205.58
<===[JENKINS REMOTING CAPACITY]===>Slave.jar version: 3.2
This is a Unix agent

Check the jenkins slave log for possible problems. Also, how is the Availability setting under the Jenkins node configuration page?
Jenkins >> Manage Jenkins >> Manage Nodes >> your node >> Configure
I recently had a Windows slave with the same symptom and change the Availability from
"Take this agent online when in demand, and offline when idle"
to
"Keep this agent online as much as possible"
and it solved my problem, but you might have a different problem from the one I had. So I suggest first viewing the slave logs. If you can, post the log snippet here for further analysis.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart