Kubernetes version : v1.6.7
Network plugin : weave
I recently noticed that my entire cluster of 3 nodes went down. Doing my initial level of troubleshooting revealed that /var on all nodes was 100%.
Doing further into the logs revealed the logs to be flooded by kubelet stating
Jan 15 19:09:43 test-master kubelet[1220]: E0115 19:09:43.636001 1220 kuberuntime_gc.go:138] Failed to stop sandbox "fea8c54ca834a339e8fd476e1cfba44ae47188bbbbb7140e550d055a63487211" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "<TROUBLING_POD>-1545236220-ds0v1_kube-system" network: CNI failed to retrieve network namespace path: Error: No such container: fea8c54ca834a339e8fd476e1cfba44ae47188bbbbb7140e550d055a63487211
Jan 15 19:09:43 test-master kubelet[1220]: E0115 19:09:43.637690 1220 docker_sandbox.go:205] Failed to stop sandbox "fea94c9f46923806c177e4a158ffe3494fe17638198f30498a024c3e8237f648": Error response from daemon: {"message":"No such container: fea94c9f46923806c177e4a158ffe3494fe17638198f30498a024c3e8237f648"}
The <TROUBLING_POD>-1545236220-ds0v1 was being initiated due to a cronjob and due to some misconfigurations, there were errors occurring during the running of those pods and more pods were being spun up.
So I deleted all the jobs and their related pods. So I had a cluster that had no jobs/pods running related to my cronjob and still see the same ERROR messages flooding the logs.
I did :
1) Restart docker and kubelet on all nodes.
2) Restart the entire control plane
and also
3) Reboot all nodes.
But still the logs are being flooded with the same error messages even though no such pods are even being spun up.
So I dont know how can I stop kubelet from throwing out the errors.
Is there a way for me to reset the network plugin I am using ? Or do something else ?
Check if the pod directory exists under /var/lib/kubelet
You're on a very old version of Kubernetes, upgrading will fix this issue.
Related
I have deployed the different micro-service in the cluster. And trying to use Log-shipper as a sidecar in one of the services.
When tried to deploy the micro-service, all the services are coming up but one services pod is getting stuck in CrashLoopBackOff.
The service contains 2 container, one for the service itself and other for logshipper as a sidecar.
The error message from the logshipper is as below:
ERROR (catatonit:6): failed to exec pid1: No such file or directory
and in the pod describe its showing
Backoff 40s restarting failed container=logshipper
Our application is docker based and requires Nat network to be created on the host machine in order to communicate since its a web service. It was working since last 4 months and suddenly stopped working. Checked and found that docker service is stopped. Manually tried restarting the service but it failed to start. Below is the error in the event log:
Error:
fatal: failed to start deamon: Error initializing network controller: Error creating default network: failed during hnsCallRawResponse: hnsCall failed in Win32: There are no more endpoints available from endpoint mapper. (0x6d9)
Tried the below steps:
Deleted the hns.data and restarted the hns service. Then restarted the docker engine service. The issue persists.
Tried running MOFCOMP. Same issue.
Tried removing docker and reinstalling it. Doesn't work.
Tried creating nat network manually. But getting the above mentioned error.
Can someone help here? what needs to be checked or what could be the reason for this issue?
I have set a Jenkins master (on a VM) and this is provisioning jnlp slaves as kubernetes pods.
In very rare occasions, the pipeline fails, with this message:
java.io.IOException: Pipe closed
at java.io.PipedInputStream.checkStateForReceive(PipedInputStream.java:260)
at java.io.PipedInputStream.receive(PipedInputStream.java:226)
at java.io.PipedOutputStream.write(PipedOutputStream.java:149)
at java.io.OutputStream.write(OutputStream.java:75)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.setupEnvironmentVariable(ContainerExecDecorator.java:510)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:474)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:333)
at hudson.Launcher$ProcStarter.start(Launcher.java:455)
Viewing kubernetes logs Stackdriver in Stackdriver, one can see that the pod does manage to connect to the master, e.g.
Handshaking
Agent discovery successful
Trying protocol: JNLP4-Connect
Remote Identity confirmed: <some_hash_here>
Connecting to <jenkins-master-url>:49187
started container
loading plugin ...
but after a while it fails and here are the relevant logs:
org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave$SlaveDisconnector call
INFO: Disabled slave engine reconnects.
hudson.remoting.jnlp.Main$CuiListener status
Terminated
hudson.remoting.Request$2 run
Failed to send back a reply to the request hudson.remoting.Request$2#336ec321: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel#29d0e8b2:JNLP4-connect connection to <jenkins-master-url>/<public-ip-of-jenkins-master>:49187": channel is already closed
"Processing signal 'terminated'"
.
.
.
How can I further troubleshoot this random error?
Can you take a look at the Kubernetes Pod-Events with Stackdriver? We had a similar behavior with a different CI-System (GitlabCI). Our builds where also randomly failing. It turned out that the JVM inside the Container exceeded its memory limitation and was killed by Kubernetes (OOMKilled) and the CI-System recognised this as a build error.
I am testing a openshift v3 starter (ca-central-1) and created a project from custom docker image stream (from github). It was running fine, but after I changed a config map, rescaled the deployment to 0 pods, upscaled it to 1 pod, openshift can no longer start any pods.
The error in web interface is (in Events tab):
Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container
for pod "hass-19-98vws": Error response from daemon: grpc: the connection is unavailable.
Pod sandbox changed, it will be killed and re-created.
These messages appear in a endless loop. I tried to deploy new deployment but it gives same logs.
What am I doing wring?
Ok, it seems that I was affected by an upgrade of cluster. The issue resolved itself after a 2 days.
I have a dev kubernetes cluster setup where I have a minion running kube-proxy and kubelet. Both only start if it can connect to the master's apiserver, which it can. Howerver I am getting
error updating node status, will retry: error getting node "10.211.55.126": minion "10.211.55.126" not found
I notice prior to that I get this: Server rejected event '&api.Event followed by a large json object with mostly empty string values.
repeatedly when I try running the minion's kubelet. I have it pointing to a private ip and it is reporting that it can't fin the public ip. I imagine this is an etcd issue but I'm not sure, also it maybe flanneld?
Update 1
I managed getting pass the initial error by registering the minion(node?) with the master. This allow it to receive pods from mast and run the containers,; however, the minion is still not fully connected and resulting in the master to continuously push more pods to the minion. The kubelet process is reporting: Cannot get host IP: Host IP unknown; known addresses: []. Is there a flag to run kubelet with to give it the host ip?
Currently, I have to manually register the minion prior to spinning up the minion instance. This is because there is an open issue as of right now not allowing the minion to self-register in certain cases.
UPDATE
Now I'm using kube-register to register each minion/node on start of the kubelet service.