I have 1 master and 5 nodes k8s cluster. I am setting EFK with ref: https://www.digitalocean.com/community/tutorials/how-to-set-up-an-elasticsearch-fluentd-and-kibana-efk-logging-stack-on-kubernetes#step-4-%E2%80%94-creating-the-fluentd-daemonset
While Creating the Fluentd DaemonSet, 1 out 5 fluentd is in ImagePullBackOff state :
kubectl get all -n kube-logging -o wide Tue Apr 21 03:49:26 2020
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES
SELECTOR
ds/fluentd 5 5 4 5 4 <none> 1d fluentd fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-e
lasticsearch-1.1 app=fluentd
ds/fluentd 5 5 4 5 4 <none> 1d fluentd fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-e
lasticsearch-1.1 app=fluentd
NAME READY STATUS RESTARTS AGE IP NODE
po/fluentd-82h6k 1/1 Running 1 1d 100.96.15.56 ip-172-20-52-52.us-west-1.compute.internal
po/fluentd-8ghjq 0/1 ImagePullBackOff 0 17h 100.96.10.170 ip-172-20-58-72.us-west-1.compute.internal
po/fluentd-fdmc8 1/1 Running 1 1d 100.96.3.73 ip-172-20-63-147.us-west-1.compute.internal
po/fluentd-g7755 1/1 Running 1 1d 100.96.2.22 ip-172-20-60-101.us-west-1.compute.internal
po/fluentd-gj8q8 1/1 Running 1 1d 100.96.16.17 ip-172-20-57-232.us-west-1.compute.internal
admin#ip-172-20-58-79:~$ kubectl describe po/fluentd-8ghjq -n kube-logging
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal BackOff 12m (x4364 over 17h) kubelet, ip-172-20-58-72.us-west-1.compute.internal Back-off pulling image "fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-elasticsearch-1.1"
Warning FailedSync 2m (x4612 over 17h) kubelet, ip-172-20-58-72.us-west-1.compute.internal Error syncing pod
Kubelet logs on node which is failing to run Fulentd
admin#ip-172-20-58-72:~$ journalctl -u kubelet -f
Apr 21 03:53:53 ip-172-20-58-72 kubelet[755]: E0421 03:53:53.095334 755 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Apr 21 03:53:53 ip-172-20-58-72 kubelet[755]: E0421 03:53:53.095369 755 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Apr 21 03:53:53 ip-172-20-58-72 kubelet[755]: W0421 03:53:53.095440 755 helpers.go:847] eviction manager: no observation found for eviction signal allocatableNodeFs.available
Apr 21 03:53:54 ip-172-20-58-72 kubelet[755]: I0421 03:53:54.882213 755 server.go:779] GET /metrics/cadvisor: (50.308555ms) 200 [[Prometheus/2.12.0] 172.20.58.79:54492]
Apr 21 03:53:55 ip-172-20-58-72 kubelet[755]: I0421 03:53:55.452951 755 kuberuntime_manager.go:500] Container {Name:fluentd Image:fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-elasticsearch-1.1 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:FLUENT_ELASTICSEARCH_HOST Value:vpc-cog-01-es-dtpgkfi.ap-southeast-1.es.amazonaws.com ValueFrom:nil} {Name:FLUENT_ELASTICSEARCH_PORT Value:443 ValueFrom:nil} {Name:FLUENT_ELASTICSEARCH_SCHEME Value:https ValueFrom:nil} {Name:FLUENTD_SYSTEMD_CONF Value:disable ValueFrom:nil}] Resources:{Limits:map[memory:{i:{value:536870912 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:209715200 scale:0} d:{Dec:<nil>} s: Format:BinarySI}]} VolumeMounts:[{Name:varlog ReadOnly:false MountPath:/var/log SubPath: MountPropagation:<nil>} {Name:varlibdockercontainers ReadOnly:true MountPath:/var/lib/docker/containers SubPath: MountPropagation:<nil>} {Name:fluentd-token-k8fnp ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Apr 21 03:53:55 ip-172-20-58-72 kubelet[755]: E0421 03:53:55.455327 755 pod_workers.go:182] Error syncing pod aa65dd30-82f2-11ea-a005-0607d7cb72ed ("fluentd-8ghjq_kube-logging(aa65dd30-82f2-11ea-a005-0607d7cb72ed)"), skipping: failed to "StartContainer" for "fluentd" with ImagePullBackOff: "Back-off pulling image \"fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-elasticsearch-1.1\""
Kubelet logs on the node which is running Fulentd successfully
admin#ip-172-20-63-147:~$ journalctl -u kubelet -f
Apr 21 04:09:25 ip-172-20-63-147 kubelet[1272]: E0421 04:09:25.874293 1272 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Apr 21 04:09:25 ip-172-20-63-147 kubelet[1272]: E0421 04:09:25.874336 1272 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Apr 21 04:09:25 ip-172-20-63-147 kubelet[1272]: W0421 04:09:25.874453 1272 helpers.go:847] eviction manager: no observation found for eviction signal allocatableNodeFs.available
Related
I am trying to run local development kubernetes cluster which runs in Docker Desktop context. But its just keeps having following taint: node.kubernetes.io/not-ready:NoSchedule.
Manally removing taints, ie kubectl taint nodes --all node.kubernetes.io/not-ready-, doesn't help, because it comes back right away
kubectl describe node, is:
Name: docker-desktop
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=docker-desktop
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 07 May 2021 11:00:31 +0100
Taints: node.kubernetes.io/not-ready:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: docker-desktop
AcquireTime: <unset>
RenewTime: Fri, 07 May 2021 16:14:19 +0100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 07 May 2021 16:14:05 +0100 Fri, 07 May 2021 11:00:31 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 07 May 2021 16:14:05 +0100 Fri, 07 May 2021 11:00:31 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 07 May 2021 16:14:05 +0100 Fri, 07 May 2021 11:00:31 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Fri, 07 May 2021 16:14:05 +0100 Fri, 07 May 2021 16:11:05 +0100 KubeletNotReady PLEG is not healthy: pleg was last seen active 6m22.485400578s ago; threshold is 3m0s
Addresses:
InternalIP: 192.168.65.4
Hostname: docker-desktop
Capacity:
cpu: 5
ephemeral-storage: 61255492Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 18954344Ki
pods: 110
Allocatable:
cpu: 5
ephemeral-storage: 56453061334
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 18851944Ki
pods: 110
System Info:
Machine ID: f4da8f67-6e48-47f4-94f7-0a827259b845
System UUID: d07e4b6a-0000-0000-b65f-2398524d39c2
Boot ID: 431e1681-fdef-43db-9924-cb019ff53848
Kernel Version: 5.10.25-linuxkit
OS Image: Docker Desktop
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.6
Kubelet Version: v1.19.7
Kube-Proxy Version: v1.19.7
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1160m (23%) 1260m (25%)
memory 1301775360 (6%) 13288969216 (68%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeNotReady 86m (x2 over 90m) kubelet Node docker-desktop status is now: NodeNotReady
Normal NodeReady 85m (x3 over 5h13m) kubelet Node docker-desktop status is now: NodeReady
Normal Starting 61m kubelet Starting kubelet.
Normal NodeAllocatableEnforced 61m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 61m (x8 over 61m) kubelet Node docker-desktop status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 61m (x7 over 61m) kubelet Node docker-desktop status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 61m (x8 over 61m) kubelet Node docker-desktop status is now: NodeHasSufficientPID
Normal Starting 60m kube-proxy Starting kube-proxy.
Normal NodeNotReady 55m kubelet Node docker-desktop status is now: NodeNotReady
Normal Starting 49m kubelet Starting kubelet.
Normal NodeAllocatableEnforced 49m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientPID 49m (x7 over 49m) kubelet Node docker-desktop status is now: NodeHasSufficientPID
Normal NodeHasSufficientMemory 49m (x8 over 49m) kubelet Node docker-desktop status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 49m (x8 over 49m) kubelet Node docker-desktop status is now: NodeHasNoDiskPressure
Normal Starting 48m kube-proxy Starting kube-proxy.
Normal NodeNotReady 41m kubelet Node docker-desktop status is now: NodeNotReady
Normal Starting 37m kubelet Starting kubelet.
Normal NodeAllocatableEnforced 37m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientPID 37m (x7 over 37m) kubelet Node docker-desktop status is now: NodeHasSufficientPID
Normal NodeHasNoDiskPressure 37m (x8 over 37m) kubelet Node docker-desktop status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientMemory 37m (x8 over 37m) kubelet Node docker-desktop status is now: NodeHasSufficientMemory
Normal Starting 36m kube-proxy Starting kube-proxy.
Normal NodeAllocatableEnforced 21m kubelet Updated Node Allocatable limit across pods
Normal Starting 21m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 21m (x8 over 21m) kubelet Node docker-desktop status is now: NodeHasSufficientMemory
Normal NodeHasSufficientPID 21m (x7 over 21m) kubelet Node docker-desktop status is now: NodeHasSufficientPID
Normal NodeHasNoDiskPressure 21m (x8 over 21m) kubelet Node docker-desktop status is now: NodeHasNoDiskPressure
Normal Starting 21m kube-proxy Starting kube-proxy.
Normal NodeReady 6m16s (x2 over 14m) kubelet Node docker-desktop status is now: NodeReady
Normal NodeNotReady 3m16s (x3 over 15m) kubelet Node docker-desktop status is now: NodeNotReady
Allocated resources are quite significant, because the cluster is huge as well
CPU: 5GB
Memory: 18GB
SWAP: 1GB
Disk Image: 60GB
Machine: Mac Core i7, 32GB RAM, 512 GB SSD
I can see that the problem is with PLEG, but I need to understand what caused Pod Lifecycle Event Generator to result an error. Whether it's not sufficient allocated node resources or something else.
Any ideas?
In my case the problem was some super resource-hungry pods. Thus I had to downscale some deployments to be able to have a stable environment
I install kubernetes 1000 times but now it does not work.
I install kubectl kubeadm kubelet then
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=185.73.114.92
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
but I see coredns is in pending state
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-5644d7b6d9-492q4 0/1 Pending 0 13m
kube-system coredns-5644d7b6d9-cvwjg 0/1 Pending 0 13m
kube-system etcd-amghezi 1/1 Running 0 12m
kube-system kube-apiserver-amghezi 1/1 Running 0 12m
kube-system kube-controller-manager-amghezi 1/1 Running 0 12m
kube-system kube-flannel-ds-amd64-fkxnf 1/1 Running 0 12m
kube-system kube-proxy-pspw2 1/1 Running 0 13m
kube-system kube-scheduler-amghezi 1/1 Running 0 12m
and then I get describe of coredns
kubectl describe pods coredns-5644d7b6d9-492q4 -n kube-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
I taint the node by
kubectl taint nodes amghezi node-role.kubernetes.io/master-
It did not work
I see at
journalctl -xe
message:docker: network plugin is not ready: cni config uninitialized
service docker status
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; disabled; vendor preset: enabled)
Active: active (running) since Sun 2019-09-22 17:29:45 CEST; 34min ago
Docs: https://docs.docker.com
Main PID: 987 (dockerd)
Tasks: 20
CGroup: /system.slice/docker.service
└─987 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
Sep 22 17:29:45 ubuntu systemd[1]: Started Docker Application Container Engine.
Sep 22 17:29:45 ubuntu dockerd[987]: time="2019-09-22T17:29:45.728818467+02:00" level=info msg="API listen on /var/run/docker.sock"
Sep 22 17:29:45 ubuntu dockerd[987]: time="2019-09-22T17:29:45.757401709+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:29:45 ubuntu dockerd[987]: time="2019-09-22T17:29:45.786776798+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:29:46 ubuntu dockerd[987]: time="2019-09-22T17:29:46.296798944+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:29:46 ubuntu dockerd[987]: time="2019-09-22T17:29:46.364459982+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:30:06 ubuntu dockerd[987]: time="2019-09-22T17:30:06.996299645+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:30:41 ubuntu dockerd[987]: time="2019-09-22T17:30:41.633452599+02:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 22 17:30:41 ubuntu dockerd[987]: time="2019-09-22T17:30:41.633831003+02:00" level=warning msg="d72e19bd0e929513a1c9092ec487e5dc3f3e009bdaa4d33668b610e86cdadf9e cleanup: failed to unmount IPC: umount /var/lib/docker/containers/d72e19bd0e929513a1c9092ec487e5dc3f3e009bdaa4d33668b610e86cdadf9e/mounts/shm, flags: 0x2
Sep 22 17:30:41 ubuntu dockerd[987]: time="2019-09-22T17:30:41.903058543+02:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
and let us see kubelet status
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
I assume per given data outputs that the problem comes from Kubelet node agent, since kubelet primarily demands on CNI network plugin installation. In order to automatically configure networking features for the Pods, kubelet starts each time CNI addon in prior Pod creation to set up pod’s network interface as well. Furthermore, CoreDNS discovery service relies on overlay container network to be reachable for all cluster nodes.
Although you've used Flannel CNI provider, flannel Pod is up and running, since kubelet can't create container interface for the particular CoreDNS Pods with lack of CNI configuration, I would recommend to reset kubeadm cluster with purging redundant component folder structure:
$ sudo kubeadm reset
$ sudo systemctl stop docker && sudo systemctl stop kubelet
$ sudo rm -rf /etc/kubernetes/
$ sudo rm -rf .kube/
$ sudo rm -rf /var/lib/kubelet/
$ sudo rm -rf /var/lib/cni/
$ sudo rm -rf /etc/cni/
$ sudo rm -rf /var/lib/etcd/
Bootstrap K8s cluster via kubeadm:
$ sudo systemctl start docker && sudo systemctl start kubelet
$ sudo kubeadm init ...
Further remove node-role.kubernetes.io/master taint and apply Flannel addon:
$ kubectl taint nodes --all node-role.kubernetes.io/master-
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
You might find also some useful information about kubeadm troubleshooting guide steps in the official K8s documentation.
I have created a k8s cluster with RHEL7 with kubernetes packages GitVersion:"v1.8.1". I'm trying to deploy wordpress on my custom cluster. But pod creation is always stuck in ContainerCreating state.
phani#k8s-master]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default wordpress-766d75457d-zlvdn 0/1 ContainerCreating 0 11m
kube-system etcd-k8s-master 1/1 Running 0 1h
kube-system kube-apiserver-k8s-master 1/1 Running 0 1h
kube-system kube-controller-manager-k8s-master 1/1 Running 0 1h
kube-system kube-dns-545bc4bfd4-bb8js 3/3 Running 0 1h
kube-system kube-proxy-bf4zr 1/1 Running 0 1h
kube-system kube-proxy-d7zvg 1/1 Running 0 34m
kube-system kube-scheduler-k8s-master 1/1 Running 0 1h
kube-system weave-net-92zf9 2/2 Running 0 34m
kube-system weave-net-sh7qk 2/2 Running 0 1h
Docker Version:1.13.1
Pod status from descibe command
Normal Scheduled 18m default-scheduler Successfully assigned wordpress-766d75457d-zlvdn to worker1
Normal SuccessfulMountVolume 18m kubelet, worker1 MountVolume.SetUp succeeded for volume "default-token-tmpcm"
Warning DNSSearchForming 18m kubelet, worker1 Search Line limits were exceeded, some dns names have been omitted, the applied search line is: default.svc.cluster.local svc.cluster.local cluster.local
Warning FailedCreatePodSandBox 14m kubelet, worker1 Failed create pod sandbox.
Warning FailedSync 25s (x8 over 14m) kubelet, worker1 Error syncing pod
Normal SandboxChanged 24s (x8 over 14m) kubelet, worker1 Pod sandbox changed, it will be killed and re-created.
from the kubelet log I observed below error on worker
error: failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"
But kubelet is stable no problems seen on worker.
How do I solve this problem?
I checked the cni failure, I couldn't find anything.
~]# ls /opt/cni/bin
bridge cnitool dhcp flannel host-local ipvlan loopback macvlan noop ptp tuning weave-ipam weave-net weave-plugin-2.3.0
In journal logs below messages are repetitively appeared . seems like scheduler is trying to create the container all the time.
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421184 14339 remote_runtime.go:115] StopPodSandbox "47da29873230d830f0ee21adfdd3b06ed0c653a0001c29289fe78446d27d2304" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421212 14339 kuberuntime_manager.go:780] Failed to stop sandbox {"docker" "47da29873230d830f0ee21adfdd3b06ed0c653a0001c29289fe78446d27d2304"}
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421247 14339 kuberuntime_manager.go:580] killPodWithSyncResult failed: failed to "KillPodSandbox" for "7f1c6bf1-6af3-11e8-856b-fa163e3d1891" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421262 14339 pod_workers.go:182] Error syncing pod 7f1c6bf1-6af3-11e8-856b-fa163e3d1891 ("wordpress-766d75457d-spdrb_default(7f1c6bf1-6af3-11e8-856b-fa163e3d1891)"), skipping: failed to "KillPodSandbox" for "7f1c6bf1-6af3-11e8-856b-fa163e3d1891" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Failed create pod sandbox.
... is almost always a CNI failure; I would check on the node that all the weave containers are happy, and that /opt/cni/bin is present (or its weave equivalent)
You may have to check both the journalctl -u kubelet.service as well as the docker logs for any containers running to discover the full scope of the error on the node.
It's seem to working by removing the$KUBELET_NETWORK_ARGS in /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
I have removed $KUBELET_NETWORK_ARGS and restarted the worker node then pods got deployed successfully.
As Matthew said it's most likely a CNI failure.
First, find the node this pod is running on:
kubectl get po wordpress-766d75457d-zlvdn -o wide
Next in the node where the pod is located check /etc/cni/net.d if you have more than one .conf then you can delete one and restart the node.
source: https://github.com/kubernetes/kubeadm/issues/578.
note this is one of the solutions.
While hopefully it's no one else's problem, for me, this happened when part of my filesystem was full.
I had pods stuck in ContainerCreating only on one node in my cluster. I also had a bunch of pods which I expected to shutdown, but hadn't. Someone recommended running
sudo systemctl status kubelet -l
which showed me a bunch of lines like
Jun 18 23:19:56 worker01 kubelet[1718]: E0618 23:19:56.461378 1718 kuberuntime_manager.go:647] createPodSandbox for pod "REDACTED(2c681b9c-cf5b-11eb-9c79-52540077cc53)" failed: mkdir /var/log/pods/2c681b9c-cf5b-11eb-9c79-52540077cc53: no space left on device
I confirmed that I was out of space with
$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 189G 0 189G 0% /dev
tmpfs 189G 0 189G 0% /sys/fs/cgroup
/dev/mapper/vg01-root 20G 7.0G 14G 35% /
/dev/mapper/vg01-tmp 4.0G 34M 4.0G 1% /tmp
/dev/mapper/vg01-home 4.0G 72M 4.0G 2% /home
/dev/mapper/vg01-varlog 10G 10G 20K 100% /var/log
/dev/mapper/vg01-varlogaudit 2.0G 68M 2.0G 4% /var/log/audit
I just had to clear out that dir (and did some manual cleanup on all the pending pods and pods that were stuck running).
Follow this guide, I'm trying to start minikube and forward port at the boot time.
My script:
#!/bin/bash
set -eux
export PATH=/usr/local/bin:$PATH
minikube status || minikube start
minikube ssh 'grep docker.for.mac.localhost /etc/hosts || echo -e "127.0.0.1\tdocker.for.mac.localhost" | sudo tee -a /etc/hosts'
minikube ssh 'test -f wait-for-it.sh || curl -O https://raw.githubusercontent.com/vishnubob/wait-for-it/master/wait-for-it.sh'
minikube ssh 'chmod +x wait-for-it.sh && ./wait-for-it.sh 127.0.1.1:10250'
POD=$(kubectl get po --namespace kube-system | awk '/kube-registry-v0/ { print $1 }')
kubectl port-forward --namespace kube-system $POD 5000:5000
Everything works fine except that kubectl port-forward said that pod does not exist at the first time running:
++ kubectl get po --namespace kube-system
++ awk '/kube-registry-v0/ { print $1 }'
+ POD=kube-registry-v0-qr2ml
+ kubectl port-forward --namespace kube-system kube-registry-v0-qr2ml 5000:5000
error: error upgrading connection: unable to upgrade connection: pod does not exist
If I re-run:
+ minikube status
minikube: Running
cluster: Running
kubectl: Correctly Configured: pointing to minikube-vm at 192.168.99.100
+ minikube ssh 'grep docker.for.mac.localhost /etc/hosts || echo -e "127.0.0.1\tdocker.for.mac.localhost" | sudo tee -a /etc/hosts'
127.0.0.1 docker.for.mac.localhost
+ minikube ssh 'test -f wait-for-it.sh || curl -O https://raw.githubusercontent.com/vishnubob/wait-for-it/master/wait-for-it.sh'
+ minikube ssh 'chmod +x wait-for-it.sh && ./wait-for-it.sh 127.0.1.1:10250'
wait-for-it.sh: waiting 15 seconds for 127.0.1.1:10250
wait-for-it.sh: 127.0.1.1:10250 is available after 0 seconds
++ kubectl get po --namespace kube-system
++ awk '/kube-registry-v0/ { print $1 }'
+ POD=kube-registry-v0-qr2ml
+ kubectl port-forward --namespace kube-system kube-registry-v0-qr2ml 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000
Forwarding from [::1]:5000 -> 5000
I added a debug line before forwarding:
kubectl describe pod --namespace kube-system $POD
and saw this:
+ POD=kube-registry-v0-qr2ml
+ kubectl describe pod --namespace kube-system kube-registry-v0-qr2ml
Name: kube-registry-v0-qr2ml
Namespace: kube-system
Node: minikube/192.168.99.100
Start Time: Thu, 28 Dec 2017 10:00:00 +0700
Labels: k8s-app=kube-registry
version=v0
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"kube-system","name":"kube-registry-v0","uid":"317ecc42-eb7b-11e7-a8ce-...
Status: Running
IP: 172.17.0.6
Controllers: ReplicationController/kube-registry-v0
Containers:
registry:
Container ID: docker://6e8f3f33399605758354f3f546996067d834459781235d51eef3ffa9c6589947
Image: registry:2.5.1
Image ID: docker-pullable://registry#sha256:946480a23b33480b8e7cdb89b82c1bd6accae91a8e66d017e21e8b56551f6209
Port: 5000/TCP
State: Running
Started: Thu, 28 Dec 2017 13:22:44 +0700
Why kubectl said that it does not exist?
Fri Dec 29 04:58:06 +07 2017
Looking carefully at the events, I found something:
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
20m 20m 1 kubelet, minikube Normal SuccessfulMountVolume MountVolume.SetUp succ
eeded for volume "image-store"
20m 20m 1 kubelet, minikube Normal SuccessfulMountVolume MountVolume.SetUp succ
eeded for volume "default-token-fs7kr"
20m 20m 1 kubelet, minikube Normal SandboxChanged Pod sandbox changed, it will be killed and re-created.
20m 20m 1 kubelet, minikube spec.containers{registry} Normal Pulled Container image "registry:2.5.1" already present on machine
20m 20m 1 kubelet, minikube spec.containers{registry} Normal Created Created container
20m 20m 1 kubelet, minikube spec.containers{registry} Normal Started Started container
Pod sandbox changed, it will be killed and re-created.
Before:
Containers:
registry:
Container ID: docker://47c510dce00c6c2c29c9fe69665e1241c457d0666174a7723062c534e7229c58
Image: registry:2.5.1
Image ID: docker-pullable://registry#sha256:946480a23b33480b8e7cdb89b82c1bd6accae91a8e66d017e21e8b56551f6209
Port: 5000/TCP
State: Running
Started: Thu, 28 Dec 2017 13:47:02 +0700
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 28 Dec 2017 13:22:44 +0700
Finished: Thu, 28 Dec 2017 13:45:18 +0700
Ready: True
Restart Count: 14
After:
Containers:
registry:
Container ID: docker://3a7da784d3d596796111348757725f5af22b47c5edd0fc29a4ffbb84f3f08956
Image: registry:2.5.1
Image ID: docker-pullable://registry#sha256:946480a23b33480b8e7cdb89b82c1bd6accae91a8e66d017e21e8b56551f6209
Port: 5000/TCP
State: Running
Started: Thu, 28 Dec 2017 19:03:04 +0700
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 28 Dec 2017 13:47:02 +0700
Finished: Thu, 28 Dec 2017 19:00:48 +0700
Ready: True
Restart Count: 15
minikube logs:
Dec 28 22:15:41 minikube localkube[3250]: W1228 22:15:41.102038
3250 docker_sandbox.go:343] failed to read pod IP from plugin/docker:
Couldn't find network status for kube-system/kube-registry-v0-qr2ml
through plugin: invalid network status for
POD=$(kubectl get po --namespace kube-system | awk '/kube-registry-v0/ { print $1 }')
Be aware that using a selector is almost certainly better than using text utilities, especially with "unstructured" output from kubectl. I don't know of any promises they make about the format of the default output, which is why --output=json and friends exist. However, in your case when you just want the name, there is a special --output=name which does what it says, with the mild caveat that the Resource prefix will be in front of the name (pods/kube-registry-v0-qr2ml in your case)
Separately, I see that you have "wait-for-it," but just because a port is accepting connections doesn't mean the Pod is Ready. You'll actually want to use --output=json (or more awk scripts, I guess) to ensure the Pod is both Running and Ready, with the latter status reached when kubernetes and the Pod agree that everything is cool.
I suspect, but would have to experiment to know for sure, that the error message is just misleading; it isn't truly that kubernetes doesn't know anything about your Pod, but merely that it couldn't port-forward to it in the state it's in.
You may also experience better success by creating a Service of type: NodePort and then talk to the Node's IP on the allocated port; that side-steps this kubectl-shell mess entirely, but does not side-step the Ready part -- only Pods in the Ready state will receive traffic from a Service
As a minor, pedantic note, --namespace is an argument to kubectl, and not to port-forward, so the most correct invocation is kubectl --namespace=kube-system port-forward kube-registry-v0-qr2ml 5000:5000 to ensure the argument isn't mis-parsed
Question
What the kubectl (1.8.3 on CentOS 7) error massage actually means and how to resolve.
Nov 19 22:32:24 master kubelet[4425]: E1119 22:32:24.269786 4425 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get con
Nov 19 22:32:24 master kubelet[4425]: E1119 22:32:24.269802 4425 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get conta
Research
Found the same error and followed the workaround by updating the service unit of kubelet as below but did not work.
kubelet fails to get cgroup stats for docker and kubelet services
/etc/systemd/system/kubelet.service
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=http://kubernetes.io/docs/
[Service]
ExecStart=/usr/bin/kubelet --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=multi-user.target
Background
Setting up Kubernetes cluster by following Install kubeadm. The section in the document Installing Docker says about aligning the cgroup driver as below.
Note: Make sure that the cgroup driver used by kubelet is the same as the one used by Docker. To ensure compatability you can either update Docker, like so:
cat << EOF > /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"]
}
EOF
But doing so caused docker service failed to start with:
unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag".
Nov 19 16:55:56 localhost.localdomain systemd1: docker.service: main process exited, code=exited, status=1/FAILURE.
Maser node is in ready with all system pods are running.
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-master 1/1 Running 0 39m
kube-system kube-apiserver-master 1/1 Running 0 39m
kube-system kube-controller-manager-master 1/1 Running 0 39m
kube-system kube-dns-545bc4bfd4-mqqqk 3/3 Running 0 40m
kube-system kube-flannel-ds-fclcs 1/1 Running 2 13m
kube-system kube-flannel-ds-hqlnb 1/1 Running 0 39m
kube-system kube-proxy-t7z5w 1/1 Running 0 40m
kube-system kube-proxy-xdw42 1/1 Running 0 13m
kube-system kube-scheduler-master 1/1 Running 0 39m
Environment
Kubernetes 1.8.3 on CentOS with Flannel.
$ kubectl version -o json | python -m json.tool
{
"clientVersion": {
"buildDate": "2017-11-08T18:39:33Z",
"compiler": "gc",
"gitCommit": "f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
"gitTreeState": "clean",
"gitVersion": "v1.8.3",
"goVersion": "go1.8.3",
"major": "1",
"minor": "8",
"platform": "linux/amd64"
},
"serverVersion": {
"buildDate": "2017-11-08T18:27:48Z",
"compiler": "gc",
"gitCommit": "f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
"gitTreeState": "clean",
"gitVersion": "v1.8.3",
"goVersion": "go1.8.3",
"major": "1",
"minor": "8",
"platform": "linux/amd64"
}
}
$ kubectl describe node master
Name: master
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=master
node-role.kubernetes.io/master=
Annotations: flannel.alpha.coreos.com/backend-data={"VtepMAC":"86:b6:7a:d6:7b:b3"}
flannel.alpha.coreos.com/backend-type=vxlan
flannel.alpha.coreos.com/kube-subnet-manager=true
flannel.alpha.coreos.com/public-ip=10.0.2.15
node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: node-role.kubernetes.io/master:NoSchedule
CreationTimestamp: Sun, 19 Nov 2017 22:27:17 +1100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Sun, 19 Nov 2017 23:04:56 +1100 Sun, 19 Nov 2017 22:27:13 +1100 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Sun, 19 Nov 2017 23:04:56 +1100 Sun, 19 Nov 2017 22:27:13 +1100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 19 Nov 2017 23:04:56 +1100 Sun, 19 Nov 2017 22:27:13 +1100 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Sun, 19 Nov 2017 23:04:56 +1100 Sun, 19 Nov 2017 22:32:24 +1100 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.99.10
Hostname: master
Capacity:
cpu: 1
memory: 3881880Ki
pods: 110
Allocatable:
cpu: 1
memory: 3779480Ki
pods: 110
System Info:
Machine ID: ca0a351004604dd49e43f8a6258ddd77
System UUID: CA0A3510-0460-4DD4-9E43-F8A6258DDD77
Boot ID: e9060efa-42be-498d-8cb8-8b785b51b247
Kernel Version: 3.10.0-693.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.12.6
Kubelet Version: v1.8.3
Kube-Proxy Version: v1.8.3
PodCIDR: 10.244.0.0/24
ExternalID: master
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system etcd-master 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-apiserver-master 250m (25%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-controller-manager-master 200m (20%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-dns-545bc4bfd4-mqqqk 260m (26%) 0 (0%) 110Mi (2%) 170Mi (4%)
kube-system kube-flannel-ds-hqlnb 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-t7z5w 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-scheduler-master 100m (10%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
810m (81%) 0 (0%) 110Mi (2%) 170Mi (4%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 38m kubelet, master Starting kubelet.
Normal NodeAllocatableEnforced 38m kubelet, master Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 37m (x8 over 38m) kubelet, master Node master status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 37m (x8 over 38m) kubelet, master Node master status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 37m (x7 over 38m) kubelet, master Node master status is now: NodeHasNoDiskPressure
Normal Starting 37m kube-proxy, master Starting kube-proxy.
Normal Starting 32m kubelet, master Starting kubelet.
Normal NodeAllocatableEnforced 32m kubelet, master Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 32m kubelet, master Node master status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 32m kubelet, master Node master status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 32m kubelet, master Node master status is now: NodeHasNoDiskPressure
Normal NodeNotReady 32m kubelet, master Node master status is now: NodeNotReady
Normal NodeReady 32m kubelet, master Node master status is now: NodeReady
the reason for this problem is that the nodes docker version diff the kubernetes need docker version .
You can directly uninstall docker, reinstall the specified version of docker on each nodes , next step restart docker, and node will be back online immediately.
And the docker-images and pods installed in this judge will not be affected because the physical folder is still there.
yum remove -y docker \
docker-client \
docker-client-latest \
docker-common \
docker-latest \
docker-latest-logrotate \
docker-logrotate \
docker-selinux \
docker-engine-selinux \
docker-engine
yum install -y docker-ce-18.09.7 docker-ce-cli-18.09.7 containerd.io
systemctl enable docker
systemctl start docker
I had exactly same issue, I've added parameters to ExecStart as mentioned above, but still getting same error. Then I've did kubeadm reset and systemctl daemon-reload and recreated cluster. This error seems to be gone. Testing now...