I install kubernetes 1000 times but now it does not work.
I install kubectl kubeadm kubelet then
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=185.73.114.92
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
but I see coredns is in pending state
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-5644d7b6d9-492q4 0/1 Pending 0 13m
kube-system coredns-5644d7b6d9-cvwjg 0/1 Pending 0 13m
kube-system etcd-amghezi 1/1 Running 0 12m
kube-system kube-apiserver-amghezi 1/1 Running 0 12m
kube-system kube-controller-manager-amghezi 1/1 Running 0 12m
kube-system kube-flannel-ds-amd64-fkxnf 1/1 Running 0 12m
kube-system kube-proxy-pspw2 1/1 Running 0 13m
kube-system kube-scheduler-amghezi 1/1 Running 0 12m
and then I get describe of coredns
kubectl describe pods coredns-5644d7b6d9-492q4 -n kube-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
I taint the node by
kubectl taint nodes amghezi node-role.kubernetes.io/master-
It did not work
I see at
journalctl -xe
message:docker: network plugin is not ready: cni config uninitialized
service docker status
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; disabled; vendor preset: enabled)
Active: active (running) since Sun 2019-09-22 17:29:45 CEST; 34min ago
Docs: https://docs.docker.com
Main PID: 987 (dockerd)
Tasks: 20
CGroup: /system.slice/docker.service
└─987 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
Sep 22 17:29:45 ubuntu systemd[1]: Started Docker Application Container Engine.
Sep 22 17:29:45 ubuntu dockerd[987]: time="2019-09-22T17:29:45.728818467+02:00" level=info msg="API listen on /var/run/docker.sock"
Sep 22 17:29:45 ubuntu dockerd[987]: time="2019-09-22T17:29:45.757401709+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:29:45 ubuntu dockerd[987]: time="2019-09-22T17:29:45.786776798+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:29:46 ubuntu dockerd[987]: time="2019-09-22T17:29:46.296798944+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:29:46 ubuntu dockerd[987]: time="2019-09-22T17:29:46.364459982+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:30:06 ubuntu dockerd[987]: time="2019-09-22T17:30:06.996299645+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:30:41 ubuntu dockerd[987]: time="2019-09-22T17:30:41.633452599+02:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 22 17:30:41 ubuntu dockerd[987]: time="2019-09-22T17:30:41.633831003+02:00" level=warning msg="d72e19bd0e929513a1c9092ec487e5dc3f3e009bdaa4d33668b610e86cdadf9e cleanup: failed to unmount IPC: umount /var/lib/docker/containers/d72e19bd0e929513a1c9092ec487e5dc3f3e009bdaa4d33668b610e86cdadf9e/mounts/shm, flags: 0x2
Sep 22 17:30:41 ubuntu dockerd[987]: time="2019-09-22T17:30:41.903058543+02:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
and let us see kubelet status
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
I assume per given data outputs that the problem comes from Kubelet node agent, since kubelet primarily demands on CNI network plugin installation. In order to automatically configure networking features for the Pods, kubelet starts each time CNI addon in prior Pod creation to set up pod’s network interface as well. Furthermore, CoreDNS discovery service relies on overlay container network to be reachable for all cluster nodes.
Although you've used Flannel CNI provider, flannel Pod is up and running, since kubelet can't create container interface for the particular CoreDNS Pods with lack of CNI configuration, I would recommend to reset kubeadm cluster with purging redundant component folder structure:
$ sudo kubeadm reset
$ sudo systemctl stop docker && sudo systemctl stop kubelet
$ sudo rm -rf /etc/kubernetes/
$ sudo rm -rf .kube/
$ sudo rm -rf /var/lib/kubelet/
$ sudo rm -rf /var/lib/cni/
$ sudo rm -rf /etc/cni/
$ sudo rm -rf /var/lib/etcd/
Bootstrap K8s cluster via kubeadm:
$ sudo systemctl start docker && sudo systemctl start kubelet
$ sudo kubeadm init ...
Further remove node-role.kubernetes.io/master taint and apply Flannel addon:
$ kubectl taint nodes --all node-role.kubernetes.io/master-
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
You might find also some useful information about kubeadm troubleshooting guide steps in the official K8s documentation.
Related
When trying to create Pods that can use GPU, I get the error "exec: "nvidia-smi": executable file not found in $PATH" ".
To explain the error from the beginning, my main goal was to create JupyterHub enviroments that can use GPU. I installed Zero to JupyterHub for Kubernetes. I followed these steps to be able to use GPU. When I check my nodes GPUs seems schedulable by Kubernetes. So far everything seemed fine.
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
NAME GPUs
arge-server 1
However, when I logged in to JupyetHub and tried to open the profile using GPU, I got an error: [Warning] 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. So, I checked the Pods and I found that they were all in the "Waiting: PodInitializing" state.
kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-dcgm-x5rqs 0/1 Init:0/1 2 6d20h
nvidia-device-plugin-daemonset-jhjhb 0/1 Init:0/1 0 6d20h
gpu-feature-discovery-pd4xv 0/1 Init:0/1 2 6d20h
nvidia-dcgm-exporter-7mjgt 0/1 Init:0/1 2 6d20h
nvidia-operator-validator-9xjmv 0/1 Init:Error 10 26m
After that, I took a closer look at the Pod nvidia-operator-validator-9xjmv, which was the beginning of the error, and I saw that the toolkit-validation container was throwing a CrashLoopBackOff error. Here is the relevant part of the log:
kubectl describe pod nvidia-operator-validator-9xjmv -n gpu-operator-resources
Name: nvidia-operator-validator-9xjmv
Namespace: gpu-operator-resources
.
.
.
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
.
.
.
toolkit-validation:
Container ID: containerd://e7d004f0809cbefdae5407ea42eb659972ea7eefa5dd6e45e968cbf3ed22bf2e
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator#sha256:a07fd1c74e3e469ac316d17cf79635173764fdab3b681dbc282027a23dbbe227
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 18 Nov 2021 12:55:00 +0300
Finished: Thu, 18 Nov 2021 12:55:00 +0300
Ready: False
Restart Count: 16
Environment:
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hx7ls (ro)
.
.
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 58m default-scheduler Successfully assigned gpu-operator-resources/nvidia-operator-validator-9xjmv to arge-server
Normal Pulled 58m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 58m kubelet Created container driver-validation
Normal Started 58m kubelet Started container driver-validation
Normal Pulled 56m (x5 over 58m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 56m (x5 over 58m) kubelet Created container toolkit-validation
Normal Started 56m (x5 over 58m) kubelet Started container toolkit-validation
Warning BackOff 3m7s (x255 over 58m) kubelet Back-off restarting failed container
Then, I looked at the logs of the container and I got the following error.
kubectl logs -n gpu-operator-resources -f nvidia-operator-validator-9xjmv -c toolkit-validation
time="2021-11-18T09:29:24Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
toolkit is not ready
For similar issues, it was suggested to delete the failed Pod and deployment. However, doing these did not fix my problem. Do you have any suggestions?
I have;
Ubuntu 20.04
Kubernetes v1.21.6
Docker 20.10.10
NVIDIA-SMI 470.82.01
CUDA 11.4
CPU: Intel Xeon E5-2683 v4 (32) # 2.097GHz
GPU: NVIDIA GeForce RTX 2080 Ti
Memory: 13815MiB / 48280MiB
Thanks in advance.
In case you're are still having the issue, we just had the same issue on our cluster, the "dirty" fix is to do that:
rm /run/nvidia/driver
ln -s / /run/nvidia/drive
kubectl delete pod -n gpu-operator nvidia-operator-validator-xxxxx
The reason is the init pod of the nvidia-operator-validator try to execute nvidia-smi within a chroot from /run/nvidia/driver .. which is a tmpfs (so doesn't persist accross reboot) and is not populated when performing a manual install of the drivers.
Do hope for a better fix from Nvidia.
I have 1 master and 5 nodes k8s cluster. I am setting EFK with ref: https://www.digitalocean.com/community/tutorials/how-to-set-up-an-elasticsearch-fluentd-and-kibana-efk-logging-stack-on-kubernetes#step-4-%E2%80%94-creating-the-fluentd-daemonset
While Creating the Fluentd DaemonSet, 1 out 5 fluentd is in ImagePullBackOff state :
kubectl get all -n kube-logging -o wide Tue Apr 21 03:49:26 2020
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES
SELECTOR
ds/fluentd 5 5 4 5 4 <none> 1d fluentd fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-e
lasticsearch-1.1 app=fluentd
ds/fluentd 5 5 4 5 4 <none> 1d fluentd fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-e
lasticsearch-1.1 app=fluentd
NAME READY STATUS RESTARTS AGE IP NODE
po/fluentd-82h6k 1/1 Running 1 1d 100.96.15.56 ip-172-20-52-52.us-west-1.compute.internal
po/fluentd-8ghjq 0/1 ImagePullBackOff 0 17h 100.96.10.170 ip-172-20-58-72.us-west-1.compute.internal
po/fluentd-fdmc8 1/1 Running 1 1d 100.96.3.73 ip-172-20-63-147.us-west-1.compute.internal
po/fluentd-g7755 1/1 Running 1 1d 100.96.2.22 ip-172-20-60-101.us-west-1.compute.internal
po/fluentd-gj8q8 1/1 Running 1 1d 100.96.16.17 ip-172-20-57-232.us-west-1.compute.internal
admin#ip-172-20-58-79:~$ kubectl describe po/fluentd-8ghjq -n kube-logging
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal BackOff 12m (x4364 over 17h) kubelet, ip-172-20-58-72.us-west-1.compute.internal Back-off pulling image "fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-elasticsearch-1.1"
Warning FailedSync 2m (x4612 over 17h) kubelet, ip-172-20-58-72.us-west-1.compute.internal Error syncing pod
Kubelet logs on node which is failing to run Fulentd
admin#ip-172-20-58-72:~$ journalctl -u kubelet -f
Apr 21 03:53:53 ip-172-20-58-72 kubelet[755]: E0421 03:53:53.095334 755 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Apr 21 03:53:53 ip-172-20-58-72 kubelet[755]: E0421 03:53:53.095369 755 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Apr 21 03:53:53 ip-172-20-58-72 kubelet[755]: W0421 03:53:53.095440 755 helpers.go:847] eviction manager: no observation found for eviction signal allocatableNodeFs.available
Apr 21 03:53:54 ip-172-20-58-72 kubelet[755]: I0421 03:53:54.882213 755 server.go:779] GET /metrics/cadvisor: (50.308555ms) 200 [[Prometheus/2.12.0] 172.20.58.79:54492]
Apr 21 03:53:55 ip-172-20-58-72 kubelet[755]: I0421 03:53:55.452951 755 kuberuntime_manager.go:500] Container {Name:fluentd Image:fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-elasticsearch-1.1 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:FLUENT_ELASTICSEARCH_HOST Value:vpc-cog-01-es-dtpgkfi.ap-southeast-1.es.amazonaws.com ValueFrom:nil} {Name:FLUENT_ELASTICSEARCH_PORT Value:443 ValueFrom:nil} {Name:FLUENT_ELASTICSEARCH_SCHEME Value:https ValueFrom:nil} {Name:FLUENTD_SYSTEMD_CONF Value:disable ValueFrom:nil}] Resources:{Limits:map[memory:{i:{value:536870912 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:209715200 scale:0} d:{Dec:<nil>} s: Format:BinarySI}]} VolumeMounts:[{Name:varlog ReadOnly:false MountPath:/var/log SubPath: MountPropagation:<nil>} {Name:varlibdockercontainers ReadOnly:true MountPath:/var/lib/docker/containers SubPath: MountPropagation:<nil>} {Name:fluentd-token-k8fnp ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Apr 21 03:53:55 ip-172-20-58-72 kubelet[755]: E0421 03:53:55.455327 755 pod_workers.go:182] Error syncing pod aa65dd30-82f2-11ea-a005-0607d7cb72ed ("fluentd-8ghjq_kube-logging(aa65dd30-82f2-11ea-a005-0607d7cb72ed)"), skipping: failed to "StartContainer" for "fluentd" with ImagePullBackOff: "Back-off pulling image \"fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-elasticsearch-1.1\""
Kubelet logs on the node which is running Fulentd successfully
admin#ip-172-20-63-147:~$ journalctl -u kubelet -f
Apr 21 04:09:25 ip-172-20-63-147 kubelet[1272]: E0421 04:09:25.874293 1272 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Apr 21 04:09:25 ip-172-20-63-147 kubelet[1272]: E0421 04:09:25.874336 1272 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Apr 21 04:09:25 ip-172-20-63-147 kubelet[1272]: W0421 04:09:25.874453 1272 helpers.go:847] eviction manager: no observation found for eviction signal allocatableNodeFs.available
I have created a k8s cluster with RHEL7 with kubernetes packages GitVersion:"v1.8.1". I'm trying to deploy wordpress on my custom cluster. But pod creation is always stuck in ContainerCreating state.
phani#k8s-master]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default wordpress-766d75457d-zlvdn 0/1 ContainerCreating 0 11m
kube-system etcd-k8s-master 1/1 Running 0 1h
kube-system kube-apiserver-k8s-master 1/1 Running 0 1h
kube-system kube-controller-manager-k8s-master 1/1 Running 0 1h
kube-system kube-dns-545bc4bfd4-bb8js 3/3 Running 0 1h
kube-system kube-proxy-bf4zr 1/1 Running 0 1h
kube-system kube-proxy-d7zvg 1/1 Running 0 34m
kube-system kube-scheduler-k8s-master 1/1 Running 0 1h
kube-system weave-net-92zf9 2/2 Running 0 34m
kube-system weave-net-sh7qk 2/2 Running 0 1h
Docker Version:1.13.1
Pod status from descibe command
Normal Scheduled 18m default-scheduler Successfully assigned wordpress-766d75457d-zlvdn to worker1
Normal SuccessfulMountVolume 18m kubelet, worker1 MountVolume.SetUp succeeded for volume "default-token-tmpcm"
Warning DNSSearchForming 18m kubelet, worker1 Search Line limits were exceeded, some dns names have been omitted, the applied search line is: default.svc.cluster.local svc.cluster.local cluster.local
Warning FailedCreatePodSandBox 14m kubelet, worker1 Failed create pod sandbox.
Warning FailedSync 25s (x8 over 14m) kubelet, worker1 Error syncing pod
Normal SandboxChanged 24s (x8 over 14m) kubelet, worker1 Pod sandbox changed, it will be killed and re-created.
from the kubelet log I observed below error on worker
error: failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"
But kubelet is stable no problems seen on worker.
How do I solve this problem?
I checked the cni failure, I couldn't find anything.
~]# ls /opt/cni/bin
bridge cnitool dhcp flannel host-local ipvlan loopback macvlan noop ptp tuning weave-ipam weave-net weave-plugin-2.3.0
In journal logs below messages are repetitively appeared . seems like scheduler is trying to create the container all the time.
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421184 14339 remote_runtime.go:115] StopPodSandbox "47da29873230d830f0ee21adfdd3b06ed0c653a0001c29289fe78446d27d2304" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421212 14339 kuberuntime_manager.go:780] Failed to stop sandbox {"docker" "47da29873230d830f0ee21adfdd3b06ed0c653a0001c29289fe78446d27d2304"}
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421247 14339 kuberuntime_manager.go:580] killPodWithSyncResult failed: failed to "KillPodSandbox" for "7f1c6bf1-6af3-11e8-856b-fa163e3d1891" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jun 08 11:25:22 worker1 kubelet[14339]: E0608 11:25:22.421262 14339 pod_workers.go:182] Error syncing pod 7f1c6bf1-6af3-11e8-856b-fa163e3d1891 ("wordpress-766d75457d-spdrb_default(7f1c6bf1-6af3-11e8-856b-fa163e3d1891)"), skipping: failed to "KillPodSandbox" for "7f1c6bf1-6af3-11e8-856b-fa163e3d1891" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Failed create pod sandbox.
... is almost always a CNI failure; I would check on the node that all the weave containers are happy, and that /opt/cni/bin is present (or its weave equivalent)
You may have to check both the journalctl -u kubelet.service as well as the docker logs for any containers running to discover the full scope of the error on the node.
It's seem to working by removing the$KUBELET_NETWORK_ARGS in /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
I have removed $KUBELET_NETWORK_ARGS and restarted the worker node then pods got deployed successfully.
As Matthew said it's most likely a CNI failure.
First, find the node this pod is running on:
kubectl get po wordpress-766d75457d-zlvdn -o wide
Next in the node where the pod is located check /etc/cni/net.d if you have more than one .conf then you can delete one and restart the node.
source: https://github.com/kubernetes/kubeadm/issues/578.
note this is one of the solutions.
While hopefully it's no one else's problem, for me, this happened when part of my filesystem was full.
I had pods stuck in ContainerCreating only on one node in my cluster. I also had a bunch of pods which I expected to shutdown, but hadn't. Someone recommended running
sudo systemctl status kubelet -l
which showed me a bunch of lines like
Jun 18 23:19:56 worker01 kubelet[1718]: E0618 23:19:56.461378 1718 kuberuntime_manager.go:647] createPodSandbox for pod "REDACTED(2c681b9c-cf5b-11eb-9c79-52540077cc53)" failed: mkdir /var/log/pods/2c681b9c-cf5b-11eb-9c79-52540077cc53: no space left on device
I confirmed that I was out of space with
$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 189G 0 189G 0% /dev
tmpfs 189G 0 189G 0% /sys/fs/cgroup
/dev/mapper/vg01-root 20G 7.0G 14G 35% /
/dev/mapper/vg01-tmp 4.0G 34M 4.0G 1% /tmp
/dev/mapper/vg01-home 4.0G 72M 4.0G 2% /home
/dev/mapper/vg01-varlog 10G 10G 20K 100% /var/log
/dev/mapper/vg01-varlogaudit 2.0G 68M 2.0G 4% /var/log/audit
I just had to clear out that dir (and did some manual cleanup on all the pending pods and pods that were stuck running).
Follow this guide, I'm trying to start minikube and forward port at the boot time.
My script:
#!/bin/bash
set -eux
export PATH=/usr/local/bin:$PATH
minikube status || minikube start
minikube ssh 'grep docker.for.mac.localhost /etc/hosts || echo -e "127.0.0.1\tdocker.for.mac.localhost" | sudo tee -a /etc/hosts'
minikube ssh 'test -f wait-for-it.sh || curl -O https://raw.githubusercontent.com/vishnubob/wait-for-it/master/wait-for-it.sh'
minikube ssh 'chmod +x wait-for-it.sh && ./wait-for-it.sh 127.0.1.1:10250'
POD=$(kubectl get po --namespace kube-system | awk '/kube-registry-v0/ { print $1 }')
kubectl port-forward --namespace kube-system $POD 5000:5000
Everything works fine except that kubectl port-forward said that pod does not exist at the first time running:
++ kubectl get po --namespace kube-system
++ awk '/kube-registry-v0/ { print $1 }'
+ POD=kube-registry-v0-qr2ml
+ kubectl port-forward --namespace kube-system kube-registry-v0-qr2ml 5000:5000
error: error upgrading connection: unable to upgrade connection: pod does not exist
If I re-run:
+ minikube status
minikube: Running
cluster: Running
kubectl: Correctly Configured: pointing to minikube-vm at 192.168.99.100
+ minikube ssh 'grep docker.for.mac.localhost /etc/hosts || echo -e "127.0.0.1\tdocker.for.mac.localhost" | sudo tee -a /etc/hosts'
127.0.0.1 docker.for.mac.localhost
+ minikube ssh 'test -f wait-for-it.sh || curl -O https://raw.githubusercontent.com/vishnubob/wait-for-it/master/wait-for-it.sh'
+ minikube ssh 'chmod +x wait-for-it.sh && ./wait-for-it.sh 127.0.1.1:10250'
wait-for-it.sh: waiting 15 seconds for 127.0.1.1:10250
wait-for-it.sh: 127.0.1.1:10250 is available after 0 seconds
++ kubectl get po --namespace kube-system
++ awk '/kube-registry-v0/ { print $1 }'
+ POD=kube-registry-v0-qr2ml
+ kubectl port-forward --namespace kube-system kube-registry-v0-qr2ml 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000
Forwarding from [::1]:5000 -> 5000
I added a debug line before forwarding:
kubectl describe pod --namespace kube-system $POD
and saw this:
+ POD=kube-registry-v0-qr2ml
+ kubectl describe pod --namespace kube-system kube-registry-v0-qr2ml
Name: kube-registry-v0-qr2ml
Namespace: kube-system
Node: minikube/192.168.99.100
Start Time: Thu, 28 Dec 2017 10:00:00 +0700
Labels: k8s-app=kube-registry
version=v0
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"kube-system","name":"kube-registry-v0","uid":"317ecc42-eb7b-11e7-a8ce-...
Status: Running
IP: 172.17.0.6
Controllers: ReplicationController/kube-registry-v0
Containers:
registry:
Container ID: docker://6e8f3f33399605758354f3f546996067d834459781235d51eef3ffa9c6589947
Image: registry:2.5.1
Image ID: docker-pullable://registry#sha256:946480a23b33480b8e7cdb89b82c1bd6accae91a8e66d017e21e8b56551f6209
Port: 5000/TCP
State: Running
Started: Thu, 28 Dec 2017 13:22:44 +0700
Why kubectl said that it does not exist?
Fri Dec 29 04:58:06 +07 2017
Looking carefully at the events, I found something:
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
20m 20m 1 kubelet, minikube Normal SuccessfulMountVolume MountVolume.SetUp succ
eeded for volume "image-store"
20m 20m 1 kubelet, minikube Normal SuccessfulMountVolume MountVolume.SetUp succ
eeded for volume "default-token-fs7kr"
20m 20m 1 kubelet, minikube Normal SandboxChanged Pod sandbox changed, it will be killed and re-created.
20m 20m 1 kubelet, minikube spec.containers{registry} Normal Pulled Container image "registry:2.5.1" already present on machine
20m 20m 1 kubelet, minikube spec.containers{registry} Normal Created Created container
20m 20m 1 kubelet, minikube spec.containers{registry} Normal Started Started container
Pod sandbox changed, it will be killed and re-created.
Before:
Containers:
registry:
Container ID: docker://47c510dce00c6c2c29c9fe69665e1241c457d0666174a7723062c534e7229c58
Image: registry:2.5.1
Image ID: docker-pullable://registry#sha256:946480a23b33480b8e7cdb89b82c1bd6accae91a8e66d017e21e8b56551f6209
Port: 5000/TCP
State: Running
Started: Thu, 28 Dec 2017 13:47:02 +0700
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 28 Dec 2017 13:22:44 +0700
Finished: Thu, 28 Dec 2017 13:45:18 +0700
Ready: True
Restart Count: 14
After:
Containers:
registry:
Container ID: docker://3a7da784d3d596796111348757725f5af22b47c5edd0fc29a4ffbb84f3f08956
Image: registry:2.5.1
Image ID: docker-pullable://registry#sha256:946480a23b33480b8e7cdb89b82c1bd6accae91a8e66d017e21e8b56551f6209
Port: 5000/TCP
State: Running
Started: Thu, 28 Dec 2017 19:03:04 +0700
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 28 Dec 2017 13:47:02 +0700
Finished: Thu, 28 Dec 2017 19:00:48 +0700
Ready: True
Restart Count: 15
minikube logs:
Dec 28 22:15:41 minikube localkube[3250]: W1228 22:15:41.102038
3250 docker_sandbox.go:343] failed to read pod IP from plugin/docker:
Couldn't find network status for kube-system/kube-registry-v0-qr2ml
through plugin: invalid network status for
POD=$(kubectl get po --namespace kube-system | awk '/kube-registry-v0/ { print $1 }')
Be aware that using a selector is almost certainly better than using text utilities, especially with "unstructured" output from kubectl. I don't know of any promises they make about the format of the default output, which is why --output=json and friends exist. However, in your case when you just want the name, there is a special --output=name which does what it says, with the mild caveat that the Resource prefix will be in front of the name (pods/kube-registry-v0-qr2ml in your case)
Separately, I see that you have "wait-for-it," but just because a port is accepting connections doesn't mean the Pod is Ready. You'll actually want to use --output=json (or more awk scripts, I guess) to ensure the Pod is both Running and Ready, with the latter status reached when kubernetes and the Pod agree that everything is cool.
I suspect, but would have to experiment to know for sure, that the error message is just misleading; it isn't truly that kubernetes doesn't know anything about your Pod, but merely that it couldn't port-forward to it in the state it's in.
You may also experience better success by creating a Service of type: NodePort and then talk to the Node's IP on the allocated port; that side-steps this kubectl-shell mess entirely, but does not side-step the Ready part -- only Pods in the Ready state will receive traffic from a Service
As a minor, pedantic note, --namespace is an argument to kubectl, and not to port-forward, so the most correct invocation is kubectl --namespace=kube-system port-forward kube-registry-v0-qr2ml 5000:5000 to ensure the argument isn't mis-parsed
Question
What the kubectl (1.8.3 on CentOS 7) error massage actually means and how to resolve.
Nov 19 22:32:24 master kubelet[4425]: E1119 22:32:24.269786 4425 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get con
Nov 19 22:32:24 master kubelet[4425]: E1119 22:32:24.269802 4425 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get conta
Research
Found the same error and followed the workaround by updating the service unit of kubelet as below but did not work.
kubelet fails to get cgroup stats for docker and kubelet services
/etc/systemd/system/kubelet.service
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=http://kubernetes.io/docs/
[Service]
ExecStart=/usr/bin/kubelet --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=multi-user.target
Background
Setting up Kubernetes cluster by following Install kubeadm. The section in the document Installing Docker says about aligning the cgroup driver as below.
Note: Make sure that the cgroup driver used by kubelet is the same as the one used by Docker. To ensure compatability you can either update Docker, like so:
cat << EOF > /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"]
}
EOF
But doing so caused docker service failed to start with:
unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag".
Nov 19 16:55:56 localhost.localdomain systemd1: docker.service: main process exited, code=exited, status=1/FAILURE.
Maser node is in ready with all system pods are running.
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-master 1/1 Running 0 39m
kube-system kube-apiserver-master 1/1 Running 0 39m
kube-system kube-controller-manager-master 1/1 Running 0 39m
kube-system kube-dns-545bc4bfd4-mqqqk 3/3 Running 0 40m
kube-system kube-flannel-ds-fclcs 1/1 Running 2 13m
kube-system kube-flannel-ds-hqlnb 1/1 Running 0 39m
kube-system kube-proxy-t7z5w 1/1 Running 0 40m
kube-system kube-proxy-xdw42 1/1 Running 0 13m
kube-system kube-scheduler-master 1/1 Running 0 39m
Environment
Kubernetes 1.8.3 on CentOS with Flannel.
$ kubectl version -o json | python -m json.tool
{
"clientVersion": {
"buildDate": "2017-11-08T18:39:33Z",
"compiler": "gc",
"gitCommit": "f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
"gitTreeState": "clean",
"gitVersion": "v1.8.3",
"goVersion": "go1.8.3",
"major": "1",
"minor": "8",
"platform": "linux/amd64"
},
"serverVersion": {
"buildDate": "2017-11-08T18:27:48Z",
"compiler": "gc",
"gitCommit": "f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd",
"gitTreeState": "clean",
"gitVersion": "v1.8.3",
"goVersion": "go1.8.3",
"major": "1",
"minor": "8",
"platform": "linux/amd64"
}
}
$ kubectl describe node master
Name: master
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=master
node-role.kubernetes.io/master=
Annotations: flannel.alpha.coreos.com/backend-data={"VtepMAC":"86:b6:7a:d6:7b:b3"}
flannel.alpha.coreos.com/backend-type=vxlan
flannel.alpha.coreos.com/kube-subnet-manager=true
flannel.alpha.coreos.com/public-ip=10.0.2.15
node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: node-role.kubernetes.io/master:NoSchedule
CreationTimestamp: Sun, 19 Nov 2017 22:27:17 +1100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Sun, 19 Nov 2017 23:04:56 +1100 Sun, 19 Nov 2017 22:27:13 +1100 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Sun, 19 Nov 2017 23:04:56 +1100 Sun, 19 Nov 2017 22:27:13 +1100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 19 Nov 2017 23:04:56 +1100 Sun, 19 Nov 2017 22:27:13 +1100 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Sun, 19 Nov 2017 23:04:56 +1100 Sun, 19 Nov 2017 22:32:24 +1100 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.99.10
Hostname: master
Capacity:
cpu: 1
memory: 3881880Ki
pods: 110
Allocatable:
cpu: 1
memory: 3779480Ki
pods: 110
System Info:
Machine ID: ca0a351004604dd49e43f8a6258ddd77
System UUID: CA0A3510-0460-4DD4-9E43-F8A6258DDD77
Boot ID: e9060efa-42be-498d-8cb8-8b785b51b247
Kernel Version: 3.10.0-693.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.12.6
Kubelet Version: v1.8.3
Kube-Proxy Version: v1.8.3
PodCIDR: 10.244.0.0/24
ExternalID: master
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system etcd-master 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-apiserver-master 250m (25%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-controller-manager-master 200m (20%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-dns-545bc4bfd4-mqqqk 260m (26%) 0 (0%) 110Mi (2%) 170Mi (4%)
kube-system kube-flannel-ds-hqlnb 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-t7z5w 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-scheduler-master 100m (10%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
810m (81%) 0 (0%) 110Mi (2%) 170Mi (4%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 38m kubelet, master Starting kubelet.
Normal NodeAllocatableEnforced 38m kubelet, master Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 37m (x8 over 38m) kubelet, master Node master status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 37m (x8 over 38m) kubelet, master Node master status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 37m (x7 over 38m) kubelet, master Node master status is now: NodeHasNoDiskPressure
Normal Starting 37m kube-proxy, master Starting kube-proxy.
Normal Starting 32m kubelet, master Starting kubelet.
Normal NodeAllocatableEnforced 32m kubelet, master Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 32m kubelet, master Node master status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 32m kubelet, master Node master status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 32m kubelet, master Node master status is now: NodeHasNoDiskPressure
Normal NodeNotReady 32m kubelet, master Node master status is now: NodeNotReady
Normal NodeReady 32m kubelet, master Node master status is now: NodeReady
the reason for this problem is that the nodes docker version diff the kubernetes need docker version .
You can directly uninstall docker, reinstall the specified version of docker on each nodes , next step restart docker, and node will be back online immediately.
And the docker-images and pods installed in this judge will not be affected because the physical folder is still there.
yum remove -y docker \
docker-client \
docker-client-latest \
docker-common \
docker-latest \
docker-latest-logrotate \
docker-logrotate \
docker-selinux \
docker-engine-selinux \
docker-engine
yum install -y docker-ce-18.09.7 docker-ce-cli-18.09.7 containerd.io
systemctl enable docker
systemctl start docker
I had exactly same issue, I've added parameters to ExecStart as mentioned above, but still getting same error. Then I've did kubeadm reset and systemctl daemon-reload and recreated cluster. This error seems to be gone. Testing now...