I installed successfully Kubernetes 1.1.2 on CoreOS alpha (877.1.0) will the service files as link https://gist.github.com/thanhson1085/5a005e92245cb2288dee
After that, I want to run SkyDNS AddOn for my Kubernetes as a Service Discovery. And I followed this guide: https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/dns
But it does not work:
core#coreos-1 ~/projects $ sudo kubectl exec busybox -- nslookup kubernetes.default.cluster.local
Server: 10.100.100.100
Address 1: 10.100.100.100
nslookup: can't resolve 'kubernetes.default.cluster.local'
error: error executing remote command: Error executing command in container: Error executing in Docker Container: 1
core#coreos-1 ~/projects $ sudo kubectl exec busybox -- nslookup kubernetes.default
Server: 10.100.100.100
Address 1: 10.100.100.100
nslookup: can't resolve 'kubernetes.default'
error: error executing remote command: Error executing command in container: Error executing in Docker Container: 1
core#coreos-1 ~/projects $ sudo kubectl exec busybox -- nslookup kubernetes.default.svc.cluster.local
Server: 10.100.100.100
Address 1: 10.100.100.100
nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
error: error executing remote command: Error executing command in container: Error executing in Docker Container: 1
It works if i lookup global domain:
core#coreos-1 ~/projects $ sudo kubectl exec busybox -- nslookup google.com
Server: 10.100.100.100
Address 1: 10.100.100.100
Name: google.com
Address 1: 2404:6800:4003:c02::64 sc-in-x64.1e100.net
Address 2: 74.125.200.138 sa-in-f138.1e100.net
Address 3: 74.125.200.139 sa-in-f139.1e100.net
Address 4: 74.125.200.100 sa-in-f100.1e100.net
Address 5: 74.125.200.113 sa-in-f113.1e100.net
Address 6: 74.125.200.102 sa-in-f102.1e100.net
Address 7: 74.125.200.101 sa-in-f101.1e100.net
LOGS:
core#coreos-1 ~/projects $ sudo docker logs k8s_skydns.ed0ae89c_kube-dns-v9-a5afs_kube-system_b45f88f2-9684-11e5-9f5a-04018a53d701_32d98a24
2015/11/29 10:47:00 skydns: falling back to default configuration, could not read from etcd: 100: Key not found (/skydns) [2]
2015/11/29 10:47:00 skydns: ready for queries on cluster.local. for tcp://0.0.0.0:53 [rcache 0]
2015/11/29 10:47:00 skydns: ready for queries on cluster.local. for udp://0.0.0.0:53 [rcache 0]
kube2sky logs:
Failed to list *api.Service: couldn't get version/kind; json parse error: invalid character '<' looking for beginning of value
Failed to list *api.Endpoints: couldn't get version/kind; json parse error: invalid character '<' looking for beginning of value
And the below is the information for my installation:
core#coreos-1 ~/projects $ sudo kubectl get endpoints -a --all-namespaces
NAMESPACE NAME ENDPOINTS AGE
default frontend 10.100.21.7:80,10.100.21.9:80 1d
default kubernetes 128.199.134.19:6443 1d
default redis-master 10.100.21.2:6379 1d
default redis-slave 10.100.21.5:6379,10.100.21.6:6379 1d
kube-system kube-dns 51m
core#coreos-1 ~/projects $ sudo kubectl get pods -a --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default busybox 1/1 Running 0 51m
default frontend-1w89q 1/1 Running 0 1d
default frontend-w44qx 1/1 Running 0 1d
default redis-master-qde8g 1/1 Running 0 1d
default redis-slave-3m0t3 1/1 Running 0 1d
default redis-slave-m6fc8 1/1 Running 0 1d
kube-system kube-dns-v9-zrd22 3/4 Running 96 52m
core#coreos-1 ~/projects $ sudo kubectl get services -a --all-namespaces
NAMESPACE NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE
default frontend 10.100.192.188 nodes 80/TCP name=frontend 1d
default kubernetes 10.100.0.1 <none> 443/TCP <none> 1d
default redis-master 10.100.180.191 <none> 6379/TCP name=redis-master 1d
default redis-slave 10.100.146.91 <none> 6379/TCP name=redis-slave 1d
kube-system kube-dns 10.100.100.100 <none> 53/UDP,53/TCP k8s-app=kube-dns 52m
core#coreos-1 ~/projects $ sudo kubectl get rc -a --all-namespaces
NAMESPACE CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS AGE
default frontend php-redis gcr.io/google_samples/gb-frontend:v3 name=frontend 2 1d
default redis-master master redis name=redis-master 1 1d
default redis-slave worker gcr.io/google_samples/gb-redisslave:v1 name=redis-slave 2 1d
kube-system kube-dns-v9 etcd gcr.io/google_containers/etcd:2.0.9 k8s-app=kube-dns,version=v9 1 52m
kube2sky gcr.io/google_containers/kube2sky:1.11
skydns gcr.io/google_containers/skydns:2015-10-13-8c72f8c
healthz gcr.io/google_containers/exechealthz:1.0
See Kubelet Service Status:
core#coreos-1 ~ $ sudo systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2015-11-28 07:17:28 UTC; 52min ago
Main PID: 27007 (kubelet)
Memory: 2.9M
CPU: 944ms
CGroup: /system.slice/kubelet.service
├─27007 /opt/bin/kubelet --port=10250 --cluster-dns=10.100.100.100 --cluster-domain=cluster.local --hostname-override=coreos-1 --api-servers=http://127.0.0.1:8080 --logtostderr=true
└─27018 journalctl -k -f
Nov 28 08:09:26 coreos-1 kubelet[27007]: E1128 08:09:26.690587 27007 fs.go:211] Stat fs failed. Error: no such file or directory
Nov 28 08:09:27 coreos-1 kubelet[27007]: E1128 08:09:27.689255 27007 fs.go:211] Stat fs failed. Error: no such file or directory
Nov 28 08:09:28 coreos-1 kubelet[27007]: E1128 08:09:28.696916 27007 fs.go:211] Stat fs failed. Error: no such file or directory
Nov 28 08:09:29 coreos-1 kubelet[27007]: E1128 08:09:29.697705 27007 fs.go:211] Stat fs failed. Error: no such file or directory
Nov 28 08:09:30 coreos-1 kubelet[27007]: E1128 08:09:30.691816 27007 fs.go:211] Stat fs failed. Error: no such file or directory
Nov 28 08:09:31 coreos-1 kubelet[27007]: E1128 08:09:31.684655 27007 fs.go:211] Stat fs failed. Error: no such file or directory
Nov 28 08:09:32 coreos-1 kubelet[27007]: I1128 08:09:32.008987 27007 manager.go:1769] pod "kube-dns-v9-zrd22_kube-system" container "skydns" is unhealthy (probe result: failure), it will be killed and re-created.
Nov 28 08:09:32 coreos-1 kubelet[27007]: E1128 08:09:32.708600 27007 fs.go:211] Stat fs failed. Error: no such file or directory
Nov 28 08:09:33 coreos-1 kubelet[27007]: E1128 08:09:33.708741 27007 fs.go:211] Stat fs failed. Error: no such file or directory
Nov 28 08:09:34 coreos-1 kubelet[27007]: E1128 08:09:34.685334 27007 fs.go:211] Stat fs failed. Error: no such file or directory
Please help me to fix it.
And can I use Consul to place of SkyDNS in this case?
Isn't it just a typo? Try nslookup kubernetes.default.svc.cluster.local or kubernetes.default
Right now, you searched for kubenetes when it should be kubernetes
Hope it helps!
UPDATE:
It seems like you haven't configured the --cluster-dns, and --cluster-domain in your kubelet service.
I found the solution from #Tim Hockin comment.
I got logs from kube2sky service:
Failed to list *api.Service: couldn't get version/kind; json parse error: invalid character '<' looking for beginning of value
Failed to list *api.Endpoints: couldn't get version/kind; json parse error: invalid character '<' looking for beginning of value
With this error, I just need to add:
- -kube_master_url=http://10.130.162.60:8080 # IP Kubernets Api Server
To skydns-rc.yaml file.
All the details about my service configuration files, please see: https://gist.github.com/thanhson1085/5a005e92245cb2288dee
After that, restart dns service containers. It works:
core#coreos-1 ~/projects $ sudo kubectl exec busybox -- nslookup kubernetes.default
Server: 10.100.100.100
Address 1: 10.100.100.100
Name: kubernetes.default
Address 1: 10.100.0.1
Related
When trying to create Pods that can use GPU, I get the error "exec: "nvidia-smi": executable file not found in $PATH" ".
To explain the error from the beginning, my main goal was to create JupyterHub enviroments that can use GPU. I installed Zero to JupyterHub for Kubernetes. I followed these steps to be able to use GPU. When I check my nodes GPUs seems schedulable by Kubernetes. So far everything seemed fine.
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
NAME GPUs
arge-server 1
However, when I logged in to JupyetHub and tried to open the profile using GPU, I got an error: [Warning] 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. So, I checked the Pods and I found that they were all in the "Waiting: PodInitializing" state.
kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-dcgm-x5rqs 0/1 Init:0/1 2 6d20h
nvidia-device-plugin-daemonset-jhjhb 0/1 Init:0/1 0 6d20h
gpu-feature-discovery-pd4xv 0/1 Init:0/1 2 6d20h
nvidia-dcgm-exporter-7mjgt 0/1 Init:0/1 2 6d20h
nvidia-operator-validator-9xjmv 0/1 Init:Error 10 26m
After that, I took a closer look at the Pod nvidia-operator-validator-9xjmv, which was the beginning of the error, and I saw that the toolkit-validation container was throwing a CrashLoopBackOff error. Here is the relevant part of the log:
kubectl describe pod nvidia-operator-validator-9xjmv -n gpu-operator-resources
Name: nvidia-operator-validator-9xjmv
Namespace: gpu-operator-resources
.
.
.
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
.
.
.
toolkit-validation:
Container ID: containerd://e7d004f0809cbefdae5407ea42eb659972ea7eefa5dd6e45e968cbf3ed22bf2e
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator#sha256:a07fd1c74e3e469ac316d17cf79635173764fdab3b681dbc282027a23dbbe227
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 18 Nov 2021 12:55:00 +0300
Finished: Thu, 18 Nov 2021 12:55:00 +0300
Ready: False
Restart Count: 16
Environment:
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hx7ls (ro)
.
.
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 58m default-scheduler Successfully assigned gpu-operator-resources/nvidia-operator-validator-9xjmv to arge-server
Normal Pulled 58m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 58m kubelet Created container driver-validation
Normal Started 58m kubelet Started container driver-validation
Normal Pulled 56m (x5 over 58m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 56m (x5 over 58m) kubelet Created container toolkit-validation
Normal Started 56m (x5 over 58m) kubelet Started container toolkit-validation
Warning BackOff 3m7s (x255 over 58m) kubelet Back-off restarting failed container
Then, I looked at the logs of the container and I got the following error.
kubectl logs -n gpu-operator-resources -f nvidia-operator-validator-9xjmv -c toolkit-validation
time="2021-11-18T09:29:24Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
toolkit is not ready
For similar issues, it was suggested to delete the failed Pod and deployment. However, doing these did not fix my problem. Do you have any suggestions?
I have;
Ubuntu 20.04
Kubernetes v1.21.6
Docker 20.10.10
NVIDIA-SMI 470.82.01
CUDA 11.4
CPU: Intel Xeon E5-2683 v4 (32) # 2.097GHz
GPU: NVIDIA GeForce RTX 2080 Ti
Memory: 13815MiB / 48280MiB
Thanks in advance.
In case you're are still having the issue, we just had the same issue on our cluster, the "dirty" fix is to do that:
rm /run/nvidia/driver
ln -s / /run/nvidia/drive
kubectl delete pod -n gpu-operator nvidia-operator-validator-xxxxx
The reason is the init pod of the nvidia-operator-validator try to execute nvidia-smi within a chroot from /run/nvidia/driver .. which is a tmpfs (so doesn't persist accross reboot) and is not populated when performing a manual install of the drivers.
Do hope for a better fix from Nvidia.
I have 1 master and 5 nodes k8s cluster. I am setting EFK with ref: https://www.digitalocean.com/community/tutorials/how-to-set-up-an-elasticsearch-fluentd-and-kibana-efk-logging-stack-on-kubernetes#step-4-%E2%80%94-creating-the-fluentd-daemonset
While Creating the Fluentd DaemonSet, 1 out 5 fluentd is in ImagePullBackOff state :
kubectl get all -n kube-logging -o wide Tue Apr 21 03:49:26 2020
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES
SELECTOR
ds/fluentd 5 5 4 5 4 <none> 1d fluentd fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-e
lasticsearch-1.1 app=fluentd
ds/fluentd 5 5 4 5 4 <none> 1d fluentd fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-e
lasticsearch-1.1 app=fluentd
NAME READY STATUS RESTARTS AGE IP NODE
po/fluentd-82h6k 1/1 Running 1 1d 100.96.15.56 ip-172-20-52-52.us-west-1.compute.internal
po/fluentd-8ghjq 0/1 ImagePullBackOff 0 17h 100.96.10.170 ip-172-20-58-72.us-west-1.compute.internal
po/fluentd-fdmc8 1/1 Running 1 1d 100.96.3.73 ip-172-20-63-147.us-west-1.compute.internal
po/fluentd-g7755 1/1 Running 1 1d 100.96.2.22 ip-172-20-60-101.us-west-1.compute.internal
po/fluentd-gj8q8 1/1 Running 1 1d 100.96.16.17 ip-172-20-57-232.us-west-1.compute.internal
admin#ip-172-20-58-79:~$ kubectl describe po/fluentd-8ghjq -n kube-logging
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal BackOff 12m (x4364 over 17h) kubelet, ip-172-20-58-72.us-west-1.compute.internal Back-off pulling image "fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-elasticsearch-1.1"
Warning FailedSync 2m (x4612 over 17h) kubelet, ip-172-20-58-72.us-west-1.compute.internal Error syncing pod
Kubelet logs on node which is failing to run Fulentd
admin#ip-172-20-58-72:~$ journalctl -u kubelet -f
Apr 21 03:53:53 ip-172-20-58-72 kubelet[755]: E0421 03:53:53.095334 755 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Apr 21 03:53:53 ip-172-20-58-72 kubelet[755]: E0421 03:53:53.095369 755 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Apr 21 03:53:53 ip-172-20-58-72 kubelet[755]: W0421 03:53:53.095440 755 helpers.go:847] eviction manager: no observation found for eviction signal allocatableNodeFs.available
Apr 21 03:53:54 ip-172-20-58-72 kubelet[755]: I0421 03:53:54.882213 755 server.go:779] GET /metrics/cadvisor: (50.308555ms) 200 [[Prometheus/2.12.0] 172.20.58.79:54492]
Apr 21 03:53:55 ip-172-20-58-72 kubelet[755]: I0421 03:53:55.452951 755 kuberuntime_manager.go:500] Container {Name:fluentd Image:fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-elasticsearch-1.1 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:FLUENT_ELASTICSEARCH_HOST Value:vpc-cog-01-es-dtpgkfi.ap-southeast-1.es.amazonaws.com ValueFrom:nil} {Name:FLUENT_ELASTICSEARCH_PORT Value:443 ValueFrom:nil} {Name:FLUENT_ELASTICSEARCH_SCHEME Value:https ValueFrom:nil} {Name:FLUENTD_SYSTEMD_CONF Value:disable ValueFrom:nil}] Resources:{Limits:map[memory:{i:{value:536870912 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:209715200 scale:0} d:{Dec:<nil>} s: Format:BinarySI}]} VolumeMounts:[{Name:varlog ReadOnly:false MountPath:/var/log SubPath: MountPropagation:<nil>} {Name:varlibdockercontainers ReadOnly:true MountPath:/var/lib/docker/containers SubPath: MountPropagation:<nil>} {Name:fluentd-token-k8fnp ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Apr 21 03:53:55 ip-172-20-58-72 kubelet[755]: E0421 03:53:55.455327 755 pod_workers.go:182] Error syncing pod aa65dd30-82f2-11ea-a005-0607d7cb72ed ("fluentd-8ghjq_kube-logging(aa65dd30-82f2-11ea-a005-0607d7cb72ed)"), skipping: failed to "StartContainer" for "fluentd" with ImagePullBackOff: "Back-off pulling image \"fluent/fluentd-kubernetes-daemonset:v1.4.2-debian-elasticsearch-1.1\""
Kubelet logs on the node which is running Fulentd successfully
admin#ip-172-20-63-147:~$ journalctl -u kubelet -f
Apr 21 04:09:25 ip-172-20-63-147 kubelet[1272]: E0421 04:09:25.874293 1272 summary.go:92] Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
Apr 21 04:09:25 ip-172-20-63-147 kubelet[1272]: E0421 04:09:25.874336 1272 summary.go:92] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Apr 21 04:09:25 ip-172-20-63-147 kubelet[1272]: W0421 04:09:25.874453 1272 helpers.go:847] eviction manager: no observation found for eviction signal allocatableNodeFs.available
I install kubernetes 1000 times but now it does not work.
I install kubectl kubeadm kubelet then
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=185.73.114.92
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
but I see coredns is in pending state
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-5644d7b6d9-492q4 0/1 Pending 0 13m
kube-system coredns-5644d7b6d9-cvwjg 0/1 Pending 0 13m
kube-system etcd-amghezi 1/1 Running 0 12m
kube-system kube-apiserver-amghezi 1/1 Running 0 12m
kube-system kube-controller-manager-amghezi 1/1 Running 0 12m
kube-system kube-flannel-ds-amd64-fkxnf 1/1 Running 0 12m
kube-system kube-proxy-pspw2 1/1 Running 0 13m
kube-system kube-scheduler-amghezi 1/1 Running 0 12m
and then I get describe of coredns
kubectl describe pods coredns-5644d7b6d9-492q4 -n kube-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
I taint the node by
kubectl taint nodes amghezi node-role.kubernetes.io/master-
It did not work
I see at
journalctl -xe
message:docker: network plugin is not ready: cni config uninitialized
service docker status
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; disabled; vendor preset: enabled)
Active: active (running) since Sun 2019-09-22 17:29:45 CEST; 34min ago
Docs: https://docs.docker.com
Main PID: 987 (dockerd)
Tasks: 20
CGroup: /system.slice/docker.service
└─987 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
Sep 22 17:29:45 ubuntu systemd[1]: Started Docker Application Container Engine.
Sep 22 17:29:45 ubuntu dockerd[987]: time="2019-09-22T17:29:45.728818467+02:00" level=info msg="API listen on /var/run/docker.sock"
Sep 22 17:29:45 ubuntu dockerd[987]: time="2019-09-22T17:29:45.757401709+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:29:45 ubuntu dockerd[987]: time="2019-09-22T17:29:45.786776798+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:29:46 ubuntu dockerd[987]: time="2019-09-22T17:29:46.296798944+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:29:46 ubuntu dockerd[987]: time="2019-09-22T17:29:46.364459982+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:30:06 ubuntu dockerd[987]: time="2019-09-22T17:30:06.996299645+02:00" level=warning msg="failed to retrieve runc version: unknown output format: runc version spec: 1.0.1-dev\n"
Sep 22 17:30:41 ubuntu dockerd[987]: time="2019-09-22T17:30:41.633452599+02:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 22 17:30:41 ubuntu dockerd[987]: time="2019-09-22T17:30:41.633831003+02:00" level=warning msg="d72e19bd0e929513a1c9092ec487e5dc3f3e009bdaa4d33668b610e86cdadf9e cleanup: failed to unmount IPC: umount /var/lib/docker/containers/d72e19bd0e929513a1c9092ec487e5dc3f3e009bdaa4d33668b610e86cdadf9e/mounts/shm, flags: 0x2
Sep 22 17:30:41 ubuntu dockerd[987]: time="2019-09-22T17:30:41.903058543+02:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
and let us see kubelet status
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
I assume per given data outputs that the problem comes from Kubelet node agent, since kubelet primarily demands on CNI network plugin installation. In order to automatically configure networking features for the Pods, kubelet starts each time CNI addon in prior Pod creation to set up pod’s network interface as well. Furthermore, CoreDNS discovery service relies on overlay container network to be reachable for all cluster nodes.
Although you've used Flannel CNI provider, flannel Pod is up and running, since kubelet can't create container interface for the particular CoreDNS Pods with lack of CNI configuration, I would recommend to reset kubeadm cluster with purging redundant component folder structure:
$ sudo kubeadm reset
$ sudo systemctl stop docker && sudo systemctl stop kubelet
$ sudo rm -rf /etc/kubernetes/
$ sudo rm -rf .kube/
$ sudo rm -rf /var/lib/kubelet/
$ sudo rm -rf /var/lib/cni/
$ sudo rm -rf /etc/cni/
$ sudo rm -rf /var/lib/etcd/
Bootstrap K8s cluster via kubeadm:
$ sudo systemctl start docker && sudo systemctl start kubelet
$ sudo kubeadm init ...
Further remove node-role.kubernetes.io/master taint and apply Flannel addon:
$ kubectl taint nodes --all node-role.kubernetes.io/master-
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
You might find also some useful information about kubeadm troubleshooting guide steps in the official K8s documentation.
I used the AWS Kubernetes Quickstart to create a Kubernetes cluster in a VPC and private subnet: https://aws-quickstart.s3.amazonaws.com/quickstart-heptio/doc/heptio-kubernetes-on-the-aws-cloud.pdf. It was running fine for a while. I have Calico installed on my Kubernetes cluster. I have two nodes and a master. The calico pods on the master are running fine, the ones on the nodes are in crashloopbackoff state:
NAME READY STATUS RESTARTS AGE
calico-etcd-ztwjj 1/1 Running 1 55d
calico-kube-controllers-685755779f-ftm92 1/1 Running 2 55d
calico-node-gkjgl 1/2 CrashLoopBackOff 270 22h
calico-node-jxkvx 2/2 Running 4 55d
calico-node-mxhc5 1/2 CrashLoopBackOff 9 25m
Describing one of the crashed pods:
ubuntu#ip-10-0-1-133:~$ kubectl describe pod calico-node-gkjgl -n kube-system
Name: calico-node-gkjgl
Namespace: kube-system
Node: ip-10-0-0-237.us-east-2.compute.internal/10.0.0.237
Start Time: Mon, 17 Sep 2018 16:56:41 +0000
Labels: controller-revision-hash=185957727
k8s-app=calico-node
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod=
Status: Running
IP: 10.0.0.237
Controlled By: DaemonSet/calico-node
Containers:
calico-node:
Container ID: docker://d89979ba963c33470139fd2093a5427b13c6d44f4c6bb546c9acdb1a63cd4f28
Image: quay.io/calico/node:v3.1.1
Image ID: docker-pullable://quay.io/calico/node#sha256:19fdccdd4a90c4eb0301b280b50389a56e737e2349828d06c7ab397311638d29
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 18 Sep 2018 15:14:44 +0000
Finished: Tue, 18 Sep 2018 15:14:44 +0000
Ready: False
Restart Count: 270
Requests:
cpu: 250m
Liveness: http-get http://:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: http-get http://:9099/readiness delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
ETCD_ENDPOINTS: <set to the key 'etcd_endpoints' of config map 'calico-config'> Optional: false
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
CLUSTER_TYPE: kubeadm,bgp
CALICO_DISABLE_FILE_LOGGING: true
CALICO_K8S_NODE_REF: (v1:spec.nodeName)
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
CALICO_IPV4POOL_CIDR: 192.168.0.0/16
CALICO_IPV4POOL_IPIP: Always
FELIX_IPV6SUPPORT: false
FELIX_IPINIPMTU: 1440
FELIX_LOGSEVERITYSCREEN: info
IP: autodetect
FELIX_HEALTHENABLED: true
Mounts:
/lib/modules from lib-modules (ro)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-cni-plugin-token-b7sfl (ro)
install-cni:
Container ID: docker://b37e0ec7eba690473a4999a31d9f766f7adfa65f800a7b2dc8e23ead7520252d
Image: quay.io/calico/cni:v3.1.1
Image ID: docker-pullable://quay.io/calico/cni#sha256:dc345458d136ad9b4d01864705895e26692d2356de5c96197abff0030bf033eb
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Running
Started: Mon, 17 Sep 2018 17:11:52 +0000
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 17 Sep 2018 16:56:43 +0000
Finished: Mon, 17 Sep 2018 17:10:53 +0000
Ready: True
Restart Count: 1
Environment:
CNI_CONF_NAME: 10-calico.conflist
ETCD_ENDPOINTS: <set to the key 'etcd_endpoints' of config map 'calico-config'> Optional: false
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-cni-plugin-token-b7sfl (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
calico-cni-plugin-token-b7sfl:
Type: Secret (a volume populated by a Secret)
SecretName: calico-cni-plugin-token-b7sfl
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoSchedule
:NoExecute
:NoSchedule
:NoExecute
CriticalAddonsOnly
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 4m (x6072 over 22h) kubelet, ip-10-0-0-237.us-east-2.compute.internal Back-off restarting failed container
The logs for the same pod:
ubuntu#ip-10-0-1-133:~$ kubectl logs calico-node-gkjgl -n kube-system -c calico-node
2018-09-18 15:14:44.605 [INFO][8] startup.go 251: Early log level set to info
2018-09-18 15:14:44.605 [INFO][8] startup.go 269: Using stored node name from /var/lib/calico/nodename
2018-09-18 15:14:44.605 [INFO][8] startup.go 279: Determined node name: ip-10-0-0-237.us-east-2.compute.internal
2018-09-18 15:14:44.609 [INFO][8] startup.go 101: Skipping datastore connection test
2018-09-18 15:14:44.610 [INFO][8] startup.go 352: Building new node resource Name="ip-10-0-0-237.us-east-2.compute.internal"
2018-09-18 15:14:44.610 [INFO][8] startup.go 367: Initialize BGP data
2018-09-18 15:14:44.614 [INFO][8] startup.go 564: Using autodetected IPv4 address on interface ens3: 10.0.0.237/19
2018-09-18 15:14:44.614 [INFO][8] startup.go 432: Node IPv4 changed, will check for conflicts
2018-09-18 15:14:44.618 [WARNING][8] startup.go 861: Calico node 'ip-10-0-0-237' is already using the IPv4 address 10.0.0.237.
2018-09-18 15:14:44.618 [WARNING][8] startup.go 1058: Terminating
Calico node failed to start
So it seems like there is a conflict finding the node IP address, or Calico seems to think the IP is already assigned to another node. Doing a quick search i found this thread: https://github.com/projectcalico/calico/issues/1628. I see that this should be resolved by setting the IP_AUTODETECTION_METHOD to can-reach=DESTINATION, which I'm assuming would be "can-reach=10.0.0.237". This config is an environment variable set on calico/node container. I have been attempting to shell into the container itself, but kubectl tells me the container is not found:
ubuntu#ip-10-0-1-133:~$ kubectl exec calico-node-gkjgl --stdin --tty /bin/sh -c calico-node -n kube-system
error: unable to upgrade connection: container not found ("calico-node")
I'm suspecting this is due to Calico being unable to assign IPs. So I logged onto the host and attempt to shell on the container using docker:
root#ip-10-0-0-237:~# docker exec -it k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1 /bin/bash
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "exec: \"/bin/bash\": stat /bin/bash: no such file or directory"
So I guess there is no shell to execute in the container. Makes sense why Kubernetes couldn't execute that. I tried running commands externally to list environment variables, but I haven't been able to find any, I could be running these commands wrong however:
root#ip-10-0-0-237:~# docker inspect -f '{{range $index, $value := .Config.Env}}{{$value}} {{end}}' k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
root#ip-10-0-0-237:~# docker exec -it k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1 printenv IP_AUTODETECTION_METHOD
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "exec: \"printenv\": executable file not found in $PATH"
root#ip-10-0-0-237:~# docker exec -it k8s_POD_calico-node-gkjgl_kube-system_a6998e98-ba9a-11e8-a9fa-0a97f5a48ef4_1 /bin/env
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "exec: \"/bin/env\": stat /bin/env: no such file or directory"
Okay, so maybe I am going about this the wrong way. Should I attempt to change the Calico config files using Kubernetes and redeploy it? Where can I find these on my system? I haven't been able to find where to set the environment variables.
If you look at the Calico docs IP_AUTODETECTION_METHOD is already defaulting to first-round.
My guess is that something or the IP address is not being released by the previous 'run' of calico, or just simply a bug in the v3.1.1 version of calico.
Try:
Delete your Calico pods that are in a CrashBackOff loop
kubectl -n kube-system delete calico-node-gkjgl calico-node-mxhc5
Your pods will be re-created and hopefully initialize.
Upgrade Calico to v3.1.3 or latest. Follow these docs My guess is that Heptio's Calico installation is using the etcd datastore.
Try to understand how Heptio's AWS AMIs work and see if there are any issues with them. This might take some time so you could contact their support as well.
Try a different method to install Kubernetes with Calico. Well documented on https://kubernetes.io
For me what worked was to remove left over docker-networks on the Nodes.
I had to list out current networks on each Node: docker network list and then remove the unneeded ones: docker network rm <networkName>.
After doing that the calico deployment pods were running fine
Follow this guide, I'm trying to start minikube and forward port at the boot time.
My script:
#!/bin/bash
set -eux
export PATH=/usr/local/bin:$PATH
minikube status || minikube start
minikube ssh 'grep docker.for.mac.localhost /etc/hosts || echo -e "127.0.0.1\tdocker.for.mac.localhost" | sudo tee -a /etc/hosts'
minikube ssh 'test -f wait-for-it.sh || curl -O https://raw.githubusercontent.com/vishnubob/wait-for-it/master/wait-for-it.sh'
minikube ssh 'chmod +x wait-for-it.sh && ./wait-for-it.sh 127.0.1.1:10250'
POD=$(kubectl get po --namespace kube-system | awk '/kube-registry-v0/ { print $1 }')
kubectl port-forward --namespace kube-system $POD 5000:5000
Everything works fine except that kubectl port-forward said that pod does not exist at the first time running:
++ kubectl get po --namespace kube-system
++ awk '/kube-registry-v0/ { print $1 }'
+ POD=kube-registry-v0-qr2ml
+ kubectl port-forward --namespace kube-system kube-registry-v0-qr2ml 5000:5000
error: error upgrading connection: unable to upgrade connection: pod does not exist
If I re-run:
+ minikube status
minikube: Running
cluster: Running
kubectl: Correctly Configured: pointing to minikube-vm at 192.168.99.100
+ minikube ssh 'grep docker.for.mac.localhost /etc/hosts || echo -e "127.0.0.1\tdocker.for.mac.localhost" | sudo tee -a /etc/hosts'
127.0.0.1 docker.for.mac.localhost
+ minikube ssh 'test -f wait-for-it.sh || curl -O https://raw.githubusercontent.com/vishnubob/wait-for-it/master/wait-for-it.sh'
+ minikube ssh 'chmod +x wait-for-it.sh && ./wait-for-it.sh 127.0.1.1:10250'
wait-for-it.sh: waiting 15 seconds for 127.0.1.1:10250
wait-for-it.sh: 127.0.1.1:10250 is available after 0 seconds
++ kubectl get po --namespace kube-system
++ awk '/kube-registry-v0/ { print $1 }'
+ POD=kube-registry-v0-qr2ml
+ kubectl port-forward --namespace kube-system kube-registry-v0-qr2ml 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000
Forwarding from [::1]:5000 -> 5000
I added a debug line before forwarding:
kubectl describe pod --namespace kube-system $POD
and saw this:
+ POD=kube-registry-v0-qr2ml
+ kubectl describe pod --namespace kube-system kube-registry-v0-qr2ml
Name: kube-registry-v0-qr2ml
Namespace: kube-system
Node: minikube/192.168.99.100
Start Time: Thu, 28 Dec 2017 10:00:00 +0700
Labels: k8s-app=kube-registry
version=v0
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"kube-system","name":"kube-registry-v0","uid":"317ecc42-eb7b-11e7-a8ce-...
Status: Running
IP: 172.17.0.6
Controllers: ReplicationController/kube-registry-v0
Containers:
registry:
Container ID: docker://6e8f3f33399605758354f3f546996067d834459781235d51eef3ffa9c6589947
Image: registry:2.5.1
Image ID: docker-pullable://registry#sha256:946480a23b33480b8e7cdb89b82c1bd6accae91a8e66d017e21e8b56551f6209
Port: 5000/TCP
State: Running
Started: Thu, 28 Dec 2017 13:22:44 +0700
Why kubectl said that it does not exist?
Fri Dec 29 04:58:06 +07 2017
Looking carefully at the events, I found something:
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
20m 20m 1 kubelet, minikube Normal SuccessfulMountVolume MountVolume.SetUp succ
eeded for volume "image-store"
20m 20m 1 kubelet, minikube Normal SuccessfulMountVolume MountVolume.SetUp succ
eeded for volume "default-token-fs7kr"
20m 20m 1 kubelet, minikube Normal SandboxChanged Pod sandbox changed, it will be killed and re-created.
20m 20m 1 kubelet, minikube spec.containers{registry} Normal Pulled Container image "registry:2.5.1" already present on machine
20m 20m 1 kubelet, minikube spec.containers{registry} Normal Created Created container
20m 20m 1 kubelet, minikube spec.containers{registry} Normal Started Started container
Pod sandbox changed, it will be killed and re-created.
Before:
Containers:
registry:
Container ID: docker://47c510dce00c6c2c29c9fe69665e1241c457d0666174a7723062c534e7229c58
Image: registry:2.5.1
Image ID: docker-pullable://registry#sha256:946480a23b33480b8e7cdb89b82c1bd6accae91a8e66d017e21e8b56551f6209
Port: 5000/TCP
State: Running
Started: Thu, 28 Dec 2017 13:47:02 +0700
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 28 Dec 2017 13:22:44 +0700
Finished: Thu, 28 Dec 2017 13:45:18 +0700
Ready: True
Restart Count: 14
After:
Containers:
registry:
Container ID: docker://3a7da784d3d596796111348757725f5af22b47c5edd0fc29a4ffbb84f3f08956
Image: registry:2.5.1
Image ID: docker-pullable://registry#sha256:946480a23b33480b8e7cdb89b82c1bd6accae91a8e66d017e21e8b56551f6209
Port: 5000/TCP
State: Running
Started: Thu, 28 Dec 2017 19:03:04 +0700
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 28 Dec 2017 13:47:02 +0700
Finished: Thu, 28 Dec 2017 19:00:48 +0700
Ready: True
Restart Count: 15
minikube logs:
Dec 28 22:15:41 minikube localkube[3250]: W1228 22:15:41.102038
3250 docker_sandbox.go:343] failed to read pod IP from plugin/docker:
Couldn't find network status for kube-system/kube-registry-v0-qr2ml
through plugin: invalid network status for
POD=$(kubectl get po --namespace kube-system | awk '/kube-registry-v0/ { print $1 }')
Be aware that using a selector is almost certainly better than using text utilities, especially with "unstructured" output from kubectl. I don't know of any promises they make about the format of the default output, which is why --output=json and friends exist. However, in your case when you just want the name, there is a special --output=name which does what it says, with the mild caveat that the Resource prefix will be in front of the name (pods/kube-registry-v0-qr2ml in your case)
Separately, I see that you have "wait-for-it," but just because a port is accepting connections doesn't mean the Pod is Ready. You'll actually want to use --output=json (or more awk scripts, I guess) to ensure the Pod is both Running and Ready, with the latter status reached when kubernetes and the Pod agree that everything is cool.
I suspect, but would have to experiment to know for sure, that the error message is just misleading; it isn't truly that kubernetes doesn't know anything about your Pod, but merely that it couldn't port-forward to it in the state it's in.
You may also experience better success by creating a Service of type: NodePort and then talk to the Node's IP on the allocated port; that side-steps this kubectl-shell mess entirely, but does not side-step the Ready part -- only Pods in the Ready state will receive traffic from a Service
As a minor, pedantic note, --namespace is an argument to kubectl, and not to port-forward, so the most correct invocation is kubectl --namespace=kube-system port-forward kube-registry-v0-qr2ml 5000:5000 to ensure the argument isn't mis-parsed