Kubernetes cluster does not run after reboot - kube-apiserver

If I use the kubectl command after a reboot, I will receive an error.
x.x.x.x: 6443 was refused-did you specify the right host or port?
If I check my container with docker ps, kube-apiserver and kube-scheduler are turned on and off.
Why is this happening?
root#taeil-linux:/etc/systemd/system/kubelet.service.d# cd
root#taeil-linux:~# kubectl get nodes
The connection to the server 10.0.0.152:6443 was refused - did you specify the right host or port?
root#taeil-linux:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
root#taeil-linux:~# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
k8s.gcr.io/kube-proxy v1.15.3 232b5c793146 2 weeks ago 82.4MB
k8s.gcr.io/kube-apiserver v1.15.3 5eb2d3fc7a44 2 weeks ago 207MB
k8s.gcr.io/kube-scheduler v1.15.3 703f9c69a5d5 2 weeks ago 81.1MB
k8s.gcr.io/kube-controller-manager v1.15.3 e77c31de5547 2 weeks ago 159MB
node carbon c83f74dcf58e 3 weeks ago 895MB
kubernetesui/dashboard v2.0.0-beta1 4640949a39e6 2 months ago 64.6MB
weaveworks/weave-kube 2.5.2 f04a043bb67a 3 months ago 148MB
weaveworks/weave-npc 2.5.2 5ce48e0d813c 3 months ago 49.6MB
kubernetesui/metrics-scraper v1.0.0 44390ebe2b73 4 months ago 36.8MB
k8s.gcr.io/coredns 1.3.1 eb516548c180 7 months ago 40.3MB
k8s.gcr.io/etcd 3.3.10 2c4adeb21b4f 9 months ago 258MB
quay.io/coreos/flannel v0.10.0-amd64 f0fad859c909 19 months ago 44.6MB
k8s.gcr.io/pause 3.1 da86e6ba6ca1 20 months ago 742kB
root#taeil-linux:~# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Fri 2019-09-06 14:29:25 KST; 4min 19s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 14470 (kubelet)
Tasks: 19 (limit: 4512)
CGroup: /system.slice/kubelet.service
└─14470 /usr/bin/kubelet --bootstrap- kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf -- kubeconfig=/etc/kubernetes/kubelet.conf -- config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network- plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.1 --resolv-con
9월 06 14:33:44 taeil-linux kubelet[14470]: E0906 14:33:44.800330 14470 pod_workers.go:190] Error syncing pod 9a745ac0a776afabd0d387fd0fcb2f54 ("kube-apiserver-taeil-linux_kube- system(9a745ac0a776afabd0d387fd0fcb2f54)"), skipping: failed to "CreatePodSandbox" for "kube-apiserver-ta
9월 06 14:33:44 taeil-linux kubelet[14470]: E0906 14:33:44.897945 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:44 taeil-linux kubelet[14470]: E0906 14:33:44.916566 14470 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.0.0.152:6443/api/v1/pods? fieldSelector=spec.nodeName%3Dtaeil-linux&limit=500&resourceVersion=0: dia
9월 06 14:33:44 taeil-linux kubelet[14470]: E0906 14:33:44.998190 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.098439 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.198732 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.299052 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.399343 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.499561 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.599723 14470 kubelet.go:2248] node "taeil-linux" not found
root#taeil-linux:~# systemctl status kube-apiserver
Unit kube-apiserver.service could not be found.
If I try
docker logs
Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0906 10:54:19.636649 1 server.go:560] external host was not specified, using 10.0.0.152
I0906 10:54:19.636954 1 server.go:147] Version: v1.15.3
I0906 10:54:21.753962 1 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook.
I0906 10:54:21.753988 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
E0906 10:54:21.754660 1 prometheus.go:55] failed to register depth metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754701 1 prometheus.go:68] failed to register adds metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754787 1 prometheus.go:82] failed to register latency metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754842 1 prometheus.go:96] failed to register workDuration metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754883 1 prometheus.go:112] failed to register unfinished metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754918 1 prometheus.go:126] failed to register unfinished metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754952 1 prometheus.go:152] failed to register depth metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754986 1 prometheus.go:164] failed to register adds metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.755047 1 prometheus.go:176] failed to register latency metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.755104 1 prometheus.go:188] failed to register work_duration metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.755152 1 prometheus.go:203] failed to register unfinished_work_seconds metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.755188 1 prometheus.go:216] failed to register longest_running_processor_microseconds metric admission_quota_controller: duplicate metrics collector registration attempted
I0906 10:54:21.755215 1 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesBy Condition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObj ectInUseProtection,MutatingAdmissionWebhook.
I0906 10:54:21.755226 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,Validating AdmissionWebhook,ResourceQuota.
I0906 10:54:21.757263 1 client.go:354] parsed scheme: ""
I0906 10:54:21.757280 1 client.go:354] scheme "" not registered, fallback to default scheme
I0906 10:54:21.757335 1 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0906 10:54:21.757402 1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
W0906 10:54:21.757666 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
I0906 10:54:22.753069 1 client.go:354] parsed scheme: ""
I0906 10:54:22.753118 1 client.go:354] scheme "" not registered, fallback to default scheme
I0906 10:54:22.753204 1 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0906 10:54:22.753354 1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
W0906 10:54:22.753855 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:22.757983 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:23.754019 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:24.430000 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:25.279869 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:26.931974 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:28.198719 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:30.825660 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:32.850511 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:36.294749 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:38.737408 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
F0906 10:54:41.757603 1 storage_decorator.go:57] Unable to create storage backend: config (&{ /registry {[https://127.0.0.1:2379] /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/etcd/ca.crt} true 0xc00063dd40 apiextensions.k8s.io/v1beta1 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:2379: connect: connection refused)

The answer is in the comment by #cewood;
Okay, that helps to understand what you installation is likely to look
like. Regarding the other master components, these are likely running
via the kubelet, and hence there won't be any systemd units for them,
only for the kubelet itself.
With kubeadm install you dont see the services;
as root
systemctl start docker
systemctl start kubectl
switch to non root user
su nonrootuser -
kubectl get pods

Long time no see.
I totally realized how to solve this problem!
If you get an error like this for no reason, you can fix it by:
docker rm $(docker ps -a -q)
Perhaps an error occurred when the existing Kubernetes container was rebooted and the newly running container crashed.
watch docker ps
If you check the container with watch, you can see that kube-apiserver and others are turned off within 1 minute.
So I decided to delete all containers appearing in docker ps -a and it's fixed!

Related

Unable to use Docker, even though the demon is active

OS: Ubuntu LTS 22.4 | LTS 20.04
I had the same problem on ubuntu LTS 22.4 soo I downgraded to 20.04. I tried following all StackOverflow threads none of the solutions seems to work for me. I am not behind any proxy.
docker run hello-world
Unable to find image 'hello-world:latest' locally
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": proxyconnect tcp: dial tcp: lookup http: Temporary failure in name resolution.
Docker servic seems to be working just okay:
Similar error on trying to login:
If any additional detail is required let me know in the comments, ill edit!
Status logs:
Aug 15 11:15:19 asus systemd[1]: Started Docker Application Container Engine.
Aug 15 11:15:19 asus dockerd[9448]: time="2022-08-15T11:15:19.531101032+05:30" level=info msg="API listen on /run/docker.sock"
Aug 15 11:15:57 asus dockerd[9448]: time="2022-08-15T11:15:57.920565157+05:30" level=warning msg="Error getting v2 registry: Get \"https://registry-1.docker.io/v2/\": proxyconnect tcp: dial tcp: lookup http: no such host"
Aug 15 11:15:57 asus dockerd[9448]: time="2022-08-15T11:15:57.920602673+05:30" level=info msg="Attempting next endpoint for pull after error: Get \"https://registry-1.docker.io/v2/\": proxyconnect tcp: dial tcp: lookup http: no such host"
Aug 15 11:15:57 asus dockerd[9448]: time="2022-08-15T11:15:57.922360289+05:30" level=error msg="Handler for POST /v1.41/images/create returned error: Get \"https://registry-1.docker.io/v2/\": proxyconnect tcp: dial tcp: lookup http: no such host"
Aug 15 11:18:45 asus dockerd[9448]: time="2022-08-15T11:18:45.641224726+05:30" level=warning msg="Error getting v2 registry: Get \"https://registry-1.docker.io/v2/\": proxyconnect tcp: dial tcp: lookup http: no such host"
Aug 15 11:18:45 asus dockerd[9448]: time="2022-08-15T11:18:45.641275059+05:30" level=info msg="Attempting next endpoint for pull after error: Get \"https://registry-1.docker.io/v2/\": proxyconnect tcp: dial tcp: lookup http: no such host"
Aug 15 11:18:45 asus dockerd[9448]: time="2022-08-15T11:18:45.643035998+05:30" level=error msg="Handler for POST /v1.41/images/create returned error: Get \"https://registry-1.docker.io/v2/\": proxyconnect tcp: dial tcp: lookup http: no such host"
Aug 15 11:24:07 asus dockerd[9448]: time="2022-08-15T11:24:07.919119361+05:30" level=info msg="Error logging in to endpoint, trying next endpoint" error="Get \"https://registry-1.docker.io/v2/\": proxyconnect tcp: dial tcp: lookup http: no such host"
Aug 15 11:24:07 asus dockerd[9448]: time="2022-08-15T11:24:07.919328823+05:30" level=error msg="Handler for POST /v1.41/auth returned error: Get \"https://registry-1.docker.io/v2/\": proxyconnect tcp: dial tcp: lookup http: no such host"
lines 1-23
journalctl logs:
sudo journalctl -fu docker.service
[sudo] password for rhythm:
-- Logs begin at Mon 2022-08-15 02:04:27 IST. --
Aug 15 13:18:20 asus dockerd[12520]: time="2022-08-15T13:18:20.454180157+05:30" level=info msg="Loading containers: done."
Aug 15 13:18:20 asus dockerd[12520]: time="2022-08-15T13:18:20.463895293+05:30" level=info msg="Docker daemon" commit=a89b842 graphdriver(s)=overlay2 version=20.10.17
Aug 15 13:18:20 asus dockerd[12520]: time="2022-08-15T13:18:20.463955619+05:30" level=info msg="Daemon has completed initialization"
Aug 15 13:18:20 asus systemd[1]: Started Docker Application Container Engine.
Aug 15 13:18:20 asus dockerd[12520]: time="2022-08-15T13:18:20.481004111+05:30" level=info msg="API listen on /run/docker.sock"
Aug 15 13:19:19 asus dockerd[12520]: time="2022-08-15T13:19:19.377705065+05:30" level=warning msg="Error getting v2 registry: Get \"https://registry-1.docker.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Aug 15 13:19:19 asus dockerd[12520]: time="2022-08-15T13:19:19.377835945+05:30" level=info msg="Attempting next endpoint for pull after error: Get \"https://registry-1.docker.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Aug 15 13:19:19 asus dockerd[12520]: time="2022-08-15T13:19:19.385691924+05:30" level=error msg="Handler for POST /v1.41/images/create returned error: Get \"https://registry-1.docker.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Aug 15 13:30:26 asus dockerd[12520]: time="2022-08-15T13:30:26.166694545+05:30" level=info msg="Error logging in to endpoint, trying next endpoint" error="Get \"https://registry-1.docker.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Aug 15 13:30:26 asus dockerd[12520]: time="2022-08-15T13:30:26.166963331+05:30" level=error msg="Handler for POST /v1.41/auth returned error: Get \"https://registry-1.docker.io/v2/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

Can I install Docker on Synology DS418?

Is there a way to install Docker on Synology DS418 ?
I had try this with this aarch64/docker-20.10.6.tgz.
Here are the step I did:
Download static docker binary at https://download.docker.com/linux/static/stable/aarch64/ . I choose the aarch64/docker-20.10.6.tgz, but I am maybe mistaking here.
tar xzvf /path/to/.tar.gz
sudo cp docker/* /usr/bin/
create the /etc/docker/daemon.json configuration file with the following configuration:
{
"storage-driver": "vfs",
"iptables": false,
"bridge": "none"
}
sudo dockerd &
I received this error when I do the step 5.
xxx#NAS:~$ sudo dockerd &
[1] 806
xxx#NAS:~$ INFO[2021-05-04T14:35:55.752149353-05:00] Starting up
WARN[2021-05-04T14:35:55.753236211-05:00] could not change group /var/run/docker.sock to docker: group docker not found
INFO[2021-05-04T14:35:55.753833733-05:00] libcontainerd: containerd is still running pid=28644
INFO[2021-05-04T14:35:55.753946586-05:00] parsed scheme: "unix" module=grpc
INFO[2021-05-04T14:35:55.754184624-05:00] scheme "unix" not registered, fallback to default scheme module=grpc
INFO[2021-05-04T14:35:55.754265514-05:00] ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>} module=grpc
INFO[2021-05-04T14:35:55.754313995-05:00] ClientConn switching balancer to "pick_first" module=grpc
WARN[2021-05-04T14:35:56.754892127-05:00] grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc
WARN[2021-05-04T14:35:59.191460361-05:00] grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc
WARN[2021-05-04T14:36:03.215171500-05:00] grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc
WARN[2021-05-04T14:36:08.582014438-05:00] grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc
failed to start containerd: timeout waiting for containerd to start
Thank you very much
All you're missing is:
sudo synogroup --add docker $USER
If you run this command as a non-root user, that user will be able to execute docker commands without sudo.
If you don't want this, try running the command as root or
sudo synogroup --add docker
may work. But I haven't tested either these latter two approaches.

Unable to join Docker swarm because control.sock is missing?

I have an existing Docker swarm consisting of three machines. I am trying to add a new manager to this swarm. I run the command
docker swarm join --token SWMTKN-1-<...> 192.168.200.200:2377
After a while I get the error
Error response from daemon: manager stopped: can't initialize raft node: rpc error: code = Unknown desc = could not connect to prospective new cluster member using its advertised address: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I view the daemon logs using tail -f /var/log/messages | grep docker, I see this:
Mar 17 17:07:48 UAT-Blockchain dockerd: time="2021-03-17T17:07:48.575024542+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {/var/run/docker/swarm/control.sock <nil> 0 <nil>}. Err :connection error: desc= \"transport: Error while dialing dial unix /var/run/docker/swarm/control.sock: connect: no such file or directory\". Reconnecting..." module=grpc
A quick check shows that /var/run/docker/swarm/control.sock is indeed missing on this machine, but is present on the machines in the existing swarm.
What is this control.sock? How should I go about enabling/reinstating it on this current machine? Is this a problem of faulty installation?

kube-apiserver docker is restarting continuously

Sincere apologies for this lengthy posting.
I have a 4 node Kubernetes cluster with 1 x master and 3 x worker nodes. I connect to the kubernetes cluster using kubeconfig, since yesterday I was not able to connect using kubeconfig.
kubectl get pods was giving an error "The connection to the server api.xxxxx.xxxxxxxx.com was refused - did you specify the right host or port?"
In the kubeconfig server name is specified as https://api.xxxxx.xxxxxxxx.com
Note:
Please note as there were too many https links, I was not able to post the question. So I have renamed https:// to https:-- to avoid the links in the background analysis section.
I tried to run kubectl from the master node and received similar error
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Then checked kube-apiserver docker and it was continuously exiting / Crashloopbackoff.
docker logs <container-id of kube-apiserver> shows below errors
W0914 16:29:25.761524 1 clientconn.go:1251] grpc:
addrConn.createTransport failed to connect to {127.0.0.1:4001 0
}. Err :connection error: desc = "transport: authentication
handshake failed: x509: certificate has expired or is not yet valid".
Reconnecting... F0914 16:29:29.319785 1 storage_decorator.go:57]
Unable to create storage backend: config (&{etcd3 /registry
{[https://127.0.0.1:4001]
/etc/kubernetes/pki/kube-apiserver/etcd-client.key
/etc/kubernetes/pki/kube-apiserver/etcd-client.crt
/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true
0xc000266d80 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err
(context deadline exceeded)
systemctl status kubelet --> was giving below errors
Sep 14 16:40:49 ip-xxx-xxx-xx-xx kubelet[2411]: E0914 16:40:49.693576
2411 kubelet_node_status.go:385] Error updating node status, will
retry: error getting node
"ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal": Get
https://127.0.0.1/api/v1/nodes/ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused
Note: ip-xxx-xx-xx-xxx --> internal IP address of aws ec2 instance.
Background Analysis:
Looks there was some issue with the cluster on 7th Sep 2020 and both kube-controller and kube-scheduler dockers exited and restarted. I believe since then kube-apiserver is not running or because of kube-apiserver, those dockers restarted. The kube-apiserver server certificate expired in July 2020 but access via kubectl was working until 7th Sep.
Below are the docker logs from the exited kube-scheduler docker container:
I0907 10:35:08.970384 1 scheduler.go:572] pod
default/k8version-1599474900-hrjcn is bound successfully on node
ip-xx-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4 nodes evaluated, 3
nodes were found feasible I0907 10:40:09.286831 1
scheduler.go:572] pod default/k8version-1599475200-tshlx is bound
successfully on node ip-1x-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4
nodes evaluated, 3 nodes were found feasible I0907 10:44:01.935373
1 leaderelection.go:263] failed to renew lease
kube-system/kube-scheduler: failed to tryAcquireOrRenew context
deadline exceeded E0907 10:44:01.935420 1 server.go:252] lost
master lost lease
Below are the docker logs from exited kube-controller docker container:
I0907 10:40:19.703485 1 garbagecollector.go:518] delete object
[v1/Pod, namespace: default, name: k8version-1599474300-5r6ph, uid:
67437201-f0f4-11ea-b612-0293e1aee720] with propagation policy
Background I0907 10:44:01.937398 1 leaderelection.go:263] failed
to renew lease kube-system/kube-controller-manager: failed to
tryAcquireOrRenew context deadline exceeded E0907 10:44:01.937506
1 leaderelection.go:306] error retrieving resource lock
kube-system/kube-controller-manager: Get https:
--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
net/http: request canceled (Client.Timeout exceeded while awaiting
headers) I0907 10:44:01.937456 1 event.go:209]
Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system",
Name:"kube-controller-manager",
UID:"ba172d83-a302-11e9-b612-0293e1aee720", APIVersion:"v1",
ResourceVersion:"85406287", FieldPath:""}): type: 'Normal' reason:
'LeaderElection' ip-xxx-xx-xx-xxx_1dd3c03b-bd90-11e9-85c6-0293e1aee720
stopped leading F0907 10:44:01.937545 1
controllermanager.go:260] leaderelection lost I0907 10:44:01.949274
1 range_allocator.go:169] Shutting down range CIDR allocator I0907
10:44:01.949285 1 replica_set.go:194] Shutting down replicaset
controller I0907 10:44:01.949291 1 gc_controller.go:86] Shutting
down GC controller I0907 10:44:01.949304 1
pvc_protection_controller.go:111] Shutting down PVC protection
controller I0907 10:44:01.949310 1 route_controller.go:125]
Shutting down route controller I0907 10:44:01.949316 1
service_controller.go:197] Shutting down service controller I0907
10:44:01.949327 1 deployment_controller.go:164] Shutting down
deployment controller I0907 10:44:01.949435 1
garbagecollector.go:148] Shutting down garbage collector controller
I0907 10:44:01.949443 1 resource_quota_controller.go:295]
Shutting down resource quota controller
Below are the docker logs from kube-controller since the restart (7th Sep):
E0915 21:51:36.028108 1 leaderelection.go:306] error retrieving
resource lock kube-system/kube-controller-manager: Get
https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:51:40.133446 1 leaderelection.go:306] error retrieving
resource lock kube-system/kube-controller-manager: Get
https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused
Below are the docker logs from kube-scheduler since the restart (7th Sep):
E0915 21:52:44.703587 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Node:
Get https://127.0.0.1/api/v1/nodes?limit=500&resourceVersion=0: dial
tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.704504
1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed
to list *v1.ReplicationController: Get
https:--127.0.0.1/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.705471 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service:
Get https:--127.0.0.1/api/v1/services?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.706477 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.ReplicaSet: Get https:--127.0.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.707581 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.StorageClass: Get https:--127.0.0.1/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.708599 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.PersistentVolume: Get https:--127.0.0.1/api/v1/persistentvolumes?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.709687 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.StatefulSet: Get https:--127.0.0.1/apis/apps/v1/statefulsets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.710744 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.PersistentVolumeClaim: Get https:--127.0.0.1/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.711879 1 reflector.go:126]
k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:223: Failed to list
*v1.Pod: Get https:--127.0.0.1/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.712903 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1beta1.PodDisruptionBudget: Get https:--127.0.0.1/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused
kube-apiserver certificate Renewal:
I found the kube-apiserver certificate which is this one /etc/kubernetes/pki/kube-apiserver/etcd-client.crt had expired in July 2020. There were few other expired certificates related to etcd-manager-main and events (it is same copy of the certificates on both places) but I don't see this referenced in the manifest files.
I searched and found steps to renew the certificates but most of them were using "kubeadm init phase" commands but I couldn't find kubeadm on master server and the certificates names and paths were different to my setup. So I generated a new certificate using openssl for kube-apiserver using existing ca cert and included DNS names with internal and external IP address (ec2 instance) and loopback ip address using openssl.cnf file. I replaced the new certificate with the same name /etc/kubernetes/pki/kube-apiserver/etcd-client.crt.
After that I restarted the kube-apiserver docker (which was continuously exiting) and restarted kubelet. Now the certificate expiry message is not coming but the kube-apiserver is continuously restarting which I believe is the reason for the errors on kube-controller and kube-scheduler docker containers.
NOTE:
I have not restarted the docker on the master server after replacing the certificate.
NOTE: All our production PODs are running on worker nodes so they are not affected but I can't manage them as I can't connect using kubectl.
Now, I am not sure what is the issue and why kube-apiserver is restarting continuously.
Update to the original question:
Kubernetes version: v1.14.1
Docker version: 18.6.3
Below are the latest docker logs from kube-apiserver container (which is still crashing)
F0916 08:09:56.753538 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry {[https:--127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc00095f050 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err (tls: private key does not match public key)
Below is the output from systemctl status kubelet
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.095615 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x.compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.130377 388 kubelet.go:2170] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.147390 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https:--127.0.0.1/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.195768 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.295890 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.347431 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://127.0.0.1/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
This cluster (along with 3 others) was setup using kops. The other clusters are running normally and looks like they have some expired certificates as well. The person who setup the clusters is not available for comment and I have limited experience on Kubernetes. Hence required assistance from the gurus.
Any help is very much appreciated.
Many thanks.
Update after response from Zambozo and Nepomucen:
Thanks to both of you for your response. Based that I found that there were expired etcd certificates on the /mnt mount point.
I followed workaround from https://kops.sigs.k8s.io/advisories/etcd-manager-certificate-expiration/
and recreated etcd certificates and keys. I have verified each of the certificate with a copy of the old one (from my backup folder) and everything is matching and the new certificates has expiry date set to Sep 2021.
Now I am getting different error on etcd dockers (both etcd-manager-events and etcd-manager-main)
Note:xxx-xx-xx-xxx is the IP address of the master server
root#ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-main container> --tail 20
I0916 14:41:40.349570 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
W0916 14:41:40.351857 8221 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:41:40.351878 8221 peers.go:347] was not able to connect to peer etcd-a: map[xxx.xx.xx.xxx:3996:true]
W0916 14:41:40.351887 8221 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-a
I0916 14:41:41.205763 8221 controller.go:173] starting controller iteration
W0916 14:41:41.205801 8221 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-a" in list of peers []
I0916 14:41:45.352008 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
I0916 14:41:46.678314 8221 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0916 14:41:46.739272 8221 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0916 14:41:46.786653 8221 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-a.internal.xxxxx.xxxxxxx.com etcd-a.internal.xxxxx.xxxxxxx.com]]
I0916 14:41:46.786724 8221 hosts.go:181] skipping update of unchanged /etc/hosts
root#ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-events container> --tail 20
W0916 14:42:40.294576 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a
I0916 14:42:41.106654 8316 controller.go:173] starting controller iteration
W0916 14:42:41.106692 8316 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-events-a" in list of peers []
I0916 14:42:45.294682 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a"
W0916 14:42:45.297094 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:42:45.297117 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true]
I0916 14:42:46.791923 8316 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0916 14:42:46.856548 8316 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0916 14:42:46.945119 8316 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-events-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-events-a.internal.xxxxx.xxxxxxx.com etcd-events-a.internal.xxxxx.xxxxxxx.com]]
I0916 14:42:50.297264 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a"
W0916 14:42:50.300328 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:42:50.300348 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true]
W0916 14:42:50.300360 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a
Could you please suggest on how to proceed from here?
Many thanks.
Generating a new cert using openssl for kube-apiserver and replacing the cert and key brought the kube-apiserver docker to stable state and provided access via kubectl.
To resolve etcd-manager certs issue, upgraded etcd-manager to kopeio/etcd-manager:3.0.20200531 for both etcd-manager-main and etcd-manager-events as described at https://github.com/kubernetes/kops/issues/8959#issuecomment-673515269
Thank you
I think this is related to the ETCD. you may have renewed the certs for Kubernetes components but did you do the same for ETCD?
Your API server is trying to connect to the ETCD and giving:
tls: private key does not match public key)
As you have only 1 etcd(assuming on the number of master nodes) I would do a backup of it before trying to fix it.

Docker fails with "failed to start containerd: timeout waiting for containerd to start"

I have docker installed on Ubuntu 18.04.2 with snap.
When I try to start docker it fails with the following error log.
2020-07-16T23:49:14Z docker.dockerd[932]: failed to start containerd: timeout waiting for containerd to start
2020-07-16T23:49:14Z systemd[1]: snap.docker.dockerd.service: Main process exited, code=exited, status=1/FAILURE
2020-07-16T23:49:14Z systemd[1]: snap.docker.dockerd.service: Failed with result 'exit-code'.
2020-07-16T23:49:14Z systemd[1]: snap.docker.dockerd.service: Service hold-off time over, scheduling restart.
2020-07-16T23:49:14Z systemd[1]: snap.docker.dockerd.service: Scheduled restart job, restart counter is at 68.
2020-07-16T23:49:14Z systemd[1]: Stopped Service for snap application docker.dockerd.
2020-07-16T23:49:14Z systemd[1]: Started Service for snap application docker.dockerd.
It goes over and over into a restart loop. What should I do to get docker working again?
In this case, docker was waiting for containerd to start. The containerd pid is located at
/var/snap/docker/471/run/docker/containerd/containerd.pid.
This pid didn't exist. But the file was not deleted when the server was unceremoniously shutdown. Deleting this file allows the containerd process to start again, and problem is solved. I believe similar problems exist out there where docker.pid file also points to a non-existent pid.
Ive also faced error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout on fresh docker install on Arch linux today.
Ive installed docker and tried to start it:
sudo systemctl enable docker
sudo systemctl start docker
It dont start: sudo systemctl status docker says:
× docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sun 2022-02-20 20:29:53 +03; 8s ago
TriggeredBy: × docker.socket
Docs: https://docs.docker.com
Process: 8368 ExecStart=/usr/bin/dockerd -H fd:// (code=exited, status=1/FAILURE)
Main PID: 8368 (code=exited, status=1/FAILURE)
CPU: 414ms
Feb 20 20:29:53 V-LINUX-087 systemd[1]: docker.service: Scheduled restart job, restart counter is at 3.
Feb 20 20:29:53 V-LINUX-087 systemd[1]: Stopped Docker Application Container Engine.
Feb 20 20:29:53 V-LINUX-087 systemd[1]: docker.service: Start request repeated too quickly.
Feb 20 20:29:53 V-LINUX-087 systemd[1]: docker.service: Failed with result 'exit-code'.
Feb 20 20:29:53 V-LINUX-087 systemd[1]: Failed to start Docker Application Container Engine.
I managed to get more info after executing sudo dockerd:
$ sudo dockerd
INFO[2022-02-20T20:32:05.923357711+03:00] Starting up
INFO[2022-02-20T20:32:05.924015767+03:00] libcontainerd: started new containerd process pid=8618
INFO[2022-02-20T20:32:05.924036777+03:00] parsed scheme: "unix" module=grpc
INFO[2022-02-20T20:32:05.924043494+03:00] scheme "unix" not registered, fallback to default scheme module=grpc
INFO[2022-02-20T20:32:05.924058420+03:00] ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>} module=grpc
INFO[2022-02-20T20:32:05.924068315+03:00] ClientConn switching balancer to "pick_first" module=grpc
containerd: /usr/lib/libc.so.6: version `GLIBC_2.34' not found (required by containerd)
ERRO[2022-02-20T20:32:05.924198775+03:00] containerd did not exit successfully error="exit status 1" module=libcontainerd
WARN[2022-02-20T20:32:06.925000686+03:00] grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc
WARN[2022-02-20T20:32:09.397384787+03:00] grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc
WARN[2022-02-20T20:32:13.645272915+03:00] grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc
WARN[2022-02-20T20:32:19.417671818+03:00] grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc
failed to start containerd: timeout waiting for containerd to start
So it seems like containerd could not start in my case.
I tried sudo containerd and voila:
$ sudo containerd
containerd: /usr/lib/libc.so.6: version `GLIBC_2.34' not found (required by containerd)
On my OS (Arch linux) the solution was to update the package:
sudo pacman -S lib32-glibc
If may be just sudo pacman -S glibc for someone on arch linux as weel

Resources