kube-apiserver docker is restarting continuously - docker

Sincere apologies for this lengthy posting.
I have a 4 node Kubernetes cluster with 1 x master and 3 x worker nodes. I connect to the kubernetes cluster using kubeconfig, since yesterday I was not able to connect using kubeconfig.
kubectl get pods was giving an error "The connection to the server api.xxxxx.xxxxxxxx.com was refused - did you specify the right host or port?"
In the kubeconfig server name is specified as https://api.xxxxx.xxxxxxxx.com
Note:
Please note as there were too many https links, I was not able to post the question. So I have renamed https:// to https:-- to avoid the links in the background analysis section.
I tried to run kubectl from the master node and received similar error
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Then checked kube-apiserver docker and it was continuously exiting / Crashloopbackoff.
docker logs <container-id of kube-apiserver> shows below errors
W0914 16:29:25.761524 1 clientconn.go:1251] grpc:
addrConn.createTransport failed to connect to {127.0.0.1:4001 0
}. Err :connection error: desc = "transport: authentication
handshake failed: x509: certificate has expired or is not yet valid".
Reconnecting... F0914 16:29:29.319785 1 storage_decorator.go:57]
Unable to create storage backend: config (&{etcd3 /registry
{[https://127.0.0.1:4001]
/etc/kubernetes/pki/kube-apiserver/etcd-client.key
/etc/kubernetes/pki/kube-apiserver/etcd-client.crt
/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true
0xc000266d80 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err
(context deadline exceeded)
systemctl status kubelet --> was giving below errors
Sep 14 16:40:49 ip-xxx-xxx-xx-xx kubelet[2411]: E0914 16:40:49.693576
2411 kubelet_node_status.go:385] Error updating node status, will
retry: error getting node
"ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal": Get
https://127.0.0.1/api/v1/nodes/ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused
Note: ip-xxx-xx-xx-xxx --> internal IP address of aws ec2 instance.
Background Analysis:
Looks there was some issue with the cluster on 7th Sep 2020 and both kube-controller and kube-scheduler dockers exited and restarted. I believe since then kube-apiserver is not running or because of kube-apiserver, those dockers restarted. The kube-apiserver server certificate expired in July 2020 but access via kubectl was working until 7th Sep.
Below are the docker logs from the exited kube-scheduler docker container:
I0907 10:35:08.970384 1 scheduler.go:572] pod
default/k8version-1599474900-hrjcn is bound successfully on node
ip-xx-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4 nodes evaluated, 3
nodes were found feasible I0907 10:40:09.286831 1
scheduler.go:572] pod default/k8version-1599475200-tshlx is bound
successfully on node ip-1x-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4
nodes evaluated, 3 nodes were found feasible I0907 10:44:01.935373
1 leaderelection.go:263] failed to renew lease
kube-system/kube-scheduler: failed to tryAcquireOrRenew context
deadline exceeded E0907 10:44:01.935420 1 server.go:252] lost
master lost lease
Below are the docker logs from exited kube-controller docker container:
I0907 10:40:19.703485 1 garbagecollector.go:518] delete object
[v1/Pod, namespace: default, name: k8version-1599474300-5r6ph, uid:
67437201-f0f4-11ea-b612-0293e1aee720] with propagation policy
Background I0907 10:44:01.937398 1 leaderelection.go:263] failed
to renew lease kube-system/kube-controller-manager: failed to
tryAcquireOrRenew context deadline exceeded E0907 10:44:01.937506
1 leaderelection.go:306] error retrieving resource lock
kube-system/kube-controller-manager: Get https:
--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
net/http: request canceled (Client.Timeout exceeded while awaiting
headers) I0907 10:44:01.937456 1 event.go:209]
Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system",
Name:"kube-controller-manager",
UID:"ba172d83-a302-11e9-b612-0293e1aee720", APIVersion:"v1",
ResourceVersion:"85406287", FieldPath:""}): type: 'Normal' reason:
'LeaderElection' ip-xxx-xx-xx-xxx_1dd3c03b-bd90-11e9-85c6-0293e1aee720
stopped leading F0907 10:44:01.937545 1
controllermanager.go:260] leaderelection lost I0907 10:44:01.949274
1 range_allocator.go:169] Shutting down range CIDR allocator I0907
10:44:01.949285 1 replica_set.go:194] Shutting down replicaset
controller I0907 10:44:01.949291 1 gc_controller.go:86] Shutting
down GC controller I0907 10:44:01.949304 1
pvc_protection_controller.go:111] Shutting down PVC protection
controller I0907 10:44:01.949310 1 route_controller.go:125]
Shutting down route controller I0907 10:44:01.949316 1
service_controller.go:197] Shutting down service controller I0907
10:44:01.949327 1 deployment_controller.go:164] Shutting down
deployment controller I0907 10:44:01.949435 1
garbagecollector.go:148] Shutting down garbage collector controller
I0907 10:44:01.949443 1 resource_quota_controller.go:295]
Shutting down resource quota controller
Below are the docker logs from kube-controller since the restart (7th Sep):
E0915 21:51:36.028108 1 leaderelection.go:306] error retrieving
resource lock kube-system/kube-controller-manager: Get
https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:51:40.133446 1 leaderelection.go:306] error retrieving
resource lock kube-system/kube-controller-manager: Get
https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused
Below are the docker logs from kube-scheduler since the restart (7th Sep):
E0915 21:52:44.703587 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Node:
Get https://127.0.0.1/api/v1/nodes?limit=500&resourceVersion=0: dial
tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.704504
1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed
to list *v1.ReplicationController: Get
https:--127.0.0.1/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.705471 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service:
Get https:--127.0.0.1/api/v1/services?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.706477 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.ReplicaSet: Get https:--127.0.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.707581 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.StorageClass: Get https:--127.0.0.1/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.708599 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.PersistentVolume: Get https:--127.0.0.1/api/v1/persistentvolumes?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.709687 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.StatefulSet: Get https:--127.0.0.1/apis/apps/v1/statefulsets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.710744 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.PersistentVolumeClaim: Get https:--127.0.0.1/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.711879 1 reflector.go:126]
k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:223: Failed to list
*v1.Pod: Get https:--127.0.0.1/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.712903 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1beta1.PodDisruptionBudget: Get https:--127.0.0.1/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused
kube-apiserver certificate Renewal:
I found the kube-apiserver certificate which is this one /etc/kubernetes/pki/kube-apiserver/etcd-client.crt had expired in July 2020. There were few other expired certificates related to etcd-manager-main and events (it is same copy of the certificates on both places) but I don't see this referenced in the manifest files.
I searched and found steps to renew the certificates but most of them were using "kubeadm init phase" commands but I couldn't find kubeadm on master server and the certificates names and paths were different to my setup. So I generated a new certificate using openssl for kube-apiserver using existing ca cert and included DNS names with internal and external IP address (ec2 instance) and loopback ip address using openssl.cnf file. I replaced the new certificate with the same name /etc/kubernetes/pki/kube-apiserver/etcd-client.crt.
After that I restarted the kube-apiserver docker (which was continuously exiting) and restarted kubelet. Now the certificate expiry message is not coming but the kube-apiserver is continuously restarting which I believe is the reason for the errors on kube-controller and kube-scheduler docker containers.
NOTE:
I have not restarted the docker on the master server after replacing the certificate.
NOTE: All our production PODs are running on worker nodes so they are not affected but I can't manage them as I can't connect using kubectl.
Now, I am not sure what is the issue and why kube-apiserver is restarting continuously.
Update to the original question:
Kubernetes version: v1.14.1
Docker version: 18.6.3
Below are the latest docker logs from kube-apiserver container (which is still crashing)
F0916 08:09:56.753538 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry {[https:--127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc00095f050 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err (tls: private key does not match public key)
Below is the output from systemctl status kubelet
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.095615 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x.compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.130377 388 kubelet.go:2170] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.147390 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https:--127.0.0.1/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.195768 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.295890 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.347431 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://127.0.0.1/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
This cluster (along with 3 others) was setup using kops. The other clusters are running normally and looks like they have some expired certificates as well. The person who setup the clusters is not available for comment and I have limited experience on Kubernetes. Hence required assistance from the gurus.
Any help is very much appreciated.
Many thanks.
Update after response from Zambozo and Nepomucen:
Thanks to both of you for your response. Based that I found that there were expired etcd certificates on the /mnt mount point.
I followed workaround from https://kops.sigs.k8s.io/advisories/etcd-manager-certificate-expiration/
and recreated etcd certificates and keys. I have verified each of the certificate with a copy of the old one (from my backup folder) and everything is matching and the new certificates has expiry date set to Sep 2021.
Now I am getting different error on etcd dockers (both etcd-manager-events and etcd-manager-main)
Note:xxx-xx-xx-xxx is the IP address of the master server
root#ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-main container> --tail 20
I0916 14:41:40.349570 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
W0916 14:41:40.351857 8221 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:41:40.351878 8221 peers.go:347] was not able to connect to peer etcd-a: map[xxx.xx.xx.xxx:3996:true]
W0916 14:41:40.351887 8221 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-a
I0916 14:41:41.205763 8221 controller.go:173] starting controller iteration
W0916 14:41:41.205801 8221 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-a" in list of peers []
I0916 14:41:45.352008 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
I0916 14:41:46.678314 8221 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0916 14:41:46.739272 8221 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0916 14:41:46.786653 8221 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-a.internal.xxxxx.xxxxxxx.com etcd-a.internal.xxxxx.xxxxxxx.com]]
I0916 14:41:46.786724 8221 hosts.go:181] skipping update of unchanged /etc/hosts
root#ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-events container> --tail 20
W0916 14:42:40.294576 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a
I0916 14:42:41.106654 8316 controller.go:173] starting controller iteration
W0916 14:42:41.106692 8316 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-events-a" in list of peers []
I0916 14:42:45.294682 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a"
W0916 14:42:45.297094 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:42:45.297117 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true]
I0916 14:42:46.791923 8316 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0916 14:42:46.856548 8316 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0916 14:42:46.945119 8316 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-events-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-events-a.internal.xxxxx.xxxxxxx.com etcd-events-a.internal.xxxxx.xxxxxxx.com]]
I0916 14:42:50.297264 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a"
W0916 14:42:50.300328 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:42:50.300348 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true]
W0916 14:42:50.300360 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a
Could you please suggest on how to proceed from here?
Many thanks.

Generating a new cert using openssl for kube-apiserver and replacing the cert and key brought the kube-apiserver docker to stable state and provided access via kubectl.
To resolve etcd-manager certs issue, upgraded etcd-manager to kopeio/etcd-manager:3.0.20200531 for both etcd-manager-main and etcd-manager-events as described at https://github.com/kubernetes/kops/issues/8959#issuecomment-673515269
Thank you

I think this is related to the ETCD. you may have renewed the certs for Kubernetes components but did you do the same for ETCD?
Your API server is trying to connect to the ETCD and giving:
tls: private key does not match public key)
As you have only 1 etcd(assuming on the number of master nodes) I would do a backup of it before trying to fix it.

Related

Hyperledger Fabric error at adding new org to channel

I'm running Fabric test network v2.2. Everything was successfully set up. I'm now trying to add an extra organization to the network.
I've generated crypto materials, and the configuration update tx. I sign the transaction, Essentially, everything executed correctly, where the success message regarding the addition of a peer was obtained.
EDIT:
Although it seems that the peer0 from a new org (org5) was correctly added, the org5 logs show that:
2021-05-31 13:13:50.794 UTC [peer.blocksprovider] DeliverBlocks -> WARN 7b0 Could not connect to ordering service: could not dial endpoint 'orderer.example.com:7050': failed to create new connection: connection error: desc = "transport: error while dialing: dial tcp: lookup orderer.example.com on 127.0.0.11:53: no such host" channel=mychannel
peer0.org1.example.com shows similarly that:
2021-05-31 13:13:06.802 UTC [gossip.gossip] func1 -> WARN 409 Deep probe of org5.example.com:11071 failed: context deadline exceeded
2021-05-31 13:13:06.802 UTC [gossip.discovery] func1 -> WARN 40a Could not connect to Endpoint: org5.example.com:11071, InternalEndpoint: org5.example.com:11071, PKI-ID: <nil>, Metadata: : context deadline exceeded
Any ideas on how to solve this?
Logs:
Orderer logs: https://gist.github.com/RafaelAPB/bada1278a096e252060e3d117b3c5719
peer0.org1.example.com logs: https://gist.github.com/RafaelAPB/d5b6af66a62a18d9572399274a0a6aa5
org5 logs: https://gist.github.com/RafaelAPB/cddba91566e66ca45f5494dff43196a0
peer 5 docker-compose:
https://gist.github.com/RafaelAPB/b82a64d4122e103f06dd7e4b9bc9023c
I guess that your orginal docker network is cactusfabrictestnetwork_test,and your origin fabric components has join this docker network,so your new peer(peer0.org5.example.com) should join this docker network(cactusfabrictestnetwork_test),then all these component can find each other.
then your peer0.org5.example.com docker-compose file should like this https://paste.ubuntu.com/p/G7MyWSCN6g/ and before you start your peer0.org5 container ,set COMPOSE_PROJECT_NAME=cactusfabrictestnetwork,and your peer0.org5 will join network cactusfabrictestnetwork_test,reference this https://docs.docker.com/compose/networking/

Unable to join Docker swarm because control.sock is missing?

I have an existing Docker swarm consisting of three machines. I am trying to add a new manager to this swarm. I run the command
docker swarm join --token SWMTKN-1-<...> 192.168.200.200:2377
After a while I get the error
Error response from daemon: manager stopped: can't initialize raft node: rpc error: code = Unknown desc = could not connect to prospective new cluster member using its advertised address: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I view the daemon logs using tail -f /var/log/messages | grep docker, I see this:
Mar 17 17:07:48 UAT-Blockchain dockerd: time="2021-03-17T17:07:48.575024542+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {/var/run/docker/swarm/control.sock <nil> 0 <nil>}. Err :connection error: desc= \"transport: Error while dialing dial unix /var/run/docker/swarm/control.sock: connect: no such file or directory\". Reconnecting..." module=grpc
A quick check shows that /var/run/docker/swarm/control.sock is indeed missing on this machine, but is present on the machines in the existing swarm.
What is this control.sock? How should I go about enabling/reinstating it on this current machine? Is this a problem of faulty installation?

Kubernetes cluster does not run after reboot

If I use the kubectl command after a reboot, I will receive an error.
x.x.x.x: 6443 was refused-did you specify the right host or port?
If I check my container with docker ps, kube-apiserver and kube-scheduler are turned on and off.
Why is this happening?
root#taeil-linux:/etc/systemd/system/kubelet.service.d# cd
root#taeil-linux:~# kubectl get nodes
The connection to the server 10.0.0.152:6443 was refused - did you specify the right host or port?
root#taeil-linux:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
root#taeil-linux:~# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
k8s.gcr.io/kube-proxy v1.15.3 232b5c793146 2 weeks ago 82.4MB
k8s.gcr.io/kube-apiserver v1.15.3 5eb2d3fc7a44 2 weeks ago 207MB
k8s.gcr.io/kube-scheduler v1.15.3 703f9c69a5d5 2 weeks ago 81.1MB
k8s.gcr.io/kube-controller-manager v1.15.3 e77c31de5547 2 weeks ago 159MB
node carbon c83f74dcf58e 3 weeks ago 895MB
kubernetesui/dashboard v2.0.0-beta1 4640949a39e6 2 months ago 64.6MB
weaveworks/weave-kube 2.5.2 f04a043bb67a 3 months ago 148MB
weaveworks/weave-npc 2.5.2 5ce48e0d813c 3 months ago 49.6MB
kubernetesui/metrics-scraper v1.0.0 44390ebe2b73 4 months ago 36.8MB
k8s.gcr.io/coredns 1.3.1 eb516548c180 7 months ago 40.3MB
k8s.gcr.io/etcd 3.3.10 2c4adeb21b4f 9 months ago 258MB
quay.io/coreos/flannel v0.10.0-amd64 f0fad859c909 19 months ago 44.6MB
k8s.gcr.io/pause 3.1 da86e6ba6ca1 20 months ago 742kB
root#taeil-linux:~# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Fri 2019-09-06 14:29:25 KST; 4min 19s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 14470 (kubelet)
Tasks: 19 (limit: 4512)
CGroup: /system.slice/kubelet.service
└─14470 /usr/bin/kubelet --bootstrap- kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf -- kubeconfig=/etc/kubernetes/kubelet.conf -- config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network- plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.1 --resolv-con
9월 06 14:33:44 taeil-linux kubelet[14470]: E0906 14:33:44.800330 14470 pod_workers.go:190] Error syncing pod 9a745ac0a776afabd0d387fd0fcb2f54 ("kube-apiserver-taeil-linux_kube- system(9a745ac0a776afabd0d387fd0fcb2f54)"), skipping: failed to "CreatePodSandbox" for "kube-apiserver-ta
9월 06 14:33:44 taeil-linux kubelet[14470]: E0906 14:33:44.897945 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:44 taeil-linux kubelet[14470]: E0906 14:33:44.916566 14470 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.0.0.152:6443/api/v1/pods? fieldSelector=spec.nodeName%3Dtaeil-linux&limit=500&resourceVersion=0: dia
9월 06 14:33:44 taeil-linux kubelet[14470]: E0906 14:33:44.998190 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.098439 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.198732 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.299052 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.399343 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.499561 14470 kubelet.go:2248] node "taeil-linux" not found
9월 06 14:33:45 taeil-linux kubelet[14470]: E0906 14:33:45.599723 14470 kubelet.go:2248] node "taeil-linux" not found
root#taeil-linux:~# systemctl status kube-apiserver
Unit kube-apiserver.service could not be found.
If I try
docker logs
Flag --insecure-port has been deprecated, This flag will be removed in a future version.
I0906 10:54:19.636649 1 server.go:560] external host was not specified, using 10.0.0.152
I0906 10:54:19.636954 1 server.go:147] Version: v1.15.3
I0906 10:54:21.753962 1 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook.
I0906 10:54:21.753988 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
E0906 10:54:21.754660 1 prometheus.go:55] failed to register depth metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754701 1 prometheus.go:68] failed to register adds metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754787 1 prometheus.go:82] failed to register latency metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754842 1 prometheus.go:96] failed to register workDuration metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754883 1 prometheus.go:112] failed to register unfinished metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754918 1 prometheus.go:126] failed to register unfinished metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754952 1 prometheus.go:152] failed to register depth metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.754986 1 prometheus.go:164] failed to register adds metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.755047 1 prometheus.go:176] failed to register latency metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.755104 1 prometheus.go:188] failed to register work_duration metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.755152 1 prometheus.go:203] failed to register unfinished_work_seconds metric admission_quota_controller: duplicate metrics collector registration attempted
E0906 10:54:21.755188 1 prometheus.go:216] failed to register longest_running_processor_microseconds metric admission_quota_controller: duplicate metrics collector registration attempted
I0906 10:54:21.755215 1 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesBy Condition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObj ectInUseProtection,MutatingAdmissionWebhook.
I0906 10:54:21.755226 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,Validating AdmissionWebhook,ResourceQuota.
I0906 10:54:21.757263 1 client.go:354] parsed scheme: ""
I0906 10:54:21.757280 1 client.go:354] scheme "" not registered, fallback to default scheme
I0906 10:54:21.757335 1 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0906 10:54:21.757402 1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
W0906 10:54:21.757666 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
I0906 10:54:22.753069 1 client.go:354] parsed scheme: ""
I0906 10:54:22.753118 1 client.go:354] scheme "" not registered, fallback to default scheme
I0906 10:54:22.753204 1 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{127.0.0.1:2379 0 <nil>}]
I0906 10:54:22.753354 1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{127.0.0.1:2379 <nil>}]
W0906 10:54:22.753855 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:22.757983 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:23.754019 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:24.430000 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:25.279869 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:26.931974 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:28.198719 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:30.825660 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:32.850511 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:36.294749 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0906 10:54:38.737408 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
F0906 10:54:41.757603 1 storage_decorator.go:57] Unable to create storage backend: config (&{ /registry {[https://127.0.0.1:2379] /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/etcd/ca.crt} true 0xc00063dd40 apiextensions.k8s.io/v1beta1 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:2379: connect: connection refused)
The answer is in the comment by #cewood;
Okay, that helps to understand what you installation is likely to look
like. Regarding the other master components, these are likely running
via the kubelet, and hence there won't be any systemd units for them,
only for the kubelet itself.
With kubeadm install you dont see the services;
as root
systemctl start docker
systemctl start kubectl
switch to non root user
su nonrootuser -
kubectl get pods
Long time no see.
I totally realized how to solve this problem!
If you get an error like this for no reason, you can fix it by:
docker rm $(docker ps -a -q)
Perhaps an error occurred when the existing Kubernetes container was rebooted and the newly running container crashed.
watch docker ps
If you check the container with watch, you can see that kube-apiserver and others are turned off within 1 minute.
So I decided to delete all containers appearing in docker ps -a and it's fixed!

Heapster fails to get container stats from Kubelet on Kubernetes cluster

I've set up a Kubernetes cluster on Ubuntu (trusty) based on the Running Kubernetes Locally via Docker guide, deployed a DNS and run Heapster with an InfluxDB backend and a Grafana UI.
Everything seems to run smoothly except for Grafana, which doesn't show any graphs but the message No datapoints in its diagrams: Screenshot
After checking the Docker container logs I found out that Heapster is is unable to access the kubelet API (?) and therefore no metrics are persisted into InfluxDB:
user#host:~$ docker logs e490a3ac10a8
I0701 07:07:30.829745 1 heapster.go:65] /heapster --source=kubernetes:https://kubernetes.default --sink=influxdb:http://monitoring-influxdb:8086
I0701 07:07:30.830082 1 heapster.go:66] Heapster version 1.2.0-beta.0
I0701 07:07:30.830809 1 configs.go:60] Using Kubernetes client with master "https://kubernetes.default" and version v1
I0701 07:07:30.831284 1 configs.go:61] Using kubelet port 10255
E0701 07:09:38.196674 1 influxdb.go:209] issues while creating an InfluxDB sink: failed to ping InfluxDB server at "monitoring-influxdb:8086" - Get http://monitoring-influxdb:8086/ping: dial tcp 10.0.0.223:8086: getsockopt: connection timed out, will retry on use
I0701 07:09:38.196919 1 influxdb.go:223] created influxdb sink with options: host:monitoring-influxdb:8086 user:root db:k8s
I0701 07:09:38.197048 1 heapster.go:92] Starting with InfluxDB Sink
I0701 07:09:38.197154 1 heapster.go:92] Starting with Metric Sink
I0701 07:09:38.228046 1 heapster.go:171] Starting heapster on port 8082
I0701 07:10:05.000370 1 manager.go:79] Scraping metrics start: 2016-07-01 07:09:00 +0000 UTC, end: 2016-07-01 07:10:00 +0000 UTC
E0701 07:10:05.008785 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
I0701 07:10:05.009119 1 manager.go:152] ScrapeMetrics: time: 8.013178ms size: 0
I0701 07:11:05.001185 1 manager.go:79] Scraping metrics start: 2016-07-01 07:10:00 +0000 UTC, end: 2016-07-01 07:11:00 +0000 UTC
E0701 07:11:05.007130 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
I0701 07:11:05.007686 1 manager.go:152] ScrapeMetrics: time: 5.945236ms size: 0
W0701 07:11:25.010298 1 manager.go:119] Failed to push data to sink: InfluxDB Sink
I0701 07:12:05.000420 1 manager.go:79] Scraping metrics start: 2016-07-01 07:11:00 +0000 UTC, end: 2016-07-01 07:12:00 +0000 UTC
E0701 07:12:05.002413 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
I0701 07:12:05.002467 1 manager.go:152] ScrapeMetrics: time: 1.93825ms size: 0
E0701 07:12:12.309151 1 influxdb.go:150] Failed to create infuxdb: failed to ping InfluxDB server at "monitoring-influxdb:8086" - Get http://monitoring-influxdb:8086/ping: dial tcp 10.0.0.223:8086: getsockopt: connection timed out
I0701 07:12:12.351348 1 influxdb.go:201] Created database "k8s" on influxDB server at "monitoring-influxdb:8086"
I0701 07:13:05.001052 1 manager.go:79] Scraping metrics start: 2016-07-01 07:12:00 +0000 UTC, end: 2016-07-01 07:13:00 +0000 UTC
E0701 07:13:05.015947 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
...
I found a few issues on GitHub describing similar problems that made me understand that Heapster doesn't access the kubelet (via the node's loopback) but itself (via the container's loopback) instead. However, I fail to reproduce their solutions:
github.com/kubernetes/heapster/issues/1183
You should either use host networking for Heapster pod or configure your cluster in a way that the node has a regular name not 127.0.0.1. The current problem is that node name is resolved to Heapster localhost. Please reopen in case of more problems.
-#piosz
How do I enable "host networking" for my Heapster pod?
How do I configure my cluster/node to use a regular name not 127.0.0.1?
github.com/kubernetes/heapster/issues/744
Fixed by using better options in hyperkube, thanks for the help!
-#ddispaltro
Is there a way to solve this issue by adding/modifying kubelet's option flags in docker run? I tried setting--hostname-override=<host's eth0 IP> and --address=127.0.0.1 (as suggested in the last answer of this GitHub issue) but Heapster's container log then states: I0701 08:23:05.000566 1 manager.go:79] Scraping metrics start: 2016-07-01 08:22:00 +0000 UTC, end: 2016-07-01 08:23:00 +0000 UTC
E0701 08:23:05.000962 1 kubelet.go:279] Node 127.0.0.1 is not ready
E0701 08:23:05.003018 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://<host's eth0 IP>:10255/stats/container/": Post http://<host's eth0 IP>/stats/container/: dial tcp <host's eth0 IP>:10255: getsockopt: connection refused
Namespace issue
Could this problem be caused by the fact that I'm running Kubernetes API in default namespace and Heapster in kube-system?
user#host:~$ kubectl get --all-namespaces pods
NAMESPACE NAME READY STATUS RESTARTS AGE
default k8s-etcd-127.0.0.1 1/1 Running 0 18h
default k8s-master-127.0.0.1 4/4 Running 1 18h
default k8s-proxy-127.0.0.1 1/1 Running 0 18h
kube-system heapster-lizks 1/1 Running 0 18h
kube-system influxdb-grafana-e0pk2 2/2 Running 0 18h
kube-system kube-dns-v10-4vjhm 4/4 Running 0 18h
OS: Ubuntu 14.04.4 LTS (trusty) |
Kubernetes: v1.2.5 |
Docker: v1.11.2
Heapster has got the list of nodes from Kubernetes and is now trying to pull stats from the kublete process on each node (which has a built in cAdvisor collecting stats on the node). In this case there's only one node and it's known by 127.0.0.1 to kubernetes. And there's the problem. The Heapster container is trying to reach the node at 127.0.0.1 which is itself and of course finding no kublete process to interrogate within the Heapster container.
Two things need to happen to resolve this issue.
We need to reference the kublete worker node (our host machine running kubernetes) by something else other than the loopback network address of 127.0.0.1
The kublete process needs to accept traffic from the new network interface/address
Assuming you are using the local installation guide and starting kubernetes off with
hack/local-up-cluster.sh
To change the hostname by which the kublete is referenced is pretty simple. You can take more elaborate approaches but setting this to your eth0 ip worked fine for me (ifconfig eth0). The downside is that you need a eth0 interface and this is subject to DHCP so your mileage may vary as to how convenient this is.
export HOSTNAME_OVERRIDE=10.0.2.15
To get the kublete process to accept traffic from any network interface is just as simple.
export KUBELET_HOST=0.0.0.0
Provide the below argument to your heapster configuration to resolve the issue.
--source=kubernetes:https://kubernetes.default:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250&insecure=true

Error from server: dial tcp i/o timeout error when getting logs of pod

I'm working on OpenShift Origin 1.1 (which is using kubernetes as its orchestration tool for docker containers). I'm creating pods, but I'm unable to see the build-logs.
[user#ip master]# oc get pods
NAME READY STATUS RESTARTS AGE
test-1-build 0/1 Completed 0 14m
test-1-iok8n 1/1 Running 0 12m
[user#ip master]# oc logs test-1-iok8n
Error from server: Get https://ip-10-0-x-x.compute.internal:10250/containerLogs/test/test-1-iok8n/test: dial tcp 10.0.x.x:10250: i/o timeout
My /var/logs/messages shows:
Dec 4 13:28:24 ip-10-0-x-x origin-master: E1204 13:28:24.579794 32518 apiserver.go:440] apiserver was unable to write a JSON response: Get https://ip-10-0-x-x.compute.internal:10250/containerLogs/test/test-1-iok8n/test: dial tcp 10.0.x.x:10250: i/o timeout
Dec 4 13:28:24 ip-10-0-x-x origin-master: E1204 13:28:24.579822 32518 errors.go:62] apiserver received an error that is not an unversioned.Status: Get https://ip-10-0-x-x.compute.internal:10250/containerLogs/test/test-1-iok8n/test: dial tcp 10.0.x.x:10250: i/o timeout
My versions are:
origin v1.1.0.1-1-g2c6ff4b
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2
I forgot to open port 10250 (tcp) (in my aws security group).
This was the only issue for me.

Resources