Nodes cannot pull image from docker registry - docker

I followed hobby-kube/guide how to set up your own cluster, but I stuck. I created an issue on repo as well but maybe here would me more people to help me with it.
I am trying to set up my cluster on Scaleway. I follow instructions one by one, and I am at the point where I installed wave as CNI, and I've got:
kube-system weave-net-dtwbj 2/2 Running 1 9d
kube-system weave-net-kmxq7 0/2 Init:ImagePullBackOff 0 9d
kube-system weave-net-pzfcj 0/2 Init:ImagePullBackOff 0 9d
So my issue is on my nodes but not on master.
I found suggestions in one of issues and this time I applied these suggestions, but the output is the same.
UFW / Firewall
I skip the part with firewall, on every VPS I've got:
> ufw status
Status: inactive
In scaleway config all my VPS have the same security policy applied. Only outbound traffic on ports [25, 465, 587] is dropping.
Internet connection
On both my nodes I've issue to download images from docker's registry and I believe that this is the real issue here
> docker pull hello-world
Using default tag: latest
Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
On master hello-world was pulled successfully.
Nodes have internet connection outside:
--- google.com ping statistics ---
9 packets transmitted, 9 received, 0% packet loss, time 8011ms
rtt min/avg/max/mdev = 1.008/1.139/1.258/0.073 ms
WireGuard
By the output of wg show I assume that VPN between my VPS is set up correctly
peer: 3
endpoint: 3-priv IP:51820
allowed ips: 10.0.1.3/32
latest handshake: 1 minute, 17 seconds ago
transfer: 7.50 GiB received, 6.50 GiB sent
peer: 2
endpoint: 2-priv IP:51820
allowed ips: 10.0.1.2/32
latest handshake: 1 minute, 41 seconds ago
transfer: 4.96 GiB received, 6.11 GiB sent
Could anybody help me track the issue down and help me to fix it? I can provide any kinds of logs you wish just tell me how I can get it

Related

kube-apiserver docker is restarting continuously

Sincere apologies for this lengthy posting.
I have a 4 node Kubernetes cluster with 1 x master and 3 x worker nodes. I connect to the kubernetes cluster using kubeconfig, since yesterday I was not able to connect using kubeconfig.
kubectl get pods was giving an error "The connection to the server api.xxxxx.xxxxxxxx.com was refused - did you specify the right host or port?"
In the kubeconfig server name is specified as https://api.xxxxx.xxxxxxxx.com
Note:
Please note as there were too many https links, I was not able to post the question. So I have renamed https:// to https:-- to avoid the links in the background analysis section.
I tried to run kubectl from the master node and received similar error
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Then checked kube-apiserver docker and it was continuously exiting / Crashloopbackoff.
docker logs <container-id of kube-apiserver> shows below errors
W0914 16:29:25.761524 1 clientconn.go:1251] grpc:
addrConn.createTransport failed to connect to {127.0.0.1:4001 0
}. Err :connection error: desc = "transport: authentication
handshake failed: x509: certificate has expired or is not yet valid".
Reconnecting... F0914 16:29:29.319785 1 storage_decorator.go:57]
Unable to create storage backend: config (&{etcd3 /registry
{[https://127.0.0.1:4001]
/etc/kubernetes/pki/kube-apiserver/etcd-client.key
/etc/kubernetes/pki/kube-apiserver/etcd-client.crt
/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true
0xc000266d80 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err
(context deadline exceeded)
systemctl status kubelet --> was giving below errors
Sep 14 16:40:49 ip-xxx-xxx-xx-xx kubelet[2411]: E0914 16:40:49.693576
2411 kubelet_node_status.go:385] Error updating node status, will
retry: error getting node
"ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal": Get
https://127.0.0.1/api/v1/nodes/ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused
Note: ip-xxx-xx-xx-xxx --> internal IP address of aws ec2 instance.
Background Analysis:
Looks there was some issue with the cluster on 7th Sep 2020 and both kube-controller and kube-scheduler dockers exited and restarted. I believe since then kube-apiserver is not running or because of kube-apiserver, those dockers restarted. The kube-apiserver server certificate expired in July 2020 but access via kubectl was working until 7th Sep.
Below are the docker logs from the exited kube-scheduler docker container:
I0907 10:35:08.970384 1 scheduler.go:572] pod
default/k8version-1599474900-hrjcn is bound successfully on node
ip-xx-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4 nodes evaluated, 3
nodes were found feasible I0907 10:40:09.286831 1
scheduler.go:572] pod default/k8version-1599475200-tshlx is bound
successfully on node ip-1x-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4
nodes evaluated, 3 nodes were found feasible I0907 10:44:01.935373
1 leaderelection.go:263] failed to renew lease
kube-system/kube-scheduler: failed to tryAcquireOrRenew context
deadline exceeded E0907 10:44:01.935420 1 server.go:252] lost
master lost lease
Below are the docker logs from exited kube-controller docker container:
I0907 10:40:19.703485 1 garbagecollector.go:518] delete object
[v1/Pod, namespace: default, name: k8version-1599474300-5r6ph, uid:
67437201-f0f4-11ea-b612-0293e1aee720] with propagation policy
Background I0907 10:44:01.937398 1 leaderelection.go:263] failed
to renew lease kube-system/kube-controller-manager: failed to
tryAcquireOrRenew context deadline exceeded E0907 10:44:01.937506
1 leaderelection.go:306] error retrieving resource lock
kube-system/kube-controller-manager: Get https:
--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
net/http: request canceled (Client.Timeout exceeded while awaiting
headers) I0907 10:44:01.937456 1 event.go:209]
Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system",
Name:"kube-controller-manager",
UID:"ba172d83-a302-11e9-b612-0293e1aee720", APIVersion:"v1",
ResourceVersion:"85406287", FieldPath:""}): type: 'Normal' reason:
'LeaderElection' ip-xxx-xx-xx-xxx_1dd3c03b-bd90-11e9-85c6-0293e1aee720
stopped leading F0907 10:44:01.937545 1
controllermanager.go:260] leaderelection lost I0907 10:44:01.949274
1 range_allocator.go:169] Shutting down range CIDR allocator I0907
10:44:01.949285 1 replica_set.go:194] Shutting down replicaset
controller I0907 10:44:01.949291 1 gc_controller.go:86] Shutting
down GC controller I0907 10:44:01.949304 1
pvc_protection_controller.go:111] Shutting down PVC protection
controller I0907 10:44:01.949310 1 route_controller.go:125]
Shutting down route controller I0907 10:44:01.949316 1
service_controller.go:197] Shutting down service controller I0907
10:44:01.949327 1 deployment_controller.go:164] Shutting down
deployment controller I0907 10:44:01.949435 1
garbagecollector.go:148] Shutting down garbage collector controller
I0907 10:44:01.949443 1 resource_quota_controller.go:295]
Shutting down resource quota controller
Below are the docker logs from kube-controller since the restart (7th Sep):
E0915 21:51:36.028108 1 leaderelection.go:306] error retrieving
resource lock kube-system/kube-controller-manager: Get
https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:51:40.133446 1 leaderelection.go:306] error retrieving
resource lock kube-system/kube-controller-manager: Get
https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused
Below are the docker logs from kube-scheduler since the restart (7th Sep):
E0915 21:52:44.703587 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Node:
Get https://127.0.0.1/api/v1/nodes?limit=500&resourceVersion=0: dial
tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.704504
1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed
to list *v1.ReplicationController: Get
https:--127.0.0.1/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.705471 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service:
Get https:--127.0.0.1/api/v1/services?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.706477 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.ReplicaSet: Get https:--127.0.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.707581 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.StorageClass: Get https:--127.0.0.1/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.708599 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.PersistentVolume: Get https:--127.0.0.1/api/v1/persistentvolumes?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.709687 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.StatefulSet: Get https:--127.0.0.1/apis/apps/v1/statefulsets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.710744 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.PersistentVolumeClaim: Get https:--127.0.0.1/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.711879 1 reflector.go:126]
k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:223: Failed to list
*v1.Pod: Get https:--127.0.0.1/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.712903 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1beta1.PodDisruptionBudget: Get https:--127.0.0.1/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused
kube-apiserver certificate Renewal:
I found the kube-apiserver certificate which is this one /etc/kubernetes/pki/kube-apiserver/etcd-client.crt had expired in July 2020. There were few other expired certificates related to etcd-manager-main and events (it is same copy of the certificates on both places) but I don't see this referenced in the manifest files.
I searched and found steps to renew the certificates but most of them were using "kubeadm init phase" commands but I couldn't find kubeadm on master server and the certificates names and paths were different to my setup. So I generated a new certificate using openssl for kube-apiserver using existing ca cert and included DNS names with internal and external IP address (ec2 instance) and loopback ip address using openssl.cnf file. I replaced the new certificate with the same name /etc/kubernetes/pki/kube-apiserver/etcd-client.crt.
After that I restarted the kube-apiserver docker (which was continuously exiting) and restarted kubelet. Now the certificate expiry message is not coming but the kube-apiserver is continuously restarting which I believe is the reason for the errors on kube-controller and kube-scheduler docker containers.
NOTE:
I have not restarted the docker on the master server after replacing the certificate.
NOTE: All our production PODs are running on worker nodes so they are not affected but I can't manage them as I can't connect using kubectl.
Now, I am not sure what is the issue and why kube-apiserver is restarting continuously.
Update to the original question:
Kubernetes version: v1.14.1
Docker version: 18.6.3
Below are the latest docker logs from kube-apiserver container (which is still crashing)
F0916 08:09:56.753538 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry {[https:--127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc00095f050 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err (tls: private key does not match public key)
Below is the output from systemctl status kubelet
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.095615 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x.compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.130377 388 kubelet.go:2170] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.147390 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https:--127.0.0.1/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.195768 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.295890 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.347431 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://127.0.0.1/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
This cluster (along with 3 others) was setup using kops. The other clusters are running normally and looks like they have some expired certificates as well. The person who setup the clusters is not available for comment and I have limited experience on Kubernetes. Hence required assistance from the gurus.
Any help is very much appreciated.
Many thanks.
Update after response from Zambozo and Nepomucen:
Thanks to both of you for your response. Based that I found that there were expired etcd certificates on the /mnt mount point.
I followed workaround from https://kops.sigs.k8s.io/advisories/etcd-manager-certificate-expiration/
and recreated etcd certificates and keys. I have verified each of the certificate with a copy of the old one (from my backup folder) and everything is matching and the new certificates has expiry date set to Sep 2021.
Now I am getting different error on etcd dockers (both etcd-manager-events and etcd-manager-main)
Note:xxx-xx-xx-xxx is the IP address of the master server
root#ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-main container> --tail 20
I0916 14:41:40.349570 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
W0916 14:41:40.351857 8221 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:41:40.351878 8221 peers.go:347] was not able to connect to peer etcd-a: map[xxx.xx.xx.xxx:3996:true]
W0916 14:41:40.351887 8221 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-a
I0916 14:41:41.205763 8221 controller.go:173] starting controller iteration
W0916 14:41:41.205801 8221 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-a" in list of peers []
I0916 14:41:45.352008 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
I0916 14:41:46.678314 8221 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0916 14:41:46.739272 8221 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0916 14:41:46.786653 8221 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-a.internal.xxxxx.xxxxxxx.com etcd-a.internal.xxxxx.xxxxxxx.com]]
I0916 14:41:46.786724 8221 hosts.go:181] skipping update of unchanged /etc/hosts
root#ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-events container> --tail 20
W0916 14:42:40.294576 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a
I0916 14:42:41.106654 8316 controller.go:173] starting controller iteration
W0916 14:42:41.106692 8316 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-events-a" in list of peers []
I0916 14:42:45.294682 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a"
W0916 14:42:45.297094 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:42:45.297117 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true]
I0916 14:42:46.791923 8316 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0916 14:42:46.856548 8316 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0916 14:42:46.945119 8316 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-events-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-events-a.internal.xxxxx.xxxxxxx.com etcd-events-a.internal.xxxxx.xxxxxxx.com]]
I0916 14:42:50.297264 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a"
W0916 14:42:50.300328 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:42:50.300348 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true]
W0916 14:42:50.300360 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a
Could you please suggest on how to proceed from here?
Many thanks.
Generating a new cert using openssl for kube-apiserver and replacing the cert and key brought the kube-apiserver docker to stable state and provided access via kubectl.
To resolve etcd-manager certs issue, upgraded etcd-manager to kopeio/etcd-manager:3.0.20200531 for both etcd-manager-main and etcd-manager-events as described at https://github.com/kubernetes/kops/issues/8959#issuecomment-673515269
Thank you
I think this is related to the ETCD. you may have renewed the certs for Kubernetes components but did you do the same for ETCD?
Your API server is trying to connect to the ETCD and giving:
tls: private key does not match public key)
As you have only 1 etcd(assuming on the number of master nodes) I would do a backup of it before trying to fix it.

Kubernetes v1.2.2 api-server dosen't start

I attempt to deploy Pachyderm (a docker bigdata platform) on kubernetes. Limited by Pachyderm, I have to install kubernetes v1.2.2, an old version. I follow the guide here http://kubernetes.io/docs/getting-started-guides/docker/ to deploy Kubernetes on local server via docker. The guide can work with the kubernetes >=1.3.0, but when I use it to deploy kubernetes 1.2.2, I met some problems.
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ec38ae951f09 gcr.io/google_containers/hyperkube-amd64:v1.2.2 "/hyperkube apiserver" 8 seconds ago Exited (255) 7 seconds ago k8s_apiserver.78ec1de_k8s-master-127.0.0.1_default_4c6ab43ac4ee970e1f563d76ab3d3ec9_d26fc24e
55c1b13bb610 gcr.io/google_containers/hyperkube-amd64:v1.2.2 "/setup-files.sh IP:1" 8 seconds ago Up 8 seconds k8s_setup.e5aa3216_k8s-master-127.0.0.1_default_4c6ab43ac4ee970e1f563d76ab3d3ec9_1cb4c220
b9f0e5b3a7a9 gcr.io/google_containers/hyperkube-amd64:v1.2.2 "/hyperkube scheduler" 9 seconds ago Up 8 seconds k8s_scheduler.fc12fcbe_k8s-master-127.0.0.1_default_4c6ab43ac4ee970e1f563d76ab3d3ec9_e5065506
9cd613d272bc gcr.io/google_containers/hyperkube-amd64:v1.2.2 "/hyperkube apiserver" 9 seconds ago Exited (255) 8 seconds ago k8s_apiserver.78ec1de_k8s-master-127.0.0.1_default_4c6ab43ac4ee970e1f563d76ab3d3ec9_c04426af
49fe2c409386 gcr.io/google_containers/etcd:2.2.1 "/usr/local/bin/etcd " 10 seconds ago Up 9 seconds k8s_etcd.7e452b0b_k8s-etcd-127.0.0.1_default_1df6a8b4d6e129d5ed8840e370203c11_a6f11fdb
5b208be18c71 gcr.io/google_containers/hyperkube-amd64:v1.2.2 "/hyperkube controlle" 10 seconds ago Up 9 seconds k8s_controller-manager.70414b65_k8s-master-127.0.0.1_default_4c6ab43ac4ee970e1f563d76ab3d3ec9_c377c5e9
df194f3cf663 gcr.io/google_containers/hyperkube-amd64:v1.2.2 "/hyperkube proxy --m" 10 seconds ago Up 9 seconds k8s_kube-proxy.9a9f4853_k8s-proxy-127.0.0.1_default_5e5303a9d49035e9fad52bfc4c88edc8_63ec0b04
58b53ec28fbe gcr.io/google_containers/pause:2.0 "/pause" 10 seconds ago Up 9 seconds k8s_POD.6059dfa2_k8s-etcd-127.0.0.1_default_1df6a8b4d6e129d5ed8840e370203c11_21034b2e
df48fe4cdf0a gcr.io/google_containers/pause:2.0 "/pause" 10 seconds ago Up 9 seconds k8s_POD.6059dfa2_k8s-master-127.0.0.1_default_4c6ab43ac4ee970e1f563d76ab3d3ec9_4867dbbc
fe6b74c2a881 gcr.io/google_containers/pause:2.0 "/pause" 10 seconds ago Up 9 seconds k8s_POD.6059dfa2_k8s-proxy-127.0.0.1_default_5e5303a9d49035e9fad52bfc4c88edc8_fad2c558
4c00ad498916 gcr.io/google_containers/hyperkube-amd64:v1.2.2 "/hyperkube kubelet -" 25 seconds ago Up 24 seconds kubelet
From the docker container table, it can be observed that my apiserver is down when deploying kubernetes1.2.2. The restart interval of kubernetes apiserver obeys expontional backoff algorithm. But never work.
Then,
sv: batch/v1
mv: extensions/__internal
I0727 06:06:27.593708 1 genericapiserver.go:82] Adding storage destination for group batch
W0727 06:06:27.593745 1 server.go:383] No RSA key provided, service account token authentication disabled
F0727 06:06:27.593767 1 server.go:410] Invalid Authentication Config: open /srv/kubernetes/basic_auth.csv: no such file or directory
Please see docker logs of kubernetes apiserver here. Note that some authentication error occurred seems that the Kubernetes does not have required key to be permitted.Also see the controller manager log here. The controller manager wait for the apiserver, however the apiserver hasn't ran ever. The controller manager is also dump.
E0727 06:07:10.604801 1 controllermanager.go:259] Failed to get api versions from server: Get http://127.0.0.1:8080/api: dial tcp 127.0.0.1:8080: connection refused
E0727 06:07:11.604832 1 controllermanager.go:259] Failed to get api versions from server: Get http://127.0.0.1:8080/api: dial tcp 127.0.0.1:8080: connection refused
E0727 06:07:12.604752 1 controllermanager.go:259] Failed to get api versions from server: Get http://127.0.0.1:8080/api: dial tcp 127.0.0.1:8080: connection refused
E0727 06:07:13.604803 1 controllermanager.go:259] Failed to get api versions from server: Get http://127.0.0.1:8080/api: dial tcp 127.0.0.1:8080: connection refused
E0727 06:07:14.604332 1 nodecontroller.go:229] Error monitoring node status: Get http://127.0.0.1:8080/api/v1/nodes: dial tcp 127.0.0.1:8080: connection refused
E0727 06:07:14.604619 1 controllermanager.go:259] Failed to get api versions from server: Get http://127.0.0.1:8080/api: dial tcp 127.0.0.1:8080: connection refused
E0727 06:07:14.604861 1 controllermanager.go:259] Failed to get api versions from server: Get http://127.0.0.1:8080/api: dial tcp 127.0.0.1:8080: connection refused
F0727 06:07:14.604957 1 controllermanager.go:263] Failed to get api versions from server: timed out waiting for the condition
So for my question, how to solve this problem? The problem has troubled me for a long time.
====================================================================
Update:
With the help of Goblin and Lukie, I find the key problem is the Setup Pods is not triggered.
See the manifest of Kubernetes,
{
"name": "controller-manager",
"/hyperkube",
"controller-manager",
"--master=127.0.0.1:8080",
"--service-account-private-key-file=/srv/kubernetes/server.key",
"--root-ca-file=/srv/kubernetes/ca.crt",
"--min-resync-period=3m",
"--v=2"
],
"volumeMounts": [
{
"name": "data",
"mountPath": "/srv/kubernetes"
}
]
}
Option --service-account-private-key-file=/srv/kubernetes/server.key has been added in the manifest file, but it doesn't work. In other words, the controller-manager cannot find this file in the file system. This assumption is supported by following command.
docker exec a82d7f6e4d7d ls -l /srv/kubernetes
ls: cannot access /srv/kubernetes: No such file or directory
Next, we check whether the Setup Pod put the file in the docker volumn. Unfortunately, we find that the Setup Pod is not triggered and worked, therefore no cert file is written in the file system.
docker ps -a | grep setup
54afdd81349e gcr.io/google_containers/hyperkube-amd64:v1.2.2 "/setup-files.sh IP:1" About a minute ago Up About a minute k8s_setup.e5aa3216_k8s-master-127.0.0.1_default_4c6ab43ac4ee970e1f563d76ab3d3ec9_a2edddca
6f714e034098 gcr.io/google_containers/hyperkube-amd64:v1.2.2 "/setup-files.sh IP:1" 4 minutes ago Exited (7) 2 minutes ago k8s_setup.e5aa3216_k8s-master-127.0.0.1_default_4c6ab43ac4ee970e1f563d76ab3d3ec9_0d7dab5b
8358f6644d94 gcr.io/google_containers/hyperkube-amd64:v1.2.2 "/setup-files.sh IP:1" 6 minutes ago Exited (7) 4 minutes ago k8s_setup.e5aa3216_k8s-master-127.0.0.1_default_4c6ab43ac4ee970e1f563d76ab3d3ec9_41e4c686
Is there any method to do further debug? Or is it a bug in Kubernetes version 1.2?
F0727 06:06:27.593767 1 server.go:410] Invalid Authentication Config: open /srv/kubernetes/basic_auth.csv: no such file or directory
You are missing the basic auth file /srv/kubernetes/basic_auth.csv either createa basic auth file or remove the configuration flag.
Kubernetes authentication
in fact it is W0727 06:06:27.593745 1 server.go:383] No RSA key provided, service account token authentication disabled that is more important in my opinion.
Seems like --service-account-private-key-file is missing on controller-manager so service tokens can not be properly generated.

Heapster fails to get container stats from Kubelet on Kubernetes cluster

I've set up a Kubernetes cluster on Ubuntu (trusty) based on the Running Kubernetes Locally via Docker guide, deployed a DNS and run Heapster with an InfluxDB backend and a Grafana UI.
Everything seems to run smoothly except for Grafana, which doesn't show any graphs but the message No datapoints in its diagrams: Screenshot
After checking the Docker container logs I found out that Heapster is is unable to access the kubelet API (?) and therefore no metrics are persisted into InfluxDB:
user#host:~$ docker logs e490a3ac10a8
I0701 07:07:30.829745 1 heapster.go:65] /heapster --source=kubernetes:https://kubernetes.default --sink=influxdb:http://monitoring-influxdb:8086
I0701 07:07:30.830082 1 heapster.go:66] Heapster version 1.2.0-beta.0
I0701 07:07:30.830809 1 configs.go:60] Using Kubernetes client with master "https://kubernetes.default" and version v1
I0701 07:07:30.831284 1 configs.go:61] Using kubelet port 10255
E0701 07:09:38.196674 1 influxdb.go:209] issues while creating an InfluxDB sink: failed to ping InfluxDB server at "monitoring-influxdb:8086" - Get http://monitoring-influxdb:8086/ping: dial tcp 10.0.0.223:8086: getsockopt: connection timed out, will retry on use
I0701 07:09:38.196919 1 influxdb.go:223] created influxdb sink with options: host:monitoring-influxdb:8086 user:root db:k8s
I0701 07:09:38.197048 1 heapster.go:92] Starting with InfluxDB Sink
I0701 07:09:38.197154 1 heapster.go:92] Starting with Metric Sink
I0701 07:09:38.228046 1 heapster.go:171] Starting heapster on port 8082
I0701 07:10:05.000370 1 manager.go:79] Scraping metrics start: 2016-07-01 07:09:00 +0000 UTC, end: 2016-07-01 07:10:00 +0000 UTC
E0701 07:10:05.008785 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
I0701 07:10:05.009119 1 manager.go:152] ScrapeMetrics: time: 8.013178ms size: 0
I0701 07:11:05.001185 1 manager.go:79] Scraping metrics start: 2016-07-01 07:10:00 +0000 UTC, end: 2016-07-01 07:11:00 +0000 UTC
E0701 07:11:05.007130 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
I0701 07:11:05.007686 1 manager.go:152] ScrapeMetrics: time: 5.945236ms size: 0
W0701 07:11:25.010298 1 manager.go:119] Failed to push data to sink: InfluxDB Sink
I0701 07:12:05.000420 1 manager.go:79] Scraping metrics start: 2016-07-01 07:11:00 +0000 UTC, end: 2016-07-01 07:12:00 +0000 UTC
E0701 07:12:05.002413 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
I0701 07:12:05.002467 1 manager.go:152] ScrapeMetrics: time: 1.93825ms size: 0
E0701 07:12:12.309151 1 influxdb.go:150] Failed to create infuxdb: failed to ping InfluxDB server at "monitoring-influxdb:8086" - Get http://monitoring-influxdb:8086/ping: dial tcp 10.0.0.223:8086: getsockopt: connection timed out
I0701 07:12:12.351348 1 influxdb.go:201] Created database "k8s" on influxDB server at "monitoring-influxdb:8086"
I0701 07:13:05.001052 1 manager.go:79] Scraping metrics start: 2016-07-01 07:12:00 +0000 UTC, end: 2016-07-01 07:13:00 +0000 UTC
E0701 07:13:05.015947 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
...
I found a few issues on GitHub describing similar problems that made me understand that Heapster doesn't access the kubelet (via the node's loopback) but itself (via the container's loopback) instead. However, I fail to reproduce their solutions:
github.com/kubernetes/heapster/issues/1183
You should either use host networking for Heapster pod or configure your cluster in a way that the node has a regular name not 127.0.0.1. The current problem is that node name is resolved to Heapster localhost. Please reopen in case of more problems.
-#piosz
How do I enable "host networking" for my Heapster pod?
How do I configure my cluster/node to use a regular name not 127.0.0.1?
github.com/kubernetes/heapster/issues/744
Fixed by using better options in hyperkube, thanks for the help!
-#ddispaltro
Is there a way to solve this issue by adding/modifying kubelet's option flags in docker run? I tried setting--hostname-override=<host's eth0 IP> and --address=127.0.0.1 (as suggested in the last answer of this GitHub issue) but Heapster's container log then states: I0701 08:23:05.000566 1 manager.go:79] Scraping metrics start: 2016-07-01 08:22:00 +0000 UTC, end: 2016-07-01 08:23:00 +0000 UTC
E0701 08:23:05.000962 1 kubelet.go:279] Node 127.0.0.1 is not ready
E0701 08:23:05.003018 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://<host's eth0 IP>:10255/stats/container/": Post http://<host's eth0 IP>/stats/container/: dial tcp <host's eth0 IP>:10255: getsockopt: connection refused
Namespace issue
Could this problem be caused by the fact that I'm running Kubernetes API in default namespace and Heapster in kube-system?
user#host:~$ kubectl get --all-namespaces pods
NAMESPACE NAME READY STATUS RESTARTS AGE
default k8s-etcd-127.0.0.1 1/1 Running 0 18h
default k8s-master-127.0.0.1 4/4 Running 1 18h
default k8s-proxy-127.0.0.1 1/1 Running 0 18h
kube-system heapster-lizks 1/1 Running 0 18h
kube-system influxdb-grafana-e0pk2 2/2 Running 0 18h
kube-system kube-dns-v10-4vjhm 4/4 Running 0 18h
OS: Ubuntu 14.04.4 LTS (trusty) |
Kubernetes: v1.2.5 |
Docker: v1.11.2
Heapster has got the list of nodes from Kubernetes and is now trying to pull stats from the kublete process on each node (which has a built in cAdvisor collecting stats on the node). In this case there's only one node and it's known by 127.0.0.1 to kubernetes. And there's the problem. The Heapster container is trying to reach the node at 127.0.0.1 which is itself and of course finding no kublete process to interrogate within the Heapster container.
Two things need to happen to resolve this issue.
We need to reference the kublete worker node (our host machine running kubernetes) by something else other than the loopback network address of 127.0.0.1
The kublete process needs to accept traffic from the new network interface/address
Assuming you are using the local installation guide and starting kubernetes off with
hack/local-up-cluster.sh
To change the hostname by which the kublete is referenced is pretty simple. You can take more elaborate approaches but setting this to your eth0 ip worked fine for me (ifconfig eth0). The downside is that you need a eth0 interface and this is subject to DHCP so your mileage may vary as to how convenient this is.
export HOSTNAME_OVERRIDE=10.0.2.15
To get the kublete process to accept traffic from any network interface is just as simple.
export KUBELET_HOST=0.0.0.0
Provide the below argument to your heapster configuration to resolve the issue.
--source=kubernetes:https://kubernetes.default:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250&insecure=true

Error from server: dial tcp i/o timeout error when getting logs of pod

I'm working on OpenShift Origin 1.1 (which is using kubernetes as its orchestration tool for docker containers). I'm creating pods, but I'm unable to see the build-logs.
[user#ip master]# oc get pods
NAME READY STATUS RESTARTS AGE
test-1-build 0/1 Completed 0 14m
test-1-iok8n 1/1 Running 0 12m
[user#ip master]# oc logs test-1-iok8n
Error from server: Get https://ip-10-0-x-x.compute.internal:10250/containerLogs/test/test-1-iok8n/test: dial tcp 10.0.x.x:10250: i/o timeout
My /var/logs/messages shows:
Dec 4 13:28:24 ip-10-0-x-x origin-master: E1204 13:28:24.579794 32518 apiserver.go:440] apiserver was unable to write a JSON response: Get https://ip-10-0-x-x.compute.internal:10250/containerLogs/test/test-1-iok8n/test: dial tcp 10.0.x.x:10250: i/o timeout
Dec 4 13:28:24 ip-10-0-x-x origin-master: E1204 13:28:24.579822 32518 errors.go:62] apiserver received an error that is not an unversioned.Status: Get https://ip-10-0-x-x.compute.internal:10250/containerLogs/test/test-1-iok8n/test: dial tcp 10.0.x.x:10250: i/o timeout
My versions are:
origin v1.1.0.1-1-g2c6ff4b
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2
I forgot to open port 10250 (tcp) (in my aws security group).
This was the only issue for me.

openshift origin v0.3.3 error starting docker registry pod on centos 6.6

I'm running https://github.com/openshift/origin/tree/v0.3.3 on centos 6.6. When i run:
sudo /opt/bin/openshift start
i see an error:
I0301 22:02:04.738381 18093 pod_cache.go:194] error getting pod deploy-docker-registry-16mttp status: Get http://localhost:10250/api/v1beta1/podInfo?podID=deploy-docker-registry-16mttp&podNamespace=default: dial tcp 127.0.0.1:10250: connection refused, retry later
E0301 22:02:04.738422 18093 pod_cache.go:260] Error getting info for pod default/deploy-docker-registry-16mttp: Get http://localhost:10250/api/v1beta1/podInfo?podID=deploy-docker-registry-16mttp&podNamespace=default: dial tcp 127.0.0.1:10250: connection refused
If i do:
docker ps -a | grep origin-deployer
then i see:
b207ce593385 openshift/origin-deployer:v0.3.3 "/usr/bin/openshift- 31 hours ago Exited (255) 31 hours ago k8s_deployment.6c8f5c13_deploy-docker-registry-16mttp.default.api_11ae6e53-bf85-11e4-b8b2-080027bb06ce_8c701fc0
so i run:
docker logs b207ce593385
and get:
228 20:06:37.955877 1 deployer.go:64] Get https://10.0.2.15:8443/api/v1beta1/replicationControllers/docker-registry-1?namespace=default: dial tcp 10.0.2.15:8443: no route to host
If i do:
ping 10.0.2.15
it works. If i try:
https://10.0.2.15:8443
it returns:
404 Page Not Found
so the server is responsive. If i open the OpenShift Web Console at https://10.0.2.15:8444/ and Browse the default project it shows one deploy-docker-registry-16mttp pod with a status of Failed. The "IP on node" is 172.17.0.3 and it does respond to a ping. If i run:
osc describe service docker-registry
it returns:
Name: docker-registry
Labels: docker-registry=default
Selector: docker-registry=default
Port: 5000
Endpoints: <empty>
No events.
it should be returning:
Endpoints: 172.17.0.60:5000
according to the instructions. When i try:
ping 172.17.0.60
it returns:
PING 172.17.0.60 (172.17.0.60) 56(84) bytes of data.
From 172.17.42.1 icmp_seq=2 Destination Host Unreachable
From 172.17.42.1 icmp_seq=3 Destination Host Unreachable
...
Lot of moving parts and i'm new to it so any suggestions would be appreciated. I've probably missed one of the configuration steps.
It appears to be related to Centos 6.6. When i try the same process on Centos 7 (using netinstall) there is no problem.

Resources