Heapster fails to get container stats from Kubelet on Kubernetes cluster - docker

I've set up a Kubernetes cluster on Ubuntu (trusty) based on the Running Kubernetes Locally via Docker guide, deployed a DNS and run Heapster with an InfluxDB backend and a Grafana UI.
Everything seems to run smoothly except for Grafana, which doesn't show any graphs but the message No datapoints in its diagrams: Screenshot
After checking the Docker container logs I found out that Heapster is is unable to access the kubelet API (?) and therefore no metrics are persisted into InfluxDB:
user#host:~$ docker logs e490a3ac10a8
I0701 07:07:30.829745 1 heapster.go:65] /heapster --source=kubernetes:https://kubernetes.default --sink=influxdb:http://monitoring-influxdb:8086
I0701 07:07:30.830082 1 heapster.go:66] Heapster version 1.2.0-beta.0
I0701 07:07:30.830809 1 configs.go:60] Using Kubernetes client with master "https://kubernetes.default" and version v1
I0701 07:07:30.831284 1 configs.go:61] Using kubelet port 10255
E0701 07:09:38.196674 1 influxdb.go:209] issues while creating an InfluxDB sink: failed to ping InfluxDB server at "monitoring-influxdb:8086" - Get http://monitoring-influxdb:8086/ping: dial tcp 10.0.0.223:8086: getsockopt: connection timed out, will retry on use
I0701 07:09:38.196919 1 influxdb.go:223] created influxdb sink with options: host:monitoring-influxdb:8086 user:root db:k8s
I0701 07:09:38.197048 1 heapster.go:92] Starting with InfluxDB Sink
I0701 07:09:38.197154 1 heapster.go:92] Starting with Metric Sink
I0701 07:09:38.228046 1 heapster.go:171] Starting heapster on port 8082
I0701 07:10:05.000370 1 manager.go:79] Scraping metrics start: 2016-07-01 07:09:00 +0000 UTC, end: 2016-07-01 07:10:00 +0000 UTC
E0701 07:10:05.008785 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
I0701 07:10:05.009119 1 manager.go:152] ScrapeMetrics: time: 8.013178ms size: 0
I0701 07:11:05.001185 1 manager.go:79] Scraping metrics start: 2016-07-01 07:10:00 +0000 UTC, end: 2016-07-01 07:11:00 +0000 UTC
E0701 07:11:05.007130 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
I0701 07:11:05.007686 1 manager.go:152] ScrapeMetrics: time: 5.945236ms size: 0
W0701 07:11:25.010298 1 manager.go:119] Failed to push data to sink: InfluxDB Sink
I0701 07:12:05.000420 1 manager.go:79] Scraping metrics start: 2016-07-01 07:11:00 +0000 UTC, end: 2016-07-01 07:12:00 +0000 UTC
E0701 07:12:05.002413 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
I0701 07:12:05.002467 1 manager.go:152] ScrapeMetrics: time: 1.93825ms size: 0
E0701 07:12:12.309151 1 influxdb.go:150] Failed to create infuxdb: failed to ping InfluxDB server at "monitoring-influxdb:8086" - Get http://monitoring-influxdb:8086/ping: dial tcp 10.0.0.223:8086: getsockopt: connection timed out
I0701 07:12:12.351348 1 influxdb.go:201] Created database "k8s" on influxDB server at "monitoring-influxdb:8086"
I0701 07:13:05.001052 1 manager.go:79] Scraping metrics start: 2016-07-01 07:12:00 +0000 UTC, end: 2016-07-01 07:13:00 +0000 UTC
E0701 07:13:05.015947 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused
...
I found a few issues on GitHub describing similar problems that made me understand that Heapster doesn't access the kubelet (via the node's loopback) but itself (via the container's loopback) instead. However, I fail to reproduce their solutions:
github.com/kubernetes/heapster/issues/1183
You should either use host networking for Heapster pod or configure your cluster in a way that the node has a regular name not 127.0.0.1. The current problem is that node name is resolved to Heapster localhost. Please reopen in case of more problems.
-#piosz
How do I enable "host networking" for my Heapster pod?
How do I configure my cluster/node to use a regular name not 127.0.0.1?
github.com/kubernetes/heapster/issues/744
Fixed by using better options in hyperkube, thanks for the help!
-#ddispaltro
Is there a way to solve this issue by adding/modifying kubelet's option flags in docker run? I tried setting--hostname-override=<host's eth0 IP> and --address=127.0.0.1 (as suggested in the last answer of this GitHub issue) but Heapster's container log then states: I0701 08:23:05.000566 1 manager.go:79] Scraping metrics start: 2016-07-01 08:22:00 +0000 UTC, end: 2016-07-01 08:23:00 +0000 UTC
E0701 08:23:05.000962 1 kubelet.go:279] Node 127.0.0.1 is not ready
E0701 08:23:05.003018 1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://<host's eth0 IP>:10255/stats/container/": Post http://<host's eth0 IP>/stats/container/: dial tcp <host's eth0 IP>:10255: getsockopt: connection refused
Namespace issue
Could this problem be caused by the fact that I'm running Kubernetes API in default namespace and Heapster in kube-system?
user#host:~$ kubectl get --all-namespaces pods
NAMESPACE NAME READY STATUS RESTARTS AGE
default k8s-etcd-127.0.0.1 1/1 Running 0 18h
default k8s-master-127.0.0.1 4/4 Running 1 18h
default k8s-proxy-127.0.0.1 1/1 Running 0 18h
kube-system heapster-lizks 1/1 Running 0 18h
kube-system influxdb-grafana-e0pk2 2/2 Running 0 18h
kube-system kube-dns-v10-4vjhm 4/4 Running 0 18h
OS: Ubuntu 14.04.4 LTS (trusty) |
Kubernetes: v1.2.5 |
Docker: v1.11.2

Heapster has got the list of nodes from Kubernetes and is now trying to pull stats from the kublete process on each node (which has a built in cAdvisor collecting stats on the node). In this case there's only one node and it's known by 127.0.0.1 to kubernetes. And there's the problem. The Heapster container is trying to reach the node at 127.0.0.1 which is itself and of course finding no kublete process to interrogate within the Heapster container.
Two things need to happen to resolve this issue.
We need to reference the kublete worker node (our host machine running kubernetes) by something else other than the loopback network address of 127.0.0.1
The kublete process needs to accept traffic from the new network interface/address
Assuming you are using the local installation guide and starting kubernetes off with
hack/local-up-cluster.sh
To change the hostname by which the kublete is referenced is pretty simple. You can take more elaborate approaches but setting this to your eth0 ip worked fine for me (ifconfig eth0). The downside is that you need a eth0 interface and this is subject to DHCP so your mileage may vary as to how convenient this is.
export HOSTNAME_OVERRIDE=10.0.2.15
To get the kublete process to accept traffic from any network interface is just as simple.
export KUBELET_HOST=0.0.0.0

Provide the below argument to your heapster configuration to resolve the issue.
--source=kubernetes:https://kubernetes.default:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250&insecure=true

Related

kube-apiserver docker is restarting continuously

Sincere apologies for this lengthy posting.
I have a 4 node Kubernetes cluster with 1 x master and 3 x worker nodes. I connect to the kubernetes cluster using kubeconfig, since yesterday I was not able to connect using kubeconfig.
kubectl get pods was giving an error "The connection to the server api.xxxxx.xxxxxxxx.com was refused - did you specify the right host or port?"
In the kubeconfig server name is specified as https://api.xxxxx.xxxxxxxx.com
Note:
Please note as there were too many https links, I was not able to post the question. So I have renamed https:// to https:-- to avoid the links in the background analysis section.
I tried to run kubectl from the master node and received similar error
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Then checked kube-apiserver docker and it was continuously exiting / Crashloopbackoff.
docker logs <container-id of kube-apiserver> shows below errors
W0914 16:29:25.761524 1 clientconn.go:1251] grpc:
addrConn.createTransport failed to connect to {127.0.0.1:4001 0
}. Err :connection error: desc = "transport: authentication
handshake failed: x509: certificate has expired or is not yet valid".
Reconnecting... F0914 16:29:29.319785 1 storage_decorator.go:57]
Unable to create storage backend: config (&{etcd3 /registry
{[https://127.0.0.1:4001]
/etc/kubernetes/pki/kube-apiserver/etcd-client.key
/etc/kubernetes/pki/kube-apiserver/etcd-client.crt
/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true
0xc000266d80 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err
(context deadline exceeded)
systemctl status kubelet --> was giving below errors
Sep 14 16:40:49 ip-xxx-xxx-xx-xx kubelet[2411]: E0914 16:40:49.693576
2411 kubelet_node_status.go:385] Error updating node status, will
retry: error getting node
"ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal": Get
https://127.0.0.1/api/v1/nodes/ip-xxx-xxx-xx-xx.xx-xxxxx-1.compute.internal?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused
Note: ip-xxx-xx-xx-xxx --> internal IP address of aws ec2 instance.
Background Analysis:
Looks there was some issue with the cluster on 7th Sep 2020 and both kube-controller and kube-scheduler dockers exited and restarted. I believe since then kube-apiserver is not running or because of kube-apiserver, those dockers restarted. The kube-apiserver server certificate expired in July 2020 but access via kubectl was working until 7th Sep.
Below are the docker logs from the exited kube-scheduler docker container:
I0907 10:35:08.970384 1 scheduler.go:572] pod
default/k8version-1599474900-hrjcn is bound successfully on node
ip-xx-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4 nodes evaluated, 3
nodes were found feasible I0907 10:40:09.286831 1
scheduler.go:572] pod default/k8version-1599475200-tshlx is bound
successfully on node ip-1x-xx-xx-xx.xx-xxxxxx-x.compute.internal, 4
nodes evaluated, 3 nodes were found feasible I0907 10:44:01.935373
1 leaderelection.go:263] failed to renew lease
kube-system/kube-scheduler: failed to tryAcquireOrRenew context
deadline exceeded E0907 10:44:01.935420 1 server.go:252] lost
master lost lease
Below are the docker logs from exited kube-controller docker container:
I0907 10:40:19.703485 1 garbagecollector.go:518] delete object
[v1/Pod, namespace: default, name: k8version-1599474300-5r6ph, uid:
67437201-f0f4-11ea-b612-0293e1aee720] with propagation policy
Background I0907 10:44:01.937398 1 leaderelection.go:263] failed
to renew lease kube-system/kube-controller-manager: failed to
tryAcquireOrRenew context deadline exceeded E0907 10:44:01.937506
1 leaderelection.go:306] error retrieving resource lock
kube-system/kube-controller-manager: Get https:
--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
net/http: request canceled (Client.Timeout exceeded while awaiting
headers) I0907 10:44:01.937456 1 event.go:209]
Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system",
Name:"kube-controller-manager",
UID:"ba172d83-a302-11e9-b612-0293e1aee720", APIVersion:"v1",
ResourceVersion:"85406287", FieldPath:""}): type: 'Normal' reason:
'LeaderElection' ip-xxx-xx-xx-xxx_1dd3c03b-bd90-11e9-85c6-0293e1aee720
stopped leading F0907 10:44:01.937545 1
controllermanager.go:260] leaderelection lost I0907 10:44:01.949274
1 range_allocator.go:169] Shutting down range CIDR allocator I0907
10:44:01.949285 1 replica_set.go:194] Shutting down replicaset
controller I0907 10:44:01.949291 1 gc_controller.go:86] Shutting
down GC controller I0907 10:44:01.949304 1
pvc_protection_controller.go:111] Shutting down PVC protection
controller I0907 10:44:01.949310 1 route_controller.go:125]
Shutting down route controller I0907 10:44:01.949316 1
service_controller.go:197] Shutting down service controller I0907
10:44:01.949327 1 deployment_controller.go:164] Shutting down
deployment controller I0907 10:44:01.949435 1
garbagecollector.go:148] Shutting down garbage collector controller
I0907 10:44:01.949443 1 resource_quota_controller.go:295]
Shutting down resource quota controller
Below are the docker logs from kube-controller since the restart (7th Sep):
E0915 21:51:36.028108 1 leaderelection.go:306] error retrieving
resource lock kube-system/kube-controller-manager: Get
https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:51:40.133446 1 leaderelection.go:306] error retrieving
resource lock kube-system/kube-controller-manager: Get
https:--127.0.0.1/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s:
dial tcp 127.0.0.1:443: connect: connection refused
Below are the docker logs from kube-scheduler since the restart (7th Sep):
E0915 21:52:44.703587 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Node:
Get https://127.0.0.1/api/v1/nodes?limit=500&resourceVersion=0: dial
tcp 127.0.0.1:443: connect: connection refused E0915 21:52:44.704504
1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed
to list *v1.ReplicationController: Get
https:--127.0.0.1/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.705471 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service:
Get https:--127.0.0.1/api/v1/services?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.706477 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.ReplicaSet: Get https:--127.0.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.707581 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.StorageClass: Get https:--127.0.0.1/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.708599 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.PersistentVolume: Get https:--127.0.0.1/api/v1/persistentvolumes?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.709687 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.StatefulSet: Get https:--127.0.0.1/apis/apps/v1/statefulsets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.710744 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1.PersistentVolumeClaim: Get https:--127.0.0.1/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.711879 1 reflector.go:126]
k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:223: Failed to list
*v1.Pod: Get https:--127.0.0.1/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused E0915
21:52:44.712903 1 reflector.go:126]
k8s.io/client-go/informers/factory.go:133: Failed to list
*v1beta1.PodDisruptionBudget: Get https:--127.0.0.1/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0:
dial tcp 127.0.0.1:443: connect: connection refused
kube-apiserver certificate Renewal:
I found the kube-apiserver certificate which is this one /etc/kubernetes/pki/kube-apiserver/etcd-client.crt had expired in July 2020. There were few other expired certificates related to etcd-manager-main and events (it is same copy of the certificates on both places) but I don't see this referenced in the manifest files.
I searched and found steps to renew the certificates but most of them were using "kubeadm init phase" commands but I couldn't find kubeadm on master server and the certificates names and paths were different to my setup. So I generated a new certificate using openssl for kube-apiserver using existing ca cert and included DNS names with internal and external IP address (ec2 instance) and loopback ip address using openssl.cnf file. I replaced the new certificate with the same name /etc/kubernetes/pki/kube-apiserver/etcd-client.crt.
After that I restarted the kube-apiserver docker (which was continuously exiting) and restarted kubelet. Now the certificate expiry message is not coming but the kube-apiserver is continuously restarting which I believe is the reason for the errors on kube-controller and kube-scheduler docker containers.
NOTE:
I have not restarted the docker on the master server after replacing the certificate.
NOTE: All our production PODs are running on worker nodes so they are not affected but I can't manage them as I can't connect using kubectl.
Now, I am not sure what is the issue and why kube-apiserver is restarting continuously.
Update to the original question:
Kubernetes version: v1.14.1
Docker version: 18.6.3
Below are the latest docker logs from kube-apiserver container (which is still crashing)
F0916 08:09:56.753538 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry {[https:--127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt} false true 0xc00095f050 apiextensions.k8s.io/v1beta1 5m0s 1m0s}), err (tls: private key does not match public key)
Below is the output from systemctl status kubelet
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.095615 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x.compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.130377 388 kubelet.go:2170] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.147390 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https:--127.0.0.1/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.195768 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.295890 388 kubelet.go:2244] node "ip-xxx-xx-xx-xx.xx-xxxxx-x..compute.internal" not found
Sep 16 08:10:16 ip-xxx-xx-xx-xx kubelet[388]: E0916 08:10:16.347431 388 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://127.0.0.1/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
This cluster (along with 3 others) was setup using kops. The other clusters are running normally and looks like they have some expired certificates as well. The person who setup the clusters is not available for comment and I have limited experience on Kubernetes. Hence required assistance from the gurus.
Any help is very much appreciated.
Many thanks.
Update after response from Zambozo and Nepomucen:
Thanks to both of you for your response. Based that I found that there were expired etcd certificates on the /mnt mount point.
I followed workaround from https://kops.sigs.k8s.io/advisories/etcd-manager-certificate-expiration/
and recreated etcd certificates and keys. I have verified each of the certificate with a copy of the old one (from my backup folder) and everything is matching and the new certificates has expiry date set to Sep 2021.
Now I am getting different error on etcd dockers (both etcd-manager-events and etcd-manager-main)
Note:xxx-xx-xx-xxx is the IP address of the master server
root#ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-main container> --tail 20
I0916 14:41:40.349570 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
W0916 14:41:40.351857 8221 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:41:40.351878 8221 peers.go:347] was not able to connect to peer etcd-a: map[xxx.xx.xx.xxx:3996:true]
W0916 14:41:40.351887 8221 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-a
I0916 14:41:41.205763 8221 controller.go:173] starting controller iteration
W0916 14:41:41.205801 8221 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-a" in list of peers []
I0916 14:41:45.352008 8221 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
I0916 14:41:46.678314 8221 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0916 14:41:46.739272 8221 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0916 14:41:46.786653 8221 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-a.internal.xxxxx.xxxxxxx.com etcd-a.internal.xxxxx.xxxxxxx.com]]
I0916 14:41:46.786724 8221 hosts.go:181] skipping update of unchanged /etc/hosts
root#ip-xxx-xx-xx-xxx:~# docker logs <etcd-manager-events container> --tail 20
W0916 14:42:40.294576 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a
I0916 14:42:41.106654 8316 controller.go:173] starting controller iteration
W0916 14:42:41.106692 8316 controller.go:149] unexpected error running etcd cluster reconciliation loop: cannot find self "etcd-events-a" in list of peers []
I0916 14:42:45.294682 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a"
W0916 14:42:45.297094 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:42:45.297117 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true]
I0916 14:42:46.791923 8316 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0916 14:42:46.856548 8316 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0916 14:42:46.945119 8316 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-events-a.internal.xxxxx.xxxxxxx.com:[xxx.xx.xx.xxx xxx.xx.xx.xxx]], final=map[xxx.xx.xx.xxx:[etcd-events-a.internal.xxxxx.xxxxxxx.com etcd-events-a.internal.xxxxx.xxxxxxx.com]]
I0916 14:42:50.297264 8316 peers.go:281] connecting to peer "etcd-events-a" with TLS policy, servername="etcd-manager-server-etcd-events-a"
W0916 14:42:50.300328 8316 peers.go:325] unable to grpc-ping discovered peer xxx.xx.xx.xxx:3997: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I0916 14:42:50.300348 8316 peers.go:347] was not able to connect to peer etcd-events-a: map[xxx.xx.xx.xxx:3997:true]
W0916 14:42:50.300360 8316 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-events-a
Could you please suggest on how to proceed from here?
Many thanks.
Generating a new cert using openssl for kube-apiserver and replacing the cert and key brought the kube-apiserver docker to stable state and provided access via kubectl.
To resolve etcd-manager certs issue, upgraded etcd-manager to kopeio/etcd-manager:3.0.20200531 for both etcd-manager-main and etcd-manager-events as described at https://github.com/kubernetes/kops/issues/8959#issuecomment-673515269
Thank you
I think this is related to the ETCD. you may have renewed the certs for Kubernetes components but did you do the same for ETCD?
Your API server is trying to connect to the ETCD and giving:
tls: private key does not match public key)
As you have only 1 etcd(assuming on the number of master nodes) I would do a backup of it before trying to fix it.

Kafka on Minikube: Back-off restarting failed container

I'm need up Kafka and Cassandra in Minikube
Host OS is Ubuntu 16.04
$ uname -a
Linux minikuber 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Minikube started normally:
$ minikube start
Starting local Kubernetes v1.8.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
Services list:
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 1d
Zookeeper and Cassandra is running, but kafka crashing with error "CrashLoopBackOff"
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
zookeeper-775db4cd8-lpl95 1/1 Running 0 1h
cassandra-d84d697b8-p5wcs 1/1 Running 0 1h
kafka-6d889c567-w5n4s 0/1 CrashLoopBackOff 25 1h
View logs:
kubectl logs kafka-6d889c567-w5n4s -p
Output:
waiting for kafka to be ready
...
INFO Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)
WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
INFO Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
...
INFO Terminate ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
INFO Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
INFO Session: 0x0 closed (org.apache.zookeeper.ZooKeeper)
INFO EventThread shut down for session: 0x0 (org.apache.zookeeper.ClientCnxn)
FATAL Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server '' with timeout of 6000 ms
...
INFO shutting down (kafka.server.KafkaServer)
INFO shut down completed (kafka.server.KafkaServer)
FATAL Exiting Kafka. (kafka.server.KafkaServerStartable)
Сan any one help how to solve the problem of restarting the container?
kubectl describe pod kafka-6d889c567-w5n4s
Output describe:
Name: kafka-6d889c567-w5n4s
Namespace: default
Node: minikube/192.168.99.100
Start Time: Thu, 23 Nov 2017 17:03:20 +0300
Labels: pod-template-hash=284457123
run=kafka
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"kafka-6d889c567","uid":"0fa94c8d-d057-11e7-ad48-080027a5dfed","a...
Status: Running
IP: 172.17.0.5
Created By: ReplicaSet/kafka-6d889c567
Controlled By: ReplicaSet/kafka-6d889c567
Info about Containers:
Containers:
kafka:
Container ID: docker://7ed3de8ef2e3e665ba693186f5125c6802283e1fabca8f3c85eb584f8de19526
Image: wurstmeister/kafka
Image ID: docker-pullable://wurstmeister/kafka#sha256:2aa183fd201d693e24d4d5d483b081fc2c62c198a7acb8484838328c83542c96
Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 27 Nov 2017 09:43:39 +0300
Finished: Mon, 27 Nov 2017 09:43:49 +0300
Ready: False
Restart Count: 1003
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bnz99 (ro)
Info about Conditions:
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Info about volumes:
Volumes:
default-token-bnz99:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bnz99
Optional: false
QoS Class: BestEffort
Info about events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 38m (x699 over 2d) kubelet, minikube pulling image "wurstmeister/kafka"
Warning BackOff 18m (x16075 over 2d) kubelet, minikube Back-off restarting failed container
Warning FailedSync 3m (x16140 over 2d) kubelet, minikube Error syncing pod

Kubernetes cannot pull from insecure registry ans cannot run container from local image on offline cluster

I am working on a offline cluster (machines have no internet access), deploying docker images using ansible and docker compose scripts.
My servers are Centos7.
I have set up an insecure docker registry on the machines. We are going to change environnement, and I am installing kubernetes in order to manage my pull of container.
I follow this guide to install kubernetes:
https://severalnines.com/blog/installing-kubernetes-cluster-minions-centos7-manage-pods-services
After the installation, I tried to launch a testing pod. here is the yml for the pod, launching with
kubectl -f create nginx.yml
here the yml:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: [my_registry_addr]:[my_registry_port]/nginx:v1
ports:
- containerPort: 80
I used kubectl describe to get more information on what was wrong:
Name: nginx
Namespace: default
Node: [my node]
Start Time: Fri, 15 Sep 2017 11:29:05 +0200
Labels: <none>
Status: Pending
IP:
Controllers: <none>
Containers:
nginx:
Container ID:
Image: [my_registry_addr]:[my_registry_port]/nginx:v1
Image ID:
Port: 80/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Volume Mounts: <none>
Environment Variables: <none>
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
No volumes.
QoS Class: BestEffort
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
2m 2m 1 {default-scheduler } Normal Scheduled Successfully assigned nginx to [my kubernet node]
1m 1m 2 {kubelet [my kubernet node]} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with ErrImagePull: "Error while pulling image: Get https://index.docker.io/v1/repositories/library/[my_registry_addr]/images: dial tcp: lookup index.docker.io on [kubernet_master_ip]:53: server misbehaving"
54s 54s 1 {kubelet [my kubernet node]} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with ImagePullBackOff: "Back-off pulling image \"[my_registry_addr]:[my_registry_port]\""
8s 8s 1 {kubelet [my kubernet node]} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "POD" with ErrImagePull: "Network timed out while trying to connect to https://index.docker.io/v1/repositories/library/[my_registry_addr]/images. You may want to check your internet connection or if you are behind a proxy."
then, I go to my node and use journalctl -xe
sept. 15 11:22:02 [my_node_ip] dockerd-current[9861]: time="2017-09-15T11:22:02.350930396+02:00" level=info msg="{Action=create, LoginUID=4294967295, PID=11555}"
sept. 15 11:22:17 [my_node_ip] dockerd-current[9861]: time="2017-09-15T11:22:17.351536727+02:00" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
sept. 15 11:22:17 [my_node_ip] dockerd-current[9861]: time="2017-09-15T11:22:17.351606330+02:00" level=error msg="Attempting next endpoint for pull after error: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
sept. 15 11:22:32 [my_node_ip] dockerd-current[9861]: time="2017-09-15T11:22:32.353946452+02:00" level=error msg="Not continuing with pull after error: Error while pulling image: Get https://index.docker.io/v1/repositories/library/[my_registry_ip]/images: dial tcp: lookup index.docker.io on [kubernet_master_ip]:53: server misbehaving"
sept. 15 11:22:32 [my_node_ip] kubelet[11555]: E0915 11:22:32.354309 11555 docker_manager.go:2161] Failed to create pod infra container: ErrImagePull; Skipping pod "nginx_default(8b5c40e5-99f4-11e7-98db-f8bc12456ee4)": Error while pulling image: Get https://index.docker.io/v1/repositories/library/[my_registry_ip]/images: dial tcp: lookup index.docker.io on [kubernet_master_ip]:53: server misbehaving
sept. 15 11:22:32 [my_node_ip] kubelet[11555]: E0915 11:22:32.354390 11555 pod_workers.go:184] Error syncing pod 8b5c40e5-99f4-11e7-98db-f8bc12456ee4, skipping: failed to "StartContainer" for "POD" with ErrImagePull: "Error while pulling image: Get https://index.docker.io/v1/repositories/library/[my_registry_ip]/images: dial tcp: lookup index.docker.io on [kubernet_master_ip]:53: server misbehaving"
sept. 15 11:22:44 [my_node_ip] dockerd-current[9861]: time="2017-09-15T11:22:44.350708175+02:00" level=error msg="Handler for GET /v1.24/images/[my_registry_ip]:[my_registry_port]/json returned error: No such image: [my_registry_ip]:[my_registry_port]"
I sure thant my docker configuration is good, cause I am using it every day with ansible or mesos.
docker version is 1.12.6, kubernetes version is 1.5.2
What can I do now? I didn't find any configuration key for this usage.
When I saw that pulling was failing, I manually pull the image on all the nodes. I put a tag to ensure that kubernetes will to try to pull as default, and set " imagePullPolicy: IfNotPresent "
The syntax for specifying the docker image is :
[docker_registry]/[image_name]:[image_tag]
In your manifest file, you have used ":" to separate docker repository host and the port the repository is listening on. The default port for docker private registry I guess is 5000.
So change your image declaration from
Image: [my_registry_addr]:[my_registry_port]/nginx:v1
to
Image: [my_registry_addr]/nginx:v1
Also, check the network connectivity from the worker node to your docker registry by doing a ping.
ping [my_registry_addr]
If you still want to check if the port 443 is opened on the registry you can do a tcp check on that port on the host running docker registry
curl telnet://[my_registry_addr]:443
Hope that helps.
I finally find what was the problem.
To work, Kubernetes need a pause container. Kubernetes was trying to find the pause container on the internet.
I deployed a custom pause container on my registry, I set up kubernetes pause container to this image.
After that, kubernetes is working like a charm.

Error from server: dial tcp i/o timeout error when getting logs of pod

I'm working on OpenShift Origin 1.1 (which is using kubernetes as its orchestration tool for docker containers). I'm creating pods, but I'm unable to see the build-logs.
[user#ip master]# oc get pods
NAME READY STATUS RESTARTS AGE
test-1-build 0/1 Completed 0 14m
test-1-iok8n 1/1 Running 0 12m
[user#ip master]# oc logs test-1-iok8n
Error from server: Get https://ip-10-0-x-x.compute.internal:10250/containerLogs/test/test-1-iok8n/test: dial tcp 10.0.x.x:10250: i/o timeout
My /var/logs/messages shows:
Dec 4 13:28:24 ip-10-0-x-x origin-master: E1204 13:28:24.579794 32518 apiserver.go:440] apiserver was unable to write a JSON response: Get https://ip-10-0-x-x.compute.internal:10250/containerLogs/test/test-1-iok8n/test: dial tcp 10.0.x.x:10250: i/o timeout
Dec 4 13:28:24 ip-10-0-x-x origin-master: E1204 13:28:24.579822 32518 errors.go:62] apiserver received an error that is not an unversioned.Status: Get https://ip-10-0-x-x.compute.internal:10250/containerLogs/test/test-1-iok8n/test: dial tcp 10.0.x.x:10250: i/o timeout
My versions are:
origin v1.1.0.1-1-g2c6ff4b
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2
I forgot to open port 10250 (tcp) (in my aws security group).
This was the only issue for me.

openshift origin v0.3.3 error starting docker registry pod on centos 6.6

I'm running https://github.com/openshift/origin/tree/v0.3.3 on centos 6.6. When i run:
sudo /opt/bin/openshift start
i see an error:
I0301 22:02:04.738381 18093 pod_cache.go:194] error getting pod deploy-docker-registry-16mttp status: Get http://localhost:10250/api/v1beta1/podInfo?podID=deploy-docker-registry-16mttp&podNamespace=default: dial tcp 127.0.0.1:10250: connection refused, retry later
E0301 22:02:04.738422 18093 pod_cache.go:260] Error getting info for pod default/deploy-docker-registry-16mttp: Get http://localhost:10250/api/v1beta1/podInfo?podID=deploy-docker-registry-16mttp&podNamespace=default: dial tcp 127.0.0.1:10250: connection refused
If i do:
docker ps -a | grep origin-deployer
then i see:
b207ce593385 openshift/origin-deployer:v0.3.3 "/usr/bin/openshift- 31 hours ago Exited (255) 31 hours ago k8s_deployment.6c8f5c13_deploy-docker-registry-16mttp.default.api_11ae6e53-bf85-11e4-b8b2-080027bb06ce_8c701fc0
so i run:
docker logs b207ce593385
and get:
228 20:06:37.955877 1 deployer.go:64] Get https://10.0.2.15:8443/api/v1beta1/replicationControllers/docker-registry-1?namespace=default: dial tcp 10.0.2.15:8443: no route to host
If i do:
ping 10.0.2.15
it works. If i try:
https://10.0.2.15:8443
it returns:
404 Page Not Found
so the server is responsive. If i open the OpenShift Web Console at https://10.0.2.15:8444/ and Browse the default project it shows one deploy-docker-registry-16mttp pod with a status of Failed. The "IP on node" is 172.17.0.3 and it does respond to a ping. If i run:
osc describe service docker-registry
it returns:
Name: docker-registry
Labels: docker-registry=default
Selector: docker-registry=default
Port: 5000
Endpoints: <empty>
No events.
it should be returning:
Endpoints: 172.17.0.60:5000
according to the instructions. When i try:
ping 172.17.0.60
it returns:
PING 172.17.0.60 (172.17.0.60) 56(84) bytes of data.
From 172.17.42.1 icmp_seq=2 Destination Host Unreachable
From 172.17.42.1 icmp_seq=3 Destination Host Unreachable
...
Lot of moving parts and i'm new to it so any suggestions would be appreciated. I've probably missed one of the configuration steps.
It appears to be related to Centos 6.6. When i try the same process on Centos 7 (using netinstall) there is no problem.

Resources