Debugging DNS resolutions in kubernetes - docker

I have initialized kubernetes v1.13.1 cluster on Ubuntu 16.04 using below command:
sudo kubeadm init --token-ttl=0 --apiserver-advertise-address=192.168.88.142
and installed weave using:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
I have 10 raspberry pi acting as worker nodes and connected to the cluster. All of them are running the deployment fine. There nodes are running pods which try to connect to iot hub visdwk-azure-devices.net and publish some data. Out of 10 nodes, only few nodes are able to connect and other throws error unable to connect to iot hub. I did a ping test and found out that they were not able to ping google while they were pinging the public IP address of google.
This made me think that something is wrong with the coredns pod. I followed this documentation and did below test.
Pod has below contents in /etc/resolv.conf
nameserver 10.96.0.10
search visdwk.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
which looks normal to me. All the coredns pods are running fine.
coredns-86c58d9df4-42xqc 1/1 Running 8 1d11h
coredns-86c58d9df4-p6d98 1/1 Running 7 1d6h
I have also done nslookup kubernetes.default from the busybox container and got the proper response. Below are the logs of coredns-86c58d9df4-42xqc
.:53
2019-02-08T08:40:10.038Z [INFO] CoreDNS-1.2.6
2019-02-08T08:40:10.039Z [INFO] linux/amd64, go1.11.2, 756749c
CoreDNS-1.2.6
linux/amd64, go1.11.2, 756749c
[INFO] plugin/reload: Running configuration MD5 =
f65c4821c8a9b7b5eb30fa4fbc167769
t
Above logs also looks normal.
I can also not say that the pod is not able to resolve the iot hub because of any error from weave because if weave is throwing error then I believe the pod will never start and will always be in failed state but in actual the pod remains in running state. Please correct me here if I am wrong.
DNS service also seems to be in running state:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 1d6h
But still I am not able to figure out as to why few nodes in the cluster are not able to resolve the iot hub. Can anyone please give me some suggestions here. Please help. Thanks.
Logs from failed pod:
1550138544: New connection from 127.0.0.1 on port 1883.
1550138544: New client connected from 127.0.0.1 as 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 (c1, k60).
1550138544: Sending CONNACK to 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 (0, 0)
1550138544: Received PUBLISH from 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 (d0, q0, r0, m0, 'devices/machine6/messages/events/', ... (1211 bytes))
1550138544: Received DISCONNECT from 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504
1550138544: Client 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 disconnected.
1550138547: Saving in-memory database to /mqtt/data/mosquitto.db.
1550138547: Bridge local.machine6 doing local SUBSCRIBE on topic devices/machine6/messages/events/#
1550138547: Connecting bridge iothub-bridge (visdwk.azure-devices.net:8883)
1550138552: Error creating bridge: Try again.
1550138566: New connection from 127.0.0.1 on port 1883.
1550138566: New client connected from 127.0.0.1 as afb6cc2a-ee78-482e-aff0-fc595e06f86a (c1, k60).
1550138566: Sending CONNACK to afb6cc2a-ee78-482e-aff0-fc595e06f86a (0, 0)
1550138566: Received PUBLISH from afb6cc2a-ee78-482e-aff0-fc595e06f86a (d0, q0, r0, m0, 'devices/machine6/messages/events/', ... (1211 bytes))
1550138566: Received DISCONNECT from afb6cc2a-ee78-482e-aff0-fc595e06f86a
1550138566: Client afb6cc2a-ee78-482e-aff0-fc595e06f86a disconnected.
1550138567: New connection from 127.0.0.1 on port 1883.
1550138567: New client connected from 127.0.0.1 as 01b9e135-fbc8-4d67-9962-356e8cf9f080 (c1, k60).
1550138567: Sending CONNACK to 01b9e135-fbc8-4d67-9962-356e8cf9f080 (0, 0)
1550138567: Received PUBLISH from 01b9e135-fbc8-4d67-9962-356e8cf9f080 (d0, q0, r0, m0, 'devices/machine6/messages/events/', ... (755 bytes))
1550138567: Received DISCONNECT from 01b9e135-fbc8-4d67-9962-356e8cf9f080
1550138567: Client 01b9e135-fbc8-4d67-9962-356e8cf9f080 disconnected.
1550138578: Saving in-memory database to /mqtt/data/mosquitto.db.
1550138583: Bridge local.machine6 doing local SUBSCRIBE on topic devices/machine6/messages/events/#
1550138583: Connecting bridge iothub-bridge (visdwk.azure-devices.net:8883)
1550138588: Error creating bridge: Try again.
Pod is running a mosquitto container which try to connect to visdwk.azure-devices.net and throws error.
Connecting bridge iothub-bridge (visdwk.azure-devices.net:8883)
Error creating bridge: Try again.

It would appear that one of your DNS Pods is not providing DNS services.
The evidence is is in the statement that "only few nodes are able to connect and other throws error unable to connect to iot hub"
This is a classic symptom of load-balancing with a failed node in the loop.
Try:
Remove the DNS server pod that gave the message: visdwk.azure-devices.net.visdwknamespace.svc.cluster.local. udp 82 false 512" NXDOMAIN qr,aa,rd,ra 175 0.000651078s where visdwk.azure-devices.net
Wait for the changes to propagate through the cluster.
Test the connections.
If this is correct they should all connect.
To confirm, add the pod back and remove the other one. Retest, they should all fail to connect.

Related

kubeadm join times out on non-default NIC/IP

I am trying to configure a K8s cluster on-prem and the servers are running Fedora CoreOS using multiple NICs.
I am configuring the cluster to use a non-default NIC - a bond which is defined with 2 interfaces. All servers can reach each-other over that interface and have HTTP + HTTPS connectivity to the internet.
kubeadm join hangs at:
I0513 13:24:55.516837 16428 token.go:215] [discovery] Failed to request cluster-info, will try again: Get https://${BOND_IP}:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The relevant kubeadm init config looks like this:
[...]
localAPIEndpoint:
advertiseAddress: ${BOND_IP}
bindPort: 6443
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
runtime-cgroups: "/systemd/system.slice"
kubelet-cgroups: "/systemd/system.slice"
node-ip: ${BOND_IP}
criSocket: /var/run/dockershim.sock
name: master
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
[...]
The join config that am using looks like this:
apiVersion: kubeadm.k8s.io/v1beta2
kind: JoinConfiguration
discovery:
bootstrapToken:
token: ${TOKEN}
caCertHashes:
- "${SHA}"
apiServerEndpoint: "${BOND_IP}:6443"
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
runtime-cgroups: "/systemd/system.slice"
kubelet-cgroups: "/systemd/system.slice"
If I am trying to configure it using default eth0, it works without issues.
This is not a connectivity issue. The port test works fine:
# nc -s ${BOND_IP_OF_NODE} -zv ${BOND_IP_OF_MASTER} 6443
Ncat: Version 7.80 ( https://nmap.org/ncat )
Ncat: Connected to ${BOND_IP_OF_MASTER}:6443.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
I suspect this happens due to kubelet listening on eth0, if so, can I change it to use a different NIC/IP?
LE: The eth0 connection has been cut off completely (cable out, interface down, connection down).
Now, when we init, if we choose port 0.0.0.0 for the kube-api it defaults to the bond, which we wanted initially:
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 0.0.0.0
result:
[certs] apiserver serving cert is signed for DNS names [emp-prod-nl-hilv-quortex19 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.0.0.1 ${BOND_IP}]
I have even added the 6443 port in iptables for accept and it still times out.. All my CALICO pods are up and running (all pods for that matter in kube-system namespace)
LLE:
I have tested calico and weavenet and both show the same issue. The api-server is up and can be reached from the master using curl but it times out from the nodes.
LLLE:
On the premise that the kube-api is nothing but an HTTPS server, I have tried two options from the node that cannot reach it when doing the kubeadm join:
Ran a python3 simple http server over 6443 and WAS ABLE TO CONNECT from node
Ran an nginx pod and exposed it over another port as NodePort and WAS ABLE TO CONNECT from node
the node just cant reach the api-server on 6443 or any other port for that matter ....
what am i doing wrong...
The cause:
The interface used was in BOND of type ACTIVE-ACTIVE. This made it so kubeadm tried another interface from the 2 bonded, which was not in the same subnet as the IP of the advertised server apparently...
Using ACTIVE-PASSIVE did the trick and was able to join the nodes.
LE: If anyone knows why kubeadm join does not support LACP with ACTIVE-ACTIVE bond setups on FEDORA COREOS please advise here. Otherwise, if additional configurations are required, I would very much like to know what I have missed.

Local Consul join K8s Consul Mac

So I'm currently running on my local Kubernetes cluster (running on docker) the stable/consul chart from helm.
$ helm install -n wet-fish --namespace consul stable/consul
This creates two services
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
wet-fish-consul ClusterIP None <none> 8500/TCP,8400/TCP,8301/TCP,8301/UDP,8302/TCP,8302/UDP,8300/TCP,8600/TCP,8600/UDP 0s
wet-fish-consul-ui NodePort 10.110.229.223 <none> 8500:30276/TCP
So this means I can run localhost:30276 and see the consul ui.
Now I'm running on my local machine
$ consul agent -dev -config-dir=./consul.d -node=machine
$ consul join 127.0.0.1:30276
This just results in:
Error joining address '127.0.0.1:30276': Unexpected response code: 500 (1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
)
Failed to join any nodes.
and
2020/01/17 15:17:35 [WARN] agent: (LAN) couldn't join: 0 Err: 1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
2020/01/17 15:17:35 [ERR] http: Request PUT /v1/agent/join/127.0.0.1:30276, error: 1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
from=127.0.0.1:59693
There must be a way to have a local consul agent running that can connect to the k8s consul server...
This is on a Mac, so networking isn't as good....
There may be two problems here, the first is that consul agent -dev starts the agent in dev mode. By default dev mode is going to start both a server and an agent. This might be part of the reason behind the error.
The other problem could be due to localhost, the server running in Kubernetes will attempt to health check local agents. It needs to be able to ping the local agent, so even if you manage to join in the first step, it would probably fail health checks.
I agree about networking on Mac it does not make things easy, one thing you will probably have to do is set the advertise address for the local agent (non kube). Docker for mac has a host name docker.for.mac.localhost which is a routable ip to the local machine from a container. When starting the local agent if you set the advertise address to the ip value of that host Kubernetes Consul server should be able to route to the locally running agent.
Potential fix:
1. Ensure local agent is starting in client mode (manually configure not -dev)
2. Set advertise advertise address to an ip address which is routable from Kubernetes docker.for.mac.localhost
Give me a shout if that does not work for you, I have used a setup like this myself, 9/10 it is networking between Docker and the local machine.
Kind regards,
Nic

Docker Swarm: Getting connection refuse while adding worker node

I just started learning docker, I am facing below challenge, please let me know where I am doing wrong.
My use case: Set up docker swarm manager and add worker node to it.
Step1: To create docker manager, I used below command:
docker swarm init --advertise-addr <<ip_address>>
Step 2: Run below command, which gives you docker command to add worker.
docker swarm join-token worker
After running above command, I got output as:
docker swarm join --token SWMTKN-1-653srs28a6s48dqxnak9g9kic2cd1xyeowgnke53nf83710wfv-7u7u7u1vovahvn792814q2sts ip_address:2377
Step3: I logged-in to worker node and ran above docker swarm join command. But I am getting below error message.
Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection er
ror: desc = "transport: Error while dialing dial tcp ip_address:2377: connect: connection refused"
This could well be a firewall issue, make sure you have port 2377, 7946 & 4789 in open state between the hosts acting as manager or worker node -
From the docs -
Open protocols and ports between the hosts The following ports must be
available.
TCP port 2377 for cluster management communications
TCP and UDP port 7946 for communication among nodes
UDP port 4789 for overlay network
traffic

Error from server: error dialing backend: dial tcp 10.9.84.149:10250: getsockopt: connection refused

I have a kubernetes cluster with three nodes: 10.9.84.149,10.9.105.90 and 10.9.84.149. When my application tries to execute the command inside some pod:
kuebctl exec -it <podName>
it sometimes gets an error:
Error from server: error dialing backend: dial tcp 10.9.84.149:10250: getsockopt: connection refused
As far as I could see everything was fine with the cluster: all kube-system services and pods were running well. Besides, it didn't appear regularly.
Can anybody help me on this issue?
I got the same error as this below
Error from server: Get https://192.168.100.102:10250/containerLogs/default/kubia-n8nv9/kubia: dial tcp 192.168.100.102:10250: connect: no route to host
DISABLING THE FIREWALL WAS MY FIX ON ALL NODES
I figured out my worker nodes firewall was not disabled. I did instruction below to fix my problem
systemctl disable firewalld && systemctl stop firewalld
-Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1...
-Removed symlink /etc/systemd/system/basic.target.wants/firewalld.service.```
Looks like your kubelet process not running, or keep restarting.
ss -tnpl |grep 10250
LISTEN 0 128 :::10250 :::* users:(("kubelet",pid=1102,fd=21))
check kubelet process is running.
if its running see when its started.
look at /var/log/message file for any issue with node.
Make sure you don't have the firewall blocking the traffic

Docker swarm mode load balancing not working as described

Update
I believe the culprit is the master who does not appear to be listening on port 7946. netstat shows that 7946 is listening on the nodes, but not the master. When I check the syslogs for the nodes I see the following error
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
Original Post
I am running a three node Swarm Mode cluster in AWS; one master and two workers. This is swarm mode not to be confused with docker swarm from pre 1.12.
I created all of the services with docker-machine. Each machine is running Ubuntu 15.10 with Docker 1.12.3.
Linux swarm-master-01 4.2.0-42-generic #49-Ubuntu SMP Tue Jun 28 21:26:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Using the master node I have created a service with the following
docker service create --replicas 1 --name myapp -p 3000 myapp
When I run docker service ps myapp I get the following output
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
02awst8p9pezgpkfzqgz8z79t myapp.1 myapp:latest swarm-node-01 Running Running 19 minutes ago
The running task is deployed to swarm-node-01.
I checked the auto-selected port which was published publicly
$ docker service inspect myapp | jq .[].Endpoint.Ports[].PublishedPort
30000
According to the documentation:
External components, such as cloud load balancers, can access the service on the PublishedPort of any node in the cluster whether or not the node is currently running the task for the service. All nodes in the swarm route ingress connections to a running task instance.
But when I try to curl the nodes who do not have the task running I'm getting connection refused.
$ curl $(docker-machine ip swarm-node-01):30000/stats
{"uptime":"2016-11-09T14:48:35Z","requestCount":7,"statuses":{"200":7},"pid":1,"open_db_conns":0}
$ curl $(docker-machine ip swarm-node-02):30000/stats
curl: (7) Failed to connect to [the IP] port 30000: Connection refused
note: I scrubbed the IP of node-02
My Troubleshooting:
The nodes are both properly connected to the swarm
Scaling the service up to 5 (which inherently deploys the task to every node) makes curl work on every node, because the task is deployed to every node.
UPDATE 1
I initialized the swarm with
docker swarm init --advertise-addr 10.0.0.12:2377 --listen-addr 10.0.0.12:2377
I checked the syslogs from the nodes and I'm seeing the following errors
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
I checked to see if the ingress port was listening and it doesn't seem to be
ubuntu#swarm-master-01:~$ sudo lsof -i :7946
ubuntu#swarm-master-01:~$ cat < /dev/tcp/10.0.0.12/7946
-bash: connect: Connection refused
-bash: /dev/tcp/10.0.0.12/7946: Connection refused
ubuntu#swarm-master-01:~$ cat < /dev/tcp/0.0.0.0/7946
-bash: connect: Connection refused
-bash: /dev/tcp/0.0.0.0/7946: Connection refused
I was able to get around the issue for now, but I don't know what initially caused it. The overlay network (port 7946) wasn't listening on swarm-master-01. I figured this out with netstat -nlt. I searched the syslogs and found these errors related to the port in the syslog.
Nov 8 20:28:20 ubuntu docker[23092]: time="2016-11-08T20:28:20.171385360Z" level=warning msg="2016/11/08 20:28:20 [ERR] memberlist: Failed TCP fallback ping: read tcp 10.0.0.85:54016->10.0.0.13:7946: i/o timeout"
Nov 9 18:26:17 swarm-node-01 docker[714]: time="2016-11-09T18:26:17.573441271Z" level=warning msg="2016/11/09 18:26:17 [ERR] memberlist: Failed to send indirect ping: write udp [::]:7946->10.0.0.38:7946: use of closed network connection"
For some reason docker refused to open this port and listen any more. Here is what I did (albeit undesirable) to circumvent the issue:
Created another node with docker-machine called swarm-master-02
Joined swarm-master-02 to the cluster as a master
Demoted master-01 which set master-02 as the leader
Restarted the docker daemon on each node (might not have been necessary)
Now all of the machines are working as expected except for swarm-master-01. One task is running on swarm-node-01 and curl works against all nodes by forwarding the traffic to the proper container on the proper node. However, swarm-master-01 refuses to listen on the overlay network and curl does not work against this node. I was only able to fix swarm-master-01 by completely removing it from the cluster, restarting the docker daemon, and joining it again as a master. Now 7946 is listening on that machine.

Resources