I have installed on my localhost, a consul server (leader) with an IP address of 192.168.48.1 => running ok
Then I installed a vagrant box (ubuntu 20.04) as a consul agent, with an ip address of 10.0.2.15 and I informed about the bridge within the Vagrantfile.
The issue is:
The Consul leader sees the agent node but the agent health status keeps failing and recovering, with the following message :
Failing serf check
This node has a failing serf node check.
And a few seconds after that, going back to green status, and so on and so forth.
If the leader can see the node, that means the configuration on the agent side, is ok. But the health status fails at regular intervals (few seconds).
I updated the iptables for the required ports for consul, but it still fails.
I checked the logs with the command "consul monitor" on the localhost (leader host) and it says about an issue with the ack:
2022-07-19T12:12:46.945+0200 [INFO] agent.server.memberlist.lan: memberlist: Suspect 10.0.2.15 has failed, no acks received
2022-07-19T12:12:50.179+0200 [ERROR] agent.server.memberlist.lan: memberlist: Push/Pull with 10.0.2.15 failed: dial tcp 10.0.2.15:8301: i/o timeout
2022-07-19T12:12:50.945+0200 [INFO] agent.server.memberlist.lan: memberlist: Marking 10.0.2.15 as failed, suspect timeout reached (0 peer confirmations)
2022-07-19T12:12:50.945+0200 [INFO] agent.server.serf.lan: serf: EventMemberFailed: 10.0.2.15 10.0.2.15
2022-07-19T12:12:50.945+0200 [INFO] agent.server: member failed, marking health critical: member=10.0.2.15 partition=default
Related
So I'm currently running on my local Kubernetes cluster (running on docker) the stable/consul chart from helm.
$ helm install -n wet-fish --namespace consul stable/consul
This creates two services
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
wet-fish-consul ClusterIP None <none> 8500/TCP,8400/TCP,8301/TCP,8301/UDP,8302/TCP,8302/UDP,8300/TCP,8600/TCP,8600/UDP 0s
wet-fish-consul-ui NodePort 10.110.229.223 <none> 8500:30276/TCP
So this means I can run localhost:30276 and see the consul ui.
Now I'm running on my local machine
$ consul agent -dev -config-dir=./consul.d -node=machine
$ consul join 127.0.0.1:30276
This just results in:
Error joining address '127.0.0.1:30276': Unexpected response code: 500 (1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
)
Failed to join any nodes.
and
2020/01/17 15:17:35 [WARN] agent: (LAN) couldn't join: 0 Err: 1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
2020/01/17 15:17:35 [ERR] http: Request PUT /v1/agent/join/127.0.0.1:30276, error: 1 error occurred:
* Failed to join 127.0.0.1: received invalid msgType (72), expected pushPullMsg (6) from=127.0.0.1:30276
from=127.0.0.1:59693
There must be a way to have a local consul agent running that can connect to the k8s consul server...
This is on a Mac, so networking isn't as good....
There may be two problems here, the first is that consul agent -dev starts the agent in dev mode. By default dev mode is going to start both a server and an agent. This might be part of the reason behind the error.
The other problem could be due to localhost, the server running in Kubernetes will attempt to health check local agents. It needs to be able to ping the local agent, so even if you manage to join in the first step, it would probably fail health checks.
I agree about networking on Mac it does not make things easy, one thing you will probably have to do is set the advertise address for the local agent (non kube). Docker for mac has a host name docker.for.mac.localhost which is a routable ip to the local machine from a container. When starting the local agent if you set the advertise address to the ip value of that host Kubernetes Consul server should be able to route to the locally running agent.
Potential fix:
1. Ensure local agent is starting in client mode (manually configure not -dev)
2. Set advertise advertise address to an ip address which is routable from Kubernetes docker.for.mac.localhost
Give me a shout if that does not work for you, I have used a setup like this myself, 9/10 it is networking between Docker and the local machine.
Kind regards,
Nic
So I recently setup a single node kubernetes cluster on my home network. I have a dns server that runs on my router (DD-WRT, dnsmaq) that resolves a bunch of local domains for ease of use. server1.lan, for example resolves to 192.168.1.11.
Server 1 was setup as my single node kubernetes cluster. Excited about the possibilities of local DNS, I spun up my first deployment using a docker container called netshoot which has a bunch of helpful network debugging tools bundled in. I execd into the container, and ran a ping and got the following...
bash-5.0# ping server1.lan
ping: server1.lan: Try again
It failed, then I tried pinging google's DNS (8.8.8.8) and that worked fine.
I tried to resolve the kubernetes default domain, it worked fine
bash-5.0# nslookup kubernetes.default
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1
The /etc/resolve.conf file looks fine from inside the pod
bash-5.0# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
I then got to tailing the coredns logs, and I started seeing some interesting output...
2019-11-13T03:01:23.014Z [ERROR] plugin/errors: 2 server1.lan. AAAA: read udp 192.168.156.140:37521->192.168.1.1:53: i/o timeout
2019-11-13T03:01:24.515Z [ERROR] plugin/errors: 2 server1.lan. A: read udp 192.168.156.140:41964->192.168.1.1:53: i/o timeout
2019-11-13T03:01:24.515Z [ERROR] plugin/errors: 2 server1.lan. AAAA: read udp 192.168.156.140:33455->192.168.1.1:53: i/o timeout
2019-11-13T03:01:25.015Z [ERROR] plugin/errors: 2 server1.lan. AAAA: read udp 192.168.156.140:48864->192.168.1.1:53: i/o timeout
2019-11-13T03:01:25.015Z [ERROR] plugin/errors: 2 server1.lan. A: read udp 192.168.156.140:35328->192.168.1.1:53: i/o timeout
It seems like kubernetes is trying to communicate with 192.168.1.1 from inside the cluster network and failing. I guess CoreDNS uses whatever is in the resolv.conf on the host, so here is what that looks like.
nameserver 192.168.1.1
I can resolve server1.lan from everywhere else on the network, except these pods. My router IP is 192.168.1.1, and that is what is responding to DNS queries.
Any help on this would be greatly appreciated, it seems like some kind of IP routing issue between the kubernetes network and my real home network, or that's my theory anyways. Thanks in advance.
So it turns out the issue was that when I initiated the cluster, I specified a pod CIDR that conflicted with IPs on my home network. My kubeadm command was this
sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-cert-extra-sans=server1.lan
Since my home network conflicted with that CIDR, and since my dns upstream was 192.168.1.1, it thought that was on the pod network and not on my home network and failed to route the DNS resolution packets appropriately.
The solution was to recreate my cluster using the following command,
sudo kubeadm init --pod-network-cidr=10.200.0.0/16 --apiserver-cert-extra-sans=server1.lan
And when I applied my calico yaml file, I made sure to replace the default 192.168.0.0/16 CIDR with the new 10.200.0.0/16 CIDR.
Hope this helps someone. Thanks.
I have initialized kubernetes v1.13.1 cluster on Ubuntu 16.04 using below command:
sudo kubeadm init --token-ttl=0 --apiserver-advertise-address=192.168.88.142
and installed weave using:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
I have 10 raspberry pi acting as worker nodes and connected to the cluster. All of them are running the deployment fine. There nodes are running pods which try to connect to iot hub visdwk-azure-devices.net and publish some data. Out of 10 nodes, only few nodes are able to connect and other throws error unable to connect to iot hub. I did a ping test and found out that they were not able to ping google while they were pinging the public IP address of google.
This made me think that something is wrong with the coredns pod. I followed this documentation and did below test.
Pod has below contents in /etc/resolv.conf
nameserver 10.96.0.10
search visdwk.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
which looks normal to me. All the coredns pods are running fine.
coredns-86c58d9df4-42xqc 1/1 Running 8 1d11h
coredns-86c58d9df4-p6d98 1/1 Running 7 1d6h
I have also done nslookup kubernetes.default from the busybox container and got the proper response. Below are the logs of coredns-86c58d9df4-42xqc
.:53
2019-02-08T08:40:10.038Z [INFO] CoreDNS-1.2.6
2019-02-08T08:40:10.039Z [INFO] linux/amd64, go1.11.2, 756749c
CoreDNS-1.2.6
linux/amd64, go1.11.2, 756749c
[INFO] plugin/reload: Running configuration MD5 =
f65c4821c8a9b7b5eb30fa4fbc167769
t
Above logs also looks normal.
I can also not say that the pod is not able to resolve the iot hub because of any error from weave because if weave is throwing error then I believe the pod will never start and will always be in failed state but in actual the pod remains in running state. Please correct me here if I am wrong.
DNS service also seems to be in running state:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 1d6h
But still I am not able to figure out as to why few nodes in the cluster are not able to resolve the iot hub. Can anyone please give me some suggestions here. Please help. Thanks.
Logs from failed pod:
1550138544: New connection from 127.0.0.1 on port 1883.
1550138544: New client connected from 127.0.0.1 as 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 (c1, k60).
1550138544: Sending CONNACK to 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 (0, 0)
1550138544: Received PUBLISH from 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 (d0, q0, r0, m0, 'devices/machine6/messages/events/', ... (1211 bytes))
1550138544: Received DISCONNECT from 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504
1550138544: Client 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 disconnected.
1550138547: Saving in-memory database to /mqtt/data/mosquitto.db.
1550138547: Bridge local.machine6 doing local SUBSCRIBE on topic devices/machine6/messages/events/#
1550138547: Connecting bridge iothub-bridge (visdwk.azure-devices.net:8883)
1550138552: Error creating bridge: Try again.
1550138566: New connection from 127.0.0.1 on port 1883.
1550138566: New client connected from 127.0.0.1 as afb6cc2a-ee78-482e-aff0-fc595e06f86a (c1, k60).
1550138566: Sending CONNACK to afb6cc2a-ee78-482e-aff0-fc595e06f86a (0, 0)
1550138566: Received PUBLISH from afb6cc2a-ee78-482e-aff0-fc595e06f86a (d0, q0, r0, m0, 'devices/machine6/messages/events/', ... (1211 bytes))
1550138566: Received DISCONNECT from afb6cc2a-ee78-482e-aff0-fc595e06f86a
1550138566: Client afb6cc2a-ee78-482e-aff0-fc595e06f86a disconnected.
1550138567: New connection from 127.0.0.1 on port 1883.
1550138567: New client connected from 127.0.0.1 as 01b9e135-fbc8-4d67-9962-356e8cf9f080 (c1, k60).
1550138567: Sending CONNACK to 01b9e135-fbc8-4d67-9962-356e8cf9f080 (0, 0)
1550138567: Received PUBLISH from 01b9e135-fbc8-4d67-9962-356e8cf9f080 (d0, q0, r0, m0, 'devices/machine6/messages/events/', ... (755 bytes))
1550138567: Received DISCONNECT from 01b9e135-fbc8-4d67-9962-356e8cf9f080
1550138567: Client 01b9e135-fbc8-4d67-9962-356e8cf9f080 disconnected.
1550138578: Saving in-memory database to /mqtt/data/mosquitto.db.
1550138583: Bridge local.machine6 doing local SUBSCRIBE on topic devices/machine6/messages/events/#
1550138583: Connecting bridge iothub-bridge (visdwk.azure-devices.net:8883)
1550138588: Error creating bridge: Try again.
Pod is running a mosquitto container which try to connect to visdwk.azure-devices.net and throws error.
Connecting bridge iothub-bridge (visdwk.azure-devices.net:8883)
Error creating bridge: Try again.
It would appear that one of your DNS Pods is not providing DNS services.
The evidence is is in the statement that "only few nodes are able to connect and other throws error unable to connect to iot hub"
This is a classic symptom of load-balancing with a failed node in the loop.
Try:
Remove the DNS server pod that gave the message: visdwk.azure-devices.net.visdwknamespace.svc.cluster.local. udp 82 false 512" NXDOMAIN qr,aa,rd,ra 175 0.000651078s where visdwk.azure-devices.net
Wait for the changes to propagate through the cluster.
Test the connections.
If this is correct they should all connect.
To confirm, add the pod back and remove the other one. Retest, they should all fail to connect.
batchWorker_1 | [DEBUG] 2017-10-30 12:42:10.035 [cluster1-nio-worker-0] Connection - Connection[/172.17.0.3:9042-1, inFlight=0, closed=false] Error connecting to /172.17.0.3:9042 (connection timed out: /172.17.0.3:9042)
batchWorker_1 | [DEBUG] 2017-10-30 12:42:10.037 [cluster1-nio-worker-0] STATES - Defuncting Connection[/172.17.0.3:9042-1, inFlight=0, closed=false] because: [/172.17.0.3:9042] Cannot connect
batchWorker_1 | [DEBUG] 2017-10-30 12:42:10.038 [cluster1-nio-worker-0] STATES - [/172.17.0.3:9042] preventing new connections for the next 1000 ms
batchWorker_1 | [DEBUG] 2017-10-30 12:42:10.038 [cluster1-nio-worker-0] STATES - [/172.17.0.3:9042] Connection[/172.17.0.3:9042-1, inFlight=0, closed=false] failed, remaining = 0
batchWorker_1 | [DEBUG] 2017-10-30 12:42:10.039 [cluster1-nio-worker-0] Connection - Connection[/172.17.0.3:9042-1, inFlight=0, closed=true] closing connection
batchWorker_1 | [DEBUG] 2017-10-30 12:42:10.042 [main] ControlConnection - [Control connection] error on /172.17.0.3:9042 connection, no more host to try
batchWorker_1 | com.datastax.driver.core.exceptions.TransportException: [/172.17.0.3:9042] Cannot connect
batchWorker_1 | at com.datastax.driver.core.Connection$1.operationComplete(Connection.java:165) ~[batch_worker_server.jar:0.01]
batchWorker_1 | at com.datastax.driver.core.Connection$1.operationComplete(Connection.java:148) ~[batch_worker_server.jar:0.01]
...
I am running my application and a Cassandra container, trying to establish connection from application container to Cassandra container.
I tried with docker-compose. It throws the same error. It is able to resolve the right container IP (as you can see) but failing to connect.
I tried to run by starting cassandra container separately and harddcode the IP in my application container, it still fails.
The Cassandra container works fine, if i run the same app outside, it connect.
The issue is that, it is not able to resolve the Cassandra container IP from the application container. Not sure why.
I also enabled start_rpc and exposed all cassandra related ports. Still no luck.
Issue
[Control connection] error on /IP:9042 connection, no more host to try
Solution
Silly but important check "App tries to connect before cassandra is up"
RPC_ADDESS Default IP Issue
As containers have their own network. So every container will takes its own IP. As your RPC_ADDESS is set to container IP it will through this error.
In Cassandra configuration change the IP address of
rpc_address (Default: localhost) The listen address for client
connections (Thrift RPC service and native transport).
Valid values:
unset: Resolves the address using the configured hostname
configuration of the node. If left unset, the hostname resolves to the
IP address of this node using /etc/hostname, /etc/hosts, or DNS.
0.0.0.0: Listens on all configured interfaces. You must set the broadcast_rpc_address to a value other than 0.0.0.0.
IP address
4. hostname
RPC_ADDRESS=0.0.0.0
Here is docker-compose.yml file
version: '2'
services:
cassandra:
container_name: cassandra
image: cassandra:3.9
volumes:
- /path/of/host/for/cassandra/:/var/lib/cassandra/
ports:
- 7000:7000
- 7001:7001
- 7199:7199
- 9042:9042
- 9160:9160
environment:
- CASSANDRA_CLUSTER_NAME='cassandra-cluster'
- CASSANDRA_NUM_TOKENS=256
- CASSANDRA_RPC_ADDRESS=0.0.0.0
restart: always
You can create Docker network as described in documentation, and connect Cassandra & your application to same network.
You also need to check on what interfaces Cassandra is listening - is it single interface, or all?
Update
I believe the culprit is the master who does not appear to be listening on port 7946. netstat shows that 7946 is listening on the nodes, but not the master. When I check the syslogs for the nodes I see the following error
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
Original Post
I am running a three node Swarm Mode cluster in AWS; one master and two workers. This is swarm mode not to be confused with docker swarm from pre 1.12.
I created all of the services with docker-machine. Each machine is running Ubuntu 15.10 with Docker 1.12.3.
Linux swarm-master-01 4.2.0-42-generic #49-Ubuntu SMP Tue Jun 28 21:26:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Using the master node I have created a service with the following
docker service create --replicas 1 --name myapp -p 3000 myapp
When I run docker service ps myapp I get the following output
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
02awst8p9pezgpkfzqgz8z79t myapp.1 myapp:latest swarm-node-01 Running Running 19 minutes ago
The running task is deployed to swarm-node-01.
I checked the auto-selected port which was published publicly
$ docker service inspect myapp | jq .[].Endpoint.Ports[].PublishedPort
30000
According to the documentation:
External components, such as cloud load balancers, can access the service on the PublishedPort of any node in the cluster whether or not the node is currently running the task for the service. All nodes in the swarm route ingress connections to a running task instance.
But when I try to curl the nodes who do not have the task running I'm getting connection refused.
$ curl $(docker-machine ip swarm-node-01):30000/stats
{"uptime":"2016-11-09T14:48:35Z","requestCount":7,"statuses":{"200":7},"pid":1,"open_db_conns":0}
$ curl $(docker-machine ip swarm-node-02):30000/stats
curl: (7) Failed to connect to [the IP] port 30000: Connection refused
note: I scrubbed the IP of node-02
My Troubleshooting:
The nodes are both properly connected to the swarm
Scaling the service up to 5 (which inherently deploys the task to every node) makes curl work on every node, because the task is deployed to every node.
UPDATE 1
I initialized the swarm with
docker swarm init --advertise-addr 10.0.0.12:2377 --listen-addr 10.0.0.12:2377
I checked the syslogs from the nodes and I'm seeing the following errors
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
I checked to see if the ingress port was listening and it doesn't seem to be
ubuntu#swarm-master-01:~$ sudo lsof -i :7946
ubuntu#swarm-master-01:~$ cat < /dev/tcp/10.0.0.12/7946
-bash: connect: Connection refused
-bash: /dev/tcp/10.0.0.12/7946: Connection refused
ubuntu#swarm-master-01:~$ cat < /dev/tcp/0.0.0.0/7946
-bash: connect: Connection refused
-bash: /dev/tcp/0.0.0.0/7946: Connection refused
I was able to get around the issue for now, but I don't know what initially caused it. The overlay network (port 7946) wasn't listening on swarm-master-01. I figured this out with netstat -nlt. I searched the syslogs and found these errors related to the port in the syslog.
Nov 8 20:28:20 ubuntu docker[23092]: time="2016-11-08T20:28:20.171385360Z" level=warning msg="2016/11/08 20:28:20 [ERR] memberlist: Failed TCP fallback ping: read tcp 10.0.0.85:54016->10.0.0.13:7946: i/o timeout"
Nov 9 18:26:17 swarm-node-01 docker[714]: time="2016-11-09T18:26:17.573441271Z" level=warning msg="2016/11/09 18:26:17 [ERR] memberlist: Failed to send indirect ping: write udp [::]:7946->10.0.0.38:7946: use of closed network connection"
For some reason docker refused to open this port and listen any more. Here is what I did (albeit undesirable) to circumvent the issue:
Created another node with docker-machine called swarm-master-02
Joined swarm-master-02 to the cluster as a master
Demoted master-01 which set master-02 as the leader
Restarted the docker daemon on each node (might not have been necessary)
Now all of the machines are working as expected except for swarm-master-01. One task is running on swarm-node-01 and curl works against all nodes by forwarding the traffic to the proper container on the proper node. However, swarm-master-01 refuses to listen on the overlay network and curl does not work against this node. I was only able to fix swarm-master-01 by completely removing it from the cluster, restarting the docker daemon, and joining it again as a master. Now 7946 is listening on that machine.