Joining a Docker swarm - docker

I have 2 VMs.
On the first I run:
docker swarm join-token manager
On the second I run the result from this command.
i.e.
docker swarm join --token SWMTKN-1-0wyjx6pp0go18oz9c62cda7d3v5fvrwwb444o33x56kxhzjda8-9uxcepj9pbhggtecds324a06u 192.168.65.3:2377
However, this outputs:
Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.65.3:2377: connect: connection refused"
Any idea what's going wrong?
If it helps I'm spinning up these VMs using Vagrant.

Just add the port to firewall on master side
firewall-cmd --add-port=2377/tcp --permanent
firewall-cmd --reload
Then again try docker swarm join on second VM or node side

I was facing similar issue. and I spent couple of hours to figure out the root cause and share to those who may have similar issues.
Environment:
Oracle Cloud + AWS EC2 (2 +2)
OS: 20.04.2-Ubuntu
Docker version : 20.10.8
3 dynamic public IP+ 1 elastic IP
Issues
create two instances on the Oracle cloud at beginning
A instance (manager) docker swarm init --advertise-addr success
B instance (worker) docker join as worker is worker success
when I try to promo B as manager, encountered error
Unable to connect to remote host: No route to host
5. mesh routing is not working properly.
Investigation
Suspect it is related to network/firewall/Security group/security list
ssh to B server (worker), telnet (manager) 2377, with same error
Unable to connect to remote host: No route to host
3. login oracle console and add ingress rule under security list for all of relative port
TCP port 2377 for cluster management communications
TCP and UDP port 7946 for communication among nodes
UDP port 4789 for overlay network traffic
4. try again but still not work with telnet for same error
5. check the OS level firewall. if has disable it.
systemctl ufw disable
6. try again but still not work with same result
7. I suspect there have something wrong with oracle cloud, then I decide try to use AWS install the same version of OS/docker
8. add security group to allow all of relative ports/protocol and disable ufw
9. test with AWS instance C (leader/master) + D (worker). it works and also can promote D to manager. mesh routing was also work.
10. confirm the issue with oracle cloud
11. try to join the oracle instance (A) to C as worker. it works but still cannot promote as manager.
12. use journalctl -f  to investigate the log and confirm there have socket timeout from A/B (oracle instances) to AWS instance(C)
13. relook the A/B, found there have iptables block request
14. remove all of setup in the iptables
# remove the rules
iptables -P INPUT ACCEPT
iptables -P OUTPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -F
15. remove all of setup in the iptables
Root Cause
It caused by firewall either in cloud security/WAF/ACL level or OS firewall/rules. e.g. ufw/iptables

I did firewall-cmd --add-port=2377/tcp --permanent firewall-cmd --reload already on master side and was still getting the same error.
I did telnet <master ip> 2377 on worker node and then I did reboot on master.
Then it is working fine.

It looks like your docker swarm manager leader is not running on port 2377. You can check it by firing this command on your swarm manager leader vm. If it is working just fine then you will get similar output
[root#host1]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
tilzootjbg7n92n4mnof0orf0 * host1 Ready Active Leader
Furthermore you can check the listening ports in leader swarm manager node. It should have port tcp 2377 for cluster management communications and tcp/udp port 7946 for communication among nodes opened.
[root#host1]# netstat -ntulp | grep dockerd
tcp6 0 0 :::2377 :::* LISTEN 2286/dockerd
tcp6 0 0 :::7946 :::* LISTEN 2286/dockerd
udp6 0 0 :::7946 :::* 2286/dockerd
In the second vm where you are configuring second swarm manager you will have to make sure you have connectivity to port 2377 of leader swarm manager. You can use tools like telnet, wget, nc to test the connectivity as given below
[root#host2]# telnet <swarm manager leader ip> 2377
Trying 192.168.44.200...
Connected to 192.168.44.200.

For me I was on linux and windows. My windows docker private network was the same as my local network address. So docker daemon wasn't able to find in his own network the master with the address I was giving to him.
So I did :
1- go to Docker Desktop app
2- go to Settings
3- go to Resources
4- go to Network section and change the Docker subnet address (need to be different from your local subnet address).
5- Then apply and restart.
6- use the docker join on the worker again.
Note: All this steps are performed on the node where the error appear. Make sure that the ports 2377, 7946 and 4789 are opens on the master (you can use iptables or ufw).
Hope it works for you.

Related

kubeadm join times out on non-default NIC/IP

I am trying to configure a K8s cluster on-prem and the servers are running Fedora CoreOS using multiple NICs.
I am configuring the cluster to use a non-default NIC - a bond which is defined with 2 interfaces. All servers can reach each-other over that interface and have HTTP + HTTPS connectivity to the internet.
kubeadm join hangs at:
I0513 13:24:55.516837 16428 token.go:215] [discovery] Failed to request cluster-info, will try again: Get https://${BOND_IP}:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
The relevant kubeadm init config looks like this:
[...]
localAPIEndpoint:
advertiseAddress: ${BOND_IP}
bindPort: 6443
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
runtime-cgroups: "/systemd/system.slice"
kubelet-cgroups: "/systemd/system.slice"
node-ip: ${BOND_IP}
criSocket: /var/run/dockershim.sock
name: master
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
[...]
The join config that am using looks like this:
apiVersion: kubeadm.k8s.io/v1beta2
kind: JoinConfiguration
discovery:
bootstrapToken:
token: ${TOKEN}
caCertHashes:
- "${SHA}"
apiServerEndpoint: "${BOND_IP}:6443"
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
runtime-cgroups: "/systemd/system.slice"
kubelet-cgroups: "/systemd/system.slice"
If I am trying to configure it using default eth0, it works without issues.
This is not a connectivity issue. The port test works fine:
# nc -s ${BOND_IP_OF_NODE} -zv ${BOND_IP_OF_MASTER} 6443
Ncat: Version 7.80 ( https://nmap.org/ncat )
Ncat: Connected to ${BOND_IP_OF_MASTER}:6443.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
I suspect this happens due to kubelet listening on eth0, if so, can I change it to use a different NIC/IP?
LE: The eth0 connection has been cut off completely (cable out, interface down, connection down).
Now, when we init, if we choose port 0.0.0.0 for the kube-api it defaults to the bond, which we wanted initially:
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 0.0.0.0
result:
[certs] apiserver serving cert is signed for DNS names [emp-prod-nl-hilv-quortex19 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.0.0.1 ${BOND_IP}]
I have even added the 6443 port in iptables for accept and it still times out.. All my CALICO pods are up and running (all pods for that matter in kube-system namespace)
LLE:
I have tested calico and weavenet and both show the same issue. The api-server is up and can be reached from the master using curl but it times out from the nodes.
LLLE:
On the premise that the kube-api is nothing but an HTTPS server, I have tried two options from the node that cannot reach it when doing the kubeadm join:
Ran a python3 simple http server over 6443 and WAS ABLE TO CONNECT from node
Ran an nginx pod and exposed it over another port as NodePort and WAS ABLE TO CONNECT from node
the node just cant reach the api-server on 6443 or any other port for that matter ....
what am i doing wrong...
The cause:
The interface used was in BOND of type ACTIVE-ACTIVE. This made it so kubeadm tried another interface from the 2 bonded, which was not in the same subnet as the IP of the advertised server apparently...
Using ACTIVE-PASSIVE did the trick and was able to join the nodes.
LE: If anyone knows why kubeadm join does not support LACP with ACTIVE-ACTIVE bond setups on FEDORA COREOS please advise here. Otherwise, if additional configurations are required, I would very much like to know what I have missed.

Docker swarm worker behind NAT

I am wanting to have a worker node on a server I have that is behind a NAT (i.e can't expose ports publicly) I thought this wasn't a problem but it turns out to be one:
On this server behind the NAT I run:
docker swarm join --token SWMTKN-1... X.X.X.X:2377
Which in turn adds the server to the swarm. I am not sure where the "internal" IP address comes from but on traefik I then have a new server http://10.0.1.126:8080 (10.0.1.126 is definitely not the public IP) if I exec inside the traefik container:
docker exec -it 80f9cb33e24c sh
I can ping every server/node/worker in the list on traefik apart from the new one. Why?
When joining the swarm like this on the worker behind the vpn:
docker swarm join --advertise-addr=tun0 --token SWMTKN-1-... X.X.X.X:2377
I can see a new peer on my network from the manager:
$ docker network inspect traefik
...
"Peers": [
...
{
"Name": "c2f01f1f1452",
"IP": "12.0.0.2"
}
]
where 12.0.0.2 and tun0 is the vpn interface from the manager to the server behind the NAT. Unfortunately when I then run:
$ nmap -p 2377,2376,4789,7946 12.0.0.2
Starting Nmap 7.70 ( https://nmap.org ) at 2020-05-04 11:01 EDT
Nmap scan report for 12.0.0.2
Host is up (0.017s latency).
PORT STATE SERVICE
2376/tcp closed docker
2377/tcp closed swarm
4789/tcp closed vxlan
7946/tcp open unknown
I can see that the ports are closed for the docker worker which is weird?
Also if I use nmap -p 8080 10.0.1.0/24 inside the traefik container on the manager I get:
Nmap scan report for app.6ysph32io2l9q74g6g263wed3.mbnlnxusxv2wz0pa2njpqg2u1.traefik (10.0.1.62)
Host is up (0.00033s latency).
PORT STATE SERVICE
8080/tcp open http-proxy
on a succesfull swarm worker which has the network internal ip 10.0.1.62
but I get:
Nmap scan report for app.y7odtja923ix60fg7madydia3.jcfbe2ke7lzllbvb13dojmxzq.traefik (10.0.1.126)
Host is up (0.00065s latency).
PORT STATE SERVICE
8080/tcp filtered http-proxy
on the new swarm node. Why is it filtered? What am I doing wrong?
I'm adding this here as it's a bit longer.
I don't think it's enough for only the manager and the remote node to be able to communicate; nodes need to be able to communicate between themselves.
Try to configure the manager (who is connected to the VPN) to route packets to and from the remote worker through the VPN and add the needed routes on all nodes (including the remote one).
Something like:
# Manager
sysctl -w net.ipv4.ip_forward=1 # if you use systemd you might need extra steps
# Remote node
ip route add LOCAL_NODES_SUBNET via MANAGER_TUN_IP dev tun0
#Local nodes
ip route add REMOTE_NODE_TUN_IP/32 via MANAGER_IP dev eth0
If the above works correctly you need to make the routing changes above permanent.
To find the IP addresses for all your nodes run this command on the manager:
for NODE in $(docker node ls --format '{{.Hostname}}'); do echo -e "${NODE} - $(docker node inspect --format '{{.Status.Addr}}' "${NODE}")"; done

Docker Swarm: Getting connection refuse while adding worker node

I just started learning docker, I am facing below challenge, please let me know where I am doing wrong.
My use case: Set up docker swarm manager and add worker node to it.
Step1: To create docker manager, I used below command:
docker swarm init --advertise-addr <<ip_address>>
Step 2: Run below command, which gives you docker command to add worker.
docker swarm join-token worker
After running above command, I got output as:
docker swarm join --token SWMTKN-1-653srs28a6s48dqxnak9g9kic2cd1xyeowgnke53nf83710wfv-7u7u7u1vovahvn792814q2sts ip_address:2377
Step3: I logged-in to worker node and ran above docker swarm join command. But I am getting below error message.
Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection er
ror: desc = "transport: Error while dialing dial tcp ip_address:2377: connect: connection refused"
This could well be a firewall issue, make sure you have port 2377, 7946 & 4789 in open state between the hosts acting as manager or worker node -
From the docs -
Open protocols and ports between the hosts The following ports must be
available.
TCP port 2377 for cluster management communications
TCP and UDP port 7946 for communication among nodes
UDP port 4789 for overlay network
traffic

Docker swarm mode load balancing not working as described

Update
I believe the culprit is the master who does not appear to be listening on port 7946. netstat shows that 7946 is listening on the nodes, but not the master. When I check the syslogs for the nodes I see the following error
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
Original Post
I am running a three node Swarm Mode cluster in AWS; one master and two workers. This is swarm mode not to be confused with docker swarm from pre 1.12.
I created all of the services with docker-machine. Each machine is running Ubuntu 15.10 with Docker 1.12.3.
Linux swarm-master-01 4.2.0-42-generic #49-Ubuntu SMP Tue Jun 28 21:26:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Using the master node I have created a service with the following
docker service create --replicas 1 --name myapp -p 3000 myapp
When I run docker service ps myapp I get the following output
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
02awst8p9pezgpkfzqgz8z79t myapp.1 myapp:latest swarm-node-01 Running Running 19 minutes ago
The running task is deployed to swarm-node-01.
I checked the auto-selected port which was published publicly
$ docker service inspect myapp | jq .[].Endpoint.Ports[].PublishedPort
30000
According to the documentation:
External components, such as cloud load balancers, can access the service on the PublishedPort of any node in the cluster whether or not the node is currently running the task for the service. All nodes in the swarm route ingress connections to a running task instance.
But when I try to curl the nodes who do not have the task running I'm getting connection refused.
$ curl $(docker-machine ip swarm-node-01):30000/stats
{"uptime":"2016-11-09T14:48:35Z","requestCount":7,"statuses":{"200":7},"pid":1,"open_db_conns":0}
$ curl $(docker-machine ip swarm-node-02):30000/stats
curl: (7) Failed to connect to [the IP] port 30000: Connection refused
note: I scrubbed the IP of node-02
My Troubleshooting:
The nodes are both properly connected to the swarm
Scaling the service up to 5 (which inherently deploys the task to every node) makes curl work on every node, because the task is deployed to every node.
UPDATE 1
I initialized the swarm with
docker swarm init --advertise-addr 10.0.0.12:2377 --listen-addr 10.0.0.12:2377
I checked the syslogs from the nodes and I'm seeing the following errors
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
I checked to see if the ingress port was listening and it doesn't seem to be
ubuntu#swarm-master-01:~$ sudo lsof -i :7946
ubuntu#swarm-master-01:~$ cat < /dev/tcp/10.0.0.12/7946
-bash: connect: Connection refused
-bash: /dev/tcp/10.0.0.12/7946: Connection refused
ubuntu#swarm-master-01:~$ cat < /dev/tcp/0.0.0.0/7946
-bash: connect: Connection refused
-bash: /dev/tcp/0.0.0.0/7946: Connection refused
I was able to get around the issue for now, but I don't know what initially caused it. The overlay network (port 7946) wasn't listening on swarm-master-01. I figured this out with netstat -nlt. I searched the syslogs and found these errors related to the port in the syslog.
Nov 8 20:28:20 ubuntu docker[23092]: time="2016-11-08T20:28:20.171385360Z" level=warning msg="2016/11/08 20:28:20 [ERR] memberlist: Failed TCP fallback ping: read tcp 10.0.0.85:54016->10.0.0.13:7946: i/o timeout"
Nov 9 18:26:17 swarm-node-01 docker[714]: time="2016-11-09T18:26:17.573441271Z" level=warning msg="2016/11/09 18:26:17 [ERR] memberlist: Failed to send indirect ping: write udp [::]:7946->10.0.0.38:7946: use of closed network connection"
For some reason docker refused to open this port and listen any more. Here is what I did (albeit undesirable) to circumvent the issue:
Created another node with docker-machine called swarm-master-02
Joined swarm-master-02 to the cluster as a master
Demoted master-01 which set master-02 as the leader
Restarted the docker daemon on each node (might not have been necessary)
Now all of the machines are working as expected except for swarm-master-01. One task is running on swarm-node-01 and curl works against all nodes by forwarding the traffic to the proper container on the proper node. However, swarm-master-01 refuses to listen on the overlay network and curl does not work against this node. I was only able to fix swarm-master-01 by completely removing it from the cluster, restarting the docker daemon, and joining it again as a master. Now 7946 is listening on that machine.

Docker1.12 Worker not able to join in cluster(Swarm: Pending)

Manager Version Docker version 1.12.0-rc5, build a3f2063,
Worker version Docker version 1.12.0-rc5, build a3f2063.
Created Swarm manger:
docker swarm init --advertise-addr "172.25.30.2:4243"
Swarm initialized: current node (3kmewyb10p8xj3ke5rpjyw4s8) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-enq26dqnwma38ij48hymtnioq \
172.25.30.2:4243
To add a manager to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-85cwe5pf779qw0knjn6wxdbim \
172.25.30.2:4243
Then created worker
docker swarm join --token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-enq26dqnwma38ij48hymtnioq 172.25.30.2:4243
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.
I have checked logs in worker
time="2016-08-01T00:22:47.449844174-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.449962215-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450025342-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450081950-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450142443-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450202836-07:00" level=error msg="cluster exited with error: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:23:31.351868722-07:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node."
In docker info, I saw "Swarm: Pending"
I did docker swarm update also!. Still, the worker was not able to join the cluster. So, how can I reslove
UPDATE-1
Uninstalled & removed config files and then install docker 1.12 again with version Docker version 1.12.0, build 8eab29e.
Still facing the same problem(Not able to join and "Swarm:Pending" in docker info) with DIFFERENT error in /var/logs/upstat/docker.logs
time="2016-08-01T11:22:08.629760770-07:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node."
Thanks.
The thing is, I was trying to join with wrong "port" (As docker swarm init shown in output).
1) Before "docker swarm init", the docker running on port "4243" only. I have checked with netstat -tulp | grep docker. So I advertised with that port!
root#veeru:~# netstat -tulpn | grep docker
tcp6 0 0 :::4243 :::* LISTEN 8750/dockerd
root#veeru:~# docker swarm init --advertise-addr "172.25.30.2:4243"
Swarm initialized: current node (exvwgj0pu4cd124ljnblt9xff) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-6zomhgu204m9alq51f632nzas \
172.25.30.2:4243
To add a manager to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-axhgqgo4jqw4hv38x578m44wh \
172.25.30.2:4243
2) After docker swarm init, the docker is running with 4 port including the port 2377(netstat -tupln | grep docker).
root#veeru:~# netstat -tulp | grep docker
tcp6 0 0 [::]:2377 [::]:* LISTEN 8750/dockerd
tcp6 0 0 [::]:7946 [::]:* LISTEN 8750/dockerd
tcp6 0 0 [::]:4243 [::]:* LISTEN 8750/dockerd
udp6 0 0 [::]:7946 [::]:* 8750/dockerd
In point 1, it is telling to run docker swarm join with port 4243 in worker. Previously I did run like that!.(It wont work!)
Later I did docker swarm leave and joined with port 2377. Now I am able to join!
docker swarm join --token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-6zomhgu204m9alq51f632nzas 172.25.30.2:2377
For me it was a firewall issue too.
I tried to ping to the manager node and was pinging back
Checked if the ports are opening using telnet and was not able to connect and figured out it was the port issue.
If you are running Centos than the port can be easily opened using the firewalld
Check if the firewalld is running
sudo firewall-cmd --state
Opening the port you want
sudo firewall-cmd --zone=public --add-port=2377/tcp
Change the port as per your node ports it is trying to connect to.
Just expose port 2377 of manager, it will work.
It clearly means node unable to connect manager, so timeout happening to conform same just do telnet manager-ip 2377 (don't try ping, won't work).
And if you are facing the same error even though all firewalls are disabled in both nodes and manager, then try to create another manager exposing port 2377 as below:
docker-machine create --driver amazonec2 --amazonec2-open-port 2377 manager1
And now try to join nodes to new manager created now, but port you are using to join should be 2377 if you gonna use diff then expose that port in above command. Doing same worked for me as I suspect it's because others used other different servers but I'm using same server for both manager and nodes.
According to dockers website Here they stated the ports to enable.
Run the following commands on both the Swam Manager and worker nodes
sudo ufw enable
sudo ufw allow 22/tcp
sudo ufw allow 2376/tcp
sudo ufw allow 2377/tcp
sudo ufw allow 7946/tcp
sudo ufw allow 7946/udp
sudo ufw allow 4789/udp
sudo ufw reload
We just gave access to the neccessary port. After running these commands, all docker commands should be working now.
I was facing similar issue, While in my case port was getting blocked due to firewall rule.
I was having the same issue. I was running coreos vms in Azure. I found out that all my vms had the same private ip address and different public ip addresses. This usually happens when the vms are part of the same security group, however it was not the case this time. The issue was the my account had reached the max number of resources, so I deleted the resources such as ip addresses, nsg, networks etc and then re-provisioned new vms, they had different private ips and when ran the command everything was fine. My docker version is 1.12.6
Assuming you did so; if you get "Connection time out" it means that there is a firewall preventing you from connecting.
Either on the source host, or the destination host (e.g. iptables rules) or in between.
If you are running on some public cloud, make sure that access lists (e.g. EC2 security groups) allow connections between hosts on that port
I was trying to connect 4 nodes(1 master, 3 slave) over EC2 ubuntu server ami image, For me it was an firewall issue.
Check your security groups=>Inbound rules, for me it was custom, and I changed it to anywhere and it will work.

Resources