Docker swarm worker behind NAT - docker

I am wanting to have a worker node on a server I have that is behind a NAT (i.e can't expose ports publicly) I thought this wasn't a problem but it turns out to be one:
On this server behind the NAT I run:
docker swarm join --token SWMTKN-1... X.X.X.X:2377
Which in turn adds the server to the swarm. I am not sure where the "internal" IP address comes from but on traefik I then have a new server http://10.0.1.126:8080 (10.0.1.126 is definitely not the public IP) if I exec inside the traefik container:
docker exec -it 80f9cb33e24c sh
I can ping every server/node/worker in the list on traefik apart from the new one. Why?
When joining the swarm like this on the worker behind the vpn:
docker swarm join --advertise-addr=tun0 --token SWMTKN-1-... X.X.X.X:2377
I can see a new peer on my network from the manager:
$ docker network inspect traefik
...
"Peers": [
...
{
"Name": "c2f01f1f1452",
"IP": "12.0.0.2"
}
]
where 12.0.0.2 and tun0 is the vpn interface from the manager to the server behind the NAT. Unfortunately when I then run:
$ nmap -p 2377,2376,4789,7946 12.0.0.2
Starting Nmap 7.70 ( https://nmap.org ) at 2020-05-04 11:01 EDT
Nmap scan report for 12.0.0.2
Host is up (0.017s latency).
PORT STATE SERVICE
2376/tcp closed docker
2377/tcp closed swarm
4789/tcp closed vxlan
7946/tcp open unknown
I can see that the ports are closed for the docker worker which is weird?
Also if I use nmap -p 8080 10.0.1.0/24 inside the traefik container on the manager I get:
Nmap scan report for app.6ysph32io2l9q74g6g263wed3.mbnlnxusxv2wz0pa2njpqg2u1.traefik (10.0.1.62)
Host is up (0.00033s latency).
PORT STATE SERVICE
8080/tcp open http-proxy
on a succesfull swarm worker which has the network internal ip 10.0.1.62
but I get:
Nmap scan report for app.y7odtja923ix60fg7madydia3.jcfbe2ke7lzllbvb13dojmxzq.traefik (10.0.1.126)
Host is up (0.00065s latency).
PORT STATE SERVICE
8080/tcp filtered http-proxy
on the new swarm node. Why is it filtered? What am I doing wrong?

I'm adding this here as it's a bit longer.
I don't think it's enough for only the manager and the remote node to be able to communicate; nodes need to be able to communicate between themselves.
Try to configure the manager (who is connected to the VPN) to route packets to and from the remote worker through the VPN and add the needed routes on all nodes (including the remote one).
Something like:
# Manager
sysctl -w net.ipv4.ip_forward=1 # if you use systemd you might need extra steps
# Remote node
ip route add LOCAL_NODES_SUBNET via MANAGER_TUN_IP dev tun0
#Local nodes
ip route add REMOTE_NODE_TUN_IP/32 via MANAGER_IP dev eth0
If the above works correctly you need to make the routing changes above permanent.
To find the IP addresses for all your nodes run this command on the manager:
for NODE in $(docker node ls --format '{{.Hostname}}'); do echo -e "${NODE} - $(docker node inspect --format '{{.Status.Addr}}' "${NODE}")"; done

Related

Dockerized Zabbix: Server Can't Connect to the Agents by IP

Problem:
I'm trying to config a fully containerized Zabbix version 6.0 monitoring system on Ubuntu 20.04 LTS using the Zabbix's Docker-Compose repo found HERE.
The command I used to raise the Zabbix server and also a Zabbix Agent is:
docker-compose -f docker-compose_v3_ubuntu_pgsql_latest.yaml --profile all up -d
Although the Agent rises in a broken state and shows a "red" status, when I change its' IP address FROM 127.0.0.1 TO 172.16.239.6 (default IP Docker-Compose assigns to it) the Zabbix Server can now successfully connect and monitoring is established. HOWEVER: the Zabbix Server cannot connect to any other Dockerized Zabbix Agents on REMOTE hosts which are raised with the docker run command:
docker run --add-host=zabbix-server:172.16.238.3 -p 10050:10050 -d --privileged --name DockerHost3-zabbix-agent -e ZBX_SERVER_HOST="zabbix-server" -e ZBX_PASSIVE_ALLOW="true" zabbix/zabbix-agent:ubuntu-6.0-latest
NOTE: I looked at other Stack groups to post this question, but Stackoverflow appeared to be the go-to group for these Docker/Zabbix issues having over 30 such questions.
Troubleshooting:
Comparative Analysis:
Agent Configuration:
Comparative analysis of the working ("green") Agent on the same host as the Zabbix Server with Agents on different hosts showing "red" statuses (not contactable by the Zabbix server) using the following command show the configurations have parity.
docker exec -u root -it (ID of agent container returned from "docker ps") bash
And then execute:
grep -Ev ^'(#|$)' /etc/zabbix/zabbix_agentd.conf
Ports:
The correct ports were showing as open on the "red" Agents as were open on the "green" agent running on the same host as the Zabbix Server from the output of the command:
ss -luntu
NOTE: This command was issued from the HOST, not the Docker container for the Agent.
Firewalling:
Review of the iptables rules from the HOST (not container) using the following command didn't reveal anything of concern:
iptables -nvx -L --line-numbers
But to exclude Firewalling, I nonetheless allowed everything in iptables in the FORWARD table on both the Zabbix server and an Agent in an "red" status used for testing.
I also allowed everything on the MikroTik GW router connecting the Zabbix Server to the different physical hosts running the Zabbix Agents.
Routing:
The Zabbix server can ping remote Agent interfaces proving there's a route to the Agents.
AppArmor:
I also stopped AppArmor to exclude it as being causal:
sudo systemctl stop apparmor
sudo systemctl status apparmor
Summary:
So everything is wide-open, the Zabbix Server can route to the Agents and the config of the "red" agents have parity with the config of the "green" Agent living on the same host at the Zabbix Server itself.
I've setup non-containerized Zabbix installation in production environments successfully so I'm otherwise familiar with Zabbix.
Why can't the containerized Zabbix Server connect to the containerized Zabbix Agents on different hosts?
Short Answer:
There was NOTHING wrong with the Zabbix config; this was a Docker-induced problem.
docker logs <hostname of Zabbix server> revealed that there appeared to be NAT'ing happening on the Zabbix SERVER, and indeed there was.
Docker was modifying iptables NAT table on the host running the Zabbix Server container causing the source address of the Zabbix Server to present as the IP of the physical host itself, not the Docker-Compose assigned IP address of 172.16.238.3.
Thus, the agent was not expecting this address and refused the connection. My experience of Dockerized apps is that they are mostly good at modifying IP tables to create the correct connectivity, but not in this particular case ;-).
I now reviewed the NAT table by executing the following command on the HOST (not container):
iptables -t nat -nvx -L --line-numbers
This revealed that Docker was being, erm "helpful" and NAT'ing the Zabbix server's traffic
I deleted the offending rules by their rule number:
iptables -t nat -D <chain> <rule #>
After which the Zabbix server's IP address was now presented correctly to the Agents who now accepted the connections and their statuses turned "green".
The problem is reproducible if you execute:
docker-compose -f docker-compose -f docker-compose_v3_ubuntu_pgsql_latest.yaml down
And then run the up command raising the containers again you'll see the offending iptables rule it restored to the NAT table of the host running the Zabbix Server's container breaking the connectivity with Agents.
Longer Answer:
Below are the steps required to identify and resolve the problem of the Zabbix server NAT'ing its' traffic out of the host's IP:
Identify If the HOST of the Zabbix Server container is NAT'ing:
We need to see how the IP of the Zabbix Server's container is presenting to the Agents, so we have to get the container ID for a Zabbix AGENT to review its' logs:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b2fcf38d601f zabbix/zabbix-agent:ubuntu-6.0-latest "/usr/bin/tini -- /u…" 5 hours ago Up 5 hours 0.0.0.0:10050->10050/tcp, :::10050->10050/tcp DockerHost3-zabbix-agent
Next, supply container ID for the Agent to the docker logs command:
docker logs b2fcf38d601f
Then Review the rejected IP address in the log output to determine if it's NOT the Zabbix Server's IP:
81:20220328:000320.589 failed to accept an incoming connection: connection from "NAT'ed IP" rejected, allowed hosts: "zabbix-server"
The fact that you can see this error proves that there is no routing or connectivity issues: the connection is going through, it's just being rejected by the application- NOT the firewall.
If NAT'ing proved, continue to next step
On Zabbix SERVER's Host:
The remediation happens on the Zabbix Server's Host itself, not the Agents. Which is good because we can fix the problem in one place versus many.
Execute below command on the Host running the Zabbix Server's container:
iptables -t nat -nvx -L --line-numbers
Output of command:
Chain POSTROUTING (policy ACCEPT 88551 packets, 6025269 bytes)
num pkts bytes target prot opt in out source destination
1 0 0 MASQUERADE all -- * !br-abeaa5aad213 192.168.24.128/28 0.0.0.0/0
2 73786 4427208 MASQUERADE all -- * !br-05094e8a67c0 172.16.238.0/24 0.0.0.0/0
Chain DOCKER (2 references)
num pkts bytes target prot opt in out source destination
1 0 0 RETURN all -- br-abeaa5aad213 * 0.0.0.0/0 0.0.0.0/0
2 95 5700 RETURN all -- br-05094e8a67c0 * 0.0.0.0/0 0.0.0.0/0
We can see the counters are incrementing for the "POSTROUTING" and "DOCKER" chains- both rule #2 in their respective chains.
These rules are clearly matching and have effect.
Delete the offending rules on the HOST of the Zabbix server container which is NATing its' traffic to the Agents:
sudo iptables -t nat -D POSTROUTING 2
sudo iptables -t nat -D DOCKER 2
Wait a few moments and the Agents should now go "green"- assuming there are no other configuration or firewalling issues. If the Agents remain "red" after applying the fix then please work through the troubleshooting steps I documented in the Question section.
Conclusion:
I've tested and restarting the Zabbix-server container does not recreate the deleted rules. But again, please note that a docker-compose down followed by a docker-compose up WILL recreate the deleted rules and break Agent connectivity.
Hope this saves other folks wasted cycles. I'm a both a Linux and network engineer and this hurt my head, so this would be near impossible to resolve if you're not a dab hand with networking.

Joining a Docker swarm

I have 2 VMs.
On the first I run:
docker swarm join-token manager
On the second I run the result from this command.
i.e.
docker swarm join --token SWMTKN-1-0wyjx6pp0go18oz9c62cda7d3v5fvrwwb444o33x56kxhzjda8-9uxcepj9pbhggtecds324a06u 192.168.65.3:2377
However, this outputs:
Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.65.3:2377: connect: connection refused"
Any idea what's going wrong?
If it helps I'm spinning up these VMs using Vagrant.
Just add the port to firewall on master side
firewall-cmd --add-port=2377/tcp --permanent
firewall-cmd --reload
Then again try docker swarm join on second VM or node side
I was facing similar issue. and I spent couple of hours to figure out the root cause and share to those who may have similar issues.
Environment:
Oracle Cloud + AWS EC2 (2 +2)
OS: 20.04.2-Ubuntu
Docker version : 20.10.8
3 dynamic public IP+ 1 elastic IP
Issues
create two instances on the Oracle cloud at beginning
A instance (manager) docker swarm init --advertise-addr success
B instance (worker) docker join as worker is worker success
when I try to promo B as manager, encountered error
Unable to connect to remote host: No route to host
5. mesh routing is not working properly.
Investigation
Suspect it is related to network/firewall/Security group/security list
ssh to B server (worker), telnet (manager) 2377, with same error
Unable to connect to remote host: No route to host
3. login oracle console and add ingress rule under security list for all of relative port
TCP port 2377 for cluster management communications
TCP and UDP port 7946 for communication among nodes
UDP port 4789 for overlay network traffic
4. try again but still not work with telnet for same error
5. check the OS level firewall. if has disable it.
systemctl ufw disable
6. try again but still not work with same result
7. I suspect there have something wrong with oracle cloud, then I decide try to use AWS install the same version of OS/docker
8. add security group to allow all of relative ports/protocol and disable ufw
9. test with AWS instance C (leader/master) + D (worker). it works and also can promote D to manager. mesh routing was also work.
10. confirm the issue with oracle cloud
11. try to join the oracle instance (A) to C as worker. it works but still cannot promote as manager.
12. use journalctl -f  to investigate the log and confirm there have socket timeout from A/B (oracle instances) to AWS instance(C)
13. relook the A/B, found there have iptables block request
14. remove all of setup in the iptables
# remove the rules
iptables -P INPUT ACCEPT
iptables -P OUTPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -F
15. remove all of setup in the iptables
Root Cause
It caused by firewall either in cloud security/WAF/ACL level or OS firewall/rules. e.g. ufw/iptables
I did firewall-cmd --add-port=2377/tcp --permanent firewall-cmd --reload already on master side and was still getting the same error.
I did telnet <master ip> 2377 on worker node and then I did reboot on master.
Then it is working fine.
It looks like your docker swarm manager leader is not running on port 2377. You can check it by firing this command on your swarm manager leader vm. If it is working just fine then you will get similar output
[root#host1]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
tilzootjbg7n92n4mnof0orf0 * host1 Ready Active Leader
Furthermore you can check the listening ports in leader swarm manager node. It should have port tcp 2377 for cluster management communications and tcp/udp port 7946 for communication among nodes opened.
[root#host1]# netstat -ntulp | grep dockerd
tcp6 0 0 :::2377 :::* LISTEN 2286/dockerd
tcp6 0 0 :::7946 :::* LISTEN 2286/dockerd
udp6 0 0 :::7946 :::* 2286/dockerd
In the second vm where you are configuring second swarm manager you will have to make sure you have connectivity to port 2377 of leader swarm manager. You can use tools like telnet, wget, nc to test the connectivity as given below
[root#host2]# telnet <swarm manager leader ip> 2377
Trying 192.168.44.200...
Connected to 192.168.44.200.
For me I was on linux and windows. My windows docker private network was the same as my local network address. So docker daemon wasn't able to find in his own network the master with the address I was giving to him.
So I did :
1- go to Docker Desktop app
2- go to Settings
3- go to Resources
4- go to Network section and change the Docker subnet address (need to be different from your local subnet address).
5- Then apply and restart.
6- use the docker join on the worker again.
Note: All this steps are performed on the node where the error appear. Make sure that the ports 2377, 7946 and 4789 are opens on the master (you can use iptables or ufw).
Hope it works for you.

Docker swarm overlay network with vxlan routing over openvpn

I have setup a docker swarm with 3 nodes (docker 18.03). These nodes use an overlay network to communicate.
node1:
laptop
host tun0 172.16.0.6 --> openvpn -> nat gateway
container n1
ip = 192.169.1.10
node2:
aws ec2
host eth2 10.0.30.62
container n2
ip = 192.169.1.9
node3:
aws ec2
host eth2 10.0.140.122
container n3
ip = 192.169.1.12
nat-gateway:
aws ec2
tun0 172.16.0.1 --> openvpn --> laptop
eth0 10.0.30.198
The scheme is partly working:
1. Containers can ping eachother using name (n1,n2,n3)
2. Docker swarm commands are working, services can be deployed
The overlay is partly working. Some nodes cannot communicate with each other either using tcp/ip or udp. I tried all combinations of the 3 nodes with udp and tcp/ip:
I did a tcpdump on the nat gateway to monitor overlay vxlan network activity (port 4789):
tcpdump -l -n -i eth0 "port 4789"
tcpdump -l -n -i tun0 "port 4789"
Then I tried tcp/ip communication from node2 to node3. On node3:
nc -l -s 0.0.0.0 -p 8999
On node1:
telnet 192.169.1.12 8999
Node1 will then try to connect to node3. I see packets coming in on the nat-gateway over the tun0 interface:
on the nat-gateway eth0 interface:
it seems that the nat-gateway is not sending replies back over the tun0 interface.
The iptables configuration the nat-gateway
The routing of the nat-gateway
Can you help me solve this issue?
I have been able to fix the issue using the following configuration on the NAT gateway:
and
No masquerading of 172.16.0.0/22 is needed. All the workers and managers will route their traffic for 172.16.0.0/22 via the NAT gateway, and it knows how to send the packets over tun0.
Masquerading of eth0 was just wrong...
All the containers can now ping and establish tcp/ip connections to each other.

Establish conversation between hello-world apps in Docker containers

I'm trying to run my hello-world apps inside Docker: frontend need to consume REST from backend.
I run
docker run -p 1337:1337 --net=bridge me/p-dockerfile-advanced-backend:latest
docker run -p 1338:1338 --net=bridge me/p-dockerfile-advanced-frontend:latest http://127.0.0.1:1337
I am able to connect to both of them using a browser from the host OS (My desktop Windows 10 x64) :
The http://127.0.0.1:1337 parameter needed for the frontend application to know where the restful services reside. But the app cannot connect to them. I cannot connect too.
Windows PowerShell
Copyright (C) 2016 Microsoft Corporation. All rights reserved.
PS C:\Users\user1> docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4b0852253b8a me/p-dockerfile-advanced-frontend:latest "/usr/bin/java -ja..." 24 minutes ago Up 24 minutes 0.0.0.0:1338->1338/tcp laughing_noyce
e73f8a6efa24 me/p-dockerfile-advanced-backend:latest "/usr/bin/java -ja..." 26 minutes ago Up 26 minutes youthful_chandrasekhar
PS C:\Users\user1> docker exec -it 4b0852253b8a bash
root#4b0852253b8a:/# apt-get install telnet
<...>
root#4b0852253b8a:/# telnet localhost 1337
Trying 127.0.0.1...
Trying ::1...
telnet: Unable to connect to remote host: Cannot assign requested address
root#4b0852253b8a:/#
Unable to connect, but it should because I specified --net=bridge on both containers and backend listen the port 1337 :
root#e73f8a6efa24:/# netstat -lntu
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:1337 0.0.0.0:* LISTEN
root#e73f8a6efa24:/#
PS: I spent almost all day trying to make it work before asking here.
The problem is the 127.0.0.1 address.
Each container is assigned, by default, 2 interfaces: eth0 and lo (the loopback interface with the 127.0.0.1 address).
You need to specify the name or address of the previous container. For this simple application you may use the --link option.
docker run -p 1337:1337 --name backend me/p-dockerfile-advanced-backend:latest
docker run -p 1338:1338 --link backend:backend me/p-dockerfile-advanced-frontend:latest http://backend:1337
Note that the --link option is deprecated as stated in:
https://docs.docker.com/engine/userguide/networking/default_network/dockerlinks/
Since these are different containers, you have to expose ports on both of them. Run the first with:
docker run -p 1337:1337 --net=bridge me/p-dockerfile-advanced-backend:latest
Note that bridge is the default network so you it is extra. Both containers will be on the same bridge network by default anyway.

Docker1.12 Worker not able to join in cluster(Swarm: Pending)

Manager Version Docker version 1.12.0-rc5, build a3f2063,
Worker version Docker version 1.12.0-rc5, build a3f2063.
Created Swarm manger:
docker swarm init --advertise-addr "172.25.30.2:4243"
Swarm initialized: current node (3kmewyb10p8xj3ke5rpjyw4s8) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-enq26dqnwma38ij48hymtnioq \
172.25.30.2:4243
To add a manager to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-85cwe5pf779qw0knjn6wxdbim \
172.25.30.2:4243
Then created worker
docker swarm join --token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-enq26dqnwma38ij48hymtnioq 172.25.30.2:4243
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.
I have checked logs in worker
time="2016-08-01T00:22:47.449844174-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.449962215-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450025342-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450081950-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450142443-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450202836-07:00" level=error msg="cluster exited with error: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:23:31.351868722-07:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node."
In docker info, I saw "Swarm: Pending"
I did docker swarm update also!. Still, the worker was not able to join the cluster. So, how can I reslove
UPDATE-1
Uninstalled & removed config files and then install docker 1.12 again with version Docker version 1.12.0, build 8eab29e.
Still facing the same problem(Not able to join and "Swarm:Pending" in docker info) with DIFFERENT error in /var/logs/upstat/docker.logs
time="2016-08-01T11:22:08.629760770-07:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node."
Thanks.
The thing is, I was trying to join with wrong "port" (As docker swarm init shown in output).
1) Before "docker swarm init", the docker running on port "4243" only. I have checked with netstat -tulp | grep docker. So I advertised with that port!
root#veeru:~# netstat -tulpn | grep docker
tcp6 0 0 :::4243 :::* LISTEN 8750/dockerd
root#veeru:~# docker swarm init --advertise-addr "172.25.30.2:4243"
Swarm initialized: current node (exvwgj0pu4cd124ljnblt9xff) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-6zomhgu204m9alq51f632nzas \
172.25.30.2:4243
To add a manager to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-axhgqgo4jqw4hv38x578m44wh \
172.25.30.2:4243
2) After docker swarm init, the docker is running with 4 port including the port 2377(netstat -tupln | grep docker).
root#veeru:~# netstat -tulp | grep docker
tcp6 0 0 [::]:2377 [::]:* LISTEN 8750/dockerd
tcp6 0 0 [::]:7946 [::]:* LISTEN 8750/dockerd
tcp6 0 0 [::]:4243 [::]:* LISTEN 8750/dockerd
udp6 0 0 [::]:7946 [::]:* 8750/dockerd
In point 1, it is telling to run docker swarm join with port 4243 in worker. Previously I did run like that!.(It wont work!)
Later I did docker swarm leave and joined with port 2377. Now I am able to join!
docker swarm join --token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-6zomhgu204m9alq51f632nzas 172.25.30.2:2377
For me it was a firewall issue too.
I tried to ping to the manager node and was pinging back
Checked if the ports are opening using telnet and was not able to connect and figured out it was the port issue.
If you are running Centos than the port can be easily opened using the firewalld
Check if the firewalld is running
sudo firewall-cmd --state
Opening the port you want
sudo firewall-cmd --zone=public --add-port=2377/tcp
Change the port as per your node ports it is trying to connect to.
Just expose port 2377 of manager, it will work.
It clearly means node unable to connect manager, so timeout happening to conform same just do telnet manager-ip 2377 (don't try ping, won't work).
And if you are facing the same error even though all firewalls are disabled in both nodes and manager, then try to create another manager exposing port 2377 as below:
docker-machine create --driver amazonec2 --amazonec2-open-port 2377 manager1
And now try to join nodes to new manager created now, but port you are using to join should be 2377 if you gonna use diff then expose that port in above command. Doing same worked for me as I suspect it's because others used other different servers but I'm using same server for both manager and nodes.
According to dockers website Here they stated the ports to enable.
Run the following commands on both the Swam Manager and worker nodes
sudo ufw enable
sudo ufw allow 22/tcp
sudo ufw allow 2376/tcp
sudo ufw allow 2377/tcp
sudo ufw allow 7946/tcp
sudo ufw allow 7946/udp
sudo ufw allow 4789/udp
sudo ufw reload
We just gave access to the neccessary port. After running these commands, all docker commands should be working now.
I was facing similar issue, While in my case port was getting blocked due to firewall rule.
I was having the same issue. I was running coreos vms in Azure. I found out that all my vms had the same private ip address and different public ip addresses. This usually happens when the vms are part of the same security group, however it was not the case this time. The issue was the my account had reached the max number of resources, so I deleted the resources such as ip addresses, nsg, networks etc and then re-provisioned new vms, they had different private ips and when ran the command everything was fine. My docker version is 1.12.6
Assuming you did so; if you get "Connection time out" it means that there is a firewall preventing you from connecting.
Either on the source host, or the destination host (e.g. iptables rules) or in between.
If you are running on some public cloud, make sure that access lists (e.g. EC2 security groups) allow connections between hosts on that port
I was trying to connect 4 nodes(1 master, 3 slave) over EC2 ubuntu server ami image, For me it was an firewall issue.
Check your security groups=>Inbound rules, for me it was custom, and I changed it to anywhere and it will work.

Resources