Worker cannot connect to docker swarm manager - docker-swarm

I have setup the docker swarm manager in one machine with IP 192.168.XXX.XXX by using this command :
docker swarm init --advertise-addr=192.168.XXX.XXX
and I got this message :
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-
0jpgak7bm7t4mzluz48gdub06f5036q8yaoo99awkjmlz48vtb-
1eutz0k1vp37ztmiuxdnglka2 192.168.XXX.XXX:2377
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
In other machine I tried the following command :
docker swarm join --token SWMTKN-1-
0jpgak7bm7t4mzluz48gdub06f5036q8yaoo99awkjmlz48vtb-
1eutz0k1vp37ztmiuxdnglka2 192.168.XXX.XXX:2377
and the result was :
error response from daemon : rpc error : code = Unavailable desc = all
SubConns are in TransientFailure, latest connection error: connection error
: desc = "transport: Error while dialing dial tcp 192.168.XXX.XXX:2377 :
connect: connection refused
Docker version :
Client: Docker Engine - Community
Version: 18.09.0
API version: 1.39
Go version: go1.10.4
Git commit: 4d60db4
Built: Wed Nov 7 00:47:51 2018
OS/Arch: windows/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.0
API version: 1.39 (minimum version 1.12)
Go version: go1.10.4
Git commit: 4d60db4
Built: Wed Nov 7 00:55:00 2018
OS/Arch: linux/amd64
Experimental: false

This is a connection refused error you are running into. Most likely, the other machine cannot connect to the manager as it is not on the same network. Solution: To fix this, the two machines have to be able to talk to each other. If the machines are in a cloud like Azure or AWS create a virtual network and add the two machines to it.
Debugging connection refused error:
To confirm the machines cannot talk to each other, try pinging the manager from the other machine. It will most likely fail in your current setup.
$ ping 192.168.XXX.XXX
If it succeeds then check if port 2377 is open and listening on the manager. Run below from the other machine:
$ nc -vz 192.168.XXX.XXX 2377
Another tool that can help in debugging is to run netstat on the manager:
$ netstat -tuplen
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN 101 51772227 -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 0 25361290 -
tcp6 0 0 :::2377 :::* LISTEN 0 1423965 -
tcp6 0 0 :::7946 :::* LISTEN 0 1423980 -
Verify you see :::2377 and :::7946 and that their state is set to LISTEN as these ports are required by docker swarm [1]

Related

Docker-stack. Forcing docker stack services to use ipv4

I would like to have a service being deployed as part of a docker stack to listen on ipv4.
Currently the docker stack deployed service (rabbitmq) is listening on ipv6, I would like to have it listen via ipv4.
The section of docker compose .yaml file that I using to deploy the docker stack as the following yaml section.
rabbitmq-3-11-0:
#image: rabbitmq:3.11.0-management
image: "127.0.0.1:5000/bcl-sdv-rabbitmq-3-11-0:v0.1"
ports:
-
"0.0.0.0:5672:5672/tcp"
-
"0.0.0.0:15672:15672/tcp" #15672: HTTP API clients, management UI and rabbitmqadmin (only if the management plugin is enabled)
On deployment of the docker stack, the "rabbitmq-3-11-0" service is deployed successfully.
To test IP connectivity I issue the following commands on the docker node.
ncat -w 2 -v ::1 5672 </dev/null; echo $?
yields
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to ::1:5672.
While the command
ncat -w 2 -v 0.0.0.0 5672 </dev/null; echo $?
yields
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 0.0.0.0:5672.
Ncat: Connection reset by peer.
1
The command
ncat -w 2 -v 127.0.0.1 5672 </dev/null; echo $?
produces
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:5672.
Ncat: Connection reset by peer.
1
The netstat command below
sudo netstat -tulnp|grep 5672
shows that the ports 5672 and 15672 are listening on ipv6.
tcp6 4 0 :::5672 :::* LISTEN 2527/dockerd
tcp6 0 0 :::15672 :::* LISTEN 2527/dockerd
The command to determine the docker version below
docker info|grep Version
Outputs
Server Version: 20.10.20
Cgroup Version: 1
Kernel Version: 3.10.0-1160.76.1.el7.x86_64
The Linux version command below
lsb_release
prints
LSB Version: :core-4.1-amd64:core-4.1-noarch

Error response from daemon: service endpoint with name

I'm getting this strange error, when I try to run a docker with a name it gives me this error.
docker: Error response from daemon: service endpoint with name qc.T8 already exists.
However, there is no container with this name.
> docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
> sudo docker info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 3
Server Version: 1.12.3
Storage Driver: aufs
Root Dir: /ahdee/docker/aufs
Backing Filesystem: extfs
Dirs: 28
Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: null bridge host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.13.0-101-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 480.3 GiB
Is there anyway I can flush this out?
Just in case someone else needs this. As #Jmons pointed out it was a weird networking issue. So I solved this by forcing a removal
docker network disconnect --force bridge qc.T8
A
TLDR: restart your docker daemon or restart your docker-machine (if you're using that e.g. on a mac).
Edit: As there are more recent posts below, they answer the question better then mine. The Network adapter is stuck on the daemon. I'm updating mine as its possibly 'on top' of the list and people might not scroll down.
Restarting your docker daemon / docker service / docker-machine is the easiest answer.
the better answer (via Shalabh Negi):
docker network inspect <network name>
docker network disconnect <network name> <container id/ container name>
This is also faster in real time if you can find the network as restarting the docker machine/demon/service in my experience is a slow thing. If you use that, please scroll down and click +1 on their answer.
So the problem is probably your network adapter (virtual, docker thing, not real): have a quick peek at this: https://github.com/moby/moby/issues/23302.
To prevent it happening again is a bit tricky. It seems there may be an issue with docker where a container exits with a bad status code (e.g. non-zero) that holds the network open. You can't then start a new container with that endpoint.
docker network inspect <network name>
docker network disconnect <network name> <container id/ container name>
You can also try doing:
docker network prune
docker volume prune
docker system prune
these commands will help clearing zombie containers, volume and network.
When no command works then do
sudo service docker restart
your problem will be solved
docker network rm <network name>
Worked for me
Restarting docker solved it for me.
I created a script a while back, I think this should help people working with swarm. Using docker-machine this can help a bit.
https://gist.github.com/lcamilo15/7aaaebe71852444ea8f1da5c4c9c84b7
declare -a NODE_NAMES=("node_01", "node_02");
declare -a CONTAINER_NAMES=("container_a", "container_b");
declare -a NETWORK_NAMES=("network_1", "network_2");
for x in "${NODE_NAMES[#]}"; do;
docker-machine env $x;
eval $(docker-machine env $x)
for CONTAINER_NAME in "${CONTAINER_NAMES[#]}"; do;
for NETWORK_NAME in "${NETWORK_NAMES[#]}"; do;
echo "Disconnecting $CONTAINER_NAME from $NETWORK_NAME"
docker network disconnect -f $NETWORK_NAME $CONTAINER_NAME;
done;
done;
done;
You could try seeing if there's any network with that container name by running:
docker network ls
If there is, copy the network id then go on to remove it by running:
docker network rm network-id
This could be because an abrupt removal of a container may leave the network open for that endpoint (container-name).
Try stopping the container first before removing it.
docker stop <container-name>. Then docker rm <container-name>.
Then docker run <same-container-name>.
i think restart docker deamon will solve the problem
Even reboot did not help in my case. It turned out that port 80 to be assigned by the nginx container automatically was in use, even after reboot. How come?
root#IONOS_2: /root/2_proxy # netstat -tlpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:873 0.0.0.0:* LISTEN 1378/rsync
tcp 0 0 0.0.0.0:5355 0.0.0.0:* LISTEN 1565/systemd-resolv
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 1463/nginx: master
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1742/sshd
tcp6 0 0 :::2377 :::* LISTEN 24139/dockerd
tcp6 0 0 :::873 :::* LISTEN 1378/rsync
tcp6 0 0 :::7946 :::* LISTEN 24139/dockerd
tcp6 0 0 :::5355 :::* LISTEN 1565/systemd-resolv
tcp6 0 0 :::21 :::* LISTEN 1447/vsftpd
tcp6 0 0 :::22 :::* LISTEN 1742/sshd
tcp6 0 0 :::5000 :::* LISTEN 24139/dockerd
No idea what nginx: master means or where it came from. And indeed 1463 is the PID:
root#IONOS_2: /root/2_proxy # ps aux | grep "nginx"
root 1463 0.0 0.0 43296 908 ? Ss 00:53 0:00 nginx: master process /usr/sbin/nginx
root 1464 0.0 0.0 74280 4568 ? S 00:53 0:00 nginx: worker process
root 30422 0.0 0.0 12108 1060 pts/0 S+ 01:23 0:00 grep --color=auto nginx
So I tried this:
root#IONOS_2: /root/2_proxy # kill 1463
root#IONOS_2: /root/2_proxy # ps aux | grep "nginx"
root 30783 0.0 0.0 12108 980 pts/0 S+ 01:24 0:00 grep --color=auto nginx
And the problem was gone.

Docker swarm mode load balancing not working as described

Update
I believe the culprit is the master who does not appear to be listening on port 7946. netstat shows that 7946 is listening on the nodes, but not the master. When I check the syslogs for the nodes I see the following error
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
Original Post
I am running a three node Swarm Mode cluster in AWS; one master and two workers. This is swarm mode not to be confused with docker swarm from pre 1.12.
I created all of the services with docker-machine. Each machine is running Ubuntu 15.10 with Docker 1.12.3.
Linux swarm-master-01 4.2.0-42-generic #49-Ubuntu SMP Tue Jun 28 21:26:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Using the master node I have created a service with the following
docker service create --replicas 1 --name myapp -p 3000 myapp
When I run docker service ps myapp I get the following output
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
02awst8p9pezgpkfzqgz8z79t myapp.1 myapp:latest swarm-node-01 Running Running 19 minutes ago
The running task is deployed to swarm-node-01.
I checked the auto-selected port which was published publicly
$ docker service inspect myapp | jq .[].Endpoint.Ports[].PublishedPort
30000
According to the documentation:
External components, such as cloud load balancers, can access the service on the PublishedPort of any node in the cluster whether or not the node is currently running the task for the service. All nodes in the swarm route ingress connections to a running task instance.
But when I try to curl the nodes who do not have the task running I'm getting connection refused.
$ curl $(docker-machine ip swarm-node-01):30000/stats
{"uptime":"2016-11-09T14:48:35Z","requestCount":7,"statuses":{"200":7},"pid":1,"open_db_conns":0}
$ curl $(docker-machine ip swarm-node-02):30000/stats
curl: (7) Failed to connect to [the IP] port 30000: Connection refused
note: I scrubbed the IP of node-02
My Troubleshooting:
The nodes are both properly connected to the swarm
Scaling the service up to 5 (which inherently deploys the task to every node) makes curl work on every node, because the task is deployed to every node.
UPDATE 1
I initialized the swarm with
docker swarm init --advertise-addr 10.0.0.12:2377 --listen-addr 10.0.0.12:2377
I checked the syslogs from the nodes and I'm seeing the following errors
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
I checked to see if the ingress port was listening and it doesn't seem to be
ubuntu#swarm-master-01:~$ sudo lsof -i :7946
ubuntu#swarm-master-01:~$ cat < /dev/tcp/10.0.0.12/7946
-bash: connect: Connection refused
-bash: /dev/tcp/10.0.0.12/7946: Connection refused
ubuntu#swarm-master-01:~$ cat < /dev/tcp/0.0.0.0/7946
-bash: connect: Connection refused
-bash: /dev/tcp/0.0.0.0/7946: Connection refused
I was able to get around the issue for now, but I don't know what initially caused it. The overlay network (port 7946) wasn't listening on swarm-master-01. I figured this out with netstat -nlt. I searched the syslogs and found these errors related to the port in the syslog.
Nov 8 20:28:20 ubuntu docker[23092]: time="2016-11-08T20:28:20.171385360Z" level=warning msg="2016/11/08 20:28:20 [ERR] memberlist: Failed TCP fallback ping: read tcp 10.0.0.85:54016->10.0.0.13:7946: i/o timeout"
Nov 9 18:26:17 swarm-node-01 docker[714]: time="2016-11-09T18:26:17.573441271Z" level=warning msg="2016/11/09 18:26:17 [ERR] memberlist: Failed to send indirect ping: write udp [::]:7946->10.0.0.38:7946: use of closed network connection"
For some reason docker refused to open this port and listen any more. Here is what I did (albeit undesirable) to circumvent the issue:
Created another node with docker-machine called swarm-master-02
Joined swarm-master-02 to the cluster as a master
Demoted master-01 which set master-02 as the leader
Restarted the docker daemon on each node (might not have been necessary)
Now all of the machines are working as expected except for swarm-master-01. One task is running on swarm-node-01 and curl works against all nodes by forwarding the traffic to the proper container on the proper node. However, swarm-master-01 refuses to listen on the overlay network and curl does not work against this node. I was only able to fix swarm-master-01 by completely removing it from the cluster, restarting the docker daemon, and joining it again as a master. Now 7946 is listening on that machine.

Node cannot join Swarm Cluster

I have 3 VM's. They all have docker 1.12 and they are running on centos7.
All the ports are opened and the vm's are able to ping eachother
I started my cluster with
docker swarm init --advertise-addr 192.168.140.12
Docker info showed me:
Swarm: active
NodeID: 0drcj2nku1mv8t16fxva48edxx
Is Manager: true
ClusterID: cchn0yzospwoe1h9f55d7omxx
Managers: 1
Nodes: 1
Now I try to join nodes (other vms) to the cluster. I use the command which was recommended after starting my manager.
docker swarm join \
--token SWMTKN-1-48ythur5k6ckkz90ttlprw37p9z3ldclws51qirw5wdyfmvevr-3sb2t66b2fj6e4dhmfo1vavxx \
192.168.140.12:2377
But I got:
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.
Docker info showed me:
Swarm: pending
NodeID:
Error: rpc error: code = 1 desc = context canceled
Is Manager: false
Node Address: 192.168.140.14
On manager of cluster:
# netstat -tulpn | grep docker
tcp6 0 0 :::2377 :::* LISTEN 1602/dockerd
tcp6 0 0 :::7946 :::* LISTEN 1602/dockerd
tcp6 0 0 :::8080 :::* LISTEN 3398/docker-proxy
tcp6 0 0 :::32768 :::* LISTEN 3199/docker-proxy
tcp6 0 0 :::32769 :::* LISTEN 3219/docker-proxy
tcp6 0 0 :::32770 :::* LISTEN 3341/docker-proxy
tcp6 0 0 :::32771 :::* LISTEN 3436/docker-proxy
tcp6 0 0 :::2375 :::* LISTEN 1602/dockerd
udp6 0 0 :::7946 :::* 1602/dockerd
How can I debug this issue or did I forgot to perform some important step? Do the servers need ssh-access to each other? Thanks
logs on node:
Aug 8 09:50:24 localhost dockerd: time="2016-08-08T09:50:24.393432145-04:00" level=error msg="Handler for POST /v1.24/swarm/leave returned error: This node is not part of swarm"
Aug 8 09:51:01 localhost su: (to root) worker1 on pts/1
Aug 8 09:51:34 localhost dockerd: time="2016-08-08T09:51:34.384408514-04:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node."
Aug 8 09:51:40 localhost su: (to root) worker1 on pts/1
Aug 8 09:52:47 localhost dhclient[1277]: DHCPREQUEST on eno16777736 to 192.168.140.254 port 67 (xid=0x11f8fba8)
Aug 8 09:52:47 localhost dhclient[1277]: DHCPACK from 192.168.140.254 (xid=0x11f8fba8)
Aug 8 09:52:47 localhost NetworkManager[953]: <info> address 192.168.140.13
Aug 8 09:52:47 localhost NetworkManager[953]: <info> plen 24 (255.255.255.0)
Aug 8 09:52:47 localhost NetworkManager[953]: <info> gateway 192.168.140.2
Aug 8 09:52:47 localhost NetworkManager[953]: <info> server identifier 192.168.140.254
Aug 8 09:52:47 localhost NetworkManager[953]: <info> lease time 1800
Aug 8 09:52:47 localhost NetworkManager[953]: <info> nameserver '192.168.140.2'
Aug 8 09:52:47 localhost NetworkManager[953]: <info> domain name 'localdomain'
Aug 8 09:52:47 localhost NetworkManager[953]: <info> (eno16777736): DHCPv4 state changed bound -> bound
Aug 8 09:52:47 localhost dbus[878]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Aug 8 09:52:47 localhost dbus-daemon: dbus[878]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Aug 8 09:52:47 localhost systemd: Starting Network Manager Script Dispatcher Service...
Aug 8 09:52:47 localhost dhclient[1277]: bound to 192.168.140.13 -- renewal in 713 seconds.
Aug 8 09:52:47 localhost dbus[878]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Aug 8 09:52:47 localhost dbus-daemon: dbus[878]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Aug 8 09:52:47 localhost nm-dispatcher: Dispatching action 'dhcp4-change' for eno16777736
Aug 8 09:52:47 localhost systemd: Started Network Manager Script Dispatcher Service.
Sometimes warnings:
level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled
Maybe you were using a http proxy.
You can use the following command to see what dockerd is doing.
# strace -Fp `pidof dockerd` 2>&1 |grep -v futex |grep -v epoll_wait |grep -v pselect
As explained by wenjianhn, make sure that you did not configure on your worker-node an http-proxy for Docker (as described here). Indeed, Swarm nodes communicate over HTTP (default port 2377); so if you configured an http-proxy, your worker-node will use the configured http-proxy, even if the manager-node sits in your LAN.
Also, make sure that no firewall is blocking the traffic on port 2377:
user#workernode$ telnet ip-of-manager 2377
If you can't open a telnet connection on port 2377, it means that this port is blocked by a firewall (either the worker node's firewall, either the manager's one).
The hostnames on all my vm's were: localhost.localdomain.
I changed the hostnames in /etc/hosts on each server and rebooted.
Now I'm able to create my swarm cluster and add nodes successfully.
I had same problem, and solved by sync each worker node's date same as master node date.
pi#workernode$sudo date --set="$(username#masternode date)"
after this, try update the worker node, and it should work.
If none of the below solutions have worked. Try disabling firewall on the master and see if it works.
I had a similar problem. I checked that there is no proxy configured and no firewall in the way but "docker swarm join" did not work. I noticed that there were managers listed in "docker info" that were no longer present. "docker node ls" did not show anything strange.
Eventually I solved my issue by doing "service docker restart" on the node addressed by "docker swarm join". Obviously some internal bookkeeping within the docker daemon got out of sync.

Docker1.12 Worker not able to join in cluster(Swarm: Pending)

Manager Version Docker version 1.12.0-rc5, build a3f2063,
Worker version Docker version 1.12.0-rc5, build a3f2063.
Created Swarm manger:
docker swarm init --advertise-addr "172.25.30.2:4243"
Swarm initialized: current node (3kmewyb10p8xj3ke5rpjyw4s8) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-enq26dqnwma38ij48hymtnioq \
172.25.30.2:4243
To add a manager to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-85cwe5pf779qw0knjn6wxdbim \
172.25.30.2:4243
Then created worker
docker swarm join --token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-enq26dqnwma38ij48hymtnioq 172.25.30.2:4243
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.
I have checked logs in worker
time="2016-08-01T00:22:47.449844174-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.449962215-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450025342-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450081950-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450142443-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450202836-07:00" level=error msg="cluster exited with error: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:23:31.351868722-07:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node."
In docker info, I saw "Swarm: Pending"
I did docker swarm update also!. Still, the worker was not able to join the cluster. So, how can I reslove
UPDATE-1
Uninstalled & removed config files and then install docker 1.12 again with version Docker version 1.12.0, build 8eab29e.
Still facing the same problem(Not able to join and "Swarm:Pending" in docker info) with DIFFERENT error in /var/logs/upstat/docker.logs
time="2016-08-01T11:22:08.629760770-07:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node."
Thanks.
The thing is, I was trying to join with wrong "port" (As docker swarm init shown in output).
1) Before "docker swarm init", the docker running on port "4243" only. I have checked with netstat -tulp | grep docker. So I advertised with that port!
root#veeru:~# netstat -tulpn | grep docker
tcp6 0 0 :::4243 :::* LISTEN 8750/dockerd
root#veeru:~# docker swarm init --advertise-addr "172.25.30.2:4243"
Swarm initialized: current node (exvwgj0pu4cd124ljnblt9xff) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-6zomhgu204m9alq51f632nzas \
172.25.30.2:4243
To add a manager to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-axhgqgo4jqw4hv38x578m44wh \
172.25.30.2:4243
2) After docker swarm init, the docker is running with 4 port including the port 2377(netstat -tupln | grep docker).
root#veeru:~# netstat -tulp | grep docker
tcp6 0 0 [::]:2377 [::]:* LISTEN 8750/dockerd
tcp6 0 0 [::]:7946 [::]:* LISTEN 8750/dockerd
tcp6 0 0 [::]:4243 [::]:* LISTEN 8750/dockerd
udp6 0 0 [::]:7946 [::]:* 8750/dockerd
In point 1, it is telling to run docker swarm join with port 4243 in worker. Previously I did run like that!.(It wont work!)
Later I did docker swarm leave and joined with port 2377. Now I am able to join!
docker swarm join --token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-6zomhgu204m9alq51f632nzas 172.25.30.2:2377
For me it was a firewall issue too.
I tried to ping to the manager node and was pinging back
Checked if the ports are opening using telnet and was not able to connect and figured out it was the port issue.
If you are running Centos than the port can be easily opened using the firewalld
Check if the firewalld is running
sudo firewall-cmd --state
Opening the port you want
sudo firewall-cmd --zone=public --add-port=2377/tcp
Change the port as per your node ports it is trying to connect to.
Just expose port 2377 of manager, it will work.
It clearly means node unable to connect manager, so timeout happening to conform same just do telnet manager-ip 2377 (don't try ping, won't work).
And if you are facing the same error even though all firewalls are disabled in both nodes and manager, then try to create another manager exposing port 2377 as below:
docker-machine create --driver amazonec2 --amazonec2-open-port 2377 manager1
And now try to join nodes to new manager created now, but port you are using to join should be 2377 if you gonna use diff then expose that port in above command. Doing same worked for me as I suspect it's because others used other different servers but I'm using same server for both manager and nodes.
According to dockers website Here they stated the ports to enable.
Run the following commands on both the Swam Manager and worker nodes
sudo ufw enable
sudo ufw allow 22/tcp
sudo ufw allow 2376/tcp
sudo ufw allow 2377/tcp
sudo ufw allow 7946/tcp
sudo ufw allow 7946/udp
sudo ufw allow 4789/udp
sudo ufw reload
We just gave access to the neccessary port. After running these commands, all docker commands should be working now.
I was facing similar issue, While in my case port was getting blocked due to firewall rule.
I was having the same issue. I was running coreos vms in Azure. I found out that all my vms had the same private ip address and different public ip addresses. This usually happens when the vms are part of the same security group, however it was not the case this time. The issue was the my account had reached the max number of resources, so I deleted the resources such as ip addresses, nsg, networks etc and then re-provisioned new vms, they had different private ips and when ran the command everything was fine. My docker version is 1.12.6
Assuming you did so; if you get "Connection time out" it means that there is a firewall preventing you from connecting.
Either on the source host, or the destination host (e.g. iptables rules) or in between.
If you are running on some public cloud, make sure that access lists (e.g. EC2 security groups) allow connections between hosts on that port
I was trying to connect 4 nodes(1 master, 3 slave) over EC2 ubuntu server ami image, For me it was an firewall issue.
Check your security groups=>Inbound rules, for me it was custom, and I changed it to anywhere and it will work.

Resources