Docker Swarm with Zookeeper - No elected primary cluster manager - docker

I have been tasked to build a production ready Swarm cluster using Zookeeper as dicovery backend. I used the official documentation for this purpose, https://docs.docker.com/swarm/install-manual/. Concerning backend discovery I used this one: https://docs.docker.com/swarm/discovery/. Now I have an issue. When I try to communicate with the swarm, I have this error: No elected primary cluster manager.
This is my setup:
I'm running on Ubuntu 16.04 with docker Client/Server version 1.12.3, with zookeeper 3.4.9 launch in the same host as my swarm manager. I'm using a two nodes architecture with one swarm manager and one swarm worker
After Docker Engine installation on each node,
$ nohup docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock &
Now on the swarm manager:
$ docker run -d -p 4000:4000 swarm manage -H :4000 --replication --advertise <swarm-manager-ip>:4000 zk://<swarm-manager-ip>/swarm
On the swarm worker:
$ docker run -d swarm join --advertise=<swarm-worker-ip>:2375 zk://<swarm-manager-ip>/swarm
Now when I try to see if everything is good, I hit the command below and the result follows.
$ docker -H <swarm-manager-ip>:4000 ps -a
Error response from daemon: No elected primary cluster manager
When I just do this:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
91c3864ba6ee swarm "/swarm manage -H :40" 17 hours ago Up 19 minutes 2375/tcp, 0.0.0.0:4000->4000/tcp swarm-master
I can see the swarm master and when I try to see the logs of the swarm node, I can see this:
$ docker logs 91c3864ba6ee
time="2016-12-09T20:29:39Z" level=info msg="Initializing discovery without TLS"
time="2016-12-09T20:29:39Z" level=info msg="Listening for HTTP" addr=":4000" proto=tcp
time="2016-12-09T20:29:39Z" level=info msg="Leader Election: Cluster leadership lost"
2016/12/09 20:29:40 Failed to connect to <swarm-manager-ip>:2181: dial tcp <swarm-manager-ip>:2181: i/o timeout
time="2016-12-09T20:29:40Z" level=error msg="zk: could not connect to a server"
time="2016-12-09T20:29:40Z" level=error msg="zk: could not connect to a server"
time="2016-12-09T20:29:40Z" level=error msg="Discovery error: zk: could not connect to a server"
2016/12/09 20:29:42 Failed to connect to <swarm-manager-ip>:2181: dial tcp <swarm-manager-ip>:2181: i/o timeout
time="2016-12-09T20:29:42Z" level=error msg="Discovery error: zk: could not connect to a server"
2016/12/09 20:29:44 Failed to connect to <swarm-manager-ip>:2181: dial tcp <swarm-manager-ip>:2181: i/o timeout
time="2016-12-09T20:29:44Z" level=error msg="Discovery error: zk: could not connect to a server"
time="2016-12-09T20:29:44Z" level=error msg="Discovery error: Unexpected watch error"
2016/12/09 20:29:46 Failed to connect to <swarm-manager-ip>:2181: dial tcp <swarm-manager-ip>:2181: i/o timeout
2016/12/09 20:29:48 Failed to connect to <swarm-manager-ip>:2181: dial tcp <swarm-manager-ip>:2181: i/o timeout
time="2016-12-09T20:29:50Z" level=info msg="Leader Election: Cluster leadership lost"
2016/12/09 20:29:50 Failed to connect to <swarm-manager-ip>:2181: dial tcp <swarm-manager-ip>:2181: i/o timeout
time="2016-12-09T20:29:50Z" level=error msg="zk: could not connect to a server"
time="2016-12-09T20:29:50Z" level=error msg="zk: could not connect to a server"
But a simple telnet command shows me that my zookeeper host is working. So how do I have a i/o timeout when the swarm try to connect to zookeeper discovery backend?

As mentioned in the comments there is a new version called Swarm mode embedded with Docker since 1.12. It includes a built-in high-available distributed object store so you don't have to setup an external KV store yourself.
Now regarding your issue with the first version of Swarm, one line caught my attention:
2016/12/09 20:29:50 Failed to connect to <swarm-manager-ip>:2181: dial tcp <swarm-manager-ip>:2181: i/o timeout
To me it seems that zookeeper is not running on your machine or that you didn't point to the right port.
First check that zookeeper is running on your machine with:
ps aux | grep zookeeper
You should see a process running.
If not, make sure you create a zoo.cfg file in the conf directory of your zookeeper installation specifying the right port, for example:
tickTime=2000
dataDir=/var/zookeeper
clientPort=2181
You can look at This Tutorial to bootstrap zookeeper.
After this you can run the zkStart.sh script to start your zookeeper instance and swarm should now be able to properly connect and register the Leader key.
If this still does not work, try downgrading to zookeeper 3.4.6 as this is the last known supported version since the switch to Docker Swarm Mode.

Related

Docker Swarm installation. Connect: no route to host error

Need to setup a 3 node (D1, D2 & D3) docker cluster using swarm and install ElasticSearch & Kibana with each node on respective Oracle virtual Linux ( 7.4). D1 is Master node and D2, D3 worker nodes
Once docker engine is installed. Followed this document to create a swarm. However , while executing the command on D2 or D3 gets below error:
Command: sudo docker swarm join --token <Token-ID> <IP>:2377
Error: Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp <IP>:2377: connect: no route to host"
All theses node servers are on the same network and no firewall restrictions.
sudo netstat -tulpn | grep LISTEN shows ports 2377, 7946 are listening but don’t see port 4789 as mentioned here.
Please assist.
I resolved it by running below on master node:
sudo systemctl stop firewalld.service
firewalld is a zone-based(host) customizable firewall and the above command disables the service until reboot

Docker Swarm: Getting connection refuse while adding worker node

I just started learning docker, I am facing below challenge, please let me know where I am doing wrong.
My use case: Set up docker swarm manager and add worker node to it.
Step1: To create docker manager, I used below command:
docker swarm init --advertise-addr <<ip_address>>
Step 2: Run below command, which gives you docker command to add worker.
docker swarm join-token worker
After running above command, I got output as:
docker swarm join --token SWMTKN-1-653srs28a6s48dqxnak9g9kic2cd1xyeowgnke53nf83710wfv-7u7u7u1vovahvn792814q2sts ip_address:2377
Step3: I logged-in to worker node and ran above docker swarm join command. But I am getting below error message.
Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection er
ror: desc = "transport: Error while dialing dial tcp ip_address:2377: connect: connection refused"
This could well be a firewall issue, make sure you have port 2377, 7946 & 4789 in open state between the hosts acting as manager or worker node -
From the docs -
Open protocols and ports between the hosts The following ports must be
available.
TCP port 2377 for cluster management communications
TCP and UDP port 7946 for communication among nodes
UDP port 4789 for overlay network
traffic

failed: port is already allocated

I use Docker for running Oracle 11g Express on macOS Sierra 10.12.2
https://github.com/wnameless/docker-oracle-xe-11g
This is my error:
Last login: Sat Jan 7 22:42:11 on ttys000
➜ ~ docker run -d -p 49160:22 -p 49161:1521 wnameless/oracle-xe-11g
docker: Cannot connect to the Docker daemon. Is the docker daemon running on this host?.
See 'docker run --help'.
➜ ~ docker run -d -p 49160:22 -p 49161:1521 wnameless/oracle-xe-11g
043d8caecbb45d6e2e5999b69a2f760c20d53ff3aa2fad78cb1eb70acb058a1f
docker: Error response from daemon: driver failed programming external connectivity on endpoint serene_lalande (08bb0bd9684c0f92db7b736986bf894d3a57a714324405823496d13e175e7491): Error starting userland proxy: Bind for 0.0.0.0:49161 failed: port is already allocated.
➜ ~
I diagnostic:
➜ ~ netstat -anp tcp | grep 49161
tcp4 0 0 192.168.1.2.49161 17.188.166.13.5223 ESTABLISHED
➜ ~
➜ ~ docker --version
Docker version 1.12.5, build 7392c3b
My Dianostic ID: 20EB9506-CC72-4093-8A15-60E05A841ED1
I don't know why. Before that few weeks, it run success. Nearly, I change, release new DHCP IP. How to run Docker instance has Oracle 11g express success?
you can't launch twice
docker run -d -p 49160:22
as this means you want to allocate the port 49160 on the host twice, of course, the second time, you get you error message, try for the second run
docker run -d -p 49161:22
You will need to use a different port instead of 49161. Try a port less than 49152.
You have a pre-existing connection between the the port 49161 on your computer and port 5223 on a remote Apple server. That port, therefore, cannot be used for anything else until that connection ceases to exist. Port 5223 is used for Apple's push notifications. As best as I can tell, your computer so happened to use the random port 49161 to connect to Apple's server this time. Previously when that Docker container worked, I would bet port 49161 on your computer was not then used.
Whenever you connect to a remote server, your own computer allocates a random port number for that connection. This time around, your computer allocated 49161 when it connected to Apple's push notifications service. Next time, it could be a completely different number. See https://en.wikipedia.org/wiki/Ephemeral_port

Docker swarm mode load balancing not working as described

Update
I believe the culprit is the master who does not appear to be listening on port 7946. netstat shows that 7946 is listening on the nodes, but not the master. When I check the syslogs for the nodes I see the following error
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
Original Post
I am running a three node Swarm Mode cluster in AWS; one master and two workers. This is swarm mode not to be confused with docker swarm from pre 1.12.
I created all of the services with docker-machine. Each machine is running Ubuntu 15.10 with Docker 1.12.3.
Linux swarm-master-01 4.2.0-42-generic #49-Ubuntu SMP Tue Jun 28 21:26:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Using the master node I have created a service with the following
docker service create --replicas 1 --name myapp -p 3000 myapp
When I run docker service ps myapp I get the following output
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
02awst8p9pezgpkfzqgz8z79t myapp.1 myapp:latest swarm-node-01 Running Running 19 minutes ago
The running task is deployed to swarm-node-01.
I checked the auto-selected port which was published publicly
$ docker service inspect myapp | jq .[].Endpoint.Ports[].PublishedPort
30000
According to the documentation:
External components, such as cloud load balancers, can access the service on the PublishedPort of any node in the cluster whether or not the node is currently running the task for the service. All nodes in the swarm route ingress connections to a running task instance.
But when I try to curl the nodes who do not have the task running I'm getting connection refused.
$ curl $(docker-machine ip swarm-node-01):30000/stats
{"uptime":"2016-11-09T14:48:35Z","requestCount":7,"statuses":{"200":7},"pid":1,"open_db_conns":0}
$ curl $(docker-machine ip swarm-node-02):30000/stats
curl: (7) Failed to connect to [the IP] port 30000: Connection refused
note: I scrubbed the IP of node-02
My Troubleshooting:
The nodes are both properly connected to the swarm
Scaling the service up to 5 (which inherently deploys the task to every node) makes curl work on every node, because the task is deployed to every node.
UPDATE 1
I initialized the swarm with
docker swarm init --advertise-addr 10.0.0.12:2377 --listen-addr 10.0.0.12:2377
I checked the syslogs from the nodes and I'm seeing the following errors
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
I checked to see if the ingress port was listening and it doesn't seem to be
ubuntu#swarm-master-01:~$ sudo lsof -i :7946
ubuntu#swarm-master-01:~$ cat < /dev/tcp/10.0.0.12/7946
-bash: connect: Connection refused
-bash: /dev/tcp/10.0.0.12/7946: Connection refused
ubuntu#swarm-master-01:~$ cat < /dev/tcp/0.0.0.0/7946
-bash: connect: Connection refused
-bash: /dev/tcp/0.0.0.0/7946: Connection refused
I was able to get around the issue for now, but I don't know what initially caused it. The overlay network (port 7946) wasn't listening on swarm-master-01. I figured this out with netstat -nlt. I searched the syslogs and found these errors related to the port in the syslog.
Nov 8 20:28:20 ubuntu docker[23092]: time="2016-11-08T20:28:20.171385360Z" level=warning msg="2016/11/08 20:28:20 [ERR] memberlist: Failed TCP fallback ping: read tcp 10.0.0.85:54016->10.0.0.13:7946: i/o timeout"
Nov 9 18:26:17 swarm-node-01 docker[714]: time="2016-11-09T18:26:17.573441271Z" level=warning msg="2016/11/09 18:26:17 [ERR] memberlist: Failed to send indirect ping: write udp [::]:7946->10.0.0.38:7946: use of closed network connection"
For some reason docker refused to open this port and listen any more. Here is what I did (albeit undesirable) to circumvent the issue:
Created another node with docker-machine called swarm-master-02
Joined swarm-master-02 to the cluster as a master
Demoted master-01 which set master-02 as the leader
Restarted the docker daemon on each node (might not have been necessary)
Now all of the machines are working as expected except for swarm-master-01. One task is running on swarm-node-01 and curl works against all nodes by forwarding the traffic to the proper container on the proper node. However, swarm-master-01 refuses to listen on the overlay network and curl does not work against this node. I was only able to fix swarm-master-01 by completely removing it from the cluster, restarting the docker daemon, and joining it again as a master. Now 7946 is listening on that machine.

Docker1.12 Worker not able to join in cluster(Swarm: Pending)

Manager Version Docker version 1.12.0-rc5, build a3f2063,
Worker version Docker version 1.12.0-rc5, build a3f2063.
Created Swarm manger:
docker swarm init --advertise-addr "172.25.30.2:4243"
Swarm initialized: current node (3kmewyb10p8xj3ke5rpjyw4s8) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-enq26dqnwma38ij48hymtnioq \
172.25.30.2:4243
To add a manager to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-85cwe5pf779qw0knjn6wxdbim \
172.25.30.2:4243
Then created worker
docker swarm join --token SWMTKN-1-5lwzvv7au6hosiqqmdwmcxvmlmhtz4ts04jsg06284fq3posn0-enq26dqnwma38ij48hymtnioq 172.25.30.2:4243
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.
I have checked logs in worker
time="2016-08-01T00:22:47.449844174-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.449962215-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450025342-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450081950-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450142443-07:00" level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:22:47.450202836-07:00" level=error msg="cluster exited with error: rpc error: code = 1 desc = context canceled"
time="2016-08-01T00:23:31.351868722-07:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node."
In docker info, I saw "Swarm: Pending"
I did docker swarm update also!. Still, the worker was not able to join the cluster. So, how can I reslove
UPDATE-1
Uninstalled & removed config files and then install docker 1.12 again with version Docker version 1.12.0, build 8eab29e.
Still facing the same problem(Not able to join and "Swarm:Pending" in docker info) with DIFFERENT error in /var/logs/upstat/docker.logs
time="2016-08-01T11:22:08.629760770-07:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node."
Thanks.
The thing is, I was trying to join with wrong "port" (As docker swarm init shown in output).
1) Before "docker swarm init", the docker running on port "4243" only. I have checked with netstat -tulp | grep docker. So I advertised with that port!
root#veeru:~# netstat -tulpn | grep docker
tcp6 0 0 :::4243 :::* LISTEN 8750/dockerd
root#veeru:~# docker swarm init --advertise-addr "172.25.30.2:4243"
Swarm initialized: current node (exvwgj0pu4cd124ljnblt9xff) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-6zomhgu204m9alq51f632nzas \
172.25.30.2:4243
To add a manager to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-axhgqgo4jqw4hv38x578m44wh \
172.25.30.2:4243
2) After docker swarm init, the docker is running with 4 port including the port 2377(netstat -tupln | grep docker).
root#veeru:~# netstat -tulp | grep docker
tcp6 0 0 [::]:2377 [::]:* LISTEN 8750/dockerd
tcp6 0 0 [::]:7946 [::]:* LISTEN 8750/dockerd
tcp6 0 0 [::]:4243 [::]:* LISTEN 8750/dockerd
udp6 0 0 [::]:7946 [::]:* 8750/dockerd
In point 1, it is telling to run docker swarm join with port 4243 in worker. Previously I did run like that!.(It wont work!)
Later I did docker swarm leave and joined with port 2377. Now I am able to join!
docker swarm join --token SWMTKN-1-5j9mpo8hepue6g1sjdas33thr92w1o9hlef5auwqpbxs3glt39-6zomhgu204m9alq51f632nzas 172.25.30.2:2377
For me it was a firewall issue too.
I tried to ping to the manager node and was pinging back
Checked if the ports are opening using telnet and was not able to connect and figured out it was the port issue.
If you are running Centos than the port can be easily opened using the firewalld
Check if the firewalld is running
sudo firewall-cmd --state
Opening the port you want
sudo firewall-cmd --zone=public --add-port=2377/tcp
Change the port as per your node ports it is trying to connect to.
Just expose port 2377 of manager, it will work.
It clearly means node unable to connect manager, so timeout happening to conform same just do telnet manager-ip 2377 (don't try ping, won't work).
And if you are facing the same error even though all firewalls are disabled in both nodes and manager, then try to create another manager exposing port 2377 as below:
docker-machine create --driver amazonec2 --amazonec2-open-port 2377 manager1
And now try to join nodes to new manager created now, but port you are using to join should be 2377 if you gonna use diff then expose that port in above command. Doing same worked for me as I suspect it's because others used other different servers but I'm using same server for both manager and nodes.
According to dockers website Here they stated the ports to enable.
Run the following commands on both the Swam Manager and worker nodes
sudo ufw enable
sudo ufw allow 22/tcp
sudo ufw allow 2376/tcp
sudo ufw allow 2377/tcp
sudo ufw allow 7946/tcp
sudo ufw allow 7946/udp
sudo ufw allow 4789/udp
sudo ufw reload
We just gave access to the neccessary port. After running these commands, all docker commands should be working now.
I was facing similar issue, While in my case port was getting blocked due to firewall rule.
I was having the same issue. I was running coreos vms in Azure. I found out that all my vms had the same private ip address and different public ip addresses. This usually happens when the vms are part of the same security group, however it was not the case this time. The issue was the my account had reached the max number of resources, so I deleted the resources such as ip addresses, nsg, networks etc and then re-provisioned new vms, they had different private ips and when ran the command everything was fine. My docker version is 1.12.6
Assuming you did so; if you get "Connection time out" it means that there is a firewall preventing you from connecting.
Either on the source host, or the destination host (e.g. iptables rules) or in between.
If you are running on some public cloud, make sure that access lists (e.g. EC2 security groups) allow connections between hosts on that port
I was trying to connect 4 nodes(1 master, 3 slave) over EC2 ubuntu server ami image, For me it was an firewall issue.
Check your security groups=>Inbound rules, for me it was custom, and I changed it to anywhere and it will work.

Resources