Docker Swarm node fails to join after reboot. State 'pending' - docker

I've been able to successfully set up a Docker Swarm cluster with one manager and two worker nodes by running swarm init and swarm join respectively. docker node ls then shows that all nodes are active. However, if one of the worker nodes restarts it won't be able to join back in. Running docker node ls on the manager now shows that the newly restarted node is in state pending. I've enabled debugging and running systemctl status docker-latest -l on the failing-to-join worker node shows lots of these:
level=error msg="agent: session failed" error="rpc error: code = 13 desc = connection error: desc = \"transport: tls: oversized record received with length 20527\"" module="node/agent"
OS: Red Hat Enterprise Linux Server release 7.5
Docker version 1.13.1, build 8633870/1.13.1 (installed package docker-latest from repository) Also tried the regular docker package with no difference.

Related

Docker 19.03.12 : The swarm does not have a leader aferter swarm upgrade

Some strange troubleshouting with docker since laste update.
Can you help me about this ?
It’s is not my firstr upgrade of package and this case have been reproduice on a freshnew stack.
Updgraded from 18.09.9 to 19.03.12
OS : Ubuntu 16.04 Server
Docker package
docker-ce=5:18.09.9~3-0~ubuntu-bionic
docker-ce-cli=5:19.03.11~3-0~ubuntu-bionic
containerd.io=1.2.13-2
Details
A problem identified with version 19.03.12 of docker
Managers have been put in version 19.03.12
When you want to add a manager to the group with an active leader, an error message is visible
The different known solutions were used
Case
As soon as you play the docker swarm join --token command on non-leader managers, after a few minutes, the leader manager is no longer available
-> Forced to replay the docker swarm init command --force-new-cluster --advertise-addr xx.xx.xx.xx --listen-addr xx.xx.xx.xx: 2377 to find the leader operational
The leader sees the worker nodes in version 19.03.12. No problem with workers
Restarting the docker service leads to the same result
Error Message
The swarm does not have a leader
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
docker msg="error reading the kernel parameter net.ipv4.vs.expire_nodest_conn"
References applied
https://github.com/moby/moby/issues/34384#:~:text=demote%20master%20...-,new-server%23%20docker%20node%20ls%20Error%20response%20from%20daemon%3A,too%20few%20managers%20are%20online.&text=have%20a%20leader.-,It%27s%20possible%20that%20too%20few%20managers%20are%20online.,of%20the%20managers%20are%20online.
Docker Node is Down after service restart
https://cynici.wordpress.com/2018/05/31/docker-info-rpc-error-on-manager-node/
https://gitmemory.com/issue/docker/swarmkit/2670/481951641
https://forums.docker.com/t/cant-add-third-swarm-manager-or-create-overlay-network-the-swarm-does-not-have-a-leader/50849
https://askubuntu.com/questions/935569/how-to-completely-uninstall-docker

How to resolve "Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader."

I am trying to learn docker and swarm. I created a swarm with 3 nodes and completed an example using virtualbox and docker-machine. I Once i restarted my machine, All nodes shown as stopped. I started all nodes using
docker-machine start node1 node2 node3
All node started but still I am not able to list nodes even on master node and getting below error:
docker#node1:~$ docker node ls
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Also the docker state on node1 (master) is pending.
Swarm: pending
NodeID: c93hv5pixlfiei7q9qneuiuen
Error: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
I am getting this error every time i restarted my machine.This is causing me to setup everything from start each time.
Is there anyway I can avoid setting up cluster again and again.
Thanks
You must include the docker service start somewhere in your boot config.
Preventing
demote the node you are going to "switch off"/leave swarm
# find node id
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
o1iz67ehuenfzbyg2gjxayaee hostA Ready Active Reachable 20.10.6
fic857lrupfemxqie5rvq63yt * hostB Ready Active Leader 20.10.6
$ docker node demote o1iz67ehuenfzbyg2gjxayaee
Manager o1iz67ehuenfzbyg2gjxayaee demoted in the swarm.
# now on, the node can safely leave the swarm
$ docker swarm leave --force
Reacting
Restart if there are no healthy nodes.
Start >> stop Docker engine (NOT restart) and init Swarm again. Validate firewall ruleset afterwards as Docker overwrites it.
$ systemctl stop docker
$ systemctl start docker
Drain left node if there is healthy manager node.
Reference https://cynici.wordpress.com/2018/05/31/docker-info-rpc-error-on-manager-node/
please check the firewall on linux:
If you want to promote some NODE as manager, so you please check the port=2377 is accepting request on particular node. Then only Node work as manager. Otherwise you will get an error like below :
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Solution : Add port number 2377 in firewall.
firewall-cmd --zone=public --add-port=2377/tcp --permanent
success
firewall-cmd --reload
success

Is it possible to use docker swarm with rootless docker?

I have successfully installed rootless docker and now I'm trying to use docker swarm with it. I'm running four GCP instances. I followed below steps:
on Node 1
docker swarm init --advertise-addr 34.93.X.X
docker swarm join-token manager gives
docker swarm join --token SWMTKN-1-21vhv6gawb9mpur1v379sq52ia2jq4n0boqes0wos10o7m833l-5935hxvsht0x21o0qjpeqykae 34.93.X.X:2377
on Node 2
docker swarm join --token SWMTKN-1-2xtpxpc18p8qf3e4kb3dvsjr4a4ae786entmwuekh6w5bbfmpz-e5rhoya81d1pajet80wx34mcv 34.93.X.X:2377 --advertise-addr 34.93.X.X gives below error
Error response from daemon: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 34.93.X.X:2377: connect: connection refused"
NOTE
with rootful docker I'm able to join the nodes.
It's not possible today. It's not Swarm's fault, it's the design of Linux. Swarm (by default) uses overlay networking that creates virtual IP's, VXLAN routes, and more in iptables, and rootless (anything) can't control Linux networking to that level as far as I know.
See https://docs.docker.com/engine/security/rootless/#known-limitations
If your goal is just to lock down Docker, I think it's much more effective to things like User Namespaces (dockerd runs as root, but containers don't run as root), change the default user running in containers, and other steps I list here https://github.com/BretFisher/ama/issues/17

Docker Swarm: Joining the node not working

I am trying to join a worker node to a manager in another machine. The former one is Mac and later one is Windows. The worker host on Mac have a response:
Timeout was reached before node joined. The attempt to join the swarm will continue in the background. Use the "docker info" command to see the current swarm status of your node.
When I typed the Join-Token command again, I received response saying the
This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
When I typed the command in manager side:
docker node ls
it only show one node which is the manager node.
Am I doing something wrong here.
Do you use the same docker version on all hosts?

Hyperledger - Docker swarm fails when deploying to multiple hosts

I am following this tutorial. I ran sudo docker swarm init --advertise-addr <myip> on 1st ubuntu machine. And then I took the manager join-token and ran it on 2nd ubuntu machine and it is able to join as manager.
But the problem starts when i run docker network create --attachable --driver overlay my-net on 1st machine, it gives me following error:
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
If I run the above command to create network before joining the 2nd node, the network gets created successfully and the 2nd node also gets joined to the 1st swarm node. But when I do anything on the 1st Ubuntu machine, I get the same error on it.
Both Ubuntu machines are in same network and can be pinged by each other.
Ubuntu version - 17.1 64 bit
Docker version 18.03.1-ce, build 9ee9f40
Docker-compose version 1.21.2, build a133471
It seems that the tutorial is off as you will only end up with two managers and that is not enough to form a quorum. You can either add an additional manager node or simply create a single manager (docker swarm init) and then join a single worker using the command that is output as part of the response to docker swarm init. You should SKIP the docker swarm join-token manager step from the tutorial.
Just change the IP of your Ubuntu Machine.
Machine->Settings->nNetwork->select Attached to Bridged Adapter.
restart your machine.

Resources