Difference between docker swarm node running as Leader and running as Manager - docker

I wish to understand the difference between docker swarm node running as Leader and running as a Manager.
I also, understand that there can be several docker managers but can there be multiple docker swarm Leader nodes and the reasons for it.
Note: im aware of what a docker worker node is.

Docker swarm has following terminology.
Manager Node (Can be a leader or manager)
Worker Node
Now for simple docker swarm mode , there is a single manager and other are worker node. In This manager is a leader.
It is possible to have more then one manager node. Like 2 manager ( Mostly odd number prefer like 1,3,5). In such case to one is leader who is responsible to scheduler task on worker node. Also manager node will talk to each other to maintain the state. To make highly available environment when manager node which is a leader at this moment get down , it should not stop scheduling work. At that moment another manager will automatically promoted as a leader and take responsibility to schedule task (container) on worked node.

Related

adding a manager back into swarm cluster

I have a swarm cluster with two nodes. 1 Manager and 1 worker. I am running my application on worker node and to run a test case, I force remove manager from docker swarm cluster.
My application continues to work, but I would like to know if there is any possibility to add back the force removed manager in the cluster again. (I don't remember the join-token and neither have them copied anywhere)
I understand docker advises to have odd number of manager nodes to maintain quorum, but would like to know if docker has addressed such scenarios anywhere.
docker swarm init --force-new-cluster --advertise-addr node01:2377
When you run the docker swarm init command with the --force-new-cluster flag, the Docker Engine where you run the command becomes the manager node of a single-node swarm which is capable of managing and running services. The manager has all the previous information about services and tasks, worker nodes are still part of the swarm, and services are still running. You need to add or re-add manager nodes to achieve your previous task distribution and ensure that you have enough managers to maintain high availability and prevent losing the quorum.
How about create new cluster from your previously manager machine and let the worker leave previous cluster to join the new one? That seems like the only possible solution for me, since your cluster now does not have any managers and the previous manager now not in any cluster, you can just create a new cluster and let other workers join the new one.
# In previously manager machine
$ docker swarm init --advertise-addr <manager ip address>
# *copy the generated command for worker to join new cluster after this command
# In worker machine
$ docker swarm leave
# *paste and execute the copied command here

Remove terminated instances ( managers ) from swarm , and recover swarm state

I have a docker swarm cluster, masters running on 6 AWS instances, during some testing, we accidentally terminated 3 instances ( running masters). Now the swarm state seems not working generating error like :
Error: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online
I tried removing the terminated managers through docker commands but whatever command I do like docker node ls or other commands it gives me the same error as above. I also tried adding new node, while adding to swarm it generates the same error.
I can see all the terminated instances IP's when I issue docker info inside one of the managers but cant do anything . How Can I recover from this state?
Node Address: 10.80.8.195
Manager Addresses:
10.80.7.104:2377
10.80.7.213:2377
10.80.7.226:2377
10.80.7.91:2377
10.80.8.195:2377
10.80.8.219:2377
The clustering facility within the swarm is maintained by the manager nodes.
In your case, you lost the cluster quorum by deleting half of the cluster managers. In this particular case, no node could elect a new manager leader and no manager could control the swarm.
In this case, the only way to recover your cluster is to re-initializing it and this will force the creation of a new cluster.
On a manager node, run this command:
docker swarm init --force-new-cluster
And on other manager nodes, I don't remember if they join the new cluster or if you need to leave and join the cluster again.

How to recover a node in quorum?

I have set up a three-node quorum network using docker and if my network crashes, and I only have information of one of the node. Do I want to recover that node using binary? also, the blocks in the new network should be in sync with others. Please guide how is it possible?
I assume you’re using docker swarm. The clustering facility within the swarm is maintained by the manager nodes. If for any reason the manager leader becomes unavailable and no enough remaining managers to reach quorum and elect a new manager leader, a quorum is lost and no manager node is able to control the swarm.
In this kind of situations, it may be necessary to re-initialize the swarm and force the creation of a new cluster using the command on the manager leader when it is brought online again:
# docker swarm init --force-new-cluster
This removes all managers except the manager the command was run from. The good thing is that worker nodes will continue to function normally and the other manager nodes should resume functionality once the swarm has been re-initialized.
Sometimes it might be necessary to remove manager nodes from the swarm and rejoin them to the swarm.
But note that when a node rejoins the swarm, it must join the swarm via a manager node.
You can always monitor the health of manager nodes by querying the docker nodes API in JSON format through the /nodes HTTP endpoint:
# docker node inspect manager1 --format "{{ .ManagerStatus.Reachability }}"
# docker node inspect manager1 --format "{{ .Status.State }}"
Also, make it a practice to perform automate the backup of docker swarm config directory /var/lib/docker/swarm to easily recover from disaster.

Docker Swarm failover behavior seems a bit underwhelming

I am currently trying to use Docker Swarm to set up our application (consisting of both stateless and stateful services) in a highly available fashion on a three node cluster. With "highly available" I mean "can survice the failure of one out of the three nodes".
We have been doing such installations (using other means, not Docker, let alone Docker Swarm) for quite a while now with good success, including acceptable failover behavior, so our application itself (resp. the services that constitute it) has/have proven that in such a three node setup it/they can be made highly available.
With Swarm, I get the application up and running successfully (with all three nodes up) and have taken care that I have each service configured redundantly, i.e., more than one instance exists for each of them, they are properly configured for HA, and not all instances of a service are located on the same Swarm node. Of course, I also took care that all my Swarm nodes joined the Swarm as manager nodes, so that anyone of them can be leader of the swarm if the original leader node fails.
In this "good" state, I can reach the services on their exposed ports on any of the nodes, thanks to Swarm's Ingress networking.
Very cool.
In a production environment, we could now put a highly-available loadbalancer in front of our swarm worker nodes, so that clients have a single IP address to connect to and would not even notice if one of the nodes goes down.
So now it is time to test failover behavior...
I would expect that killing one Swarm node (i.e., hard shutdown of the VM) would leave my application running, albeit in "degraded" mode, of course.
Alas, after doing the shutdown, I cannot reach ANY of my services via their exposed (via Ingress) ports anymore for a considerable time. Some do become reachable again and indeed have recovered successfully (e.g., a three node Elasticsearch cluster can be accessed again, of course now lacking one node, but back in "green" state). But others (alas, this includes our internal LB...) remain unreachable via their published ports.
"docker node ls" shows one node as unreachable
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER
STATUS
kma44tewzpya80a58boxn9k4s * manager1 Ready Active Reachable
uhz0y2xkd7fkztfuofq3uqufp manager2 Ready Active Leader
x4bggf8cu371qhi0fva5ucpxo manager3 Down Active Unreachable
as expected.
What could I be doing wrong regarding my Swarm setup that causes these effects? Am I just expecting too much here?

In Docker swarm mode, is there a way to get managers' information from workers?

In docker swarm mode I can run docker node ls to list swarm nodes but it does not work on worker nodes. I need a similar function. I know worker nodes does not have a strong consistent view of the cluster, but there should be a way to get current leader or reachable leader.
So is there a way to get current leader/manager on worker node on docker swarm mode 1.12.1?
You can get manager addresses by running docker info from a worker.
The docs and the error message from a worker node mention that you have to be on a manager node to execute swarm commands or view cluster state:
Error message from a worker node: "This node is not a swarm manager. Worker nodes can't be used to view or modify cluster state. Please run this command on a manager node or promote the current node to a manager."
After further thought:
One way you could crack this nut is to use an external key/value store like etcd or any other key/value store that Swarm supports and store the elected node there so that it can be queried by all the nodes. You can see examples of that in the Shipyard Docker management / UI project: http://shipyard-project.com/
Another simple way would be to run a redis service on the cluster and another service to announce the elected leader. This announcement service would have a constraint to only run on the manager node(s): --constraint node.role == manager

Resources