Delay Docker Swarm Replicas Reschedule when Node Restart - docker

I currently have a 3 node swarm mode cluster. 1 manager and 2 workers. I have created a service with replicas of 20. When running docker service ps <service>, I do see all replicas have been deployed evenly in 3 nodes. I believe the default swarm placement strategy is spread instead of binpack. That's all good. The problem is when I restart one of the workers after some OS maintenance. The node will take a while to reboot, but at this time, I do not want the services to reschedule to the other 2 nodes because I know the restarted node will soon come back online. Is there a way to delay swarm to reschedule replicas after a node reboot or failure? I want to give it more time before confirming the node is really failed, like maybe 5 minutes or so.
Docker version 20.10.7

Related

Do I need to install docker in all my nodes inside the swarm mode?

I know this is a basic question. But I'm new to docker and have this query.
Do I need to install docker in all my nodes that are part of my swarm mode?.
If so what are the ways that I install docker in all my nodes in one shot?
Of course you need to install Docker and its dependencies on each node. On one of the manager nodes, you need to initiate the swarm with docker swarm init and then you join the other machines either as manager nodes or worker nodes.
The number of manager nodes depends on your requirement to compensate node losses:
1 manager node: requires 1 healthy node
3 manager nodes: require 2 healthy nodes for quorum, can compensate 1 unhealthy node
5 manager nodes: require 3 healthy nodes for quorum, can compensate 2 unhealthy nodes
7 manager nodes: require 4 healthy nodes for quorum, can compensate 3 unhealthy nodes
more than 7 is not recommended due to overhead
Using a even number does not provide more reliability, it is quite the oposite. If you have 2 manager nodes, the loss of either one of them renders the cluster headless. If the cluster is not able to build quorum (requires the majority of of manager nodes beeing healthy), the cluster is headless and can not be controlled. Running containers continue to run, but no new containers can be deployed, failed containers won't redeploy, ...).
People usualy deploy a swarm configuration with a configuration management tool like Ansible, Puppet, Chef or Salt.

Docker Swarm bringing up containers after node is down only after 1 minute

I'm testing Docker Swarm behavior for case when one Swarm Node is down.
And I've see that Docker Swarm start bringing up containers from failed node only after 50-60 seconds which is quite long from my point of view. Containers themselves starting quite fast.
So, is it possible to tune Swarm somehow to decrease this time when Swarm figured out that node is down and it's needed to start new containers?

How to stop and start a whole cluster Swarm

My question is simple: I need to stop all nodes of a cluster swarm, how to stop properly it before shutdown nodes ?
Next, how to restart it properly ?
In my cluster (10 nodes), there are at least 60 services
If all services are in the same docker stack you can stop all services running on the nodes using: docker stack rm.
To start it again (restart) you run docker stack deploy

How to balance containers on newly added node with same elastic IP?

I need help in distributing already running containers on the newly added docker swarm worker node.
I am running docker swarm mode on docker version - 18.09.5. I am using AWS autoscaling for creating 3 masters and 4 workers. For high availability, if one of the workers goes down, all the containers from that worker node will be balanced on other workers. When autoscaling brings new node up, I am adding that worker node to the current docker swarm setup using some automation. But docker swarm is not balancing containers on that worker node. Even I tried to deploy the docker stack again, still swarm is not balancing the containers. Is it because of different node id? How can I customize it? I am using docker compose file deploying stack.
docker stack deploy -c dockerstack.yml NAME
The only (current) to force re-balancing, is to force-update the services. See https://docs.docker.com/engine/swarm/admin_guide/#force-the-swarm-to-rebalance for more information.

Can Docker-Swarm run in fail-over-mode only?

I am facing a situation, where I need to run about 100 different applications in Docker containers. It is not reasonable to run all 100 containers on one hardware, so I need to spread the applications over several machines.
As far as I understood, docker-swarm is for scaling only, which means when I run my containers in a swarm, than all 100 containers will automatically be deployed and started on every node of my docker-swarm. But this is not what I am searching for. I want to split the applications and for example run 50 on node1 and 50 on node2.
Question 1:
Is there a way to configure docker-swarm in a way, where my applications will be automatically dispatched on the node which has the most idling resources?
Question 2:
Is there a kind of fail-over-mode in docker swarm which can stop a container on one node and try to start it on on another in case something goes wrong?
all 100 containers will automatically be deployed and started on every node of my docker-swarm
This is not true. When you deploy 100 containers in a swarm, the containers will be distributed on the available nodes in the swarm. You will mostly get an even distribution of containers on all nodes.
Question 1: Is there a way to configure docker-swarm in a way, where my applications will be automatically dispatched on the node which has the most idling resources?
Docker swarm does not check the available resources (memory, cpu ...) available on the nodes before deploying a container on it. The distribution of containers is balanced per nodes, without taking into account the availability of resources on each node.
You can however build a strategy of distributing container on the nodes. You can use placement constraints were you can restrict where a container can be deployed. You can label nodes having a lot of resources and restrict some heavy containers to only run on these nodes.
Question 2: Is there a kind of fail-over-mode in docker swarm which can stop a container on one node and try to start it on on another in case something goes wrong?
If a container crashes, docker swarm will ensure that a new container is started. Again, the decision of what node to deploy the new container on cannot be predetermined.

Resources