I have set of tasks for given service t1, t2, ..., tk across nodes N1, N2, ...Nw.
Due to lower usage, I do not need as many tasks as k.
I need only l tasks (l < k).
In fact, I do not need w nodes so I want to start removing machines and pay less. Removing one machine at a time is fine.
Each service has its own state.
The services are started in replicated mode.
1) How can I remove a single node and force the docker swarm not to recreate the same number of tasks for the service?
Notes:
I can ensure that no work is rerouted to tasks running on a specific node, so removing the specific node is safe.
This is the easiest solution, I will end up with w - 1 nodes and l services assuming that on the removed node was served k - l services.
or
2) How can I remove specific containers (tasks) from docker swarm and keep the number of replicas of the service lower by the number of removed tasks?
Notes:
I assume that I already removed a node. The services from the node were redeployed to other nodes.
I monitor myself the containers (tasks) which serve no traffic -> no state is needed to maintain
or
3) Any other solution?
To use a concrete example let's say you have 3 nodes and 9 tasks. You now want to go to 2 nodes and 6 tasks, without any unnecessary rescheduling (e.g. 2 modes and 9 tasks, or 3 modes and 6 tasks).
To scale down a service and 'drain' a node at the same time, you can do this:
docker service update --replicas 6 --constraint-add "node.hostname != node_to_be_removed_hostname" service_name
If your existing setup is balanced, this should only cause the tasks running on the host to be removed to be killed.
After this, you can proceed to (docker node update) drain the node, remove it from the swarm, and remove the constraint that has just been added.
To answer your questions
Q1-> You can simply drain the node in the cluster to verify if the services on the services are started on other nodes. Once they do you can safely remove the node from the swarm cluster.
docker node update --availability drain <>
Q2-> You must have specified replica count while starting the services, you can simply scale it to a lower count.
docker service scale <>=<>
Related
I have two docker nodes running in swarm like below. The second node i promoted to work as manager.
imb9cmobjk0fp7s6h5zoivfmo * Node1 Ready Active Leader 19.03.11-ol
a9gsb12wqw436zujakdpbqu5p Node2 Ready Active Reachable 19.03.11-ol
This works fine when leader node goes to drain/pause. But as part of my test i have stopped the Node1 instance then i got below error when try to see what are the nodes(docker node ls) in the second node and when tried to list the services running(docker service ls).
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online
Also no docker process coming up in node 2 which were running in node 1 before stopping the instance. Only the existing process are running. My expectation is after stopping the node1 instance, the procees were running in node 1 has to move to node2. This works fine when a node goes to drain status
The raft consensus algoritm fails when it cant' find a clear majority.
This means, never run with 2 manager nodes as one node going down leaves the other with 50% - which is not a majority and quorum cannot be reached.
Generally in fact, avoid even numbers, especially when splitting managers between availability zones, as a zone split can leave you with a 50/50 partition - again no majority and no Quorum and a dead swarm.
So, valid numbers of swarm managers to try are generally: 1,3,5,7. Going higher than 7 generally reduces performance and doesn't help availability.
1 should only be used if you are using a 1 or 2 node swarm, and in these cases, loss of the manager node equates to loss of the swarm anyway.
3 managers is really the minimum you should aim for. If you only have 3 nodes, then prefer to use the managers as workers than run 1 manager and 2 workers.
I have a swarm cluster containing 4 nodes : 1 Manager + 3 Workers
When restarting one worker'server, its status becomes "DOWN" when running :
docker node ls
Also already deployed services shutdown in this node (containers exited), and cannot restart it.
I have tried to :
recreate cluster after each reboot (too ugly and doesn't resolve the problem )
deleting the heavy file /var/lib/docker/swarm/worker/tasks.db (doesn't improve the situation)
simply waiting (but it still down after hours)
I m using docker 18.09ce
Suggestions ?
There are few things you have to do.
Update node availability ( Run command from manager node)
docker node update <> --availability active
If still issue persists then try to do following things.
// Add worker again to swarm using token previously generated.
If still not solve then you might to do following thing Remove all nodes from cluster.
docker swarm init --force-new-cluster // Use with care.
Recover docker swarm
Can I somehow configure how master node distributes services in docker swarm? I thought, that it should see free resources of worker nodes and distribute it to "freest" node.
Currently I have problem, that service is distributed into one node, which is full (90% RAM) and it starts be laggy, but at the same time second node has few services and it can handle another one.
docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
wdkklpy6065zxckxyuj000ei4 * docker-master Ready Drain Leader 18.09.6
sk45rol2whdr5eh2jqozy0035 docker-node01 Ready Active Reachable 18.09.6
o4zwwbwwcrbwo4tsd00pxkfuc docker-node02 Ready Active 18.09.6
Now I have 36 (very similar) services, 28 run on docker-node01, 8 on docker-node02. I thought, that ideal state is 16 services on both nodes.
Both docker-nodes are same.
How docker swarm knows where to run service? What algorithm does it use?
It is possible to change/update algorithm for selecting node?
According to the swarmkit project README the only available strategy is spread so it schedule tasks on the least loaded modes.
Note that the swarm won't move nodes around to maintain this strategy so if you added the node02 after the node01 was full then the node02 will remain mostly empty. You could drain both nodes then activate them to see if it distributes better the load.
You can find a more detailed description on the schedules algorithm on the project documentation: scheduling-algorithm
For the older swarm manager this attribute was configurable:
https://docs.docker.com/swarm/reference/manage/#--strategy--scheduler-placement-strategy
Also I found https://docs.docker.com/swarm/scheduler/strategy/, it explains a lot about Docker swarm strategies.
We have a bare metal Docker Swarm cluster, with a lot of containers.
And recently we have a full stop on the physical server.
The main problem, happened on Docker startup where all container tried to start on the same time.
I would like to know if there is a way to limit the amount of starting container?
Or if there is another way to avoid overloading the physical server.
At present, I'm not aware of an ability to limit how fast swarm mode will start containers. There is a todo entry to add an exponential backoff in the code and various open issues in swarmkit, e.g. 1201 that may eventually help with this scenario. Ideally, you would have an HA cluster with nodes spread in different AZ's, and when one node fails, the workload would migrate to another node and you do not end up with one overloaded node.
What you can use are resource constraints. You can configure each service with a minimum CPU and memory reservation. This would prevent swarm mode from scheduling more containers on a node than it could handle during a significant outage. The downside is that some services may go unscheduled during an outage and you cannot prioritize which are more important to schedule.
I run two servers using docker-compose and swarm
When stopping the A server, the container in A server is moved to B server.
but, when starting up the A server, the container that was in the B server will not moved to A server.
I want to know how to properly arrange the location of the dead server's container when the server dies for a moment
First, for your Swarm to be able to re-create a task when a node goes down, you still need to have a majority of manager nodes still available... so if it was only a two node Swarm, this wouldn't work because you'd need three managers for one to fail and another to take leader role and re-schedule the failed replicas. (just a FYI)
I think what you're asking for is "re-balancing". When a node comes back online (or a new one is added), Swarm does nothing with services that are set to the default replicated mode. Swarm doesn't "move" containers, it destroys and re-creates containers, so it considers the service still healthy on Node B and won't move it back to Node A. It wouldn't want to disrupt your active/healthy services on Node B just because Node A came back online.
If Node B does fail, then Swarm would again re-schedule the task on the next best node.
If Node B has a lot of containers, and work is unbalanced (i.e. Node A is empty and Node B has 3 tasks running), then you can force a service update which will destroy and re-create all replicas of that service and will try to spread them out by default, which may result on one of the tasks ending up back on Node A.
docker service update --force <serivcename>