I have a swarm cluster containing 4 nodes : 1 Manager + 3 Workers
When restarting one worker'server, its status becomes "DOWN" when running :
docker node ls
Also already deployed services shutdown in this node (containers exited), and cannot restart it.
I have tried to :
recreate cluster after each reboot (too ugly and doesn't resolve the problem )
deleting the heavy file /var/lib/docker/swarm/worker/tasks.db (doesn't improve the situation)
simply waiting (but it still down after hours)
I m using docker 18.09ce
Suggestions ?
There are few things you have to do.
Update node availability ( Run command from manager node)
docker node update <> --availability active
If still issue persists then try to do following things.
// Add worker again to swarm using token previously generated.
If still not solve then you might to do following thing Remove all nodes from cluster.
docker swarm init --force-new-cluster // Use with care.
Recover docker swarm
Related
I'd like to upgrade the Docker engine on my Docker Swarm managed nodes (both manager and worker nodes) from 18.06 to 19.03, without causing any downtime.
I see there are many tutorials online for rolling update of a Dockerized application without downtime, but nothing related to upgrading the Docker engine on all Docker Swarm managed nodes.
Is it really not possible to upgrade the Docker daemon on Docker Swarm managed nodes without a downtime? If true, that would indeed be a pity.
Thanks in advance to the wonderful community at SO!
You can upgrade managers, in place, one at a time. During this upgrade process, you would drain the node with docker node update, and run the upgrade to the docker engine with the normal OS commands, and then return the node to active. What will not work is to add or remove nodes to the cluster while the managers have mixed versions. This means you cannot completely replace nodes with an install from scratch at the same time you upgrade the versions. All managers need to be the same version (upgraded) and then you can look at rebuilding/replacing the hosts. What I've seen in the past is that nodes do not fully join the manager quorum, and after losing enough managers you eventually lose quorum.
Once all managers are upgraded, then you can upgrade the workers, either with in place upgrades or replacing the nodes. Until the workers have all been upgraded, do not use any new features.
You can drain your node and after that upgrade your docker version,then make this ACTIVE again.
Repeat this step for all the nodes.
DRAIN availability prevents a node from receiving new tasks from the swarm manager. Manager stops tasks running on the node and launches replica tasks on a node with ACTIVE availability.
For detailed information you can refer this link :- https://docs.docker.com/engine/swarm/swarm-tutorial/drain-node/
Let's say we have a swarm1 (1 manager and 2 workers), I am going to back up this swarm on a daily basis, so if there is a problem some day, I could restore all the swarm to a new one (swarm2 = 1 manager and 2 workers too).
I followed what described here but it seems that while restoring, the new manager get the same token as the old manager, as a result : the 2 workers get disconnected and I end up with a new swarm2 with 1 manager and 0 worker.
Any ideas / solution?
I don't recommend restoring workers. Assuming you've only lost your single manager, just docker swarm leave on the workers, then join again. Then on the manager you can always cleanup old workers later (does not affect uptime) with docker node rm.
Note that if you loose the manager quorum, this doesn't mean the apps you're running go down, so you'll want to keep your workers up and providing your apps to your users until you fix your manager.
If your last manager fails or you lose quorum, then focus on restoring the raft DB so the swarm manager has quorum again. Then rejoin workers, or create new workers in parallel and only shutdown old workers when new ones are running your app. Here's a great talk by Laura Frank that goes into it at DockerCon.
I run two servers using docker-compose and swarm
When stopping the A server, the container in A server is moved to B server.
but, when starting up the A server, the container that was in the B server will not moved to A server.
I want to know how to properly arrange the location of the dead server's container when the server dies for a moment
First, for your Swarm to be able to re-create a task when a node goes down, you still need to have a majority of manager nodes still available... so if it was only a two node Swarm, this wouldn't work because you'd need three managers for one to fail and another to take leader role and re-schedule the failed replicas. (just a FYI)
I think what you're asking for is "re-balancing". When a node comes back online (or a new one is added), Swarm does nothing with services that are set to the default replicated mode. Swarm doesn't "move" containers, it destroys and re-creates containers, so it considers the service still healthy on Node B and won't move it back to Node A. It wouldn't want to disrupt your active/healthy services on Node B just because Node A came back online.
If Node B does fail, then Swarm would again re-schedule the task on the next best node.
If Node B has a lot of containers, and work is unbalanced (i.e. Node A is empty and Node B has 3 tasks running), then you can force a service update which will destroy and re-create all replicas of that service and will try to spread them out by default, which may result on one of the tasks ending up back on Node A.
docker service update --force <serivcename>
We have a swarm running docker 1.13 to which I need to add 3 more nodes running docker 17.04.
Is this possible or will it cause problems?
Will it be possible to update the old nodes without bringing the entire swarm down?
Thanks
I ran into this one myself yesterday and the advice from the Docker developers is that you can mix versions of docker on the swarm managers temporarily, but you cannot promote or demote nodes that don't match the version on all the other swarm managers. They also recommended upgrading all managers before upgrading workers.
According to that advice, you should upgrade the old nodes first, one at a time to avoid avoid bringing down the cluster. If containers are deployed to those managers, you'll want to configure the node to drain with docker node update --availability drain $node_name first. After the upgrade, you can bring is back into service with docker node update --availability active $node_name.
When trying to promote a newer node into an older swarm, what I saw was some very disruptive behavior that wasn't obvious until looking at the debugging logs. The comments on this issue go into more details on Docker's advice and problems I saw.
I've set up a docker swarm mode cluster, with two managers and one worker. This is on Centos 7. They're on machines dkr1, dkr2, dkr3. dkr3 is the worker.
I was upgrading to v1.13 the other day, and wanted zero downtime. But it didn't work exactly as expected. I'm trying to work out the correct way to do it, since this is one of the main goals, of having a cluster.
The swarm is in 'global' mode. That is, one replica per machine. My method for upgrading was to drain the node, stop the daemon, yum upgrade, start daemon. (Note that this wiped out my daemon config settings for ExecStart=...! Be careful if you upgrade.)
Our client/ESB hits dkr2, which does its load balancing magic over the swarm. dkr2 which is the leader. dkr1 is 'reachable'
I brought down dkr3. No issues. Upgraded docker. Brought it back up. No downtime from bringing down the worker.
Brought down dkr1. No issue at first. Still working when I brought it down. Upgraded docker. Brought it back up.
But during startup, it 404'ed. Once up, it was OK.
Brought down dkr2. I didn't actually record what happened then, sorry.
Anyway, while my app was starting up on dkr1, it 404'ed, since the server hadn't started yet.
Any idea what I might be doing wrong? I would suppose I need a health check of some sort, because the container is obviously ok, but the server isn't responding yet. So that's when I get downtime.
You are correct -- You need to specify a healthcheck to run against your app inside the container in order to make sure it is ready. Your container will not receive traffic until this healtcheck has passed.
A simple curl to an endpoint should suffice. Use the Healthcheck flag in your Dockerfile to specify a healthcheck to perform.
An example of the healthcheck line in a Dockerfile to check if an endpoint returned 200 OK would be:
HEALTHCHECK CMD curl -f 'http://localhost:8443/somepath' || exit 1
If you can't modify your Dockerfile, then you can also specify your healthcheck manually at deployment time using the compose file healthcheck format.
If that's also not possible either and you need to update a running service, you can do a service update and use a combination of the health flags to specify your healthcheck.