What steps does docker swarm take when doing a rolling update with start-first? - docker

When docker swarm does a rolling update with stop-first on multiple running container instances, it takes -among others- the following steps in order for each container in a row:
Remove the container from its internal load balancer
Send a SIGTERM signal to the container.
With respect to the stop-grace-period, send a SIGKILL signal.
Start a new container
Add the new container to its internal load balancer
But which order of steps are taken when I want to do a rolling update with start-first?
Will the old- and new- container be available through the loadbalancer at the same time (until the old one has stopped and removed from the lb)?
Or will the new container first be started and not be added to the loadbalancer until the old container is stopped and removed from the loadbalancer?
The latter would nescesarry for processes that are bounded to a specific instance of a service (container).

But which order of steps are taken when I want to do a rolling update
with start-first?
It's basically the reverse. New container starts, added to LB, then the old one is removed from LB and sent shutdown signal.
Will the old- and new- container be available through the loadbalancer
at the same time (until the old one has stopped and removed from the
lb)?
Yes.
A reminder that most of this will not be seamless (or near zero downtime) unless you (at a minimum) have healthchecks enabled in the service. I talk about this a little in this YouTube video.

Related

Docker swarm redirects requests to the "new" and "old" versions of the app simultaneously, during the service update

I'm using this example https://github.com/BretFisher/node-docker-good-defaults
I have a nodejs backend app and I have 1 replica of the service.
And the "deploy" section in the docker-stack.yml file looks like this:
deploy:
update_config:
order: start-first
When I make a new version of the app, and then update the service, during the update, I have an overlap of the working "old" and "new" version of the app.
Steps to reproduce:
build a new version of the app
update the service
send requests to the backend app each 10ms.
Result: at some point in time I see responses from the old and the new version.
Is it expected behavior?
Yes, this is expected.
Load balancing between multiple replicas of a swarm mode service in docker is round robin between the replicas. There's no priority between the replicas, including for containers running on the same node or for a newer version, it's just round robin.
This is implemented per connection, so if you want to keep talking to the same container, you could keep a persistent connection open (e.g. http keep alive), but then you'll want to have an automatic retry in the app if the connection fails, to handle the container getting replaced.

How to delay Docker Swarm updating a stateful container until it's ready?

Problem domain
Imagine that a stateful container is being managed by Swarm, e.g. a database, and another container is relying on it, e.g. a service that is executing a long-running job (minutes, sometimes hours) that does not tolerate the database (or even itself) to go down while it's executing.
To give an example, a database importing a multi GB dump.
There's also a CI/CD system in place which takes care of building new versions of the containers and deploying them to the Swarm, or pushing the image to Docker Hub which then calls a defined webhook which fires off the deployment event.
Question
Is there any way I can build my containers so that Swarm can know whether it's ok to update it or not? Similarly how HEALTHCHECK reports whether it needs to be restarted, something that would let Swarm know that 'It's safe to restart this container now'.
Or is it the CI/CD system's responsibility to check whether the stateful containers are safe to restart, and only then issue the update command to swarm?
Thanks in advance!
Docker will not check with a container if it is ready to be stopped, once you give docker the command to stop a container it will perform that action. However it performs the stop in two steps. The first step is a SIGTERM that your container can trap and gracefully handle. By default, after 10 seconds, a SIGKILL is sent that the Linux kernel immediately applies and cannot be trapped by the container. For your goals, you'll want to make sure your app knows when it's safe to exit after receiving the first signal, and you'll probably want to extend the time to much longer than 10 seconds between signals.
The healthcheck won't tell docker that your container is at a safe point to stop. It does tell swarm when your container has finished starting, or when it's misbehaving and needs to be stopped and replaced. The healthcheck defines a command to run inside your container, and the exit code is checked for whether it's 0 (healthy) or 1 (unhealthy). No other exit codes are currently valid.
If you need more than the simple signal handling inside the container, then yes, you're likely moving up the stack to a ci/cd tool to manage the deployment.

How to reschedule containers with swarm when the server dies for a moment

I run two servers using docker-compose and swarm
When stopping the A server, the container in A server is moved to B server.
but, when starting up the A server, the container that was in the B server will not moved to A server.
I want to know how to properly arrange the location of the dead server's container when the server dies for a moment
First, for your Swarm to be able to re-create a task when a node goes down, you still need to have a majority of manager nodes still available... so if it was only a two node Swarm, this wouldn't work because you'd need three managers for one to fail and another to take leader role and re-schedule the failed replicas. (just a FYI)
I think what you're asking for is "re-balancing". When a node comes back online (or a new one is added), Swarm does nothing with services that are set to the default replicated mode. Swarm doesn't "move" containers, it destroys and re-creates containers, so it considers the service still healthy on Node B and won't move it back to Node A. It wouldn't want to disrupt your active/healthy services on Node B just because Node A came back online.
If Node B does fail, then Swarm would again re-schedule the task on the next best node.
If Node B has a lot of containers, and work is unbalanced (i.e. Node A is empty and Node B has 3 tasks running), then you can force a service update which will destroy and re-create all replicas of that service and will try to spread them out by default, which may result on one of the tasks ending up back on Node A.
docker service update --force <serivcename>

Restart task in docker service after a certain time

I have a swarm with 3 nodes. On it, I want to launch one service for a Database and then another, with some replicas that run a python application. The program will take approximately 30 minutes to finish. After that, the container is shut down and a new one starts. Sometimes, however, some problem occur and the container does not stop. Is there any option that I can use when I launch the service so that, after 1 hour, a container is automatically killed and a new one is created?
You can create an application using the Docker Remote API, that automatically creates that container, waits for one hour, deploys it to the swarm and then deletes that container. This is not a feature to look for in docker. You should manually implement it using Docker API.
You can find in here complete list of docker libraries to help you get started.

Docker swarm load balancing - How to give common name to the service?

I read swarm routing mesh
I create a simple service which uses tomcat server and listens at 8080.
docker swarm init I created a node manager at node1.
docker swarm join /tokens I used the token provided by the manager at node 2 and node 3 to create workers.
docker node ls shows 5 instances of my service, 3 running at node 1, 1 running at node 2, another one is at node 3.
docker service create image I created the service.
docker service scale imageid=5 scaled it.
My application uses atomic number which is maintained at JVM level.
If I hit http://node1:8080/service 25 times, all requests goes to node1. How dose it balance node?
If I hit http://node2:8080/service, it goes to node 2.
Why is it not using round-robin?
Doubts:
Is anything wrong in the above steps?
Did I miss something?
I feel I am missing something. Like common service name http://domain:8080/service, then swarm will work in round robin fashion.
I would like to understand only swarm mode. I am not interested external load balancer as of now.
How do I see swarm load balance in action?
Docker does round robin load balancing per connection to the port. As long as a connection is up, it will continue to go to the same instance.
Http allows a connection to be kept alive and reused. Browsers take advantage of this behavior to speed up later requests by leaving connections open. To test the round robin load balancing, you'd need to either disable that keep alive setting or switch to a command line tool like curl or wget.

Resources