Docker swarm mode load balancing - docker

I've set up a docker swarm mode cluster, with two managers and one worker. This is on Centos 7. They're on machines dkr1, dkr2, dkr3. dkr3 is the worker.
I was upgrading to v1.13 the other day, and wanted zero downtime. But it didn't work exactly as expected. I'm trying to work out the correct way to do it, since this is one of the main goals, of having a cluster.
The swarm is in 'global' mode. That is, one replica per machine. My method for upgrading was to drain the node, stop the daemon, yum upgrade, start daemon. (Note that this wiped out my daemon config settings for ExecStart=...! Be careful if you upgrade.)
Our client/ESB hits dkr2, which does its load balancing magic over the swarm. dkr2 which is the leader. dkr1 is 'reachable'
I brought down dkr3. No issues. Upgraded docker. Brought it back up. No downtime from bringing down the worker.
Brought down dkr1. No issue at first. Still working when I brought it down. Upgraded docker. Brought it back up.
But during startup, it 404'ed. Once up, it was OK.
Brought down dkr2. I didn't actually record what happened then, sorry.
Anyway, while my app was starting up on dkr1, it 404'ed, since the server hadn't started yet.
Any idea what I might be doing wrong? I would suppose I need a health check of some sort, because the container is obviously ok, but the server isn't responding yet. So that's when I get downtime.

You are correct -- You need to specify a healthcheck to run against your app inside the container in order to make sure it is ready. Your container will not receive traffic until this healtcheck has passed.
A simple curl to an endpoint should suffice. Use the Healthcheck flag in your Dockerfile to specify a healthcheck to perform.
An example of the healthcheck line in a Dockerfile to check if an endpoint returned 200 OK would be:
HEALTHCHECK CMD curl -f 'http://localhost:8443/somepath' || exit 1
If you can't modify your Dockerfile, then you can also specify your healthcheck manually at deployment time using the compose file healthcheck format.
If that's also not possible either and you need to update a running service, you can do a service update and use a combination of the health flags to specify your healthcheck.

Related

Rsyslog can't start inside of a docker container

I've got a docker container running a service, and I need that service to send logs to rsyslog. It's an ubuntu image running a set of services in the container. However, the rsyslog service cannot start inside this container. I cannot determine why.
Running service rsyslog start (this image uses upstart, not systemd) returns only the output start: Job failed to start. There is no further information provided, even when I use --verbose.
Furthermore, there are no error logs from this failed startup process. Because rsyslog is the service that can't start, it's obviously not running, so nothing is getting logged. I'm not finding anything relevant in Upstart's logs either: /var/log/upstart/ only contains the logs of a few things that successfully started, as well as dmesg.log which simply contains dmesg: klogctl failed: Operation not permitted. which from what I can tell is because of a docker limitation that cannot really be fixed. And it's unknown if this is even related to the issue.
Here's the interesting bit: I have the exact same container running on a different host, and it's not suffering from this issue. Rsyslog is able to start and run in the container just fine on that host. So obviously the cause is some difference between the hosts. But I don't know where to begin with that: There are LOTS of differences between the hosts (the working one is my local windows system, the failing one is a virtual machine running in a cloud environment), so I wouldn't know where to even begin about which differences could cause this issue and which ones couldn't.
I've exhausted everything that I know to check. My only option left is to come to stackoverflow and ask for any ideas.
Two questions here, really:
Is there any way to get more information out of the failure to start? start itself is a binary file, not a script, so I can't open it up and edit it. I'm reliant solely on the output of that command, and it's not logging anything anywhere useful.
What could possibly be different between these two hosts that could cause this issue? Are there any smoking guns or obvious candidates to check?
Regarding the container itself, unfortunately it's a container provided by a third party that I'm simply modifying. I can't really change anything fundamental about the container, such as the fact that it's entrypoint is /sbin/init (which is a very bad practice for docker containers, and is the root cause of all of my troubles). This is also causing some issues with the docker logging driver, which is why I'm stuck using syslog as the logging solution instead.

Docker (Compose? Swarm?) How to run a health check before exposing cotainer

I have a web app (netcore) running in a docker container. If I update it under load it won't be able to handle requests until there is a gap. This might be a bug in my app, or in the .net, I am looking for a workaround for now. If I hit the app with a single http request before exposing it to the traffic though, it works as expected.
I would like to get this behaviour:
In the running server get the latest release of the container.
Launch the container detached from network.
Run a health check on it, if health check fails - stop.
Remove old container.
Attach new container and start processing traffic.
I am using compose atm, and have somewhat limited knowledge of docker infrastructure and the problem should be something well understood, yet I've failed finding anything in the google on the topic.
It kind of sounds like Kubernetees at this stage, but I would like to keep it as simple as possible.
The thing I was looking for is the Blue/Green deployment and it is quite easy to search for it.
E.g.
https://github.com/Sinkler/docker-nginx-blue-green
https://coderbook.com/#marcus/how-to-do-zero-downtime-deployments-of-docker-containers/
Swarm has a feature which could be useful as well: https://docs.docker.com/engine/reference/commandline/service_update/

How to delay Docker Swarm updating a stateful container until it's ready?

Problem domain
Imagine that a stateful container is being managed by Swarm, e.g. a database, and another container is relying on it, e.g. a service that is executing a long-running job (minutes, sometimes hours) that does not tolerate the database (or even itself) to go down while it's executing.
To give an example, a database importing a multi GB dump.
There's also a CI/CD system in place which takes care of building new versions of the containers and deploying them to the Swarm, or pushing the image to Docker Hub which then calls a defined webhook which fires off the deployment event.
Question
Is there any way I can build my containers so that Swarm can know whether it's ok to update it or not? Similarly how HEALTHCHECK reports whether it needs to be restarted, something that would let Swarm know that 'It's safe to restart this container now'.
Or is it the CI/CD system's responsibility to check whether the stateful containers are safe to restart, and only then issue the update command to swarm?
Thanks in advance!
Docker will not check with a container if it is ready to be stopped, once you give docker the command to stop a container it will perform that action. However it performs the stop in two steps. The first step is a SIGTERM that your container can trap and gracefully handle. By default, after 10 seconds, a SIGKILL is sent that the Linux kernel immediately applies and cannot be trapped by the container. For your goals, you'll want to make sure your app knows when it's safe to exit after receiving the first signal, and you'll probably want to extend the time to much longer than 10 seconds between signals.
The healthcheck won't tell docker that your container is at a safe point to stop. It does tell swarm when your container has finished starting, or when it's misbehaving and needs to be stopped and replaced. The healthcheck defines a command to run inside your container, and the exit code is checked for whether it's 0 (healthy) or 1 (unhealthy). No other exit codes are currently valid.
If you need more than the simple signal handling inside the container, then yes, you're likely moving up the stack to a ci/cd tool to manage the deployment.

How can I get "docker-compose scale" to use the latest image for any additional instances created?

In my project, I have a number of micro-services that rely upon each other. I am using Docker Compose to bring everything up in the right order.
During development, when I write some new code for a container, the container will need to be restarted, so that the new code can be tried. Thus far I've simply been using a restart of the whole thing, thus:
docker-compose down && docker-compose up -d
That works fine, but bringing everything down and up again takes ~20 seconds, which will be too long for a live environment. I am therefore looking into various strategies to ensure that micro-services may be rebooted individually with no interruption at all.
My first approach, which nearly works, is to scale up the service to reboot, from one instance to two. I then programmatically reset the reverse proxy (Traefik) to point to the new instance, and then when that is happy, I docker stop on the old one.
My scale command is the old variety, since I am using Compose 1.8.0. It looks like this:
docker-compose scale missive-storage-backend=2
The only problem is that if there is a new image, Docker Compose does not use it - it stubbornly uses the hash identical to the already running instance. I've checked docker-compose scale --help and there is nothing in there relating to forcing the use of a new image.
Now I could use an ordinary docker run, but then I'd have to replicate all the options I've set up for this service in my docker-compose.yml, and I don't know if something run outside of the Compose file would be understood as being part of that Compose application (e.g. would it be stopped with a docker-compose down despite having been started manually?).
It's possible also that later versions of Docker Compose may have more options in the scale function (it has been merged with up anyway).
What is the simplest way to get this feature?
(Aside: I appreciate there are a myriad of orchestration tools to do gentle reboots and other wizardry, and I will surely explore that bottomless pit when I have the time available. For now, I feel that writing a few scripts to do some deployment tasks is the quicker win.)
I've fixed this. Firstly I tried upgrading to Compose 1.9, but that didn't seem to offer the changes I need. I then bumped to 1.13, which is when scale becomes deprecated as a separate command, and appears as a switch to the up command.
As a test, I have an image called missive-storage, and I add a dummy change to the Dockerfile, so docker ps reports the name of the image in a running container as d4ebdee0f3e2 instead (since missive-storage:latest has changed).
The ps line looks like this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
45b8023f6ef1 d4ebdee0f3e2 "/usr/local/bin/du..." 4 minutes ago Up 4 minutes app_missive-storage-backend_1
I then issue this command (missive-storage-backend is the DC service name for image missive-storage):
docker-compose up -d --no-recreate --scale missive-storage-backend=2
which results in these containers:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0bd6577f281a missive-storage "/usr/local/bin/du..." 2 seconds ago Up 2 seconds app_missive-storage-backend_2
45b8023f6ef1 d4ebdee0f3e2 "/usr/local/bin/du..." 4 minutes ago Up 4 minutes app_missive-storage-backend_1
As you can see, this gives me two running containers, one based on the old image, and one based on the new image. From here I can just redirect traffic by sending a configuration change to the front-end proxy, then stop the old container.
Note that --no-recreate is important - without it, Docker Compose seems liable to reboot everything, defeating the object of the exercise.

How to keep a certain number of Docker containers running the same application and add/remove them as needed?

I've working with Docker containers. What Ive done is lunching 5 containers running the same application, I use HAProxy to redirect requests to them, I added a volume to preserve data and set restart policy as Always.
It works. (So far this is my load balancing aproach)But sometimes I need another container to join the pool as there might be more requests, or maybe at first I don't need 5 containers.
This is provided by the Swarm Mode addition in Docker 1.12. It includes orchestration that lets you not only scale your service up or down, but recover from an outage by automatically rescheduling the jobs to run on other nodes.
If you don't want to use Docker 1.12 (yet!), you can also use a Service Discovery like Consul, register your containers inside and use a tool like Consul Template to regenerate your load balancer configuration accordingly.
I made a talk 6 months ago about it. You can find the code and the configuration I used during my demo here: https://github.com/bargenson/dockerdemo

Resources