Don't spin up dependent containers if dependency container healthcheck results in unhealthy

Don't spin up dependent containers if dependency container healthcheck results in unhealthy - docker

I have 3 services defined in docker-compose A, B and C. B and C depends_on A. If A's healthcheck results in unhealthy, I want other 2 containers not be spinned up. Currently, containers B and C are started only when container A is started with healthy status, and this is my expected behaviour. However, if container A becomes unhealthy after it gets created, I don't want dependent containers to be started, currently when A is created and becomes unhealthy only one of the other 2 container exits (sometimes A exits and other times B but not both). Here's the output when A is unhealthy.
Creating A ... done
Creating C ... done
ERROR: for B Container "1339a6d12091" is unhealthy. ERROR: Encountered errors while bringing up the project.
As we can see here in the ERROR message, only for B it shows 1339a6d12091(container A) is unhealthy. Whereas it should have reported this error for both B and C containers.
docker-compose
version: '2.3'
services:
B:
image: base_image
depends_on:
A:
condition: service_healthy
entrypoint: bash -c "/app/entrypoint_scripts/execute_B.sh"
C:
image: base_image
depends_on:
A:
condition: service_healthy
entrypoint: bash -c "/app/entrypoint_scripts/execute_C.sh"
A:
image: base_image
healthcheck:
test: ["CMD-SHELL", "test -f /tmp/compliance/executedfetcher"]
interval: 30s
timeout: 3600s
retries: 1
entrypoint: bash -c "/app/entrypoint_scripts/execute_A.sh"
My Expectation: B and C should wait for A to become healthy before starting (which is working fine for me). If A starts and become unhealthy without becoming healthy even for once B and C should not start.

Container A was exiting immediately after becoming unhealthy, as a result, the status: unhealthy was not available for long enough for other containers to read its value. The status: unhealthy was visible only for a fraction of a second and in that fraction of a second only one of the container (either A or B) was able to read it.
I added a sleep 100 statement in the execute_A.sh entrypoint script after the container was becoming unhealthy so that both B and C can easily read A's status, and that fixed the issue.

Related

How do you perform a HEALTHCHECK in the Redis Docker image?

Recently, we had an outage due to Redis being unable to write to a file system (not sure why it's Amazon EFS) anyway I noted that there was no actual HEALTHCHECK set up for the Docker service to make sure it is running correctly, Redis is up so I can't simply use nc -z to check if the port is open.
Is there a command I can execute in the redis:6-alpine (or non-alpine) image that I can put in the healthcheck block of the docker-compose.yml file.
Note I am looking for command that is available internally in the image. Not an external healthcheck.

If I remember correctly that image includes redis-cli so, maybe, something along these lines:
...
healthcheck:
test: ["CMD", "redis-cli","ping"]

Although the ping operation from #nitrin0 answer generally works. It does not handle the case where the write operation will actually fail. So instead I perform a change that will just increment a value to a key I don't plan to use.
image: redis:6
healthcheck:
test: [ "CMD", "redis-cli", "--raw", "incr", "ping" ]

I've just noticed that there is a phase in which redis is still starting up and loading data. In this phase, redis-cli ping shows the error
LOADING Redis is loading the dataset in memory
but stills returns the exit code 0, which would make redis already report has healthy.
Also redis-cli --raw incr ping returns 0 in this phase without actually incrementing this key successfully.
As a workaround, I'm checking whether the redis-cli ping actually prints a PONG, which it only does after the LOADING has been finished.
services:
redis:
healthcheck:
test: ["CMD-SHELL", "redis-cli ping | grep PONG"]
interval: 1s
timeout: 3s
retries: 5
This works because grep returns only 0 when the string ("PONG") is found.

You can also add it inside the Dockerfile if your using a Redis image that contains the redis-cli:
Linux Docker
HEALTHCHECK CMD redis-cli ping || exit 1
Windows Docker
HEALTHCHECK CMD pwsh.exe -command \
try { \
$response = ./redis-cli ping; \
if ($response -eq 'PONG') { return 0} else {return 1}; \
} catch { return 1 }

docker compose to delay container build and start

i have couple of container running in sequence.
i am using depends on to make sure the next one only starts after current one running.
i realize one of container has some cron job to be finished ,
so the next container has the proper data to be imported....
in this case, i cannot just rely on depends on parameter.
how do i delay the next container to starts? say wait for 5 minutes.
sample docker compose:
test1:
networks:
- test
image: test1
ports:
- "8115:8115"
container_name: test1
test2:
networks:
- test
image: test2
depends_on:
- test1
ports:
- "8160:8160"

You can use entrypoint script, something like this (need to install netcat):
until nc -w 1 -z test1 8115; do
>&2 echo "Service is unavailable - sleeping"
sleep 1
done
sleep 2
>&2 echo "Service is up - executing command"
And execute it by command instruction in service (in docker-compose file) or in the Dockerfile (CMD directive).

I added this in the Dockerfile (since it was just for a quick test):
CMD sleep 60 && node server.js
A 60 seconds sleep did the trick, since the node.js part was executing before a database dump init script could finish executing fully.

Docker healthcheck causes the container to crash

I have a customized rabbitmq image that I am using with docker-compose (3.7) to launch a docker cluster. This is necessary because of some peculiar issues when trying to deploy a cluster in docker swarm. The image has a shell script which runs on the primary and secondary nodes and makes the modifications needed to run a cluster. This involves stopping rabbitmq and running rabbitmqctl commands to create the cluster between the two nodes. This configuiration works flawlessly until I try to add in a healthcheck. I have tried adding it in to the image and adding it into the compose file. Both cause the image to crash and constantly restart. I have the following shell script which gets copied into the image:
#!/bin/bash
set -eo pipefail
# A RabbitMQ node is considered healthy if all the below are true:
# * the rabbit app finished booting & it's running
# * there are no alarms
# * there is at least 1 active listener
rabbitmqctl eval '
{ true, rabbit_app_booted_and_running } = { rabbit:is_booted(node()), rabbit_app_booted_and_running },
{ [], no_alarms } = { rabbit:alarms(), no_alarms },
[] /= rabbit_networking:active_listeners(),
rabbitmq_node_is_healthy.
' || exit 1
On an already running image this works and produces the correct result.
I tried the flowing in the compose file:
healthcheck:
interval: 60s
timeout: 60s
retries: 10
start_period: 600s
test: ["CMD", "docker-healthcheck"]
It seems that the start_period is completely ignored. I can see the health status with an error right away. I have also tried the following native rabbitmq diagnostics command:
rabbitmq-diagnostics -q check_running && rabbitmq-diagnostics -q check_local_alarms
This oddly fails with an "unable to find rabbitmq-diagnostics" error, despite the fact the program is definitely in the path. I can execute the command successfully in an already running container.
If I create the container without the healthcheck and then add it in after the fact from the command line with:
docker service update --health-cmd docker-healthcheck --health-interval 60s --health-timeout 60s --health-retries 10 [container id]
it marks the container healthy. So it works but just not in a start up configuration. It seems like to me that the healthcheck should not begin until 10 minutes have passed. It doesn't seem to matter how long I wait for everything to startup using the start_period parameter it still causes the container to fail.
Is this a bug or is there something mysterious about the way start_period works?
Anyone else every have this problem?

Docker Swarm: traffic in assigned state

When I scale a service up from 1 node (Node A) to 2 nodes (Node A and Node B), I see traffic immediately being routed to both nodes (including the new Node B even though it isn't ready).
As a result, an Nginx proxy will return 502s half the time (until Node B is ready).
Any suggestions how you can delay this traffic?
Note: this isn't waiting for another container to come up as mentioned here: Docker Compose wait for container X before starting Y
This is about delaying the network connection until the container is ready.

If you do not configure a healthcheck section, docker will assume that the container is available as soon as it is started.
Note that the initial healthcheck is only done after the set interval.
So you could add something extremely basic like testing if port 80 is connectable (you need nc in your docker image):
healthcheck:
test: nc -w 1 127.0.0.1 80 < /dev/null
interval: 30s
timeout: 10s
retries: 3
start_period: 5s

Docker healthcheck for nginx container

I have a project using the official nginx docker container from Docker Hub, launching via Docker Compose. I have healthchecks configured in Docker Compose for each of my containers, and recently the healthcheck for this nginx container has been behaving strangely; on launching with docker-compose up -d, all my containers launch, and begin running healthchecks, but the nginx container looks like it never runs the healthcheck. I can manually run the script just fine if I docker exec into the container, and the healthcheck runs normally if I restart the container.
Example output from docker ps:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
458a55ae8971 my_custom_image "/tini -- /usr/local…" 7 minutes ago Up 7 minutes (healthy) project_worker_1
5024781b1a73 redis:3.2 "docker-entrypoint.s…" 7 minutes ago Up 7 minutes (healthy) 127.0.0.1:6379->6379/tcp project_redis_1
bd405dde8ce7 postgres:9.6 "docker-entrypoint.s…" 7 minutes ago Up 7 minutes (healthy) 127.0.0.1:15432->5432/tcp project_postgres_1
93e15c18d879 nginx:mainline "nginx -g 'daemon of…" 7 minutes ago Up 7 minutes (health: starting) 127.0.0.1:80->80/tcp, 127.0.0.1:443->443/tcp nginx
Example (partial, for brevity) output from docker inspect nginx:
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 11568,
"ExitCode": 0,
"Error": "",
"StartedAt": "2018-02-13T21:04:22.904241169Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": {
"Status": "unhealthy",
"FailingStreak": 0,
"Log": []
}
},
The portion of the docker-compose.yml defining the nginx container:
nginx:
image: nginx:mainline
# using container_name means there will only ever be one nginx container!
container_name: nginx
restart: always
networks:
- proxynet
volumes:
- /etc/nginx/conf.d
- /etc/nginx/vhost.d
- /usr/share/nginx/html
- tlsdata:/etc/nginx/certs:ro
- attachdata:/usr/share/nginx/html/uploads:ro
- staticdata:/usr/share/nginx/html/static:ro
- ./nginx/healthcheck.sh:/bin/healthcheck.sh
healthcheck:
test: ['CMD', '/bin/healthcheck.sh']
interval: 1m
timeout: 5s
retries: 3
ports:
# Make the http/https ports available on the Docker host IPv4 loopback interface
- '127.0.0.1:80:80'
- '127.0.0.1:443:443'
The healthcheck.sh I am loading in as a volume:
#!/bin/bash
service nginx status || exit 1
It looks like the problem is just an issue with systemd never returning from the status check when the container initially launches, and at the same time the configured healthcheck timeout does not trigger. Everything else works, and nginx is up and responding, but it would be nice for the healthcheck to function properly without needing to manually restart each time I start up.
Is there something missing in my configuration, or a better check I can run?

I think that there is no need for a custom script in this case.
Try just change your healthcheck test to
test: ["CMD", "service", "nginx", "status"]
That works fine for me.
Try to use " instead of ' as well, just in case :)
EDIT
If you really want to force an exit 1, in case of failure, you could use:
test: service nginx status || exit 1

for the official alpine nginx image you can also do:
healthcheck:
test: ["CMD-SHELL", "wget -O /dev/null http://localhost || exit 1"]
timeout: 10s
wget is part of the standard image. What this does is download your index.html/php/whatever to nowhere (/dev/null), and it should timeout and fail otherwise.

I attempted the same script and encountered the same issue. I changed the healthcheck.sh to instead run like this:
#!/bin/bash
if service nginx status; then
exit 0
else
exit 1
fi
Running this in the docker container resulted in successful health checks.

Over a year later, I have found a solution. First, an additional clarification on the environment, what I believe is happening, and speculation on a possible bug with the Docker Engine.
The Compose file I am using now is launching a lightly modified version of the 'official' Alpine NGINX image, which uses COPY to load in the healthcheck script and adds HEALTHCHECK explicitly in the image. This image is used for an nginx service, and is used in concert with an image running jwilder/docker-gen to use container metadata from Docker to generate NGINX configuration files. This container is running as a service named nginx-gen. When containers change, configuration is re-generated, and if there are any changes, a SIGHUP is sent to the nginx service.
What I discovered is the following:
If all services are launched together, the nginx service never runs healthchecks;
If the nginx service is restarted soon after launch, healthchecks complete normally;
If the nginx service is launched by itself, healthchecks complete normally;
If all services other than nginx-gen are launched together, healthchecks complete normally;
If all services are launched together, but nginx-gen is modified to sleep 60 before doing anything, healthchecks complete normally;
So, it appears that there is some obscure interaction with signal processing, Docker, and NGINX. If a SIGHUP is sent to an NGINX process in a container before the first healthcheck runs in that container, no healthchecks ever run.
The final iteration I came up with modifies the nginx-gen container to poll the health of the nginx container. It looks up the health status of a container with a defined label in a loop, with a short sleep. Once the nginx container reports healthy, nginx-gen proceeds to generate configuration files. I also changed the notification method to docker exec a script to explicitly test and reload configuration in the nginx container, rather than rely on SIGHUP.
End result: I can docker-compose up -d, and everything eventually reports healthy without further intervention. Success!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart