I have a customized rabbitmq image that I am using with docker-compose (3.7) to launch a docker cluster. This is necessary because of some peculiar issues when trying to deploy a cluster in docker swarm. The image has a shell script which runs on the primary and secondary nodes and makes the modifications needed to run a cluster. This involves stopping rabbitmq and running rabbitmqctl commands to create the cluster between the two nodes. This configuiration works flawlessly until I try to add in a healthcheck. I have tried adding it in to the image and adding it into the compose file. Both cause the image to crash and constantly restart. I have the following shell script which gets copied into the image:
#!/bin/bash
set -eo pipefail
# A RabbitMQ node is considered healthy if all the below are true:
# * the rabbit app finished booting & it's running
# * there are no alarms
# * there is at least 1 active listener
rabbitmqctl eval '
{ true, rabbit_app_booted_and_running } = { rabbit:is_booted(node()), rabbit_app_booted_and_running },
{ [], no_alarms } = { rabbit:alarms(), no_alarms },
[] /= rabbit_networking:active_listeners(),
rabbitmq_node_is_healthy.
' || exit 1
On an already running image this works and produces the correct result.
I tried the flowing in the compose file:
healthcheck:
interval: 60s
timeout: 60s
retries: 10
start_period: 600s
test: ["CMD", "docker-healthcheck"]
It seems that the start_period is completely ignored. I can see the health status with an error right away. I have also tried the following native rabbitmq diagnostics command:
rabbitmq-diagnostics -q check_running && rabbitmq-diagnostics -q check_local_alarms
This oddly fails with an "unable to find rabbitmq-diagnostics" error, despite the fact the program is definitely in the path. I can execute the command successfully in an already running container.
If I create the container without the healthcheck and then add it in after the fact from the command line with:
docker service update --health-cmd docker-healthcheck --health-interval 60s --health-timeout 60s --health-retries 10 [container id]
it marks the container healthy. So it works but just not in a start up configuration. It seems like to me that the healthcheck should not begin until 10 minutes have passed. It doesn't seem to matter how long I wait for everything to startup using the start_period parameter it still causes the container to fail.
Is this a bug or is there something mysterious about the way start_period works?
Anyone else every have this problem?
Related
Recently, we had an outage due to Redis being unable to write to a file system (not sure why it's Amazon EFS) anyway I noted that there was no actual HEALTHCHECK set up for the Docker service to make sure it is running correctly, Redis is up so I can't simply use nc -z to check if the port is open.
Is there a command I can execute in the redis:6-alpine (or non-alpine) image that I can put in the healthcheck block of the docker-compose.yml file.
Note I am looking for command that is available internally in the image. Not an external healthcheck.
If I remember correctly that image includes redis-cli so, maybe, something along these lines:
...
healthcheck:
test: ["CMD", "redis-cli","ping"]
Although the ping operation from #nitrin0 answer generally works. It does not handle the case where the write operation will actually fail. So instead I perform a change that will just increment a value to a key I don't plan to use.
image: redis:6
healthcheck:
test: [ "CMD", "redis-cli", "--raw", "incr", "ping" ]
I've just noticed that there is a phase in which redis is still starting up and loading data. In this phase, redis-cli ping shows the error
LOADING Redis is loading the dataset in memory
but stills returns the exit code 0, which would make redis already report has healthy.
Also redis-cli --raw incr ping returns 0 in this phase without actually incrementing this key successfully.
As a workaround, I'm checking whether the redis-cli ping actually prints a PONG, which it only does after the LOADING has been finished.
services:
redis:
healthcheck:
test: ["CMD-SHELL", "redis-cli ping | grep PONG"]
interval: 1s
timeout: 3s
retries: 5
This works because grep returns only 0 when the string ("PONG") is found.
You can also add it inside the Dockerfile if your using a Redis image that contains the redis-cli:
Linux Docker
HEALTHCHECK CMD redis-cli ping || exit 1
Windows Docker
HEALTHCHECK CMD pwsh.exe -command \
try { \
$response = ./redis-cli ping; \
if ($response -eq 'PONG') { return 0} else {return 1}; \
} catch { return 1 }
Background: My Docker container has a very long startup time, and it is hard to predict when it is done. And when the health check kicks in, it first may show 'unhealthy' since the startup is sometimes not finished. This may cause a restart or container removal from our automation tools.
My specific question is if I can control my Docker container so that it shows 'starting' until the setup is ready and that the health check can somehow be started immediately after that? Or is there any other recommendation on how to handle states in a good way using health checks?
Side question: I would love to get a reference to how transitions are made and determined during container startup and health check initiating. I have tried googling how to determine Docker (container) states but I can't find any good reference.
My specific question is if I can control my container so that it shows
'starting' until the setup is ready and that the health check can
somehow be started immediately after that?
I don't think that it is possible with just K8s or Docker.
Containers are not designed to communicate with Docker Daemon or Kubernetes to tell that its internal setup is done.
If the application takes a time to setup you could play with readiness and liveness probe options of Kubernetes.
You may indeed configure readynessProbe to perform the initial check after a specific delay.
For example to specify 120 seconds as initial delay :
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 120
Same thing for livenessProbe:
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: Custom-Header
value: Awesome
initialDelaySeconds: 120
periodSeconds: 3
For Docker "alone" while not so much configurable you could make it to work with the --health-start-period parameter of the docker run sub command :
--health-start-period : Start period for the container to initialize
before starting health-retries countdown
For example you could specify an important value such as :
docker run --health-start-period=120s ...
Here is my work around. First in docker-compose set long timeout, start_period+timeout should be grater than max expected starting time, eg.
healthcheck:
test: ["CMD", "python3", "appstatus.py", '500']
interval: 60s
timeout: 900s
retries: 2
start_period: 30s
and then run script which can wait (if needed) before return results. In example above it is appstatus.py. In the script is something like:
timeout = int(sys.argv[1])
t0 = time.time()
while True:
time.sleep(2)
if isReady():
sys.exit(os.EX_OK)
t = time.time() - t0
if t > timeout:
sys.exit(os.EX_SOFTWARE)
I have a project using the official nginx docker container from Docker Hub, launching via Docker Compose. I have healthchecks configured in Docker Compose for each of my containers, and recently the healthcheck for this nginx container has been behaving strangely; on launching with docker-compose up -d, all my containers launch, and begin running healthchecks, but the nginx container looks like it never runs the healthcheck. I can manually run the script just fine if I docker exec into the container, and the healthcheck runs normally if I restart the container.
Example output from docker ps:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
458a55ae8971 my_custom_image "/tini -- /usr/local…" 7 minutes ago Up 7 minutes (healthy) project_worker_1
5024781b1a73 redis:3.2 "docker-entrypoint.s…" 7 minutes ago Up 7 minutes (healthy) 127.0.0.1:6379->6379/tcp project_redis_1
bd405dde8ce7 postgres:9.6 "docker-entrypoint.s…" 7 minutes ago Up 7 minutes (healthy) 127.0.0.1:15432->5432/tcp project_postgres_1
93e15c18d879 nginx:mainline "nginx -g 'daemon of…" 7 minutes ago Up 7 minutes (health: starting) 127.0.0.1:80->80/tcp, 127.0.0.1:443->443/tcp nginx
Example (partial, for brevity) output from docker inspect nginx:
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 11568,
"ExitCode": 0,
"Error": "",
"StartedAt": "2018-02-13T21:04:22.904241169Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": {
"Status": "unhealthy",
"FailingStreak": 0,
"Log": []
}
},
The portion of the docker-compose.yml defining the nginx container:
nginx:
image: nginx:mainline
# using container_name means there will only ever be one nginx container!
container_name: nginx
restart: always
networks:
- proxynet
volumes:
- /etc/nginx/conf.d
- /etc/nginx/vhost.d
- /usr/share/nginx/html
- tlsdata:/etc/nginx/certs:ro
- attachdata:/usr/share/nginx/html/uploads:ro
- staticdata:/usr/share/nginx/html/static:ro
- ./nginx/healthcheck.sh:/bin/healthcheck.sh
healthcheck:
test: ['CMD', '/bin/healthcheck.sh']
interval: 1m
timeout: 5s
retries: 3
ports:
# Make the http/https ports available on the Docker host IPv4 loopback interface
- '127.0.0.1:80:80'
- '127.0.0.1:443:443'
The healthcheck.sh I am loading in as a volume:
#!/bin/bash
service nginx status || exit 1
It looks like the problem is just an issue with systemd never returning from the status check when the container initially launches, and at the same time the configured healthcheck timeout does not trigger. Everything else works, and nginx is up and responding, but it would be nice for the healthcheck to function properly without needing to manually restart each time I start up.
Is there something missing in my configuration, or a better check I can run?
I think that there is no need for a custom script in this case.
Try just change your healthcheck test to
test: ["CMD", "service", "nginx", "status"]
That works fine for me.
Try to use " instead of ' as well, just in case :)
EDIT
If you really want to force an exit 1, in case of failure, you could use:
test: service nginx status || exit 1
for the official alpine nginx image you can also do:
healthcheck:
test: ["CMD-SHELL", "wget -O /dev/null http://localhost || exit 1"]
timeout: 10s
wget is part of the standard image. What this does is download your index.html/php/whatever to nowhere (/dev/null), and it should timeout and fail otherwise.
I attempted the same script and encountered the same issue. I changed the healthcheck.sh to instead run like this:
#!/bin/bash
if service nginx status; then
exit 0
else
exit 1
fi
Running this in the docker container resulted in successful health checks.
Over a year later, I have found a solution. First, an additional clarification on the environment, what I believe is happening, and speculation on a possible bug with the Docker Engine.
The Compose file I am using now is launching a lightly modified version of the 'official' Alpine NGINX image, which uses COPY to load in the healthcheck script and adds HEALTHCHECK explicitly in the image. This image is used for an nginx service, and is used in concert with an image running jwilder/docker-gen to use container metadata from Docker to generate NGINX configuration files. This container is running as a service named nginx-gen. When containers change, configuration is re-generated, and if there are any changes, a SIGHUP is sent to the nginx service.
What I discovered is the following:
If all services are launched together, the nginx service never runs healthchecks;
If the nginx service is restarted soon after launch, healthchecks complete normally;
If the nginx service is launched by itself, healthchecks complete normally;
If all services other than nginx-gen are launched together, healthchecks complete normally;
If all services are launched together, but nginx-gen is modified to sleep 60 before doing anything, healthchecks complete normally;
So, it appears that there is some obscure interaction with signal processing, Docker, and NGINX. If a SIGHUP is sent to an NGINX process in a container before the first healthcheck runs in that container, no healthchecks ever run.
The final iteration I came up with modifies the nginx-gen container to poll the health of the nginx container. It looks up the health status of a container with a defined label in a loop, with a short sleep. Once the nginx container reports healthy, nginx-gen proceeds to generate configuration files. I also changed the notification method to docker exec a script to explicitly test and reload configuration in the nginx container, rather than rely on SIGHUP.
End result: I can docker-compose up -d, and everything eventually reports healthy without further intervention. Success!
I have an issue using Docker swarm.
I have 3 replicas of a Python web service running on Gunicorn.
The issue is that when I restart the swarm service after a software update, an old running service is killed, then a new one is created and started. But in the short period of time when the old service is already killed, and the new one didn't fully start yet, network messages are already routed to the new instance that isn't ready yet, resulting in 502 bad gateway errors (I proxy to the service from nginx).
I use --update-parallelism 1 --update-delay 10s options, but this doesn't eliminate the issue, only slightly reduces chances of getting the 502 error (because there are always at least 2 services running, even if one of them might be still starting up).
So, following what I've proposed in comments:
Use the HEALTHCHECK feature of Dockerfile: Docs. Something like:
HEALTHCHECK --interval=5m --timeout=3s \
CMD curl -f http://localhost/ || exit 1
Knowing that Docker Swarm does honor this healthcheck during service updates, it's relative easy to have a zero downtime deployment.
But as you mentioned, you have a high-resource consumer health-check, and you need larger healthcheck-intervals.
In that case, I recomend you to customize your healthcheck doing the first run immediately and the successive checks at current_minute % 5 == 0, but the healthcheck itself running /30s:
HEALTHCHECK --interval=30s --timeout=3s \
CMD /service_healthcheck.sh
healthcheck.sh
#!/bin/bash
CURRENT_MINUTE=$(date +%M)
INTERVAL_MINUTE=5
[ $((a%2)) -eq 0 ]
do_healthcheck() {
curl -f http://localhost/ || exit 1
}
if [ ! -f /tmp/healthcheck.first.run ]; then
do_healhcheck
touch /tmp/healthcheck.first.run
exit 0
fi
# Run only each minute that is multiple of $INTERVAL_MINUTE
[ $(($CURRENT_MINUTE%$INTERVAL_MINUTE)) -eq 0 ] && do_healhcheck
exit 0
Remember to COPY the healthcheck.sh to /healthcheck.sh (and chmod +x)
There are some known issues (e.g. moby/moby #30321) with rolling upgrades in docker swarm with the current 17.05 and earlier releases (and doesn't look like all the fixes will make 17.06). These issues will result in connection errors during a rolling upgrade like you're seeing.
If you have a true zero downtime deployment requirement and can't solve this with a client side retry, then I'd recommend putting in some kind of blue/green switch in front of your swarm and do the rolling upgrade to the non-active set of containers until docker finds solutions to all of the scenarios.
I run many services in my docker-compse.yml. I'd like to display a message to let user know when docker-compose up is done?
I tried to echo message with command but my container exited with code 0
command: bash -c "echo Congratulation! You can use your containers now"
Is there anyway to let user know when docker-compose up is done?
Many thanks!
Anything you do in CMD will only be visible in the logs, and only per service, so you may as well just watch the logs which I assume is what you're trying to avoid. If you want it to work from the local command line you're going to need to wrap the docker-compose up command.
THE following is NOT fully tested. This is for guidance only. I use the getContainerHealth and waitContainer scripts myself, so can vouch for them. I'm fairly sure I found them on this site, but can't remember where.
Assuming you're waiting for your services to be in a usable state, you could use something similar to this compose file:
docker-compose.yml
services:
serviceOne:
// ... the usual stuff
container_name: serviceOne
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 10s
timeout: 10s
retries: 3
start_period: 1s
serviceTwo:
// ... the usual stuff
container_name: serviceTwo
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 10s
start_period: 1s
The commands in the test are the commands you use to define the service is up and running. More info here: https://docs.docker.com/compose/compose-file/#healthcheck
You will probably want different health checks (or at least intervals, timeouts, start_periods, etc) for production and development.
then some bash script that you use instead of docker-compose up:
getContainerHealth () {
docker inspect --format "{{json .State.Health.Status }}" $1
}
waitContainer () {
while STATUS=$(getContainerHealth $1); [ $STATUS != "\"healthy\"" ]; do
if [ $STATUS = "\"unhealthy\"" ]; then
echo "Failed!"
exit -1
fi
printf .
lf=$'\n'
sleep 1
done
printf "$lf"
}
waitContainers () {
waitContainer serviceOne
waitContainer serviceTwo
// ... for all services
}
docker-compose up
echo "In progress. Please wait"
waitContainers
echo "Done. Ready to roll."
create a shell file echo.sh that contains
echo "Your message"
In Docker file add
CMD ["/dockerRepo/echo.sh"]
Have you tried running something like the following directly in the Dockerfile itself ?
RUN echo -e "Your message"