What does the "(healthy)" string in STATUS column stands for?
user#user:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
X X X X Up 20 hours X X
X X X X Up 21 hours (healthy) X X
That's the result of the HEALTHCHECK instruction. That instruciton runs a command inside the container every 30 seconds. If the command succeeds, the container is marked healthy. If it fails too many times, it's marked unhealthy.
You can set the interval, timeout, number of retries and start delay.
The following, for example, will check that your container responds to HTTP every 5 minutes with a timeout of 3 seconds.
HEALTHCHECK --interval=5m --timeout=3s \
CMD curl -f http://localhost/ || exit 1
You get a health_status event when the health status changes. You can follow those and others with docker events.
https://ryaneschinger.com/blog/using-docker-native-health-checks/
Normally it's something you launch with, to enable swarm or other services to check on the health of the container.
IE:
$ docker run --rm -it \
--name=elasticsearch \
--health-cmd="curl --silent --fail localhost:9200/_cluster/health || exit 1" \
--health-interval=5s \
--health-retries=12 \
--health-timeout=2s \
elasticsearch
see the health checks enabled at runtime?
Means they are using the command: healthcheck
https://docs.docker.com/engine/reference/builder/#healthcheck
When a container has a healthcheck specified, it has a health status in addition to its normal status. This status is initially starting. Whenever a health check passes, it becomes healthy (whatever state it was previously in). After a certain number of consecutive failures, it becomes unhealthy.
**starting** – Initial status when the container is still starting
**healthy** – If the command succeeds then the container is healthy
**unhealthy** – If a single run of the takes longer than the specified
timeout then it is considered unhealthy. If a health check fails then the
will run retries number of times and will be declared unhealthy
if the still fails.
Reference
Related
I am new to Scylla and I am following the instructions to try it in a container as per this page: https://hub.docker.com/r/scylladb/scylla/.
The following command ran fine.
docker run --name some-scylla --hostname some-scylla -d scylladb/scylla
I see the container is running.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e6c4e19ff1bd scylladb/scylla "/docker-entrypoint.…" 14 seconds ago Up 13 seconds 22/tcp, 7000-7001/tcp, 9042/tcp, 9160/tcp, 9180/tcp, 10000/tcp some-scylla
However, I'm unable to use nodetool or cqlsh. I get the following output.
$ docker exec -it some-scylla nodetool status
Using /etc/scylla/scylla.yaml as the config file
nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)
See 'nodetool help' or 'nodetool help <command>'.
and
$ docker exec -it some-scylla cqlsh
Connection error: ('Unable to connect to any servers', {'172.17.0.2': error(111, "Tried connecting to [('172.17.0.2', 9042)]. Last error: Connection refused")})
Any ideas?
Update
Looking at docker logs some-scylla I see some errors in the logs, the last one is as follows.
2021-10-03 07:51:04,771 INFO spawned: 'scylla' with pid 167
Scylla version 4.4.4-0.20210801.69daa9fd0 with build-id eb11cddd30e88ef39c32c847e70181b5cf786355 starting ...
command used: "/usr/bin/scylla --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --developer-mode=1 --overprovisioned --listen-address 172.17.0.2 --rpc-address 172.17.0.2 --seed-provider-parameters seeds=172.17.0.2 --blocked-reactor-notify-ms 999999999"
parsed command line options: [log-to-syslog: 0, log-to-stdout: 1, default-log-level: info, network-stack: posix, developer-mode: 1, overprovisioned, listen-address: 172.17.0.2, rpc-address: 172.17.0.2, seed-provider-parameters: seeds=172.17.0.2, blocked-reactor-notify-ms: 999999999]
ERROR 2021-10-03 07:51:05,203 [shard 6] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application
2021-10-03 07:51:05,316 INFO exited: scylla (exit status 1; not expected)
2021-10-03 07:51:06,318 INFO gave up: scylla entered FATAL state, too many start retries too quickly
Update 2
The reason for the error was described on the docker hub page linked above. I had to start container specifying the number of CPUs with --smp 1 as follows.
docker run --name some-scylla --hostname some-scylla -d scylladb/scylla --smp 1
According to the above page:
This command will start a Scylla single-node cluster in developer mode
(see --developer-mode 1) limited by a single CPU core (see --smp).
Production grade configuration requires tuning a few kernel parameters
such that limiting number of available cores (with --smp 1) is the
simplest way to go.
Multiple cores requires setting a proper value to the
/proc/sys/fs/aio-max-nr. On many non production systems it will be
equal to 65K. ...
As you have found out, in order to be able to use additional CPU cores you'll need to increase fs.aio-max-nr kernel parameter.
You may run as root:
# sysctl -w fs.aio-max-nr=65535
Which should be enough for most systems. Should you still have any error preventing it to use all of your CPU cores, increase its value further.
Do notice that the above configuration is not persistent. Edit /etc/sysctl.conf in order to make it persistent across reboots.
I have a health check defined for my ECS Fargate Service, it works when I test locally and works with Fargate v 1.3.0.
But when I change to Fargate Platform version 1.4.0 it always turns unhealthy. But the actual service is working. I can access the service on the containers public IP.
The health check is defined as:
"CMD-SHELL", "curl --fail http://localhost || exit 1"
So we looked into this and there's an issue in platform version 1.4 where, if the health check outputs anything to stderr a false negative occurs. We will, obviously, fix this but in the meantime you can work around this by (in this case) run curl in silent mode or simply redirect stderr output to /dev/null:
curl -s --fail http://localhost || exit 1
or
curl --fail http://localhost 2>/dev/null || exit 1
Should unblock you for now.
I wanted to collate some answers together and build on them, as follows.
I'm not being funny, but first and foremost make sure you have a healthcheck endpoint running somewhere. Note that this doesn't have to be inside your container! Let me show you what I mean:
curl -s --fail -I https://127.0.0.1:8000/ || exit 1
will only pass if you have a HTTP server running on localhost port 8000 (etc.). This can be anything that returns a 200 - over to you.
Tips:
Make sure curl is installed inside the container
-s is for silent
--fail - ask google
-I header only
If localhost doesn't work try 127.0.0.1
Now, in my case I was not running a HTTP server but rather a long-running python script. In its error state the script exits with 1 (which terminates the task), but otherwise (after a long time) it exits with 0. To fail the healthcheck, the healthcheck call must also return 0 (otherwise there is a 1 and the task is again terminated*). [*exit codes > 1 can be converted to a 1 - see below stolen trick.]
So I had to fake a different endpoint with the same behaviour.
Step forward, Google.
curl -s --fail -I https://www.google.com || exit 1
As before, but now hit an external endpoint kindly provided. Note the || exit 1 which converts any positive-definite integer exit code to the 1 liked by the healthcheck.
Sorry to "state the bleeding obvious", but you really do need a function running here - don't run curl on a local endpoint and expect to get a healthy status!
Remember to expose the https / http ports 443 / 80 in your docker file and in the JSON task definition spec/through the console UI.
TIP! Note that the CMD-SHELL syntax is slightly different depending.
Putting it all together, for ECS Fargate the rest is correct.
You could also try an echo rather than a curl. I am unclear whether a point-to-point call is even required.
I have an issue using Docker swarm.
I have 3 replicas of a Python web service running on Gunicorn.
The issue is that when I restart the swarm service after a software update, an old running service is killed, then a new one is created and started. But in the short period of time when the old service is already killed, and the new one didn't fully start yet, network messages are already routed to the new instance that isn't ready yet, resulting in 502 bad gateway errors (I proxy to the service from nginx).
I use --update-parallelism 1 --update-delay 10s options, but this doesn't eliminate the issue, only slightly reduces chances of getting the 502 error (because there are always at least 2 services running, even if one of them might be still starting up).
So, following what I've proposed in comments:
Use the HEALTHCHECK feature of Dockerfile: Docs. Something like:
HEALTHCHECK --interval=5m --timeout=3s \
CMD curl -f http://localhost/ || exit 1
Knowing that Docker Swarm does honor this healthcheck during service updates, it's relative easy to have a zero downtime deployment.
But as you mentioned, you have a high-resource consumer health-check, and you need larger healthcheck-intervals.
In that case, I recomend you to customize your healthcheck doing the first run immediately and the successive checks at current_minute % 5 == 0, but the healthcheck itself running /30s:
HEALTHCHECK --interval=30s --timeout=3s \
CMD /service_healthcheck.sh
healthcheck.sh
#!/bin/bash
CURRENT_MINUTE=$(date +%M)
INTERVAL_MINUTE=5
[ $((a%2)) -eq 0 ]
do_healthcheck() {
curl -f http://localhost/ || exit 1
}
if [ ! -f /tmp/healthcheck.first.run ]; then
do_healhcheck
touch /tmp/healthcheck.first.run
exit 0
fi
# Run only each minute that is multiple of $INTERVAL_MINUTE
[ $(($CURRENT_MINUTE%$INTERVAL_MINUTE)) -eq 0 ] && do_healhcheck
exit 0
Remember to COPY the healthcheck.sh to /healthcheck.sh (and chmod +x)
There are some known issues (e.g. moby/moby #30321) with rolling upgrades in docker swarm with the current 17.05 and earlier releases (and doesn't look like all the fixes will make 17.06). These issues will result in connection errors during a rolling upgrade like you're seeing.
If you have a true zero downtime deployment requirement and can't solve this with a client side retry, then I'd recommend putting in some kind of blue/green switch in front of your swarm and do the rolling upgrade to the non-active set of containers until docker finds solutions to all of the scenarios.
In my CI chain I execute end-to-end tests after a "docker-compose up". Unfortunately my tests often fail because even if the containers are properly started, the programs contained in my containers are not.
Is there an elegant way to verify that my setup is completely started before running my tests ?
You could poll the required services to confirm they are responding before running the tests.
curl has inbuilt retry logic or it's fairly trivial to build retry logic around some other type of service test.
#!/bin/bash
await(){
local url=${1}
local seconds=${2:-30}
curl --max-time 5 --retry 60 --retry-delay 1 \
--retry-max-time ${seconds} "${url}" \
|| exit 1
}
docker-compose up -d
await http://container_ms1:3000
await http://container_ms2:3000
run-ze-tests
The alternate to polling is an event based system.
If all your services push notifications to an external service, scaeda gave the example of a log file or you could use something like Amazon SNS. Your services emit a "started" event. Then you can subscribe to those events and run whatever you need once everything has started.
Docker 1.12 did add the HEALTHCHECK build command. Maybe this is available via Docker Events?
If you have control over the docker engine in your CI setup you could execute docker logs [Container_Name] and read out the last line which could be emitted by your application.
RESULT=$(docker logs [Container_Name] 2>&1 | grep [Search_String])
logs output example:
Agent pid 13
Enter passphrase (empty for no passphrase): Enter same passphrase again: Identity added: id_rsa (id_rsa)
#host SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6
#host SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6
parse specific line:
RESULT=$(docker logs ssh_jenkins_test 2>&1 | grep Enter)
result:
Enter passphrase (empty for no passphrase): Enter same passphrase again: Identity added: id_rsa (id_rsa)
When I ping one site it returns "Request timed out". I want to make little program that will inform me (sound beep or something like that) when this server is online again. No matter in which language. I think it should be very simple script with a several lines of code. So how to write it?
Some implementations of ping allow you to specify conditions for exiting after receipt of packets:
On Mac OS X, use ping -a -o $the_host
ping will keep trying (by default)
-a means beep when a packet is received
-o means exit when a packet is received
On Linux (Ubuntu at least), use ping -a -c 1 -w inf $the_host
-a means beep when a packet is received
-c 1 specifies the number of packets to send before exit (in this case 1)
-w inf specifies the deadline for when ping exits no matter what (in this case Infinite)
when -c and -w are used together, -c becomes number of packets received before exit
Either can be chained to perform your next command, e.g. to ssh into the server as soon as it comes up (with a gap between to allow sshd to actually start up):
# ping -a -o $the_host && sleep 3 && ssh $the_host
Don't forget the notify sound like echo"^G"! Just to be different - here's Windows batch:
C:\> more pingnotify.bat
:AGAIN
ping -n 1 %1%
IF ERRORLEVEL 1 GOTO AGAIN
sndrec32 /play /close "C:\Windows\Media\Notify.wav"
C:\> pingnotify.bat localhost
:)
One way is to run ping is a loop, e.g.
while ! ping -c 1 host; do sleep 1; done
(You can redirect the output to /dev/null if you want to keep it quiet.)
On some systems, such as Mac OS X, ping may also have the options -a -o (as per another answer) available which will cause it to keep pinging until a response is received. However, the ping on many (most?) Linux systems does not have the -o option and the kind of equivalent -c 1 -w 0 still exits if the network returns an error.
Edit: If the host does not respond to ping or you need to check the availability of service on a certain port, you can use netcat in the zero I/O mode:
while ! nc -w 5 -z host port; do sleep 1; done
The -w 5 specifies a 5 second timeout for each individual attempt. Note that with netcat you can even list multiple ports (or port ranges) to scan when some of them becomes available.
Edit 2: The loops shown above keep trying until the host (or port) is reached. Add your alert command after them, e.g. beep or pop-up a window.