My goal is to have a HEALTHCHECK command in my Dockerfile to check if the webserver is working alright by simply making a request to the website and check if it receives a "proper response".
The problem I'm having is that the application has an authentication middleware, which causes the application to return an error (401 Unauthorized), causing CURL to fail and return curl: (7) Failed to connect to host.docker.internal port 8000: Connection refused.
If I remove the authentication middleware it doesn't return anything, which is what I'm aiming for.
The command I'm using is the following (I'm currently just using it inside a container, trying to find a solution):
curl --fail http://host.docker.internal:8000
I know I can tell CURL the username and password but that's something I would rather not do it.
Having a way to tell CURL that Unauthorized (error 401) is fine or to consider a connection refused error (curl: (7)) as the only error would be fine but it would be even better if I could decide what should CURL consider and/or not consider a success. Is there any way to do something like this with one or more CURL options?
Health check is a good practice when microservices or rest services architecture are used.
Default health endpoints and check platforms needs 200 as http code to flag your app as healthy. Any other response is flagged ad unhealthy.
Custom codes with curl
I tried and I can say: with curl is not possible:
https://superuser.com/questions/590099/can-i-make-curl-fail-with-an-exitcode-different-than-0-if-the-http-status-code-i
You need a custom logic.
Custom health
As you are using ubuntu based image, you could use a simple bash function to catch 401 codes and return exit 0 in order to mark as healthy your container.
with curl
The cornerstone here is the option to retrieve just the response code from curl invokation:
curl -o /dev/null -s -w "%{http_code}\n" http://localhost
So you can create a bash script to execute your curl invocation and return:
exit 0 just for 200 & 401
And exit 1 in any other case .
#healthcheck.sh
code=$(curl -o /dev/null -s -w "%{http_code}\n" http://localhost:12345)
echo "response code:$code"
if [ "$code" == "200" ] || [ "$code" == "401" ]
then
echo "success"
exit 0;
else
echo "error"
exit 1;
fi
Finally you can use this script in your healthcheck in any language(php in your case):
FROM node
COPY server.js /
HEALTHCHECK --interval=5s --timeout=10s --retries=3 CMD curl -sS 127.0.0.1:8080 || exit 1
CMD [ "node", "/server.js" ]
Health feature should be public
Common health verification is related to: server status, internet connection, ram, disk, database connectivity, and any other stat that indicates you if you app is running and is ok.
Health check platforms does not allow us to register complex security flows (oauth1, ouath2, openid, etc). Just allow us to register a simple http endpoint. Here an example from aws ELB check configuration
Health feature should not expose any other sensitive data, because of that, this endpoint could be public. Classic and public webpages, web systems or public apis are examples.
Workaround
In some strict cases, privacy is required.
In this case I protected the /health with a simple apikey value as query parameter. In the controller, I validate if it is equal to some value. Final health endpoint will be /health?apiKey=secret and this is easy to register in check platforms.
Using complex configurations you could allow /health just for internal private lan, not for public access. So in this case your /health is secure
Related
I have a health check defined for my ECS Fargate Service, it works when I test locally and works with Fargate v 1.3.0.
But when I change to Fargate Platform version 1.4.0 it always turns unhealthy. But the actual service is working. I can access the service on the containers public IP.
The health check is defined as:
"CMD-SHELL", "curl --fail http://localhost || exit 1"
So we looked into this and there's an issue in platform version 1.4 where, if the health check outputs anything to stderr a false negative occurs. We will, obviously, fix this but in the meantime you can work around this by (in this case) run curl in silent mode or simply redirect stderr output to /dev/null:
curl -s --fail http://localhost || exit 1
or
curl --fail http://localhost 2>/dev/null || exit 1
Should unblock you for now.
I wanted to collate some answers together and build on them, as follows.
I'm not being funny, but first and foremost make sure you have a healthcheck endpoint running somewhere. Note that this doesn't have to be inside your container! Let me show you what I mean:
curl -s --fail -I https://127.0.0.1:8000/ || exit 1
will only pass if you have a HTTP server running on localhost port 8000 (etc.). This can be anything that returns a 200 - over to you.
Tips:
Make sure curl is installed inside the container
-s is for silent
--fail - ask google
-I header only
If localhost doesn't work try 127.0.0.1
Now, in my case I was not running a HTTP server but rather a long-running python script. In its error state the script exits with 1 (which terminates the task), but otherwise (after a long time) it exits with 0. To fail the healthcheck, the healthcheck call must also return 0 (otherwise there is a 1 and the task is again terminated*). [*exit codes > 1 can be converted to a 1 - see below stolen trick.]
So I had to fake a different endpoint with the same behaviour.
Step forward, Google.
curl -s --fail -I https://www.google.com || exit 1
As before, but now hit an external endpoint kindly provided. Note the || exit 1 which converts any positive-definite integer exit code to the 1 liked by the healthcheck.
Sorry to "state the bleeding obvious", but you really do need a function running here - don't run curl on a local endpoint and expect to get a healthy status!
Remember to expose the https / http ports 443 / 80 in your docker file and in the JSON task definition spec/through the console UI.
TIP! Note that the CMD-SHELL syntax is slightly different depending.
Putting it all together, for ECS Fargate the rest is correct.
You could also try an echo rather than a curl. I am unclear whether a point-to-point call is even required.
I recently set up Traefik v.1.7.14 in a Docker container, on a Docker Swarm enabled cluster. As a test, I created a simple service:
docker service create --name demo-nginx \
--network traefik-net \
--label traefik.app.port=80 \
--label traefik.app.frontend.auth.basic="test:$$apr1$$LG8ly.Y1$$1J9m2sDXimLGaCSlO8.T20" \
--label traefik.app.frontend.rule="Host:t.myurl.com" \
nginx
As the code above states, I am simply installing nginx on my url, at the subdomain t specified.
When this code runs, the service gets created successfully. Traefik also shows the service in the traefik api, as well as within the traefik administrator.
In the traefik api, the back-end service is reported as follows:
"frontend-Host-t-myurl-com-0": {
"entryPoints": [
"http",
"https"
],
"backend": "backend-demo-nginx",
"routes": {
"route-frontend-Host-t-myurl-com-0": {
"rule": "Host:t.myurl.com"
}
},
"passHostHeader": true,
"priority": 0,
"basicAuth": null,
"auth": {
"basic": {}
}
When I go to visit t.myurl.com, I get the authentication prompt, as expected.
However, when I type in my username/password (test:test, in this case), the login prompt just prompts me again and doesn't authenticate me.
I have checked to ensure that I am escaping the characters in the docker label by using:
echo $(htpasswd -nb test test) | sed -e s/\\$/\\$\\$/g
To generate the password.
As part of my testing, I also tried turning off the https entryPoint, as I wanted to see if this cycle was somehow being triggered by ssl. That didn't seem to have any impact on resolving this (rule: --label traefik.app.frontend.entryPoints=http). Traefik did properly respond on http upon doing this, but the password authentication still fell into the same prompting loop as before.
When I remove the traefik.app.frontend.auth.basic label, I can access my site at my url (t.myurl.com). So this issue appears to be isolated within the basic authentication functionality.
My DNS provider is Cloudflare.
If anyone has any ideas, I'd appreciate it.
Maybe you can try this:
echo $(htpasswd -nb your-user your-password);
Because you don't need two $$ in the command line.
I'm struggling with the following issue. We have a Java application that is running properly on Docker. Now, when we try to migrate the application to Docker Swarm--running it as a service--it always throws the following exception:
Cache - Unable to set localhost. This prevents creation of a GUID. Cause was: 39bc5cdfb3d9: 39bc5cdfb3d9: Name or service not known
java.net.UnknownHostException: 39bc5cdfb3d9: 39bc5cdfb3d9: Name or service not known
Note that 39bc5cdfb3d9 is the container ID.
I've tried the following:
curl against the DNS that we are using
updating the nginx config that the other server is back up
Setup:
3 Mangers
containers runs only on the 2 servers app1.dev and app2.dev it has a constraint label=dev
using the default network ingress,
DNS:dev-ecc.toroserver.com
I run the service using this:
docker service create \
${HTTP} \
${HTTPS} \
${VOLUMES} \
${ENV_VARS} \
${LICENSE} \
${LOGS} \
--limit-memory 768mb \
--mode=global \
--constraint 'engine.labels.serverType == dev' \
--env appName="${SUB_DNS}" \
--name="${SUB_DNS}" \
--restart-condition on-failure --restart-max-attempts 5 \
--with-registry-auth \
${DOCKER_REGISTRY}/${DOCKER_USER}/${APPNAME}:${VERSION}
Also I've got this error every time I tried to login, it will automatically logout my session , Not sure if it is related to the error Unable to set localhost
2017-11-08 03:25:56,771 [ INFO] AjaxTimeoutRedirectFilter - User session expired or not logged in yet
2017-11-08 03:25:56,771 [ INFO] AjaxTimeoutRedirectFilter - User session expired or not logged in yet
2017-11-08 03:25:56,778 [ INFO] AjaxTimeoutRedirectFilter - Redirect to login page
2017-11-08 03:25:56,778 [ INFO] AjaxTimeoutRedirectFilter - Redirect to login page
2017-11-08 03:30:36,822 [ INFO] AjaxTimeoutRedirectFilter - User session expired or not logged in yet
2017-11-08 03:30:36,822 [ INFO] AjaxTimeoutRedirectFilter - User session expired or not logged in yet
Any insights will be much appreciated. Thanks.
The "Cache - unable to set localhost" looks to be a common error message from the EHCache project. Finding that in the code shows that it is the result of calling the Java net library's java.net.InetAddress.getLocalHost() method, which looks up the local hostname and then tries to DNS resolve it to an IP address.
A quick local test shows that this works for both docker run and as a service on my single-node Swarm. Given you mention testing DNS, maybe at this point more information is required about your specific Swarm setup (specifically networking) to see why you are getting different behavior. Obviously if you have your own DNS, then as per the above, the default name of the container must be resolvable by a DNS lookup or else you will continue to get the Java UnknownHostException.
I have an issue using Docker swarm.
I have 3 replicas of a Python web service running on Gunicorn.
The issue is that when I restart the swarm service after a software update, an old running service is killed, then a new one is created and started. But in the short period of time when the old service is already killed, and the new one didn't fully start yet, network messages are already routed to the new instance that isn't ready yet, resulting in 502 bad gateway errors (I proxy to the service from nginx).
I use --update-parallelism 1 --update-delay 10s options, but this doesn't eliminate the issue, only slightly reduces chances of getting the 502 error (because there are always at least 2 services running, even if one of them might be still starting up).
So, following what I've proposed in comments:
Use the HEALTHCHECK feature of Dockerfile: Docs. Something like:
HEALTHCHECK --interval=5m --timeout=3s \
CMD curl -f http://localhost/ || exit 1
Knowing that Docker Swarm does honor this healthcheck during service updates, it's relative easy to have a zero downtime deployment.
But as you mentioned, you have a high-resource consumer health-check, and you need larger healthcheck-intervals.
In that case, I recomend you to customize your healthcheck doing the first run immediately and the successive checks at current_minute % 5 == 0, but the healthcheck itself running /30s:
HEALTHCHECK --interval=30s --timeout=3s \
CMD /service_healthcheck.sh
healthcheck.sh
#!/bin/bash
CURRENT_MINUTE=$(date +%M)
INTERVAL_MINUTE=5
[ $((a%2)) -eq 0 ]
do_healthcheck() {
curl -f http://localhost/ || exit 1
}
if [ ! -f /tmp/healthcheck.first.run ]; then
do_healhcheck
touch /tmp/healthcheck.first.run
exit 0
fi
# Run only each minute that is multiple of $INTERVAL_MINUTE
[ $(($CURRENT_MINUTE%$INTERVAL_MINUTE)) -eq 0 ] && do_healhcheck
exit 0
Remember to COPY the healthcheck.sh to /healthcheck.sh (and chmod +x)
There are some known issues (e.g. moby/moby #30321) with rolling upgrades in docker swarm with the current 17.05 and earlier releases (and doesn't look like all the fixes will make 17.06). These issues will result in connection errors during a rolling upgrade like you're seeing.
If you have a true zero downtime deployment requirement and can't solve this with a client side retry, then I'd recommend putting in some kind of blue/green switch in front of your swarm and do the rolling upgrade to the non-active set of containers until docker finds solutions to all of the scenarios.
In my CI chain I execute end-to-end tests after a "docker-compose up". Unfortunately my tests often fail because even if the containers are properly started, the programs contained in my containers are not.
Is there an elegant way to verify that my setup is completely started before running my tests ?
You could poll the required services to confirm they are responding before running the tests.
curl has inbuilt retry logic or it's fairly trivial to build retry logic around some other type of service test.
#!/bin/bash
await(){
local url=${1}
local seconds=${2:-30}
curl --max-time 5 --retry 60 --retry-delay 1 \
--retry-max-time ${seconds} "${url}" \
|| exit 1
}
docker-compose up -d
await http://container_ms1:3000
await http://container_ms2:3000
run-ze-tests
The alternate to polling is an event based system.
If all your services push notifications to an external service, scaeda gave the example of a log file or you could use something like Amazon SNS. Your services emit a "started" event. Then you can subscribe to those events and run whatever you need once everything has started.
Docker 1.12 did add the HEALTHCHECK build command. Maybe this is available via Docker Events?
If you have control over the docker engine in your CI setup you could execute docker logs [Container_Name] and read out the last line which could be emitted by your application.
RESULT=$(docker logs [Container_Name] 2>&1 | grep [Search_String])
logs output example:
Agent pid 13
Enter passphrase (empty for no passphrase): Enter same passphrase again: Identity added: id_rsa (id_rsa)
#host SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6
#host SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6
parse specific line:
RESULT=$(docker logs ssh_jenkins_test 2>&1 | grep Enter)
result:
Enter passphrase (empty for no passphrase): Enter same passphrase again: Identity added: id_rsa (id_rsa)