Docker swarm stack service replicas zero down time - docker

i have been trying to fine tune the docker compose settings but i am not satisfied with the result and the docs are so unspecific for the healthcheck and update_config options.
The scenario are react apps which need to run build and start during entrypoint execution. The builds can not be done on Dockerfile because then i would need to tag redundant images for each environment (amongst other inconveniences)
Because of the build and run steps the container is deployed and after the healthcheck will give a positive from node server it takes about 30 secs.
Now in a rollig update zero downtime scenario what settings would i use? The thing is i dont need more then 1 replica. The ideal config option would be wait_rolling_update_delay or something that would provoke docker to replace containers never before this wait time. i am playing around with the healthcheck.start_period but i am not seeing a difference.
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.role == worker
labels:
- "APP=name"
- "traefik.http.services.name.loadbalancer.server.port=1338"
restart_policy:
condition: any
delay: 10s
max_attempts: 3
window: 60s
update_config:
parallelism: 1
delay: 10s
monitor: 10s
order: start-first
failure_action: rollback
healthcheck:
test: "curl --silent --fail http://localhost:1338/_health || exit 1"
interval: 10s
timeout: 120s
retries: 10
start_period: 120s

Related

Docker deploy swarm instance on specific node matching instance index

Using docker swarm, I am trying to deploy N instances of my app on N nodes in a way that each app is deployed on the node with the corresponding index. E.g.: app1 must be deployed on node1, app2 on node2, ...
The bellow is not working as it complains Error response from daemon: rpc error: code = Unknown desc = value 'node{{.Task.Slot}}' is invalid.
Any suggestion how to achieve this ?
I also have the impression, in a long shot, to use something with labels but I cannot wrap my head over it yet. Anyhow please advise.
version: "3.8"
services:
app:
image: app:latest
hostname: "app{{.Task.Slot}}"
networks:
- app-net
volumes:
- "/data/shared/app{{.Task.Slot}}/config:/app/config"
deploy:
replicas: 5
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: any
placement:
constraints:
- "node.hostname==node{{.Task.Slot}}" <========= ERROR
Service template parameters are documented as only resolving in:
the hostname: directive
for volume definitions
in labels.
environment variables.
Placement preference / constraints is not supported, but would be brilliant as it would allow simple deployments of Minio, etcd, consul and other clustered services where you need to pin replicas to nodes.

Healthcheck is failing when deploying a mssql database

The healthcheck is failing when deploying a mssql database on AWS ECS.
Below is a copy of the service form the docker-compose.yml file
sql_server_db:
image: 'mcr.microsoft.com/mssql/server:2017-latest'
environment:
SA_PASSWORD: Password123#
ACCEPT_EULA: "Y"
labels:
- traefik.enable=false
deploy:
resources:
limits:
cpus: '1'
memory: 8Gb
reservations:
cpus: '0.5'
memory: 4GB
healthcheck:
test: ["/opt/mssql-tools/bin/sqlcmd", "-U", "sa", "-P", "Password123#", "-Q", "SELECT 1"]
interval: 1m
retries: 10
start_period: 60s
I have the same issue, when checking the "inspect" for the container I was getting "Login fails for SA"
this was disturbing because the password was the same (I used the .env variable) ... but for some reason the special characters seems to mess up the check.
I simply created a oneliner script
/opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P $SA_PASSWORD -Q "Select 1"
and then I called it as HC
healthcheck:
test: ["CMD","bash","/healthcheck.sh", ]
and it works
I don't really like it but I will keep it until I find a better one (I am not sure it can actually fails )

no suitable node - Unable to match constraints services with docker swarm nodes

I execute
sudo docker node ls
And this is my output
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
smlsbj3r6qjq7s22cl9cgi1f1 * ip-172-30-0-94 Ready Active Leader 20.10.2
b5e3w8nvrw3kw8q3sg1188439 ip-172-30-0-107 Ready Active 20.10.2
phjkfj09ydvgzztaib2zxcfv9 ip-172-30-0-131 Ready Active 20.10.2
m73z9ikte16klds06upruifji ip-172-30-0-193 Ready Active 20.10.2
So far I get, I know I have one manager and 3 workers. So, if I have a service which has a constraint that match with node.role property of worker nodes, some of them will be elected by docker swarm to execute the containers related to the service itself.
The info of my current service is this:
ID: 5p4hpxmvru9kbwz9y5oymoeq0
Name: elasbit_relay1
Service Mode: Replicated
Replicas: 1
Placement:
Constraints: [node.role!=manager]
UpdateConfig:
Parallelism: 1
On failure: pause
Monitoring Period: 5s
Max failure ratio: 0
Update order: stop-first
RollbackConfig:
Parallelism: 1
On failure: pause
Monitoring Period: 5s
Max failure ratio: 0
Rollback order: stop-first
ContainerSpec:
Image: inputoutput/cardano-node:latest#sha256:02779484dc23731cdbea6388920acc6ddd8e40c03285bc4f9c7572a91fe2ee08
Args: run --topology /configuration/testnet-topology.json --database-path /db --socket-path /db/node.socket --host-addr 0.0.0.0 --port 3001 --config /configuration/testnet-config.json
Init: false
Mounts:
Target: /configuration
Source: /home/ubuntu/cardano-docker-run/testnet
ReadOnly: true
Type: bind
Target: /db
Source: db
ReadOnly: false
Type: volume
Resources:
Endpoint Mode: vip
Ports:
PublishedPort = 12798
Protocol = tcp
TargetPort = 12798
PublishMode = ingress
The key part is [node.role!=manager]. It gives me no suitable node (unsupported platform on 3 nodes; scheduling constraints ….
I tried a lot of ways:
Use docker-compose format (yml) with a constraints list:
deploy:
replicas: 1
placement:
constraints: [node.role==worker]
restart_policy:
condition: on-failure
Use label to nodes.
In all of them I failed. The funny part is, that if I point with some constraint to the node manager, it works! Do I have some typo? Well, I don't see it.
Im using Docker version 20.10.2, build 20.10.2-0ubuntu1~18.04.2.

How to achieve zero downtime with docker stack

Docker updates container but network registration takes 10 minutes to complete so while the new container is being registered the page returns 502 because internal network is still pointing at the old container. How can i delay the removal of the old container after the update to the new container by 10 minutes or so? Ideally I would like to push this config with docker stack but I'll do whatever it takes. I should also note that I am unable to use replicas right now due to certain limitations of a security package i'm being forced to use.
version: '3.7'
services:
xxx:
image: ${xxx}/com.xxx:${xxx}
environment:
- SERVICE_NAME=xxx
- xxx
- _xxx
- SPRING_PROFILES_ACTIVE=${xxx}
networks:
- xxx${xxx}
healthcheck:
interval: 1m
deploy:
mode: replicated
replicas: 1
resources:
limits:
cpus: '3'
memory: 1024M
reservations:
cpus: '0.50'
memory: 256M
labels:
- com.docker.lb.hosts=xxx${_xxx}.xxx.com
- jenkins.url=${xxx}
- com.docker.ucp.access.label=/${xxx}/xxx
- com.docker.lb.network=xxx${_xxx}
- com.docker.lb.port=8080
- com.docker.lb.service_cluster=${xxx}
- com.docker.lb.ssl_cert=xxx.cert
- com.docker.lb.ssl_key=xxx.key
- com.docker.lb.redirects=http://xxx${_xxx}.xxx.com/xxx,https://xxx${_xxx}.xxx.com/xxx
restart_policy:
condition: any
delay: 5s
max_attempts: 3
window: 120s
update_config:
parallelism: 1
delay: 10s
order: start-first
failure_action: rollback
rollback_config:
parallelism: 0
order: stop-first
secrets:
- ${xxx}
networks:
xxx${_xxx}:
external: true
secrets:
${xxx}:
external: true
xxx.cert:
external: true
xxx.key:
external: true
Use proper healthcheck - see the reference here: https://docs.docker.com/compose/compose-file/#healthcheck
So:
You need to define proper test to know when your new container is fully up (that goes inside test instruction of your healthcheck).
Use start_period instruction to specify your 10 (or so) minute way - otherwise, Docker Swarm would just kill your new container and never let it start.
Basically, once you get healthcheck right, this should solve your issue.

Add healthcheck in Keycloak Docker Swarm service

What's the best way to test the health of Keycloak configured as cluster deployed as docker swarm service?
I tried the below healthcheck for testing availability in Keycloak service descriptor:
healthcheck:
test: ["CMD-SHELL", "curl http://localhost:8080/auth/realms/[realm_name]"]
interval: 30s
timeout: 10s
retries: 10
start_period: 1m
Are there more things to check for?
Couldn't find the documentation for this.
I prefer to listen directly the 'master' realm.
Morover most recent Keycloak versions uses a different path (omitting 'auth'):
healthcheck:
test: ["CMD", "curl", "-f", "http://0.0.0.0:8080/realms/master"]
start_period: 10s
interval: 30s
retries: 3
timeout: 5s
One can also use the /health endpoint on the KeyCloak container as follows:
"healthCheck": {
"retries": 3,
"command": [
"CMD-SHELL",
"curl -f http://localhost:8080/health || exit 1"
],
"timeout": 5,
"interval": 60,
"startPeriod": 300
}

Resources