I'm using docker swarm to deploy an application as single node only, but for having zero downtime deployment.
Question: how could I tell docker swarm to wait for X seconds before switching the started container, and before taking the old container down?
I know I could add a custom healthcheck, but I'd like to to simply define a time interval that should block the container and give it some warmup time before it is taken live.
Or maybe some kind of initial sleep?
deploy:
replicas: 1
update_config:
order: start-first
failure_action: rollback
#this did not help. new container is going live instant!
delay: 10s
Related
i’m currently in the process of setting up a swarm with 5 machines. i’m just wondering if i can and should limit the swarm to only allow one active instance of a service? and all others just wait till they should jump in when the service fail.
This is to prevent potential concurrency problems with maria-db (as the nodes sill write to a nas), or connection limit to an external service (like node red with telegram)
If you're deploying with stack files you can set "replicas: 1" in the deploy section to make sure only one instance runs at a time.
If that instance fails (crashes or exits) docker will start another one.
https://docs.docker.com/compose/compose-file/deploy/#replicas
If the service is replicated (which is the default), replicas
specifies the number of containers that SHOULD be running at any given
time.
services: frontend:
image: awesome/webapp
deploy:
mode: replicated
replicas: 6
If you want multiple instances running and only one "active" hitting the database you'll have to coordinate that some other way.
I have a Docker swarm Environment with 7 nodes(3 master and 4 Workers) I am trying to Deploy a Container and buy requirement is that at any point of time I need 2 instance of this container running but when I scale this the Container should be deployed to a different node than it is currently running.
Ex: say one instance of the container is running in Node 4 and I scale to scale=2 it should run in any other node except for Node 4.
tried this but no luck:
deploy:
mode: global
placement:
constraints:
- node.labels.cloud.type == nodesforservice
We solved this issue with deployment preferences configuration (under Placement section). We set up node.labels.worker, on all our worker nodes. We have 3 workers they have node.labels.worker = worker1, node.labels.worker = worker2 and node.labels.worker = worker3 labels set to each of them. On the docker compose side then we configure it:
placement:
max_replicas_per_node: 2
constraints:
- node.role==worker
preferences:
- spread: node.labels.worker
Note this will not FORCE it always on the separate node, but if it is possible it will do so. So it is not hard limit. Beware of that.
I'm trying to improve my service by setting a rollback strategy in case my changes crash the container and keep the tasks exiting.
Context:
I have a simple service that I update changing the tag.
services
web:
image: AWS-Account-Id.dkr.ecr.us-east-1.amazonaws.com/my-service:1
environment:
- ENV=dev
ports: ["80:80"]
I make some change in the docker image, build, tag, and push it to ECR. Then update the tag to 2 (for example) and run docker compose up.
Let's say that I introduce an error and the container starts but then stops (due to the error) it will keep constantly trying to start and stop the container with error: Essential container in task exited
Is there a way in docker-compose to set a condition where if it tries to start the container web 2 times and the tasks fail to reach and maintain the status of running, rollback changes or do a cloudformation cancel update operation?
There is a load balancer that listens to port 80 and I also added a health check to the
healthcheck:
test: ["CMD", "curl", "-f", "/status"]
interval: 1m
timeout: 10s
retries: 2
start_period: 40s
But I cannot make it work. Tasks keep exiting and the cloudformation deployment keeps going.
This is no direct way for this, but you can consider this approach:
Create a Wait Condition and a WaitCondition Handle Resource.
Calibrate how long it usually takes for the Task / Container to start and set the timeout accordingly.
Configure the application to post a success signal to the endpoint URL on successful setup.
Ensure that the Service and the waitcondition handle start updating / creating in parallel.
If the time exceeds the timeout period, the wait condition handle will rollback.
Thing to consider: On every operation the waitcondition handle and the waitcondition resources will need to be re-created. Easy way to do that is to modify the logical id of the resources. There can be a parameters / template hash calculator that will add the hash as suffix to the wait condition resources. Thus, if there's a change in the parameters / template, the wait condition resources will be recreated automatically.
Trying to set up a zero-downtime deployment using docker stack deploy, docker swarm one node localhost environment.
After building image demo:latest, the first deployment using the command docker stack deploy --compose-file docker-compose.yml demo able to see 4 replicas running and can access nginx default home page on port 8080 on my local machine. Now updating index.html, building image with the same name and tag running docker stack deplopy command causing below error and changes are not reflected.
Deleting the deployment and recreating will work, but I am trying to see how can updates rolled in without downtime. Please help here.
Error
Updating service demo_demo (id: wh5jcgirsdw27k0v1u5wla0x8)
image demo:latest could not be accessed on a registry to record
its digest. Each node will access demo:latest independently,
possibly leading to different nodes running different
versions of the image.
Dockerfile
FROM nginx:1.19-alpine
ADD index.html /usr/share/nginx/html/
docker-compose.yml
version: "3.7"
services:
demo:
image: demo:latest
ports:
- "8080:80"
deploy:
replicas: 4
update_config:
parallelism: 2
order: start-first
failure_action: rollback
delay: 10s
rollback_config:
parallelism: 0
order: stop-first
TLDR: push your image to a registry after you build it
Docker swarm doesn't really work without a public or private docker registry. Basically all the nodes need to get their images from the same place, and the registry is the mechanism by which that information is shared. There are other ways to get images loaded on each node in the swarm, but it involves executing the same commands on every node one at a time to load in the image, which isn't great.
Alternatively you could use docker configs for your configuration data and not rebuild the image every time. That would work passably well without a registry, and you can swap out the config data with little-no downtime:
Rotate Configs
TL;DR: I have two almost identical services in my compose file except for the name of the service and the published ports. When deploying with docker stack deploy..., why does the first service fail with a no such image error, while the second service using the same image runs perfectly fine?
Full: I have a docker-compose file with two Apache Tomcat services pulling the same image from my private git repository. The only difference between the two services in my docker-compose.yml is the name of the service (*_dev vs. *_prod) and the published ports. I deploy this docker-compose file on my swarm using the Gitlab CI with the gitlab-ci.yml. For the deployment of my docker-compose in this gitlab-ci.yml I use two commands:
...
script:
- docker pull $REGISTRY:$TAG
- docker stack deploy -c docker-commpose.yml webapp1 --with registry-auth
...
(I use a docker pull [image] command to have the image on the right node, since my --with-registry-auth is not working properly, but this is not my problem currently).
Now the strange thing is that for the first service, I obtain a No such image: error and the service is stopped, while for the second service everything seems to run perfectly fine. Both services are on the same worker node. This is what I get if I docker ps:
:~$ docker service ps webapp1_tomcat_dev
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
xxx1 webapp1_tomcat_dev.1 url/repo:tag worker1 node Shutdown Rejected 10 minutes ago "No such image: url/repo:tag#xxx…"
xxx2 \_ webapp1_tomcat_dev.1 url/repo:tag worker1 node Shutdown Rejected 10 minutes ago "No such image: url/repo:tag#xxx…"
:~$ docker service ps webapp1_tomcat_prod
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
xxx3 webapp1_tomcat_prod.1 url/repo:tag worker1 node Running Running 13 minutes ago
I have used the --no-trunc obtain to see that the IMAGE used by *_prod and *_dev is identical.
The restart_policy in my docker-compose explains why the first service fails three minutes after the second service started. Here is my docker-compose:
version: '3.2'
services:
tomcat_dev:
image: url/repo:tag
deploy:
restart_policy:
condition: on-failure
delay: 60s
window: 120s
max_attempts: 1
ports:
- "8282:8080"
tomcat_prod:
image: url/repo:tag
deploy:
restart_policy:
condition: on-failure
delay: 60s
window: 120s
max_attempts: 1
ports:
- "8283:8080"
Why does the first service fail with a no such image error? Is it for example just not possible to have two services, that use the same image, work on the same worker node?
(I cannot simply scale-up one service, since I need to upload files to the webapp which are different for production and development - e.g. dev vs prod licenses - and hence I need two distinct services)
EDIT: Second service works because it is created first:
$ docker stack deploy -c docker-compose.yml webapp1 --with-registry-auth
Creating service webapp1_tomcat_prod
Creating service webapp1_tomcat_dev
I found a workaround by separating my services over two different docker compose files (docker-compose-prod.yml and docker-compose-dev.yml) and perform the docker stack deploy command in my gitlab-ci.yml twice:
...
script:
- docker pull $REGISTRY:$TAG
- docker stack deploy -c docker-commpose-prod.yml webapp1 --with registry-auth
- docker pull $REGISTRY:$TAG
- docker stack deploy -c docker-commpose-dev.yml webapp1 --with registry-auth
...
My gut says my restart_policy in my docker-compose was too strict as well (had a max_attempts: 1) and may be due to this the image couldn't be used in time / within one restart (as suggested by #Ludo21South). Hence I allowed more attempts, but since I already separated the services over two files (which worked already) I have not checked if this hypothesis is true.