Docker swarm replicas on different nodes - docker

I'm using a docker-compose file version 3 with its deploy key to run a swarm (docker version 1.13) and I would like to replicate a service in order to make it resilient again single node failure.
However, when I'm adding a deploy section like this:
deploy:
replicas: 2
in my four node cluster I sometimes end up with both replicas scheduled on the same node. What I'm missing is a constraint that schedules the two instances on different nodes.
I know that there's a global mode I could use but that would run an instance on every node, i.e. four instances in my case instead of just two.
Is there a simple way to specify this constraint in a generic way without having to resort to a combination of global and a labels to keep additional instances away from?
Edit: After trying it again I find containers to be scheduled on different nodes this time around. I'm beginning to wonder if I may have had a 'node.hostname == X' constraint in place.
Edit 2: After another service update - and without any placement constraints - the service is again being scheduled on the same node (as displayed by ManoMarks Visualizer):

Extending VonC answer, and as in your example you are using a compose file, not the cli, you could add max_replicas_per_node: 1, as in
version: '3.8'
...
yourservice:
deploy:
replicas: 2
placement:
max_replicas_per_node: 1
The compose schema version 3.8 is key here, as there is no support for max_replicas_per_node below 3.8.
This was added in https://github.com/docker/cli/pull/1410

docker/cli PR 1612 seems to resolve issue 26259, and has been released in docker 19.03.
Added new switch --replicas-max-per-node switch to docker service
How to verify it
Create two services and specify --replicas-max-per-node one of them:
docker service create --detach=true --name web1 --replicas 2 nginx
docker service create --detach=true --name web2 --replicas 2 --replicas-max-per-node 1 nginx
See difference on command outputs:
$ docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
0inbv7q148nn web1 replicated 2/2 nginx:latest
9kry59rk4ecr web2 replicated 1/2 (max 1 per node) nginx:latest
$ docker service ps --no-trunc web2
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
bf90bhy72o2ry2pj50xh24cfp web2.1 nginx:latest#sha256:b543f6d0983fbc25b9874e22f4fe257a567111da96fd1d8f1b44315f1236398c limint Running Running 34 seconds ago
xedop9dwtilok0r56w4g7h5jm web2.2 nginx:latest#sha256:b543f6d0983fbc25b9874e22f4fe257a567111da96fd1d8f1b44315f1236398c Running Pending 35 seconds ago "no suitable node (max replicas per node limit exceed)"
The error message would be:
no suitable node (max replicas per node limit exceed)
Examples from Sebastiaan van Stijn:
Create a service with max 2 replicas:
docker service create --replicas=2 --replicas-max-per-node=2 --name test nginx:alpine
docker service inspect --format '{{.Spec.TaskTemplate.Placement.MaxReplicas}}' test
2
Update the service (max replicas should keep its value)
docker service update --replicas=1 test
docker service inspect --format '{{.Spec.TaskTemplate.Placement.MaxReplicas}}' test
2
Update the max replicas to 1:
docker service update --replicas-max-per-node=1 test
docker service inspect --format '{{.Spec.TaskTemplate.Placement.MaxReplicas}}' test
1
And reset to 0:
docker service update --replicas-max-per-node=0 test
docker service inspect --format '{{.Spec.TaskTemplate.Placement.MaxReplicas}}' test
0

What version of docker are you using? According to this post in 1.13 this kind of problem has been rectified, do take a look: https://github.com/docker/docker/issues/26259#issuecomment-260899732
Hope that answers your question.

Related

Updating a docker container from image; leaves old images on server

My process for updating a docker image to production (a docker swarm) is as follows:
On dev environment:
docker-compose build
docker push myrepo/name
Then on the prod server, which is a docker swarm:
docker pull myrepo/name
docker service update --image myrepo/name --with-registry-auth containername
This works perfectly; the swarm is updated with the latest image.
However, it always leaves the old image on the live servers and I'm left with something like this:
docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
myrepo/name latest abcdef 14 minutes ago 1.15GB
myrepo/name <none> bcdefg 4 days ago 1.22GB
myrepo/name <none> cdefgh 6 days ago 1.22GB
Which, over time results in a heap of disk space being unnecessarily used.
I've read that docker system prune is not safe to run on production especially in a swarm.
So, I am having to regularly, manually remove old images e.g.
docker image rm bcdefg cdefgh
Am I missing a step in my update process, or is it 'normal' that old images are left over to be manually removed?
Thanks in advance
since you are using docker swarm and probably multi node setup you could deploy a global service which would do the cleanup for you. We are using Bret Fisher's approach on it:
version: '3.9'
services:
image-prune:
image: internal-image-registry.org/proxy-cache/library/docker:20.10
command: sh -c "while true; do docker image prune -af --filter \"until=4h\"; sleep 14400; done"
networks:
- bridge
volumes:
- /var/run/docker.sock:/var/run/docker.sock
deploy:
mode: global
labels:
- "env=devops"
- "application=cleanup-image-prune"
networks:
bridge:
external: true
name: bridge
When adding new hosts it gets deployed automatically on it with our own base docker image and then does the cleanup job for us.
We are still missing some time to inspect newer docker service types which are scheduled on their own. It would probably be wise to move cleanup jobs to the global service replicated jobs provided by docker instead of an infinite loop in a script. It just works for us so we did not make it high priority enough to swap over to it. More info on the replicated jobs

Docker compose, two services using same image: first fails with "no such image", second runs normally

TL;DR: I have two almost identical services in my compose file except for the name of the service and the published ports. When deploying with docker stack deploy..., why does the first service fail with a no such image error, while the second service using the same image runs perfectly fine?
Full: I have a docker-compose file with two Apache Tomcat services pulling the same image from my private git repository. The only difference between the two services in my docker-compose.yml is the name of the service (*_dev vs. *_prod) and the published ports. I deploy this docker-compose file on my swarm using the Gitlab CI with the gitlab-ci.yml. For the deployment of my docker-compose in this gitlab-ci.yml I use two commands:
...
script:
- docker pull $REGISTRY:$TAG
- docker stack deploy -c docker-commpose.yml webapp1 --with registry-auth
...
(I use a docker pull [image] command to have the image on the right node, since my --with-registry-auth is not working properly, but this is not my problem currently).
Now the strange thing is that for the first service, I obtain a No such image: error and the service is stopped, while for the second service everything seems to run perfectly fine. Both services are on the same worker node. This is what I get if I docker ps:
:~$ docker service ps webapp1_tomcat_dev
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
xxx1 webapp1_tomcat_dev.1 url/repo:tag worker1 node Shutdown Rejected 10 minutes ago "No such image: url/repo:tag#xxx…"
xxx2 \_ webapp1_tomcat_dev.1 url/repo:tag worker1 node Shutdown Rejected 10 minutes ago "No such image: url/repo:tag#xxx…"
:~$ docker service ps webapp1_tomcat_prod
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
xxx3 webapp1_tomcat_prod.1 url/repo:tag worker1 node Running Running 13 minutes ago
I have used the --no-trunc obtain to see that the IMAGE used by *_prod and *_dev is identical.
The restart_policy in my docker-compose explains why the first service fails three minutes after the second service started. Here is my docker-compose:
version: '3.2'
services:
tomcat_dev:
image: url/repo:tag
deploy:
restart_policy:
condition: on-failure
delay: 60s
window: 120s
max_attempts: 1
ports:
- "8282:8080"
tomcat_prod:
image: url/repo:tag
deploy:
restart_policy:
condition: on-failure
delay: 60s
window: 120s
max_attempts: 1
ports:
- "8283:8080"
Why does the first service fail with a no such image error? Is it for example just not possible to have two services, that use the same image, work on the same worker node?
(I cannot simply scale-up one service, since I need to upload files to the webapp which are different for production and development - e.g. dev vs prod licenses - and hence I need two distinct services)
EDIT: Second service works because it is created first:
$ docker stack deploy -c docker-compose.yml webapp1 --with-registry-auth
Creating service webapp1_tomcat_prod
Creating service webapp1_tomcat_dev
I found a workaround by separating my services over two different docker compose files (docker-compose-prod.yml and docker-compose-dev.yml) and perform the docker stack deploy command in my gitlab-ci.yml twice:
...
script:
- docker pull $REGISTRY:$TAG
- docker stack deploy -c docker-commpose-prod.yml webapp1 --with registry-auth
- docker pull $REGISTRY:$TAG
- docker stack deploy -c docker-commpose-dev.yml webapp1 --with registry-auth
...
My gut says my restart_policy in my docker-compose was too strict as well (had a max_attempts: 1) and may be due to this the image couldn't be used in time / within one restart (as suggested by #Ludo21South). Hence I allowed more attempts, but since I already separated the services over two files (which worked already) I have not checked if this hypothesis is true.

How to determine the number of replicas of scaled service

I have a docker-compose file that exposes 2 services, a master service and a slave service. I want to be able to scale the slave service to some number of instances using
docker-compose up --scale slave=N
However, one of the options I must specify on command run in the master service is the number of slave instances to expect. E.g. If I scale slave=10, I need to set --num-slaves=10 in the command on the master service.
Is there a way to determine the number of instances of a given service either from the docker-compose file itself, or from a customized entrypoint shellscript?
The problem I'm facing is that since there is no way I've yet found to specify the number of scaled instances from within the docker-compose file format itself, I'm relying on the person running the command to enter the scale factor consistently and to have that value align with the value I need to tell the master node to expect. And trusting users to do the right thing is a recipe for disaster. If I could continue to let the user specify the scale value on the command line, I need a way to determine what that value is at runtime.
scale is not added to up from compose version 3 but you may use replicas:
version: "3.7"
services:
redis:
image: redis:latest
deploy:
replicas: 1
and run it using:
docker-compose --compatibility up -d
docker-compose 1.20.0 introduces a new --compatibility flag designed
to help developers transition to version 3 more easily. When enabled,
docker-compose reads the deploy section of each service’s definition
and attempts to translate it into the equivalent version 2 parameter.
Currently, the following deploy keys are translated:
resources limits and memory reservations
replicas
restart_policy condition and max_attempts
but:
Do not use this in production!
We recommend against using --compatibility mode in production. Because
the resulting configuration is only an approximate using non-Swarm
mode properties, it may produce unexpected results.
see this
PS:
Docker container names must be unique you cannot scale a service beyond 1 container if you have specified a
custom name. Attempting to do so results in an error.
Unfortunately there is no way to define replicas for docker compose. IT ONLY WORKS FOR DOCKER SWARM The documentation specifies it link
Tip: Alternatively, in Compose file version 3.x, you can specify replicas under the deploy key as part of a service configuration for Swarm mode. The deploy key and its sub-options (including replicas) only works with the docker stack deploy command, not docker-compose up or docker-compose run.
So if you have the deploy section in the yaml, but run it with docker-compose, then it will not take any effect.
version: "3.3"
services:
alpine1:
image: alpine
container_name: alpine1
command: ["/bin/sleep", "10000"]
deploy:
replicas: 4
alpine2:
image: alpine
container_name: alpine2
command: ["/bin/sleep", "10000"]
deploy:
replicas: 2
So the only way to scale up in docker compose is by running the scale command manually.
docker-compose scale alpine1=3
Note I had a job in which they loved docker-compose so we had bash scripts to perform operations such as the ones you describe. So for example we would have something like ./controller-app.sh scale test_service=10 and it would run docker-compose scale test_service=10
UPDATE
To check the number of replicas you can mount the docker socket into your container. Then run docker ps --format {{ .Names }} | grep $YOUR_CONTAINER_NAME.
Here is how you would mount the socket.
docker run -v /var/run/docker.sock:/var/run/docker.sock -it alpine sh
Install docker
apk update
apk add docker

Docker stack deploy doesn't run container without mounted Volumes folder being present on Machine

I have the below cassandra-compose.yml file which I am trying to deploy on a docker swarm with the below command.
docker stack deploy --compose-file=cassandra-compose.yml cassandra-service
Issue:-
The Service is getting created but no replicas are running.When I inspected the issue, I found that because the mounted folder ie. ~/user/cassandraBackup is not present and that is why the container isn't started.
I tried to run it using docker-compose and it got executed successfully.
Can somebody tell me how to run it using stack-deploy
Cassandra-compose.yml:-
version: '3.1'
services:
multinode:
image: cassandra:3.9
deploy:
replicas: 2
volumes:
- ~/user/cassandraBackup/:/var/lib/cassandra/data
ports:
- 7000:7000
- 7001:7001
- 7199:7199
- 9042:9042
- 9160:9160
try using this command :-
docker stack deploy --compose-file=cassandra-compose.yml -f cassandra-compose.yml cassandra-service
It will get your issue resolved
From this answer https://stackoverflow.com/a/48972713/2138959:
Docker Swarm BIND MOUNTS
If you bind mount a host path into your service’s containers, the path must exist on every swarm node. The Docker swarm mode scheduler can schedule containers on any machine that meets resource availability requirements and satisfies all constraints and placement preferences you specify.
So you have to make sure the path is available to every node in the cluster the task can be scheduled to. But even with this approach container can be scheduled to a node with no or outdated data. So it is better to pin services to concrete nodes.

Number of replicas in swarm doesn't start in worker node (1/4)

I started a flask API service onto docker swarm cluster with 1 master and 3 worker node. I have deployed task using the following docker compose file,
version: '3'
services:
xgboost-model-api:
image: xgboost-model-api
ports:
- "5000:5000"
deploy:
mode: global
networks:
- xgboost-net
networks:
xgboost-net:
I deployed the task using the following docker swarm command,
docker stack deploy --compose-file docker-compose.yml xgboost-swarm
However, the task was started only on my master node and not on any worker node.
$ docker service ls
ID NAME MODE REPLICAS IMAGE
pgd8cktr4foz viz replicated 1/1
dockersamples/visualizer
twrpr4av4c7f xgboost-swarm_xgboost-model-api global 1/4 xgboost-model-api
xxrfn1w7eqw6 dockercloud-server-proxy global 1/1 dockercloud/server-proxy
Dockerfile being used is here. Any thoughts on why this behavior occurs would be appreciated.
As stated in this thread (duplicate?):
If you are using a private registry its important to share the login and credentials with the worker nodes by using
docker stack deploy --with-registry-auth
---- UPDATE
From your compose file it doesn't look like you are using a private registry. Generally speaking if containers can't start successfuly on the workers they will end up on the manager.
Some possible reasons for this are:
Can't access private registry (fix with --with-registry-auth)
Application requires some change on the host to run (like elasticSearch requires vm.max_map_count=262144)
HealthCheck fails on other node because of poorly written helthcheck
Network setting issues preventing pulling an image
Try removing your stack and running it again. Then do docker service ps --no-trunc {serviceName} this might show you tasks that should run the service on another node and why it failed.
Check out this SO thread for more troubleshooting tips.

Resources