What all happens underneath the command Service Update? - docker

When we run the following command, which of the following events does not occur?
$ docker service update --replicas=5 --detach=true nginx1
a) The state of the service is updated to 5 replicas, which is stored in the swarm's internal storage. --I believe this is True
b) Docker Swarm recognizes that the number of replicas that is scheduled now does not match the declared state of 5. --Not sure, ultimately it will check but how it happens on the timeline: immediately?/periodically?
c) This command checks aggregated logs on the updated replicas. --Don't think this is true but not sure.
d) Docker Swarm schedules 5 more tasks (containers) in an attempt to meet the declared state for the service. ----I believe this is True

D) Docker Swarm recognizes that the number of replicas that is scheduled now does not match the declared state of 5.

Related

Best Docker Stack equivalent for docker-compose "--exit-code-from" option?

I have a docker-compose file with 4 services. Services 1,2,3 are job executors. Service 4 is the job scheduler. After the scheduler has finished running all its jobs on executors, it returns 0 and terminates. However the executor services still need to be shut down. With standard docker-compose this is easy. Just use the "--exit-code-from" option:
Terminate docker compose when test container finishes
However when a version 3.0+ compose file is deployed via Docker Stack, I see no equivalent way to wait for 1 service to complete and then terminate all remaining services. https://docs.docker.com/engine/reference/commandline/stack/
A few possible approaches are discussed here -
https://github.com/moby/moby/issues/30942
The solution from miltoncs seems reasonable at first:
https://github.com/moby/moby/issues/30942#issuecomment-540699206
The concept suggested is querying every second with docker stack ps to get service status. Then removing all services with docker stack rm when done. I'm not sure how all the constant stack ps traffic would scale with thousands of jobs running in a cluster. Potentially bogging down the ingress network?
Does anyone have experience / success with this or similar solutions?

While draining a node during a Swarm update, how do you avoid a newly active node to receive all the rescheduled containers?

During an update (in place in this case) of our Swarm, we have to drain a node, update it, make it active again, drain the following node, etc...
It works perfectly for the first node as the load of the containers to reschedule is spread quite fairly to all the remaining nodes but things get difficult when draining the second node as all the containers to reschedule go the recently updated node that has (almost) no task running.
The load when starting up all the services is huge compared to normal business, the node cannot keep up and some containers might fail to start due to healthcheck constraints and max_attempts policy.
Do you know of a way to reschedule and avoid that spike and unwanted results ? (priority, wait time, update strategy...) ?
Cheers,
Thomas
This will need to be a manual process. You can pause the scheduling on the node to go down, and then gradually stop containers on that node so they migrate slowly to other nodes in the swarm cluster. E.g.
# on manager
docker node update --availability=pause node-to-stop
# on paused node
docker container ls --filter label=com.docker.swarm.task -q \
| while read cid; do
echo "stopping $cid"
docker stop ${cid}
echo "pausing"
sleep 60
done
Adjust the sleep command as appropriate for your environment.

Docker Swarm - Health check on a swarm node

Is it possible for a swarm node to monitor itself, and set itself to drain under certain conditions? Much like HEALTHCHECK from Dockerfile, I'd like to specify the script that determines the node's health condition.
[Edit] For instance, this just started occurring today:
$ sudo docker run --rm hello-world
docker: Error response from daemon: failed to update the store state of sandbox:
failed to update store for object type *libnetwork.sbState: invalid character 'H'
looking for beginning of value.
I know how to fix this particular problem, but the node still reported Ready and Active, and was accepting tasks it could not run. A health check would have been able to determine the node could not run containers, and disable the node.
How else can you achieve a self-healing infrastructure?

Is it possible to remove a task in docker (docker swarm)?

Suppose I had 3 replicated images:
docker service create --name redis-replica --replicas=3 redis:3.0.6
Consider that there are two nodes connected (including the manager), and running the command docker service ps redis-replica yields this:
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
x1uumhz9or71 redis-replica.1 redis:3.0.6 worker-vm Running Running 9 minutes ago
j4xk9inr2mms redis-replica.2 redis:3.0.6 manager-vm Running Running 8 minutes ago
ud4dxbxzjsx4 redis-replica.3 redis:3.0.6 worker-vm Running Running 9 minutes ago
As you can see all tasks are running.
I have a scenario I want to fix:
Suppose I want to remove a redis container on the worker-vm. Currently there are two, but I want to make it one.
I could do this by going into the worker-vm, removing the container by docker rm. This poses a problem however:
Once docker swarm sees that one of the tasks has gone down, it will immediately spit out another redis image on another node (manager or worker). As a result I will always have 3 tasks.
This is not what I want. Suppose I want to force docker to not relight another image if it is removed.
Is this currently possible?
In Swarm mode, it is the orchestrator who's scheduling the tasks for you. Task is the unit of scheduling and each task invokes exactly one container.
What this means in practice is, that you are not supposed to manage tasks manually. Swarm takes care of this for you.
You need to describe the desired state of your service, if you have placement preferences you can use the --placement-pref, in docker service commands. You can specify number of replicas, etc. E.g.
docker service create \
--replicas 9 \
--name redis_2 \
--placement-pref 'spread=node.labels.datacenter' \
redis:3.0.6
You can limit the set of nodes where the task will be placed using placement constraints (https://docs.docker.com/engine/reference/commandline/service_create/#specify-service-constraints---constraint. Here is an example from the Docker docs:
$ docker service create \
--name redis_2 \
--constraint 'node.labels.type == queue' \
redis:3.0.6
I think that's the closest solution to control tasks.
Once you described your placement constraints/preferences, Swarm will make sure that the actual state of your service is in line with the desired state that you described in the create command. You are not supposed to manage any further details after describing the desired state.
If you change the actual state by killing a container for example, Swarm will re-align the state of your service to be in-line with your desired state again. This is what happened when you removed your container.
In order to change the desired state you can use the docker service update command.
The key point is that tasks are not equal to containers. Technically they invoke exactly one container, but they are not equal. Task is like a scheduling slot where the scheduler places a container.
The Swarm scheduler manages tasks (not you), that's why there is no command like docker task. You drive the mechanism by describing the desired state.
To answer your original question, yes it is possible to remove a task, you can do it by updating the desired state of your service.

How to fix this issue "no suitable node (scheduling constraints not satisfied on 1 node)" in docker swarm while deploying registry?

I have a docker swarm in a virtual machine with 2 core 4GB ram Centos.
In the swarm when I deploy docker private registry (registry 2.6.4) it shows service status as pending forever.
I used
docker service ps <<registry_name>>
And when i inspect using docker inspect <<task_id>> in message I got this
"no suitable node (scheduling constraints not satisfied on 1 node)".
I tried service restart and redeployment.
How to fix this?
I often run into this problem when there is a mismatch between the node labels defined in the compose file and the ones defined in the actual node, either because I set a wrong label (e.g. a typo) or simply forgot to label nodes at all.
To label nodes:
1) For each target node do:
docker-machine ssh <manager_node_name> 'docker node update --label-add <label_name>=<label_value> <target_node_name>'
2) Make sure they match the ones defined in the compose file.
3) restart docker service in manager node
for example:
compose file:
dummycontainer:
image: group/dummyimage
deploy:
mode: replicated
replicas: 1
placement:
constraints: [node.labels.dummy_label == dummy]
restart_policy:
condition: on-failure
assuming that I want to deploy this replica in a node called dummy_node:
docker-machine ssh manager_node 'docker node update --label-add dummy_label=dummy dummy_node'
and restart docker in the manager node.
Finally, if you deploy you should expect dummycontainer running in dummy_node, assuming that the label was correctly set in both steps. Otherwise it is expectable to see the error you are getting.
Best regards
I had a similar problem while deploying service, check what is the availability of node, by docker node ls and check if nodes are not set to drain and update to active using docker node update --availability active <node-id>
which will allow swarm to run containers on the nodes for that service.

Resources