We are running a docker swarm cluster with 3 managers and 5 workers. Twice now we have experienced some error in the cluster where every service is restarted automatically.
This happens when the heartbeat is getting failed. The service 4adb11869318 on manager node and the service e7b284330420 on worker node had the issues very frequently
Manager Node Logs: https://pastebin.com/YdriawA6
Worker Node Logs: https://pastebin.com/AvGCstfg
I dont know how to prevent the restart of docker services. Do you have any suggestions?
Related
I currently have a 3 node swarm mode cluster. 1 manager and 2 workers. I have created a service with replicas of 20. When running docker service ps <service>, I do see all replicas have been deployed evenly in 3 nodes. I believe the default swarm placement strategy is spread instead of binpack. That's all good. The problem is when I restart one of the workers after some OS maintenance. The node will take a while to reboot, but at this time, I do not want the services to reschedule to the other 2 nodes because I know the restarted node will soon come back online. Is there a way to delay swarm to reschedule replicas after a node reboot or failure? I want to give it more time before confirming the node is really failed, like maybe 5 minutes or so.
Docker version 20.10.7
In our docker swarm environment, there is 1 manager nodes and 2 worker nodes.
We also installed portainer,swarm and the portainer agent's &swarm agents on all nodes.
Yesterday, one of the virtual servers which worker node installed rebooted unexpectedly.
When we check the docker service it was stopped. restarted the docker service with using this command:
systemctl restart docker
Then all the containers seem to work fine on the worker node. But when we check the containers by the portainer which runs on a master node, the containers look stopped. Swarmpit reports that the worker's nodes active and ready.
What could be the problem?
Worker Node:
Master node - running containers
Swarmpit
We find out that the firewall caused the error.
After rebotting CentOS, the firewall is enabled automatically and it conflicted with the docker engine so we disabled the firewall with this command:
systemctl disable firewalld
I need help in distributing already running containers on the newly added docker swarm worker node.
I am running docker swarm mode on docker version - 18.09.5. I am using AWS autoscaling for creating 3 masters and 4 workers. For high availability, if one of the workers goes down, all the containers from that worker node will be balanced on other workers. When autoscaling brings new node up, I am adding that worker node to the current docker swarm setup using some automation. But docker swarm is not balancing containers on that worker node. Even I tried to deploy the docker stack again, still swarm is not balancing the containers. Is it because of different node id? How can I customize it? I am using docker compose file deploying stack.
docker stack deploy -c dockerstack.yml NAME
The only (current) to force re-balancing, is to force-update the services. See https://docs.docker.com/engine/swarm/admin_guide/#force-the-swarm-to-rebalance for more information.
I have a test env where I have 2 machines, 1 manager and 1 worker in swarm mode. I deploy a stack of 10 services on the worker machine with 1 container for each service. The services start and after some time some instances die, then the manager again puts them in pending and so on keeps happening. The spring-boot services have no problem(I checked the log). To me it seems that the worker is not able to handle the 10 instances however I am not sure.
Are their any docker commands to find out whats going on here? Like some command which might say that the container was killed because it was out memory or something?
In docker swarm mode I can run docker node ls to list swarm nodes but it does not work on worker nodes. I need a similar function. I know worker nodes does not have a strong consistent view of the cluster, but there should be a way to get current leader or reachable leader.
So is there a way to get current leader/manager on worker node on docker swarm mode 1.12.1?
You can get manager addresses by running docker info from a worker.
The docs and the error message from a worker node mention that you have to be on a manager node to execute swarm commands or view cluster state:
Error message from a worker node: "This node is not a swarm manager. Worker nodes can't be used to view or modify cluster state. Please run this command on a manager node or promote the current node to a manager."
After further thought:
One way you could crack this nut is to use an external key/value store like etcd or any other key/value store that Swarm supports and store the elected node there so that it can be queried by all the nodes. You can see examples of that in the Shipyard Docker management / UI project: http://shipyard-project.com/
Another simple way would be to run a redis service on the cluster and another service to announce the elected leader. This announcement service would have a constraint to only run on the manager node(s): --constraint node.role == manager