Swarm node Status down, but node should be Ready - docker

I am trying to run a service on a swarm composed of three Raspberry PIs.
I have one manager and two worker nodes.
The problem is that sometimes the status of the worker nodes is "Down" even if the nodes are correctly switched on and connected to the network.
I just started using Docker so I might be doing something wrong, but everything seems to be correctly set.
How would you avoid that "Down" status?

It can depend on your exact version of docker, but your issue was seen in this thread
A possible workaround was to do a docker ps, which seems to helped nodes to join the swarm.

In my case, the docker node had invalid default route and DNS did not work. I was anyways able to ssh on the machine by ip address. I tested first:
ping google.com
Which did not work. Then I changed the default route:
route -n
route add default gw 10.1.2.3
route del default gw 10.1.2.1 (offending gateway)
And finally changed the DNS server from:
/etc/resolv.conf
Then the node came up automatically.

I've had the same issue before. You can fix it by cleaning up /var/lib/docker/swarm/ on the problematic node, then reattach it to the swarm.
1) on problem node
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/swarm
2) on swarm manager
docker node rm <problem-node-name>
docker swarm join-token worker
docker swarm join --token <token> <manager_ip>:2377
3) on problem node
sudo systemctl start docker
enter code here
docker swarm join --token <token> <manager_ip>:2377

In my case, (virtual) network devices changed. Just adjusted settings, did docker swarm leave and docker swarm join for each of the nodes with the problem and then from the manager I removed (docker node rm ...) them. Worked without issues after that.

Related

Error response from daemon: attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded

I am trying to set up docker swarm with an overlay network. I have some hosts on aws while others are laptops running Ubuntu(same as on aws). Every node has a static public IP. I have created an overlay network as:
docker network create --driver=overlay --attachable test-net
I have created a swarm network on one of the aws hosts. Every other node is able to join that swarm network.
However when I run docker run -it --name alpine2 --network test-net alpine on any node not on aws, I get the error: docker: Error response from daemon: attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded.
But if I run the same on any aws host, then everything is working fine. Is there anything more I need to do in terms of networking/ports If there are some nodes on aws while others are not?
I have opened the ports required for swarm networking on all machines.
EDIT: All the nodes are marked as "active" when listing in the manager node.
UPDATE Solved this issue by opening the respective ports. It now works if all the nodes are Linux based. But when I try to make a swarm with the manager as Linux(ubuntu) os, mac os machines are not able to join the swarm.
check if the node in drain state:
docker node inspect --format {{.Spec.Availability}} node
if yes then update the state:
docker node update --availability active node
here is the explanation:
Resolution
When a node is in drain state, it is expected behavior that you should
not be able to allocate swarm mode resources such as multi-host
overlay network IP addresses to the node.However, swarm mode does not
currently provide a messaging mechanism between the swarm leader where
IP address management occurs back to the worker node that requested
the IP address. So docker run fails with context deadline exceeded.
Internal engineering issue escalation/292 has been opened to provide a
better error message in a future release of the Docker daemon.
source
Check if the below ports are opened on both machines.
TCP port 2377
TCP and UDP port 7946
UDP port 4789
You may use ufw to allow the ports:
ufw allow 2377/tcp
I had a similar issue, managed to fix it by making sure the ENGINE VERSION of the nodes were the same.
sudo docker node ls
Another common cause for this is Ubuntu server installer installing docker using snap, and that package is buggy. Uninstall with snap and install using apt. And reconsider Ubuntu. :-/

Is there any way to flush the docker DNS cache (internal)?

I'm using Docker 18.03.1-ce and if I create a container, remove it and then re-create it, the internal DNS retains the old address (in addition to the new).
Is there any way to clear or flush the old entries? If I delete and re-create the network then that flushes it but I don't want to have to do that every time.
I create the network:
docker network create -d overlay --attachable --subnet 10.0.0.0/24 --gateway 10.0.0.1 --scope swarm -o parent=ens224 overlay1
Then create a container (SQL for this example)
docker container run -d --rm --network overlay1 --name sql -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=Some_SA_Passw0rd' -p 1433:1433 microsoft/mssql-server-linux
If I create an Alpine container on the same network I can nslookup sql by name and it resolves to 10.0.0.6. No problems, so far-so-good.
Now, if I remove the SQL container and re-create it then nslookup sql shows 10.0.0.6 and 10.0.0.8. The 10.0.0.6 is the old address and no longer alive but still resolves.
The nameserver my containers are using is 127.0.0.11 which is typical for a user-created network but I haven't been able to find anything that will let me clear its cache.
Maybe I'm missing something but I had assumed the DNS entries would be torn down whenever the containers get removed.
Any insight is certainly appreciated!
I have just fixed the same problem by running containers in Docker Swarm. Seems like Swarm does something to keep DNS entries up to date. I tried to remove my application container manually using docker rm, scaled it up/down - in every case it's hostname was correctly resolved to existing IP addresses only.
If you can't use Swarm, I guess another solution would be to run a standalone service discovery tool (maybe in another container) and configure your other containers to use it as DNS server instead of a build-in one.
Like Daniele, I also have the DNS problem in Swarm mode (stack). I kill the services running on 1 node (paying attention that other instances are running on other nodes), and swarm starts recreating them. But meanwhile, the DNS gives me the wrong IP for the service name. More than that, I would expect that the DNS resolution gives me a different IP everytime but it's not the case, in time frame (a few seconds), it DNS returns the same IP for a given service, regardless the IP is valid or not.
Daniele, did you fill a bug report ?

Docker swarm mode routing mesh not working

When I deploy a service on a swarm using:
docker service create --replicas 1 --publish published=80,target=80 tutum/hello-world
I can access the service only from the ip of the node running the container. If I scale the service to run on both nodes, I can access the service from both ips, but it will never run from a container on the other node. (as confirmed by the tutum/hello-world image).
The documentation suggests that load balancing should work when it says:
Three tasks will run on up to three nodes. You don’t need to know which nodes are running the tasks; connecting to port 8080 on any of the 10 nodes will connect you to one of the three nginx tasks.
The swarm was created using swarm init and swarm join.
Using docker network ls the ingress swarm network is found on both nodes:
NETWORK ID NAME DRIVER SCOPE
cv6hk9wce8bf ingress overlay swarm
Edit:
Manager node runs linux, worker node runs OSX. Running modinfo ip_vs on the manager nodes returns:
filename: /lib/modules/4.4.0-109-
generic/kernel/net/netfilter/ipvs/ip_vs.ko
license: GPL
srcversion: D856EAE372F4DAF27045C82
depends: nf_conntrack,libcrc32c
intree: Y
vermagic: 4.4.0-109-generic SMP mod_unload modversions
parm: conn_tab_bits:Set connections' hash size (int)
Running modinfo ip_vs_rr returns:
filename: /lib/modules/4.4.0-109-
generic/kernel/net/netfilter/ipvs/ip_vs_rr.ko
license: GPL
srcversion: F21F7372F5E2331EF5F4F73
depends: ip_vs
intree: Y
vermagic: 4.4.0-109-generic SMP mod_unload modversions
Edit 2:
I tried adding a linux worker to the swarm, and it worked as advertised, so the problem appears to be related to the OSX machine.
The problem is solved for me, however, I'll let the question stay for future reference.
Ensure that 7946/tcp, 7946/udp, and 4789/udp are open and available to all nodes in the cluster BEFORE docker swarm init.
Not sure why, but if they are not open PRIOR to creating to the swarm, they will not properly load balance.
https://docs.docker.com/engine/swarm/ingress/
This happen to me, it was caused by a firewall issue. So I open the ports on every worker and manager.
sudo firewall-cmd --permanent --add-port=2377/tcp
sudo firewall-cmd --permanent --add-port=7946/tcp
sudo firewall-cmd --permanent --add-port=4789/udp
sudo firewall-cmd --reload
sudo reboot
Restart server if that doesn't work. Docker service may need to reload too.

How to remove node from swarm?

I added three nodes to a swarm cluster with static file mode. I want to remove host1 from the cluster. But I don't find a docker swarm remove command:
Usage: swarm [OPTIONS] COMMAND [arg...]
Commands:
create, c Create a cluster
list, l List nodes in a cluster
manage, m Manage a docker cluster
join, j join a docker cluster
help, h Shows a list of commands or help for one command
How can I remove the node from the swarm?
Using Docker Version: 1.12.0, docker help offers:
➜ docker help swarm
Usage: docker swarm COMMAND
Manage Docker Swarm
Options:
--help Print usage
Commands:
init Initialize a swarm
join Join a swarm as a node and/or manager
join-token Manage join tokens
update Update the swarm
leave Leave a swarm
Run 'docker swarm COMMAND --help' for more information on a command.
So, next try:
➜ docker swarm leave --help
Usage: docker swarm leave [OPTIONS]
Leave a swarm
Options:
--force Force leave ignoring warnings.
--help Print usage
Using the swarm mode introduced in the docker engine version 1.12, you can directly do docker swarm leave.
The reference to "static file mode" implies the container based standalone swarm that predated the current Swarm Mode that most know as Swarm. These are two completely different "Swarm" products from Docker and are managed with completely different methods.
The other answers here focused on Swarm Mode. With Swarm Mode docker swarm leave on the target node will cause the node to leave the swarm. And when the engine is no longer talking to the manager, docker node rm on an active manager for the specific node will cleanup any lingering references inside the cluster.
With the container based classic swarm, you would recreate the manager container with an updated static list. If you find yourself doing this a lot, the external DB for discovery would make more sense (e.g. consul, etcd, or zookeeper). Given the classic swarm is deprecated and no longer being maintained, I'd suggest using either Swarm Mode or Kubernetes for any new projects.
Try this:
docker node list # to get a list of nodes in the swarm
docker node rm <node-id>
Using the Docker CLI
I work with Docker Swarm clusters and to remove a node from the cluster there are two options.
It depends on where you want to run the command, within the node you want to remove or on a manager node other than the node to be removed.
The important thing is that the desired node must be drained before being removed to maintain cluster integrity.
First option:
So I think the best thing to do is (as steps in official document):
Go to one of the nodes with manager status using a terminal ssh;
Optionally get your cluster nodes;
Change the availability to drained of the node you want to remove;
And remove it;
# step 1
ssh user#node1cluster3
# step 2, see the nodes in your cluster like print screen below
docker node ls
# step 3, drain one of them
docker node update --availability drain node4cluster3
# step 4, remove the drained node
docker node rm node4cluster3
Second option:
The second option needs two terminal logins, one on a manager node and one on the node you want to remove.
Perform the 3 initial steps described in the first option to drain the desired node.
Afterwards, log in to the node you want to remove and run the docker swarm leave command.
# remove from swarm using leave
docker swarm leave
# OR, if the desired node is a manager, you can use force (be careful*)
docker swarm leave --force
*For information about maintaining a quorum and disaster recovery, refer to the Swarm administration guide.
My environment information
I use Ubuntu 20.04 for nodes inside VMs;
With Docker version 20.10.9;
Swarm: active;

Checking reason behind node failure

I have docker swarm setup with nodes running node-1, node-2 and node-3. Due to some reason everyday one of my node is getting failed basically they exits. I ran docker logs <container id of swarm> but logs doesn't contains any info related to node failure.
So, is there any logs file where a log related to this failure can be seen? or this is due to some less memory allocation problem?
Can any one suggest me how to dig this problem and find a proper solution. As everyday I have to start by swarm nodes.
Like most containers, Swarm containers run and exit unless you use docker run with the -d option to "daemonize" them. For example:
$ docker run -d swarm join --advertise=172.30.0.69:2375 consul://172.30.0.161:8500
On the other hand, if you used Docker Machine to create the VMs, then also use Docker Machine to create the Swarm manager and nodes. By default, Docker Machine applies TLS authentication to the Docker Engine nodes. The easiest thing to do is to also create the Swarm manager and nodes at the same time as you create the Docker Engine nodes.
For more info, check out the brand new Swarm doc.

Resources