Docker swarm - cross node HTTP connection hangs - docker-swarm

We have a swarm cluster running on 2 RHEL nodes.
When a container running on node A tries to make HTTP connection to a container on node B, we see this behavior:
If the requested file is smaller than about 1350 bytes, the curl command returns successfully.
If the requested file is larger than about 1350 bytes, the curl command hangs indefinitely.
We do not see this issue on CentOS. Also, if the same request is made from another container within the same node (same swarm network), it works fine.
What do I need to tweak in Docker swarm to make this problem go away?
OS: RHEL 8.7
Docker: 20.10.21

Related

Docker Windows master node "docker swarm init" causes worker nodes in same Virtual Network to no longer see the master node

I have strange behaviour related to docker swarm mode on windows. What I have done:
Deployed two "Windows Server 2019 Datacenter with Containers - Gen1" virtual machines in Azure
Setting RDP access from my IP to the virtual machines
Ensures they are in the same virtual network and their subnet is associated with the virtual network
Downloaded all windows updates
Used telnet to check if worker machine sees master by running "telnet 10.0.0.4 3389". This works.
Used telnet to check if master machine sees worker by running "telnet 10.0.0.5 3389". This works.
Ensured that Docker Swarm ports are open in Windows Firefall too for both machines: 4789, 7946 (UDP) and 2377, 7946 (TCP)
Initialized docker swarm mode on master node with the command: "docker swarm init --advertise-addr 10.0.0.4"
Checked that "docker node ls" lists the master as Ready
Immediately after this tried to use "telnet 10.0.0.4 3389" from worker node to see if master is still accessible - it no longer works!
Not surprisingly, trying to join the docker swarm from the worker also fails in the usual "timeout" error
Due to the fact that telnet 10.0.0.4 3389 worked before master node entered swarm mode, but not after, it seems docker windows is doing some changes to the firewall priorities or rules, or changing the active network or something... Which is bonkers. I have not found a solution to this problem, which is making docker-for-windows unusable. Note: This problem only occurs in Azure. Using virtual machines in Exoscale and manually installing docker with powershell scripts did not show the same issue, which makes me think perhaps the "Windows Server 2019 Datacenter with Containers - Gen1" servers have some faulty configurations.
Edit:
I can confirm that this behaviour does not appear when manually installing docker for 2019 data centers using the following guide: https://blog.sixeyed.com/getting-started-with-docker-on-windows-server-2019/ (sixeyed is a known Docker for Windows expert). In other words "Windows Server 2019 Datacenter" image works.
I can confirm that this behaviour does not appear when manually installing docker for 2019 data centers using the following guide: https://blog.sixeyed.com/getting-started-with-docker-on-windows-server-2019/ (sixeyed is a known Docker for Windows expert). In other words "Windows Server 2019 Datacenter" image works.
So, do not use the "Windows Server 2019 Datacenter with Containers - Gen1" image. Instead, use the standard image and follow standard docker-for-windows-server-2019 installation guides to get swarm mode working.

Docker node level load balancing not working

I have two laptops, Ubuntu-14 and Mac (Big Sur) and both of them have docker (with swarm support) installed in it.
I used Ubuntu as my Swarm manager (and) Mac as my worker node.
Ubuntu private IP is 192.168.0.14 (and) Mac private IP is 192.168.0.11 [Private IP can be shared in public without any issues because every class C network has the same IP's :P]
"docker swarm init --advertise-addr" was the command I used to make my Ubuntu host, the Manager (and) I entered the join command in Mac to make the Mac node to join the Swarm as worker.
So, On a highlevel, I used docker-compose.yml (which has only 1 python webservice). Using the compose file, I started a "docker stack" and then replicated the "python webservice" instance to 5. All these actions were carried out in Manager node.
Ubuntu Manager Node (also had 2 container instance and behaved as worker) (and) Mac Node had 3 container instances of the "python webservice". I have set up "ports" to be "80:1234" which means If I hit the port 80 of the host machine, It will redirect to the "python application webservice port" which is 1234 running inside the container.
When I hit the Manager IP (192.168.0.14:80) some 50 times and when I checked the logs of all the 5 containers in both Mac and Ubuntu,
I found all the 2 containers in Ubuntu, got 25 hits each (in a round robin fashion) BUT,
I couldn't find any logs for any of the containers present in Mac machine.
Is this an expected behavior?
Only when I hit the IP address (192.168.0.11:80) of the Mac machine (worker) directly, I was able to get the logs/request hits for the containers present in the Mac machine.
So, there is two types of load balancing happening here,
When I hit the IP:port (of a worker/manager), then, only the containers present in that worker/manager machine will be load balanced and served in a round-robin fashion (I can see that's the algorithm used). Let's name this load balancing type as "Container level load balancer"
But, When I hit 192.168.0.14 (Manager IP), I expected the load will be balanced across all the 5 containers which is deployed across 2 nodes. Somehow this didn't work. Let's call it "Node level load balancer"
I have tried searching a lot in google for this but found nothing. Most sites are using external technologies like Nginx, HaProxy load balancers for solving "Node level load balancer".
Isn't there an out of the box support for this by docker itself?
EDIT 1 - Added docker-compose.yml as Metin asked in comment section
docker-compose.yml
version: '3'
services:
webservice:
image: python_ws_test
ports:
- '80:1234'
command: ["python", "app.py"]
The main issue was, I tried to join a Linux node and a Mac node, because the docker for Mac (only SWARM, I think) is kind of broken as mentioned,
in this comment https://dev.to/aguedeney/comment/172d6 (and)
subsequently the thread (https://dev.to/natterstefan/docker-tip-how-to-get-host-s-ip-address-inside-a-docker-container-5anh).
Mac Private IP is 192.168.0.11, but somehow, 192.168.65.3 is the IP taken for the Mac worker node.
How did I find out?
Point 1
=> I made Mac as my Manager using "swarm init" command without any "advertise-addr" or "lister-addr" or etc.. The "docker swarm join" command I got had the IP address = 192.168.65.3. I don't know why because my Mac host IP is 192.168.0.11. This is not expected behaviour.
=> I did the same in Ubuntu to make my Ubuntu as Manager using "swarm init" raw command and the "docker swarm join" command I got had the IP address = 192.168.0.14 which is the same IP of the ubuntu host machine and that is expected behaviour.
Point 2
Once the stack is deployed, I tried to inspect the overlay network that's used using "docker network inspect $networkName". Linux manager node had the peers as "itself" and 192.168.65.3 which was unreachable because my Mac node's IP is 192.168.0.11.
But somehow, when I auto-scaled by using "scale" command in Manager node (Ubuntu), docker manager was able to scale the containers in both Mac and Ubuntu. This is very odd.
Default Overlay network - behaviour
Also, "docker stack deploy" by default creates an overlay ingress network irrespective of whether you mention in docker-compose.yml or deploy command. Docker manager and nodes communicate between them on top of this network.
Answer to the issue mentioned in question
Docker has out of the box support for "Node level load balancing"? Yes!
I was so frustrated about this odd behaviour in Mac, I installed Ubuntu 20.04 VM in my Mac and tried to use "Ubuntu 14.04" (separate Laptop / Base OS) as Manager and Ubuntu 20.04 (Virtual Machine OS) as worker node. Now, I was able to load balance between two nodes (I was getting hits in worker node), although I kept hitting the IP of manager only.
I'll update why Mac is broken here if I get more insights. Anyone who already has the knowledge about this, please share.

Running couchbase cluster with multiple nodes in docker on windows 10

I created a couchbase 4.0 docker container with single node on windows 10. And added node ip in host machine loopback and forwarded port in vitural box so that couchbase client in my app running in host can connect with node in cluster. I was able to connect and do db operation when I have single node in cluster.
However when I created multiple node cluster in docker on windows 10. I was not able to do db operation. In golang app running in host I got message unable to complete action after 6 attemps on get and set operation.
How to run couchbase cluster of multiple nodes in docker on same host in windows machine so that I can connect with cluster and do db operation from app running in host machine.
If your app is not running inside of Docker host, as far as I know, you can't do this (I would LOVE to be proven wrong by a Docker expert).
Couchbase clients need access to every node in the cluster, and with Docker you can only forward one image to a given port outside the host. (FYI, there is a tool called sdk-doctor which you can use to verify connectivity/networking issues called SDK Doctor).
I would suggest running your golang app inside of the Docker host (using docker-compose is the way this is typically done).
Also, I would highly suggest upgrading to a more recent version of Couchbase.

Error response from daemon: attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded

I am trying to set up docker swarm with an overlay network. I have some hosts on aws while others are laptops running Ubuntu(same as on aws). Every node has a static public IP. I have created an overlay network as:
docker network create --driver=overlay --attachable test-net
I have created a swarm network on one of the aws hosts. Every other node is able to join that swarm network.
However when I run docker run -it --name alpine2 --network test-net alpine on any node not on aws, I get the error: docker: Error response from daemon: attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded.
But if I run the same on any aws host, then everything is working fine. Is there anything more I need to do in terms of networking/ports If there are some nodes on aws while others are not?
I have opened the ports required for swarm networking on all machines.
EDIT: All the nodes are marked as "active" when listing in the manager node.
UPDATE Solved this issue by opening the respective ports. It now works if all the nodes are Linux based. But when I try to make a swarm with the manager as Linux(ubuntu) os, mac os machines are not able to join the swarm.
check if the node in drain state:
docker node inspect --format {{.Spec.Availability}} node
if yes then update the state:
docker node update --availability active node
here is the explanation:
Resolution
When a node is in drain state, it is expected behavior that you should
not be able to allocate swarm mode resources such as multi-host
overlay network IP addresses to the node.However, swarm mode does not
currently provide a messaging mechanism between the swarm leader where
IP address management occurs back to the worker node that requested
the IP address. So docker run fails with context deadline exceeded.
Internal engineering issue escalation/292 has been opened to provide a
better error message in a future release of the Docker daemon.
source
Check if the below ports are opened on both machines.
TCP port 2377
TCP and UDP port 7946
UDP port 4789
You may use ufw to allow the ports:
ufw allow 2377/tcp
I had a similar issue, managed to fix it by making sure the ENGINE VERSION of the nodes were the same.
sudo docker node ls
Another common cause for this is Ubuntu server installer installing docker using snap, and that package is buggy. Uninstall with snap and install using apt. And reconsider Ubuntu. :-/

Hyperledger - Docker swarm fails when deploying to multiple hosts

I am following this tutorial. I ran sudo docker swarm init --advertise-addr <myip> on 1st ubuntu machine. And then I took the manager join-token and ran it on 2nd ubuntu machine and it is able to join as manager.
But the problem starts when i run docker network create --attachable --driver overlay my-net on 1st machine, it gives me following error:
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
If I run the above command to create network before joining the 2nd node, the network gets created successfully and the 2nd node also gets joined to the 1st swarm node. But when I do anything on the 1st Ubuntu machine, I get the same error on it.
Both Ubuntu machines are in same network and can be pinged by each other.
Ubuntu version - 17.1 64 bit
Docker version 18.03.1-ce, build 9ee9f40
Docker-compose version 1.21.2, build a133471
It seems that the tutorial is off as you will only end up with two managers and that is not enough to form a quorum. You can either add an additional manager node or simply create a single manager (docker swarm init) and then join a single worker using the command that is output as part of the response to docker swarm init. You should SKIP the docker swarm join-token manager step from the tutorial.
Just change the IP of your Ubuntu Machine.
Machine->Settings->nNetwork->select Attached to Bridged Adapter.
restart your machine.

Resources