docker & Quagga BGP - docker

I bundled Qugaa bgpd and nsm binaries into a docker based on Ubuntu 3.13.0-36-generic kernel. Now I run multiple of these dockers on a linux sever. These dockers are instantiated using docker-py.
I used pipework and created an Ethernet interface on each docker and assigned an 172.17.xx.xx kind of address. My BGP configuration is such that all the dockers are completely meshed with respect to BGP. i.e each BGP on a docker is connected to every other BGP running on other dockers.
BGP sessions get established and the BGP routes etc are absolutely fine. Now, when the number of dockers exceeds 30, i could never connect to bgp. "top" command doesn't show much cpu usage, memory is in limits, not much network activity and I also don't expect much processing within BGP process.
when i took tcpdump on the container, this is what it looks like.
9 2014-09-26 18:17:54.997861 0a:60:4a:3b:56:31 ARP 44 Who has 172.17.1.32? Tell 172.17.1.6
when I run 40 dockers, i see 40 such ARP requests as shown above followed by an one ARP reply.
however this continuously happens, and it approximately results in 1600 (40*40) such messages in a short span of time. I believe this is what is not allowing me to connect to a local bgp module using "telnet localhost bgpd" command.
I don't think this is anything specific to either Quagga or BGP. I suspect some thing to do with docker networking. Does anybody got such an issue or any idea how to fix this or what is the root cause for this ?

Finally I could find the root cause behind this and also fixed this. This is due to the combination of number of dockers instances, number of MAC entries that i am creating on each docker and the default ARP cache size specified on my linux server as 1024.
In my case I end up creating MAC entries as (number of dockers * number of ethernet interfaces i am creating through pipework). So the problem occurs when the number of MAC entries on each DOCKER exceeds the default size of 1024.
adding below lines at the end of /etc/sysctl.conf file or modifying as below if it already has these entries in /etc/sysctl.conf solved the issue.
net.ipv4.neigh.default.gc_thresh1 = 8192
net.ipv4.neigh.default.gc_thresh2 = 12288
net.ipv4.neigh.default.gc_thresh3 = 16384
After modifying, execute "sysctl -p" command

Related

DNS Resolution times out in Golang Docker image

I am currently building a DNS enumeration tool in golang but when it comes to dockerize the service some issues are arising in regards to certain lookups.
I am running multiple dns lookups on different hosts and printing the errors of it and found out that at the beggining of the scan, the resolutions return correct results but when some time elapses, the requests start returning i/o timeout.
I am currently using the golang 1.19.1 docker image.
Have already changed the DNS server (That's not the problem)
The app runs OK if not in docker container.

Containers: high cpu usage in %soft (soft IRQ) for network-intensive workloads

I'm trying to debug some performance issues on a RHEL8.3 server.
The server is actually a Kubernetes worker nodes and hosts several Redis containers (PODs).
These containers are doing a lot of network I/O (iptraf-ng reports about 500 kPPS and 1.5Gbps).
The server is an high-end Dell server with 104 cpus and 10Gbps NICs.
The issue I'm trying to debug is related to soft IRQs. In short: despite my attempts to set IRQ affinity of the NIC on a specific range of dedicated CPUs, the utility "mpstat" is still reporting a lot of CPU spent in "soft%" on all the CPUs where the "redis-server" process is running (even if redis-server has been moved using taskset to a non-overlapping range of dedicated CPU cores).
For more details consider the attached screenshot redis_server_and_mpstat:
the "redis-server" with PID 3592506 can run only on CPU 80 (taskset -pc 3592506 returns 80 only)
as can be seen from the "mpstat" output, it's running close to 100%, with 25-28% of the time spent in "%soft" time
In the attempt to address this problem, I've been using the Mellanox IRQ affinity script (https://github.com/Mellanox/mlnx-tools/blob/master/ofed_scripts/set_irq_affinity.sh) to "move" all IRQs related to the NICs on a separate set of CPUs (namely CPUs 1,3,5,7,9,11,13,15,17 that belong to NUMA1) for both NICs (eno1np0, eno2np1) that compose the "bond0" bonded interface used by the server, see the screenshot set_irq_affinity. Moreover the "irqbalance" daemon has been stopped and disabled.
The result is that mpstat is now reporting a consistent CPU usage from CPUs 1,3,5,7,9,11,13,15,17 in "%soft" time, but at the same time the redis-server is still spending 25-28% of its time spent in "%soft" column (i.e. nothing has changed for redis-server).
This pattern is repeated for all instances of "redis-server" running on that server (there's more than 1), while other CPUs having no redis-server scheduled, are 100% idle.
Finally in a different environment based on RHEL7.9 (kernel 3.10.0) and a non-containerized deployment of Redis, I see that, when running the "set_irq_affinity.sh" script to move IRQs away from Redis CPUs, Redis %soft column goes down to zero.
Can you help me to understand why running redis into a Kubernetes container (with kernel 4.18.0), the redis-server process will continue to spend a consistent amount of time in %soft handling, despite NIC IRQs having affinity on different CPUs ?
Is it possible that the time the redis-server process spends in "soft IRQ" handling is due to the veth virtual ethernet device created by the containerization technology (in this case the Kubernetes CNI is Flannel, using all default settings) ?
Thanks

Should Docker release all memory when all containers are closed?

I am debugging a possible memory leak in a web service I have running as a Docker network. The service has a Javascript front end, Flask REST API, Dask worker pool, the spaCy natural language toolkit...the works. I see intermittent running-out-of memory problems and I'm trying to get a handle on what could be going on.
I can run this system on my laptop, a MacBook Pro with 16 GB of memory where I am using Docker Desktop. When there are no containers running, Activity Monitor shows com.docker.hyperkit using about 12 GB. Then I launch the Docker network, which ultimately runs 14 containers to house the various components. I perform a fairly large batch job in the Docker network. It runs for an hour, during which time com.docker.hyperkit's memory creeps up to around 18 GB. This is not surprising--this is a memory intensive service. But when I stop all the containers in the network, I would expect com.docker.hyperkit's memory usage to drop back to 12 GB. Instead it stays at 18 GB. The only way I can get it back to 12 GB is to restart the Docker Desktop.
Is this expected behavior? It looks like a memory leak in Docker.
No it should not release the memory, and yes it is expected behavior.
There is no way to run docker containers natively on MacOS, so you run them inside of a virtual machine. A VM gets memory assigned to it, which it assigns to processes running inside of that VM. When those processes inside of the VM exit, the resources are released back to the VM, but not back to the parent MacOS. That's just how VM's work, and the fact that it didn't take all of the memory up to the limit specified in the Docker preferences immediately on startup is an impressive feat itself.
The containers themselves are processes running within this VM, and they will release all of their memory back to the VM upon exit. If you run something like docker run --rm busybox free you'll likely see the memory being used and freed within the VM.
For more details on this, there's several extensive threads in the github issues. Most of the comments on these threads appear to be from users assuming MacOS is running containers, rather than a VM that runs containers. Even completely idle, that VM will use some resources to run the kernel, container runtime daemons, volume sharing code, port forwarding code, etc. There's a lot of magic under the covers to make docker not look like a VM to the user, so that you can just pass paths and connect to ports on the MacOS side. The most helpful comment in the thread to me is here: https://github.com/moby/hyperkit/issues/231#issuecomment-448416559

Single docker container slightly outperforming its host in cpu performance: Why?

I ran an experiment to compare the CPU performance of a docker container against the CPU performance of the host it is running on.
Cases
A: Benchmark program run on Host machine (Intel i5, 2.6 GHz, 2 processors, 2 cores)
B: Benchmark program run on Docker container running on the same host machine.
(No resource limiting is done for container in B. i.e. container has all 1024 cpu shares for itself. No other container is running)
Benchmark program: Numerical Integration
Numerical Integration: is a standard example of a massively parallel program. A standard numerical integration example program written in C++ using OpenMP lib is taken (which has already been tested for correctness). The program is run 11 times by varying number of available threads within the program from 1-11. These 11 runs are done for each case A and B. So a total of 22 runs are done 11 for host and 11 for container.
X axis: Number of threads available in the program
Y axis: indicates performance which is inverse of time (Calculated by multiplying inverse of time to run program, with a constant).
Result
Observation
The docker container running on host is slightly outperforming the host machine. This experiment was repeated 4-5 times across 2 different hosts and every time the container performance curve was slightly above host performance curve.
Question
How is the container performance higher than the host machine when the docker container is running on the host itself?
Possible Reason: Higher priority of of the docker cgroup processes?
I am hypothesizing that the processes within the container's cgroup, might be getting a higher process priority leading to a higher performance of the program running within the container as compared to when the program directly runs on the host machine. Does this sound like a possible explanation?
Thanks to the comments of #miraculixx and #Zboson, which helped me understand that the container is not really outperforming the host. The strange results (plot in question) were caused because of the different compiler versions being used on host and container while performing the experiment. When Case A & B are run again after updating to the same version of compiler in both container and host, these are the results:
Without optimization flag
With optimization flag -O3
Observation
It can be observed that container has same or slightly less performance than host; which makes sense intuitively. (Without optimization there are a couple of discrepancies)
P.S Apologies for the misleading question title. I wasn't aware that the performance discrepancy could be because of the different compiler versions until the comments were posted.

Reset IP counter used by docker

I start, stop and remove containers as part of a continuous build process. Each time the build runs, the containers get new IPs.
Im already at 172.17.0.95 after starting at 172.17.0.2 an hour back.
Since I remove old containers each build, I would also like to reset the IP counter, so that I dont have a timebomb where I run out of IP addresses after say a few hundred builds.
Please let me know how I can let the entity (DHCP Server?) know that IPAddress is free again, and to reset counter.
Thanks in advance SO community!
Docker seems to default to using 172.17.0.0/16 for the docker0 interface. That's 255^2 addresses, and if you use 100 every hour you'll run through them all in just over 27 days. I think Docker is just being conservative in not recycling them faster, but will loop around when it reaches the end.
If you need a bigger or different address space, you can use the --bip and --fixed-cidr flags on the Docker server to choose your own CIDR. See the Docker documentation on networking here.
If you really just want to reset the counter, you would need to restart the docker server. This will have the side-effect of terminating all your running containers.

Resources