High dockerd sys CPU usage - docker

We are in the process of trying to migrate a windows desktop app to docker. We have created a light-weight Ubuntu based container with wine + vnc and the app is running well.
We need to run a large quantity of these apps on a given host, circa 500 - 600 per host. The host its self is a high spec, 4 x 8 Core CPU.
When testing under load, dockerd is using a very high amount of the sys cpu, and by high i mean for every 1% of user CPU in use, its using around 1% of sys cpu.
The problem this is causing is that compared to running the same app under Windows / Hyper-V, we can only get 50% quantity wise of the same application running, which is clearly an issue. If we were to factor out the sys CPU load, then they are pretty much equal.
Networking wise, we are using MACVLAN where each container has its own IP address that is mapped directly into the network.
First of all, is this normal for dockerd to be using so much CPU?
Cheers in advance!

Related

Parallel Docker Container Creation

I am using a Docker Setup that consists of 14 different containers. Every container gets a cpu_limit of 2 and a mem_limit of 2g.
To create and run these containers, I've written a Python script that uses the docker-py library. As of now, the containers are created sequentially, which takes approximately 2 minutes.
Now I'm thinking about parallelizing the process. So now instead of doing (its pseudocode):
for container in containers_to_start:
create_container(container)
I do
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
pool.map(create_container, containers_to_start)
And as a result the 14 containers are created 2x faster. BUT: The applications within the containers take a significant longer time to boot. At the end of the day, i dont gain really much, the time until every application is reachable is more or less the same, no matter if with or without multithreading.
But I don't really know why, because every container gets the same amount of CPU and memory resources, so I would expect the same boot time no matter how many containers are starting at the same time. Clearly this is not the case. Maybe I'm missing some knowledge here, any explanation would be greatly appreciated.
System Specs
CPU: intel i7 # 2.90 GHz
32GB RAM
I am using Windows 10 with Docker installed in WSL2 backend.

Containers: high cpu usage in %soft (soft IRQ) for network-intensive workloads

I'm trying to debug some performance issues on a RHEL8.3 server.
The server is actually a Kubernetes worker nodes and hosts several Redis containers (PODs).
These containers are doing a lot of network I/O (iptraf-ng reports about 500 kPPS and 1.5Gbps).
The server is an high-end Dell server with 104 cpus and 10Gbps NICs.
The issue I'm trying to debug is related to soft IRQs. In short: despite my attempts to set IRQ affinity of the NIC on a specific range of dedicated CPUs, the utility "mpstat" is still reporting a lot of CPU spent in "soft%" on all the CPUs where the "redis-server" process is running (even if redis-server has been moved using taskset to a non-overlapping range of dedicated CPU cores).
For more details consider the attached screenshot redis_server_and_mpstat:
the "redis-server" with PID 3592506 can run only on CPU 80 (taskset -pc 3592506 returns 80 only)
as can be seen from the "mpstat" output, it's running close to 100%, with 25-28% of the time spent in "%soft" time
In the attempt to address this problem, I've been using the Mellanox IRQ affinity script (https://github.com/Mellanox/mlnx-tools/blob/master/ofed_scripts/set_irq_affinity.sh) to "move" all IRQs related to the NICs on a separate set of CPUs (namely CPUs 1,3,5,7,9,11,13,15,17 that belong to NUMA1) for both NICs (eno1np0, eno2np1) that compose the "bond0" bonded interface used by the server, see the screenshot set_irq_affinity. Moreover the "irqbalance" daemon has been stopped and disabled.
The result is that mpstat is now reporting a consistent CPU usage from CPUs 1,3,5,7,9,11,13,15,17 in "%soft" time, but at the same time the redis-server is still spending 25-28% of its time spent in "%soft" column (i.e. nothing has changed for redis-server).
This pattern is repeated for all instances of "redis-server" running on that server (there's more than 1), while other CPUs having no redis-server scheduled, are 100% idle.
Finally in a different environment based on RHEL7.9 (kernel 3.10.0) and a non-containerized deployment of Redis, I see that, when running the "set_irq_affinity.sh" script to move IRQs away from Redis CPUs, Redis %soft column goes down to zero.
Can you help me to understand why running redis into a Kubernetes container (with kernel 4.18.0), the redis-server process will continue to spend a consistent amount of time in %soft handling, despite NIC IRQs having affinity on different CPUs ?
Is it possible that the time the redis-server process spends in "soft IRQ" handling is due to the veth virtual ethernet device created by the containerization technology (in this case the Kubernetes CNI is Flannel, using all default settings) ?
Thanks

How does host machine's CPU utilized by docker containers and other applications running on host?

I am running a micro-service application in docker container and have to test that using JMeter tool. So I am running JMeter on my host machine and my host machine has 4 cores. I allocate 2 cores to the container using --cpu=2 flag while running the container. so it means it can use up to 2 cores as per it needs while running. I leave the remaining 2 cores for the JMeter and other applications and system usage.
Here I need a clarification that what will happen if JMeter and other application needs more than 2 cores and container also needs allocated 2 cores fully ?
Is there any way to allocate 2 cores fully to the container? (It means any other applications or system can't use that 2 cores)
Thank you in advance.
The answer is most probably "no", the explanations will differ depending on your operating system.
You can try to implement this by playing with CPU affinity, however CPU is not only one metric you should be looking at, I would rather be concerned about RAM and Disk usage.
In general having load generator and application under test on the same physical machine is a very bad idea because they are both very resource intensive so consider using 2 separate machines for this otherwise both will suffer from context switches and you will not be able to monitor resources usage of JMeter and the application under test using JMeter PerfMon Plugin

Docker stats, memory usage, big difference between OSX and Ubuntu, why?

I have a C program running in an alipine docker container. The image size is 10M on both OSX and on ubuntu.
On OSX, when I run this image, using the 'docker stats' I see it uses 1M RAM and so in the docker compose file I allocate a max of 5M within my swarm.
However, on Ubuntu 16.04.4 LTS the image is also 10M but when running it uses about 9M RAM, and I have to increase the max allocated memory in my compose file.
Why is there a such a difference in RAM usage between OSX and Ubuntu?
Even though we have different OSs, I would have thought once you are running inside a framework, then you would behave similarly on different machines. So I would have thought there should be comparable memory usage.
Update:
Thanks for the comments. So 'stats' may be inaccurate, and there are differences so best to baseline on linux. As an aside, but I think interesting, the reason for asking this question is to understand the 'under the hood' in order to tune my setup for a large number of deployed programs. Originally, when I tested I tried to allocate the smallest amount of maximum RAM on ubuntu, this resulted in a lot of disk thrashing something I didn't see or hear on my Macbook, (no hard disks!).
Some numbers which are completely my setup but also I think are interesting.
1000 docker containers, 1 C program each, 20M RAM MAX per container, Server load of 98, Server runs 4K processes in total, [1000 C programs total]
20 docker containers, 100 C programs each, 200M RAM MAX per container, Server load of 5 to 50, Server runs 2.3K processes in total, [2000 C programs total].
This all points at give your docker images a good amount of MAX RAM and it is nicer to your server to have fewer docker containers running.

Single docker container slightly outperforming its host in cpu performance: Why?

I ran an experiment to compare the CPU performance of a docker container against the CPU performance of the host it is running on.
Cases
A: Benchmark program run on Host machine (Intel i5, 2.6 GHz, 2 processors, 2 cores)
B: Benchmark program run on Docker container running on the same host machine.
(No resource limiting is done for container in B. i.e. container has all 1024 cpu shares for itself. No other container is running)
Benchmark program: Numerical Integration
Numerical Integration: is a standard example of a massively parallel program. A standard numerical integration example program written in C++ using OpenMP lib is taken (which has already been tested for correctness). The program is run 11 times by varying number of available threads within the program from 1-11. These 11 runs are done for each case A and B. So a total of 22 runs are done 11 for host and 11 for container.
X axis: Number of threads available in the program
Y axis: indicates performance which is inverse of time (Calculated by multiplying inverse of time to run program, with a constant).
Result
Observation
The docker container running on host is slightly outperforming the host machine. This experiment was repeated 4-5 times across 2 different hosts and every time the container performance curve was slightly above host performance curve.
Question
How is the container performance higher than the host machine when the docker container is running on the host itself?
Possible Reason: Higher priority of of the docker cgroup processes?
I am hypothesizing that the processes within the container's cgroup, might be getting a higher process priority leading to a higher performance of the program running within the container as compared to when the program directly runs on the host machine. Does this sound like a possible explanation?
Thanks to the comments of #miraculixx and #Zboson, which helped me understand that the container is not really outperforming the host. The strange results (plot in question) were caused because of the different compiler versions being used on host and container while performing the experiment. When Case A & B are run again after updating to the same version of compiler in both container and host, these are the results:
Without optimization flag
With optimization flag -O3
Observation
It can be observed that container has same or slightly less performance than host; which makes sense intuitively. (Without optimization there are a couple of discrepancies)
P.S Apologies for the misleading question title. I wasn't aware that the performance discrepancy could be because of the different compiler versions until the comments were posted.

Resources