Kubernetes: High CPU usage and zombie processes when "Disk Pressure" event - docker

We run on-premise small K8s cluster (based on RKE stack). 1x etcd/control node, 2x worker nodes. Components are:
OS: Centos 7
Docker version: 19.3.9
K8s: 1.17.2
Other, important fact: we're using Rook-Ceph storage cluster on both worker nodes (rook: v1.2.4, ceph version 14.2.7).
When one of OS mounts run into 90%+ usage (for example: /var), K8s is reporting "Disk Pressure", disables node and it's OK. But when this happens, the CPU usage start growing up to dozens (for example 30+, 40+ on machine with 4 vCPU), many of container processes (childs to containerd-shim) goes into zombie (defunct) state and whole k8s cluster collapse.
First of all we think that's a Rook-Ceph problem with XFS storage (described at https://github.com/rook/rook/issues/3132#issuecomment-580508760), so we switched to EXT4 (because we cannot do upgrade of kernel to 5.6+), but during last weekend this happened again, and we are sure that this case is related to Disk Pressure event. Last contact with (already) dead node was 21-01, #13:50, but load starts growing at 13:07 and quickly goes to 30.5:
/var usage goes from 89.97% to 90%+ exactly at 13:07 this day:
Can you point us what we need to check in k8s configuration, logs or whatever else to find out what is going on? Why k8s is collapsing during quite normal event?
(For clarification: we know that we're using quite old versions, but we'll do a complex upgrade of environment within few weeks).

Related

Parallel Docker Container Creation

I am using a Docker Setup that consists of 14 different containers. Every container gets a cpu_limit of 2 and a mem_limit of 2g.
To create and run these containers, I've written a Python script that uses the docker-py library. As of now, the containers are created sequentially, which takes approximately 2 minutes.
Now I'm thinking about parallelizing the process. So now instead of doing (its pseudocode):
for container in containers_to_start:
create_container(container)
I do
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
pool.map(create_container, containers_to_start)
And as a result the 14 containers are created 2x faster. BUT: The applications within the containers take a significant longer time to boot. At the end of the day, i dont gain really much, the time until every application is reachable is more or less the same, no matter if with or without multithreading.
But I don't really know why, because every container gets the same amount of CPU and memory resources, so I would expect the same boot time no matter how many containers are starting at the same time. Clearly this is not the case. Maybe I'm missing some knowledge here, any explanation would be greatly appreciated.
System Specs
CPU: intel i7 # 2.90 GHz
32GB RAM
I am using Windows 10 with Docker installed in WSL2 backend.

Containers: high cpu usage in %soft (soft IRQ) for network-intensive workloads

I'm trying to debug some performance issues on a RHEL8.3 server.
The server is actually a Kubernetes worker nodes and hosts several Redis containers (PODs).
These containers are doing a lot of network I/O (iptraf-ng reports about 500 kPPS and 1.5Gbps).
The server is an high-end Dell server with 104 cpus and 10Gbps NICs.
The issue I'm trying to debug is related to soft IRQs. In short: despite my attempts to set IRQ affinity of the NIC on a specific range of dedicated CPUs, the utility "mpstat" is still reporting a lot of CPU spent in "soft%" on all the CPUs where the "redis-server" process is running (even if redis-server has been moved using taskset to a non-overlapping range of dedicated CPU cores).
For more details consider the attached screenshot redis_server_and_mpstat:
the "redis-server" with PID 3592506 can run only on CPU 80 (taskset -pc 3592506 returns 80 only)
as can be seen from the "mpstat" output, it's running close to 100%, with 25-28% of the time spent in "%soft" time
In the attempt to address this problem, I've been using the Mellanox IRQ affinity script (https://github.com/Mellanox/mlnx-tools/blob/master/ofed_scripts/set_irq_affinity.sh) to "move" all IRQs related to the NICs on a separate set of CPUs (namely CPUs 1,3,5,7,9,11,13,15,17 that belong to NUMA1) for both NICs (eno1np0, eno2np1) that compose the "bond0" bonded interface used by the server, see the screenshot set_irq_affinity. Moreover the "irqbalance" daemon has been stopped and disabled.
The result is that mpstat is now reporting a consistent CPU usage from CPUs 1,3,5,7,9,11,13,15,17 in "%soft" time, but at the same time the redis-server is still spending 25-28% of its time spent in "%soft" column (i.e. nothing has changed for redis-server).
This pattern is repeated for all instances of "redis-server" running on that server (there's more than 1), while other CPUs having no redis-server scheduled, are 100% idle.
Finally in a different environment based on RHEL7.9 (kernel 3.10.0) and a non-containerized deployment of Redis, I see that, when running the "set_irq_affinity.sh" script to move IRQs away from Redis CPUs, Redis %soft column goes down to zero.
Can you help me to understand why running redis into a Kubernetes container (with kernel 4.18.0), the redis-server process will continue to spend a consistent amount of time in %soft handling, despite NIC IRQs having affinity on different CPUs ?
Is it possible that the time the redis-server process spends in "soft IRQ" handling is due to the veth virtual ethernet device created by the containerization technology (in this case the Kubernetes CNI is Flannel, using all default settings) ?
Thanks

Docker desktop eats all memory and crashes

Using Docker Desktop (19.03.13) with 6 containers in Windows 10. Having 16GB RAM.
In docker stats each container consumes 20-500 mb, all together cunsume ~1gb.
But in the Task Manager docker eats ~10gb and crashes from the lack of system memory.
How to check, what consumes so much memory in docker?
And how to prevent this?
Try to create a .wslconfig file at the root of your User folder C:\Users\<my-user> to adjust how much memory & processors Docker will use.
This is the content of the .wslconfig file.
[wsl2]
memory=2GB # Limits VM memory in WSL 2 up to 2GB
processors=2# Makes the WSL 2 VM use two virtual processors
Then, restart the computer. You will find the Vemm process will only take the amount of RAM you defined previously.
You can learn more here here
I guess you are using the new WSL 2 based engine, try switching docker engine back to Hyper-V by going opening docker settings -> general -> uncheck Use WSL 2 based Engine .
To explain:
I noticed it started happening to me since WSL 2 engine was introduced, i automatically switched to it since it's a new engine; Memory issues started arising since then.
Restarting/closing docker did not free the memory and i noticed in task Manager Vemm was the one eating all memory, so had to force close it (caused docker not to work).
Last thing i did was switching docker engine back to Hyper-V solved my high memory usage.
If you are using WSL2 put into the .wslconfig the middle of your ram. I don't know why but I had the same problem with 8GB RAM.
This is my .wslconfig
[wsl2]
memory=4GB # I have 8GB RAM
processors=2
And the result was good because the consumption is good! In this moment I have running a Docker with 8 images:
Although this problem is already marked as SOLVED
There is still another reason for this, in recently updated versions.
You might enable too many resources for docker hyperkit.
Go to settings - resources - advanced
check if you spared too much resource there.
I have my docker taking less than 2% cpu now.
After updating .wslconfig to be:
[wsl2]
memory=8GB
swap=2000
processors=4
... and then restarting Docker, the CPU consumption was still over 80% and there were 5 Docker Desktop processes (each taking 17-18%) in Windows Task Manager. I reset Docket to Factory and still the CPU pegged at 80% or more.
I then deleted the .docker folder (in windows the path is %USERPROFILE%/.docker) as suggested by jmichalek-fp. I took care to do a Shift-DEL so as not to move it to the recylce bin because I remember in the past recycled items were still found by processes that hold a link to the file.
After Factory Reset, then increasing .wslconfig resources, then deleting .docker folder and then restarting Docker, it is now running only one Docker Desktop process, and, with a NodeJs app running in it, it is consuming between 0.5% and 2% CPU.
I found "delete .docker folder" in this github issue: https://github.com/docker/for-win/issues/12266
As I know docker stats does not show RAM reservations. Try to put RAM limits using -m flag. There are some information how to control resources using docker:
https://docs.docker.com/config/containers/resource_constraints/?spm=a2c41.12663380.0.0.59ed566dAqUZPu
I am guessing on Windows there is something similar to what exists on MacOS.
Open your docker app and go to the dashboard
Click any container
Click Stats
You will get information regarding your CPU, RAM Usage, Disk Read & Write Memory & Network usage.
When I had memory issues, which I used to frequently, I would setup alias scripts that I could chain together to stop/kill/restart and do what ever setup I needed on the containers.
There is no preventing docker behaving the way it behaves unless you want to start contributing to and making pull requests. This isn't an uncommon issue. Docker is a free service, I recommend working around it's short comings.

Using k8s node resources out of k8s

What would happen with kubernetes scheduling if I have a kubernetes node, but I use the container (docker) engine for some other stuff, outside of the context of kubernetes.
For example if I manually SSH to the respective node and I do docker run something. Would kubernetes scheduling take into account the fact that this node is busy running other stuff, and it might not be able to host any other containers now?
What would happen in the following scenario:
Node with 8 GB RAM
running a pod with resource request 2 GB, limit 4 GB, and current usage 3 GB
ssh on node and docker run a container with 5 GB, using all
P.S. Please skip the "why would you go and run docker run directly on the node" questions. I don't want to, but reasons.
I'm pretty sure Kubernetes's scheduling only considers (a) pods it knows about and not other resources, and (b) only their resource requests.
In the situation you describe, with exactly that resource utilization, things will work fine. The pod can be scheduled on the node because the total resource requests using it are 2 GB out of 8 GB. The total memory usage doesn't exceed the physical memory size either, so you're okay.
Say the pod allocated a little bit more memory. Now the system as a whole is above its physical memory capacity, so the Linux kernel will arbitrarily kill something off. This is often the largest thing. You'll typically see an exit code of 137 (matching SIGKILL) in whichever system manages it.
This behavior is the same even if you run your side job in something like a DaemonSet. It requests 2 GB of RAM, so both pods fit on the same node [4 GB/8 GB], but if it has a resource limit of 6 GB RAM, something will get killed off.
The place where things are different is if you can predict the high memory use. Say your pod requests 3 GB/limits 6 GB of RAM, and your side process will predictably also use 6 GB. If you just docker run it something will definitely get OOM-killed. If you run it as a DaemonSet declaring a 6 GB memory request, the Kubernetes scheduler will know the pod doesn't fit and won't place it there (it may get stuck in "Pending" state if it can't be scheduled anywhere).
Kubernetes won't see other processes running on the host, however you can tell the kubelet on that host how much of the host resources to reserve for the host itself, preventing Kubernetes from scheduling pods that would exceed the host capacity. See the --system-reserved flag that you can pass to the kubelet:
--system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

Kubernetes OOM pod killed because kernel memory grows to much

I am working on a java service that basically creates files in a network file system to store data. It runs in a k8s cluster in a Ubuntu 18.04 LTS.
When we began to limit the memory in kubernetes (limits: memory: 3Gi), the pods began to be OOMKilled by kubernetes.
At the beginning we thought it was a leak of memory in the java process, but analyzing more deeply we noticed that the problem is the memory of the kernel.
We validated that looking at the file /sys/fs/cgroup/memory/memory.kmem.usage_in_bytes
We isolated the case to only create files (without java) with the DD command like this:
for i in {1..50000}; do dd if=/dev/urandom bs=4096 count=1 of=file$i; done
And with the dd command we saw that the same thing happened ( the kernel memory grew until OOM).
After k8s restarted the pod, I got doing a describe pod:
Last State:Terminated
Reason: OOMKilled
Exit Code: 143
Creating files cause the kernel memory grows, deleting those files cause the memory decreases . But our services store data , so it creates a lot of files continuously, until the pod is killed and restarted because OOMKilled.
We tested limiting the kernel memory using a stand alone docker with the --kernel-memory parameter and it worked as expected. The kernel memory grew to the limit and did not rise anymore. But we did not find any way to do that in a kubernetes cluster.
Is there a way to limit the kernel memory in a K8S environment ?
Why the creation of files causes the kernel memory grows and it is not released ?
Thanks for all this info, it was very useful!
On my app, I solved this by creating a new side container that runs a cron job, every 5 minutes with the following command:
echo 3 > /proc/sys/vm/drop_caches
(note that you need the side container to run in privileged mode)
It works nicely and has the advantage of being predictable: every 5 minutes, your memory cache will be cleared.

Resources