Kubernetes OOM pod killed because kernel memory grows to much - docker

I am working on a java service that basically creates files in a network file system to store data. It runs in a k8s cluster in a Ubuntu 18.04 LTS.
When we began to limit the memory in kubernetes (limits: memory: 3Gi), the pods began to be OOMKilled by kubernetes.
At the beginning we thought it was a leak of memory in the java process, but analyzing more deeply we noticed that the problem is the memory of the kernel.
We validated that looking at the file /sys/fs/cgroup/memory/memory.kmem.usage_in_bytes
We isolated the case to only create files (without java) with the DD command like this:
for i in {1..50000}; do dd if=/dev/urandom bs=4096 count=1 of=file$i; done
And with the dd command we saw that the same thing happened ( the kernel memory grew until OOM).
After k8s restarted the pod, I got doing a describe pod:
Last State:Terminated
Reason: OOMKilled
Exit Code: 143
Creating files cause the kernel memory grows, deleting those files cause the memory decreases . But our services store data , so it creates a lot of files continuously, until the pod is killed and restarted because OOMKilled.
We tested limiting the kernel memory using a stand alone docker with the --kernel-memory parameter and it worked as expected. The kernel memory grew to the limit and did not rise anymore. But we did not find any way to do that in a kubernetes cluster.
Is there a way to limit the kernel memory in a K8S environment ?
Why the creation of files causes the kernel memory grows and it is not released ?

Thanks for all this info, it was very useful!
On my app, I solved this by creating a new side container that runs a cron job, every 5 minutes with the following command:
echo 3 > /proc/sys/vm/drop_caches
(note that you need the side container to run in privileged mode)
It works nicely and has the advantage of being predictable: every 5 minutes, your memory cache will be cleared.

Related

Kubernetes: High CPU usage and zombie processes when "Disk Pressure" event

We run on-premise small K8s cluster (based on RKE stack). 1x etcd/control node, 2x worker nodes. Components are:
OS: Centos 7
Docker version: 19.3.9
K8s: 1.17.2
Other, important fact: we're using Rook-Ceph storage cluster on both worker nodes (rook: v1.2.4, ceph version 14.2.7).
When one of OS mounts run into 90%+ usage (for example: /var), K8s is reporting "Disk Pressure", disables node and it's OK. But when this happens, the CPU usage start growing up to dozens (for example 30+, 40+ on machine with 4 vCPU), many of container processes (childs to containerd-shim) goes into zombie (defunct) state and whole k8s cluster collapse.
First of all we think that's a Rook-Ceph problem with XFS storage (described at https://github.com/rook/rook/issues/3132#issuecomment-580508760), so we switched to EXT4 (because we cannot do upgrade of kernel to 5.6+), but during last weekend this happened again, and we are sure that this case is related to Disk Pressure event. Last contact with (already) dead node was 21-01, #13:50, but load starts growing at 13:07 and quickly goes to 30.5:
/var usage goes from 89.97% to 90%+ exactly at 13:07 this day:
Can you point us what we need to check in k8s configuration, logs or whatever else to find out what is going on? Why k8s is collapsing during quite normal event?
(For clarification: we know that we're using quite old versions, but we'll do a complex upgrade of environment within few weeks).

Docker desktop eats all memory and crashes

Using Docker Desktop (19.03.13) with 6 containers in Windows 10. Having 16GB RAM.
In docker stats each container consumes 20-500 mb, all together cunsume ~1gb.
But in the Task Manager docker eats ~10gb and crashes from the lack of system memory.
How to check, what consumes so much memory in docker?
And how to prevent this?
Try to create a .wslconfig file at the root of your User folder C:\Users\<my-user> to adjust how much memory & processors Docker will use.
This is the content of the .wslconfig file.
[wsl2]
memory=2GB # Limits VM memory in WSL 2 up to 2GB
processors=2# Makes the WSL 2 VM use two virtual processors
Then, restart the computer. You will find the Vemm process will only take the amount of RAM you defined previously.
You can learn more here here
I guess you are using the new WSL 2 based engine, try switching docker engine back to Hyper-V by going opening docker settings -> general -> uncheck Use WSL 2 based Engine .
To explain:
I noticed it started happening to me since WSL 2 engine was introduced, i automatically switched to it since it's a new engine; Memory issues started arising since then.
Restarting/closing docker did not free the memory and i noticed in task Manager Vemm was the one eating all memory, so had to force close it (caused docker not to work).
Last thing i did was switching docker engine back to Hyper-V solved my high memory usage.
If you are using WSL2 put into the .wslconfig the middle of your ram. I don't know why but I had the same problem with 8GB RAM.
This is my .wslconfig
[wsl2]
memory=4GB # I have 8GB RAM
processors=2
And the result was good because the consumption is good! In this moment I have running a Docker with 8 images:
Although this problem is already marked as SOLVED
There is still another reason for this, in recently updated versions.
You might enable too many resources for docker hyperkit.
Go to settings - resources - advanced
check if you spared too much resource there.
I have my docker taking less than 2% cpu now.
After updating .wslconfig to be:
[wsl2]
memory=8GB
swap=2000
processors=4
... and then restarting Docker, the CPU consumption was still over 80% and there were 5 Docker Desktop processes (each taking 17-18%) in Windows Task Manager. I reset Docket to Factory and still the CPU pegged at 80% or more.
I then deleted the .docker folder (in windows the path is %USERPROFILE%/.docker) as suggested by jmichalek-fp. I took care to do a Shift-DEL so as not to move it to the recylce bin because I remember in the past recycled items were still found by processes that hold a link to the file.
After Factory Reset, then increasing .wslconfig resources, then deleting .docker folder and then restarting Docker, it is now running only one Docker Desktop process, and, with a NodeJs app running in it, it is consuming between 0.5% and 2% CPU.
I found "delete .docker folder" in this github issue: https://github.com/docker/for-win/issues/12266
As I know docker stats does not show RAM reservations. Try to put RAM limits using -m flag. There are some information how to control resources using docker:
https://docs.docker.com/config/containers/resource_constraints/?spm=a2c41.12663380.0.0.59ed566dAqUZPu
I am guessing on Windows there is something similar to what exists on MacOS.
Open your docker app and go to the dashboard
Click any container
Click Stats
You will get information regarding your CPU, RAM Usage, Disk Read & Write Memory & Network usage.
When I had memory issues, which I used to frequently, I would setup alias scripts that I could chain together to stop/kill/restart and do what ever setup I needed on the containers.
There is no preventing docker behaving the way it behaves unless you want to start contributing to and making pull requests. This isn't an uncommon issue. Docker is a free service, I recommend working around it's short comings.

Should Docker release all memory when all containers are closed?

I am debugging a possible memory leak in a web service I have running as a Docker network. The service has a Javascript front end, Flask REST API, Dask worker pool, the spaCy natural language toolkit...the works. I see intermittent running-out-of memory problems and I'm trying to get a handle on what could be going on.
I can run this system on my laptop, a MacBook Pro with 16 GB of memory where I am using Docker Desktop. When there are no containers running, Activity Monitor shows com.docker.hyperkit using about 12 GB. Then I launch the Docker network, which ultimately runs 14 containers to house the various components. I perform a fairly large batch job in the Docker network. It runs for an hour, during which time com.docker.hyperkit's memory creeps up to around 18 GB. This is not surprising--this is a memory intensive service. But when I stop all the containers in the network, I would expect com.docker.hyperkit's memory usage to drop back to 12 GB. Instead it stays at 18 GB. The only way I can get it back to 12 GB is to restart the Docker Desktop.
Is this expected behavior? It looks like a memory leak in Docker.
No it should not release the memory, and yes it is expected behavior.
There is no way to run docker containers natively on MacOS, so you run them inside of a virtual machine. A VM gets memory assigned to it, which it assigns to processes running inside of that VM. When those processes inside of the VM exit, the resources are released back to the VM, but not back to the parent MacOS. That's just how VM's work, and the fact that it didn't take all of the memory up to the limit specified in the Docker preferences immediately on startup is an impressive feat itself.
The containers themselves are processes running within this VM, and they will release all of their memory back to the VM upon exit. If you run something like docker run --rm busybox free you'll likely see the memory being used and freed within the VM.
For more details on this, there's several extensive threads in the github issues. Most of the comments on these threads appear to be from users assuming MacOS is running containers, rather than a VM that runs containers. Even completely idle, that VM will use some resources to run the kernel, container runtime daemons, volume sharing code, port forwarding code, etc. There's a lot of magic under the covers to make docker not look like a VM to the user, so that you can just pass paths and connect to ports on the MacOS side. The most helpful comment in the thread to me is here: https://github.com/moby/hyperkit/issues/231#issuecomment-448416559

Nifi 1.6.0 memory leak

We're running Docker containers of NiFi 1.6.0 in production and have to come across a memory leak.
Once started, the app runs just fine, however, after a period of 4-5 days, the memory consumption on the host keeps on increasing. When checked in the NiFi cluster UI, the JVM heap size used hardly around 30% but the memory on the OS level goes to 80-90%.
On running the docker starts command, we found that the NiFi docker container is consuming the memory.
After collecting the JMX metrics, we found that the RSS memory keeps growing. What could be the potential cause of this? In the JVM tab of cluster dialog, young GC also seems to be happening in a timely manner with old GC counts shown as 0.
How do we go about identifying in what's causing the RSS memory to grow?
You need to replicate that in a non-docker environment, because with docker, memory is known to raise.
As I explained in "Difference between Resident Set Size (RSS) and Java total committed memory (NMT) for a JVM running in Docker container", docker has some bugs (like issue 10824 and issue 15020) which prevent an accurate report of the memory consumed by a Java process within a Docker container.
That is why a plugin like signalfx/docker-collectd-plugin mentions (two weeks ago) in its PR -- Pull Request -- 35 to "deduct the cache figure from the memory usage percentage metric":
Currently the calculation for memory usage of a container/cgroup being returned to SignalFX includes the Linux page cache.
This is generally considered to be incorrect, and may lead people to chase phantom memory leaks in their application.
For a demonstration on why the current calculation is incorrect, you can run the following to see how I/O usage influences the overall memory usage in a cgroup:
docker run --rm -ti alpine
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
dd if=/dev/zero of=/tmp/myfile bs=1M count=100
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
You should see that the usage_in_bytes value rises by 100MB just from creating a 100MB file. That file hasn't been loaded into anonymous memory by an application, but because it's now in the page cache, the container memory usage is appearing to be higher.
Deducting the cache figure in memory.stat from the usage_in_bytes shows that the genuine use of anonymous memory hasn't risen.
The signalFX metric now differs from what is seen when you run docker stats which uses the calculation I have here.
It seems like knowing the page cache use for a container could be useful (though I am struggling to think of when), but knowing it as part of an overall percentage usage of the cgroup isn't useful, since it then disguises your actual RSS memory use.
In a garbage collected application with a max heap size as large, or larger than the cgroup memory limit (e.g the -Xmx parameter for java, or .NET core in server mode), the tendency will be for the percentage to get close to 100% and then just hover there, assuming the runtime can see the cgroup memory limit properly.
If you are using the Smart Agent, I would recommend using the docker-container-stats monitor (to which I will make the same modification to exclude cache memory).
Yes, NiFi docker has memory issues, shoots up after a while & restarts on its own. On the other hand, the non-docker works absolutely fine.
Details:
Docker:
Run it with 3gb Heap size & immediately after the start up it consumes around 2gb. Run some processors, the machine's fan runs heavily & it restarts after a while.
Non-Docker:
Run it with 3gb Heap size & it takes 900mb & runs smoothly. (jconsole)

Kubernetes pods restart issue anomaly

My Java microservices are running in k8s cluster hosted on AWS EC2 instances.
I have around 30 microservice(a good mix of nodejs and Java 8) running in a K8s cluster. I am facing a challange where my java application pods gets restart unexpectedly which leads to increase in application 5xx count.
To debug this, I started a newrelic agent in pod along with application and found the following graph:
Where I can see that, I have Xmx value as 6GB and my uses is max 5.2GB.
This clearly stats that JVM is not crossing the Xmx value.
But when I describe the pod and look for last state it says "Reason:Error" with "Exit code: 137"
Then on further investigation I find that my Pod average memory uses is close to its limit all the time.(Allocated 9Gib, uses ~9Gib). I am not able to understand why memory uses is so high in Pod even thogh I have only one process running((JVM) and that too is restricted with 6Gib Xmx.
When I login to my worker nodes and check the status of docker containers I can see the last container of that appriction with Exited state and says "Container exits with non-zero exit code 137"
I can see the wokernode kernel logs as:
which shows kernel is terminitaing my process running inside container.
I can see I have lot of free memory in my worker node.
I am not sure why my pods get restart again and again is this k8s behaviour or something spoofy in my infrastructure. This force me to move my application from Container to VM again as this leades to increase in 5xx count.
EDIT: I am getting OOM after increasing memory to 12GB.
I am not getting sure why POD is getting killed because of OOM th
ough JVM xmx is 6 GB only.
Need help!
Some older Java versions( prior to Java 8 u131 release) don’t recognize that they are running in a container. So even if you specify maximum heap size for the JVM with -Xmx, the JVM will set the maximum heap size based on the host’s total memory instead of the memory available to the container and then when a process tries to allocate memory over its limit(defined in a pod/deployment spec) your container is getting OOMKilled.
These problems might not pop up when running your Java apps in K8 cluster locally, because the difference between pod memory limit and total local machine memory aren’t big. But when you run it in production on nodes with more memory available, then JVM may go over your container memory limit and will be OOMKilled.
Starting from Java 8(u131 release) it is possible to make JVM be “container-aware” so that it recognizes constraints set by container control groups (cgroups).
For Java 8(from U131 release) and Java9 you can set this experimental flags to JVM:
-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
It will set the heap size based on your container cgroups memory limit, which is defined as "resources: limits" in your container definition part of the pod/deployment spec.
There still probably can be cases of JVM’s off-heap memory increase in Java 8, so you might monitor that, but overall those experimental flags must be handling that as well.
From Java 10 these experimental flags are the new default and are enabled/disabled by using this flag:
-XX:+UseContainerSupport
-XX:-UseContainerSupport
Since you have limitedthe maximum memory usage of your pod to 9Gi, it will be terminated automatically when the memory usage get to 9Gi.
In GCloud App Engine you can Specify max. CPU usage threshold, e.b. 0.6. Meaning that if CPU reaches 0.6 of 100% - 60% - a new instance will spawn.
I did not come across such a setting, but maybe: Kubernetes POD/Deployment has similar configuration parameter. Meaning, if RAM of POD reaches 0.6 of 100%, terminate POD. In your case that would be 60% of 9GB = ~5GB. Just some Food for thought.

Resources