Kubernetes pods restart issue anomaly - docker

My Java microservices are running in k8s cluster hosted on AWS EC2 instances.
I have around 30 microservice(a good mix of nodejs and Java 8) running in a K8s cluster. I am facing a challange where my java application pods gets restart unexpectedly which leads to increase in application 5xx count.
To debug this, I started a newrelic agent in pod along with application and found the following graph:
Where I can see that, I have Xmx value as 6GB and my uses is max 5.2GB.
This clearly stats that JVM is not crossing the Xmx value.
But when I describe the pod and look for last state it says "Reason:Error" with "Exit code: 137"
Then on further investigation I find that my Pod average memory uses is close to its limit all the time.(Allocated 9Gib, uses ~9Gib). I am not able to understand why memory uses is so high in Pod even thogh I have only one process running((JVM) and that too is restricted with 6Gib Xmx.
When I login to my worker nodes and check the status of docker containers I can see the last container of that appriction with Exited state and says "Container exits with non-zero exit code 137"
I can see the wokernode kernel logs as:
which shows kernel is terminitaing my process running inside container.
I can see I have lot of free memory in my worker node.
I am not sure why my pods get restart again and again is this k8s behaviour or something spoofy in my infrastructure. This force me to move my application from Container to VM again as this leades to increase in 5xx count.
EDIT: I am getting OOM after increasing memory to 12GB.
I am not getting sure why POD is getting killed because of OOM th
ough JVM xmx is 6 GB only.
Need help!

Some older Java versions( prior to Java 8 u131 release) don’t recognize that they are running in a container. So even if you specify maximum heap size for the JVM with -Xmx, the JVM will set the maximum heap size based on the host’s total memory instead of the memory available to the container and then when a process tries to allocate memory over its limit(defined in a pod/deployment spec) your container is getting OOMKilled.
These problems might not pop up when running your Java apps in K8 cluster locally, because the difference between pod memory limit and total local machine memory aren’t big. But when you run it in production on nodes with more memory available, then JVM may go over your container memory limit and will be OOMKilled.
Starting from Java 8(u131 release) it is possible to make JVM be “container-aware” so that it recognizes constraints set by container control groups (cgroups).
For Java 8(from U131 release) and Java9 you can set this experimental flags to JVM:
-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
It will set the heap size based on your container cgroups memory limit, which is defined as "resources: limits" in your container definition part of the pod/deployment spec.
There still probably can be cases of JVM’s off-heap memory increase in Java 8, so you might monitor that, but overall those experimental flags must be handling that as well.
From Java 10 these experimental flags are the new default and are enabled/disabled by using this flag:
-XX:+UseContainerSupport
-XX:-UseContainerSupport

Since you have limitedthe maximum memory usage of your pod to 9Gi, it will be terminated automatically when the memory usage get to 9Gi.

In GCloud App Engine you can Specify max. CPU usage threshold, e.b. 0.6. Meaning that if CPU reaches 0.6 of 100% - 60% - a new instance will spawn.
I did not come across such a setting, but maybe: Kubernetes POD/Deployment has similar configuration parameter. Meaning, if RAM of POD reaches 0.6 of 100%, terminate POD. In your case that would be 60% of 9GB = ~5GB. Just some Food for thought.

Related

Limit MarkLogic memory consumption in docker container

The project in which I am working develops a Java service that uses MarkLogic 9 in the backend.
We are running a Jenkins build server that executes (amongst others) several tests in MarkLogic written in XQuery.
For those tests MarkLogic is running in a docker container on the Jenkins host (which is running Ubuntu Linux).
The Jenkins host has 12 GB of RAM and 8 GB of swap configured.
Recently I have noticed that the MarkLogic instance running in the container uses a huge amount of RAM (up to 10 GB).
As there are often other build jobs running in parallel, the Jenkins starts to swap, sometimes even eating up all swap
so that MarkLogic reports it cannot get more memory.
Obviously, this situation leads to failed builds quite often.
To analyse this further I made some tests on my PC running Docker for Windows and found out that the MarkLogic tests
can be run successfully with 5-6 GB RAM. The MarkLogic logs show that it sees all the host memory and wants to use everything.
But as we have other build processes running on that host this behaviour is not desirable.
My question: is there any possibility to tell the MarkLogic to not use so much memory?
We are preparing the docker image during the build, so we could modify some configuration, but it has to be scripted somehow.
The issue of the container not detecting memory limit correctly has been identified, and should be addressed in a forthcoming release.
In the meantime, you might be able to mitigate the issue by:
changing the group cache sizing from automatic to manual and setting cache sizes appropriate for the allocated resources. There area variety of ways to set these configs, whether deploying and settings configs from ml-gradle project, making your own Manage API REST calls, or programmatically:
admin:group-set-cache-sizing
admin:group-set-compressed-tree-cache-partitions
admin:group-set-compressed-tree-cache-size
admin:group-set-expanded-tree-cache-partitions
admin:group-set-expanded-tree-cache-size
admin:group-set-list-cache-partitions
admin:group-set-list-cache-size
reducing the in-memory-limit
in memory limit specifies the maximum number of fragments in an in-memory stand. An in-memory stand contains the latest version of any new or changed fragments. Periodically, in-memory stands are written to disk as a new stand in the forest. Also, if a stand accumulates a number of fragments beyond this limit, it is automatically saved to disk by a background thread.

Using k8s node resources out of k8s

What would happen with kubernetes scheduling if I have a kubernetes node, but I use the container (docker) engine for some other stuff, outside of the context of kubernetes.
For example if I manually SSH to the respective node and I do docker run something. Would kubernetes scheduling take into account the fact that this node is busy running other stuff, and it might not be able to host any other containers now?
What would happen in the following scenario:
Node with 8 GB RAM
running a pod with resource request 2 GB, limit 4 GB, and current usage 3 GB
ssh on node and docker run a container with 5 GB, using all
P.S. Please skip the "why would you go and run docker run directly on the node" questions. I don't want to, but reasons.
I'm pretty sure Kubernetes's scheduling only considers (a) pods it knows about and not other resources, and (b) only their resource requests.
In the situation you describe, with exactly that resource utilization, things will work fine. The pod can be scheduled on the node because the total resource requests using it are 2 GB out of 8 GB. The total memory usage doesn't exceed the physical memory size either, so you're okay.
Say the pod allocated a little bit more memory. Now the system as a whole is above its physical memory capacity, so the Linux kernel will arbitrarily kill something off. This is often the largest thing. You'll typically see an exit code of 137 (matching SIGKILL) in whichever system manages it.
This behavior is the same even if you run your side job in something like a DaemonSet. It requests 2 GB of RAM, so both pods fit on the same node [4 GB/8 GB], but if it has a resource limit of 6 GB RAM, something will get killed off.
The place where things are different is if you can predict the high memory use. Say your pod requests 3 GB/limits 6 GB of RAM, and your side process will predictably also use 6 GB. If you just docker run it something will definitely get OOM-killed. If you run it as a DaemonSet declaring a 6 GB memory request, the Kubernetes scheduler will know the pod doesn't fit and won't place it there (it may get stuck in "Pending" state if it can't be scheduled anywhere).
Kubernetes won't see other processes running on the host, however you can tell the kubelet on that host how much of the host resources to reserve for the host itself, preventing Kubernetes from scheduling pods that would exceed the host capacity. See the --system-reserved flag that you can pass to the kubelet:
--system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

Kubernetes OOM pod killed because kernel memory grows to much

I am working on a java service that basically creates files in a network file system to store data. It runs in a k8s cluster in a Ubuntu 18.04 LTS.
When we began to limit the memory in kubernetes (limits: memory: 3Gi), the pods began to be OOMKilled by kubernetes.
At the beginning we thought it was a leak of memory in the java process, but analyzing more deeply we noticed that the problem is the memory of the kernel.
We validated that looking at the file /sys/fs/cgroup/memory/memory.kmem.usage_in_bytes
We isolated the case to only create files (without java) with the DD command like this:
for i in {1..50000}; do dd if=/dev/urandom bs=4096 count=1 of=file$i; done
And with the dd command we saw that the same thing happened ( the kernel memory grew until OOM).
After k8s restarted the pod, I got doing a describe pod:
Last State:Terminated
Reason: OOMKilled
Exit Code: 143
Creating files cause the kernel memory grows, deleting those files cause the memory decreases . But our services store data , so it creates a lot of files continuously, until the pod is killed and restarted because OOMKilled.
We tested limiting the kernel memory using a stand alone docker with the --kernel-memory parameter and it worked as expected. The kernel memory grew to the limit and did not rise anymore. But we did not find any way to do that in a kubernetes cluster.
Is there a way to limit the kernel memory in a K8S environment ?
Why the creation of files causes the kernel memory grows and it is not released ?
Thanks for all this info, it was very useful!
On my app, I solved this by creating a new side container that runs a cron job, every 5 minutes with the following command:
echo 3 > /proc/sys/vm/drop_caches
(note that you need the side container to run in privileged mode)
It works nicely and has the advantage of being predictable: every 5 minutes, your memory cache will be cleared.

Kubernetes throwing OOM for pods running a JVM

I am running Docker containers containing JVM (java8u31). These containers are deployed as pods in a kubernetes cluster. Often I get OOM for the pods and Kubernetes kills the pods and restarts it. I am having issues in finding the root cause for these OOMs as I am new to Kubernetes.
Here are the JVM parameters
-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -Xms700M -Xmx1000M -XX:MaxRAM=1536M -XX:MaxMetaspaceSize=250M
These containers are deployed as stateful set and following is the resource allocation
resources:
requests:
memory: "1.5G"
cpu: 1
limits:
memory: "1.5G"
cpu: 1
so the total memory allocated to the container matches the MaxRam
If I use -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/etc/opt/jmx/java_pid%p.hprof that doesn't help because the pod is getting killed and recreated and started as soon as there is a OOM so everything within the pod is lost
The only way to get a thread or HEAP dump is to SSH into the pod which also I am not able to take because the pod is recreated after an OOM so I don't get the memory footprint at the time of OOM. I SSH after an OOM which is not much help.
I also profiled the code using visualVM, jHat but couldn't find substantial memory footprint which could lead to a conclusion of too much memory consumption by the threads running within the JVM or a probable leak.
Any help is appreciated to resolve the OOM thrown by Kubernetes.
When your application in a pod reaches the limits of memory you set by resources.limits.memory or namespace limit, Kubernetes restarts the pod.
The Kubernetes part of limiting resources is described in the following articles:
Kubernetes best practices: Resource requests and limits
Resource Quotas
Admission control plugin: ResourceQuota
Assign Memory Resources to Containers and Pods
Memory consumed by Java application is not limited to the size of the Heap that you can set by specifying the options:
-Xmssize Specifies the initial heap size.
-Xmxsize Specifies the maximum heap size.
Java application needs some additional memory for metaspace, class space, stack size, and JVM itself needs even more memory to do its tasks like garbage collection, JIT optimization, Off-heap allocations, JNI code.
It is hard to predict total memory usage of JVM with reasonable precision, so the best way is to measure it on the real deployment with usual load.
I would recommend you to set the Kubernetes pod limit to double Xmx size, check if you are not getting OOM anymore, and then gradually decrease it to the point when you start getting OOM. The final value should be in the middle between these points.
You can get more precise value from memory usage statistics in a monitoring system like Prometheus.
On the other hand, you can try to limit java memory usage by specifying the number of available options, like the following:
-Xms<heap size>[g|m|k] -Xmx<heap size>[g|m|k]
-XX:MaxMetaspaceSize=<metaspace size>[g|m|k]
-Xmn<young size>[g|m|k]
-XX:SurvivorRatio=<ratio>
More details on that can be found in these articles:
Properly limiting the JVM’s memory usage (Xmx isn’t enough)
Why does my Java process consume more memory than Xmx
The second way to limit JVM memory usage is to calculate heap size based on the amount of RAM(or MaxRAM). There is a good explanation of how it works in the article:
The default sizes are based on the amount of memory on a machine, which can be set with the -XX:MaxRAM=N flag.
Normally, that value is calculated by the JVM by inspecting the amount of memory on the machine.
However, the JVM limits MaxRAM to 1 GB for the client compiler, 4 GB for 32-bit server compilers, and 128 GB for 64-bit compilers.
The maximum heap size is one-quarter of MaxRAM .
This is why the default heap size can vary: if the physical memory on a machine is less than MaxRAM , the default heap size is one-quarter of that.
But even if hundreds of gigabytes of RAM are available, the most the JVM will use by default is 32 GB: one-quarter of 128 GB. The default maximum heap calculation is actually this:
Default Xmx = MaxRAM / MaxRAMFraction
Hence, the default maximum heap can also be set by adjusting the value of the - XX:MaxRAMFraction=N flag, which defaults to 4.
Finally, just to keep things interesting, the -XX:ErgoHeapSizeLimit=N flag can also be set to a maximum default value that the JVM should use.
That value is 0 by default (meaning to ignore it); otherwise, that limit is used if it is smaller than MaxRAM / MaxRAMFraction .
The initial heap size choice is similar, though it has fewer complications. The initial heap size value is determined like this:
Default Xms = MaxRAM / InitialRAMFraction
As can be concluded from the default minimum heap sizes, the default value of the InitialRAMFraction flag is 64.
The one caveat here occurs if that value is less than 5 MB —or, strictly speaking, less than the values specified by -XX:OldSize=N (which defaults to 4 MB) plus -XX:NewSize=N (which defaults to 1 MB).
In that case, the sum of the old and new sizes is used as the initial heap size.
This article gives you a good point to start tuning your JVM for web-oriented application:
Java VM Options You Should Always Use in Production
If you are able to run on Java 11 (or 10) instead of 8, the memory limit options have been much improved (plus the JVM is cgroups-aware). Just use -XX:MaxRAMPercentage (range 0.0, 100.0):
$ docker run -m 1GB openjdk:11 java -XshowSettings:vm -XX:MaxRAMPercentage=80 -version
VM settings:
Max. Heap Size (Estimated): 792.69M
Using VM: OpenJDK 64-Bit Server VM
openjdk version "11.0.1" 2018-10-16
OpenJDK Runtime Environment (build 11.0.1+13-Debian-2)
OpenJDK 64-Bit Server VM (build 11.0.1+13-Debian-2, mixed mode, sharing)
That way, you can easily specify 80% of available container memory for the heap, which wasn't possible with the old options.
Thanks #VAS for your comments. Thanks for the kubernetes links.
After few tests I think that its not a good idea to specify XMX if you are using -XX:+UseCGroupMemoryLimitForHeap since XMX overrides it. I am still doing some more tests & profiling.
Since my requirement is running a JVM inside a docker container. I did few tests as mentioned in the posts by #Eugene. Considering every app running inside a JVM would need HEAP and some native memory, I think we need to specify -XX:+UnlockExperimentalVMOptions, XX:+UseCGroupMemoryLimitForHeap, -XX:MaxRAMFraction=1 (considering only the JVM running inside the container, at the same time its risky) -XX:MaxRAM (I think we should specify this if MaxRAMFraction is 1 so that you leave some for native memory)
Few tests:
As per below docker configuration, the docker is allocated 1 GB considering you only have the JVM running inside the container. Considering docker's allocation to 1G and I also want to allocate some to the process/native memory, I think I should use MaxRam=700M so that I have 300 MB for native.
$ docker run -m 1GB openjdk:8u131 java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=1 -XX:MaxRAM=700M -XshowSettings:vm -version
VM settings:
Max. Heap Size (Estimated): 622.50M
Ergonomics Machine Class: server
Using VM: OpenJDK 64-Bit Server VM
Now specifying XX:MaxRAMFraction=1 might be killing:
references: https://twitter.com/csanchez/status/940228501222936576?lang=en
Is -XX:MaxRAMFraction=1 safe for production in a containered environment?
Following would be better, please note I have removed MaxRAM since MaxRAMFraction > 1 :
$ docker run -m 1GB openjdk:8u131 java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2 -XshowSettings:vm -version
VM settings:
Max. Heap Size (Estimated): 455.50M
Ergonomics Machine Class: server
Using VM: OpenJDK 64-Bit Server VM
This gives rest of the 500M for native e.g. could be used for MetaSpace by specifying -XX:MaxMetaspaceSize:
$ docker run -m 1GB openjdk:8u131 java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2 -XX:MaxMetaspaceSize=200M -XshowSettings:vm -version
VM settings:
Max. Heap Size (Estimated): 455.50M
Ergonomics Machine Class: server
Using VM: OpenJDK 64-Bit Server VM
Logically and also as per the above references, it makes sense to specify -XX:MaxRAMFraction >1. This also depends on the application profiling done.
I am still doing some more tests, will update these results or post. Thanks
recently I've come also across similar issue
java 11.0.11+9 + kubernetes running docker containers in pod
similar config as op
resources:
requests:
memory: "1G"
cpu: 400m
limits:
memory: "1G"
with -XX:MaxRAMPercentage=60.0
Our service uploads and downloads a lot of data. Therefore direct memory is being used and in this issue I've found that MaxDirectMemorySize is equal to heapsize. So if we calculate the memory usage it could go behind limit 1G (1G * 0.6 * 2). In this case we've increased memory to 1.5G and changed -XX:MaxRAMPercentage=35.0 so we have enough space for heap + direct memory and even for some OS related tasks. Be cautious when you set up MaxRAMPercentage or Xmx in container environment.

Nifi 1.6.0 memory leak

We're running Docker containers of NiFi 1.6.0 in production and have to come across a memory leak.
Once started, the app runs just fine, however, after a period of 4-5 days, the memory consumption on the host keeps on increasing. When checked in the NiFi cluster UI, the JVM heap size used hardly around 30% but the memory on the OS level goes to 80-90%.
On running the docker starts command, we found that the NiFi docker container is consuming the memory.
After collecting the JMX metrics, we found that the RSS memory keeps growing. What could be the potential cause of this? In the JVM tab of cluster dialog, young GC also seems to be happening in a timely manner with old GC counts shown as 0.
How do we go about identifying in what's causing the RSS memory to grow?
You need to replicate that in a non-docker environment, because with docker, memory is known to raise.
As I explained in "Difference between Resident Set Size (RSS) and Java total committed memory (NMT) for a JVM running in Docker container", docker has some bugs (like issue 10824 and issue 15020) which prevent an accurate report of the memory consumed by a Java process within a Docker container.
That is why a plugin like signalfx/docker-collectd-plugin mentions (two weeks ago) in its PR -- Pull Request -- 35 to "deduct the cache figure from the memory usage percentage metric":
Currently the calculation for memory usage of a container/cgroup being returned to SignalFX includes the Linux page cache.
This is generally considered to be incorrect, and may lead people to chase phantom memory leaks in their application.
For a demonstration on why the current calculation is incorrect, you can run the following to see how I/O usage influences the overall memory usage in a cgroup:
docker run --rm -ti alpine
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
dd if=/dev/zero of=/tmp/myfile bs=1M count=100
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
You should see that the usage_in_bytes value rises by 100MB just from creating a 100MB file. That file hasn't been loaded into anonymous memory by an application, but because it's now in the page cache, the container memory usage is appearing to be higher.
Deducting the cache figure in memory.stat from the usage_in_bytes shows that the genuine use of anonymous memory hasn't risen.
The signalFX metric now differs from what is seen when you run docker stats which uses the calculation I have here.
It seems like knowing the page cache use for a container could be useful (though I am struggling to think of when), but knowing it as part of an overall percentage usage of the cgroup isn't useful, since it then disguises your actual RSS memory use.
In a garbage collected application with a max heap size as large, or larger than the cgroup memory limit (e.g the -Xmx parameter for java, or .NET core in server mode), the tendency will be for the percentage to get close to 100% and then just hover there, assuming the runtime can see the cgroup memory limit properly.
If you are using the Smart Agent, I would recommend using the docker-container-stats monitor (to which I will make the same modification to exclude cache memory).
Yes, NiFi docker has memory issues, shoots up after a while & restarts on its own. On the other hand, the non-docker works absolutely fine.
Details:
Docker:
Run it with 3gb Heap size & immediately after the start up it consumes around 2gb. Run some processors, the machine's fan runs heavily & it restarts after a while.
Non-Docker:
Run it with 3gb Heap size & it takes 900mb & runs smoothly. (jconsole)

Resources