Using k8s node resources out of k8s - docker

What would happen with kubernetes scheduling if I have a kubernetes node, but I use the container (docker) engine for some other stuff, outside of the context of kubernetes.
For example if I manually SSH to the respective node and I do docker run something. Would kubernetes scheduling take into account the fact that this node is busy running other stuff, and it might not be able to host any other containers now?
What would happen in the following scenario:
Node with 8 GB RAM
running a pod with resource request 2 GB, limit 4 GB, and current usage 3 GB
ssh on node and docker run a container with 5 GB, using all
P.S. Please skip the "why would you go and run docker run directly on the node" questions. I don't want to, but reasons.

I'm pretty sure Kubernetes's scheduling only considers (a) pods it knows about and not other resources, and (b) only their resource requests.
In the situation you describe, with exactly that resource utilization, things will work fine. The pod can be scheduled on the node because the total resource requests using it are 2 GB out of 8 GB. The total memory usage doesn't exceed the physical memory size either, so you're okay.
Say the pod allocated a little bit more memory. Now the system as a whole is above its physical memory capacity, so the Linux kernel will arbitrarily kill something off. This is often the largest thing. You'll typically see an exit code of 137 (matching SIGKILL) in whichever system manages it.
This behavior is the same even if you run your side job in something like a DaemonSet. It requests 2 GB of RAM, so both pods fit on the same node [4 GB/8 GB], but if it has a resource limit of 6 GB RAM, something will get killed off.
The place where things are different is if you can predict the high memory use. Say your pod requests 3 GB/limits 6 GB of RAM, and your side process will predictably also use 6 GB. If you just docker run it something will definitely get OOM-killed. If you run it as a DaemonSet declaring a 6 GB memory request, the Kubernetes scheduler will know the pod doesn't fit and won't place it there (it may get stuck in "Pending" state if it can't be scheduled anywhere).

Kubernetes won't see other processes running on the host, however you can tell the kubelet on that host how much of the host resources to reserve for the host itself, preventing Kubernetes from scheduling pods that would exceed the host capacity. See the --system-reserved flag that you can pass to the kubelet:
--system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

Related

Parallel Docker Container Creation

I am using a Docker Setup that consists of 14 different containers. Every container gets a cpu_limit of 2 and a mem_limit of 2g.
To create and run these containers, I've written a Python script that uses the docker-py library. As of now, the containers are created sequentially, which takes approximately 2 minutes.
Now I'm thinking about parallelizing the process. So now instead of doing (its pseudocode):
for container in containers_to_start:
create_container(container)
I do
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
pool.map(create_container, containers_to_start)
And as a result the 14 containers are created 2x faster. BUT: The applications within the containers take a significant longer time to boot. At the end of the day, i dont gain really much, the time until every application is reachable is more or less the same, no matter if with or without multithreading.
But I don't really know why, because every container gets the same amount of CPU and memory resources, so I would expect the same boot time no matter how many containers are starting at the same time. Clearly this is not the case. Maybe I'm missing some knowledge here, any explanation would be greatly appreciated.
System Specs
CPU: intel i7 # 2.90 GHz
32GB RAM
I am using Windows 10 with Docker installed in WSL2 backend.

Kubernetes: High CPU usage and zombie processes when "Disk Pressure" event

We run on-premise small K8s cluster (based on RKE stack). 1x etcd/control node, 2x worker nodes. Components are:
OS: Centos 7
Docker version: 19.3.9
K8s: 1.17.2
Other, important fact: we're using Rook-Ceph storage cluster on both worker nodes (rook: v1.2.4, ceph version 14.2.7).
When one of OS mounts run into 90%+ usage (for example: /var), K8s is reporting "Disk Pressure", disables node and it's OK. But when this happens, the CPU usage start growing up to dozens (for example 30+, 40+ on machine with 4 vCPU), many of container processes (childs to containerd-shim) goes into zombie (defunct) state and whole k8s cluster collapse.
First of all we think that's a Rook-Ceph problem with XFS storage (described at https://github.com/rook/rook/issues/3132#issuecomment-580508760), so we switched to EXT4 (because we cannot do upgrade of kernel to 5.6+), but during last weekend this happened again, and we are sure that this case is related to Disk Pressure event. Last contact with (already) dead node was 21-01, #13:50, but load starts growing at 13:07 and quickly goes to 30.5:
/var usage goes from 89.97% to 90%+ exactly at 13:07 this day:
Can you point us what we need to check in k8s configuration, logs or whatever else to find out what is going on? Why k8s is collapsing during quite normal event?
(For clarification: we know that we're using quite old versions, but we'll do a complex upgrade of environment within few weeks).

Google Kubernetes logs

Memory cgroup out of memory: Kill process 545486 (python3) score 2016 or sacrifice child Killed process 545486 (python3) total-vm:579096kB, anon-rss:518892kB, file-rss:16952kB
This node logs and my container is continuously restarting randomly. Running python cotnainer with 4 replicas.
Python application contains socket with a flask. Docker image contain of python3.5:slim
Kubectl get nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-XXXXXXX-cluster-highmem-pool-gen2-f2743e02-msv2 682m 17% 11959Mi 89%
Today morning node log : 0/1 nodes are available: 1 Insufficient cpu.
But node CPU usage is 17% only
There not much running inside pod.
Have a look at the best practices and try to adjust resource requests and limits for CPU and memory. If your app starts hitting your CPU limits, Kubernetes starts throttling your container. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated (and restarted). So, using suitable limits should help you to solve your problem with restarts of your containers.
In case request of your container exceeded limits, Kubernetes will throw an error, similar to one you have, and won’t let you run the container.
After adjusting limits, you could use some monitoring system (like Stackdriver) to find the cause of potential memory leak.

Kubernetes pods restart issue anomaly

My Java microservices are running in k8s cluster hosted on AWS EC2 instances.
I have around 30 microservice(a good mix of nodejs and Java 8) running in a K8s cluster. I am facing a challange where my java application pods gets restart unexpectedly which leads to increase in application 5xx count.
To debug this, I started a newrelic agent in pod along with application and found the following graph:
Where I can see that, I have Xmx value as 6GB and my uses is max 5.2GB.
This clearly stats that JVM is not crossing the Xmx value.
But when I describe the pod and look for last state it says "Reason:Error" with "Exit code: 137"
Then on further investigation I find that my Pod average memory uses is close to its limit all the time.(Allocated 9Gib, uses ~9Gib). I am not able to understand why memory uses is so high in Pod even thogh I have only one process running((JVM) and that too is restricted with 6Gib Xmx.
When I login to my worker nodes and check the status of docker containers I can see the last container of that appriction with Exited state and says "Container exits with non-zero exit code 137"
I can see the wokernode kernel logs as:
which shows kernel is terminitaing my process running inside container.
I can see I have lot of free memory in my worker node.
I am not sure why my pods get restart again and again is this k8s behaviour or something spoofy in my infrastructure. This force me to move my application from Container to VM again as this leades to increase in 5xx count.
EDIT: I am getting OOM after increasing memory to 12GB.
I am not getting sure why POD is getting killed because of OOM th
ough JVM xmx is 6 GB only.
Need help!
Some older Java versions( prior to Java 8 u131 release) don’t recognize that they are running in a container. So even if you specify maximum heap size for the JVM with -Xmx, the JVM will set the maximum heap size based on the host’s total memory instead of the memory available to the container and then when a process tries to allocate memory over its limit(defined in a pod/deployment spec) your container is getting OOMKilled.
These problems might not pop up when running your Java apps in K8 cluster locally, because the difference between pod memory limit and total local machine memory aren’t big. But when you run it in production on nodes with more memory available, then JVM may go over your container memory limit and will be OOMKilled.
Starting from Java 8(u131 release) it is possible to make JVM be “container-aware” so that it recognizes constraints set by container control groups (cgroups).
For Java 8(from U131 release) and Java9 you can set this experimental flags to JVM:
-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
It will set the heap size based on your container cgroups memory limit, which is defined as "resources: limits" in your container definition part of the pod/deployment spec.
There still probably can be cases of JVM’s off-heap memory increase in Java 8, so you might monitor that, but overall those experimental flags must be handling that as well.
From Java 10 these experimental flags are the new default and are enabled/disabled by using this flag:
-XX:+UseContainerSupport
-XX:-UseContainerSupport
Since you have limitedthe maximum memory usage of your pod to 9Gi, it will be terminated automatically when the memory usage get to 9Gi.
In GCloud App Engine you can Specify max. CPU usage threshold, e.b. 0.6. Meaning that if CPU reaches 0.6 of 100% - 60% - a new instance will spawn.
I did not come across such a setting, but maybe: Kubernetes POD/Deployment has similar configuration parameter. Meaning, if RAM of POD reaches 0.6 of 100%, terminate POD. In your case that would be 60% of 9GB = ~5GB. Just some Food for thought.

What defines what containers in Pods 'see' in terms of their limits and requests?

When a container in a Pod is created in a Kubernetes cluster with a limit and request set, how aware can that container be of those limits and requests? Would an application running inside the container be able to get these limits and requests to, for example, reduce the amount of resources it uses if the limits and requests were particularly low?
Kubernetes version: 1.8
Container runtime: Docker
Docker version: 1.12.6
Check mem_limit within a docker container with the tl;dr of
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
will show the limit, and then presumably the requests value is the allocated memory the container started with, but I would need to verify that assumption
I personally don't even understand the unit when trying to apply limits: cpu: so I for sure wouldn't know how to verify that value
The Downwards API can be used to pass the requests and limits to the container process as environment variables
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-requests-are-scheduled
If a Container exceeds its memory limit, it might be terminated. If it is restartable, the kubelet will restart it, as with any other type of runtime failure.
If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory.
Container might or might not be allowed to exceed its CPU limit for extended periods of time. However, it will not be killed for excessive CPU usage
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run
To get the information about resource usage, you need a monitoring system, configured for your cluster (heapster, prometheus, etc). Requests and limits can be adjusted according to this data manually or automatically.
One of possible ways to automate this process is to create a dedicated microservice, that will watch resources usage (by collecting and analyzing data from monitors), generate manifests with new limits and send requests kube api to recreate pods.

Resources