Google Kubernetes logs - docker

Memory cgroup out of memory: Kill process 545486 (python3) score 2016 or sacrifice child Killed process 545486 (python3) total-vm:579096kB, anon-rss:518892kB, file-rss:16952kB
This node logs and my container is continuously restarting randomly. Running python cotnainer with 4 replicas.
Python application contains socket with a flask. Docker image contain of python3.5:slim
Kubectl get nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-XXXXXXX-cluster-highmem-pool-gen2-f2743e02-msv2 682m 17% 11959Mi 89%
Today morning node log : 0/1 nodes are available: 1 Insufficient cpu.
But node CPU usage is 17% only
There not much running inside pod.

Have a look at the best practices and try to adjust resource requests and limits for CPU and memory. If your app starts hitting your CPU limits, Kubernetes starts throttling your container. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated (and restarted). So, using suitable limits should help you to solve your problem with restarts of your containers.
In case request of your container exceeded limits, Kubernetes will throw an error, similar to one you have, and won’t let you run the container.
After adjusting limits, you could use some monitoring system (like Stackdriver) to find the cause of potential memory leak.

Related

Containers: high cpu usage in %soft (soft IRQ) for network-intensive workloads

I'm trying to debug some performance issues on a RHEL8.3 server.
The server is actually a Kubernetes worker nodes and hosts several Redis containers (PODs).
These containers are doing a lot of network I/O (iptraf-ng reports about 500 kPPS and 1.5Gbps).
The server is an high-end Dell server with 104 cpus and 10Gbps NICs.
The issue I'm trying to debug is related to soft IRQs. In short: despite my attempts to set IRQ affinity of the NIC on a specific range of dedicated CPUs, the utility "mpstat" is still reporting a lot of CPU spent in "soft%" on all the CPUs where the "redis-server" process is running (even if redis-server has been moved using taskset to a non-overlapping range of dedicated CPU cores).
For more details consider the attached screenshot redis_server_and_mpstat:
the "redis-server" with PID 3592506 can run only on CPU 80 (taskset -pc 3592506 returns 80 only)
as can be seen from the "mpstat" output, it's running close to 100%, with 25-28% of the time spent in "%soft" time
In the attempt to address this problem, I've been using the Mellanox IRQ affinity script (https://github.com/Mellanox/mlnx-tools/blob/master/ofed_scripts/set_irq_affinity.sh) to "move" all IRQs related to the NICs on a separate set of CPUs (namely CPUs 1,3,5,7,9,11,13,15,17 that belong to NUMA1) for both NICs (eno1np0, eno2np1) that compose the "bond0" bonded interface used by the server, see the screenshot set_irq_affinity. Moreover the "irqbalance" daemon has been stopped and disabled.
The result is that mpstat is now reporting a consistent CPU usage from CPUs 1,3,5,7,9,11,13,15,17 in "%soft" time, but at the same time the redis-server is still spending 25-28% of its time spent in "%soft" column (i.e. nothing has changed for redis-server).
This pattern is repeated for all instances of "redis-server" running on that server (there's more than 1), while other CPUs having no redis-server scheduled, are 100% idle.
Finally in a different environment based on RHEL7.9 (kernel 3.10.0) and a non-containerized deployment of Redis, I see that, when running the "set_irq_affinity.sh" script to move IRQs away from Redis CPUs, Redis %soft column goes down to zero.
Can you help me to understand why running redis into a Kubernetes container (with kernel 4.18.0), the redis-server process will continue to spend a consistent amount of time in %soft handling, despite NIC IRQs having affinity on different CPUs ?
Is it possible that the time the redis-server process spends in "soft IRQ" handling is due to the veth virtual ethernet device created by the containerization technology (in this case the Kubernetes CNI is Flannel, using all default settings) ?
Thanks

Using k8s node resources out of k8s

What would happen with kubernetes scheduling if I have a kubernetes node, but I use the container (docker) engine for some other stuff, outside of the context of kubernetes.
For example if I manually SSH to the respective node and I do docker run something. Would kubernetes scheduling take into account the fact that this node is busy running other stuff, and it might not be able to host any other containers now?
What would happen in the following scenario:
Node with 8 GB RAM
running a pod with resource request 2 GB, limit 4 GB, and current usage 3 GB
ssh on node and docker run a container with 5 GB, using all
P.S. Please skip the "why would you go and run docker run directly on the node" questions. I don't want to, but reasons.
I'm pretty sure Kubernetes's scheduling only considers (a) pods it knows about and not other resources, and (b) only their resource requests.
In the situation you describe, with exactly that resource utilization, things will work fine. The pod can be scheduled on the node because the total resource requests using it are 2 GB out of 8 GB. The total memory usage doesn't exceed the physical memory size either, so you're okay.
Say the pod allocated a little bit more memory. Now the system as a whole is above its physical memory capacity, so the Linux kernel will arbitrarily kill something off. This is often the largest thing. You'll typically see an exit code of 137 (matching SIGKILL) in whichever system manages it.
This behavior is the same even if you run your side job in something like a DaemonSet. It requests 2 GB of RAM, so both pods fit on the same node [4 GB/8 GB], but if it has a resource limit of 6 GB RAM, something will get killed off.
The place where things are different is if you can predict the high memory use. Say your pod requests 3 GB/limits 6 GB of RAM, and your side process will predictably also use 6 GB. If you just docker run it something will definitely get OOM-killed. If you run it as a DaemonSet declaring a 6 GB memory request, the Kubernetes scheduler will know the pod doesn't fit and won't place it there (it may get stuck in "Pending" state if it can't be scheduled anywhere).
Kubernetes won't see other processes running on the host, however you can tell the kubelet on that host how much of the host resources to reserve for the host itself, preventing Kubernetes from scheduling pods that would exceed the host capacity. See the --system-reserved flag that you can pass to the kubelet:
--system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

Fail docker container after cpu limit

I am running a java process as a docker swarm service. But that service is hogging my CPU, eventually. I tried with CPU limit as 1, and docker stats showing that container to be consistent 100%, but I want to fail that container in 95% and recreated. Is there any way I can accomplish this?
Thanks in advance.
CPU is a compressible resource, unlike memory. When memory requests exceed the limit, the kernel will kill the app. When CPU exceeds the limit, the kernel simply gives that process less time on the CPU and it runs slower.
There's no built in capability to change this behavior. You would need to implement some form of external monitoring with the ability to kill the container when a threshold is exceeded.
More than likely, what you actually want is to setup a healthcheck for your container that detects the application becoming unresponsive. You will need to run the container using swarm mode to automatically recreate the container with the failing healthcheck.

What defines what containers in Pods 'see' in terms of their limits and requests?

When a container in a Pod is created in a Kubernetes cluster with a limit and request set, how aware can that container be of those limits and requests? Would an application running inside the container be able to get these limits and requests to, for example, reduce the amount of resources it uses if the limits and requests were particularly low?
Kubernetes version: 1.8
Container runtime: Docker
Docker version: 1.12.6
Check mem_limit within a docker container with the tl;dr of
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
will show the limit, and then presumably the requests value is the allocated memory the container started with, but I would need to verify that assumption
I personally don't even understand the unit when trying to apply limits: cpu: so I for sure wouldn't know how to verify that value
The Downwards API can be used to pass the requests and limits to the container process as environment variables
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-requests-are-scheduled
If a Container exceeds its memory limit, it might be terminated. If it is restartable, the kubelet will restart it, as with any other type of runtime failure.
If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory.
Container might or might not be allowed to exceed its CPU limit for extended periods of time. However, it will not be killed for excessive CPU usage
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run
To get the information about resource usage, you need a monitoring system, configured for your cluster (heapster, prometheus, etc). Requests and limits can be adjusted according to this data manually or automatically.
One of possible ways to automate this process is to create a dedicated microservice, that will watch resources usage (by collecting and analyzing data from monitors), generate manifests with new limits and send requests kube api to recreate pods.

How does the container use more memory than the limit?

My project might be an overcommitted system, and I have to improve the reliability by specifying an appropriate container mem limit, by which the total mem of the node should not be divided. But I'm confused with the following statements in the Kubernetes v1.1 doc Resource of Qos:
Incompressible Resource Guarantees
if they exceed their memory request, they could be killed (if some other container needs memory)
Containers will be killed if they use more memory than their limit.
and the command docker stats shows a "LIMIT" for each container:
I think it means that containers will not use mems more than the "LIMIT" since I've met sometimes the MEM% stays at 100% for a while, so how and when the containers are killed?
Update
I think OOM Killer is enabled with the default value 0.
> cat /proc/sys/vm/oom_kill_allocating_task
0
Cgroup memory limit feature is used, so I recommend to read cgroup doc:
Tasks that attempt to consume more memory than they are allowed are
immediately killed by the OOM killer.
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html

Resources