Docker: resource issues inside containers - docker

I have a centos7.4 container running. Inside this container I am doing a variety of things. I am noticing very frequent issues such as
clang-5.0: error: unable to execute command: posix_spawn failed: Resource temporarily unavailable.
And
There is insufficient memory for the Java Runtime Environment to continue.
Cannot create GC thread. Out of system resources.
Error occurred during initialization of VM
java.lang.OutOfMemoryError: unable to create new native thread
And
make[1]: vfork: Resource temporarily unavailable
And
fork() failed: Command could not be run:
If I execute the same "variety of things" on the same Docker host but outside of the container, i dont run into issues. I am trying to understand if this is docker specific thing that I am not aware of. Docker by default inherits host's resources correct? i.e. number of cores and mem
i see this then i run docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
8cbc44773def my-rh6 1.98% 48.59MiB / 251.6GiB 0.02% 33.8GB / 31.6GB 65.9MB / 11.5MB 26
683313a4e70e my-rh7 85.99% 73.26MiB / 251.6GiB 0.03% 21.1GB / 1.31GB 269MB / 26.6MB 6
my host has 72 cores and 250G mem. so mem seems to be the same. I have no idea how java would run out of mem with 250G being available...
When I do execute "variety of things" i do see CPU % go above 1000% at times which i am not sure if its the norm. If i run the same thing outside of container, sar doesnt show cpu consumption nowhere near 1000%. (70% max at times)

Related

docker container hanging on run - how to debug

I am trying to run Screaming Frog in a docker. For this I have used as a starting point, this Github project:
https://github.com/iihnordic/screamingfrog-docker
After building, I ran the docker with the follwing command:
docker run -v /<my-path>/screamingfrog-crawls:/home/crawls screamingfrog --crawl https://<my-domain> --headless --save-crawl
--output-folder /home/crawls
It worked the first time, but after multiple attempts, it seems that the process hangs 8 out of 10 times with no error, always hanging at a different stage in the process.
I assumed the most likely reason is memory, but despite significantly increasing the docker memory and also increasing the Screaming Frog Memory to 16GB the same issue persists.
How can I go about debugging my container when no errors are thrown except for the container hanging indefinitely
As suggested by #Ralle, I checked docker stats, and while it seems that Memory usage is actually staying well below 10%, the CPU is always of 100%
Try
docker stats:
returns something like this.
At least you can see the behaviour of memory and cpu.
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
3.18kB / 496B 17.2MB / 41.4MB 8
9949a4ee1238 nest-api-1 0.87% 290MiB / 3.725GiB 7.60% 2.14MB / 37.2kB 156kB / 2.06MB 33
96fe43dba2b0 postgres 0.00% 29MiB / 3.725GiB 0.76% 7.46kB / 6.03kB 1.17MB / 67.8MB 7
ff570659e917 redis 0.30% 3.004MiB / 3.725GiB 0.08% 2.99kB / 0B 614kB / 4.1kB 5
ALSO docker top shows you you the pid ids.
I dont know your application, but check also what if the issue can be related to the volumes itselfs each time the container restarts.

What is real CPU usage of container from docker stats command

I am running postgres timescaledb on my docker swarm node. I set limit of CPU to 4 and Mem limit to 32G. When I check docker stats, I can see this output:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c6ce29c7d1a4 pg-timescale.1.6c1hql1xui8ikrrcsuwbsoow5 341.33% 20.45GiB / 32GiB 63.92% 6.84GB / 5.7GB 582GB / 172GB 133
CPU% is oscilating around 400%. Node has 6CPUs and average load has been 1 - 2 (1minute load average), So according to me, with my limit of CPUs - 4, the maximum load should be oscilating around 6. My current load is 20 (1minute load average), and output of top command from inside of postgres show 50-60%.
My service configuration limit:
deploy:
resources:
limits:
cpus: '4'
memory: 32G
I am confused, all values are different so what is real CPU usage of postgres and how to limit it ? My server load is pushed to maximum even limit of postgres is set to 4. Inside postgres I can see from htop that there is 6 cores and 64G MEM so its looks like it has all resources of the hosts. From docker stats maximum cpu is 400% - corelate with limit of 4 cpus.
Load average from commands like top in Linux refer to the number of processes running or waiting to run on average over some time period. CPU limits used by docker specify the number of CPU cycles over some timeframe permitted for processes inside of a cgroup. These aren't really measuring the same thing, especially when you factor in things like I/O waiting. You can have a process waiting for a read from disk, that wants to run but is blocked on that I/O call, increasing your load measurements on the host, but not using any CPU cycles.
When calculating how much CPU to allocate to a cgroup, no only do you need to factor in the I/O and other system needs of the process, but consider queuing theory when you approach saturation on the CPU. The closer you get to 100% utilization of the CPU, the longer the queue of processes ready to run will likely be, resulting in significant jumps in load measurements.
Setting these limits correctly will likely require trial and error because not all processes are the same, and not all workload on the host is the same. A batch processing job that kicks off at irregular intervals and saturates the drives and network will have a very different impact on the host from a scientific computation that is heavily CPU and memory bound.

Google Kubernetes logs

Memory cgroup out of memory: Kill process 545486 (python3) score 2016 or sacrifice child Killed process 545486 (python3) total-vm:579096kB, anon-rss:518892kB, file-rss:16952kB
This node logs and my container is continuously restarting randomly. Running python cotnainer with 4 replicas.
Python application contains socket with a flask. Docker image contain of python3.5:slim
Kubectl get nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-XXXXXXX-cluster-highmem-pool-gen2-f2743e02-msv2 682m 17% 11959Mi 89%
Today morning node log : 0/1 nodes are available: 1 Insufficient cpu.
But node CPU usage is 17% only
There not much running inside pod.
Have a look at the best practices and try to adjust resource requests and limits for CPU and memory. If your app starts hitting your CPU limits, Kubernetes starts throttling your container. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated (and restarted). So, using suitable limits should help you to solve your problem with restarts of your containers.
In case request of your container exceeded limits, Kubernetes will throw an error, similar to one you have, and won’t let you run the container.
After adjusting limits, you could use some monitoring system (like Stackdriver) to find the cause of potential memory leak.

Kubernetes pods restart issue anomaly

My Java microservices are running in k8s cluster hosted on AWS EC2 instances.
I have around 30 microservice(a good mix of nodejs and Java 8) running in a K8s cluster. I am facing a challange where my java application pods gets restart unexpectedly which leads to increase in application 5xx count.
To debug this, I started a newrelic agent in pod along with application and found the following graph:
Where I can see that, I have Xmx value as 6GB and my uses is max 5.2GB.
This clearly stats that JVM is not crossing the Xmx value.
But when I describe the pod and look for last state it says "Reason:Error" with "Exit code: 137"
Then on further investigation I find that my Pod average memory uses is close to its limit all the time.(Allocated 9Gib, uses ~9Gib). I am not able to understand why memory uses is so high in Pod even thogh I have only one process running((JVM) and that too is restricted with 6Gib Xmx.
When I login to my worker nodes and check the status of docker containers I can see the last container of that appriction with Exited state and says "Container exits with non-zero exit code 137"
I can see the wokernode kernel logs as:
which shows kernel is terminitaing my process running inside container.
I can see I have lot of free memory in my worker node.
I am not sure why my pods get restart again and again is this k8s behaviour or something spoofy in my infrastructure. This force me to move my application from Container to VM again as this leades to increase in 5xx count.
EDIT: I am getting OOM after increasing memory to 12GB.
I am not getting sure why POD is getting killed because of OOM th
ough JVM xmx is 6 GB only.
Need help!
Some older Java versions( prior to Java 8 u131 release) don’t recognize that they are running in a container. So even if you specify maximum heap size for the JVM with -Xmx, the JVM will set the maximum heap size based on the host’s total memory instead of the memory available to the container and then when a process tries to allocate memory over its limit(defined in a pod/deployment spec) your container is getting OOMKilled.
These problems might not pop up when running your Java apps in K8 cluster locally, because the difference between pod memory limit and total local machine memory aren’t big. But when you run it in production on nodes with more memory available, then JVM may go over your container memory limit and will be OOMKilled.
Starting from Java 8(u131 release) it is possible to make JVM be “container-aware” so that it recognizes constraints set by container control groups (cgroups).
For Java 8(from U131 release) and Java9 you can set this experimental flags to JVM:
-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
It will set the heap size based on your container cgroups memory limit, which is defined as "resources: limits" in your container definition part of the pod/deployment spec.
There still probably can be cases of JVM’s off-heap memory increase in Java 8, so you might monitor that, but overall those experimental flags must be handling that as well.
From Java 10 these experimental flags are the new default and are enabled/disabled by using this flag:
-XX:+UseContainerSupport
-XX:-UseContainerSupport
Since you have limitedthe maximum memory usage of your pod to 9Gi, it will be terminated automatically when the memory usage get to 9Gi.
In GCloud App Engine you can Specify max. CPU usage threshold, e.b. 0.6. Meaning that if CPU reaches 0.6 of 100% - 60% - a new instance will spawn.
I did not come across such a setting, but maybe: Kubernetes POD/Deployment has similar configuration parameter. Meaning, if RAM of POD reaches 0.6 of 100%, terminate POD. In your case that would be 60% of 9GB = ~5GB. Just some Food for thought.

percent of cpu quota (actually) used by a container

I'm relatively new to docker. I'm trying to get the percent of cpu quota (actually) used by a container. Is there a default metric emitted by one of the endpoints or is it something that I will have to calculate with other metrics? Thanks!
docker stats --no-stream
CONTAINER ID NAME CPU % (rest of line truncated)
949e2a3724e6 practical_shannon 8.32% (truncated)
As mentioned in the comment from #asuresh4, above, docker stats appears to give the ACTUAL cpu utilization, not the configured values. The output here is from Docker version 17.12.1-ce, build 7390fc6
--no-stream means run stats once, not continuously as it normally does. As you might guess, you can also ask for stats on a single container (specify the container name or id).
In addition to CPU %, MEM USAGE / LIMIT, MEM %, NET I/O, and BLOCK I/O are also shown.

Resources