Kubernetes pod cpu usage calculation method for HPA - docker

Can someone explain how the cpu usage is calculated inside pods with multiple containers for use with an Horizontal Pod Autoscaler?
Is it the mean value and how is this calculated?
For example:
If we have 2 containers:
Container1 requests 0.5 cpu and uses 0 cpu
Container2 requests 1 cpu and uses 2 cpu
If we calculate both seperatly and take the mean: (0% + 200%)/2 = 100% usage?
If we take the sums and take the mean: 2/1.5 = 133% usage?
Or is my logic way off?

As of kubernetes 1.9 HPA calculates pod cpu utilization as total cpu usage of all containers in pod divided by total request. So in your example the calculated usage would be 133%. I don't think that's specified in docs anywhere, but the relevant code is here: https://github.com/kubernetes/kubernetes/blob/v1.9.0/pkg/controller/podautoscaler/metrics/utilization.go#L49
However, I would consider this an implementation detail. As such it can easily change in future versions.

In the Horizontal Pod Autoscaling design documentation it's clearly written that it takes the arithmetic mean of the pods' CPU utilization to compare against the target value. Here is the text:
The autoscaler is implemented as a control loop. It periodically
queries pods described by Status.PodSelector of Scale subresource, and
collects their CPU utilization. Then, it compares the arithmetic mean
of the pods' CPU utilization with the target defined in
Spec.CPUUtilization, and adjusts the replicas of the Scale if needed
to match the target (preserving condition: MinReplicas <= Replicas <=
MaxReplicas).
The target number of pods is calculated from the following formula:
TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)
For further detail: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md

Related

Behavior of docker compose v3's deploy resources limits 'cpus' parameter setting (is it an absolute number or a percentage of available cores)

Folks,
With regards to docker compose v3's 'cpus' parameter setting (under 'deploy' 'resources' 'limits') to limit the available CPUs to a service, is it an absolute number that specifies the count of CPUs or is it a more useful percentage of available CPUs setting.
From what i read it appears to be an absolute number, where in, say if a host has 4 CPUs and one were to set two services in the compose file with 0.5 then both the services combined can only use a max of 1 CPU (0.5 each) while leaving the 3 remaining CPUs idle.
But thinking loudly it appears to me that it would be nicer if this is a percentage of available cores setting in which case for the same previous example this would result in both services each being able to use up to 2 CPUs each & thereby the two combined could use up all 4 when needed. This way when i increase or decrease the available cores the relative settings would help avoid modifying this value again.
EDIT(09/10/21):
On reading this it appears that the above can be achieved with 'cpu-shares' setting instead of setting 'cpus'. Is my understanding correct?
The doc for 'cpu-shares' however mentions the below cautionary note,
"It does not guarantee or reserve any specific CPU access."
If the above is achieved with this setting, then what does it mean (what is to lose) to not have a guarantee or reservation?
EDIT(09/13/21):
Just to summarize,
The 'cpus' parameter setting is an an absolute number that refers to the number of CPUs a service has reserved for it to use at all times. Correct?
The 'cpu-shares' parameter setting is a relative weight number the value of which is used to compute/determine the percentage of total available CPU that a service can use only when there is contention. Correct?

Duplicate entries in prometheus

I'm using the prometheus plugin for Jenkins in order to pass data to the prometheus server and subsequently have it displayed in grafana.
With the default setup I can see the metrics at http://:8080/prometheus
But in the list I also find some duplicate entries for the same job
default_jenkins_builds_duration_milliseconds_summary_sum{jenkins_job="spring_api/com.xxxxxx.yyy:yyy-web",repo="NA",} 217191.0
default_jenkins_builds_duration_milliseconds_summary_sum{jenkins_job="spring_api",repo="NA",} 526098.0
Both entries refer to the same jenkins job spring_api. But the metrics have different value. Why do I see two entries for the same metric?
Possibly one is a a subset of the other.
In the kubernetes world you will have the resource consumption for each container in a pod ,and the pod's overall resource usage.
Suppose I query the metric "container_cpu_usage_seconds_total" for {pod="X"}.
Pod X has 2 containers so I'll get back four metrics.
{pod="X",container="container1"}
{pod="X",container="container2"}
{pod="X",container="POD"} <- some weird "pause" image with very low usage
{pod="X"} <- sum of container1 and container2
There might also be a discrepancy where the metrics with no container is greater than the sum of the container consumption. That might be some "not accounted for" overhead, like maybe pod dns lookups or something. I'm not sure.
I guess my point is that prometheus will often use combinations of labels and omissions of labels to show how a metric is broken down.

Why Kubernetes pods are failing randomly when their limits overlap?

I have a single node Kubernetes cluster which shows 10Gi, 3 CPU as available(of total 16 Gi, 4CPU) for running the pods post the cluster startup. I am trying two different scenarios then:
Scenario-1.
Running 3 pods individually with configs(Request,Limit) as:
Pod-A: (1 Gi,3.3Gi) and (1 cpu,1 cpu)
Pod-B: (1 Gi,3.3Gi) and (1 cpu,1 cpu)
Pod-C: (1 Gi,3.3Gi) and (1 cpu,1 cpu)
In this scenario, apps get perfectly up in there corresponding pods and works fine as expected.
Scenario-2.
Running 3 pods individually with configs(Request,Limit) as:
Pod-A: (1 Gi,10 Gi) and (1 cpu,3 cpu)
Pod-B: (1 Gi,10 Gi) and (1 cpu,3 cpu)
Pod-C: (1 Gi,10 Gi) and (1 cpu,3 cpu)
In the second scenario, apps get up in there corresponding pods and but fails randomly after some load is put over any of these pods i.e. sometime Pod-A gets down, at times Pod-2 or Pod-3. At any point of time I am not able to run all the three pods together.
The only event I can see in the failed pod is as below
"The warning which is available in node logs says "Warning CheckLimitsForResolvConf 1m (x32 over 15m) kubelet, xxx.net Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!.".
Having only this information in logs, I am not able to figure out the actual reason for random failure of Pods.
Can anyone help me understand if there is anything wrong with the configs or is there something else I am missing?
Thanks
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on.
Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.
Note Although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
So after scheduling If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory
Refer Default Hard Eviction Threshold values.
The kubelet has the following default hard eviction threshold:
memory.available<100Mi
nodefs.available<10%
nodefs.inodesFree<5%
imagefs.available<15%
You should track your Node Conditions when load is running.
kubelet maps one or more eviction signals to a corresponding node condition.
If a hard eviction threshold has been met, or a soft eviction threshold has been met independent of its associated grace period, the kubelet reports a condition that reflects the node is under pressure i.e MemoryPressure or DiskPressure

How to use the data of cadvisor to calculate the cpu usage and load in Prometheus

when I use the cadvisor to get the information about the cpu in a docker container,I get the information as follow:
my question is that how to caculate the cpu usage and load,which is the same as Prometheus,through the information returned by cadvisor?how Prometheus caculate the cpu usage?
The algorithm that Prometheus uses for rate() is a little intricate due to handling of issues like alignment and counter resets as explained in Counting with Prometheus.
The short version is to subtract first value from the last value, and divide by the time they are over. It's probably easiest to use Prometheus rather than doing this yourself.
Below query should return your top 10 containers which are consuming most CPU time:
topk(10, sum(irate(container_cpu_usage_seconds_total{container_label_com_docker_swarm_node_id=~".+", id=~"/docker/.*"}[$interval])) by (name)) * 100

Docker service Limits and Reservations

Docker v1.12 service comes with four flags for setting the resource limits on a service.
--limit-cpu value Limit CPUs (default 0.000)
--limit-memory value Limit Memory (default 0 B)
--reserve-cpu value Reserve CPUs (default 0.000)
--reserve-memory value Reserve Memory (default 0 B)
What is the difference between limit and reserve in this context?
What does the cpu value mean in here? Does this mean number of cores? cpu share? What is the unit?
Reserve holds those resources on the host so they are always available for the container. Think dedicated resources.
Limit prevents the binary inside the container from using more than that. Think of controlling runaway processes in container.
Based on my limited testing with stress, --limit-cpu is percent of a core, though if there are multiple threads, it'll spread those out across core's and seems to attempt to keep the total near what you'd expect.
In the below pic, from left to right, was --limit-cpu 4, then 2.5, then 2, then 1. All of those tests had stress set to CPU of 4 (worker threads).

Resources