Why Kubernetes pods are failing randomly when their limits overlap? - docker

I have a single node Kubernetes cluster which shows 10Gi, 3 CPU as available(of total 16 Gi, 4CPU) for running the pods post the cluster startup. I am trying two different scenarios then:
Scenario-1.
Running 3 pods individually with configs(Request,Limit) as:
Pod-A: (1 Gi,3.3Gi) and (1 cpu,1 cpu)
Pod-B: (1 Gi,3.3Gi) and (1 cpu,1 cpu)
Pod-C: (1 Gi,3.3Gi) and (1 cpu,1 cpu)
In this scenario, apps get perfectly up in there corresponding pods and works fine as expected.
Scenario-2.
Running 3 pods individually with configs(Request,Limit) as:
Pod-A: (1 Gi,10 Gi) and (1 cpu,3 cpu)
Pod-B: (1 Gi,10 Gi) and (1 cpu,3 cpu)
Pod-C: (1 Gi,10 Gi) and (1 cpu,3 cpu)
In the second scenario, apps get up in there corresponding pods and but fails randomly after some load is put over any of these pods i.e. sometime Pod-A gets down, at times Pod-2 or Pod-3. At any point of time I am not able to run all the three pods together.
The only event I can see in the failed pod is as below
"The warning which is available in node logs says "Warning CheckLimitsForResolvConf 1m (x32 over 15m) kubelet, xxx.net Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!.".
Having only this information in logs, I am not able to figure out the actual reason for random failure of Pods.
Can anyone help me understand if there is anything wrong with the configs or is there something else I am missing?
Thanks

When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on.
Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.
Note Although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
So after scheduling If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory
Refer Default Hard Eviction Threshold values.
The kubelet has the following default hard eviction threshold:
memory.available<100Mi
nodefs.available<10%
nodefs.inodesFree<5%
imagefs.available<15%
You should track your Node Conditions when load is running.
kubelet maps one or more eviction signals to a corresponding node condition.
If a hard eviction threshold has been met, or a soft eviction threshold has been met independent of its associated grace period, the kubelet reports a condition that reflects the node is under pressure i.e MemoryPressure or DiskPressure

Related

Duplicate entries in prometheus

I'm using the prometheus plugin for Jenkins in order to pass data to the prometheus server and subsequently have it displayed in grafana.
With the default setup I can see the metrics at http://:8080/prometheus
But in the list I also find some duplicate entries for the same job
default_jenkins_builds_duration_milliseconds_summary_sum{jenkins_job="spring_api/com.xxxxxx.yyy:yyy-web",repo="NA",} 217191.0
default_jenkins_builds_duration_milliseconds_summary_sum{jenkins_job="spring_api",repo="NA",} 526098.0
Both entries refer to the same jenkins job spring_api. But the metrics have different value. Why do I see two entries for the same metric?
Possibly one is a a subset of the other.
In the kubernetes world you will have the resource consumption for each container in a pod ,and the pod's overall resource usage.
Suppose I query the metric "container_cpu_usage_seconds_total" for {pod="X"}.
Pod X has 2 containers so I'll get back four metrics.
{pod="X",container="container1"}
{pod="X",container="container2"}
{pod="X",container="POD"} <- some weird "pause" image with very low usage
{pod="X"} <- sum of container1 and container2
There might also be a discrepancy where the metrics with no container is greater than the sum of the container consumption. That might be some "not accounted for" overhead, like maybe pod dns lookups or something. I'm not sure.
I guess my point is that prometheus will often use combinations of labels and omissions of labels to show how a metric is broken down.

Tricks to tuning/speeding up Elasticsearch 5.3.3 snapshotting?

I've been tasked with tuning the snapshotting process. I'm dealing with 3 master node instances and 9 data node instances. We are using S3 as the store for the repositories.
There are 28 indices and total of 189 shards in this particular cluster. The only snapshot tuning parameters I can find are chunk_size and max_snapshot_bytes_per_sec.
I've left chunk_size at the default (unlimited) and modified the bytes_per_sec and from the default (40mb) to 100mb and 500mb respectively.
Baseline it takes 3 hours to snapshot the entire cluster (using the default settings), after two experiments with changes in just bytes_per_sec, the snapshot process is still 3 hours.
To me this sounds like the process is either CPU or network bound or am I missing something? Not sure what other parameters I can change.

How to set limit for kubelet of kubernetes?

I am using kubernetes cluster with 1 master node and 2 worker with 4 core cpu and 256mb ram. I wanted to know how much cpu and ram is needed for kubelet.
Is there any way to set limit (cpu, memory) for kubelet?. I searched for documentation but i found only worker node requirements.
I think you should understand what kubelet does. This can be found in kubelet documentation.
The kubelet is the primary “node agent” that runs on each node. It can register the node with the apiserver using one of: the hostname; a flag to override the hostname; or specific logic for a cloud provider.
The kubelet works in terms of a PodSpec. A PodSpec is a YAML or JSON object that describes a pod. The kubelet takes a set of PodSpecs that are provided through various mechanisms (primarily through the apiserver) and ensures that the containers described in those PodSpecs are running and healthy. The kubelet doesn’t manage containers which were not created by Kubernetes.
Other than from an PodSpec from the apiserver, there are three ways that a container manifest can be provided to the Kubelet.
File: Path passed as a flag on the command line. Files under this path will be monitored periodically for updates. The monitoring period is 20s by default and is configurable via a flag.
HTTP endpoint: HTTP endpoint passed as a parameter on the command line. This endpoint is checked every 20 seconds (also configurable with a flag).
HTTP server: The kubelet can also listen for HTTP and respond to a simple API (underspec’d currently) to submit a new manifest.
There is several flags that you could use with kubelet, but they mostly are DEPRECATED and parameter should be set via config file specified by Kubelet's --config flag. This is explain on Set Kubelet parameters via a config file.
The flags that might be interesting for you are:
--application-metrics-count-limit int
Max number of application metrics to store (per container) (default 100) (DEPRECATED)
--cpu-cfs-quota
Enable CPU CFS quota enforcement for containers that specify CPU limits (default true) (DEPRECATED)
--event-qps int32
If > 0, limit event creations per second to this value. If 0, unlimited. (default 5) (DEPRECATED)
--event-storage-age-limit string
Max length of time for which to store events (per type). Value is a comma separated list of key values, where the keys are event types (e.g.: creation, oom) or "default" and the value is a duration. Default is applied to all non-specified event types (default "default=0") (DEPRECATED)
--event-storage-event-limit string
Max number of events to store (per type). Value is a comma separated list of key values, where the keys are event types (e.g.: creation, oom) or "default" and the value is an integer. Default is applied to all non-specified event types (default "default=0") (DEPRECATED)
--log-file-max-size uint
Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
--pods-per-core int32
Number of Pods per core that can run on this Kubelet. The total number of Pods on this Kubelet cannot exceed max-pods, so max-pods will be used if this calculation results in a larger number of Pods allowed on the Kubelet. A value of 0 disables this limit. (DEPRECATED)
--registry-qps int32
If > 0, limit registry pull QPS to this value. If 0, unlimited. (default 5) (DEPRECATED)

Kubernetes pod cpu usage calculation method for HPA

Can someone explain how the cpu usage is calculated inside pods with multiple containers for use with an Horizontal Pod Autoscaler?
Is it the mean value and how is this calculated?
For example:
If we have 2 containers:
Container1 requests 0.5 cpu and uses 0 cpu
Container2 requests 1 cpu and uses 2 cpu
If we calculate both seperatly and take the mean: (0% + 200%)/2 = 100% usage?
If we take the sums and take the mean: 2/1.5 = 133% usage?
Or is my logic way off?
As of kubernetes 1.9 HPA calculates pod cpu utilization as total cpu usage of all containers in pod divided by total request. So in your example the calculated usage would be 133%. I don't think that's specified in docs anywhere, but the relevant code is here: https://github.com/kubernetes/kubernetes/blob/v1.9.0/pkg/controller/podautoscaler/metrics/utilization.go#L49
However, I would consider this an implementation detail. As such it can easily change in future versions.
In the Horizontal Pod Autoscaling design documentation it's clearly written that it takes the arithmetic mean of the pods' CPU utilization to compare against the target value. Here is the text:
The autoscaler is implemented as a control loop. It periodically
queries pods described by Status.PodSelector of Scale subresource, and
collects their CPU utilization. Then, it compares the arithmetic mean
of the pods' CPU utilization with the target defined in
Spec.CPUUtilization, and adjusts the replicas of the Scale if needed
to match the target (preserving condition: MinReplicas <= Replicas <=
MaxReplicas).
The target number of pods is calculated from the following formula:
TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)
For further detail: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md

How is the detection of terminated nodes in Erlang working? How is net_ticktime influencing the control of node liveness in Erlang?

I set net_ticktime value to 600 seconds.
net_kernel:set_net_ticktime(600)
In Erlang documentation for net_ticktime = TickTime:
Specifies the net_kernel tick time. TickTime is given in seconds. Once every TickTime/4 second, all connected nodes are ticked (if anything else has been written to a node) and if nothing has been received from another node within the last four (4) tick times that node is considered to be down. This ensures that nodes which are not responding, for reasons such as hardware errors, are considered to be down.
The time T, in which a node that is not responding is detected:
MinT < T < MaxT where:
MinT = TickTime - TickTime / 4
MaxT = TickTime + TickTime / 4
TickTime is by default 60 (seconds). Thus, 45 < T < 75 seconds.
Note: Normally, a terminating node is detected immediately.
My Problem:
My TickTime is 600 (seconds). Thus, 450 (7.5 minutes)< T < 750 seconds (12.5 minutes). Although, when I set net_ticktime to all distributed nodes in Erlang to value 600 when some node fails (eg. when I close Erlang shell) then the other nodes get message immediately and not according to definition of ticktime.
However it is noted that normally a terminating node is detected immediately but I could not find explanation (neither in Erlang documentation, or Erlang ebook or other Erlang based sources) of this immediate response principle for node termination in distributed Erlang. Are nodes in distributed environment pinged periodically with smaller intervals than net_ticktime or does the terminating node send some kind of message to other nodes before it terminates? If it does send a message are there any scenarios when upon termination node cannot send this message and must be pinged to investigate its liveliness?
Also it is noted in Erlang documentation that Distributed Erlang is not very scalable for clusters larger than 100 nodes as every node keeps links to all nodes in the cluster. Is the algorithm for investigating liveliness of nodes (pinging, announcing termination) modified with increasing size of the cluster?
When two Erlang nodes connect, a TCP connection is made between them. The failure you are inducing would cause the underlying OS to close the connection, effectively notifying the other node very quickly.
The network tick is used to detect a connection to a distant node that appears to be up but is not actually passing traffic, such as may occur when a network event isolates a node.
If you want to simulate a failure that would require a tick to detect, use a firewall to block the traffic on the connection created when the nodes first ping.

Resources