GKE extended memory and Kubernetes memory allocatable - docker

When creating a cluster on GKE its possible to create Custom Instance Types. When adding 8GB of memory to an n1-standard-1 Kubernetes only shows memory allocatable of 6.37GB. Why is this?
The requested memory includes all the pods in kube-system namespace so where is this extra memory going?

Quotinig from documentation:
Node allocatable resources
Note that some of a node's resources are required to run the Kubernetes Engine and Kubernetes resources necessary to make that node function as part of your cluster. As such, you may notice a disparity between your node's total resources (as specified in the machine type documentation) and the node's allocatable resources in Kubernetes Engine
Note: As larger machine types tend to run more containers (and by extension, Kubernetes pods), the amount of resources that Kubernetes Engine reserves for cluster processes scales upward for larger machines.
Caution: In Kubernetes Engine node versions prior to 1.7.6, reserved resources were not counted against a node's total allocatable resources.
If your nodes have recently upgraded to version 1.7.6, they might appear to have fewer resources available, as Kubernetes Engine now displays allocatable resources. This can potentially lead to your cluster's nodes appearing overcommitted, and you might want to resize your cluster as a result.
For example performing some tests you can doublecheck:
Machine type Memory(GB) Allocatable(GB) CPU(cores) Allocatable(cores)
g1-small 1.7 1.2 0.5 0.47
n1-standard-1 (default) 3.75 2.7 1 0.94
n1-standard-2 7.5 5.7 2 1.93
n1-standard-4 15 12 4 3.92
n1-standard-8 30 26.6 8 7.91
n1-standard-16 60 54.7 16 15.89
Note: The values listed for allocatable resources do not account for the resources used by kube-system pods, the amount of which varies with each Kubernetes release. These system pods generally occupy an additional 400m CPU and 400mi memory on each node (values are approximate). It is recommended that you directly inspect your cluster if you require an exact accounting of usable resources on each node.
UPDATE
There is also the official explanation from the Kubernetes Documentation regarding why this resources are used:
kube-reserved is meant to capture resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc. It is not meant to reserve resources for system daemons that are run as pods. kube-reserved is typically a function of pod density on the nodes. This performance dashboard exposes cpu and memory usage profiles of kubelet and docker engine at multiple levels of pod density. This blog post explains how the dashboard can be interpreted to come up with a suitable kube-reserved reservation.
I would suggest you to go thorugh this page if you are interested to learn more.

Related

Kubernetes: High CPU usage and zombie processes when "Disk Pressure" event

We run on-premise small K8s cluster (based on RKE stack). 1x etcd/control node, 2x worker nodes. Components are:
OS: Centos 7
Docker version: 19.3.9
K8s: 1.17.2
Other, important fact: we're using Rook-Ceph storage cluster on both worker nodes (rook: v1.2.4, ceph version 14.2.7).
When one of OS mounts run into 90%+ usage (for example: /var), K8s is reporting "Disk Pressure", disables node and it's OK. But when this happens, the CPU usage start growing up to dozens (for example 30+, 40+ on machine with 4 vCPU), many of container processes (childs to containerd-shim) goes into zombie (defunct) state and whole k8s cluster collapse.
First of all we think that's a Rook-Ceph problem with XFS storage (described at https://github.com/rook/rook/issues/3132#issuecomment-580508760), so we switched to EXT4 (because we cannot do upgrade of kernel to 5.6+), but during last weekend this happened again, and we are sure that this case is related to Disk Pressure event. Last contact with (already) dead node was 21-01, #13:50, but load starts growing at 13:07 and quickly goes to 30.5:
/var usage goes from 89.97% to 90%+ exactly at 13:07 this day:
Can you point us what we need to check in k8s configuration, logs or whatever else to find out what is going on? Why k8s is collapsing during quite normal event?
(For clarification: we know that we're using quite old versions, but we'll do a complex upgrade of environment within few weeks).

Airflow with mysql_to_gcp negsignal.sigkill

I'm using airflow with composer (GCP) to extract data from cloud sql for gcs and after gcs for bigquery, I have some tables between 100 Mb and 10 Gb. My dag has two tasks to do what I mentioned before. with the smaller tables the dag runs smoothly, but with slightly larger tables the cloud sql extraction task ends in a few seconds with failure, but does not bring any logs except "negsignal.sigkill", I have already tried to increase the composer capacity , among other things, but nothing has worked yet.
I'm using the mysql_to_gcs and gcs_to_bigquery operators
The first thing you should check when you get negsinal.SIGKILL is your Kubernetes resources. This is surely a problem with resources limits.
I think you should monitor your Kubernetes Cluster Nodes. Inside GCP, go to Kubernetes Engine > Clusters. You should have a cluster containing the environment that Cloud Composer uses.
Now, head to the nodes of your cluster. Each node provides you metrics about CPU, memory & disk usage. You will also see the limit for the resources that each node uses. Also, you will see the pods that each node has.
If you are not very familiar with K8s, let me explain this briefly. Airflow uses Pods inside nodes to run your Airflow tasks. These pods are called airflow-worker-[id]. That way you can identify your worker pods inside the Node.
Check your pods list. If you have evicted airflow-worker pods, then Kubernetes is stopping your workers for some reason. Since Composer uses CeleryExecutor, a evicted airflow-worker points to a problem. This is not the case if you use KubernetesExecutor, but that is not available yet in Composer.
If you click in some evicted pod, you will see the reason for eviction. That should give you the answer.
If you don't see a problem with your pod eviction, don't panic, you still have some options. From that point on, your best friend will be logs. Be sure to check your pods logs, node logs and cluster logs, in that order.

Using k8s node resources out of k8s

What would happen with kubernetes scheduling if I have a kubernetes node, but I use the container (docker) engine for some other stuff, outside of the context of kubernetes.
For example if I manually SSH to the respective node and I do docker run something. Would kubernetes scheduling take into account the fact that this node is busy running other stuff, and it might not be able to host any other containers now?
What would happen in the following scenario:
Node with 8 GB RAM
running a pod with resource request 2 GB, limit 4 GB, and current usage 3 GB
ssh on node and docker run a container with 5 GB, using all
P.S. Please skip the "why would you go and run docker run directly on the node" questions. I don't want to, but reasons.
I'm pretty sure Kubernetes's scheduling only considers (a) pods it knows about and not other resources, and (b) only their resource requests.
In the situation you describe, with exactly that resource utilization, things will work fine. The pod can be scheduled on the node because the total resource requests using it are 2 GB out of 8 GB. The total memory usage doesn't exceed the physical memory size either, so you're okay.
Say the pod allocated a little bit more memory. Now the system as a whole is above its physical memory capacity, so the Linux kernel will arbitrarily kill something off. This is often the largest thing. You'll typically see an exit code of 137 (matching SIGKILL) in whichever system manages it.
This behavior is the same even if you run your side job in something like a DaemonSet. It requests 2 GB of RAM, so both pods fit on the same node [4 GB/8 GB], but if it has a resource limit of 6 GB RAM, something will get killed off.
The place where things are different is if you can predict the high memory use. Say your pod requests 3 GB/limits 6 GB of RAM, and your side process will predictably also use 6 GB. If you just docker run it something will definitely get OOM-killed. If you run it as a DaemonSet declaring a 6 GB memory request, the Kubernetes scheduler will know the pod doesn't fit and won't place it there (it may get stuck in "Pending" state if it can't be scheduled anywhere).
Kubernetes won't see other processes running on the host, however you can tell the kubelet on that host how much of the host resources to reserve for the host itself, preventing Kubernetes from scheduling pods that would exceed the host capacity. See the --system-reserved flag that you can pass to the kubelet:
--system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

Apache Hadoop Yarn vs. Kubernetes

Since versions 2.6 (Apache Hadoop) Yarn handles docker containers. Basically it distributes the requested amount of containers on a Hadoop cluster, restart failed containers and so on.
Kubernetes seemed to do the same.
Where are the major differences?
Kubernetes is developed almost from a clean slate for extending Docker container kernel to become a platform. Kubernetes development has taken bottom up approach. It has good optimization on specifying per container/pod resource requirements, but it lacks a effective global scheduler that can partition resources into logical grouping. Kubernetes design allows multiple schedulers to run in the cluster. Each scheduler manages resources within its own pods. However, Kubernetes cluster can suffer from instability when application demands more resources than physical systems can handle. It work best in infrastructure capacity exceeding application demands. Kubernetes scheduler will attempt to fill up the idle nodes with incoming application requests
and terminate low priority and starvation containers to improve resource utilization. Kubernetes containers can integrate with external storage system like S3 to provide resilience to data. Kubernetes framework uses etcd to store cluster data. Etcd cluster nodes and Hadoop Namenode are both single point of failures in Kubernetes or Hadoop platform. Etcd can have more replica than Namenode, hence, from reliability point of view seems to favor Kubernetes in theory. However, Kubernetes security is default open, unless RBAC are defined with fine-grained role binding. Security context is set correctly for pods. If omitted, primary group of the pod will default to root, which can be problematic for system administrators trying to secure the infrastructure.
Apache Hadoop YARN was developed to run isolated java processes to process big data workload then improved to support Docker containers. YARN provides global level resource management like capacity queues for partitioning physical resources into logical units. Each business unit can be assigned with percentage of the cluster resources. Capacity resource sharing system is designed in favor of guarentee resource
availability for Enterprise priority instead of squeezing every available physical resources. YARN does score more points in security. There are more
security featuers in Kerberos, access control for privileged/non-privileged containers, trusted docker images, and placement policy constraints. Most docker
related security are default to close, and system admin needs to manually turn on flags to grant more power to containers. Large enterprises tend to run Hadoop more
than Kubernetes because securing the system cost less. There are more distributed SQL engines built on top of YARN, including Hive, Impala, SparkSQL and IBM BigSQL.
Database options make YARN an attrative option because the ability to run online transaction processing in containers, and online analytical processing using batch workload. Hadoop Developer toolchains can be overwhelming. Mapreduce, Hive, Pig, Spark and etc, each have its own style of development. The user experience is inconsistent and take a while to learn them all. Kubernetes feels less obstructive by comparison because it only deploys docker containers. With introduction of YARN services to run
Docker container workload, YARN can feel less wordy than Kubernetes.
If your plan is to out source IT operations to public cloud, pick Kubernetes. If your plan is to build private/hybrid/multi-clouds, pick Apache YARN.
While this question and answer isn't exactly what you are asking, it does touch on a number of the same points.
Last I saw, Yarn was just a resource sharing mechanism, whereas Kubernetes is an entire platform, encompassing ConfigMaps, declarative environment management, Secret management, Volume Mounts, a super well designed API for interacting with all of those things, Role Based Access Control, and Kubernetes is in wide-spread use, meaning one can very easily find both candidates to hire and tools to buy.
A blog post I found cited a master's thesis that describes some of the fascinating trade-offs between the different scheduler's view of the world. It's a lot of words, so if you're looking for a tl;dr answer, that link may not be it, but if you're looking for actual research on the topic, it seems sound.

What defines what containers in Pods 'see' in terms of their limits and requests?

When a container in a Pod is created in a Kubernetes cluster with a limit and request set, how aware can that container be of those limits and requests? Would an application running inside the container be able to get these limits and requests to, for example, reduce the amount of resources it uses if the limits and requests were particularly low?
Kubernetes version: 1.8
Container runtime: Docker
Docker version: 1.12.6
Check mem_limit within a docker container with the tl;dr of
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
will show the limit, and then presumably the requests value is the allocated memory the container started with, but I would need to verify that assumption
I personally don't even understand the unit when trying to apply limits: cpu: so I for sure wouldn't know how to verify that value
The Downwards API can be used to pass the requests and limits to the container process as environment variables
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-requests-are-scheduled
If a Container exceeds its memory limit, it might be terminated. If it is restartable, the kubelet will restart it, as with any other type of runtime failure.
If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory.
Container might or might not be allowed to exceed its CPU limit for extended periods of time. However, it will not be killed for excessive CPU usage
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run
To get the information about resource usage, you need a monitoring system, configured for your cluster (heapster, prometheus, etc). Requests and limits can be adjusted according to this data manually or automatically.
One of possible ways to automate this process is to create a dedicated microservice, that will watch resources usage (by collecting and analyzing data from monitors), generate manifests with new limits and send requests kube api to recreate pods.

Resources