As far as I know, docker stop stops containers one by one even though apply multiple containers to it at a time. So does kubelet behave like this?
Pod termination prefectly described on POD lifecycle page
User sends command to delete Pod, with default grace period (30s)
The Pod in the API server is updated with the time beyond which the Pod is considered “dead” along with the grace period.
Pod shows up as “Terminating” when listed in client commands
(simultaneous with 3) When the Kubelet sees that a Pod has been marked as terminating because the time in 2 has been set, it begins the Pod shutdown process.
If one of the Pod’s containers has defined a preStop hook, it is invoked inside of the container. If the preStop hook is still running after the grace period expires, step 2 is then invoked with a small (2 second) extended grace period.
The container is sent the TERM signal. Note that not all containers in the Pod will receive the TERM signal at the same time and may each require a preStop hook if the order in which they shut down matters.
(simultaneous with 3) Pod is removed from endpoints list for service, and are no longer considered part of the set of running Pods for replication controllers. Pods that shutdown slowly cannot continue to serve traffic as load balancers (like the service proxy) remove them from their rotations.
When the grace period expires, any processes still running in the Pod are killed with SIGKILL.
The Kubelet will finish deleting the Pod on the API server by setting grace period 0 (immediate deletion). The Pod disappears from the API and is no longer visible from the client.
Related
Assume that there are some pods from Deployments/StatefulSet/DaemonSet, etc. running on a Kubernetes node.
Then I restarted the node directly, and then start docker, start kubelet with the same parameters.
What would happen to those pods?
Are they recreated with metadata saved locally from kubelet? Or use info retrieved from api-server? Or recovered from OCI runtime and behaves like nothing happened?
Is it that only stateless pod(no --local-data) can be recovered normally? If any of them has a local PV/dir, would they be connected back normally?
What if I did not restart the node for a long time? Would api-server assign other nodes to create those pods? What is the default timeout value? How can I configure this?
As far as I know:
apiserver
^
|(sync)
V
kubelet
^
|(sync)
V
-------------
| CRI plugin |(like api)
| containerd |(like api-server)
| runc |(low-level binary which manages container)
| c' runtime |(container runtime where containers run)
-------------
When kubelet received a PodSpec from kube-api-server, it calls CRI like a remote service, the steps be like:
create PodSandbox(a.k.a 'pause' image, always 'stopped')
create container(s)
run container(s)
So I guess that as the node and docker being restarted, steps 1 and 2 are already done, containers are at 'stopped' status; Then as kubelet being restarted, it pulls latest info from kube-api-server, found out that container(s) are not in 'running' state, so it calls CRI to run container(s), then everything are back to normal.
Please help me confirm.
Thank you in advance~
Good questions. A few things first; a Pod is not pinned to a certain node. The nodes is mostly seen as a "server farm" that Kubernetes can use to run its workload. E.g. you give Kubernetes a set of nodes and you also give a set of e.g. Deployment - that is desired state of applications that should run on your servers. Kubernetes is responsible for scheduling these Pods and also keep them running when something in the cluster is changed.
Standalone pods is not managed by anything, so if a Pod crashes it is not recovered. You typically want to deploy your stateless apps as Deployments, that then initiates ReplicaSets that manage a set of Pods - e.g. 4 Pods - instances of your app.
Your desired state; a Deployment with e.g. replicas: 4 is saved in the etcd database within the Kubernetes control plane.
Then a set of controllers for Deployment and ReplicaSet is responsible for keeping 4 replicas of your app alive. E.g. if a node becomes unresponsible (or dies), new pods will be created on other Nodes, if they are managed by the controllers for ReplicaSet.
A Kubelet receives a PodSpecs that are scheduled to the node, and then keep these pods alive by regularly health checks.
Is it that only stateless pod(no --local-data) can be recovered normally?
Pods should be seen as emphemeral - e.g. can disappear - but is recovered by a controller that manages them - unless deployed as standalone Pod. So don't store local data within the pod.
There is also StatefulSet pods, those are meant for stateful workload - but distributed stateful workload, typically e.g. 3 pods, that use Raft to replicate data. The etcd database is an example of distributed database that uses Raft.
The correct answer: it depends.
Imagine, you've got 3 nodes cluster, where you created a Deployment with 3 replicas, and 3-5 standalone pods.
Pods are created and scheduled to nodes.
Everything is up and running.
Let's assume that worker node node1 has got 1 deployment replica and 1 or more standalone pods.
The general sequence of node restart process as follows:
The node gets restarted, for ex. using sudo reboot
After restart, the node starts all OS processes in the order specified by systemd dependencies
When dockerd is started it does nothing. At this point all previous containers has Exited state.
When kubelet is started it requests the cluster apiserver for the list of Pods with node property equals its node name.
After getting the reply from apiserver, kubelet starts containers for all pods described in the apiserver reply using Docker CRI.
When pause container starts for each Pod from the list, it gets new IP address configured by CNI binary, deployed by Network addon Daemonset's Pod.
After kube-proxy Pod is started on the node, it updates iptables rules to implement Kubernetes Services desired configuration, taking to account new Pods' IP addresses.
Now things become a bit more complicated.
Depending on apiserver, kube-controller-manager and kubelet configuration, they reacts on the fact that node is not responding with some delay.
If the node restarts fast enough, kube-controller-manager doesn't evict the Pods and they all remain scheduled on the same node increasing their RESTARTS number after their new containers become Ready.
Example 1.
The cluster is created using Kubeadm with Flannel network addon on Ubuntu 18.04 VM created in GCP.
Kubernetes version is v1.18.8
Docker version is 19.03.12
After the node is restarted, all Pods' containers are started on the node with new IP addresses. Pods keep their names and location.
If node is stopped for a long time, the pods on the node stays in Running state, but connection attempts are obviously timed out.
If node remains stopped, after approximately 5 minutes pods scheduled on that node were evicted by kube-controller-manager and terminated. If I would start node before that eviction all pods were remained on the node.
In case of eviction, standalone Pods disappear forever, Deployments and similar controllers create necessary number of pods to replace evicted Pods and kube-scheduler puts them to appropriate nodes. If new Pod can't be scheduled on another node, for ex. due to lack of required volumes it will remain in Pending state, until the scheduling requirements were satisfied.
On a cluster created using Ubuntu 18.04 Vagrant box and Virtualbox hypervisor with host-only adapter dedicated for Kubernetes networking, pods on stopped node remains in the Running, but with Readiness: false state even after two hours, and were never evicted. After starting the node in 2 hours all containers were restarted successfully.
This configuration's behavior is the same all the way from Kubernetes v1.7 till the latest v1.19.2.
Example 2.
The cluster is created in Google cloud (GKE) with the default kubenet network addon:
Kubernetes version is 1.15.12-gke.20
Node OS is Container-Optimized OS (cos)
After the node is restarted (it takes around 15-20 seconds) all pods are started on the node with new IP addresses. Pods keep their names and location. (same with example 1)
If the node is stopped, after short period of time (T1 equals around 30-60 seconds) all pods on the node change status to Terminating. Couple minutes later they disappear from the Pods list. Pods managed by Deployment are rescheduled on other nodes with new names and ip addresses.
If the node pool is created with Ubuntu nodes, apiserver terminates Pods later, T1 equals around 2-3 minutes.
The examples show that the situation after worker node gets restarted is different for different clusters, and it's better to run the experiment on a specific cluster to check if you can get the expected results.
How to configure those timeouts:
How Can I Reduce Detecting the Node Failure Time on Kubernetes?
Kubernetes recreate pod if node becomes offline timeout
When the node is restarted and there are pods scheduled on it, managed by Deployment or ReplicaSet, those controllers will take care of scheduling desired number of replicas on another, healthy node. So if you have 2 replicas running on restarted node, they will be terminated and scheduled on other node.
Before restarting a node you should use kubectl cordon to mark the node as unschedulable and give kubernetes time to reschedule pods.
Stateless pods will not be rescheduled on any other node, they will be terminated.
The Problem
When one of our locally hosted bare-metal k8s (1.18) nodes is powered-on, pods are scheduled, but struggle to reach 'Ready' status - almost entirely due to a land-rush of disk IO from 30-40 pods being scheduled simultaneously on the node.
This often results in a cascade of Deployment failures:
IO requests on the node stack up in the IOWait state as pods deploy.
Pod startup times skyrocket from (normal) 10-20 seconds to minutes.
livenessProbes fail.
Pods are re-scheduled, compounding the problem as more IO
requests stack up.
Repeat.
FWIW Memory and CPU are vastly over-provisioned on the nodes, even in the power-on state (<10% usage).
Although we do have application NFS volume mounts (that would normally be suspect WRT IO issues), the disk activity and restriction at pod startup is almost entirely in the local docker container filesystem.
Attempted Solutions
As disk IO is not a limitable resource, we are struggling to find a solution for this. We have tuned our docker images to write to disk as little as possible at startup, and this has helped some.
One basic solution involves lowering the number of pods scheduled per node by increasing the number of nodes in the cluster. This isn't ideal for us, as they are physical machines, and once the nodes DO start up, the cluster is significantly over-resourced.
As we are bare-metal/local we do not have an automated method to auto-provision nodes in startup situations and lower them as the cluster stabilizes.
Applying priorityClasses at first glance seemed to be a solution. We have created priorityClasses and applied them accordingly, however, as listed in the documentation:
Pods can have priority. Priority indicates the importance of a Pod
relative to other Pods. If a Pod cannot be scheduled, the scheduler
tries to preempt (evict) lower priority Pods to make scheduling of the
pending Pod possible.
tldr: Pods will still all be "scheduleable" simultaneously at power-on, as no configurable resource limits are being exceeded.
Question(s)
Is there a method to limit scheduing pods on a node based on its current number of non-Ready pods? This would allow priority classes to evict non-priority pods and schedule the higher priority first.
Aside from increasing the number of
cluster nodes, is there a method we have not thought of to manage this disk IO landrush otherwise?
While I am also interested to see smart people answer the question, here is my probably "just OK" idea:
Configure the new node with a Taint that will prevent your "normal" pods from being scheduled to it.
Create a deployment of do-nothing pods with:
A "reasonably large" memory request, eg: 1GB.
A number of replicas high enough to "fill" the node.
A toleration for the above Taint.
Remove the Taint from the now-"full" node.
Scale down the do-nothing deployment at whatever rate you feel is appropriate as to avoid the "land rush".
Here's a Dockerfile for the do-nothing "noop" image I use for testing/troubleshooting:
FROM alpine:3.9
CMD sh -c 'while true; do sleep 5; done'
Kubernetes Startup Probes might mitigate the problem of Pods being killed due to livenessProbe timeouts: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes
If you configure them appropiately, the I/O "landrush" will still happen, but the pods have enought time to settle themselves instead of being killed.
Deployed Openwhisk on Rancher with kubernetesContainerFactory and Invoker Agent enabled. Invoked hello.js action and wsk created pods wskowdev-invoker-00-1-prewarm-nodejs10 and wskowdev-invoker-00-2-prewarm-nodejs10. Pod 1 got removed after 15 minutes but pod 2 never gets removed by the invoker agent. My understanding is after a specified period of time the pods should be suspended. Please clarify how the invoker agent and pod suspension/removal works.
One of the pre-warmed containers was removed after because it was used. The other is still pristine and remains around/not reclaimed. This is called a stem cell container. A container is subject to garbage collection later once it is specialized (initialized with user code).
The default idle timeout is 10 minutes though, not 15 per this configuration, unless overridden.
https://github.com/apache/incubator-openwhisk/blob/2f0155fb750ce8b5eef6d5b0f4e2e2db40e5a037/core/invoker/src/main/resources/application.conf#L103-L110
The number of stemcells is specific to a runtime and determined by the runtime manifest. For example: https://github.com/apache/incubator-openwhisk/blob/ce45d54c824ef6c3e5d98ce0b220b924c81e688b/ansible/files/runtimes.json#L45-L50
Once a stem cell container is used, a nanny process will replace it with a new container so that the number of pre-warmed containers should generally be constant.
I have a Kubernetes Pod created by a Stateful Set (not sure if that matters). There are two containers in this pod. When one of the two containers fails and use the get pods command, 1/2 containers are Ready and the Status is "Error." The second container never attempts a restart and I am unable to destroy the pod except by using the --grace-period=0 --force flags. A typical delete leaves the pod hanging in a "terminating" state either forever or for a very very long time. What could be causing this behavior and how to go about debugging it?
I encounter a similar problem on a node in my k8s 1.6 cluster esp. when the node has been running for a couple of weeks. It can happen to any node. When this happens, I restart kubelet on the node and the errors go away.
It's not the best thing to do, but it always solves the problem. It's also not detrimental to the cluster if you restart kubelet because the running pods continue to stay up.
kubectl get po -o wide will likely reveal to you that the errant pods are running on one node. SSH to that node and restart kubelet.
when delete a pod, kubernetes first deletes it in etcd via apiserver, then the controllers and kubelet do some stuff based on the changing of objects storaged in etcd, am i right?
so here comes the question, after a pod has been deleted in etcd, the endpoint controller and kubelet both should react, but which one will complete first? if the pod has been actually killed by kubelet on node, and the endpoint has not, as a result, some visits to the service will be lost. is that right?
thanks!
I think you are right. It's possible that a Pod is deleted before the endpoints are updated. Once a pod's deletionTimeStamp(https://github.com/kubernetes/kubernetes/blob/master/pkg/client/unversioned/pods.go#L71) is set, kubelet will signal the container(s) to stop, and the endpoints controller will start updating the relevant endpoints. It's random which one will finish first.
[Update]
when delete a pod, kubernetes first deletes it in etcd via apiserver, then the controllers and kubelet do some stuff based on the changing of objects storaged in etcd, am i right
If a grace period (30s by default) is specified, the pod will be removed from the list of endpoints first, and deleted from the api server after the grace period expires.
if the pod has been actually killed by kubelet on node, and the endpoint has not, as a result, some visits to the service will be lost. is that right?
This section in the docs is particularly useful for this topic:
https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/pods.md#termination-of-pods
Because the processes in the Pod being deleted are sent the TERM signal, so they have a chance to finish serving the ongoing requests, so if you delete the pod with a grace period (30s by default), and if the processes handles the TERM signal properly, the pending requests won't be lost.