what would happen if i restart a node with some pods running - docker

Assume that there are some pods from Deployments/StatefulSet/DaemonSet, etc. running on a Kubernetes node.
Then I restarted the node directly, and then start docker, start kubelet with the same parameters.
What would happen to those pods?
Are they recreated with metadata saved locally from kubelet? Or use info retrieved from api-server? Or recovered from OCI runtime and behaves like nothing happened?
Is it that only stateless pod(no --local-data) can be recovered normally? If any of them has a local PV/dir, would they be connected back normally?
What if I did not restart the node for a long time? Would api-server assign other nodes to create those pods? What is the default timeout value? How can I configure this?
As far as I know:
apiserver
^
|(sync)
V
kubelet
^
|(sync)
V
-------------
| CRI plugin |(like api)
| containerd |(like api-server)
| runc |(low-level binary which manages container)
| c' runtime |(container runtime where containers run)
-------------
When kubelet received a PodSpec from kube-api-server, it calls CRI like a remote service, the steps be like:
create PodSandbox(a.k.a 'pause' image, always 'stopped')
create container(s)
run container(s)
So I guess that as the node and docker being restarted, steps 1 and 2 are already done, containers are at 'stopped' status; Then as kubelet being restarted, it pulls latest info from kube-api-server, found out that container(s) are not in 'running' state, so it calls CRI to run container(s), then everything are back to normal.
Please help me confirm.
Thank you in advance~

Good questions. A few things first; a Pod is not pinned to a certain node. The nodes is mostly seen as a "server farm" that Kubernetes can use to run its workload. E.g. you give Kubernetes a set of nodes and you also give a set of e.g. Deployment - that is desired state of applications that should run on your servers. Kubernetes is responsible for scheduling these Pods and also keep them running when something in the cluster is changed.
Standalone pods is not managed by anything, so if a Pod crashes it is not recovered. You typically want to deploy your stateless apps as Deployments, that then initiates ReplicaSets that manage a set of Pods - e.g. 4 Pods - instances of your app.
Your desired state; a Deployment with e.g. replicas: 4 is saved in the etcd database within the Kubernetes control plane.
Then a set of controllers for Deployment and ReplicaSet is responsible for keeping 4 replicas of your app alive. E.g. if a node becomes unresponsible (or dies), new pods will be created on other Nodes, if they are managed by the controllers for ReplicaSet.
A Kubelet receives a PodSpecs that are scheduled to the node, and then keep these pods alive by regularly health checks.
Is it that only stateless pod(no --local-data) can be recovered normally?
Pods should be seen as emphemeral - e.g. can disappear - but is recovered by a controller that manages them - unless deployed as standalone Pod. So don't store local data within the pod.
There is also StatefulSet pods, those are meant for stateful workload - but distributed stateful workload, typically e.g. 3 pods, that use Raft to replicate data. The etcd database is an example of distributed database that uses Raft.

The correct answer: it depends.
Imagine, you've got 3 nodes cluster, where you created a Deployment with 3 replicas, and 3-5 standalone pods.
Pods are created and scheduled to nodes.
Everything is up and running.
Let's assume that worker node node1 has got 1 deployment replica and 1 or more standalone pods.
The general sequence of node restart process as follows:
The node gets restarted, for ex. using sudo reboot
After restart, the node starts all OS processes in the order specified by systemd dependencies
When dockerd is started it does nothing. At this point all previous containers has Exited state.
When kubelet is started it requests the cluster apiserver for the list of Pods with node property equals its node name.
After getting the reply from apiserver, kubelet starts containers for all pods described in the apiserver reply using Docker CRI.
When pause container starts for each Pod from the list, it gets new IP address configured by CNI binary, deployed by Network addon Daemonset's Pod.
After kube-proxy Pod is started on the node, it updates iptables rules to implement Kubernetes Services desired configuration, taking to account new Pods' IP addresses.
Now things become a bit more complicated.
Depending on apiserver, kube-controller-manager and kubelet configuration, they reacts on the fact that node is not responding with some delay.
If the node restarts fast enough, kube-controller-manager doesn't evict the Pods and they all remain scheduled on the same node increasing their RESTARTS number after their new containers become Ready.
Example 1.
The cluster is created using Kubeadm with Flannel network addon on Ubuntu 18.04 VM created in GCP.
Kubernetes version is v1.18.8
Docker version is 19.03.12
After the node is restarted, all Pods' containers are started on the node with new IP addresses. Pods keep their names and location.
If node is stopped for a long time, the pods on the node stays in Running state, but connection attempts are obviously timed out.
If node remains stopped, after approximately 5 minutes pods scheduled on that node were evicted by kube-controller-manager and terminated. If I would start node before that eviction all pods were remained on the node.
In case of eviction, standalone Pods disappear forever, Deployments and similar controllers create necessary number of pods to replace evicted Pods and kube-scheduler puts them to appropriate nodes. If new Pod can't be scheduled on another node, for ex. due to lack of required volumes it will remain in Pending state, until the scheduling requirements were satisfied.
On a cluster created using Ubuntu 18.04 Vagrant box and Virtualbox hypervisor with host-only adapter dedicated for Kubernetes networking, pods on stopped node remains in the Running, but with Readiness: false state even after two hours, and were never evicted. After starting the node in 2 hours all containers were restarted successfully.
This configuration's behavior is the same all the way from Kubernetes v1.7 till the latest v1.19.2.
Example 2.
The cluster is created in Google cloud (GKE) with the default kubenet network addon:
Kubernetes version is 1.15.12-gke.20
Node OS is Container-Optimized OS (cos)
After the node is restarted (it takes around 15-20 seconds) all pods are started on the node with new IP addresses. Pods keep their names and location. (same with example 1)
If the node is stopped, after short period of time (T1 equals around 30-60 seconds) all pods on the node change status to Terminating. Couple minutes later they disappear from the Pods list. Pods managed by Deployment are rescheduled on other nodes with new names and ip addresses.
If the node pool is created with Ubuntu nodes, apiserver terminates Pods later, T1 equals around 2-3 minutes.
The examples show that the situation after worker node gets restarted is different for different clusters, and it's better to run the experiment on a specific cluster to check if you can get the expected results.
How to configure those timeouts:
How Can I Reduce Detecting the Node Failure Time on Kubernetes?
Kubernetes recreate pod if node becomes offline timeout

When the node is restarted and there are pods scheduled on it, managed by Deployment or ReplicaSet, those controllers will take care of scheduling desired number of replicas on another, healthy node. So if you have 2 replicas running on restarted node, they will be terminated and scheduled on other node.
Before restarting a node you should use kubectl cordon to mark the node as unschedulable and give kubernetes time to reschedule pods.
Stateless pods will not be rescheduled on any other node, they will be terminated.

Related

Jenkins slave pods on Kubernetes disappear when their is an influx of running pods

I have a Kubernetes cluster running Jenkins master in a single pod and each build running in a separate slave pod. When there are many builds running, there are many pods being spun up and down and often I will see an error in a job like this:
Cannot contact slave-jenkins-0g9p0: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel#197b6a38:JNLP4-connect connection from 10.10.3.90/10.10.3.90:54418": Remote call on JNLP4-connect connection from 10.10.3.90/10.10.3.90:54418 failed. The channel is closing down or has closed down
Could not connect to slave-jenkins-0g9p0 to send interrupt signal to process
The pod, for example slave-jenkins-0g9p0, just disappears. There is no trace that it existed. While watching information like kubectl describe pod slave-jenkins-0g9p0, there is no error message, it simply stops existing.
I have a feeling that because there are multiple pods spinning up and down that Kubernetes attempts to balance the load on the nodes and reschedule the pod but after killing it, it cannot spin up the pod on another node. I cannot be sure though. Maybe there is a way to tell K8s to tie a pod to a node until it exits itself? Im not really sure what/how to debug this case.
Kuberentes version: v1.16.13-eks-2ba888 on AWS EKS
Jenkins version: 2.257
Kubernetes plugin version 1.27.2
Any advise would be appreciated
Thanks
UPDATE:
I have uploaded three slave pod manifest examples here where you can see the resources allocated. The above issue occurs in each of these running pods.
The node pool is controlled by the Kubernetes autoscaler (v1.14.6) and use AWS t3a.large (2 CPU, 8GB mem) instances.
UPDATE 2:
I believe that I have found the cause of the problem. I disabled the cluster-autoscaler](https://github.com/kubernetes/autoscaler) (v1.14.6) and the problem stopped.
So what is seems is happening is that the autoscaler is removing the node that the slave pd is running on. I know that taints can be used to tell the autoscaler not to remove a node but is there a way to do this dynamically that it wont remove a node if a certain pod is running on it. Without having to develop something new.

What happens if master node dies in kubernetes? How to resolve the issue?

I've started learning kubernetes with docker and I've been thinking, what happens if master node dies/fails. I've already read the answers here. But it doesn't answer the remedy for it.
Who is responsible to bring it back? And how to bring it back? Can there be a backup master node to avoid this? If yes, how?
Basically, I'm asking a recommended way to handle master failure in kubernetes setup.
You should have multiple VMs serving as master node to avoid single point of failure.An odd number of 3 or 5 master nodes are recommended for quorum. Have a load balancer in-front of all the VMs serving as master node which can do load balancing and in case one master node dies loadbalancer should remove the VMs IP and make it as unhealthy and not send traffic to it.
Also ETCD cluster is the brain of a kubernetes cluster. So you should have multiple VMs serving as ETCD nodes. Those VMs can be same VMs as of master node or for reduced blast radius you can have separate VMs for ETCD. Again the odd number of VMs should should be 3 or 5. Make sure to take periodic backup of ETCD nodes data so that you can restore the cluster state to pervious state in case of a disaster.
Check the official doc on how to install a HA kubernetes cluster using Kubeadm.
In short, for Kubernetes you should keep master nodes to function properly all the time. There are different methods to make copies of master node, so it is available on failure. As example check this - https://kubernetes.io/docs/tasks/administer-cluster/highly-available-master/
Abhishek, you can run master node in high availability, you should set up the control plane aka master node behind Load balancer as first step. If you have plans to upgrade a single control-plane kubeadm cluster to high availability you should specify the --control-plane-endpoint to set the shared endpoint for all control-plane nodes. Such an endpoint can be either a DNS name or an IP address of a load-balancer.
By default because of security reasons the master node does not host PODS and if you want to enable hosting PODS on master node you can run the following command to do so.
kubectl taint nodes --all node-role.kubernetes.io/master
If you want to manually restore the master make sure you back up the etcd directory /var/lib/etcd. You can restore this on the new master and it should work. Read about high availability kubernetes over here.

How to kill a multi-container pod if one container fails?

I'm using Jenkins Kubernetes Plugin which starts Pods in a Kubernetes Cluster which serve as Jenkins agents. The pods contain 3 containers in order to provide the slave logic, a Docker socket as well as the gcloud command line tool.
The usual workflow is that the slave does its job and notifies the master that it completed. Then the master terminates the pod. However, if the slave container crashes due to a lost network connection, the container terminates with error code 255, the other two containers keep running and so does the pod. This is a problem because the pods have large CPU requests and setup is cheap with the slave running only when they have to, but having multiple machines running for 24h or over the weekend is a noticable financial damage.
I'm aware that starting multiple containers in the same pod is not fine Kubernetes arts, however ok if I know what I'm doing and I assume I do. I'm sure it's hard to solve this differently given the way the Jenkins Kubernetes Plugin works.
Can I make the pod terminate if one container fails without it respawn? As solution with a timeout is acceptable as well, however less preferred.
Disclaimer, I have a rather limited knowledge of kubernetes, but given the question:
Maybe you can run the forth container that exposes one simple endpoint of "liveness"
It can run ps -ef or any other way to contact 3 existing containers just to make sure they're alive.
This endpoint could return "OK" only if all the containers are running, and "ERROR" if at least one of them was detected as "crushed"
Then you could setup a liveness probe of kubernetes so that it would stop the pod upon the error returned from that forth container.
Of course if this 4th process will crash by itself for any reason (well it shouldn't unless there is a bug or something) then the liveness probe won't respond and kubernetes is supposed to stop the pod anyway, which is probably what you really want to achieve.

Kubernetes POD Failover

I am toying around with Kubernetes and have managed to deploy a statefull application (jenkins instance) to a single node.
It uses a PVC to make sure that I can persist my jenkins data (jobs, plugins etc).
Now I would like to experiment with failover.
My cluster has 2 digital ocean droplets.
Currently my jenkins pod is running on just one node.
When that goes down, Jenkins becomes unavailable.
I am now looking on how to accomplish failover in a sense that, when the jenkins pod goes down on my node, it will spin up on the other node. (so short downtime during this proces is ok).
Of course it has to use the same PVC, so that my data remains intact.
I believe, when reading, that a StatefulSet kan be used for this?
Any pointers are much appreciated!
Best regards
Digital Ocean's Kubernetes service only supports ReadWriteOnce access modes for PVCs (see here). This means the volume can only be attached to one node at a time.
I came across this blogpost which, while focused on Jenkins on Azure, has the same situation of only supporting ReadWriteOnce. The author states:
the drawback for me though lies in the fact that the access mode for Azure Disk persistent volumes is ReadWriteOnce. This means that an Azure disk can be attached to only one cluster node at a time. In the event of a node failure or update, it could take anywhere between 1-5 minutes for the Azure disk to get detached and attached to the next available node.
Note, Pod failure and node failures are different things. Since DO only supports ReadWriteOnce, there's no benefit to trying anything more sophisticated than what you have right now in terms of tolerance to node failure. Since it's ReadWriteOnce the volume will need to be unmounted from the failing node and re-mounted to the new node, and then a new Pod will get scheduled on the new node. Kubernetes will do this for you, and there's not much you can do to optimize it.
For Pod failure, you could use a Deployment since you want to read and write the same data, you don't want different PVs attached to the different replicas. There may be very limited benefit to this, you will have multiple replicas of the Pod all running on the same node, so it depends on how the Jenkins process scales and if it can support that type of scale horizontal out model while all writing to the same volume (as opposed to simply vertically scaling memory or CPU requests).
If you really want to achieve higher availability in the face of node and/or Pod failures, and the Jenkins workload you're deploying has a hard requirement on local volumes for persistent state, you will need to consider an alternative volume plugin like NFS, or moving to a different cloud provider like GKE.
Yes, you would use a Deployment or StatefulSet depending on the use case. For Jenkins, a StatefulSet would be appropriate. If the running pod becomes unavailable, the StatefulSet controller will see that and spawn a new one.
What you are describing is the default behaviour of Kubernetes for Pods that are managed by a controller, such as a Deployment.
You should deploy any application as a Deployment (or another controller) even if it consists just of a single Pod. You never really deploy Pods directly to Kubernetes. So, in this case, there's nothing special you need to do to get this behaviour.
When one of your nodes dies, the Pod dies too. This is detected by the Deployment controller, which creates a new Pod. This is in turn detected by the scheduler, which assigns the new Pod to a node. Since one of the nodes is down, it will assign the Pod to the other node that is still running. Once the Pod is assigned to this node, the kubelet of this node will run the container(s) of this Pod on this node.
Ok, let me try to anwser my own question here.
I think Amit Kumar Gupta came the closest to what I believe is going on here.
Since I am using a Deployment and my PVC in ReadWriteOnce, I am basically stuck with one pod, running jenkins, on one node.
weibelds answer made me realise that I was asking questions to about a concept that Kubernetes performs by default.
If my pod goes down (in my case i am shutting down a node on purpose by doing a hard power down to simulate a failure), the cluster (controller?) will detect this and spawn a new pod on another node.
All is fine so far, but then I noticed that my new pod as stuck in ContainerCreating state.
Running a describe on my new pod (the one in ContainerCreating state) showed this
Warning FailedAttachVolume 16m attachdetach-controller Multi-Attach error for volume "pvc-cb772fdb-492b-4ef5-a63e-4e483b8798fd" Volume is already used by pod(s) jenkins-deployment-6ddd796846-dgpnm
Warning FailedMount 70s (x7 over 14m) kubelet, cc-pool-bg6u Unable to mount volumes for pod "jenkins-deployment-6ddd796846-wjbkl_default(93747d74-b208-421c-afa4-8d467e717649)": timeout expired waiting for volumes to attach or mount for pod "default"/"jenkins-deployment-6ddd796846-wjbkl". list of unmounted volumes=[jenkins-home]. list of unattached volumes=[jenkins-home default-token-wd6p7]
Then it started to hit me, this makes sense.
It's a pitty, but it makes sense.
Since I did a hard power down on the node, the PV went down with it.
So now the controller tries to start a new pod, on a new node but it cant transfer the PV, since the one on the previous pod became unreachable.
As I read more on this, I read that DigitalOcean only supports ReadWriteOnce , which now leaves me wondering, how the hell can I achieve a simple failover for a stateful application on a Kubernetes Cluster on Digital Ocean that consists of just a couple of simple droplets?

Trying to understand what a Kubernetes Worker Node and Pod is compared to a Docker "Service"

I'm trying to learn Kubernetes to push up my microservices solution to some Kubernetes in the Cloud (e.g. Azure Kubernetes Service, etc)
As part of this, I'm trying to understand the main concepts, specifically around Pods + Workers and (in the yml file) Pods + Services. To do this, I'm trying to compare what I have inside my docker-compose file against the new concepts.
Context
I currently have a docker-compose.yml file which contains about 10 images. I've split the solution up into two 'networks': frontend and backend. The backend network contains 3 microservices and cannot be accessed at all via a browser. The frontend network contains a reverse-proxy (aka. Traefik, which is just like nginx) which is used to route all requests to the appropriate backend microservice and a simple SPA web app. All works 100% awesome.
Each backend Microservice has at least one of these:
Web API host
Background tasks host
So this means, I could scale out the WebApi hosts, if required .. but I should never scale out the background tasks hosts.
Here's a simple diagram of the solution:
So if the SPA app tries to request some data with the following route:
https://api.myapp.com/account/1 this will hit the reverse-proxy and match a rule to then forward onto <microservice b>/account/1
So it's from here, I'm trying to learn how to write up an Kubernetes deployment file based on these docker-compose concepts.
Questions
Each 'Pod' has it's own IP so I should create a Pod per container. (Yes, a Pod can have multiple containers and to me, that's like saying 'install these software products on the same machine')
A 'Worker Node' is what we replicate/scale out, so we should put our Pods into a Node based on the scaling scenario. For example, the Background Task hosts should go into one Node because they shouldn't be scaled. Also, the hardware requirements for that node are really small. While the Web Api's should go into another Node so they can be replicated/scaled out
If I'm on the right path with the understanding above, then I'll have a lot of nodes and pods ... which feels ... weird?
The pod is the unit of Workload, and has one or more containers. Exactly one container is normal. You scale that workload by changing the number of Pod Replicas in a ReplicaSet (or Deployment).
A Pod is mostly an accounting construct with no direct parallel to base docker. It's similar to docker-compose's Service. A pod is mostly immutable after creation. Like every resource in kubernetes, a pod is a declaration of desired state - containers to be running somewhere. All containers defined in a pod are scheduled together and share resources (IP, memory limits, disk volumes, etc).
All Pods within a ReplicaSet are both fungible and mortal - a request can be served by any pod in the ReplicaSet, and any pod can be replaced at any time. Each pod does get its own IP, but a replacement pod will probably get a different IP. And if you have multiple replicas of a pod they'll all have different IPs. You don't want to manage or track pod IPs. Kubernetes Services provide discovery (how do I find those pods' IPs) and routing (connect to any Ready pod without caring about its identity) and load balancing (round robin over that group of Pods).
A Node is the compute machine (VM or Physical) running a kernel and a kubelet and a dockerd. (This is a bit of a simplification. Other container runtimes than just dockerd exist, and the virtual-kubelet project aims to turn this assumption on its head.)
All pods are Scheduled on Nodes. When a pod (with containers) is scheduled on a node, the kubelet responsible for & running on that node does things. The kubelet talks to dockerd to start containers.
Once scheduled on a node, a pod is not moved to another node. Nodes are fungible & mortal too, though. If a node goes down or is being decommissioned, the pod will be evicted/terminated/deleted. If that pod was created by a ReplicaSet (or Deployment) then the ReplicaSet Controller will create a new replica of that pod to be scheduled somewhere else.
You normally start many (1-100) pods+containers on the same node+kubelet+dockerd. If you have more pods than that (or they need a lot of cpu/ram/io), you need more nodes. So the nodes are also a unit of scale, though very much indirectly wrt the web-app.
You do not normally care which Node a pod is scheduled on. You let kubernetes decide.

Resources