What happens if master node dies in kubernetes? How to resolve the issue? - docker

I've started learning kubernetes with docker and I've been thinking, what happens if master node dies/fails. I've already read the answers here. But it doesn't answer the remedy for it.
Who is responsible to bring it back? And how to bring it back? Can there be a backup master node to avoid this? If yes, how?
Basically, I'm asking a recommended way to handle master failure in kubernetes setup.

You should have multiple VMs serving as master node to avoid single point of failure.An odd number of 3 or 5 master nodes are recommended for quorum. Have a load balancer in-front of all the VMs serving as master node which can do load balancing and in case one master node dies loadbalancer should remove the VMs IP and make it as unhealthy and not send traffic to it.
Also ETCD cluster is the brain of a kubernetes cluster. So you should have multiple VMs serving as ETCD nodes. Those VMs can be same VMs as of master node or for reduced blast radius you can have separate VMs for ETCD. Again the odd number of VMs should should be 3 or 5. Make sure to take periodic backup of ETCD nodes data so that you can restore the cluster state to pervious state in case of a disaster.
Check the official doc on how to install a HA kubernetes cluster using Kubeadm.

In short, for Kubernetes you should keep master nodes to function properly all the time. There are different methods to make copies of master node, so it is available on failure. As example check this - https://kubernetes.io/docs/tasks/administer-cluster/highly-available-master/

Abhishek, you can run master node in high availability, you should set up the control plane aka master node behind Load balancer as first step. If you have plans to upgrade a single control-plane kubeadm cluster to high availability you should specify the --control-plane-endpoint to set the shared endpoint for all control-plane nodes. Such an endpoint can be either a DNS name or an IP address of a load-balancer.
By default because of security reasons the master node does not host PODS and if you want to enable hosting PODS on master node you can run the following command to do so.
kubectl taint nodes --all node-role.kubernetes.io/master
If you want to manually restore the master make sure you back up the etcd directory /var/lib/etcd. You can restore this on the new master and it should work. Read about high availability kubernetes over here.

Related

Kubernetes - is it ok to have 1 Control Plane and 1 worker Node for dev/test purposes?

we need to initiate Kubernetes cluster and start our development.
Is it OK to have 1 master Control Plane node and 1 worker node with our containers to start the development?
We can afford for services to be unavailable in case of upgrades, scaling and so on, I've just worried if I am lacking some more important info.
I was planning to have 8CPUs and 64 GB since that are the similar resources which we have on one of our VMs without containers with the same apps.
We will deploy cluster with Azure Kubernetes Service.
Thank you
Sure, you can also have single node clusters. Just as you said, that means if one node goes down, the cluster is unavailable.

what would happen if i restart a node with some pods running

Assume that there are some pods from Deployments/StatefulSet/DaemonSet, etc. running on a Kubernetes node.
Then I restarted the node directly, and then start docker, start kubelet with the same parameters.
What would happen to those pods?
Are they recreated with metadata saved locally from kubelet? Or use info retrieved from api-server? Or recovered from OCI runtime and behaves like nothing happened?
Is it that only stateless pod(no --local-data) can be recovered normally? If any of them has a local PV/dir, would they be connected back normally?
What if I did not restart the node for a long time? Would api-server assign other nodes to create those pods? What is the default timeout value? How can I configure this?
As far as I know:
apiserver
^
|(sync)
V
kubelet
^
|(sync)
V
-------------
| CRI plugin |(like api)
| containerd |(like api-server)
| runc |(low-level binary which manages container)
| c' runtime |(container runtime where containers run)
-------------
When kubelet received a PodSpec from kube-api-server, it calls CRI like a remote service, the steps be like:
create PodSandbox(a.k.a 'pause' image, always 'stopped')
create container(s)
run container(s)
So I guess that as the node and docker being restarted, steps 1 and 2 are already done, containers are at 'stopped' status; Then as kubelet being restarted, it pulls latest info from kube-api-server, found out that container(s) are not in 'running' state, so it calls CRI to run container(s), then everything are back to normal.
Please help me confirm.
Thank you in advance~
Good questions. A few things first; a Pod is not pinned to a certain node. The nodes is mostly seen as a "server farm" that Kubernetes can use to run its workload. E.g. you give Kubernetes a set of nodes and you also give a set of e.g. Deployment - that is desired state of applications that should run on your servers. Kubernetes is responsible for scheduling these Pods and also keep them running when something in the cluster is changed.
Standalone pods is not managed by anything, so if a Pod crashes it is not recovered. You typically want to deploy your stateless apps as Deployments, that then initiates ReplicaSets that manage a set of Pods - e.g. 4 Pods - instances of your app.
Your desired state; a Deployment with e.g. replicas: 4 is saved in the etcd database within the Kubernetes control plane.
Then a set of controllers for Deployment and ReplicaSet is responsible for keeping 4 replicas of your app alive. E.g. if a node becomes unresponsible (or dies), new pods will be created on other Nodes, if they are managed by the controllers for ReplicaSet.
A Kubelet receives a PodSpecs that are scheduled to the node, and then keep these pods alive by regularly health checks.
Is it that only stateless pod(no --local-data) can be recovered normally?
Pods should be seen as emphemeral - e.g. can disappear - but is recovered by a controller that manages them - unless deployed as standalone Pod. So don't store local data within the pod.
There is also StatefulSet pods, those are meant for stateful workload - but distributed stateful workload, typically e.g. 3 pods, that use Raft to replicate data. The etcd database is an example of distributed database that uses Raft.
The correct answer: it depends.
Imagine, you've got 3 nodes cluster, where you created a Deployment with 3 replicas, and 3-5 standalone pods.
Pods are created and scheduled to nodes.
Everything is up and running.
Let's assume that worker node node1 has got 1 deployment replica and 1 or more standalone pods.
The general sequence of node restart process as follows:
The node gets restarted, for ex. using sudo reboot
After restart, the node starts all OS processes in the order specified by systemd dependencies
When dockerd is started it does nothing. At this point all previous containers has Exited state.
When kubelet is started it requests the cluster apiserver for the list of Pods with node property equals its node name.
After getting the reply from apiserver, kubelet starts containers for all pods described in the apiserver reply using Docker CRI.
When pause container starts for each Pod from the list, it gets new IP address configured by CNI binary, deployed by Network addon Daemonset's Pod.
After kube-proxy Pod is started on the node, it updates iptables rules to implement Kubernetes Services desired configuration, taking to account new Pods' IP addresses.
Now things become a bit more complicated.
Depending on apiserver, kube-controller-manager and kubelet configuration, they reacts on the fact that node is not responding with some delay.
If the node restarts fast enough, kube-controller-manager doesn't evict the Pods and they all remain scheduled on the same node increasing their RESTARTS number after their new containers become Ready.
Example 1.
The cluster is created using Kubeadm with Flannel network addon on Ubuntu 18.04 VM created in GCP.
Kubernetes version is v1.18.8
Docker version is 19.03.12
After the node is restarted, all Pods' containers are started on the node with new IP addresses. Pods keep their names and location.
If node is stopped for a long time, the pods on the node stays in Running state, but connection attempts are obviously timed out.
If node remains stopped, after approximately 5 minutes pods scheduled on that node were evicted by kube-controller-manager and terminated. If I would start node before that eviction all pods were remained on the node.
In case of eviction, standalone Pods disappear forever, Deployments and similar controllers create necessary number of pods to replace evicted Pods and kube-scheduler puts them to appropriate nodes. If new Pod can't be scheduled on another node, for ex. due to lack of required volumes it will remain in Pending state, until the scheduling requirements were satisfied.
On a cluster created using Ubuntu 18.04 Vagrant box and Virtualbox hypervisor with host-only adapter dedicated for Kubernetes networking, pods on stopped node remains in the Running, but with Readiness: false state even after two hours, and were never evicted. After starting the node in 2 hours all containers were restarted successfully.
This configuration's behavior is the same all the way from Kubernetes v1.7 till the latest v1.19.2.
Example 2.
The cluster is created in Google cloud (GKE) with the default kubenet network addon:
Kubernetes version is 1.15.12-gke.20
Node OS is Container-Optimized OS (cos)
After the node is restarted (it takes around 15-20 seconds) all pods are started on the node with new IP addresses. Pods keep their names and location. (same with example 1)
If the node is stopped, after short period of time (T1 equals around 30-60 seconds) all pods on the node change status to Terminating. Couple minutes later they disappear from the Pods list. Pods managed by Deployment are rescheduled on other nodes with new names and ip addresses.
If the node pool is created with Ubuntu nodes, apiserver terminates Pods later, T1 equals around 2-3 minutes.
The examples show that the situation after worker node gets restarted is different for different clusters, and it's better to run the experiment on a specific cluster to check if you can get the expected results.
How to configure those timeouts:
How Can I Reduce Detecting the Node Failure Time on Kubernetes?
Kubernetes recreate pod if node becomes offline timeout
When the node is restarted and there are pods scheduled on it, managed by Deployment or ReplicaSet, those controllers will take care of scheduling desired number of replicas on another, healthy node. So if you have 2 replicas running on restarted node, they will be terminated and scheduled on other node.
Before restarting a node you should use kubectl cordon to mark the node as unschedulable and give kubernetes time to reschedule pods.
Stateless pods will not be rescheduled on any other node, they will be terminated.

Single node kubernetes cluster scale up

I have single node kubenertes cluster running on GKE. All the load is running on single node separated by namesapces.
now i would like to implement the auto-scaling. Is it possible i can scale mircoservices to new node but one pod is running My main node only.
what i am thinking
Main node : Running everything with 1 pod avaibility (Redis, Elasticsearch)
Scaled up node : Scaled up replicas only of stateless microservice
so is there any way i can implemet this using node auto scaeler or using affinity.
Issue is that right now i am running graylog, elasticsearch and redis and rabbitmq on single node having statefulsets and backed by volume i have to redeploy everything edit yaml file for adding affinity to all.
I'm not sure that I understand your question correctly but if I do then you may try to use taints and tolerations (node affinity). Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. All the details are available in the documentation here.
Assuming the issue you have is that the persistent volumes bound to your StatefulSets are only accessible from one node, then you can use the nodeAffinity field to constraint where the StatefulSet Pods can be scheduled. As mentioned in the documentation:
A PV can specify node affinity to define constraints that limit what
nodes this volume can be accessed from. Pods that use a PV will only
be scheduled to nodes that are selected by the node affinity.

3 tier architecture on kubernetes

I have 1 master kubernetes server and 9 nodes. In that, I want to run backend on 2 nodes and frontend on 2 nodes and DB on 3 nodes.
For all backend, frontend, DB I have ready DockerImage.
How to run an image using kubernetes on only desired(2 or 3).
Please share some ideas to achieve the same.
The Kubernetes scheduler most of the time will do a good job distributing the pods across the cluster. You may want to delegate that responsibility to the scheduler unless you have very specific requirements.
If you want to control this, you can use:
Node selectors
Node Affinity or Anti-Affinity
Directly specify the node name in the deployment spec
From these three, the recommended approach is to use node affinity or anti-affinity due to its flexibility.
Run the front end as a Deployment with desired replica count and let kubernetes manage it for you.
Run Backend as Deployment with desired number of replicas and Kubernetes will figure out how to run it. Use node selectors if you prefer specific nodes.
Run the DB as Deployment OR StatefulSet, Kubernetes will figure out how to run it.
https://kubernetes.io/docs/tutorials/stateful-application/mysql-wordpress-persistent-volume/
Use network policies to restrict traffic.
You may use labels and nodeSelector. Here it is:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

How services are distributed in docker swarm

Can I somehow configure how master node distributes services in docker swarm? I thought, that it should see free resources of worker nodes and distribute it to "freest" node.
Currently I have problem, that service is distributed into one node, which is full (90% RAM) and it starts be laggy, but at the same time second node has few services and it can handle another one.
docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
wdkklpy6065zxckxyuj000ei4 * docker-master Ready Drain Leader 18.09.6
sk45rol2whdr5eh2jqozy0035 docker-node01 Ready Active Reachable 18.09.6
o4zwwbwwcrbwo4tsd00pxkfuc docker-node02 Ready Active 18.09.6
Now I have 36 (very similar) services, 28 run on docker-node01, 8 on docker-node02. I thought, that ideal state is 16 services on both nodes.
Both docker-nodes are same.
How docker swarm knows where to run service? What algorithm does it use?
It is possible to change/update algorithm for selecting node?
According to the swarmkit project README the only available strategy is spread so it schedule tasks on the least loaded modes.
Note that the swarm won't move nodes around to maintain this strategy so if you added the node02 after the node01 was full then the node02 will remain mostly empty. You could drain both nodes then activate them to see if it distributes better the load.
You can find a more detailed description on the schedules algorithm on the project documentation: scheduling-algorithm
For the older swarm manager this attribute was configurable:
https://docs.docker.com/swarm/reference/manage/#--strategy--scheduler-placement-strategy
Also I found https://docs.docker.com/swarm/scheduler/strategy/, it explains a lot about Docker swarm strategies.

Resources