How to detect exception occured in a Pod in Kubernetes? - docker

I have a multinode kubernetes cluster. Multiple services are deployed as Pods. They communicate over each other via rabbitmq which also exists as Pod in the Cluster.
Problem Scenario:
Many time services fails to connect to required queue in the Rabbitmq. Log for the same are reported in Rabbitmq pod logs and on the services Pod as well. This occurs primarily due to connectivity issues and is inconsistent. Due to this failure functionality breaks. And also since this is NOT a crash, pod is always in running state in the kubernetes. To fix this we have to manually go and restart the pod.
I want to create a liveness probe for every pod. But how this should work to catch the exception? Since many process in a service can be trying to access the connection, any one of them can fail.

I'd suggest implementing http endpoint for liveness probe that would check statew of the connection to rabbitmq or actualy failing miserably and exiting whole process when rabbit connection does not work.
But... the best solution would be to retry the connection indefinitely when it fails so a temporary networking issue is transparently recovered from. Well written service should wait for depending services to become operational instead of cascading the failure up the stack.
Imagine you have a liveness check like you ask here on 20 services using that rabvbit or other service. That service goes down for some time, and what you end up with is cluster with 20+ services in CrashLoopBackoff state due to incremental backoff on failure. Meaning your cluster will take some time to recover when that originaly failing service is back, as well as the picture will be pretty messed up and will make it harder to understand what happened at first glance.

Related

How to check failed container logs in Kubernetes

Before i check the logs, pods are failing and removed by jenkins and I am unable to see the logs.
How can i check the logs of the pods that are removed.
is there any simple way to save the logs in kubernetes.
I don't have any logging system for my kubernetes.
In a fraction of seconds, it keeps creating and deleting because of some error. I want to find what the error is. before i check the logs, the container name is changed.
Thanks,
Most probably you meant "pods are failing and removed by kubernetes and I am unable to see the logs." This is kubernetes itself who manage API objects, not jenkins.
Answering your question directly - you are not able to fetch any logs from any of your containers once related POD was deleted. Deletion pod means wiping all pod's containers with all the data included. Logs were deleted in the moment your pod was terminated.
By default, if a container restarts, the kubelet keeps one terminated
container with its logs. If a pod is evicted from the node, all
corresponding containers are also evicted, along with their logs.
If you pod were alive - you would be able to use ----previous flag to check the logs, but unfortunatelly thats not your case.
There are a lot of similar questions - and the only main suggestion is to set up some log aggregation system that will store logs separately. IN that case you wont lose them and will be able at least check them.
Logging at the node level
Cluster-level logging architectures
How to see logs of terminated pods
How to access Logs of Pods in Kubernetes after its deletion

How to kill a multi-container pod if one container fails?

I'm using Jenkins Kubernetes Plugin which starts Pods in a Kubernetes Cluster which serve as Jenkins agents. The pods contain 3 containers in order to provide the slave logic, a Docker socket as well as the gcloud command line tool.
The usual workflow is that the slave does its job and notifies the master that it completed. Then the master terminates the pod. However, if the slave container crashes due to a lost network connection, the container terminates with error code 255, the other two containers keep running and so does the pod. This is a problem because the pods have large CPU requests and setup is cheap with the slave running only when they have to, but having multiple machines running for 24h or over the weekend is a noticable financial damage.
I'm aware that starting multiple containers in the same pod is not fine Kubernetes arts, however ok if I know what I'm doing and I assume I do. I'm sure it's hard to solve this differently given the way the Jenkins Kubernetes Plugin works.
Can I make the pod terminate if one container fails without it respawn? As solution with a timeout is acceptable as well, however less preferred.
Disclaimer, I have a rather limited knowledge of kubernetes, but given the question:
Maybe you can run the forth container that exposes one simple endpoint of "liveness"
It can run ps -ef or any other way to contact 3 existing containers just to make sure they're alive.
This endpoint could return "OK" only if all the containers are running, and "ERROR" if at least one of them was detected as "crushed"
Then you could setup a liveness probe of kubernetes so that it would stop the pod upon the error returned from that forth container.
Of course if this 4th process will crash by itself for any reason (well it shouldn't unless there is a bug or something) then the liveness probe won't respond and kubernetes is supposed to stop the pod anyway, which is probably what you really want to achieve.

Self healing in Kubernetes - Can we regenerate the pod completely?

I am new to Kubernetes.
I have seen pod automaticaĺly restart in case of failure.
When node failure happens, new pod regenerate to another node.
In both cases,
What happens when pod gets failed in the middle of the process, (say: httpsession)? Can we provide the same session to the already logged in user.
Please forgive if the question is irrelevant.
Yes, you can use health-checks like readiness and liveness probes for your pod. No traffic will be routed to the pod till readiness check passes and pod will be restarted if liveness check fails. These checks can be added to your pod-spec.
And session management is not handled by k8s. It must be done by the application itself.
Anyhow, If you want to persist some data you can use PV and PVC and bind the volume to your pod.
Yes, the normal way to create pods is through one of the higher-level controllers like Deployments or StatefulSets. These will automatically detect if there are not the right number of pods and start replacements. As for showing the user the same log-in session, that's not usually related to the running pod, your login session on a website is usually stored in a cookie of some kind and references stuff in the database, not the web server.

Is there a best practice to reboot a cluster

I followed Alex Ellis' excellent tutorial that uses kubeadm to spin-up a K8s cluster on Raspberry Pis. It's unclear to me what the best practice is when I wish to power-cycle the Pis.
I suspect sudo systemctl reboot is going to result in problems. I'd prefer not to delete and recreate the cluster each time starting with kubeadm reset.
Is there a way that I can shutdown and restart the machines without deleting the cluster?
Thanks!
This question is quite old but I imagine others may eventually stumble upon it so I thought I would provide a quick answer because there is, in fact, a best practice around this operation.
The first thing that you're going to want to ensure is that you have a highly available cluster. This consists of at least 3 masters and 3 worker nodes. Why 3? This is so that at any given time they can always form a quorum for eventual consistency.
Now that you have an HA Kubernetes cluster, you're going to have to go through every single one of your application manifests and ensure that you have specified Resource Requests and Limitations. This is so that you can ensure that a pod will never be scheduled on a pod without the required resources. Furthermore, in the event that a pod has a bug that causes it to consume a highly abnormal amount of resources, the limitation will prevent it from taking down your cluster.
Now that that is out of the way, you can begin the process of rebooting the cluster. The first thing you're going to do is reboot your masters. So run kubectl drain $MASTER against one of your (at least) three masters. The API Server will now reject any scheduling attempts and immediately start the process of evicting any scheduled pods and migrating their workloads to your other masters.
Use kubectl describe node $MASTER to monitor the node until all pods have been removed. Now you can safely connect to it and reboot it. Once it has come back up, you can now run kubectl uncordon $MASTER and the API Server will once again begin scheduling Pods to it. Once again use kubectl describe $NODE until you have confirmed that all pods are READY.
Repeat this process for all of the masters. After the masters have been rebooted, you can safely repeat this process for all three (or more) worker nodes. If you properly perform this operation you can ensure that all of your applications will maintain 100% availability provided they are using multiple pods per service and have proper Deployment Strategy configured.

Openshift PaaS/Kubernetes Docker Container Monitoring and Orchestration

Kubernetes deployment and replication controller give the ability to self heal by ensuring a minimum number of replicas is/are present.
Also the auto scaling features, allows to increase replicas given a specific cpu threshold.
Are there tools available that would provide flexibility in the auto-healing and auto-scale features?
Example :
Auto-adjust number of replicas during peak hours or days.
When the pod dies, and is due to external issues, prevent the system from re-creating container and wait for a condition to succeed, i.e. ping or telnet test.
You can block pod startup by waiting for external services in an entrypoint script or init container. That's the closest that exists today to waiting for external conditions.
There is no time based autoscaler today, although it would be possible to script it failure easily on a schedule.
In Openshift, you can easily scale your app by running this command in a cron job.
Scale command
oc scale dc app --replicas=5
And of course, scale it down changing the numer of replicas.
Autoscale
This is what Openshift for developers write about autoscaling.
OpenShift also supports automatic scaling, defining upper and lower thresholds for CPU usage by pod.
If the upper threshold is consistently exceeded by the running pods for your application, a new instance of your application will be started. When CPU usage drops back below the lower threshold, because your application is no longer working as hard, the number of instances will be scaled back again.
I think Kubernetes now released version 1.3 which allows autoscale but integrated yet in Openshift.
Health Check
What it comes to health check, Openshift has:
readiness checks Checks the status of the test you configure before the router start to send traffic to it.
liveness probe: liveness probe is run periodically once traffic has been switched to an instance of your application to ensure it is still behaving correctly. If the liveness probe fails, OpenShift will automatically shut down that instance of your application and replace it with a new one.
You can perform this kind of tests (HTTP check, Container execution check and TCP socket check)
So e this tolos I guess you can créate some readiness check and liveness check to ensure that the status of your pod is running properly, if not a new deployment will be triggered until readiness status comes to ok.

Resources