During image upgrade of pods few of the pods are stuck in ContainerCreating state.
kubectl get events has below error: FailedSync kubelet,
10.102.10.34 Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod
"default"/"ob-service-1124355621-1th47". list of unattached/unmounted
volumes=[timezone default-token-3x1x9]
Docker Logs :
^[[31mERRO^[[0m[240242] Handler for DELETE /v1.22/containers/749d05b355e2b80bffb90d207232d37e3ebc5ff57942c46ce0a2b4ca5950ed0e returned error: Driver devicemapper failed to remove root filesystem 749d05b355e2b80bffb90d207232d37e3ebc5ff57942c46ce0a2b4ca5950ed0e: Device is Busy
^[[31mERRO^[[0m[240242] Error saving dying container to disk: open /var/lib/docker/containers/5d01db2c31a3073cc7fb68f2be5acc45c34583d5f2ae0c0879ec064f90da6943/config.v2.json: no such file or directory
^[[31mERRO^[[0m[240263] Error removing mounted layer 5d01db2c31a3073cc7fb68f2be5acc45c34583d5f2ae0c0879ec064f90da6943: Device is Busy
it's a bit hard to debug with just the information you provided, but the general direction you should be looking into is resources of your cluster.
failed to sync usually means the pods can't be fit into any of the workers (maybe adding more will help) or from your error seems like you're trying to "connect" to volumes that are busy and can't accept the connection which fails the pod.
Again lacking details, but let's assume you're on AWS and you have a volume that didn't dismount and now you're trying to connect to it again - the above would be the result pretty much, so you'll need to detach the volume so the new pod can connect to it.
if you say there are some pods that are okay with the same image it means you don't have enough volumes and/or some of the current volumes are not available to accept new connection (maybe during the deletion of the old pods they didn't dismount properly)
Related
Before i check the logs, pods are failing and removed by jenkins and I am unable to see the logs.
How can i check the logs of the pods that are removed.
is there any simple way to save the logs in kubernetes.
I don't have any logging system for my kubernetes.
In a fraction of seconds, it keeps creating and deleting because of some error. I want to find what the error is. before i check the logs, the container name is changed.
Thanks,
Most probably you meant "pods are failing and removed by kubernetes and I am unable to see the logs." This is kubernetes itself who manage API objects, not jenkins.
Answering your question directly - you are not able to fetch any logs from any of your containers once related POD was deleted. Deletion pod means wiping all pod's containers with all the data included. Logs were deleted in the moment your pod was terminated.
By default, if a container restarts, the kubelet keeps one terminated
container with its logs. If a pod is evicted from the node, all
corresponding containers are also evicted, along with their logs.
If you pod were alive - you would be able to use ----previous flag to check the logs, but unfortunatelly thats not your case.
There are a lot of similar questions - and the only main suggestion is to set up some log aggregation system that will store logs separately. IN that case you wont lose them and will be able at least check them.
Logging at the node level
Cluster-level logging architectures
How to see logs of terminated pods
How to access Logs of Pods in Kubernetes after its deletion
I have a Kubernetes cluster running Jenkins master in a single pod and each build running in a separate slave pod. When there are many builds running, there are many pods being spun up and down and often I will see an error in a job like this:
Cannot contact slave-jenkins-0g9p0: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel#197b6a38:JNLP4-connect connection from 10.10.3.90/10.10.3.90:54418": Remote call on JNLP4-connect connection from 10.10.3.90/10.10.3.90:54418 failed. The channel is closing down or has closed down
Could not connect to slave-jenkins-0g9p0 to send interrupt signal to process
The pod, for example slave-jenkins-0g9p0, just disappears. There is no trace that it existed. While watching information like kubectl describe pod slave-jenkins-0g9p0, there is no error message, it simply stops existing.
I have a feeling that because there are multiple pods spinning up and down that Kubernetes attempts to balance the load on the nodes and reschedule the pod but after killing it, it cannot spin up the pod on another node. I cannot be sure though. Maybe there is a way to tell K8s to tie a pod to a node until it exits itself? Im not really sure what/how to debug this case.
Kuberentes version: v1.16.13-eks-2ba888 on AWS EKS
Jenkins version: 2.257
Kubernetes plugin version 1.27.2
Any advise would be appreciated
Thanks
UPDATE:
I have uploaded three slave pod manifest examples here where you can see the resources allocated. The above issue occurs in each of these running pods.
The node pool is controlled by the Kubernetes autoscaler (v1.14.6) and use AWS t3a.large (2 CPU, 8GB mem) instances.
UPDATE 2:
I believe that I have found the cause of the problem. I disabled the cluster-autoscaler](https://github.com/kubernetes/autoscaler) (v1.14.6) and the problem stopped.
So what is seems is happening is that the autoscaler is removing the node that the slave pd is running on. I know that taints can be used to tell the autoscaler not to remove a node but is there a way to do this dynamically that it wont remove a node if a certain pod is running on it. Without having to develop something new.
I tried simple PVC example from here with nginx claiming azure-managed-disk and I getting 'unable to mount' error, see below. Also I can't remove the created PV with 'kubectl delete pv pvc-3f3c3c78-9779-11e9-a7eb-1aafd0e2f988'.
$kubectl get events
LAST SEEN TYPE REASON KIND MESSAGE
10m Warning FailedMount Pod MountVolume.WaitForAttach failed for volume "pvc-3f3c3c78-9779-11e9-a7eb-1aafd0e2f988" : azureDisk - WaitForAttach failed within timeout node (aks-agentpool-10844952-2) diskId:(kubernetes-dynamic-pvc-3f3c3c78-9779-11e9-a7eb-1aafd0e2f988) lun:(1)
22s Warning FailedMount Pod Unable to mount volumes for pod "nginx_default(bd16b9c8-97b2-11e9-9018-eaa2ea1705c5)": timeout expired waiting for volumes to attach or mount for pod "default"/"nginx". list of unmounted volumes=[volume]. list of unattached volumes=[volume default-token-92rj6]
My managed aks cluster is using v1.12.8 , SP has contributor role (owner role doesn't not help too). There is storage class 'managed-premium', in the yaml from my simple nginx example (link provided).
For your issue, there are no more details to judge the exact reason. Just list the possible reason here.
It's just a simple error that it's failed when the API call to Azure. If so, you just need to delete them and recreate again.
The node that the pod run in already has too many Azure disks attached. If so, you need to schedule the pod run in another node which does not attach to many disks.
The Azure disk cannot be unmounted or detached from the old node. It means that the PV is in use and attach to another node. If so, you need to create another dynamic PV that does not in use for your pod.
You can check carefully again according to these reasons. In my opinion, the third reason is the most possible one. Of curse, it all dependents on the actual situation. For more details about the similar errors, see How to Understand & Resolve “Warning Failed Attach Volume” and “Warning Failed Mount” Errors in Kubernetes on Azure.
A node of my k8s cluster has GC trying to remove images used by a container.
This behaviour seems strange to me.
Here the logs:
kubelet: I1218 12:44:19.925831 11177 image_gc_manager.go:334] [imageGCManager]: Removing image "sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2" to free 746888 bytes
kubelet: E1218 12:44:19.928742 11177 remote_image.go:130] RemoveImage "sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2" from image service failed: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 99e59f495ffa (cannot be forced) - image is being used by running container 6f236a385a8e
kubelet: E1218 12:44:19.928793 11177 kuberuntime_image.go:126] Remove image "sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2" failed: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 99e59f495ffa (cannot be forced) - image is being used by running container 6f236a385a8e
kubelet: W1218 12:44:19.928821 11177 eviction_manager.go:435] eviction manager: unexpected error when attempting to reduce nodefs pressure: wanted to free 9223372036854775807 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 99e59f495ffa (cannot be forced) - image is being used by running container 6f236a385a8e
Any suggestions?
May a manual remove of docker images and stopped containers on a node cause such a problem?
Thank you in advance.
What you've encountered is not the regular Kubernetes garbage collection that deleted orphaned API resource objects, but the kubelet's Image collection.
Whenever a node experiences Disk pressure, the Kubelet daemon will desperately try to reclaim disk space by deleting (supposedly) unused images. Reading the source code shows that the Kubelet sorts the images to remove by the time since they have last been used for creating a Pod -- if all images are in use, the Kubelet will try to delete them anyways and fail (which is probably what happened to you).
You can use the Kubelet's --minimum-image-ttl-duration flag to specify a minimum age that an image needs to have before the Kubelet will ever try to remove it (although this will not prevent the Kubelet from trying to remove used images altogether). Alternatively, see if you can provision your nodes with more disk space for images (or build smaller images).
As I understand, Kubelet has a garbage collector and the purpose of it to remove unnecessary k8s objects for utilising resources.
If the object does not belong to any owner it means its orphaned. There is a pattern in Kubernetes which is known as ownership in kubernetes.
For Instance, If you apply the deployment object then it will create a replicaSet object, further ResplicaSet will create pods objects.
So Ownership flow
Deployment <== RepicaSet <=== Pod
Now if you delete Deployment object which means ReplicaSet does not have an owner then Garbage collector will try to remove ReplicaSet and now Pods do not have owner, therefore, GC will try to remove pods.
There is a field called ownerReferences which describe the relationship among all of these Objects such as Deployment, ReplicaSet, Pods etc.
There are 3 ways to delete objects in Kubernetes.
Foreground: If you try to delete Deployment, First pods will be deleted then replicaset after that deployment will be removed.
Background: If you try to delete Deployment, First Deployment will be deleted Now GC will remove replicasets and pods.
Orphan: If you remove Deployment, then Repicaset will be Orphaned and GC will remove all of these orphaned objects.
Solutions to your issues
It seems to me that your pod (containers) is orphaned, therefore, GC is making sure that it is removed from the cluster.
If you want to check ownerRererences status :
kubectl get pod $PODNAME -o yaml
In the metadata sections, there will be an adequate information.
I have attached references for further research.
garbage-collection
garbage-collection-k8s
I have a Kubernetes Pod created by a Stateful Set (not sure if that matters). There are two containers in this pod. When one of the two containers fails and use the get pods command, 1/2 containers are Ready and the Status is "Error." The second container never attempts a restart and I am unable to destroy the pod except by using the --grace-period=0 --force flags. A typical delete leaves the pod hanging in a "terminating" state either forever or for a very very long time. What could be causing this behavior and how to go about debugging it?
I encounter a similar problem on a node in my k8s 1.6 cluster esp. when the node has been running for a couple of weeks. It can happen to any node. When this happens, I restart kubelet on the node and the errors go away.
It's not the best thing to do, but it always solves the problem. It's also not detrimental to the cluster if you restart kubelet because the running pods continue to stay up.
kubectl get po -o wide will likely reveal to you that the errant pods are running on one node. SSH to that node and restart kubelet.