Azure AKS Attempting to reclaim ephemeral-storage - azure-aks

All my pods got into a evicted state and when I look in to the describe of the fives nodes all of them says: Attempting to reclaim ephemeral-storage
how can I solve this?

Related

How can I get an insufficient cpu error inside a GKE cluster with autopilot mode?

I created a cluster with autopilot mode. When I try to install an app inside this cluster using helm, workloads fail with this error Does not have minimum availability. If I click on this error, I get Cannot schedule pods: Insufficient cpu and Cannot schedule pods: Insufficient memory.
If I do kubectl describe node <name> I find 0/3 nodes are available: 1 Insufficient memory, 3 Insufficient cpu.
Isn't GKE autopilot mode supposed to allocate sufficient memory and cpu?
I found where my mistake was. It had nothing to do with cpu or memomry. It was a mistake inside my yaml file (wrong host for database).

AKS Horizontal Pod Autoscaling - missing request for cpu

I have been trying to set up an Kubernetes 1.13 AKS deployment to use HPA, and I keep running into a problem:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
dev-hpa-poc Deployment/dev-hpa-poc <unknown>/50% 1 4 2 65m
Describing the HPA gives me these events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedComputeMetricsReplicas 5m4s (x200 over 55m) horizontal-pod-autoscaler failed to get cpu utilization: missing request for cpu
Warning FailedGetResourceMetric 3s (x220 over 55m) horizontal-pod-autoscaler missing request for cpu
It doesn't appear to be able to actually retrieve CPU usage. I have specified cpu and memory usage in the deployment YAML:
resources:
requests:
cpu: 250m
memory: 128Mi
limits:
cpu: 800m
memory: 1024Mi
The system:metrics-server is running and healthy, too, so that's not it. I can monitor pod health and CPU usage from the Azure portal. Any ideas as to what I'm missing? Could this potentially be a permissions issue?
for missing request for [x] make sure that all the containers in the pod have requests declared.
In my case the reason was that other deployment haven't resource limits. You should add resources for each pod and deployment in namespace.
Adding to #nakamume's answer, make sure to double check sidecar containers.
For me, I forgot to declare requests for GCP cloud-sql-proxy sidecar which had me pulling hairs for couple of hours.

Google Kubernetes logs

Memory cgroup out of memory: Kill process 545486 (python3) score 2016 or sacrifice child Killed process 545486 (python3) total-vm:579096kB, anon-rss:518892kB, file-rss:16952kB
This node logs and my container is continuously restarting randomly. Running python cotnainer with 4 replicas.
Python application contains socket with a flask. Docker image contain of python3.5:slim
Kubectl get nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-XXXXXXX-cluster-highmem-pool-gen2-f2743e02-msv2 682m 17% 11959Mi 89%
Today morning node log : 0/1 nodes are available: 1 Insufficient cpu.
But node CPU usage is 17% only
There not much running inside pod.
Have a look at the best practices and try to adjust resource requests and limits for CPU and memory. If your app starts hitting your CPU limits, Kubernetes starts throttling your container. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated (and restarted). So, using suitable limits should help you to solve your problem with restarts of your containers.
In case request of your container exceeded limits, Kubernetes will throw an error, similar to one you have, and won’t let you run the container.
After adjusting limits, you could use some monitoring system (like Stackdriver) to find the cause of potential memory leak.

Kubernetes POD Failover

I am toying around with Kubernetes and have managed to deploy a statefull application (jenkins instance) to a single node.
It uses a PVC to make sure that I can persist my jenkins data (jobs, plugins etc).
Now I would like to experiment with failover.
My cluster has 2 digital ocean droplets.
Currently my jenkins pod is running on just one node.
When that goes down, Jenkins becomes unavailable.
I am now looking on how to accomplish failover in a sense that, when the jenkins pod goes down on my node, it will spin up on the other node. (so short downtime during this proces is ok).
Of course it has to use the same PVC, so that my data remains intact.
I believe, when reading, that a StatefulSet kan be used for this?
Any pointers are much appreciated!
Best regards
Digital Ocean's Kubernetes service only supports ReadWriteOnce access modes for PVCs (see here). This means the volume can only be attached to one node at a time.
I came across this blogpost which, while focused on Jenkins on Azure, has the same situation of only supporting ReadWriteOnce. The author states:
the drawback for me though lies in the fact that the access mode for Azure Disk persistent volumes is ReadWriteOnce. This means that an Azure disk can be attached to only one cluster node at a time. In the event of a node failure or update, it could take anywhere between 1-5 minutes for the Azure disk to get detached and attached to the next available node.
Note, Pod failure and node failures are different things. Since DO only supports ReadWriteOnce, there's no benefit to trying anything more sophisticated than what you have right now in terms of tolerance to node failure. Since it's ReadWriteOnce the volume will need to be unmounted from the failing node and re-mounted to the new node, and then a new Pod will get scheduled on the new node. Kubernetes will do this for you, and there's not much you can do to optimize it.
For Pod failure, you could use a Deployment since you want to read and write the same data, you don't want different PVs attached to the different replicas. There may be very limited benefit to this, you will have multiple replicas of the Pod all running on the same node, so it depends on how the Jenkins process scales and if it can support that type of scale horizontal out model while all writing to the same volume (as opposed to simply vertically scaling memory or CPU requests).
If you really want to achieve higher availability in the face of node and/or Pod failures, and the Jenkins workload you're deploying has a hard requirement on local volumes for persistent state, you will need to consider an alternative volume plugin like NFS, or moving to a different cloud provider like GKE.
Yes, you would use a Deployment or StatefulSet depending on the use case. For Jenkins, a StatefulSet would be appropriate. If the running pod becomes unavailable, the StatefulSet controller will see that and spawn a new one.
What you are describing is the default behaviour of Kubernetes for Pods that are managed by a controller, such as a Deployment.
You should deploy any application as a Deployment (or another controller) even if it consists just of a single Pod. You never really deploy Pods directly to Kubernetes. So, in this case, there's nothing special you need to do to get this behaviour.
When one of your nodes dies, the Pod dies too. This is detected by the Deployment controller, which creates a new Pod. This is in turn detected by the scheduler, which assigns the new Pod to a node. Since one of the nodes is down, it will assign the Pod to the other node that is still running. Once the Pod is assigned to this node, the kubelet of this node will run the container(s) of this Pod on this node.
Ok, let me try to anwser my own question here.
I think Amit Kumar Gupta came the closest to what I believe is going on here.
Since I am using a Deployment and my PVC in ReadWriteOnce, I am basically stuck with one pod, running jenkins, on one node.
weibelds answer made me realise that I was asking questions to about a concept that Kubernetes performs by default.
If my pod goes down (in my case i am shutting down a node on purpose by doing a hard power down to simulate a failure), the cluster (controller?) will detect this and spawn a new pod on another node.
All is fine so far, but then I noticed that my new pod as stuck in ContainerCreating state.
Running a describe on my new pod (the one in ContainerCreating state) showed this
Warning FailedAttachVolume 16m attachdetach-controller Multi-Attach error for volume "pvc-cb772fdb-492b-4ef5-a63e-4e483b8798fd" Volume is already used by pod(s) jenkins-deployment-6ddd796846-dgpnm
Warning FailedMount 70s (x7 over 14m) kubelet, cc-pool-bg6u Unable to mount volumes for pod "jenkins-deployment-6ddd796846-wjbkl_default(93747d74-b208-421c-afa4-8d467e717649)": timeout expired waiting for volumes to attach or mount for pod "default"/"jenkins-deployment-6ddd796846-wjbkl". list of unmounted volumes=[jenkins-home]. list of unattached volumes=[jenkins-home default-token-wd6p7]
Then it started to hit me, this makes sense.
It's a pitty, but it makes sense.
Since I did a hard power down on the node, the PV went down with it.
So now the controller tries to start a new pod, on a new node but it cant transfer the PV, since the one on the previous pod became unreachable.
As I read more on this, I read that DigitalOcean only supports ReadWriteOnce , which now leaves me wondering, how the hell can I achieve a simple failover for a stateful application on a Kubernetes Cluster on Digital Ocean that consists of just a couple of simple droplets?

GKE Container is taking more memory than allocated to it

I have a GKE Cluster with multiple services/containers running on it. My Containers has requests and limits as 512Mi and 512Mi. But I could see its usage exceeds more than 700Mi (in Stackdriver and Workloads). This has been happening for a while. Can anyone suggest what could be the possible issue and how can we fix that?

Resources