Any way to prevent k8s pod eviction? - docker

I have a set of daemons I need to run, generally, they do not consume much memory or CPU and I have their limits to cpu: 150m and memory: 150m.
Occasionally they will spike to quite a bit higher than this and this seems to be causing evictions and unstable node.
It is critical that the daemons remain running 24/7, even if they are throttled by CPU and/or memory when they spike. Is it possible to prevent their eviction and to cap their resources?
As I understand the CPU usage is throttled but over memory use results in an OOM eviction, is there any way to prevent this eviction?

As of 1.11, you can set pod priorities.
create priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
set priority in pod
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
priorityClassName: high-priority

Sounds like you need to track the resources consumption trends with something like Prometheus + Grafana to check what sort of spikes you expect from your DaemonSets.
Then you can allocate more resources to these pods or remove this config (which, by default, will leave them in unbounded mode). But, of course, you don't want to risk a full node / host crash so you can consider tweaking your eviction threshold:
https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#eviction-thresholds
More details:
https://kubernetes-v1-4.github.io/docs/admin/limitrange/

Related

how memory management happens when we set jvm arguments and memory requests and limits on container

I have set pod definition config as below. i set both heap memory and memory limits on container,
spec:
containers:
- command:
- sh
- '-c'
- >-
exec java -XX:+UseG1GC -Xms512m -Xmx512m -XX:MaxRAM=640m
ports:
- containerPort: 8080
name: http
protocol: TCP
- containerPort: 8443
name: https
protocol: TCP
- containerPort: 8081
name: management
protocol: TCP
resources:
limits:
cpu: 200m
memory: 950Mi
requests:
cpu: 100m
memory: 128Mi
but pod fequently gets killed with OOM. in that case what values should i change. whether resources part or heap memory .
also would like to how memory as jvm arguments and memory as resources works together.
First of all your configurations seem fine. I don't think you need "-XX:MaxRAM=640m". If you are using Java 10+ you don't even need these flags at all, with Java 8 there is a flag that helps you remove these flags as well.
I think your problem is actual resources on nodes are not sufficient, because pod isn't in pending state which means there is at least 128Mi empty memory reservation but not the actual resource. Problem may be 2 reasons:
1: Your bursting isn't enough(200 mcpu, 950Mi memory) and your app crashes while starting. This is common problem with Java based apps, especially with Spring Boot. To check this remove memory limit part from configuration and see if you have still OOM kills. If this fixes your problem, then find the sweet spot for memory limit your app needs.
2: Your nodes working at near full capacity and your app has only 128Mi as guaranteed but not much after that since you may have more bursting apps working above requested power. You can simply monitor it with "free -h" in nodes. This is the reason it's considered best by some group to set requests and limits same to provide stability.

Trying to understand what values to use for resources and limits of multiple container deployment

I am trying to set up HorizontalPodAutoscaler autoscaler for my app, alongside automatic Cluster Autoscaling of DigitalOcean
I will add my deployment yaml below, I have also deployed metrics-server as per guide in link above. At the moment I am struggling to figure out how to determine what values to use for my cpu and memory requests and limits fields. Mainly due to variable replica count, i.e. do I need to account for maximum number of replicas each using their resources or for deployment in general, do I plan it per pod basis or for each container individually?
For some context I am running this on a cluster that can have up to two nodes, each node has 1 vCPU and 2GB of memory (so total can be 2 vCPUs and 4 GB of memory).
As it is now my cluster is running one node and my kubectl top statistics for pods and nodes look as follows:
kubectl top pods
NAME CPU(cores) MEMORY(bytes)
graphql-85cc89c874-cml6j 5m 203Mi
graphql-85cc89c874-swmzc 5m 176Mi
kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
skimitar-dev-pool-3cpbj 62m 6% 1151Mi 73%
I have tried various combinations of cpu and resources, but when I deploy my file my deployment is either stuck in a Pending state, or keeps restarting multiple times until it gets terminated. My horizontal pod autoscaler also reports targets as <unknown>/80%, but I believe it is due to me removing resources from my deployment, as it was not working.
Considering deployment below, what should I look at / consider in order to determine best values for requests and limits of my resources?
Following yaml is cleaned up from things like env variables / services, it works as is, but results in above mentioned issues when resources fields are uncommented.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: graphql
spec:
replicas: 2
selector:
matchLabels:
app: graphql
template:
metadata:
labels:
app: graphql
spec:
containers:
- name: graphql-hasura
image: hasura/graphql-engine:v1.2.1
ports:
- containerPort: 8080
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
# resources:
# requests:
# memory: "150Mi"
# cpu: "100m"
# limits:
# memory: "200Mi"
# cpu: "150m"
- name: graphql-actions
image: my/nodejs-app:1
ports:
- containerPort: 4040
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: 4040
readinessProbe:
httpGet:
path: /healthz
port: 4040
# resources:
# requests:
# memory: "150Mi"
# cpu: "100m"
# limits:
# memory: "200Mi"
# cpu: "150m"
# Disruption budget
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: graphql-disruption-budget
spec:
minAvailable: 1
selector:
matchLabels:
app: graphql
# Horizontal auto scaling
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: graphql-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: graphql
minReplicas: 2
maxReplicas: 3
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 80
How to determine what values to use for my cpu and memory requests and limits fields. Mainly due to variable replica count, i.e. do I need to account for maximum number of replicas each using their resources or for deployment in general, do I plan it per pod basis or for each container individually
Requests and limits are the mechanisms Kubernetes uses to control resources such as CPU and memory.
Requests are what the container is guaranteed to get. If a container requests a resource, Kubernetes will only schedule it on a node that can give it that resource.
Limits, on the other hand, make sure a container never goes above a certain value. The container is only allowed to go up to the limit, and then it is restricted.
The number of replicas will be determined by the autoscaler on the ReplicaController.
when I deploy my file my deployment is either stuck in a Pending state, or keeps restarting multiple times until it gets terminated.
pending state means that there is not resources available to schedule new pods.
restarting may be triggered by other issues, I'd suggest you to debug it after solving the scaling issues.
My horizontal pod autoscaler also reports targets as <unknown>/80%, but I believe it is due to me removing resources from my deployment, as it was not working.
You are correct, if you don't set the request limit, the % desired will remain unknown and the autoscaler won't be able to trigger scaling up or down.
Here you can see algorithm responsible for that.
Horizontal Pod Autoscaler will trigger new pods based on the request % of usage on the pod. In this case whenever the pod reachs 80% of the max request value it will trigger new pods up to the maximum specified.
For a good HPA example, check this link: Horizontal Pod Autoscale Walkthrough
But How does Horizontal Pod Autoscaler works with Cluster Autoscaler?
Horizontal Pod Autoscaler changes the deployment's or replicaset's number of replicas based on the current CPU load. If the load increases, HPA will create new replicas, for which there may or may not be enough space in the cluster.
If there are not enough resources, CA will try to bring up some nodes, so that the HPA-created pods have a place to run. If the load decreases, HPA will stop some of the replicas. As a result, some nodes may become underutilized or completely empty, and then CA will terminate such unneeded nodes.
NOTE: The key is to set the maximum replicas for HPA thinking on a cluster level according to the amount of nodes (and budget) available for your app, you can start setting a very high max number of replicas, monitor and then change it according to the usage metrics and prediction of future load.
Take a look at How to Enable the Cluster Autoscaler for a DigitalOcean Kubernetes Cluster in order to properly enable it as well.
If you have any question let me know in the comments.

kubernetes scheduling for expensive resources

We have a Kubernetes cluster.
Now we want to expand that with GPU nodes (so that would be the only nodes in the Kubernetes cluster that have GPUs).
We'd like to avoid Kubernetes to schedule pods on those nodes unless they require GPUs.
Not all of our pipelines can use GPUs. The absolute majority are still CPU-heavy only.
The servers with GPUs could be very expensive (for example, Nvidia DGX could be as much as $150/k per server).
If we just add DGX nodes to Kubernetes cluster, then Kubernetes would schedule non-GPU workloads there too, which would be a waste of resources (e.g. other jobs that are getting scheduled later and do need GPUs, may have other non-GPU resources there exhausted there like CPU and memory, so they would have to wait for non-GPU jobs/containers to finish).
Is there is a way to customize GPU resource scheduling in Kubernetes so that it would only schedule pods on those expensive nodes if they require GPUs? If they don't, they may have to wait for availability of other non-GPU resources like CPU and memory on non-GPU servers...
Thanks.
You can use labels and label selectors for this.
kubernates docs
Update: example
apiVersion: v1
kind: Pod
metadata:
name: with-gpu-antiAffinity
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: resources
operator: In
values:
- cpu-only
Using labels and label selectors for your nodes is right. But you need to use NodeAffinity on your pods.
Something like this:
apiVersion: v1
kind: Pod
metadata:
name: run-with-gpu
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/node-type
operator: In
values:
- gpu
containers:
- name: your-gpu-workload
image: mygpuimage
Also, attach the label to your GPU nodes:
$ kubectl label nodes <node-name> kubernetes.io/node-type=gpu

Kubernetes scaling based on network utilization or requests per second

Is there any way to scale Kubernetes nodes based on network utilization and not based on memory or CPU?
Let's say for example you are sending thousands of requests to a couple of nodes behind a load balancer. The CPU is not struggling or the memory, but because there are thousands of requests per second you would need additional nodes to serve this. How can you do this in Google Cloud Kubernetes?
I have been researching around but I can't seem to find any references to this type of scaling, and I am guessing I am not the only one to come across this problem. So I am wondering if any of you knows of any best practice solutions.
I guess the ideal solution would be to have one pod per node receiving requests and creating more nodes based on more requests and scale up or down based on this.
This is possible and you have to use Prometheus Adaptor to configure custom rules to generate Custom Metrics.
This link has more details on how to setup prometheus, install adaptor and apply configuration with custom metrics..
I've implement this on my gke cluster using this custom metrics.
This the example of my HPA configuration :
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-name
namespace: your-namespace
annotations:
metric-config.external.prometheus-query.prometheus/interval: 30s
metric-config.external.prometheus-query.prometheus/prometheus-server: http://your-prometheus-server-ip
metric-config.external.prometheus-query.prometheus/istio-requests-total: |
sum(rate(istio_requests_total{reporter="destination", destination_workload="deployment-name", destination_service_namespace="your-namespace"}[2m]))
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: deployment-name
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: prometheus-query
selector:
matchLabels:
query-name: istio-requests-total
target:
type: AverageValue
averageValue: 7
I think HPA(Horizontal Pod Autoscaler) along with Cluster Autoscaler will do the magic.
Have a look at this - https://medium.com/google-cloud/kubernetes-autoscaling-with-istio-metrics-76442253a45a

distribute docker containers evenly with kubectl

If I create 3 nodes in a cluster, how do I distribute the docker containers evenly across the containers? For example, if I create a cluster of 3 nodes with 8 cpus on each node, I've determined through performance profiling that I get the best performance when I run one container per cpu.
gcloud container clusters create mycluster --num-nodes 3 --machine-type n1-standard-8
kubectl run myapp --image=gcr.io/myproject/myapp -r 24
When I ran kubectl above, it put 11 containers on the first node, 10 on the second, and 3 on the third. How to I make it so that it is 8 each?
Both your and jpapejr's solutions seem like they'd work, but using a nodeSelector to force scheduling to a single node has the downside of requiring multiple RCs for a single application and making that application less resilient to a node failure. The idea of a custom scheduler is nice but has the downside of the amount of work to write and maintain that code.
I think another possible solution would be to set runtime constraints in your pod spec that might get you near to what you want. Based on this newly merged doc with examples of runtime contraints, I think you could set resources.requests.cpu in the pod spec part of the RC and get close to a CPU-per-pod:
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
containers:
- name: myapp
image: myregistry/myapp:v1
resources:
requests:
cpu: "1000m"
That docs has other good examples of how requests and limits differ and interact. There may be a combination that gives you what you want and also keeps your application at proper capacity when an individual node fails.
If I'm not mistaken, what you see is the expectation. If you want finer grained control over pod placement you probably want a customer scheduler.
In my case, I want to put a fixed number of containers in each node. I am able to do this by labeling each node and then using a nodeSelector with a config. Ignore that fact that I mislabeled the 3rd node, here is my setup:
kubectl label nodes gke-n3c8-7d9f8163-node-dol5 node=1
kubectl label nodes gke-n3c8-7d9f8163-node-hmbh node=2
kubectl label nodes gke-n3c8-7d9f8163-node-kdc4 node=3
That can be automated doing:
kubectl get nodes --no-headers | awk '{print NR " " $1}' | xargs -l bash -c 'kubectl label nodes $1 node=$0'
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 8
selector:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
nodeSelector:
node: "1"
containers:
- name: nginx
image: nginx

Resources