kubernetes scheduling for expensive resources - docker

We have a Kubernetes cluster.
Now we want to expand that with GPU nodes (so that would be the only nodes in the Kubernetes cluster that have GPUs).
We'd like to avoid Kubernetes to schedule pods on those nodes unless they require GPUs.
Not all of our pipelines can use GPUs. The absolute majority are still CPU-heavy only.
The servers with GPUs could be very expensive (for example, Nvidia DGX could be as much as $150/k per server).
If we just add DGX nodes to Kubernetes cluster, then Kubernetes would schedule non-GPU workloads there too, which would be a waste of resources (e.g. other jobs that are getting scheduled later and do need GPUs, may have other non-GPU resources there exhausted there like CPU and memory, so they would have to wait for non-GPU jobs/containers to finish).
Is there is a way to customize GPU resource scheduling in Kubernetes so that it would only schedule pods on those expensive nodes if they require GPUs? If they don't, they may have to wait for availability of other non-GPU resources like CPU and memory on non-GPU servers...
Thanks.

You can use labels and label selectors for this.
kubernates docs
Update: example
apiVersion: v1
kind: Pod
metadata:
name: with-gpu-antiAffinity
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: resources
operator: In
values:
- cpu-only

Using labels and label selectors for your nodes is right. But you need to use NodeAffinity on your pods.
Something like this:
apiVersion: v1
kind: Pod
metadata:
name: run-with-gpu
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/node-type
operator: In
values:
- gpu
containers:
- name: your-gpu-workload
image: mygpuimage
Also, attach the label to your GPU nodes:
$ kubectl label nodes <node-name> kubernetes.io/node-type=gpu

Related

Trying to understand what values to use for resources and limits of multiple container deployment

I am trying to set up HorizontalPodAutoscaler autoscaler for my app, alongside automatic Cluster Autoscaling of DigitalOcean
I will add my deployment yaml below, I have also deployed metrics-server as per guide in link above. At the moment I am struggling to figure out how to determine what values to use for my cpu and memory requests and limits fields. Mainly due to variable replica count, i.e. do I need to account for maximum number of replicas each using their resources or for deployment in general, do I plan it per pod basis or for each container individually?
For some context I am running this on a cluster that can have up to two nodes, each node has 1 vCPU and 2GB of memory (so total can be 2 vCPUs and 4 GB of memory).
As it is now my cluster is running one node and my kubectl top statistics for pods and nodes look as follows:
kubectl top pods
NAME CPU(cores) MEMORY(bytes)
graphql-85cc89c874-cml6j 5m 203Mi
graphql-85cc89c874-swmzc 5m 176Mi
kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
skimitar-dev-pool-3cpbj 62m 6% 1151Mi 73%
I have tried various combinations of cpu and resources, but when I deploy my file my deployment is either stuck in a Pending state, or keeps restarting multiple times until it gets terminated. My horizontal pod autoscaler also reports targets as <unknown>/80%, but I believe it is due to me removing resources from my deployment, as it was not working.
Considering deployment below, what should I look at / consider in order to determine best values for requests and limits of my resources?
Following yaml is cleaned up from things like env variables / services, it works as is, but results in above mentioned issues when resources fields are uncommented.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: graphql
spec:
replicas: 2
selector:
matchLabels:
app: graphql
template:
metadata:
labels:
app: graphql
spec:
containers:
- name: graphql-hasura
image: hasura/graphql-engine:v1.2.1
ports:
- containerPort: 8080
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
# resources:
# requests:
# memory: "150Mi"
# cpu: "100m"
# limits:
# memory: "200Mi"
# cpu: "150m"
- name: graphql-actions
image: my/nodejs-app:1
ports:
- containerPort: 4040
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: 4040
readinessProbe:
httpGet:
path: /healthz
port: 4040
# resources:
# requests:
# memory: "150Mi"
# cpu: "100m"
# limits:
# memory: "200Mi"
# cpu: "150m"
# Disruption budget
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: graphql-disruption-budget
spec:
minAvailable: 1
selector:
matchLabels:
app: graphql
# Horizontal auto scaling
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: graphql-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: graphql
minReplicas: 2
maxReplicas: 3
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 80
How to determine what values to use for my cpu and memory requests and limits fields. Mainly due to variable replica count, i.e. do I need to account for maximum number of replicas each using their resources or for deployment in general, do I plan it per pod basis or for each container individually
Requests and limits are the mechanisms Kubernetes uses to control resources such as CPU and memory.
Requests are what the container is guaranteed to get. If a container requests a resource, Kubernetes will only schedule it on a node that can give it that resource.
Limits, on the other hand, make sure a container never goes above a certain value. The container is only allowed to go up to the limit, and then it is restricted.
The number of replicas will be determined by the autoscaler on the ReplicaController.
when I deploy my file my deployment is either stuck in a Pending state, or keeps restarting multiple times until it gets terminated.
pending state means that there is not resources available to schedule new pods.
restarting may be triggered by other issues, I'd suggest you to debug it after solving the scaling issues.
My horizontal pod autoscaler also reports targets as <unknown>/80%, but I believe it is due to me removing resources from my deployment, as it was not working.
You are correct, if you don't set the request limit, the % desired will remain unknown and the autoscaler won't be able to trigger scaling up or down.
Here you can see algorithm responsible for that.
Horizontal Pod Autoscaler will trigger new pods based on the request % of usage on the pod. In this case whenever the pod reachs 80% of the max request value it will trigger new pods up to the maximum specified.
For a good HPA example, check this link: Horizontal Pod Autoscale Walkthrough
But How does Horizontal Pod Autoscaler works with Cluster Autoscaler?
Horizontal Pod Autoscaler changes the deployment's or replicaset's number of replicas based on the current CPU load. If the load increases, HPA will create new replicas, for which there may or may not be enough space in the cluster.
If there are not enough resources, CA will try to bring up some nodes, so that the HPA-created pods have a place to run. If the load decreases, HPA will stop some of the replicas. As a result, some nodes may become underutilized or completely empty, and then CA will terminate such unneeded nodes.
NOTE: The key is to set the maximum replicas for HPA thinking on a cluster level according to the amount of nodes (and budget) available for your app, you can start setting a very high max number of replicas, monitor and then change it according to the usage metrics and prediction of future load.
Take a look at How to Enable the Cluster Autoscaler for a DigitalOcean Kubernetes Cluster in order to properly enable it as well.
If you have any question let me know in the comments.

Kubernetes Pod Warning: 1 node(s) had volume node affinity conflict

I try to set up Kubernetes cluster. I have Persistent Volume, Persistent Volume Claim and Storage class all set-up and running but when I wan to create pod from deployment, pod is created but it hangs in Pending state. After describe I get only this warning "1 node(s) had volume node affinity conflict." Can somebody tell me what I am missing in my volume configuration?
apiVersion: v1
kind: PersistentVolume
metadata:
creationTimestamp: null
labels:
io.kompose.service: mariadb-pv0
name: mariadb-pv0
spec:
volumeMode: Filesystem
storageClassName: local-storage
local:
path: "/home/gtcontainer/applications/data/db/mariadb"
accessModes:
- ReadWriteOnce
capacity:
storage: 2Gi
claimRef:
namespace: default
name: mariadb-claim0
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/cvl-gtv-42.corp.globaltelemetrics.eu
operator: In
values:
- master
status: {}
The error "volume node affinity conflict" happens when the persistent volume claims that the pod is using, are scheduled on different zones, rather than on one zone, and so the actual pod was not able to be scheduled because it cannot connect to the volume from another zone. To check this, you can see the details of all the Persistent Volumes.
To check that, first get your PVCs:
$ kubectl get pvc -n <namespace>
Then get the details of the Persistent Volumes (not Volume claims)
$ kubectl get pv
Find the PVs, that correspond to your PVCs and describe them
$ kubectl describe pv <pv1> <pv2>
You can check the Source.VolumeID for each of the PV, most likely they will be different availability zone, and so your pod gives the affinity error.
To fix this, create a storageclass for a single zone and use that storageclass in your PVC.
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: region1storageclass
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
encrypted: "true" # if encryption required
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- eu-west-2b # this is the availability zone, will depend on your cloud provider
# multi-az can be added, but that defeats the purpose in our scenario
0. If you didn't find the solution in other answers...
In our case the error happened on a AWS EKS cluster freshly provisioned with Pulumi (see full source here). The error drove me nuts, since I didn't change anything, just created a PersistentVolumeClaim as described in the Buildpacks Tekton docs:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: buildpacks-source-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Mi
I didn't change anything else from the default EKS configuration and also didn't add/change any PersistentVolume or StorageClass (in fact I didn't even know how to do that). As the default EKS setup seems to rely on 2 nodes, I got the error:
0/2 nodes are available: 2 node(s) had volume node affinity conflict.
Reading through Sownak Roy's answer I got a first glue what to do - but didn't know how to do it. So for the folks interested here are all my steps to resolve the error:
1. Check EKS nodes failure-domain.beta.kubernetes.io labels
As described in the section Statefull applications in this post two nodes are provisioned on other AWS availability zones as the persistent volume (PV), which is created by applying our PersistendVolumeClaim described above.
To check that, you need to look into/describe your nodes with kubectl get nodes:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-10-186.eu-central-1.compute.internal Ready <none> 2d16h v1.21.5-eks-bc4871b
ip-172-31-20-83.eu-central-1.compute.internal Ready <none> 2d16h v1.21.5-eks-bc4871b
and then have a look at the Label section using kubectl describe node <node-name>:
$ kubectl describe node ip-172-77-88-99.eu-central-1.compute.internal
Name: ip-172-77-88-99.eu-central-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t2.medium
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-central-1
failure-domain.beta.kubernetes.io/zone=eu-central-1b
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-172-77-88-99.eu-central-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=t2.medium
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1b
Annotations: node.alpha.kubernetes.io/ttl: 0
...
In my case the node ip-172-77-88-99.eu-central-1.compute.internal has failure-domain.beta.kubernetes.io/region defined as eu-central-1 and the az with failure-domain.beta.kubernetes.io/zone to eu-central-1b.
And the other node defines failure-domain.beta.kubernetes.io/zone az eu-central-1a:
$ kubectl describe nodes ip-172-31-10-186.eu-central-1.compute.internal
Name: ip-172-31-10-186.eu-central-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t2.medium
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-central-1
failure-domain.beta.kubernetes.io/zone=eu-central-1a
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-172-31-10-186.eu-central-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=t2.medium
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1a
Annotations: node.alpha.kubernetes.io/ttl: 0
...
2. Check PersistentVolume's topology.kubernetes.io field
Now we should check the PersistentVolume automatically provisioned after we manually applied our PersistentVolumeClaim. Use kubectl get pv:
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-93650993-6154-4bd0-bd1c-6260e7df49d3 1Gi RWO Delete Bound default/buildpacks-source-pvc gp2 21d
followed by kubectl describe pv <pv-name>
$ kubectl describe pv pvc-93650993-6154-4bd0-bd1c-6260e7df49d3
Name: pvc-93650993-6154-4bd0-bd1c-6260e7df49d3
Labels: topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1c
Annotations: kubernetes.io/createdby: aws-ebs-dynamic-provisioner
...
The PersistentVolume was configured with the label topology.kubernetes.io/zone in az eu-central-1c, which makes our Pods complain about not finding their volume - since they are in a completely different az!
3. Add allowedTopologies to StorageClass
As stated in the Kubernetes docs one solution to the problem is to add a allowedTopologies configuration to the StorageClass. If you already provisioned a EKS cluster like me, you need to retrieve your already defined StorageClass with
kubectl get storageclasses gp2 -o yaml
Save it to a file called storage-class.yml and add a allowedTopologies section that matches your node's failure-domain.beta.kubernetes.io labels like this:
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- eu-central-1a
- eu-central-1b
The allowedTopologies configuration defines that the failure-domain.beta.kubernetes.io/zone of the PersistentVolume must be either in eu-central-1a or eu-central-1b - not eu-central-1c!
The full storage-class.yml looks like this:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp2
parameters:
fsType: ext4
type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- eu-central-1a
- eu-central-1b
Apply the enhanced StorageClass configuration to your EKS cluster with
kubectl apply -f storage-class.yml
4. Delete PersistentVolumeClaim, add storageClassName: gp2 to it and re-apply it
In order to get things working again, we need to delete the PersistentVolumeClaim first.
To map the PersistentVolumeClaim to our previously define StorageClass we need to add storageClassName: gp2 to the PersistendVolumeClaim definition in our pvc.yml:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: buildpacks-source-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Mi
storageClassName: gp2
Finally re-apply the PersistentVolumeClaim with kubectl apply -f pvc.yml. This should resolve the error.
There a few things that can cause this error:
Node isn’t labeled properly. I had this issue on AWS when my worker node didn’t have appropriate labels(master had them though) like that:
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2c
After patching the node with the labels, the “1 node(s) had volume node affinity conflict” error was gone, so PV, PVC with a pod were deployed successfully.
The value of these labels is cloud provider specific. Basically, it is the job of the cloud provider(with —cloud-provider option defined in cube-controller, API-server, kubelet) to set those labels. If appropriate labels aren’t set, then check that your CloudProvider integration is correct. I used kubeadm, so it is cumbersome to set up but with other tools, kops, for instance, it is working right away.
Based on your PV definition and the usage of nodeAffinity field, you are trying to use a local volume, (read here local volume description link, official docs), then make sure that you set "NodeAffinity field" like that(it worked in my case on AWS):
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- my-node # it must be the name of your node(kubectl get nodes)
So that after creating the resource and running describe on it it will show up there like that:
Required Terms:
Term 0: kubernetes.io/hostname in [your node name]
StorageClass definition(named local-storage, which is not posted here) must be created with volumeBindingMode set to WaitForFirstConsumer for local storage to work properly. Refer to the example here storage class local description, official doc to understand the reason behind that.
The "1 node(s) had volume node affinity conflict" error is created by the scheduler because it can't schedule your pod to a node that conforms with the persistenvolume.spec.nodeAffinity field in your PersistentVolume (PV).
In other words, you say in your PV that a pod using this PV must be scheduled to a node with a label of kubernetes.io/cvl-gtv-42.corp.globaltelemetrics.eu = master, but this isn't possible for some reason.
There may be various reason that your pod can't be scheduled to such a node:
The pod has node affinities, pod affinities, etc. that conflict with the target node
The target node is tainted
The target node has reached its "max pods per node" limit
There exists no node with the given label
The place to start looking for the cause is the definition of the node and the pod.
Great answer by Sownak Roy. I've had the same case of a PV being created in a different zone compared to the node that was supposed to use it. The solution I applied was based on Sownak's answer only in my case it was enough to specify the storage class without the "allowedTopologies" list, like this:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cloud-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
volumeBindingMode: WaitForFirstConsumer
After some headache inducing investigation there are a few things that are needed to be checked:
Azure:
Does your cluster have more that one zone selected? (zone 1, 2, 3)
Does your default storage class have the correct storage provider?
(ZRS Zone-Redundant-Storage)
If not:
change the storage class to use te correct provider
create backup of PV data
stop the deployment that is using the PVC (set replicas to 0)
delete the PVC and confirm that the associated PV is deleted.
re-apply the PVC config yaml (without reference to the old storageclass name)
start the deployment that is using the PVC (set replicas to 1)
manually import backupdata
Example storageclass for AKS:
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: zone-redundant-storage
parameters:
skuname: StandardSSD_ZRS
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
GKE:
Does your cluster have more than one zone selected? (Zone A, B, C)
Does your default storage class have replication-type parameter? (replication-type: regional-pd)
If not:
change the storage class to use te correct parameters
create backup of PV data
stop the deployment that is using the PVC (set replicas to 0)
delete the PVC and confirm that the associated PV is deleted.
re-apply the PVC config yaml (without reference to the old storageclass name)
start the deployment that is using the PVC (set replicas to 1)
manually import backupdata
Example storageclass for GKE:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: standard-regional-pd-storage
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-standard
replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
After that PV's will have redundancy across the selected zones allowing a pod to access PV from other nodes in different zones.
In my case, the root cause was that the persistent volume are in us-west-2c and the new worker nodes are relaunched to be in us-west-2a and us-west-2b. The solution is to either have more worker nodes so they are in more zones, or remove / widen node affinity for the application so that more worker nodes qualifies to be bounded to the persistent volume.
Make sure the kubernetes node had the required label. You can verify the node labels using:
kubectl get nodes --show-labels
One of the kubernetes nodes should show you the name/ label of the persistent volume and your pod should be scheduled on the same node.
Make sure the requested size in PersistentVolumeClaim is matching with the size of the PersistentVolume. If the size does not match, either correct the resources.requests.storage in PersistentVolumeClaim or delete the old PersistentVolume and create a new one with the correct size.
Verification steps:
Describe your persistent volume:
kubectl describe pv postgres-br-proxy-pv-0
Output:
...
Node Affinity:
Required Terms:
Term 0: postgres-br-proxy in [postgres-br-proxy-pv-0]
...
Show node labels:
kubectl get nodes --show-labels
Output:
NAME STATUS ROLES AGE VERSION LABELS
node3 Ready <none> 19d v1.17.6 postgres-br-proxy=postgres-br-proxy-pv-0
If you are not getting the persistent volume label on the node that your pod is using then the pod won't get scheduled.
For me, this happened on GKE after upgrading to k8s v1.25. In my case, none of the above worked, so I looked into cloning the volume as I didn't want to lose the data.
This post led me to enable the Compute Engine persistent disk CSI Driver which once enabled, fixed my issue.
Different case from GCP GKE. Assume that you are using regional cluster and you created two PVC. Both were created in different zones (you didn't notice).
In next step you are trying to run the pod which will have mounted both PVC to the same pod. You have to schedule that pod to specific node in specific zone but because your volumes are on different zones the k8s won't be able to schedule that and you will receive the following problem.
For example - two simple PVC(s) on the regional cluster (nodes in different zones):
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: disk-a
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: disk-b
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Next simple pod:
apiVersion: v1
kind: Pod
metadata:
name: debug
spec:
containers:
- name: debug
image: pnowy/docker-tools:latest
command: [ "sleep" ]
args: [ "infinity" ]
volumeMounts:
- name: disk-a
mountPath: /disk-a
- name: disk-b
mountPath: /disk-b
volumes:
- name: disk-a
persistentVolumeClaim:
claimName: disk-a
- name: disk-b
persistentVolumeClaim:
claimName: disk-b
Finally as a result it could happen that k8s won't be able schedule to pod because the volumes are on different zones.
On AWS EKS, you may also get this problem if you forget to install the aws-ebs-csi-driver EKS addon prior to upgrading your Kubernetes cluster from 1.22 to 1.23.
You can also install the addon after the upgrade (although with some service interruption).
Make sure to check the AWS FAQ on this: https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi-migration-faq.html
almost same problem described here...
https://github.com/kubernetes/kubernetes/issues/61620
"If you're using local volumes, and the node crashes, your pod cannot be rescheduled to a different node. It must be scheduled to the same node. That is the caveat of using local storage, your Pod becomes bound forever to one specific node."
Most likely you just reduced number of nodes in your kubernetes cluster and some "regions" are not available anymore...
Something worth mentioning... if your pod will be in different zone than persistent volume then:
your disc access times will drop significantly (your local persistent storage is not local anymore - even with Amazon's / Google's fiber hyper fast links it's still traffic across data centers)
you will be paying for "cross regional network" (on your AWS bill it is something that goes into "EC2-other" and only after drilling down Aws Bill you can spot that)
One cause from this is when you have a definition like below (Kafka Zookeeper in this example) which is using multiple pvcs for one container. If they land on different nodes, you will get something like the following: ..volume node affinity conflict. The solution here is to use one pvc definition and use subPath on the volumeMount.
Problem
...
volumeMounts:
- mountPath: /data
name: kafka-zoo-data
- mountPath: /datalog
name: kafka-zoo-datalog
restartPolicy: Always
volumes:
- name: kafka-zoo-data
persistentVolumeClaim:
claimName: "zookeeper-data"
- name: kafka-zoo-datalog
persistentVolumeClaim:
claimName: "zookeeper-datalog"
Resolved
...
volumeMounts:
- mountPath: /data
subPath: data
name: kafka-zoo-data
- mountPath: /datalog
subPath: datalog
name: kafka-zoo-data
restartPolicy: Always
volumes:
- name: kafka-zoo-data
persistentVolumeClaim:
claimName: "zookeeper-data"
In my case, I was working with minikube on Docker Desktop on Windows, and my example was using only docker-desktop value as node name. so the setup is pretty important.
I have added minikube as I was using a single node. there might be more if additional nodes are added, such as minikube-m02.
spec:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- minikube
kubectl get node should be enough to give node names.
In my case I just deleted the PersistentVolumeClaim associated with the conflicting Pod and then recreated the pod.
Another reason for this error to occur is if you have a mix of nodes utilising taints. In some releases the DaemonSet component of the EBS CSI driver does not tolerate all taints by default; if you're trying to schedule a Pod onto a node with a taint and because of that taint it doesn't have the ebs-csi-node Pod running, you get this error.

Any way to prevent k8s pod eviction?

I have a set of daemons I need to run, generally, they do not consume much memory or CPU and I have their limits to cpu: 150m and memory: 150m.
Occasionally they will spike to quite a bit higher than this and this seems to be causing evictions and unstable node.
It is critical that the daemons remain running 24/7, even if they are throttled by CPU and/or memory when they spike. Is it possible to prevent their eviction and to cap their resources?
As I understand the CPU usage is throttled but over memory use results in an OOM eviction, is there any way to prevent this eviction?
As of 1.11, you can set pod priorities.
create priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
set priority in pod
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
priorityClassName: high-priority
Sounds like you need to track the resources consumption trends with something like Prometheus + Grafana to check what sort of spikes you expect from your DaemonSets.
Then you can allocate more resources to these pods or remove this config (which, by default, will leave them in unbounded mode). But, of course, you don't want to risk a full node / host crash so you can consider tweaking your eviction threshold:
https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#eviction-thresholds
More details:
https://kubernetes-v1-4.github.io/docs/admin/limitrange/

Kubernetes scaling based on network utilization or requests per second

Is there any way to scale Kubernetes nodes based on network utilization and not based on memory or CPU?
Let's say for example you are sending thousands of requests to a couple of nodes behind a load balancer. The CPU is not struggling or the memory, but because there are thousands of requests per second you would need additional nodes to serve this. How can you do this in Google Cloud Kubernetes?
I have been researching around but I can't seem to find any references to this type of scaling, and I am guessing I am not the only one to come across this problem. So I am wondering if any of you knows of any best practice solutions.
I guess the ideal solution would be to have one pod per node receiving requests and creating more nodes based on more requests and scale up or down based on this.
This is possible and you have to use Prometheus Adaptor to configure custom rules to generate Custom Metrics.
This link has more details on how to setup prometheus, install adaptor and apply configuration with custom metrics..
I've implement this on my gke cluster using this custom metrics.
This the example of my HPA configuration :
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-name
namespace: your-namespace
annotations:
metric-config.external.prometheus-query.prometheus/interval: 30s
metric-config.external.prometheus-query.prometheus/prometheus-server: http://your-prometheus-server-ip
metric-config.external.prometheus-query.prometheus/istio-requests-total: |
sum(rate(istio_requests_total{reporter="destination", destination_workload="deployment-name", destination_service_namespace="your-namespace"}[2m]))
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: deployment-name
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: prometheus-query
selector:
matchLabels:
query-name: istio-requests-total
target:
type: AverageValue
averageValue: 7
I think HPA(Horizontal Pod Autoscaler) along with Cluster Autoscaler will do the magic.
Have a look at this - https://medium.com/google-cloud/kubernetes-autoscaling-with-istio-metrics-76442253a45a

distribute docker containers evenly with kubectl

If I create 3 nodes in a cluster, how do I distribute the docker containers evenly across the containers? For example, if I create a cluster of 3 nodes with 8 cpus on each node, I've determined through performance profiling that I get the best performance when I run one container per cpu.
gcloud container clusters create mycluster --num-nodes 3 --machine-type n1-standard-8
kubectl run myapp --image=gcr.io/myproject/myapp -r 24
When I ran kubectl above, it put 11 containers on the first node, 10 on the second, and 3 on the third. How to I make it so that it is 8 each?
Both your and jpapejr's solutions seem like they'd work, but using a nodeSelector to force scheduling to a single node has the downside of requiring multiple RCs for a single application and making that application less resilient to a node failure. The idea of a custom scheduler is nice but has the downside of the amount of work to write and maintain that code.
I think another possible solution would be to set runtime constraints in your pod spec that might get you near to what you want. Based on this newly merged doc with examples of runtime contraints, I think you could set resources.requests.cpu in the pod spec part of the RC and get close to a CPU-per-pod:
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
containers:
- name: myapp
image: myregistry/myapp:v1
resources:
requests:
cpu: "1000m"
That docs has other good examples of how requests and limits differ and interact. There may be a combination that gives you what you want and also keeps your application at proper capacity when an individual node fails.
If I'm not mistaken, what you see is the expectation. If you want finer grained control over pod placement you probably want a customer scheduler.
In my case, I want to put a fixed number of containers in each node. I am able to do this by labeling each node and then using a nodeSelector with a config. Ignore that fact that I mislabeled the 3rd node, here is my setup:
kubectl label nodes gke-n3c8-7d9f8163-node-dol5 node=1
kubectl label nodes gke-n3c8-7d9f8163-node-hmbh node=2
kubectl label nodes gke-n3c8-7d9f8163-node-kdc4 node=3
That can be automated doing:
kubectl get nodes --no-headers | awk '{print NR " " $1}' | xargs -l bash -c 'kubectl label nodes $1 node=$0'
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 8
selector:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
nodeSelector:
node: "1"
containers:
- name: nginx
image: nginx

Resources