Noob here. I want to have a Dask install with a worker pool that can grow and shrink based on current demands. I followed the instructions in zero to jupyterhub to install on GKE, and then went through the install instructions for dask-kubernetes: https://kubernetes.dask.org/en/latest/.
I originally ran into some permissions issues, so I created a service account with all permissions and changed my config.yaml to use this service account. That got rid of the permissions issues, but now when I run this script, with the default worker-spec.yml, I get no workers:
cluster = KubeCluster.from_yaml('worker-spec.yml')
cluster.scale_up(4) # specify number of nodes explicitly
client = distributed.Client(cluster)
client
Cluster
Workers: 0
Cores: 0
Memory: 0 B
When I list my pods, I see a lot of workers in the pending state:
patrick_mineault#cloudshell:~ (neuron-264716)$ kubectl get pod --namespace jhub
NAME READY STATUS RESTARTS AGE
dask-jovyan-24034fcc-22qw7w 0/1 Pending 0 45m
dask-jovyan-24034fcc-25h89q 0/1 Pending 0 45m
dask-jovyan-24034fcc-2bpt25 0/1 Pending 0 45m
dask-jovyan-24034fcc-2dthg6 0/1 Pending 0 45m
dask-jovyan-25b11132-52rn6k 0/1 Pending 0 26m
...
And when I describe each pod, I see that there's an insufficient memory, cpu error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 69s (x22 over 30m) default-scheduler 0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.
Do I need to manually create a new autoscaling pool in GKE or something? I only have one pool now, the one which runs jupyterlab, and that pool is already fully committed. I can't figure out what piece of configuration causes dask to figure out in which pool to put the workers.
I indeed needed to create a flexible, scalable worker pool to host the workers - there's an example of this in the Pangeo setup guide: https://github.com/pangeo-data/pangeo/blob/master/gce/setup-guide/1_create_cluster.sh. This is the relevant line:
gcloud container node-pools create worker-pool --zone=$ZONE --cluster=$CLUSTER_NAME \
--machine-type=$WORKER_MACHINE_TYPE --preemptible --num-nodes=$MIN_WORKER_NODES
Related
I have a Kubernetes cluster of around 18 nodes where few are with 4 cores and 16G RAM, and few are with 16 core and 64G RAM, and there are around 25-30 applications running on
the cluster.
Each of the applications are configured with requests and limit parameter, around 2-3cores & 4-8G ram and allocated to each of the application.
Now how do I get the current utilization report saying how many cores/ram I am left within the current cluster? before deploying any new application.
I tried using the below commands:
kubectl top no; kubectl describe no [node-name]
These are not giving me the exact no. of cores or ram I am left with.
Any leads to this will help a lot.
Note: I am using version 1.19 of Kubernetes.
You can use a kubectl plugin to view the resourses capacity, usage etc.
Here, are few related plugins.
etopeter/kubectl-view-utilization
davidB/kubectl-view-allocations
robscott/kube-capacity
You can use krew to install those plugins.
kubectl describe node <insert-node-name-here>
You should see something like this:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1130m (59%) 3750m (197%)
memory 4836Mi (90%) 7988Mi (148%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-azure-disk 0 0
I created a cluster with autopilot mode. When I try to install an app inside this cluster using helm, workloads fail with this error Does not have minimum availability. If I click on this error, I get Cannot schedule pods: Insufficient cpu and Cannot schedule pods: Insufficient memory.
If I do kubectl describe node <name> I find 0/3 nodes are available: 1 Insufficient memory, 3 Insufficient cpu.
Isn't GKE autopilot mode supposed to allocate sufficient memory and cpu?
I found where my mistake was. It had nothing to do with cpu or memomry. It was a mistake inside my yaml file (wrong host for database).
I am trying to pull and run the images which are of a size more than 1 GB in Kubernetes pods. After 2 minutes, pods getting destroyed and recreating and the same issue is happening in a loop.
Below is what I am getting while running ///kubectl get pods
NAME READY STATUS RESTARTS AGE
dynamic-agent-fe7b9f8e-bce6-471b-a6b9-26cc52b02bba-4nxcd-psngk 0/3 ContainerCreating 0 <invalid>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 92s default-scheduler Successfully assigned default/dynamic-agent-fe7b9f8e-bce6-471b-a6b9-26cc52b02bba-4nxcd-qvg8c to k8sworker1-test
Normal Pulling 41s kubelet, k8sworker1-test pulling image "docker-all.xxx.xxx.net/sap/ppiper/node-browsers:v3"
After the above step, it continues to pull for 120 seconds after that k8s killing the pod and recreating it again same pulling process repeats...
I have been trying to set up an Kubernetes 1.13 AKS deployment to use HPA, and I keep running into a problem:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
dev-hpa-poc Deployment/dev-hpa-poc <unknown>/50% 1 4 2 65m
Describing the HPA gives me these events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedComputeMetricsReplicas 5m4s (x200 over 55m) horizontal-pod-autoscaler failed to get cpu utilization: missing request for cpu
Warning FailedGetResourceMetric 3s (x220 over 55m) horizontal-pod-autoscaler missing request for cpu
It doesn't appear to be able to actually retrieve CPU usage. I have specified cpu and memory usage in the deployment YAML:
resources:
requests:
cpu: 250m
memory: 128Mi
limits:
cpu: 800m
memory: 1024Mi
The system:metrics-server is running and healthy, too, so that's not it. I can monitor pod health and CPU usage from the Azure portal. Any ideas as to what I'm missing? Could this potentially be a permissions issue?
for missing request for [x] make sure that all the containers in the pod have requests declared.
In my case the reason was that other deployment haven't resource limits. You should add resources for each pod and deployment in namespace.
Adding to #nakamume's answer, make sure to double check sidecar containers.
For me, I forgot to declare requests for GCP cloud-sql-proxy sidecar which had me pulling hairs for couple of hours.
I've successfully deployed AKS with virtual nodes, where it automatically creates Azure Container Instances to support the number of pods requested, and I can manually scale up with:
kubectl scale --replicas=20 deployment.apps/blah
And sure enough I see 20 container instances get created in a special resource group and they're all running my app. When I scale it down, they disappear. This is great.
So then I try setting up autoscaling. I set limits/requests for CPU in my yaml and I say:
kubectl autoscale deployment blah --min=1 --max=20 --cpu-percent=50
But no new pods get created. To find out more I say:
kubectl describe hpa
And I get:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetResourceMetric 3s (x12 over 2m49s) horizontal-pod-autoscaler unable to get metrics for resource cpu: no metrics returned from resource metrics API
Warning FailedComputeMetricsReplicas 3s (x12 over 2m49s) horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
According to these docs the metrics-server is automatically available in AKS since 1.8 (mine is 1.12, newly created).
Is this specifically a problem with virtual nodes, i.e. do they lack the ability to expose resource utilization via metrics-server in the way required by autoscale? Or is there something else I need to set up?
Metric-Server should be able to gather metrics from Virtual Kubelet (ACI)
Here's an example repo that shows that HPA with ACI is possible.
https://github.com/Azure-Samples/virtual-node-autoscale