distribute docker containers evenly with kubectl - docker

If I create 3 nodes in a cluster, how do I distribute the docker containers evenly across the containers? For example, if I create a cluster of 3 nodes with 8 cpus on each node, I've determined through performance profiling that I get the best performance when I run one container per cpu.
gcloud container clusters create mycluster --num-nodes 3 --machine-type n1-standard-8
kubectl run myapp --image=gcr.io/myproject/myapp -r 24
When I ran kubectl above, it put 11 containers on the first node, 10 on the second, and 3 on the third. How to I make it so that it is 8 each?

Both your and jpapejr's solutions seem like they'd work, but using a nodeSelector to force scheduling to a single node has the downside of requiring multiple RCs for a single application and making that application less resilient to a node failure. The idea of a custom scheduler is nice but has the downside of the amount of work to write and maintain that code.
I think another possible solution would be to set runtime constraints in your pod spec that might get you near to what you want. Based on this newly merged doc with examples of runtime contraints, I think you could set resources.requests.cpu in the pod spec part of the RC and get close to a CPU-per-pod:
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
containers:
- name: myapp
image: myregistry/myapp:v1
resources:
requests:
cpu: "1000m"
That docs has other good examples of how requests and limits differ and interact. There may be a combination that gives you what you want and also keeps your application at proper capacity when an individual node fails.

If I'm not mistaken, what you see is the expectation. If you want finer grained control over pod placement you probably want a customer scheduler.

In my case, I want to put a fixed number of containers in each node. I am able to do this by labeling each node and then using a nodeSelector with a config. Ignore that fact that I mislabeled the 3rd node, here is my setup:
kubectl label nodes gke-n3c8-7d9f8163-node-dol5 node=1
kubectl label nodes gke-n3c8-7d9f8163-node-hmbh node=2
kubectl label nodes gke-n3c8-7d9f8163-node-kdc4 node=3
That can be automated doing:
kubectl get nodes --no-headers | awk '{print NR " " $1}' | xargs -l bash -c 'kubectl label nodes $1 node=$0'
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 8
selector:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
nodeSelector:
node: "1"
containers:
- name: nginx
image: nginx

Related

Trying to understand what values to use for resources and limits of multiple container deployment

I am trying to set up HorizontalPodAutoscaler autoscaler for my app, alongside automatic Cluster Autoscaling of DigitalOcean
I will add my deployment yaml below, I have also deployed metrics-server as per guide in link above. At the moment I am struggling to figure out how to determine what values to use for my cpu and memory requests and limits fields. Mainly due to variable replica count, i.e. do I need to account for maximum number of replicas each using their resources or for deployment in general, do I plan it per pod basis or for each container individually?
For some context I am running this on a cluster that can have up to two nodes, each node has 1 vCPU and 2GB of memory (so total can be 2 vCPUs and 4 GB of memory).
As it is now my cluster is running one node and my kubectl top statistics for pods and nodes look as follows:
kubectl top pods
NAME CPU(cores) MEMORY(bytes)
graphql-85cc89c874-cml6j 5m 203Mi
graphql-85cc89c874-swmzc 5m 176Mi
kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
skimitar-dev-pool-3cpbj 62m 6% 1151Mi 73%
I have tried various combinations of cpu and resources, but when I deploy my file my deployment is either stuck in a Pending state, or keeps restarting multiple times until it gets terminated. My horizontal pod autoscaler also reports targets as <unknown>/80%, but I believe it is due to me removing resources from my deployment, as it was not working.
Considering deployment below, what should I look at / consider in order to determine best values for requests and limits of my resources?
Following yaml is cleaned up from things like env variables / services, it works as is, but results in above mentioned issues when resources fields are uncommented.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: graphql
spec:
replicas: 2
selector:
matchLabels:
app: graphql
template:
metadata:
labels:
app: graphql
spec:
containers:
- name: graphql-hasura
image: hasura/graphql-engine:v1.2.1
ports:
- containerPort: 8080
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
# resources:
# requests:
# memory: "150Mi"
# cpu: "100m"
# limits:
# memory: "200Mi"
# cpu: "150m"
- name: graphql-actions
image: my/nodejs-app:1
ports:
- containerPort: 4040
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: 4040
readinessProbe:
httpGet:
path: /healthz
port: 4040
# resources:
# requests:
# memory: "150Mi"
# cpu: "100m"
# limits:
# memory: "200Mi"
# cpu: "150m"
# Disruption budget
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: graphql-disruption-budget
spec:
minAvailable: 1
selector:
matchLabels:
app: graphql
# Horizontal auto scaling
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: graphql-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: graphql
minReplicas: 2
maxReplicas: 3
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 80
How to determine what values to use for my cpu and memory requests and limits fields. Mainly due to variable replica count, i.e. do I need to account for maximum number of replicas each using their resources or for deployment in general, do I plan it per pod basis or for each container individually
Requests and limits are the mechanisms Kubernetes uses to control resources such as CPU and memory.
Requests are what the container is guaranteed to get. If a container requests a resource, Kubernetes will only schedule it on a node that can give it that resource.
Limits, on the other hand, make sure a container never goes above a certain value. The container is only allowed to go up to the limit, and then it is restricted.
The number of replicas will be determined by the autoscaler on the ReplicaController.
when I deploy my file my deployment is either stuck in a Pending state, or keeps restarting multiple times until it gets terminated.
pending state means that there is not resources available to schedule new pods.
restarting may be triggered by other issues, I'd suggest you to debug it after solving the scaling issues.
My horizontal pod autoscaler also reports targets as <unknown>/80%, but I believe it is due to me removing resources from my deployment, as it was not working.
You are correct, if you don't set the request limit, the % desired will remain unknown and the autoscaler won't be able to trigger scaling up or down.
Here you can see algorithm responsible for that.
Horizontal Pod Autoscaler will trigger new pods based on the request % of usage on the pod. In this case whenever the pod reachs 80% of the max request value it will trigger new pods up to the maximum specified.
For a good HPA example, check this link: Horizontal Pod Autoscale Walkthrough
But How does Horizontal Pod Autoscaler works with Cluster Autoscaler?
Horizontal Pod Autoscaler changes the deployment's or replicaset's number of replicas based on the current CPU load. If the load increases, HPA will create new replicas, for which there may or may not be enough space in the cluster.
If there are not enough resources, CA will try to bring up some nodes, so that the HPA-created pods have a place to run. If the load decreases, HPA will stop some of the replicas. As a result, some nodes may become underutilized or completely empty, and then CA will terminate such unneeded nodes.
NOTE: The key is to set the maximum replicas for HPA thinking on a cluster level according to the amount of nodes (and budget) available for your app, you can start setting a very high max number of replicas, monitor and then change it according to the usage metrics and prediction of future load.
Take a look at How to Enable the Cluster Autoscaler for a DigitalOcean Kubernetes Cluster in order to properly enable it as well.
If you have any question let me know in the comments.

How to expose low-numbered ports in the kubernetes mini-cluster that comes with Docker Desktop

I'm using the kubernetes cluster built in to Docker Desktop to develop my application.
I would like to expose services inside the cluster as ports on localhost.
I can do so using kubectl expose deployment foobar --type=NodePort --port=30088, which creates a service like this:
apiVersion: v1
kind: Service
metadata:
labels:
role: web
name: foobar
spec:
externalTrafficPolicy: Cluster
ports:
- nodePort: 30088
port: 80
protocol: TCP
targetPort: 80
selector:
role: web
type: NodePort
But it only works for very high numbered ports. If I try something lower I get:
The Service "kafka-external" is invalid: spec.ports[0].nodePort: Invalid value: 9092: provided port is not in the valid range. The range of valid ports is 30000-32767
It seems there is a kubernetes apiserver setting called ServiceNodePortRange which would allow me to override this restriction, but I can't figure out how to set it on Docker's builtin cluster.
So my question is: how do I expose a specific, low-numbered port (like 9092) on Docker's kubernetes cluster? Is there a way to override that setting? Or a better way to expose the service than NodePort?
NodePort is intended to be a building block for load-balancers or other
ingress modes. This means it didn't matter which port you got as long as
you got one. This makes it a little clunky to use directly - you can't
have just any port. You can change the port range, but you run the risk of
conflicts with real things running on your nodes and with any pod HostPorts.
The default range is indeed 30000-32767 but it can be changed by setting the --service-node-port-range Update the file /etc/kubernetes/manifests/kube-apiserver.yaml and add the line --service-node-port-range=xxxxx-yyyyy.
In the Kubernetes cluster there is a kube-apiserver.yaml file which is in the directory - /etc/kubernetes/manifests/kube-apiserver.yaml but not on the kube-apiserver container/pod but on the master itself.
Login to Docker VM:
Add the following line to the pod spec:
spec:
containers:
- command:
- kube-apiserver
...
- --service-node-port-range=xxxxx-yyyyy # <-- add this line
...
Save and exit. Pod kube-apiserver will be restarted with new parameters.
Exit Docker VM (for screen: Ctrl-a,k , for container: Ctrl-d )
Check the results:
$ kubectl get pod kube-apiserver-docker-desktop -o yaml -n kube-system | less
Take a look: service-pod-range, changing pod range, changing-nodeport-range.

minikube how to connect from one pod to another using hostnames?

I am running a cluster in default namespace with all the pods in Running state.
I have an issue, I am trying to telnet from one pod to another pod using the pod hostname 'abcd-7988b76669-lgp8l' but I am not able to connect. although it works if I use pods internal ip. Why does the dns is not resolved?
I looked at
kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6955765f44-5lpfd 1/1 Running 0 12h
coredns-6955765f44-9cvnb 1/1 Running 0 12h
Anybody has any idea how to connect from one pod to another using hostname resolution ?
First of all it is worth mentioning that typically you won't connect to individual Pods using their domain names. One good reason for that is their ephemeral nature. Note that typically you don't create plain Pods but controller such as Deployment which manages your Pods and ensures that specific number of Pods of a certain kind is constantly up and running. Pods may be often deleted and recreated hence you should never rely on their domain names in your applications. Typically you will expose them to another apps e.g. running in other Pods via Service.
Although using invididual Pod's domain name is not recommended, it is still possible. You can do it just for fun or learning/experimenting purposes.
As #David already mentioned you would help us much more in providing you a comprehensive answer if you EDIT your question and provide a few important details, showing what you've tried already such as your Pods and Services definitions in yaml format.
Answering literally to your question posted in the title:
minikube how to connect from one pod to another using hostnames?
You won't be able to connect to a Pod using simply its hostname. You can e.g. ping your backend Pods exposed via ClusterIP Service by simply pinging the <service-name> (provided it is in the same namespace as the Pod your pinging from).
Keep in mind however that it doesn't work for Pods - neither Pods names nor their hostnames are resolvable by cluster DNS.
You should be able to connect to an individual Pod using its fully quallified domain name (FQDN) provided you have configured everything properly. Just make sure you didn't overlook any of the steps described here:
Make sure you've created a simple Headless Service which may look like this:
apiVersion: v1
kind: Service
metadata:
name: default-subdomain
spec:
selector:
name: busybox
clusterIP: None
Make sure that your Pods definitions didn't lack any important details:
apiVersion: v1
kind: Pod
metadata:
name: busybox1
labels:
name: busybox
spec:
hostname: busybox-1
subdomain: default-subdomain
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
name: busybox
---
apiVersion: v1
kind: Pod
metadata:
name: busybox2
labels:
name: busybox
spec:
hostname: busybox-2
subdomain: default-subdomain
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
name: busybox
Speaking about important details, pay special attention that you correctly defined hostname and subdomain in Pod specification and that labels used by Pods match the labels used by Service's selector.
Once everything is configured properly you will be able to attach to Pod busybox1 and ping Pod busybox2 by using its FQDN like in the example below:
$ kubectl exec -ti busybox1 -- /bin/sh
/ # ping busybox-2.default-subdomain.default.svc.cluster.local
PING busybox-2.default-subdomain.default.svc.cluster.local (10.16.0.109): 56 data bytes
64 bytes from 10.16.0.109: seq=0 ttl=64 time=0.051 ms
64 bytes from 10.16.0.109: seq=1 ttl=64 time=0.082 ms
64 bytes from 10.16.0.109: seq=2 ttl=64 time=0.081 ms
I hope this helps.

container labels in kubernetes

I am building my docker image with jenkins using:
docker build --build-arg VCS_REF=$GIT_COMMIT \
--build-arg BUILD_DATE=`date -u +"%Y-%m-%dT%H:%M:%SZ"` \
--build-arg BUILD_NUMBER=$BUILD_NUMBER -t $IMAGE_NAME\
I was using Docker but I am migrating to k8.
With docker I could access those labels via:
docker inspect --format "{{ index .Config.Labels \"$label\"}}" $container
How can I access those labels with Kubernetes ?
I am aware about adding those labels in .Metadata.labels of my yaml files but I don't like it that much because
- it links those information to the deployment and not the container itself
- can be modified anytime
...
kubectl describe pods
Thank you
Kubernetes doesn't expose that data. If it did, it would be part of the PodStatus API object (and its embedded ContainerStatus), which is one part of the Pod data that would get dumped out by kubectl get pod deployment-name-12345-abcde -o yaml.
You might consider encoding some of that data in the Docker image tag; for instance, if the CI system is building a tagged commit then use the source control tag name as the image tag, otherwise use a commit hash or sequence number. Another typical path is to use a deployment manager like Helm as the principal source of truth about deployments, and if you do that there can be a path from your CD system to Helm to Kubernetes that can pass along labels or annotations. You can also often set up software to know its own build date and source control commit ID at build time, and then expose that information via an informational-only API (like an HTTP GET /_version call or some such).
I'll add another option.
I would suggest reading about the Recommended Labels by K8S:
Key Description
app.kubernetes.io/name The name of the application
app.kubernetes.io/instance A unique name identifying the instance of an application
app.kubernetes.io/version The current version of the application (e.g., a semantic version, revision hash, etc.)
app.kubernetes.io/component The component within the architecture
app.kubernetes.io/part-of The name of a higher level application this one is part of
app.kubernetes.io/managed-by The tool being used to manage the operation of an application
So you can use the labels to describe a pod:
apiVersion: apps/v1
kind: Pod # Or via Deployment
metadata:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/instance: wordpress-abcxzy
app.kubernetes.io/version: "4.9.4"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: server
app.kubernetes.io/part-of: wordpress
And use the downward api (which works in a similar way to reflection in programming languages).
There are two ways to expose Pod and Container fields to a running Container:
1 ) Environment variables.
2 ) Volume Files.
Below is an example for using volumes files:
apiVersion: v1
kind: Pod
metadata:
name: kubernetes-downwardapi-volume-example
labels:
version: 4.5.6
component: database
part-of: etl-engine
annotations:
build: two
builder: john-doe
spec:
containers:
- name: client-container
image: k8s.gcr.io/busybox
command: ["sh", "-c"]
args: # < ------ We're using the mounted volumes inside the container
- while true; do
if [[ -e /etc/podinfo/labels ]]; then
echo -en '\n\n'; cat /etc/podinfo/labels; fi;
if [[ -e /etc/podinfo/annotations ]]; then
echo -en '\n\n'; cat /etc/podinfo/annotations; fi;
sleep 5;
done;
volumeMounts:
- name: podinfo
mountPath: /etc/podinfo
volumes: # < -------- We're mounting in our example the pod's labels and annotations
- name: podinfo
downwardAPI:
items:
- path: "labels"
fieldRef:
fieldPath: metadata.labels
- path: "annotations"
fieldRef:
fieldPath: metadata.annotations
Notice that in the example we accessed the labels and annotations that were passed and mounted to the /etc/podinfo path.
Beside labels and annotations, the downward API exposed multiple additional options like:
The pod's IP address.
The pod's service account name.
The node's name and IP.
A Container's CPU limit , CPU request , memory limit, memory request.
See full list in here.
(*) A nice blog discussing the downward API.
(**) You can view all your pods labels with
$ kubectl get pods --show-labels
NAME ... LABELS
my-app-xxx-aaa pod-template-hash=...,run=my-app
my-app-xxx-bbb pod-template-hash=...,run=my-app
my-app-xxx-ccc pod-template-hash=...,run=my-app
fluentd-8ft5r app=fluentd,controller-revision-hash=...,pod-template-generation=2
fluentd-fl459 app=fluentd,controller-revision-hash=...,pod-template-generation=2
kibana-xyz-adty4f app=kibana,pod-template-hash=...
recurrent-tasks-executor-xaybyzr-13456 pod-template-hash=...,run=recurrent-tasks-executor
serviceproxy-1356yh6-2mkrw app=serviceproxy,pod-template-hash=...
Or viewing only specific label with $ kubectl get pods -L <label_name>.

Partially Rollout Kubernetes Pods

I have 1 node with 3 pods. I want to rollout a new image in 1 of the three pods and the other 2 pods stay with the old image. Is it possible?
Second question. I tried rolling out a new image that contains error and I already define the maxUnavailable. But kubernetes still rollout all pods. I thought kubernetes will stop rolling out the whole pods, once kubernetes discover an error in the first pod. Do we need to manually stop the rollout?
Here is my deployment script.
# Service setup
apiVersion: v1
kind: Service
metadata:
name: semantic-service
spec:
ports:
- port: 50049
selector:
app: semantic
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: semantic-service
spec:
selector:
matchLabels:
app: semantic
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
metadata:
labels:
app: semantic
spec:
containers:
- name: semantic-service
image: something/semantic-service:v2
As #David Maze wrote in the comment, you can consider using canary where it is possible distinguish deployments of different releases or configurations of the same component with multiple labels and then track these labels and point to different releases, more information about Canary deployments can be find here. Another way how to achieve your goal can be Blue/Green deployment in case if you want to use two different environments identical as possible with a comprehensive way to switch between Blue/Green environments and rollback deployments at any moment of time.
Answering the second question depends on what kind of error a given image contains and how Kubernetes identifies this issue in the Pod, as maxUnavailable: 1 parameter states maximum number of Pods that can be unavailable during update. In the process of Deployment update within a cluster deployment controller creates a new Pod and then delete the old one assuming that the number of available Pods matches rollingUpdate strategy parameters.
Additionally, Kubernetes uses liveness/readiness probes to check whether the Pod is ready (alive) during deployment update and leave the old Pod running until probes have been successful on the new replica. I would suggest checking probes to identify the status of the Pods when deployment tries rolling out updates across you cluster Pods.
Regarding question 1:
I have 1 node with 3 pods. I want to rollout a new image in 1 of the
three pods and the other 2 pods stay with the old image. Is it
possible?
Answer:
Change your maxSurge in your strategy to 0:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 <------ From the 3 replicas - 1 can be unavailable
maxSurge: 0 <------ You can't have more then the 3 replicas of pods at a time
Regarding question 2:
I tried rolling out a new image that contains error and I already
define the maxUnavailable. But kubernetes still rollout all pods. I
thought kubernetes will stop rolling out the whole pods, once
kubernetes discover an error in the first pod. Do we need to manually
stop the rollout?
A ) In order for kubernetes to stop rolling out the whole pods - Use minReadySeconds to specify how much time the pod that was created should be considered ready (use liveness / readiness probes like #Nick_Kh suggested).
If one of the probes had failed before the interval of minReadySeconds has finished then the all rollout will be blocked.
So with a combination of maxSurge = 0 and the setup of minReadySeconds and liveness / readiness probes you can achieve your desired state: 3 pods: 2 with the old image and 1 pod with the new image.
B ) In case of A - you don't need to stop the rollout manually.
But in cases when you will have to do that, you can run:
$ kubectl rollout pause deployment <name>
Debug the non functioning pods and take the relevant action.
If you decide to revert the rollout you can run:
$ kubectl rollout undo deployment <name> --to-revision=1
(View revisions with: $ kubectl rollout history deployment <name>).
Notice that after you paused the rollout you need to resume it with:
$ kubectl rollout resume deployment <name>
Even if you decide to undo and return to previous revision.

Resources