Kubernetes is affecting cpu usage of pods - docker

in my environment one kubernetes pod, let's call it P1, is connected outside the cluster via a message oriented middleware (MOM). The latter is publicly exposed through the following Service:
apiVersion: v1
kind: Service
metadata:
name: my-mom-svc
spec:
externalIPs:
- aaa.bbb.ccc.ddd
selector:
app: my-mom
ports:
- port: pppp
name: my-port-name
Clients are outside the k8s cluster and connect to the MOM thanks to this service. P1 processes messages coming from the MOM and sent by the clients. My goal is to maximize the CPU used by P1.
I defined a limitrange so that P1 can use all the available CPUs on a worker node.
However, in my test environment it does not use all of them and indeed, the more pods like P1 I create the less CPU each of them uses (notice that there is only one pod like P1 for a single worker node).
I tried to define a resourcequota with a huge max cpu number, but the result does not change.
In desperation i entered into the pod and executed the command 'stress --cpu x'..and here the pod uses all the x cpus.
I tried the same test using a 'raw' docker containers, that is running my environment without kubernetes and only using docker containers. In this case the containers use all the available CPUs.
Are there any default kubernetes limitations or behavior limiting something? how can i modify them?
Thanks!

A few things to note:
The fact that you were able to stress the CPU fully with stress --cpu x when logging into your pod's container is evidence that k8s is functioning correctly when pod requests resources on the worker node. So, resource requests and limits are functional.
You should consider if your network traffic that one P1 pod is handling is actually enough to generate a high CPU utilisation. Typically, you need to generate a VERY HIGH amount of network traffic to get a service to utilize a lot of CPU since such a workload is network latency centric and not compute power centric.
You describe that when increasing your P1 pods, your loads/pod decreases that is because your Service object is doing a great job. Service objects are responsible for load balancing incoming traffic equally to all the pods that are serving the traffic. The fact that CPU load reduces is evidence that since there are more pods to serve the incoming traffic, the load is naturally distributed across them by the Service abstraction.
When you define a very large number for your request quota two things can happen:
a. If there is no admission control in your cluster (an application that processes all the incoming API requests and performs actions on it, like validation/compliance/security checks), your pod will be stuck in Pending state, since there will be no Node big enough for the scheduler to be able to fit your pod.
b. If there is an admission-controller setup, it will try to enforce a maximum allowable quota by overriding the value of quota in your manifest before it is sent to the API server for processing. So, even if you specify 10 in your vCPU request, if the admission-controller has a rule which doesn't allow more than 2 vCPUs in quota, it will be changed to 2 by the controller. You can verify this isn't the case by printing your Pod and looking at the quota fields if they are the same as the ones you specified when applying you might not have an admission-controller in your cluster.
I would suggest a better way to approach the problem would be to test your Pod with a reasonable/realistic maximum value of traffic that you expect on 1 node and then record the CPU usage and memory usage. You can then instead of attempting to get the Pod to use more CPU, you can resize your node into a smaller sized node, this way, your pod will have less CPU available and hence better utilisation :)
This is a very common design pattern (especially for scenario like yours where you have 1pod/worker node). This allows to have light-weight easy scale-out architectures which can perform really well along with autoscaling of nodes.

Related

Scheduling and scaling pods in kubernetes

i am running k8s cluster on GKE
it has 4 node pool with different configuration
Node pool : 1 (Single node coroned status)
Running Redis & RabbitMQ
Node pool : 2 (Single node coroned status)
Running Monitoring & Prometheus
Node pool : 3 (Big large single node)
Application pods
Node pool : 4 (Single node with auto-scaling enabled)
Application pods
currently, i am running single replicas for each service on GKE
however 3 replicas of the main service which mostly manages everything.
when scaling this main service with HPA sometime seen the issue of Node getting crashed or kubelet frequent restart PODs goes to Unkown state.
How to handle this scenario ? If the node gets crashed GKE taking time to auto repair and which cause service down time.
Question : 2
Node pool : 3 -4 running application PODs. Inside the application, there are 3-4 memory-intensive micro services i am also thinking same to use Node selector and fix it on one Node.
while only small node pool will run main service which has HPA and node auto scaling auto work for that node pool.
however i feel like it's not best way to it with Node selector.
it's always best to run more than one replicas of each service but currently, we are running single replicas only of each service so please suggest considering that part.
As Patrick W rightly suggested in his comment:
if you have a single node, you leave yourself with a single point of
failure. Also keep in mind that autoscaling takes time to kick in and
is based on resource requests. If your node suffers OOM because of
memory intensive workloads, you need to readjust your memory requests
and limits – Patrick W Oct 10 at
you may need to redesign a bit your infrastructure so you have more than a single node in every nodepool as well as readjust mamory requests and limits
You may want to take a look at the following sections in the official kubernetes docs and Google Cloud blog:
Managing Resources for Containers
Assign CPU Resources to Containers and Pods
Configure Default Memory Requests and Limits for a Namespace
Resource Quotas
Kubernetes best practices: Resource requests and limits
How to handle this scenario ? If the node gets crashed GKE taking time
to auto repair and which cause service down time.
That's why having more than just one node for a single node pool can be much better option. It greatly reduces the likelihood that you'll end up in the situation described above. GKE autorapair feature needs to take its time (usually a few minutes) and if this is your only node, you cannot do much about it and need to accept possible downtimes.
Node pool : 3 -4 running application PODs. Inside the application,
there are 3-4 memory-intensive micro services i am also thinking same
to use Node selector and fix it on one Node.
while only small node pool will run main service which has HPA and
node auto scaling auto work for that node pool.
however i feel like it's not best way to it with Node selector.
You may also take a loot at node affinity and anti-affinity as well as taints and tolerations

Does the nodes of a Kubernetes cluster share memory

We want to deploy an application that utilizes memory cache using docker and kubernetes with horizontal pod auto-scale, but we have no idea if the containerized application inside the pods would use the same cache since it won't be guaranteed that the pods would be in the same node when scaled by the auto-scaler.
I've tried searching for information regarding cache memory on kubernetes clusters, and all I found is a statement in a Medium article that states
the CPU and RAM resources of all nodes are effectively pooled and managed by the cluster
and a sentence in a Mirantis blog
Containers in a Pod share the same IPC namespace, which means they can also communicate with each other using standard inter-process communications such as SystemV semaphores or POSIX shared memory.
But I can't find anything regarding pods in different nodes having access to the same cache. And these are all on 3rd party sites and not in the official kubernetes site.
I'm expecting the cache to be shared between all pods in all nodes, but I just want confirmation regarding the matter.
No, separate pods do not generally share anything even if running on the same physical node. There are ways around this if you are very very careful and fancy but the idea is for pods to be independent anyway. Within a single pod it's easier, you can use normal shmem, but this is pretty rare since there isn't much reason to do that usually.

Keep a Kubernetes Pod in service when in a state of not ready

I am working on a project currently migrating a legacy application towards becoming cloud-compliant. We are using Kubernetes, Openshift and Docker for this. The application has one particular type of "back-end pod" (let's call it BEP) whose responsibility it is to process incoming transactions. In this pod we have several interdependent containers, but only one container which actually does the "real processing" (call it BEC). This legacy application processes several thousands of transactions / sec, and will need to continue to do so in the cloud.
To achieve this scale we were thinking to duplicate the BEC in the pod instead of replicating the BEP (and thus also replicating all the other unnecessary containers that come along with it). We might need X replicas of this BEC, whereas we would not need to scale its interdependent containers at all. It would thus be useless to scale X replicas of the BEP instead.
However, this solution poses a problem. Once one BEC is down the entire pod will be flagged as "Not ready" by kubernetes (even if there are 100 other BEC's which are up and ready to process) upon which the pod end-point is removed and thus cutting the traffic to the entire pod.
I guess this is a classical example of defining some sort of "minimum running requirement" for the pod.
I thus have two questions:
Is there a way to flag a pod as still functioning even if all containers are not in a state of "ready"? I.e achieving this minimum running requirement by defining a lower threshold on the # containers in a state of "ready" for the pod to be considered functioning?
Is there a way to maybe flag the service - that the pod provides - as to still send traffic even if the pod is not in a ready state? I have seen an property called: publishNotReadyAddresses (https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#servicespec-v1-core#publishNotReadyAddresses) but I am unsure if this does what we would require?
If the answer to both of these two questions is a no: do you have any idea / approach to take concerning this problem, without proposing a major architectural refactoring of this legacy application? We can not split the interdependent containers from the BEC, they need to run in the same pod...unfortunately.
Thanks in advance for any help/advice!
/Alex
Is there a way to flag a pod as still functioning even if all containers are not in a state of "ready"? I.e achieving this minimum running requirement by defining a lower threshold on the # containers in a state of "ready" for the pod to be considered functioning?
No, this is not possible.
Is there a way to maybe flag the service - that the pod provides - as to still send traffic even if the pod is not in a ready state? I have seen an property called: publishNotReadyAddresses (https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#servicespec-v1-core#publishNotReadyAddresses) but I am unsure if this does what we would require?
You can use annotations:
metadata:
name: name
labels:
app: app
annotations:
service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
But community is already working on this issue/bug, you can follow #58662 and #49239

Kubernetes scaling pods using custom algorithm

Our cloud application consists of 3 tightly coupled Docker containers, Nginx, Web and Mongo. Currently we run these containers on a single machine. However as our users are increasing we are looking for a solution to scale. Using Kubernetes we would form a multi container pod. If we are to replicate we need to replicate all 3 containers as a unit. Our cloud application is consumed by mobile app users. Our app can only handle approx 30000 users per Worker node and we intend to place a single pod on a single worker node. Once a mobile device is connected to worker node it must continue to only use that machine ( unique IP address )
We plan on using Kubernetes to manage the containers. Load balancing doesn't work for our use case as a mobile device needs to be tied to a single machine once assigned and each Pod works independently with its own persistent volume. However we need a way of spinning up new Pods on worker nodes if the number of users goes over 30000 and so on.
The idea is we have some sort of custom scheduler which assigns a mobile device a Worker Node ( domain/ IPaddress) depending on the number of users on that node.
Is Kubernetes a good fit for this design and how could we implement a custom pod scale algorithm.
Thanks
Piggy-Backing on the answer of Jonah Benton:
While this is technically possible - your problem is not with Kubernetes it's with your Application! Let me point you the problem:
Our cloud application consists of 3 tightly coupled Docker containers, Nginx, Web, and Mongo.
Here is your first problem: Is you can only deploy these three containers together and not independently - you cannot scale one or the other!
While MongoDB can be scaled to insane loads - if it's bundled with your web server and web application it won't be able to...
So the first step for you is to break up these three components so they can be managed independently of each other. Next:
Currently we run these containers on a single machine.
While not strictly a problem - I have serious doubt's what it would mean to scale your application and what the challenges that come with scalability!
Once a mobile device is connected to worker node it must continue to only use that machine ( unique IP address )
Now, this IS a problem. You're looking to run an application on Kubernetes but I do not think you understand the consequences of doing that: Kubernetes orchestrates your resources. This means it will move pods (by killing and recreating) between nodes (and if necessary to the same node). It does this fully autonomous (which is awesome and gives you a good night sleep) If you're relying on clients sticking to a single nodes IP, you're going to get up in the middle of the night because Kubernetes tried to correct for a node failure and moved your pod which is now gone and your users can't connect anymore. You need to leverage the load-balancing features (services) in Kubernetes. Only they are able to handle the dynamic changes that happen in Kubernetes clusters.
Using Kubernetes we would form a multi container pod.
And we have another winner - No! You're trying to treat Kubernetes as if it were your on-premise infrastructure! If you keep doing so you're going to fail and curse Kubernetes in the process!
Now that I told you some of the things you're thinking wrong - what a person would I be if I did not offer some advice on how to make this work:
In Kubernetes your three applications should not run in one pod! They should run in separate pods:
your webservers work should be done by Ingress and since you're already familiar with nginx, this is probably the ingress you are looking for!
Your web application should be a simple Deployment and be exposed to ingress through a Service
your database should be a separate deployment which you can either do manually through a statefullset or (more advanced) through an operator and also exposed to the web application trough a Service
Feel free to ask if you have any more questions!
Building a custom scheduler and running multiple schedulers at the same time is supported:
https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
That said, to the question of whether kubernetes is a good fit for this design- my answer is: not really.
K8s can be difficult to operate, with the payoff being the level of automation and resiliency that it provides out of the box for whole classes of workloads.
This workload is not one of those. In order to gain any benefit you would have to write a scheduler to handle the edge failure and error cases this application has (what happens when you lose a node for a short period of time...) in a way that makes sense for k8s. And you would have to come up to speed with normal k8s operations.
With the information provided, hard pressed to see why one would use k8s for this workload over just running docker on some VMs and scripting some of the automation.

How many containers should exist per host in production? How should services be split?

I'm trying to understand the benefits of Docker better and I am not really understanding how it would work in production.
Let's say I have a web frontend, a rest api backend and a db. That makes 3 containers.
Let's say that I want 3 of the front end, 5 of the backend and 7 of the db. (Minor question: Does it ever make sense to have less dbs than backend servers?)
Now, given the above scenario, if I package them all on the same host then I gain the benefit of efficiently using the resources of the host, but then I am DOA when that machine fails or has a network partition.
If I separate them into 1 full application (ie 1 FE, 1 BE & 1 DB) per host, and put extra containers on their own host, I get some advantages of using resources efficiently, but it seems to me that I still lose significantly when I have a network partition since it will take down multiple services.
Hence I'm almost leaning to the conclusion that I should be putting in 1 container per host, but then that means I am using my resources pretty inefficiently and then what are the benefits of containers in production? I mean, an OS might be an extra couple gigs per machine in storage size, but most cloud providers give you a minimum of 10 gigs storage. And let's face it, a rest api backend or a web front end is not gonna even come close to the 10 gigs...even including the OS.
So, after all that, I'm trying to figure out if I'm missing the point of containers? Are the benefits of keeping all containers of an application on 1 host, mostly tied to testing and development benefits?
I know there are benefits from moving containers amongst different providers/machines easily, but for the most part, I don't see that as a huge gain personally since that was doable with images...
Are there any other benefits for containers in production that I am missing? Or are the main benefits for testing and development? (Am I thinking about containers in production wrong)?
Note: The question is very broad and could fill an entire book but I'll shed some light.
Benefits of containers
The exciting part about containers is not about their use on a single host, but their use across hosts connected on a large cluster. Do not look at your machines as independent docker hosts, but as a pool of resource to host your containers.
Containers alone are not ground-breaking (ie. Docker's CTO stating at the last DockerCon that "nobody cares about containers"), but coupled to state of the art schedulers and container orchestration frameworks, they become a very powerful abstraction to handle production-grade software.
As to the argument that it also applies to Virtual Machines, yes it does, but containers have some technical advantage (See: How is Docker different from a normal virtual machine) over VMs that makes them convenient to use.
On a Single host
On a single host, the benefits you can get from containers are (amongst many others):
Use as a development environment mimicking the behavior on a real production cluster.
Reproducible builds independent of the host (convenient for sharing)
Testing new software without bloating your machine with packages you won't use daily.
Extending from a single host to a pool of machines (cluster)
When time comes to manage a production cluster, there are two approaches:
Create a couple of docker hosts and run/connect containers together "manually" through scripts or using solutions like docker-compose. Monitoring the lifetime of your services/containers is at your charge, and you should be prepared to handle service downtime.
Let a container orchestrator deal with everything and monitor the lifetime of your services to better cope with failures.
There are plenty of container orchestrators: Kubernetes, Swarm, Mesos, Nomad, Cloud Foundry, and probably many others. They power many large-scale companies and infrastructures, like Ebay, so they sure found a benefit in using these.
Pick the right replication strategy
A container is better used as a disposable resource meaning you can stop and restart the DB independently and it shouldn't impact the backend (other than throwing an error because the DB is down). As such you should be able to handle any kind of network partition as long as your services are properly replicated across several hosts.
You need to pick a proper replication strategy, to make sure your service stays up and running. You can for example replicate your DB across Cloud provider Availability Zones so that when an entire zone goes down, your data remains available.
Using Kubernetes for example, you can put each of your containers (1 FE, 1 BE & 1 DB) in a pod. Kubernetes will deal with replicating this pod on many hosts and monitor that these pods are always up and running, if not a new pod will be created to cope with the failure.
If you want to mitigate the effect of network partitions, specify node affinities, hinting the scheduler to place containers on the same subset of machines and replicate on an appropriate number of hosts.
How many containers per host?
It really depends on the number of machines you use and the resources they have.
The rule is that you shouldn't bloat a host with too many containers if you don't specify any resource constraint (in terms of CPU or Memory). Otherwise, you risk compromising the host and exhaust its resources, which in turn will impact all the other services on the machine. A good replication strategy is not only important at a single service level, but also to ensure good health for the pool of services that are sharing a host.
Resource constraint should be dealt with depending on the type of your workload: a DB will probably use more resources than your Front-end container so you should size accordingly.
As an example, using Swarm, you can explicitely specify the number of CPUs or Memory you need for a given service (See docker service documentation). Although there are many possibilities and you can also give an upper bound/lower bound in terms of CPU or Memory usage. Depending on the values chosen, the scheduler will pin the service to the right machine with available resources.
Kubernetes works pretty much the same way and you can specify limits for your pods (See documentation).
Mesos has more fine grained resource management policies with frameworks (for specific workloads like Hadoop, Spark, and many more) and with over-commiting capabilities. Mesos is especially convenient for Big Data kind of workloads.
How should services be split?
It really depends on the orchestration solution:
In Docker Swarm, you would create a service for each component (FE, BE, DB) and set the desired replication number for each service.
In Kubernetes, you can either create a pod encompassing the entire application (FE, BE, DB and the volume attached to the DB) or create separate pods for the FE, BE, DB+volume.
Generally: use one service per type of container. Regarding groups of containers, evaluate if it is more convenient to scale the entire group of container (as an atomic unit, ie. a pod) than to manage them separately.
Sum up
Containers are better used with an orchestration framework/platform. There are plenty of available solutions to deal with container scheduling and resource management. Pick one that might fit your use case, and learn how to use it. Always pick an appropriate replication strategy, keeping in mind possible failure modes. Specify resource constraints for your containers/services when possible to avoid resource exhaustion which could potentially lead to bringing a host down.
This depends on the type of application you run in your containers. From the top of my head I can think of a couple different ways to look at this:
is your application diskspace heavy?
do you need the application fail save on multiple machines?
can you run multiple different instance of different applications on the same host without decreasing performance of them?
do you use software like kubernetes or swarm to handle your machines?
I think most of the question are interesting to answer even without containers. Containers might free you of thinking about single hosts, but you still have to decide and measure the load of your host machines yourself.
Minor question: Does it ever make sense to have less dbs than backend servers?
Yes.
Consider cases where you hit normal(without many joins) SQL select statements to get data from the database but your Business Logic demands too much computation. In those cases you might consider keeping your Back-End Service count high and Database Service count low.
It all depends on the use case which is getting solved.
The number of containers per host depends on the design ratio of the host and the workload ratio of the containers. Both ratios are
Throughput/Capacity ratios. In the old days, this was called E/B for execution/bandwidth. Execution was cpu and banwidth was I/o. Solutions were said to be cpu or I/o bound.
Today memories are very large the critical factor is usually cpu/nest
capacity. We describe workloads as cpu intense or nest intense. A useful proxy for nest capacity is the size of highest level cache. A useful design ratio estimator is (clock x cores)/cache. Fir the same core count the machine with a lower design ratio will hold more containers. In part this is because the machine with more cache will scale better and see less saturation at higher utilization. By

Resources