What is a simple explanation to resource allocation and definitions in kubernetes?
What does it mean to allocate "1000m" CPU units and 1024Mi off memory?
(tried to write it in simpler language than the official docs)
CPU
In Kubernetes each CPU core is allocated in units of one "milicore" meaning one Virtual Core (on a virtual machine) can be divided into 1000 shares of 1 milicore. Allocating 1000 milicores will give a pod one full CPU. Giving more will require the code in the pod to able to utilize more than one core.
Memory
Very simple. Each Megabyte you allocate is reserved for the pod.
Requests
Minimal resources that are guaranteed to be given to the pod. If there are not enough resources to start a pod on any node it will remain in "Pending" state.
Limits
CPU Limit Will cause the the pod to throttle down when hitting the limit.
Memory Limit When a pod utilizes all of it's memory and asks for more than the limit it will considered a memory leak and the pod will get restarted.
Target (defined in the Horizontal Pod Autoscaler)
Can be applied to CPU, Memory and other custom metrics (more complicated to define.
It's might be a good idea to set resources for a pod in sizes of A B and C where: A < B < C. With requests = A, Target = B and Limits = C.
Just remember that a fully loaded node might prevent pods from reaching their "target" and not never scale up.
Related
in my environment one kubernetes pod, let's call it P1, is connected outside the cluster via a message oriented middleware (MOM). The latter is publicly exposed through the following Service:
apiVersion: v1
kind: Service
metadata:
name: my-mom-svc
spec:
externalIPs:
- aaa.bbb.ccc.ddd
selector:
app: my-mom
ports:
- port: pppp
name: my-port-name
Clients are outside the k8s cluster and connect to the MOM thanks to this service. P1 processes messages coming from the MOM and sent by the clients. My goal is to maximize the CPU used by P1.
I defined a limitrange so that P1 can use all the available CPUs on a worker node.
However, in my test environment it does not use all of them and indeed, the more pods like P1 I create the less CPU each of them uses (notice that there is only one pod like P1 for a single worker node).
I tried to define a resourcequota with a huge max cpu number, but the result does not change.
In desperation i entered into the pod and executed the command 'stress --cpu x'..and here the pod uses all the x cpus.
I tried the same test using a 'raw' docker containers, that is running my environment without kubernetes and only using docker containers. In this case the containers use all the available CPUs.
Are there any default kubernetes limitations or behavior limiting something? how can i modify them?
Thanks!
A few things to note:
The fact that you were able to stress the CPU fully with stress --cpu x when logging into your pod's container is evidence that k8s is functioning correctly when pod requests resources on the worker node. So, resource requests and limits are functional.
You should consider if your network traffic that one P1 pod is handling is actually enough to generate a high CPU utilisation. Typically, you need to generate a VERY HIGH amount of network traffic to get a service to utilize a lot of CPU since such a workload is network latency centric and not compute power centric.
You describe that when increasing your P1 pods, your loads/pod decreases that is because your Service object is doing a great job. Service objects are responsible for load balancing incoming traffic equally to all the pods that are serving the traffic. The fact that CPU load reduces is evidence that since there are more pods to serve the incoming traffic, the load is naturally distributed across them by the Service abstraction.
When you define a very large number for your request quota two things can happen:
a. If there is no admission control in your cluster (an application that processes all the incoming API requests and performs actions on it, like validation/compliance/security checks), your pod will be stuck in Pending state, since there will be no Node big enough for the scheduler to be able to fit your pod.
b. If there is an admission-controller setup, it will try to enforce a maximum allowable quota by overriding the value of quota in your manifest before it is sent to the API server for processing. So, even if you specify 10 in your vCPU request, if the admission-controller has a rule which doesn't allow more than 2 vCPUs in quota, it will be changed to 2 by the controller. You can verify this isn't the case by printing your Pod and looking at the quota fields if they are the same as the ones you specified when applying you might not have an admission-controller in your cluster.
I would suggest a better way to approach the problem would be to test your Pod with a reasonable/realistic maximum value of traffic that you expect on 1 node and then record the CPU usage and memory usage. You can then instead of attempting to get the Pod to use more CPU, you can resize your node into a smaller sized node, this way, your pod will have less CPU available and hence better utilisation :)
This is a very common design pattern (especially for scenario like yours where you have 1pod/worker node). This allows to have light-weight easy scale-out architectures which can perform really well along with autoscaling of nodes.
If my pod is exceeding the memory requested (but still under the limit) is more memory available at runtime on request or does the pod needs to restarts to allocate more memory?
Memory up to the limit is available to the pod (as long as the host has enough free) at all times. The "request" amount is used for scheduling, and the limit is used for actually restricting the pod to an amount. I would caution against setting too high of a gap between request and limit, as it can result in the node itself exhausting memory.
There is no need to restart pod in order to allocate more memory. This happens automatically. However, you need to remember about two values: request and limit. Request is used when creating pod. The node must have enough memory (the number specified by request) for it to be created. It is also the minimum amount of memory that must be available for a given pod. The limit, on the other hand, is the maximum amount of memory that a given pod can use. The amount of memory between the request and the limit is allocated based on the resources available on the node. You can read more about it here.
What do I get by running multiple nodes on a single host? I am not getting availability, because if the host is down, the whole cluster goes with it. Does it make sense regarding performance? Doesn't one instance of ES take as many resources from the host as it needs?
Generally no, but if you have machines with ridiculous amounts of CPU and memory, you might want that to properly utilize the available resources. Avoiding big heaps with Elasticsearch is a good thing generally since garbage collection on bigger heaps can become a problem and in any case above 32 GB you lose the benefit of pointer compression. Mostly you should not need big heaps with ES. Most of the memory that ES uses is through memory mapped files, which relies on the OS cache. So just because you aren't assigning memory to the heap doesn't mean it is not being used: more memory available for caching means you'll be able to handle bigger shards or more shards.
So if you run more nodes, that advantage goes away and you waste memory on redundant heaps, and you'll have nodes competing for resources. Mostly, you should base these decisions on actual memory, cache, and cpu usage of course.
It depends on your host and how you configure your nodes.
For example, Elastic recommends allocating up to 32GB of RAM (because of how Java compresses pointers) to elasticsearch and have another 32GB for the operating system (mostly for disk caching).
Assuming you have more than 64GB of ram on your host, let's say 128, it makes sense to have two nodes running on the same machine, having both configured to 32GB ram each and leaving another 64 for the operating system.
I dig into Kubernetes resource restrictions and have a hard time to understand what CPU limits are for. I know Kubernetes passes requests and limits down to the (in my case) Docker runtime.
Example: I have 1 Node with 1 CPU and 2 Pods with CPU requests: 500m and limits: 800m. In Docker, this results in (500m -> 0.5 * 1024 = 512) --cpu-shares=512 and (800m -> 800 * 100) --cpu-quota=80000. The pods get allocated by Kube scheduler because the requests sum does not exceed 100% of the node's capacity; in terms of limits the node is overcommited.
The above allows each container to get 80ms CPU time per 100ms period (the default). As soon as the CPU usage is 100%, the CPU time is shared between the containers based on their weight, expressed in CPU shares. Which would be 50% for each container according to the base value of 1024 and a 512 share fo each. At this point - in my understanding - the limits have no more relevance because none of the containers can get its 80ms anymore. They both would get 50ms. So no matter how much limits I define, when usage reaches critical 100%, it's partitioned by requests anyway.
This makes me wonder: Why should I define CPU limits in the first place, and does overcommitment make any difference at all? requests on the other hand in terms of "how much share do I get when everything is in use" is completely understandable.
One reason to set CPU limits is that, if you set CPU request == limit and memory request == limit, your pod is assigned a Quality of Service class = Guaranteed, which makes it less likely to be OOMKilled if the node runs out of memory. Here I quote from the Kubernetes doc Configure Quality of Service for Pods:
For a Pod to be given a QoS class of Guaranteed:
Every Container in the Pod must have a memory limit and a memory request, and they must be the same.
Every Container in the Pod must have a CPU limit and a CPU request, and they must be the same.
Another benefit of using the Guaranteed QoS class is that it allows you to lock exclusive CPUs for the pod, which is critical for certain kinds of low-latency programs. Quote from Control CPU Management Policies:
The static CPU management policy allows containers in Guaranteed pods with integer CPU requests access to exclusive CPUs on the node. ... Only containers that are both part of a Guaranteed pod and have integer CPU requests are assigned exclusive CPUs.
According to the Motivation for CPU Requests and Limits section of the Assign CPU Resources to Containers and Pods Kubernetes walkthrough:
By having a CPU limit that is greater than the CPU request, you
accomplish two things:
The Pod can have bursts of activity where it makes use of CPU resources that happen to be available.
The amount of CPU resources a Pod can use during a burst is limited to some reasonable amount.
I guess that might leave us wondering why we care about limiting the burst to "some reasonable amount" since the very fact that it can burst seems to seems to suggest there are no other processes contending for CPU at that time. But I find myself dissatisfied with that line of reasoning...
So first off I checked out the command line help for the docker flags you mentioned:
--cpu-quota int Limit CPU CFS (Completely Fair Scheduler) quota
-c, --cpu-shares int CPU shares (relative weight)
Reference to the Linux Completely Fair Scheduler means that in order to understand the value of CPU limit/quota we need to undestand how the underlying process scheduling algorithm works. Makes sense, right? My intuition is that it's not as simple as time-slicing CPU execution according to the CPU shares/requests and allocating whatever is leftover at the end of some fixed timeslice on a first-come, first-serve basis.
I found this old Linux Journal article snippet which seems to be a legit description of how CFS works:
The CFS tries to keep track of the fair share of the CPU that would
have been available to each process in the system. So, CFS runs a fair
clock at a fraction of real CPU clock speed. The fair clock's rate of
increase is calculated by dividing the wall time (in nanoseconds) by
the total number of processes waiting. The resulting value is the
amount of CPU time to which each process is entitled.
As a process waits for the CPU, the scheduler tracks the amount of
time it would have used on the ideal processor. This wait time,
represented by the per-task wait_runtime variable, is used to rank
processes for scheduling and to determine the amount of time the
process is allowed to execute before being preempted. The process with
the longest wait time (that is, with the gravest need of CPU) is
picked by the scheduler and assigned to the CPU. When this process is
running, its wait time decreases, while the time of other waiting
tasks increases (as they were waiting). This essentially means that
after some time, there will be another task with the largest wait time
(in gravest need of the CPU), and the currently running task will be
preempted. Using this principle, CFS tries to be fair to all tasks and
always tries to have a system with zero wait time for each
process—each process has an equal share of the CPU (something an
“ideal, precise, multitasking CPU” would have done).
While I haven't gone as far as to dive into the Linux kernel source to see how this algorithm actually works, I do have some guesses I would like to put forth as to how shares/requests and quotas/limits play into this CFS algorithm.
First off, my intuition leads me to believe that different processes/tasks accumulate wait_runtime at different relative rates based on their assigned CPU shares/requests since Wikipedia claims that CFS is an implementation of weighted fair queuing and this seems like a reasonable way to achieve a shares/request based weighting in the context of an algorithm that attempts to minimize the wait_runtime for all processes/tasks. I know this doesn't directly speak to the question that was asked, but I want to be sure that my explanation as a whole has a place for both concepts of shares/requests and quotas/limits.
Second, with regard to quotas/limits I intuit that these would be applicable in situations where a process/task has accumulated a disproportionately large wait_runtime while waiting on I/O. Remember that the quoted description above CFP prioritizes the process/tasks with the largest wait_runtime? If there were no quota/limit on a given process/task then it seems to me like a burst of CPU usage on that process/task would have the effect of, for as long as it takes for its wait_runtime to reduce enough that another task is allowed to preempt it, blocking all other processes/tasks from execution.
So in other words, CPU quotas/limits in Docker/Kubernetes land is a mechanism that allows the given container/pod/process to burst in CPU activity to play catch up to other processes after waiting on I/O (rather than CPU) without in the course of doing so unfairly blocking other processes from also doing work.
There is no upper bound with just cpu shares. If there are free cycles, you are free to use them. limit is imposed so that one rogue process is not holding up the resource forever.
There should be some fair scheduling. CFS imposes that using cpu quota and cpu period via the limit attribute configured here.
To conclude, this kind of property ensures that when I schedule your task you get a minimum of 50 milliseconds to finish it. If you need more time, then if no one is waiting in the queue I would let you run for few more but not more than 80 milliseconds.
I think it's correct that, during periods where the Node's CPU is being fully utilized, it's the requests (CPU shares) that will determine how much CPU time each container gets, rather than the limits (which are effectively moot at that point). In that sense, a rogue process can't do unlimited damage (by depriving another of its requests).
However, there are still two broad uses for limits:
If you don't want a container to be able to use more than a fixed amount of CPU even if extra CPU is available on the Node. It might seem weird that you wouldn't want excess CPU to be utilized, but there are use cases for this. Some that I've heard:
You're charging customers for the right to use up to x amount of compute resources (a limit), so you don't want to give them more sometimes for free (which might dissuade them from paying for a higher tier on your service).
You're trying to figure out how a service will perform under load, but this gets complicated/unpredictable, because the performance during your load testing depends on how much spare CPU is lying around that the service is able to utilize (which might be a lot more than the spare CPU that'll actually be on the Node during a real-world high-load situation). This is mentioned here as a big risk.
If the requests on all the containers aren't set especially accurately (as is often the case; devs might set the values upfront and forget to update them as the service evolves, or not even set them very carefully initially). In these cases, things sometimes still function well enough if there's enough slack on the Node; limits can then be useful to prevent a buggy workload from eating all the slack and forcing the other pods back to their incorrectly-set(!) requested amounts.
I know you can set a memory restriction per container in docker via run -m <x>, but is it possible to set an aggregate restriction across all containers, rather than each container individually?
For example, if I have 5 containers and 2GB of RAM, is it possible to configure docker so that it can allocate in total no more than 1GB, meaning the sum of memory allocated to containers may not pass 1GB?
For now kubernetes does limiting only on container level via resources: limits parameter. And only for cpu and memory.
You could control how much memory/cpu a pod is using, since you define the pod. So, if you assign specific max usage for each container, the pod will not be able to use more resources then the sum of the individual ones.
This is not ideal, because you may want to let each container use as much memory as needed, but the pod to not get past a certain treshold. They have an issue opened for what you want here