Blog

Kubernetes resources under the hood  – Part 1

I’m sure we’re all familiar with the ‘resources’ block of containers in a pod. But do we really know what Kubernetes uses them for under the hood?

One of the very first things that we are taught by the community when starting to use Kubernetes is always to set requests and limits for CPU and memory on every container in our pods.

When you specify a Pod, you can optionally specify how much of each resource a container needs. The most common resources you’ll specify are CPU and memory (RAM); there are others.

apiVersion: v1
kind: Pod
metadata:
    name: frontend
spec:
    containers:
      -name: app
        image: images.my-company.example/app:v4
        resources:
            requests:
                memory: "64Mi"
                cpu: "250m"
            limits:
                memory: "128Mi"
                cpu: "500m"

If a container specifies its own resource limit but does not specify a resource request, then Kubernetes automatically assigns a resource request that matches the specified limit.

[source]

However, after years of experience with many use-cases and having to investigate many resources-related issues, We have discovered that Kubernetes resource management is a lot more complex than it seems.

Let’s start from the beginning:

Kubernetes is a container orchestrator that deploys workloads (pods) over a pool of resources (nodes). Of course, this is a huge simplification since Kubernetes is a lot more complex and schedules pods using many different parameters, but what I want to dig into in this article (if it’s not already obvious) is how Kubernetes manages container resources.

So which resources can Kubernetes manage? Containers consume many kinds of resources. The obvious ones are resources like CPU and Memory, but they can also consume other resources such as disk space, disk time (I/O), network bandwidth, process IDs, host ports, IP addresses, GPU, power, and more!

First, let’s take a deep dive into containers

So, what are containers really?

In a nutshell, containers are a set of Linux namespaces.

So, what are Linux namespaces?

Linux namespaces are a Linux kernel functionality that partitions kernel resources such that a process or set of processes in the same Linux namespace can see a set of kernel resources and are isolated from processes in other namespaces. Some examples of these namespaces are PID, UID, Cgroups & IPC (see the complete list in the wiki).

Another thing to know about namespaces is that they are nested, meaning namespaces can be inside other namespaces. Child namespaces are isolated from their parent namespaces, but the parent namespaces can see everything within the child namespaces.

Technically speaking, when running a Linux machine, you are already inside a container (since you are in the first set of namespaces). We utilize the isolation advantages of containers when creating another set of namespaces in the same system.

So, when spinning up a container, it creates a set of these namespaces and runs your application inside them. This is also why inside of a container, you will see the PID of your application usually set as 1 (or a low number depending on what you’re running), while outside of the container (in the main PID namespace), the PID of your application will be a far larger number. This is the same process, but the PID in the container is mapped to the higher PID in the main namespace and isolated from it and any other sets of namespaces (other containers).

Namespaces give us the ability to isolate processes from each other, but what about resource consumption? If all of our containers think they are operating in isolation, couldn’t they consume too much of the resources and impact the others? This phenomenon is known as noisy neighbors.

So how can we deal with noisy neighbors? One approach is to limit the resources each process can consume, and (surprise, surprise) the Linux kernel has another feature up its sleeve that can do just this, called Control groups (Cgroups). These are configured for each process to limit, account for, and isolate the resources they each consume. Using this functionality, Kubernetes can limit the resource usage of containers.

Currently, Kubernetes uses Cgroups v1, but another player has entered the arena (for the last five years), Cgroups v2! Their current use would be for Memory Quality of Service (QoS), which, since 1.22 is in alpha, has opened a whole new world of possibilities. You can read all about it here.

What resources are currently managed by Kubernetes?

Kubernetes by itself currently only manage a fraction of the resources present. First of all, it lists the capacity for each resource on its nodes.

# you can also do this with kubectl
kubectl get node -ojson | jq '.items[].status.capacity' 
{
  "cpu": "2",
  "ephemeral-storage": "52416492Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8003232Ki",
  "pods": "110"
}

Then, it calculates the allocatable amount used for pods scheduling. The allocatable resources of a node are calculated by subtracting a buffer of reserved resources for the Linux system, kubelet, and the eviction threshold from the node’s total resources. As of 1.21, the kubelet only calculates the allocatable resources for CPU, memory, huge pages and ephemeral storage.

kubectl get node -ojson | jq '.items[].status.allocatable' 
{
  "cpu": "1930m",
  "ephemeral-storage": "47233297124",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "7313056Ki",
  "pods": "110"
}

Each allocatable resource is a vector that the Kubernetes scheduler uses for scheduling decisions.

Requests

When scheduling pods, the scheduler only considers the pod’s container requests against the allocatable resources (which naturally lowers the amount of allocatable resources so the next pods requests will have fewer allocatable resources it can request to be scheduled). It does not consider the actual resource usage on the node (i.e. containers that use resources over or below their requests).

If the containers in my pod have no requests assigned, Kubernetes can schedule them to any nodes (if, of course, there are no other scheduling restrictions).
By default, Kubernetes can schedule up to 110 pods per node.

Since Kubernetes 1.21, the main resources you can request from Kubernetes are CPU, Memory, Ephemeral storage, and HugePages. In addition, you can accomplish scheduling by requesting custom resources using extended resources (which can also be applied with controllers such as the Nvidia controller for GPU).

Note that you can also limit PIDs consumption per pod at the node level.

So we’ve learned that resource requests are important for scheduling, where all of the requested resources must be available in the node, including extended resources.
We’ll dive deeper into the other effects of CPU requests in the second part of this blog post.

Limits

Resources are considered both for scheduling and runtime. To limit our containers from overloading and consuming too many resources, Kubernetes utilizes Cgroups. Kubernetes uses container limits to define the Cgroups and limit their resource consumption.

Compressible vs. incompressible resources

I want to take a step back for a moment to talk about the two different types of resources, compressible and incompressible.).

A compressible resource means that if the usage of this resource reaches its maximum, the processes that require this resource will have to wait until the resource becomes free. In other words, throttling the processes.

Think of it as a water dam; when the outlet pipes of the dam are full, and the flowing water arriving at the dam exceeds these pipes’ capacity, the water inside the dam will fill up. Usually, we measure compressible resources by time.

CPU is a compressible resource, meaning if the CPU usage is at 100%, a process that requires CPU will need to wait until they receive CPU time.

On the other hand, a resource being incompressible means processes cannot wait for it; either they cannot run, or something else has to stop and release resources for the new process.

Think of it like putting boxes on shelves, once the shelves are filled with boxes you cannot put another box on the shelf. You either have to make room by removing boxes from the shelf or not placing the box on the shelf at all. Memory is an incompressible resource, meaning if you are out of memory and want to allocate memory for a new or existing process you have to either kill a process that is taking up memory space or the process will crash.

For Kubernetes, the only compressible resource that it manages is CPU. The other resources Kubernetes manages (memory, HugePages, Ephemeral storage, and PIDs) are all incompressible.

When you specify limits for compressible resources like CPU, Kubernetes makes sure to throttle them when they try to consume more than their allowable levels. On the other hand, Kubernetes has to deal with limits for incompressible resources using eviction. We will dig into this in the upcoming blog posts.

Requests vs. Limits

So we know we use resource requests as our “manual” guide for the Kubernetes scheduler to make scheduling decisions based on the minimum amount that we need to ensure our workload.

We can also use resource limits as instructions to Kubernetes for which Cgroups it should configure for our containers and their thresholds.

When using extended resources, Kubernetes will use requests for scheduling but will not use the limits to set any Cgroups and limit those special resources usage.

Not really the bottom line

We can specify resource requests and limits for the containers in our pod; based on those parameters Kubernetes also assigns a QoS class to our pods.

kubectl get pods -A -o=jsonpath='{range .items[*]}{.metadata.namespace}{" : "}{.metadata.name}{" --QoS--> "}{.status.qosClass}{"\n"}{end}'

As good as it sounds, quality of service does not have the last word in terms of pods’ priorities. This parameter is visible to us, as Kubernetes users, to estimate the probable priority of our pod in case of high resource stresses and eviction events. There is a lot more to it, such that so-called lower QoS pods might survive eviction events while higher QoS class pods may be terminated.

By the end of this blog series, you will know everything you need to know about the implication of QoS.

First of all, there are three classes of QoS:

  • Guaranteed
  • Burstable
  • BestEffort

For a Pod to have a QoS class of Guaranteed, every container in the Pod must have both memory and CPU with limits and requests that are equal.

A Pod has a QoS class of Burstable if the Pod has at least one Container with a memory or CPU request.

For a Pod to have a QoS class of BestEffort, the Containers in the Pod must not have any memory or CPU limits or requests.

Note that this only uses CPU and memory for calculating the QoS class of the pod.

Regarding the usage of QoS, you should be aware:

  • It’s used to set the OOM_Score_adj parameter – more on that in part 4
  • It’s used to set QoS Cgroups – which so far have no effect and is a future QoS feature

To summarize;

So that was a lot of information to go through, and this first part was just getting the basics out of the way. In our next blog posts, we will start digging into the whys and whats of these features to understand exactly how Kubernetes uses them and what you should be putting in your resource requests and limits!

Stay tuned for part 2, where we get into the bits and bytes of CPU requests and see how fairly Linux shares its CPU!