Kubernetes Security — Part 3: Security Context

Renate Schosser

October 19, 2023

Kubernetes Security — Part 3: Security Context

In this blog post series, we will further deepen the security knowledge of Kubernetes, by discussing selected security topics in more detail.
In Part 1 of the series, we started with Role Based Access Control (RBAC). In Part 2 of the series, we discussed Kubernetes Network Policies.

In this part of the series, we will talk about the Kubernetes Security Context, which can be used to define privilege and access controls for a Pod or Container. If you like to learn more about the Kubernetes Security Context settings, this blog post is for you.

What is a Security Context and why is it needed?

In the previous blog posts, we already discussed how to use RBAC to restrict permissions for the Kubernetes API server and how to limit the connections between pods with Network Policies. But what about the workload themselves? Shouldn't we restrict the privileges of a workload as well? The answer is a clear "yes", because if workloads have too many privileges, they can endanger the whole Kubernetes cluster. To restrict privileges and add access control settings, a Security Context can be used.

The following Security Context settings will be discussed in more detail in this blog post:

capabilities
privileged
runAsNonRoot and runAsUser
allowPrivilegeEscalation
readOnlyRootFilesystem
seccomp

Additionally, we will discuss these settings that should be considered to enhance the security of a pod further:

hostNetwork, hostPID and hostIPC
AppArmor
resource requests and limits
hostPath

Additional information about the Kubernetes Security Context can be found in the official Kubernetes documentation:

Configure a Security Context for a Pod or Container

Security Context definition in Kubernetes resources

You can define a Security Context on the pod level, which means that the settings are assigned to all containers in the pod. Or the Security Context can be assigned to the individual containers in the pod.
Also, please be aware that it is not possible to assign all settings of a Security Context on both the pod level and the container level. Some can only be assigned to the pod, some only to the containers, and others on both. If a Security Context with conflicting settings is defined on the pod level and on the container level, the settings from the container level counts.

Here is an example definition of a pod, where you can see how the Security Context can be defined:

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
  labels:
    app: demo-pod
spec:
  # pod level:
  securityContext:
    # <settings>
  containers:
    - name: demo
      image: image-name
      # container level:
      securityContext:
        # <settings>

Hint

The level (pod and/or container) a Security Context setting can be defined, can be found here:

Also, the documentation of the Pod Security Standards provide a good overview to find out which Security Context setting can be defined on which level. Just search for the security setting you want to have information about and check the "Restricted Fields" section.

Security Context settings

The "capabilities" setting

The Linux capabilities feature exists for quite a while now, it was introduced in the 2.2 kernel in the year 1999. With Linux capabilities, it is no longer necessary to give a process either all permissions or no permissions at all. Instead, it is possible to assign a process a defined set of permissions, called a "capability".
But how does this relate to Kubernetes? Well, a container in Kubernetes is basically a process running on the host system. So, we can limit the permissions of a container with the use of Linux capabilities. Maybe now you wonder what happens if you do not define capabilities for your pod? Good question! Let's find out.

In the following example, we first create a pod, then we log into the running pod by using the "kubectl exec" command. After finding out the process ID of the container, we check the capabilities of the pod. To decode the displayed capabilities, we use the program "capsh".

// create a new pod
$ k run test-pod --image=nginx
pod/test-pod created
// log into the created pod
$ k exec -it test-pod -- bash
// get pid of running process
root@test-pod:/# echo $$
45
// get capabilities of running process
root@test-pod:/# cat /proc/45/status | grep CapEff
CapEff: 00000000a80425fb
root@test-pod:/# exit
exit
// decode capabilities
$ capsh --decode=00000000a80425fb
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

We can see that the default capabilities (which depend on the used container runtime) of a container are quite a lot. And probably not all of them are needed. So how can we limit the capabilities of the container? Here we can use the "capabilities" setting in the Security Context.

Best practice: Drop all capabilities

The best practice in regard to capabilities is to drop all capabilities in a pod definition if possible. Here is an example how this looks like:

# pod definition with all capabilities dropped
apiVersion: v1
kind: Pod
metadata:
  name: no-caps
  labels:
    app: no-caps
spec:
  containers:
    - name: no-caps
      image: my-image
      securityContext:
        capabilities:
          drop:
            - all
$ k exec -it no-caps -- sh
/ # echo $$
7
/ # cat /proc/7/status | grep CapEff
CapEff: 0000000000000000

When we check the assigned capabilities of the pod, we can see that his pod does not have any. But what if the pod needs one or more capabilities? In that case, we can drop all capabilities and then add only the needed ones:

# pod definition with only the "cap_net_bind_service"
apiVersion: v1
kind: Pod
metadata:
  name: cap-net-bind-service
  labels:
    app: cap-net-bind-service
spec:
  containers:
    - name: cap-net-bind-service
      image: image-name
      securityContext:
        capabilities:
          drop: ["all"]
          add: ["NET_BIND_SERVICE"]
$ k exec -it cap-net-bind-service -- sh
/ # echo $$
7
/ # cat /proc/7/status | grep CapEff
CapEff: 0000000000000400
/ # exit
$ capsh --decode=0000000000000400
0x0000000000000400=cap_net_bind_service

Here we see that this pod only has the capability "cap_net_bind_service" assigned.

The "privileged" setting

We talked about capabilities in the previous section, but how is this connected with the "privileged" setting in the Security Context?
The first question is what capabilities a pod has if we do not set "privileged"? If privileged is not set, it defaults to "false", which is good news in regard to security. But does this mean the pod has no capabilities assigned at all? And what if we set "privileged: true"? Let's check.

apiVersion: v1
kind: Pod
metadata:
  name: privileged-pod
spec:
  containers:
  - name: nginx
    image: nginx
    securityContext:
      privileged: true
$ kubectl exec -it privileged-pod -- bash
root@no-capabilities-pod:/# echo $$
72
root@no-capabilities-pod:/# cat /proc/72/status | grep CapEff
CapEff: 0000003fffffffff
root@no-capabilities-pod:/# exit
exit
$ capsh --decode=0000003fffffffff
0x0000003fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read

We can see that a pod with "privileged: true" has all capabilities assigned. This is the same, as if we would add all capabilities via the "capabilities" setting in the Security Context. To run your pod as "privileged: true" means that the container has full root privileges on the host system and can access all capabilities provided by the kernel. This basically disable most security mechanisms and should be avoided if possible.

Best practice: "privileged: false"

It is best practice to set "privileged: false". Now let's see what capabilities a pod with the setting "privileged: false" has.

apiVersion: v1
kind: Pod
metadata:
  name: unprivileged-pod
spec:
  containers:
  - name: nginx
    image: nginx
    securityContext:
      privileged: false
$ kubectl exec -it unprivileged-pod -- bash
echo $$
45
cat /proc/45/status | grep CapEff
CapEff: 00000000a80425fb
exit 
$ capsh --decode=00000000a80425fb
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

We can see that even if we set "privileged: false", the pod has the same capabilities as the pod with the default settings in the "capabilities" section above. Combine "privileged: false" with dropping as many capabilities as possible, to avoid having a pod with too many privileges.

The "runAsNonRoot" and "runAsUser" settings

Before we get into details, let us start with talking about what it means to run a container as root. Does this mean the user has automatically root rights on the host system? Fortunately no, because Docker provides container isolation mechanisms. But nevertheless, a root inside a container is the same as the root account on the host system. Which means that if an attacker could break out of the container, he would be root on the host system. A container breakout could happen because of a vulnerability in your application, in the Docker runtime, or in the Linux kernel.

Therefore, it is strongly recommended to only run a container as non-root whenever possible.

Best practice: Run container as non-root

It is best practice to set "runAsNonRoot: true" and also define the user by setting "runAsUser" to a user other than root (which has user ID 0). If not defined, the container process runs as the default user of the used image.
Please be aware that if you configure "runAsNonRoot: true", but you do not define a user with "runAsUser", and the default user is root, the container will not run.

In the following example, we set "runAsNonRoot: true" and define a user other than root with "runAsUser". Then we check the user id of the container.

apiVersion: v1
kind: Pod
metadata:
  name: run-as-user
spec:
  containers:
  - name: run-as-user
    image: alpine:3.18.0
    command: ['/bin/sleep', '1d']
    securityContext:
       runAsUser: 1000
       runAsNonRoot: true
$ k exec -it run-as-user -- sh
~ $ id
uid=1000 gid=0(root) groups=0(root)

The "allowPrivilegeEscalation" setting

Another important setting is "allowPrivilegeEscalation". If we do not define this setting it defaults to "true", which is a problem because it means that it is possible for your process in the pod to gain more privileges than its parent.

Also, if your container runs as "privileged" or has the "cap_sys_admin" capability assigned, "allowPrivilegeEscalation" is always "true".

Best practice: set "allowPrivilegeEscalation" to false

Set "allowPrivilegeEscalation" to "false" to prevent containers from escalating privileges, such as by using setuid or setgid to change their effective user or group ID.

apiVersion: v1
kind: Pod
metadata:
  name: privilege-escalation-not-allowed
spec:
  containers:
  - name: nginx
    image: nginx
    securityContext:
      allowPrivilegeEscalation: false

The "readOnlyRootFilesystem" setting

The "readOnlyRootFilesystem" setting defines if a process is allowed to modify the filesystem of the container it runs in. By default a process can change the filesystem of it's container, because this setting defaults to "false".

Best practice: Set "readOnlyRootFilesystem" to "true"

Best practice is to set "readOnlyRootFilesystem: true". If the process needs to write data somewhere, instead of setting "readOnlyRootFilesystem" to "false", consider mounting an emptyDir to write data to. A process can write to an "emptyDir", even if "readOnlyRootFilesystem" is set to "true".

apiVersion: v1
kind: Pod
metadata:
  name: read-only-root-fs
spec:
  containers:
  - name: alpine
    image: alpine:3.18.0
    command: ['/bin/sleep', '1d']
    securityContext:
      readOnlyRootFilesystem: true
$ k exec -it read-only-root-fs -- sh
/ # touch testfile
touch: testfile: Read-only file system

Tip

If you want to have pods that are immutable, which means that they cannot be changed during runtime, combine the following two settings in the Security Context:

securityContext:
  readOnlyRootFilesystem: true
  privileged: false

The "seccomp" setting

You can use seccomp to restrict the available system calls for your application and by doing so, decrease your application's attack surface. Seccomp is an abbreviation for "secure computing mode" and is a built-in Linux security feature.
Seccomp is part of the Security Context. Be aware, that it is not possible to use seccomp if your containers run as privileged.

The following three seccomp types can be defined in the Security Context:

unconfined

If you do not want to use seccomp (which is not recommended, by the way), you can select the type "unconfined". Also "unconfined" is used as the default type if you do not specify seccomp in the Security Context.

RuntimeDefault

If you want to use seccomp without the need to define custom profiles, "RuntimeDefault" is the right choice for you. As the name already suggests, "RuntimeDefault" uses the default seccomp profile of your Container Runtime, e.g., the Docker default seccomp profile or the containerd default seccomp profile.

Localhost

If you want to use a custom profile for seccomp, you can do this by creating your own seccomp profiles. The seccomp profiles have to be stored in the "/var/lib/kubelet/seccomp/" directory.

Seccomp is also supported on EKS, GKE and AKS.

Best practice: Use seccomp type "RuntimeDefault" or "Localhost"

# RuntimeDefault profile
apiVersion: v1
kind: Pod
metadata:
  name: seccomp-pod
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: nginx
    image: nginx
# Localhost profile
securityContext:
  seccompProfile:
    type: Localhost
    localhostProfile: profiles/audit.json

Overview of all Security Context settings

Here is an overview of all security settings we discussed so far and their default values.

An overview of the default values for the Security Context settings can be found here: Pod Security Context v1 core

Additional Pod Security Settings

In the sections above, we talked about settings in the Kubernetes Security Context. But there are additional settings that have to be considered to secure a Kubernetes workload, although they are not part of the Security Context.

Therefore, in this section we will take talk about hostNetwork, hostPID and hostIPC, AppArmor, resource requests and limits and the "hostPath" setting.

hostNetwork, hostPID and hostIPC

Host namespaces, like the network namespace (hostNetwork), the Process ID namespace (hostPID) or the Inter-process communication namespace (hostIPC), allow access to shared information on the underlying host. Let's see what this means in regard to Kubernetes.

If hostNetwork is allowed, the pod can use the node network namespace. This gives the pod access to the loopback device, services listening on localhost, and could be used to snoop on network activity of other pods on the same node.
If hostPID is allowed, the pod containers can share the host process ID namespace. This allows the pod e.g., to see all processes running on the host, including processes running inside other pods, and to view all environment variables of each pod on the host.
if hostIPC is allowed, the pod can use the host’s inter-process communication mechanisms (shared memory, semaphore arrays, message queues, etc.). This enables the pod to read/write to mechanisms, if any other process on the host uses the same mechanisms.

As we saw above, from security perspective hostNetwork, hostPID and hotIPC should be disabled. The default for all host namespaces is "false", so not defining it is sufficient, though it can also be explicitly set to "false" if desired.

spec:
   hostNetwork: false
   hostPID: false
   hostIPC: false

AppArmor

AppArmor is a built-in Linux security module, that can be used to restrict a program to a set of files, capabilities and network access.
If AppArmor is loaded onto the Node, it can be defined in the pod definition. You can define your own AppArmor profile, or you can use the "runtime/default" AppArmor profile. Since AppArmor is currently in beta state in Kubernetes, it has to be defined as an annotation.

Additional information about AppArmor on Kubernetes can be found here: https://kubernetes.io/docs/tutorials/security/apparmor/

# AppArmor profiles are specified per container
annotations:
  container.apparmor.security.beta.kubernetes.io/<container_name>: runtime/default

Resource requests and limits

Since availability is also an important part of Security, it is best practice to set resource requests and limits for Kubernetes Resources.

For CPU, it is recommended to set only the resource request, but no limit.
There is a great blog post explaining why it is not a good idea to set a CPU limit: https://home.robusta.dev/blog/stop-using-cpu-limits
For Memory, it is recommended to set the resource request and limit. Additionally, the values of the Memory request and the Memory limit should be equal if possible (see https://kubernetes.io/docs/concepts/security/security-checklist/#pod-security for more information)

Additional information about requests and limit can be found here: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

 resources:
      requests:
        memory: "128Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"

hostPath

With the "hostPath" setting, it is possible to mount a file or a directory from the host filesystem into you pod. If the host filesystem is mounted and the pod also has write access, there is a high risk of privilege escalation. A pod with such a configuration is able to traverse the host filesystem outside the specified path, to read data from other containers, and to abuse credentials of system services like the Kubelet.

Therefore, it is best practice to avoid using hostPath whenever possible. If you really need access to the filesystem of the underlying host, do not mount the root directory, and set "readOnly: true" if possible.

   volumeMounts:
   - mountPath: "/path/to/directory"
     name: demopath
     readOnly: true

Best practice Kubernetes resource definition

In this example Kubernetes Deployment, we bring all best practices we discussed in this blog post together:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-sec-context
  labels:
    app: sec-context
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sec-context
  template:
    metadata:
      labels:
        app: sec-context
      annotations:
        container.apparmor.security.beta.kubernetes.io/sec-context: runtime/default
    spec:
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      containers:
       - name: sec-context
         image: demo-image
         securityContext:
           privileged: false
           allowPrivilegeEscalation: false
           readOnlyRootFilesystem: true
           runAsNonRoot: true
           runAsUser: 1001
           capabilities:
             drop: ["all"]
         resources:
           requests:
             memory: "128Mi"
             cpu: "250m"
           limits:
             memory: "128Mi"

Summary

In this blog post we covered a lot of topics about the Kubernetes Security Context. In detail we discussed the following:

What a Security Context is and why we need it

How a Security Context can be defined

The recommended Security best practices for the following Security Context settings:

capabilities
privileged
runAsNonRoot and runAsUser
allowPrivilegeEscalation
readOnlyRootFilesystem
seccomp

Additional settings to secure a workload:

hostNetwork, hostPID and hostIPC
AppArmor
Resource requests and limits
hostPath

A best practice Kubernetes resource definition

This blog post should have helped you to not only understand what you should do in regard to the Kubernetes Security Context, but also why these settings are important.

That’s it for the post about the Kubernetes Security Context. Thanks for reading and have fun with securing your Kubernetes workloads!

Kubernetes Security — Part 3: Security Context was originally published in Dynatrace Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Written by

Renate Schosser

Kubernetes Security — Part 3: Security Context