Containers

Namespaces

Containers rely on namespaces in Linux. There are 8 kinds of namespaces:

cgroups
mount
process ID
network
IPC
UNIX time-sharing system (UTS) - hostname, domain name
users, groups
time

Each container is assigned with new namespaces of these types, creating an illusion for the process that it runs on a separate machine.

Diferent processes may also share some namespace types, but not others:

Networking

Here’s how network namespacing works. When a new container is started, it receives a set of interfaces that are placed in new namespace:

cgroups

cgroups are another feature of Linux kernel. It allows to limit system resources assigned to a process (CPU time, CPU cores, RAM, disk, network bandwidth).

When we’re setting restrictions for containers (e.g. --memory="100m"), container engine actualy uses cgroups to limit the process.

Capabilities

Containers should not be able to invoke sys-calls that may break other containers (like changing time, or loading kernel modules).

Docker has --privileged flag that gives special permissions to containers. It’s not ideal, because it gives ALL permissions.

Another option is to use Linux capabilities. There are many of them giivng granular access to specific operations.

Another option is seccomp (Secure Computing Mode). A custom profile (JSON file) can be applied to a containers listing sys-calls that it can make.

Further hardening may be achieved with AppArmor or SELinux (MAC).

Rootless Containers

Podman popularized the idea of using rootless containers. Previously, it was common to run containers with Docker with root privileges, which translated to root access on the host as well (although Docker does support rootless containers as well!).

Rootless containers make use of user namespaces. It gives us access to user mapping and allows us to run the container with any UID, even UID = 0, while on the host system that UID would be mapped to a “normal”, non-root user.

An easy way to test that is to run podman unshare id. It’s going to run id program in a user namespace. It will print “0” as a result. A non-root user on a host (typically 1000) translates to a root in the namespace. We can see the mapping offsets in the /etc/subuid file:

mnj:100000:65536

The output means, that the user 1 in a container will be mapped to user 100000 on a host. User 2 would be mapped to 100001, and so on. A maximum of 65536 users may be mapped (that value may be modified, it’s not a hard limit).

Volumes and SELinux

Mounting volumes in rootles context is different than it is with rootful environments. In the latter case, whatever you mount, it will probably just work without any further tinkering. In rootless podman, you will often experience issues with SELinux (if your system uses SELinux MAC). One of the ways to get around that is to apply the :z or :Z (private, additionally uses MCS) to volume definition. That will apply the right SELinux labeling to the files being shared as a volume (only if we’re attaching host dir, it is not needed when creating a volume entity).

Containers run with the container_t SELinux domain. They are allowed to access the container_file_t and container_ro_file_t typed files. The :z/:Z parameters apply the container_file_t to the mounted files.

Another issue could be due to traditional DAC permissions. The user mapping also works for volumes, so a UID 0 in a container will map to UID 1000 on a host. So, a container will be able to access files of UID 1000 on the host.

Standarization

Docker was the first container platform to make them popular. There’s CRI (Container Runtime Interface) that container platforms adhere to.

References

“Kubernetes in Action (Second Edition)” book
man namespaces
Rootless Containers
container_selinux
User namespaces with Podman (Red Hat)

← SELinux

Bash Scripting →