Containers
Namespaces
Containers rely on namespaces in Linux. There are 8 kinds of namespaces:
- cgroups
- mount
- process ID
- network
- IPC
- UNIX time-sharing system (UTS) - hostname, domain name
- users, groups
- time
Each container is assigned with new namespaces of these types, creating an illusion for the process that it runs on a separate machine.
Diferent processes may also share some namespace types, but not others:
Networking
Here’s how network namespacing works. When a new container is started, it receives a set of interfaces that are placed in new namespace:
cgroups
cgroups are another feature of Linux kernel. It allows to limit system resources assigned to a process (CPU time, CPU cores, RAM, disk, network bandwidth).
When we’re setting restrictions for containers (e.g. --memory="100m"
),
container engine actualy uses cgroups to limit the process.
Capabilities
Containers should not be able to invoke sys-calls that may break other containers (like changing time, or loading kernel modules).
Docker has --privileged
flag that gives special permissions to containers.
It’s not ideal, because it gives ALL permissions.
Another option is to use Linux capabilities. There are many of them giivng granular access to specific operations.
Another option is seccomp (Secure Computing Mode). A custom profile (JSON file) can be applied to a containers listing sys-calls that it can make.
Further hardening may be achieved with AppArmor or SELinux (MAC).
Rootless Containers
Podman popularized the idea of using rootless containers. Previously, it was common to run containers with Docker with root privileges, which translated to root access on the host as well (although Docker does support rootless containers as well!).
Rootless containers make use of user namespaces. It gives us access to user mapping and allows us to run the container with any UID, even UID = 0, while on the host system that UID would be mapped to a “normal”, non-root user.
An easy way to test that is to run podman unshare id
. It’s going to run id
program in a user namespace. It will print “0” as a result. A non-root user on a
host (typically 1000) translates to a root in the namespace.
We can see the mapping offsets in the /etc/subuid
file:
The output means, that the user 1 in a container will be mapped to user 100000 on a host. User 2 would be mapped to 100001, and so on. A maximum of 65536 users may be mapped (that value may be modified, it’s not a hard limit).
Volumes and SELinux
Mounting volumes in rootles context is different than it is with rootful
environments. In the latter case, whatever you mount, it will probably just work
without any further tinkering. In rootless podman, you will often experience
issues with SELinux (if your system uses SELinux MAC). One of
the ways to get around that is to apply the :z
or :Z
(private, additionally
uses MCS) to volume definition. That will apply the right SELinux labeling to
the files being shared as a volume (only if we’re attaching host dir, it is not
needed when creating a volume entity).
Containers run with the container_t
SELinux domain. They are allowed to access
the container_file_t
and container_ro_file_t
typed files. The :z
/:Z
parameters apply the container_file_t
to the mounted files.
Another issue could be due to traditional DAC permissions. The user mapping also works for volumes, so a UID 0 in a container will map to UID 1000 on a host. So, a container will be able to access files of UID 1000 on the host.
Standarization
Docker was the first container platform to make them popular. There’s CRI (Container Runtime Interface) that container platforms adhere to.
References
- “Kubernetes in Action (Second Edition)” book
- man namespaces
- Rootless Containers
- container_selinux
- User namespaces with Podman (Red Hat)