Containers rely on namespaces in Linux. There are 8 kinds of namespaces:
- process ID
- UNIX time-sharing system (UTS) - hostname, domain name
- users, groups
Each container is assigned with new namespaces of these types, creating an illusion for the process that it runs on a separate machine.
Diferent processes may also share some namespace types, but not others:
Here’s how network namespacing works. When a new container is started, it receives a set of interfaces that are placed in new namespace:
cgroups are another feature of Linux kernel. It allows to limit system resources assigned to a process (CPU time, CPU cores, RAM, disk, network bandwidth).
When we’re setting restrictions for containers (e.g.
container engine actualy uses cgroups to limit the process.
Containers should not be able to invoke sys-calls that may break other containers (like changing time, or loading kernel modules).
--privileged flag that gives special permissions to containers.
It’s not ideal, because it gives ALL permissions.
Another option is to use Linux capabilities. There are many of them giivng granular access to specific operations.
Another option is seccomp (Secure Computing Mode). A custom profile (JSON file) can be applied to a containers listing sys-calls that it can make.
Further hardening may be achieved with AppArmor or SELinux (MAC).
Podman popularized the idea of using rootless containers. Previously, it was common to run containers with Docker with root privileges, which translated to root access on the host as well (although Docker does support rootless containers as well!).
Rootless containers make use of user namespaces. It gives us access to user mapping and allows us to run the container with any UID, even UID = 0, while on the host system that UID would be mapped to a “normal”, non-root user.
An easy way to test that is to run
podman unshare id. It’s going to run
program in a user namespace. It will print “0” as a result. A non-root user on a
host (typically 1000) translates to a root in the namespace.
We can see the mapping offsets in the
The output means, that the user 1 in a container will be mapped to user 100000 on a host. User 2 would be mapped to 100001, and so on. A maximum of 65536 users may be mapped (that value may be modified, it’s not a hard limit).
Volumes and SELinux
Mounting volumes in rootles context is different than it is with rootful
environments. In the latter case, whatever you mount, it will probably just work
without any further tinkering. In rootless podman, you will often experience
issues with SELinux (if your system uses SELinux MAC). One of
the ways to get around that is to apply the
:Z (private, additionally
uses MCS) to volume definition. That will apply the right SELinux labeling to
the files being shared as a volume (only if we’re attaching host dir, it is not
needed when creating a volume entity).
Containers run with the
container_t SELinux domain. They are allowed to access
container_ro_file_t typed files. The
parameters apply the
container_file_t to the mounted files.
Another issue could be due to traditional DAC permissions. The user mapping also works for volumes, so a UID 0 in a container will map to UID 1000 on a host. So, a container will be able to access files of UID 1000 on the host.
Docker was the first container platform to make them popular. There’s CRI (Container Runtime Interface) that container platforms adhere to.
- “Kubernetes in Action (Second Edition)” book
- man namespaces
- Rootless Containers
- User namespaces with Podman (Red Hat)