Digging Into Runtimes

Everyone knows about Docker but not a lot of people are aware of the underlying technologies used by it. In this blogpost we will analyze one of the most fundamental and powerful technologies hidden behind Docker - runc.

Picture by Kirill Shirinkin

Brief container history: from a mess to standards and proper architecture

Container technologies appeared for the first time after the invention of cgroups and namespaces. Two of the first-known projects trying to combine them to achieve isolated process environments were LXC and LMCTFY. The latter, in a typical Google style, tried to provide a stable abstraction from the Linux internals via an intuitive API.

In 2013 a technology named Docker, built on top of LXC, was created. The Docker team introduced the notions of container “packaging” (called images) and “portability” of these images between different machines. In other words, Docker tried to create an independent software component following the Unix philosophy (minimalism, modularity, interoperability).

In 2014 a software called libcontainer was created. Its objectives were to create processes into isolated environments and manage their lifecycle. Also in 2014, Kubernetes was announced at DockerCon and that is when a lot of things started to happen in the container world.

The OCI (Open Container Initiative) was then created in response to the need for standardization and structured governance. The OCI project ended up with two specifications - the Runtime Specification (runtime-spec) and the Image Specification (image-spec). The former defined a detailed API for the developers of runtimes to follow. The libcontainer project was donated to OCI and the first standardized runtime following the runtime-spec was created - runc. It represents a fully compatible API on top of libcontainer allowing users to directly spawn and manage containers.

Today container runtimes are often divided into two categories - low-level (runc, gVisor, Firecracker) and high-level (containerd, Docker, CRI-O, podman). The difference is in the amount of consumed OCI specification and additional features.

Runc is a standardized runtime for spawning and running containers on Linux according to the OCI specification. However, it doesn’t follow the image-spec specification of the OCI.

There are other more high-level runtimes, like Docker and Containerd, which implement this specification on top of runc. By doing so, they solve several disadvantages related to the usage of runc alone namely - image integrity, bundle history log and efficient container root file system usage. Be prepared, we are going to look into these more evaluated runtimes in the next article !

Runc

runc - Open Container Initiative runtime runc is a command line client for running applications packaged according to the Open Container Initiative (OCI) format and is a compliant implementation of the Open Container Initiative specification Runc is a so-called “container runtime”.

As mentioned above, runc is an OCI compliant runtime - a software component responsible for the creation, configuration and management of isolated Linux processes also called containers. Formally, runc is a client wrapper around libcontainer. As runc follows the OCI specification for container runtimes it requires two pieces of information:

OCI configuration - a JSON-like file containing container process information like namespaces, capabilities, environment variables, etc.
Root filesystem directory - the root file system directory which is going to be used by the container process (chroot).

Let’s now inspect how we can use runc and the above-mentioned specification components.

Generating an OCI configuration

Runc comes out of the box with a feature to generate a default OCI configuration:

[email protected]:~$ runc spec && cat config.json
{
    "ociVersion": "1.0.2-dev",
    "process": {
        "terminal": true,
        "user": {
            "uid": 0,
            "gid": 0
        },
        "args": [
            "sh"
        ],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "TERM=xterm"
        ],
        "cwd": "/",
        "capabilities": {
            "bounding": [
                "CAP_AUDIT_WRITE",
                "CAP_KILL",
                "CAP_NET_BIND_SERVICE"
            ],
            "effective": [
                "CAP_AUDIT_WRITE",
                "CAP_KILL",
                "CAP_NET_BIND_SERVICE"
            ],
            "inheritable": [
                "CAP_AUDIT_WRITE",
                "CAP_KILL",
                "CAP_NET_BIND_SERVICE"
            ],
            "permitted": [
                "CAP_AUDIT_WRITE",
                "CAP_KILL",
                "CAP_NET_BIND_SERVICE"
            ],
            "ambient": [
                "CAP_AUDIT_WRITE",
                "CAP_KILL",
                "CAP_NET_BIND_SERVICE"
            ]
        },
        "rlimits": [
            {
                "type": "RLIMIT_NOFILE",
                "hard": 1024,
                "soft": 1024
            }
        ],
        "noNewPrivileges": true
    },
    "root": {
        "path": "rootfs",
        "readonly": true
    },
    "hostname": "runc",
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc"
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        },
        {
            "destination": "/sys/fs/cgroup",
            "type": "cgroup",
            "source": "cgroup",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "relatime",
                "ro"
            ]
        }
    ],
    "linux": {
        "resources": {
            "devices": [
                {
                    "allow": false,
                    "access": "rwm"
                }
            ]
        },
        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "network"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            }
        ],
        "maskedPaths": [
            "/proc/acpi",
            "/proc/asound",
            "/proc/kcore",
            "/proc/keys",
            "/proc/latency_stats",
            "/proc/timer_list",
            "/proc/timer_stats",
            "/proc/sched_debug",
            "/sys/firmware",
            "/proc/scsi"
        ],
        "readonlyPaths": [
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger"
        ]
    }
}

From the above code snippet, one can see that the config.json file contains information about:

All of the above configuration is almost completely sufficient for libcontainer to create an isolated Linux process (container). There is one thing missing to spawn a container - the process root directory. Let’s download one and make it available for runc.

# download an alpine fs
[email protected]:~$  wget http://dl-cdn.alpinelinux.org/alpine/v3.10/releases/x86_64/alpine-minirootfs-3.10.1-x86_64.tar.gz
...
[email protected]:~$ mkdir rootfs && tar -xzf \
    alpine-minirootfs-3.10.1-x86_64.tar.gz -C rootfs
[email protected]:~$ ls rootfs
bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

We have downloaded the root file system of an Alpine Linux image - the missing part of the OCI runtime-spec. Now we can create a container process executing a sh shell.

[email protected]:~$ runc create --help
NAME:
   runc create - create a container
...
DESCRIPTION:
   The create command creates an instance of a container for a bundle. The bundle
is a directory with a specification file named "config.json" and a root
filesystem.

Okay, let’s put everything together in this bundle and spawn a new baby container! We will distinguish two types of containers: - root - containers running under UID=0. - rootless - containers running under a UID different from 0.

Creating root container

To be able to create a root container (process running under UID=0), one has to change the ownership of the root filesystem which was just downloaded (if the user is not already root).

[email protected]:~$ sudo su -
# move the config.json file and the rootfs
# to the bundle directory
[email protected]:~ # mkdir bundle && mv config.json ./bundle && mv rootfs ./bundle;
# change the ownership to root
[email protected]:~ #chown -R $(id -u) bundle

Almost everything is ready. We have one last detail to take care of: since we would like to detach our root container from runc and its file descriptors to interact with it, we have to create a TTY socket and connect to it from both sides (from the container side and from our terminal). We are going to use recvtty which is part of the official runc project.

Now we switch to another terminal to create the container via runc.

# in another terminal
[email protected]:~$ sudo runc create -b bundle --console-socket $(pwd)/tty.sock container-crypt0n1t3
[email protected]:~$ sudo runc list
ID                     PID         STATUS      BUNDLE     CREATED                          OWNER
container-crypt0n1t3   86087       created     ~/bundle   2022-03-15T15:46:41.562034388Z   root
[email protected]:~$ ps aux | grep 86087
root       86087  0.0  0.0 1086508 11640 pts/0   Ssl+ 16:46   0:00 runc init
...

Our baby container is now created but it is not running the defined sh program as configured in the JSON file. That is because the runc init process is still keeping the process which has to load the sh program in a namespaced (isolated) environment. Let’s inspect what is going on inside runc init.

# inspect namespaces of runc init process
[email protected]:~$ sudo ls -al /proc/86087/ns
total 0
dr-x--x--x 2 root root 0 mars  15 16:57 .
dr-xr-xr-x 9 root root 0 mars  15 16:46 ..
lrwxrwxrwx 1 root root 0 mars  15 16:57 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 mars  15 16:57 ipc -> 'ipc:[4026532681]'
lrwxrwxrwx 1 root root 0 mars  15 16:57 mnt -> 'mnt:[4026532663]'
lrwxrwxrwx 1 root root 0 mars  15 16:57 net -> 'net:[4026532685]'
lrwxrwxrwx 1 root root 0 mars  15 16:57 pid -> 'pid:[4026532682]'
lrwxrwxrwx 1 root root 0 mars  15 16:57 pid_for_children -> 'pid:[4026532682]'
lrwxrwxrwx 1 root root 0 mars  15 16:57 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 mars  15 16:57 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 mars  15 16:57 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 mars  15 16:57 uts -> 'uts:[4026532676]'
# inspect current shell namespaces
[email protected]:~$ sudo ls -al /proc/$$/ns
total 0
dr-x--x--x 2 cryptonite cryptonite 0 mars  15 16:57 .
dr-xr-xr-x 9 cryptonite cryptonite 0 mars  15 16:16 ..
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  15 16:57 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  15 16:57 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  15 16:57 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  15 16:57 net -> 'net:[4026532008]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  15 16:57 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  15 16:57 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  15 16:57 time -> 'time:[4026531834]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  15 16:57 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  15 16:57 user -> 'user:[4026531837]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  15 16:57 uts -> 'uts:[4026531838]

We can confirm that, except for the user and time namespaces, the init runc process is using a different set of namespaces than our regular shell. But why hasn’t it launched our shell process?

Here is the time to say again that runc is a pretty low-level runtime and does not handle a lot of configurations compared to other more high-level runtimes. For example, it doesn’t configure any networking interface for the isolated process. In order to further configure the container process, runc puts it in a containment facility (runc-init) where additional configurations can be added. To illustrate in a more practical way why this containment can be useful, we are going to refer to a previous article and configure the network namespace for our baby container.

Configuring runc-init with network interface

# enter the container network namespace
[email protected]:~$ sudo nsenter --target 86087 --net
[email protected]:~ # ifconfig -a
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
# no network interface ;(

We see that indeed there are no interfaces in the network namespace of the runc init process, it is cut off from the world. Let’s bring it back!

# create a virtual pair and turn one of the sides on
[email protected]:~$ sudo ip link add veth0 type veth peer name ceth0
[email protected]:~$ sudo ip link set veth0 up
# assign an IP range
[email protected]:~$ sudo ip addr add 172.12.0.11/24 dev veth0
# and put it in the net namespace of runc init
[email protected]:~$ sudo ip link set ceth0 netns /proc/86087/ns/net

We can now inspect once again what is going on in the network namespace.

[email protected]:~$ sudo nsenter --target 86087 --net
[email protected]:~ # ifconfig -a
ceth0: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether 8a:4f:1c:61:74:f4  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
# and there it is !
# let's configure it to be functional
[email protected]:~ # ip link set lo up
[email protected]:~ # ip link set ceth0 up
[email protected]:~ # ip addr add 172.12.0.12/24 dev ceth0
[email protected]:~ # ping -c 1 172.12.0.11
PING 172.12.0.11 (172.12.0.11) 56(84) bytes of data.
64 bytes from 172.12.0.11: icmp_seq=1 ttl=64 time=0.180 ms
--- 172.12.0.11 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.180/0.180/0.180/0.000 ms

Now we have configured a virtual interface and connected the network namespace of the runc init process with the host network namespace. This configuration is not mandatory and the container process executing a sh shell could be started without any connection to the host. However, this configuration can be useful for other processes delivering network functionalities. The sh program is still not running inside the process as desired. After the configuration is done, one can finally spawn the container running the predefined program.

Starting the root container

[email protected]:~$ sudo runc start container-crypt0n1t3
[email protected]:~$ ps aux | grep 86087
root       86087  0.0  0.0   1632   892 pts/0    Ss+  16:46   0:00 /bin/sh
[email protected]:~$ ps aux | grep 86087
sudo runc list
ID                     PID         STATUS      BUNDLE     CREATED                          OWNER
container-crypt0n1t3   86087       running     ~/bundle   2022-03-15T15:46:41.562034388Z   root

The container is finally running under the same PID but has changed its executable to /bin/sh. Meanwhile, in the terminal holding the recvtty a shell pops out - yay! We can interact with our newly spawned container.

[email protected]:~$ recvtty tty.sock
# the shell drops here
/ # ls
ls
bin    etc    lib    mnt    proc   run    srv    tmp    var
dev    home   media  opt    root   sbin   sys    usr
/ # ifconfig -a
ifconfig -a
ceth0     Link encap:Ethernet  HWaddr 8A:4F:1C:61:74:F4
          inet addr:172.12.0.12  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::884f:1cff:fe61:74f4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:43 errors:0 dropped:0 overruns:0 frame:0
          TX packets:16 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:6259 (6.1 KiB)  TX bytes:1188 (1.1 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
/ # id
id
uid=0(root) gid=0(root)
/ # ps aux
ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
   19 root      0:00 ps aux

For those familiar with other container technologies, this terminal is familiar ;). Okay runc is really cool but what more can we do with it? Well, pretty much anything concerning the life management of this container.

Writable storage inside a container

By default when creating a container, runc mounts the root file system as read-only. This is problematic because our process cannot use the file system in which it is “chrooted”.

# in recvtty terminal
/ # touch hello
touch hello
touch: hello: Read-only file system
/ # mount
mount
/dev/mapper/vgubuntu-root on / type ext4 (ro,relatime,errors=remount-ro)
...
# ;((

To change that, one can modify the config.json file.

"root": {
        "path": "rootfs",
        "readonly": false
    },

[email protected]:~$ sudo runc create -b bundle --console-socket $(pwd)/tty.sock wcontainer-crypt0n1t3
[email protected]:~$ sudo runc start wcontainer-crypt0n1t3

# in the recvtty shell
/ # touch hello
touch hello
/ # ls
ls
bin     etc     home    media   opt     rootfs  sbin    sys     usr
dev     hello   lib     mnt     proc    run     srv     tmp     var

This solution is rather too general. It is better from here to define more specific rules for the different subdirectories under the root file system.

Pause and resume a container

# pause and inspect the process state
[email protected]:~$ sudo runc  pause container-crypt0n1t3
[email protected]:~$ sudo sudo runc list
ID                     PID         STATUS      BUNDLE     CREATED                         OWNER
container-crypt0n1t3   86087       paused      ~/bundle   2022-03-15T15:46:41.562034388Z   root
# investigate the system state  of the process
[email protected]:~$ ps aux | grep 86087
root       86087  0.0  0.0   1632   892 pts/0    Ds+  16:46   0:00 /bin/sh

From the above code snippet, one can see that after pausing a container, the system process is put in a state Ds+. This state translates to uninterruptible sleep (state for I/O). In this state the process doesn’t receive almost any of the OS signals and can’t be debugged (ptrace). Let’s resume it and inspect its state again.

One can see that after resuming, the process has changed its state to Ss+ (interruptible sleep). In this state it can again receive signals and be debugged. Let’s investigate deeper how the sleep and resume are done on kernel level.

[email protected]:~$ strace sudo runc  pause container-crypt0n1t3
...
rt_sigaction(SIGRTMIN, {sa_handler=0x7f198de71bf0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f198de7f3c0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x7f198de71c90, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7f198de7f3c0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
...
[email protected]:~ strace sudo runc resume container-crypt0n1t3
...
rt_sigaction(SIGRTMIN, {sa_handler=0x7fcb0f28cbf0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fcb0f29a3c0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x7fcb0f28cc90, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fcb0f29a3c0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
...

From the above code snippets we can see that the container pause is implemented using Linux signals (SIGRTMIN, SIGRT_1) and predefined handlers.

Inspect the current state of a container

Runc also allows to inspect the state of the running container on the fly. It shows the usage limit of different OS resources (CPU usage per core, RAM pages and page faults, I/O, SWAP, Network card) at the time of execution of the container:

# stopped and deleted old container - started with a new default one
[email protected]:~$ sudo runc events container-crypt0n1t3
{"type":"stats","id":"container-crypt0n1t3","data":{"cpu":{"usage":{
"total":22797516,"percpu":[1836602,9757585,1855859,494447,2339327,3934238,
139074,2440384],"percpu_kernel":[0,0,391015,0,0,0,0,0],
"percpu_user":[1836602,9757585,1464844,494447,2339327,3934238,139074,2440384],
"kernel":0,"user":0},"throttling":{}},"cpuset":{"cpus":[0,1,2,3,4,5,6,7],
"cpu_exclusive":0,"mems":[0],"mem_hardwall":0,"mem_exclusive":0,
"memory_migrate":0,"memory_spread_page":0,"memory_spread_slab":0,
"memory_pressure":0,"sched_load_balance":1,"sched_relax_domain_level":-1},
"memory":{"usage":{"limit":9223372036854771712,"usage":348160,"max":3080192,
"failcnt":0},"swap":{"limit":9223372036854771712,"usage":348160,"max":3080192,
"failcnt":0},"kernel":{"limit":9223372036854771712,"usage":208896,"max":512000,
"failcnt":0},"kernelTCP":{"limit":9223372036854771712,"failcnt":0},"raw":
{"active_anon":4096,"active_file":0,"cache":0,"dirty":0,
"hierarchical_memory_limit":9223372036854771712,"hierarchical_memsw_limit":9223372036854771712,
"inactive_anon":135168,"inactive_file":0,"mapped_file":0,
"pgfault":1063,"pgmajfault":0,"pgpgin":657,"pgpgout":623,"rss":139264,
"rss_huge":0,"shmem":0,"swap":0,
"total_active_anon":4096,"total_active_file":0,"total_cache":0,
"total_dirty":0,"total_inactive_anon":135168,"total_inactive_file":0,
"total_mapped_file":0,"total_pgfault":1063,"total_pgmajfault":0,"total_pgpgin":657,
"total_pgpgout":623,"total_rss":139264,"total_rss_huge":0,"total_shmem":0,"total_swap":0,
"total_unevictable":0,"total_writeback":0,"unevictable":0,"writeback":0}},
"pids":{"current":1},"blkio":{},"hugetlb":{"1GB":{"failcnt":0},"2MB":{"failcnt":0}},
"intel_rdt":{},"network_interfaces":null}}

Some of the above limits represent the hard limits imposed by the cgroups Linux feature.

[email protected]:~$ cat /sys/fs/cgroup/memory/user.slice/user-1000.slice/[email protected]/container-crypt0n1t3/memory.limit_in_bytes
9223372036854771712
# same memory limit
[email protected]:~$ cat /sys/fs/cgroup/cpuset/container-crypt0n1t3/cpuset.cpus
0-7
# same number of cpus used

This feature gives a really precise idea of what is going on with the process in terms of consumed resources in real time. This log can be really useful for forensics investigation.

By default runc outputs global information about each container upon creation in a file under /var/run/runc/state.json.

Checkpoint a container

Checkpointing is another interesting feature of runc. It allows you to snapshot the current state of a container (in RAM) and save it as a set of files. This state includes open file descriptors, memory content (pages in RAM) registers, mount points, etc. The process can be later resumed from this saved state. This can be really useful when one wants to transport a container from one host to another without losing its internal state (live migration). This feature can also be useful to reverse the process to a stable state (debugging). Runc does the checkpointing with the help of the criu software. However, the latter does not come out of the box with runc and has to be installed separately and added to /usr/local/sbin in order to work properly. To illustrate the checkpointing we are going to stop a printer process and resume it afterwards. The config.json file will contain this:

"args": [
    "/bin/sh", "-c", "i=0; while true; do echo $i;i=$(expr $i + 1); sleep 1; done"
        ],

Here we will demonstrate another way to run a container skipping the configuration phase (runc init). Let’s run it.

[email protected]:~$ sudo runc run -b bundle -d --console-socket $(pwd)/tty.sock container-printer

# in the recvtty shell
recvtty tty.sock
0
1
2
3
4

Now let’s stop it.

[email protected]:~$ sudo runc checkpoint --image-path $(pwd)/image-checkpoint \
 container-printer
 # inspect what was produced by criu
[email protected]:~$ ls image-checkpoint
cgroup.img        fs-1.img            pagemap-176.img          tmpfs-dev-73.tar.gz.img
core-176.img      ids-176.img         pagemap-1.img            tmpfs-dev-74.tar.gz.img
core-1.img        ids-1.img           pages-1.img              tmpfs-dev-75.tar.gz.img
descriptors.json  inventory.img       pages-2.img              tmpfs-dev-76.tar.gz.img
fdinfo-2.img      ipcns-var-10.img    pstree.img               tmpfs-dev-86.tar.gz.img
fdinfo-3.img      mm-176.img          seccomp.img              tty-info.img
files.img         mm-1.img            tmpfs-dev-69.tar.gz.img  utsns-11.img
fs-176.img        mountpoints-12.img  tmpfs-dev-71.tar.gz.img

From the above snippet one can see that the checkpointing represents a set of files of the format img (CRIU image file v1.1). The contents of these files are hardly readable and are a mix of binary and test content. Let’s now resume our printer process which at the time of stopping was at 84.

[email protected]:~$ sudo runc restore --detach --image-path $(pwd)/image-checkpoint \
-b bundle --console-socket $(pwd)/tty.sock container-printer-restore

Let’s go back to the recvtty shell.

recvtty tty.sock
85
86
87
88
89
90
91
92

The process resumed at its previous state as if nothing had happened. We have spawned the exact same copy of the container (file descriptors, processes, etc.) but in the previous saved state (the state when the checkpoint was taken).

Note: There is an interesting option –leave-running which doesn’t stop the process. Also, once a container is stopped it can't be started again.

Executing a new process in an existing container

Runc offers the possibility to execute a new process inside a container. This translates as creating a new process and applying the same set of isolation mechanisms as another process hence putting them in the same “container”.

[email protected]:~$ sudo runc exec container-crypt0n1t3-restore  sleep 120

# in a new terminal
# by default runc allocates a pseudo tty and connects it with the exec terminal
[email protected]:~$ sudo runc list
ID                             PID         STATUS      BUNDLE     CREATED                          OWNER
container-crypt0n1t3           0           stopped     ~/bundle   2022-03-16T08:44:37.440444742Z   root
container-crypt0n1t3-restore   13712       running     ~/bundle   2022-03-16T10:24:03.925419212Z   root

[email protected]:~$ sudo runc ps container-crypt0n1t3-restore
UID          PID    PPID  C STIME TTY          TIME CMD
root       13712    2004  0 11:24 pts/0    00:00:00 /bin/sh
root       14405   14393  0 11:36 ?        00:00:00 sleep 120
# the sleep process is part of the crypt0n1t3-container

Let’s see what is going on from the container’s point of view.

/ # ps aux
ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
   47 root      0:00 sleep 120
   53 root      0:00 ps aux
# the shell process sees the sleep process

# check the new process namespaces
/ # ls /proc/47/ns -al
ls /proc/47/ns -al
total 0
dr-x--x--x    2 root     root             0 Mar 16 10:36 .
dr-xr-xr-x    9 root     root             0 Mar 16 10:36 ..
lrwxrwxrwx    1 root     root             0 Mar 16 10:36 cgroup -> cgroup:[4026531835]
lrwxrwxrwx    1 root     root             0 Mar 16 10:36 ipc -> ipc:[4026532667]
lrwxrwxrwx    1 root     root             0 Mar 16 10:36 mnt -> mnt:[4026532665]
lrwxrwxrwx    1 root     root             0 Mar 16 10:36 net -> net:[4026532670]
lrwxrwxrwx    1 root     root             0 Mar 16 10:36 pid -> pid:[4026532668]
lrwxrwxrwx    1 root     root             0 Mar 16 10:36 pid_for_children -> pid:[4026532668]
lrwxrwxrwx    1 root     root             0 Mar 16 10:36 time -> time:[4026531834]
lrwxrwxrwx    1 root     root             0 Mar 16 10:36 time_for_children -> time:[4026531834]
lrwxrwxrwx    1 root     root             0 Mar 16 10:36 user -> user:[4026531837]
lrwxrwxrwx    1 root     root             0 Mar 16 10:36 uts -> uts:[4026532666]
# check own namespaces
/ # ls /proc/self/ns -al
ls /proc/self/ns -al
total 0
dr-x--x--x    2 root     root             0 Mar 16 10:35 .
dr-xr-xr-x    9 root     root             0 Mar 16 10:35 ..
lrwxrwxrwx    1 root     root             0 Mar 16 10:35 cgroup -> cgroup:[4026531835]
lrwxrwxrwx    1 root     root             0 Mar 16 10:35 ipc -> ipc:[4026532667]
lrwxrwxrwx    1 root     root             0 Mar 16 10:35 mnt -> mnt:[4026532665]
lrwxrwxrwx    1 root     root             0 Mar 16 10:35 net -> net:[4026532670]
lrwxrwxrwx    1 root     root             0 Mar 16 10:35 pid -> pid:[4026532668]
lrwxrwxrwx    1 root     root             0 Mar 16 10:35 pid_for_children -> pid:[4026532668]
lrwxrwxrwx    1 root     root             0 Mar 16 10:35 time -> time:[4026531834]
lrwxrwxrwx    1 root     root             0 Mar 16 10:35 time_for_children -> time:[4026531834]
lrwxrwxrwx    1 root     root             0 Mar 16 10:35 user -> user:[4026531837]
lrwxrwxrwx    1 root     root             0 Mar 16 10:35 uts -> uts:[4026532666]

The new container sleep process has indeed inherited the same namespaces as the shell process already inside the container. The same applies for other things such as capabilities, cwd, etc. These can be changed with the help of runc.

Hooks

Runc allows you to execute commands in an order relative to a container lifecycle. This feature was developed to facilitate the setup and cleaning of a container environment. There exists several types of hooks which we’ll discuss separately.

CreateRuntime

The CreateRuntime hooks are called after the container environment has been created (namespaces, cgroups, capabilities). However, the process executing the hook is not enclosed in this environment hence it has access to all resources in the current context. The process is not chrooted and its current working directory is the bundle directory. The executable path is also resolved in the runtime namespace (runc’s set of namespaces). The CreateRuntime hooks are useful for initial container configuration (eg: configure the network namespace).

CreateContainer

The CreateContainer hooks are called after the CreateRuntime hooks. These hooks are executed in the container namespace (after nsenter) but the executable path is resolved in the runtime namespace. The process has entered the set of namespaces but it is not yet “chrooted”. However, its current working directory is the container rootfs directory.

This functionality of runc can be useful to report to the user the status of the environment configuration.

StartContainer

The StartContainer hooks are called before the user-specified program is executed as part of the start operation. This hook can be used to add additional functionalities relative to the execution context (eg: load an additional library).

The hook executable path is resolved and it is executed in the container namespace.

PostStart

The Poststart hooks are called after the user-specified process is executed but before the start operation returns. For example, this hook can notify the user that the container process is spawned.

The executable path resolves and it is executed in the runtime namespace.

PostStop

The PostStop hooks are called after the container is deleted (or the process exits) but before the delete operation returns. Cleanup or debugging functions are examples of such a hook.

The executable path resolves and it is executed in the runtime namespace.

The syntax for defining hooks in the config.json is the following:

"hooks":{
        "createRuntime": [
            {

                "path": "/bin/bash",
                "args": ["/bin/bash", "-c", "../scripts-hooks/runtimeCreate.sh"]
            }
        ],
        "createContainer": [
            {

                "path": "/bin/bash",
                "args": ["/bin/bash", "-c", "./home/tmpfssc/containerCreate.sh"]

            }
        ],
        "poststart": [
               {

                "path": "/bin/bash",
                "args": ["/bin/bash", "-c", "../scripts-hooks/postStart.sh"]

               }
        ],
        "startContainer": [
                {

                "path": "/bin/sh",
                "args": ["/bin/sh", "-c", "/home/tmpfssc/startContainer.sh"]

                }
        ],
        "poststop": [
               {

                "path": "/bin/bash",
                "args": ["/bin/bash", "-c",  "./scripts-hooks/postStop.sh"]

               }
        ]
    },

A small POC for the purpose of truly understanding this feature was developed. Here are the main elements of it:

runtimeCreate.sh - initializes the network namespace;
containerCreate.sh - tests the above configuration;
postStop.sh - runs an http server on the host after the initialization is completed;
postStop.sh - cleans up the network namespace and stops the http server.

Each of these scripts output information about their environment.

Information about the defined hooks of a running container can be found under /run/runc/state.json (customizable path with --root flag).

Updating container resource limit

The runtime also allows to modify on the fly the resource limits in terms of cgroups. This can be really useful for scaling and improving performance but also for degradation of performances/denial of service of programs sharing or depending hierarchically on the set of cgroups of a current process. By default runc creates a sub-cgroup of the root one under /sys/fs/cgroup/user.slice/. For this paragraph we are going to run in the container the following program:

"args": [
    "/bin/sh", "-c", "i=0; while true; do echo $i;i=$(expr $i + 1); sleep 1; done"
    ],

[email protected]:~ $runc update --help
...
 --blkio-weight value        Specifies per cgroup weight, range is from 10 to 1000 (default: 0)
   --cpu-period value          CPU CFS period to be used for hardcapping (in usecs). 0 to use system default
   --cpu-quota value           CPU CFS hardcap limit (in usecs). Allowed cpu time in a given period
   --cpu-share value           CPU shares (relative weight vs. other containers)
   --cpu-rt-period value       CPU realtime period to be used for hardcapping (in usecs). 0 to use system default
   --cpu-rt-runtime value      CPU realtime hardcap limit (in usecs). Allowed cpu time in a given period
   --cpuset-cpus value         CPU(s) to use
   --cpuset-mems value         Memory node(s) to use
   --memory value              Memory limit (in bytes)
   --memory-reservation value  Memory reservation or soft_limit (in bytes)
   --memory-swap value         Total memory usage (memory + swap); set '-1' to enable unlimited swap
   --pids-limit value          Maximum number of pids allowed in the container (default: 0)
...

Let’s start a container and update its hardware limitations.

Now we will impose a new limit in terms of RAM consumption. This limit is relative to the cgroup memory group.

# after some adjustments to the process current memory usage
# define an upper bound to the RAM usage to 300kB
[email protected]:~ $sudo runc  update --memory 300000  container-spammer

If we look in our virtual tty, the process suddenly freezes.

Let’s inspect further what happened.

[email protected]:~ $sudo runc list
ID                  PID         STATUS      BUNDLE     CREATED                    OWNER
container-spammer   0           stopped     ~/bundle   2022-03-17T10:05:16.623692849Z   root
# stopped?
[email protected]:~ $sudo tail  /var/log/kern.log
...
Mar 17 11:06:32 qb kernel: [ 3772.833645] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=container-spammer,mems_allowed=0,oom_memcg=/user.slice/user-1000.slice/[email protected]/container-spammer,task_memcg=/user.slice/user-1000.slice/[email protected]/container-spammer,task=sh,pid=13681,uid=0
Mar 17 11:06:32 qb kernel: [ 3772.833684] Memory cgroup out of memory: Killed process 13681 (sh) total-vm:1624kB, anon-rss:0kB, file-rss:880kB, shmem-rss:0kB, UID:0 pgtables:36kB oom_score_adj:0

By updating the cgroup memory limit of the container we actually forced the Out Of Memory (OOM) killer of the kernel to intervene and kill the container process. An update of the CPU cgroup can also increase or decrease the performance of the container. Cgroups are a swiss army knife but a careful configuration is required.

Creating rootless container

Until now, all the manipulations we demonstrated were on container processes running as root on the host. But what happens if we want to harden the security and run the container processes with a UID different from zero? Separately, we would also like system users without root privileges to be able to run containers in a safe manner. In this article we refer to the term “rootless container” as a container which uses the user namespace. By default runc instantiates the container with the UID of the user triggering the command. The default OCI configuration is also generated without user namespace hence it relates to UID 0 on the host.

[email protected]:~ $runc create -b bundle --console-socket $(pwd)/tty.sock rootless-crypt0n1t3
ERRO[0000] rootless container requires user namespaces

If you go up to the default JSON configuration file you’ll notice that there is no user namespace. If you are familiar with our previous article you will understand what is the problem and how to circumvent it. If you are not, here is a nice graphical representation resuming how user namespace works:

Runc comes with a means to generate rootless configuration files.

[email protected]:~$ runc spec --rootless
# inspect what is different with the previous root config
[email protected]:~$ diff  config.json ./bundle/config.json
...
135,148c135,142
<       "uidMappings": [
<           {
<               "containerID": 0,
<               "hostID": 1000,
<               "size": 1
<           }
<       ],
<       "gidMappings": [
<           {
<               "containerID": 0,
<               "hostID": 1000,
<               "size": 1
<           }
<       ],
...

We can see that runc added two fields in the file indicating to whom to remap the user of the process inside the container. By default it maps it to the uid of the user running the command. Let’s now create a new container with the new runtime specification.

# first change the ownership of the bundle files
[email protected]:~$ sudo chown -R $(id -u) bundle
# overwrite the old specification
[email protected]:~$ mv config.json bundle/config.json
[email protected]:~$ runc create -b bundle --console-socket $(pwd)/tty.sock rootless-crypt0n1t3
# no error - yay; let's run it
[email protected]:~$ runc start rootless-crypt0n1t3

Now we can move the recvtty terminal and inspect in detail what runc has created.

/ # id
id
uid=0(root) gid=0(root) groups=65534(nobody),65534(nobody),65534(nobody),65534(nobody),65534(nobody),65534(nobody),65534(nobody),65534(nobody),65534(nobody),0(root)
/ # ls -al
ls -al
total 64
drwx------   19 root     root          4096 Jul 11  2019 .
drwx------   19 root     root          4096 Jul 11  2019 ..
drwxr-xr-x    2 root     root          4096 Jul 11  2019 bin
drwxr-xr-x    5 root     root           360 Mar 16 11:54 dev
drwxr-xr-x   15 root     root          4096 Jul 11  2019 etc
drwxr-xr-x    2 root     root          4096 Jul 11  2019 home
drwxr-xr-x    5 root     root          4096 Jul 11  2019 lib
drwxr-xr-x    5 root     root          4096 Jul 11  2019 media
drwxr-xr-x    2 root     root          4096 Jul 11  2019 mnt
drwxr-xr-x    2 root     root          4096 Jul 11  2019 opt
dr-xr-xr-x  389 nobody   nobody           0 Mar 16 11:54 proc
drwx------    2 root     root          4096 Mar 16 10:25 root
drwxr-xr-x    2 root     root          4096 Jul 11  2019 run
drwxr-xr-x    2 root     root          4096 Jul 11  2019 sbin
drwxr-xr-x    2 root     root          4096 Jul 11  2019 srv
dr-xr-xr-x   13 nobody   nobody           0 Mar 16 08:35 sys
drwxrwxr-x    2 root     root          4096 Jul 11  2019 tmp
drwxr-xr-x    7 root     root          4096 Jul 11  2019 usr
drwxr-xr-x   11 root     root          4096 Jul 11  2019 var

This looks really familiar to the user namespace shown in the previous article, right? Let’s check the namespaces and user ids from both “points of view”.

# check within the container
/ # ls -al /proc/self/ns
ls -al /proc/self/ns
total 0
dr-x--x--x    2 root     root             0 Mar 16 11:59 .
dr-xr-xr-x    9 root     root             0 Mar 16 11:59 ..
lrwxrwxrwx    1 root     root             0 Mar 16 11:59 cgroup -> cgroup:[4026531835]
lrwxrwxrwx    1 root     root             0 Mar 16 11:59 ipc -> ipc:[4026532672]
lrwxrwxrwx    1 root     root             0 Mar 16 11:59 mnt -> mnt:[4026532669]
lrwxrwxrwx    1 root     root             0 Mar 16 11:59 net -> net:[4026532008]
lrwxrwxrwx    1 root     root             0 Mar 16 11:59 pid -> pid:[4026532673]
lrwxrwxrwx    1 root     root             0 Mar 16 11:59 pid_for_children -> pid:[4026532673]
lrwxrwxrwx    1 root     root             0 Mar 16 11:59 time -> time:[4026531834]
lrwxrwxrwx    1 root     root             0 Mar 16 11:59 time_for_children -> time:[4026531834]
lrwxrwxrwx    1 root     root             0 Mar 16 11:59 user -> user:[4026532663]
lrwxrwxrwx    1 root     root             0 Mar 16 11:59 uts -> uts:[4026532671]

# check from the root user namespace
[email protected]:~$ ls /proc/$$/ns -al
total 0
dr-x--x--x 2 cryptonite cryptonite 0 mars  16 13:02 .
dr-xr-xr-x 9 cryptonite cryptonite 0 mars  16 11:36 ..
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  16 13:02 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  16 13:02 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  16 13:02 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  16 13:02 net -> 'net:[4026532008]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  16 13:02 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  16 13:02 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  16 13:02 time -> 'time:[4026531834]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  16 13:02 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  16 13:02 user -> 'user:[4026531837]'
lrwxrwxrwx 1 cryptonite cryptonite 0 mars  16 13:02 uts -> 'uts:[4026531838]'

# check uid of the process in the container (owner field)
[email protected]:~$ runc list
ID                    PID         STATUS      BUNDLE                                                         CREATED                          OWNER
rootless-crypt0n1t3   19104       running     /home/cryptonite/docker-security/internals/playground/bundle   2022-03-16T11:54:10.930816557Z   cryptonite

# double check
[email protected]:~$ ps aux | grep 19104
crypton+   19104  0.0  0.0   1632  1124 ?        Ss+  12:54   0:00 sh

We can see that the classical user namespace definitions apply as expected. If you wish to learn about this interesting namespace, please refer to our previous article. All of the manipulations described in the root containers part apply to the rootless containers. The main and most important difference is that system-wide operations are performed with a non-privileged UID hence less security risk on the host system.

A word on security

Runc is a powerful tool usually used by more high-level runtimes. That is because, by itself, runc is a pretty low-level runtime. It doesn’t include a lot of security enhancement features out of the box like seccomp, SELinux and AppArmor. Nevertheless, the tool has native support for the above security enhancements but they are not included in the default configurations. High-level runtimes running on top of runc handle these types of configuration.

# no AppArmor
[email protected]:~$ cat /proc/19104/attr/current
unconfined
# no Seccomp
[email protected]:~$ cat /proc/19104/status | grep Seccomp            Seccomp:    0
Seccomp_filters:    0

Greetings

I would like to thank some key members of the team for helping me to structure and write this article (mahé, pappy, sébastien, erynian). I would also like to thank angie and mahé for proofreading, and last but not least, pappy for guiding me and giving me the chance to grow.