In this post, I'll show you how docker works behind the scenes and how to spawn containers using containerd and runc as the main runtime. What's more, how does it start the program from ENTRYPOINT.
Hello, world! So far you have been learning about the docker and how to exploit the misconfiguration in the containerized environment. Recently, I watched the @LiveOverflow video on containers and namespaces, and it inspired to me dig deeper to learn how actually containers work under the hood. Have some caffeine available since this post is going to be a bit longer. 😀.
👉
Disclaimer!!
The following experiment is done with the 20.10.16 version of the docker on Arch Linux, output may vary on the different operating systems or docker versions, but the concept will be almost the same.
If you are on this blog for the first time, I would recommend you first read Understanding the Container Architecture, this is an advanced article.
On the high level, when the container is started, runc and containerd change the root filesystem to the new folder containing all the required files and implement all the namespace creation using unshare
command, which uses unshare
syscall under the hood, and then finally the pivot_root
is used to change the root and then the entry point defined in the image is executed.
From the website of containerd – https://containerd.io/
It manages the complete container lifecycle of its host system, from image transfer and storage to container execution and supervision to low-level storage to network attachments and beyond.
When the dockerd
receives information, formats it and then sends it to the containerd
, which then set up the directory for runc
with the files like config.json
and options.json
,
You will realize that this rootfs
directory is empty. Actually, it is provided in the config.json
file in the root key. I haven't configured rootless mode with docker yet, so the engine is configured to use the overlay2 filesystem.
The meaning of "merged" here is that all the layers in the fsLayers
property of the image manifest files are now clubbed into one directory, aka rootfs. So this directory will contain all the children of the root (/
) path in the container.
Our containerd
is so busy, that it will not only wait for one container. Therefore it will start a shim (you can also call it a slave) process and be ready for the next container operations. This newly created shim process is now responsible to run and monitor the container and report the containerd
(master) in case it is stopped anyhow, to do the cleanup. In this case, the shim process name is containerd-shim-runc-v2
.
When it comes to actually running the container, it is used for such work load and yet not rewarded with enough discussion, poor dude 😅.
From the GitHub of runc
– https://github.com/opencontainers/runc
It is a tool for spawning and running containers on Linux according to the Open Container Initiative specification. Therefore, it is responsible for interpreting the config.json
file.
The very first step is to setup a cgroup container for each container, distinguished by their hex id.
It then creates the namespaces for mount, UTS, System V IPC, PID and network namespace first. After then it create the cgroup namespace so that it could virtualize the cgroup containers for security and confinement. It also eases tasks such as container migration.
Now comes the main part, where runc
will go to the "rootfs" directory of the container, perform pivot_root and then start the program after loading the which is usually defined in the ENTRYPOINT
entrypoint.
Now if you will stop the container, containerd-shim
process will report this to the containerd
and will use runc to delete the running container.
So we have understood that containers are nothing but a confined namespaces whose root is changed by pivot_root. Since the syscall
involved in the setting containers by runc are available in Linux, in case of Windows or MacOS a Linux VM is installed while setting up docker desktop.
The processes in the containers run on machine host and with the username same as of root user or the current user (rootless mode), therefore it is not a actually virtualization, it is a process level isolation because of PID namespace mainly.
- https://www.youtube.com/watch?v=-YnMr1lj4Z8&list=PLhixgUqwRTjxtDt2ABuejRxrIFSroqyEY&index=4
- https://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html#NOTES
- https://www.threatstack.com/blog/deep-dive-runtimes-kubernetes-cri-and-shims
- https://nanikgolang.netlify.app/post/containers/
- https://iximiuz.com/en/posts/implementing-container-runtime-shim/