I recently built a docker clone from scratch in Go. This made me wonder - how hard would it be to do the same step-by-step in a terminal? Let’s find out!
Safety warning
If you do decide to follow along, I’d highly recommend to setup a Linux virtual machine. I’ll be running a bunch of privileged commands and I would like to avoid unintentionally nuking my readers’ systems.
With the warning out of the way, let’s get into it!
Container filesystem
I’ll keep this section brief, for a deeper explanation of container filesystems, especially overlayFS, check out my previous post. In essence, we create a directory structure for our container, download Alpine-based minirootfs, and mount it with overlayFS.
# create folder structure in a temporary directory
mkdir -p /tmp/container-1/{lower,upper,work,merged}
cd /tmp/container-1
# download alpine minirootfs
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.3-x86_64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-x86_64.tar.gz -C lower
# mount overlayFS, our container root will be in /tmp/container-1/merged
sudo mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merged
After we run this, we should have a directory like this.
michal@michal-lg:/tmp/container-1$ ls
alpine-minirootfs-3.20.3-x86_64.tar.gz lower merged upper work
The container itself will use /tmp/container-1/merged as the root of its filesystem:
michal@michal-lg:/tmp/container-1/merged$ ls
bin etc lib mnt proc run srv tmp var
dev home media opt root sbin sys usr
Control groups
Let’s restrict the resource consumption of this container to, say, 100m CPU and 500 MiB.
Setting up cgroups is super easy:
# make a new cgroup slice and a child cgroup for our container
sudo mkdir -p /sys/fs/cgroup/toydocker.slice/container-1
cd /sys/fs/cgroup/toydocker.slice/
# enable modifying cpu and memory for the child cgroup
sudo -- sh -c 'echo "+memory +cpu" > cgroup.subtree_control'
cd container-1
# set max cpu usage to 10%
sudo -- sh -c 'echo "10000 100000" > cpu.max'
# set memory limit to 500 MiB
sudo -- sh -c 'echo "500M" > memory.max'
# Disable swap
sudo -- sh -c 'echo "0" > memory.swap.max'
The cpu.max
syntax is a bit unusual, but it means that out of 100 000 time units, this cgroup can consume 10 000 of those units. If we instead wanted to limit the cgroup to 2 CPUs, it would be 200 000 out of 100 000.
Interestingly, the cpu.max
rule doesn’t restrict the process to use a single physical core. So on a 4 core machine, it’s fine if a process uses 2500 time units on each of cores 0, 1, 2, 3, since the total is 10 000. For limiting the number of physical cores to use, cpusets can be used.
We can see that when we created the cgroup, default rules were automatically set up.
michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ ls
cgroup.controllers cpu.pressure memory.numa_stat
cgroup.events cpu.stat memory.oom.group
cgroup.freeze cpu.stat.local memory.peak
cgroup.kill cpu.uclamp.max memory.pressure
cgroup.max.depth cpu.uclamp.min memory.reclaim
cgroup.max.descendants cpu.weight memory.stat
cgroup.pressure cpu.weight.nice memory.swap.current
cgroup.procs io.pressure memory.swap.events
cgroup.stat memory.current memory.swap.high
cgroup.subtree_control memory.events memory.swap.max
cgroup.threads memory.events.local memory.swap.peak
cgroup.type memory.high memory.zswap.current
cpu.idle memory.low memory.zswap.max
cpu.max memory.max memory.zswap.writeback
cpu.max.burst memory.min
Let’s check that the ones we modified took effect:
michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ cat cpu.max
10000 100000
michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ cat memory.max
524288000
michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ cat memory.swap.max
0
Looks good. Next, let’s see how we can put a process into the cgroup and further isolate it via namespaces.
Namespaces
Let’s first answer the why of namespaces, then we can see how they are used.
If cgroups are the main mechanism for restricting resource usage, namespaces are the main mechanism for isolating resources themselves.
Let’s take filesystem mounts as an example. When we mount a new filesystem in the host, it’s visible to all processes. We need to be aware of what other mounts exist on a system to avoid clashes. With mount namespace, each process can make filesystem changes as it wishes without affecting any other process outside of this namespace.
The same idea extends to other resources: networking, inter-process communication, process ids, users, etc.
With motivation out of the way, let’s see it in action.
# enter interactive root
sudo -i
# Add current process to cgroup
echo $$ > /sys/fs/cgroup/toydocker.slice/container-1/cgroup.procs
# Create new namespaces
unshare \
--uts \
--pid \
--mount \
--mount-proc \
--net \
--ipc \
--cgroup \
--fork \
/bin/bash
This piece of code is a little arcane, mostly because I want to keep everything in a single terminal.
First, we enter an interactive root shell. This is because I need to run the next two commands from the same shell and both with root privileges.
michal@michal-lg:~$ # enter interactive root
sudo -i
[sudo] password for michal:
root@michal-lg:~#
The second command adds the current shell process to the cgroup we created earlier. Any children of this process will also automatically be added to the cgroup.
root@michal-lg:~# echo $$
28156
root@michal-lg:~# echo $$ > /sys/fs/cgroup/toydocker.slice/container-1/cgroup.procs
When we do this, the current shell is part of the cgroup and all the CPU and memory restrictions we set up earlier already apply.
Next, when we create the namespaces, it forks the current process and runs a bash shell. You can learn more about the unshare command in man pages.
root@michal-lg:~# unshare \
--uts \
--pid \
--mount \
--mount-proc \
--net \
--ipc \
--cgroup \
--fork \
/bin/bash
root@michal-lg:~#
This looks unremarkable, but we have essentially created a container through cgroup and namespace isolation. Let’s test that the UTS namespace is working correctly by changing the hostname and seeing that it doesn’t change on the host.
Container terminal:
root@michal-lg:~# hostname
michal-lg
root@michal-lg:~# hostname mycontainer
root@michal-lg:~# hostname
mycontainer
root@michal-lg:~#
Host terminal:
michal@michal-lg:~$ hostname
michal-lg
Since we also used `pid` namespace, `/bin/bash` should now have process id of 1. Let’s verify from the container:
root@michal-lg:~# ps
PID TTY TIME CMD
1 pts/1 00:00:00 bash
32 pts/1 00:00:00 ps
And let’s see what the real process id is from the host’s perspective.
michal@michal-lg:~$ ps -ef | grep -i /bin/bash
root 8952 8932 0 16:10 pts/1 00:00:00 unshare --uts --pid --mount --mount-proc --net --ipc --cgroup --fork /bin/bash
root 8953 8952 0 16:10 pts/1 00:00:00 /bin/bash
There are some post-processing steps that a container runtime would do at this point before launching user’s application. Let’s go through those next.
Container-side setup
First, we isolate the container from the host filesystem by changing the root using the pivot_root command.
pivot_root
is a safer equivalent to chroot /tmp/container-1/merged
used by container runtimes to avoid breakout exploits. Security is not my expertise, so I’ll link to this article explaining how these exploits work and how pivot_root
prevents them.
root@michal-lg:~# cd /tmp/container-1/merged
mount --make-rprivate /
mkdir old_root
pivot_root . old_root
umount -l /old_root
rm -rf /old_root
root@michal-lg:/tmp/container-1/merged#
Making the root private prevents the container from affecting host’s mount table. This could again be used for exploits.
In my terminal, I need to run “cd ..” to refresh state after we deleted the old root. Since we removed old root, PATH variables no longer resolve correctly.
But since we are now in the `/tmp/container-1/merged` directory and this filesystem is based on Alpine minirootfs, we have basic utilities in the `bin` directory.
root@michal-lg:/tmp/container-1/merged# cd ..
root@michal-lg:/# ls
bash: /usr/bin/ls: No such file or directory
root@michal-lg:/# /bin/ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
Let’s also setup basic devices we’ll need later and mount useful virtual filesystems:
mknod -m 666 dev/null c 1 3
mknod -m 666 dev/zero c 1 5
mknod -m 666 dev/tty c 5 0
/bin/mkdir -p dev/{pts,shm}
/bin/mount -t devpts devpts dev/pts
/bin/mount -t tmpfs tmpfs dev/shm
/bin/mount -t sysfs sysfs sys/
/bin/mount -t tmpfs tmpfs run/
/bin/mount -t proc proc proc/
For instance, if we didn’t mount `proc`, we wouldn’t have access to process information and running commands that depend on reading process info would fail:
root@michal-lg:/# top
top: no process info in /proc
After the mount, things work correctly again.
Mem: 7560280K used, 8661696K free, 161756K shrd, 135464K buff, 2364264K cached
CPU: 0% usr 0% sys 0% nic 98% idle 0% io 0% irq 0% sirq
Load average: 0.30 0.38 0.37 1/1233 64
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
1 0 root S 12896 0% 6 0% /bin/bash
64 1 root R 1624 0% 9 0% top
At this point, we could configure networking, export env variables, etc. For our minimal purposes, we are done and it’s time to launch user’s application!
Let’s suppose the user wanted to run a simple interactive shell. We can launch it like this:
exec /bin/busybox sh
I use busybox since it works as a minimal init script and it ships in alpine minirootfs. Using exec replaces the old process with the new one since we don’t need to keep the shell around.
root@michal-lg:/# exec /bin/busybox sh
/ # ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
/ #
Right now, we are roughly where we’d be if we ran the following docker command:
michal@michal-lg:~$ docker run -it --cpus="0.1" --memory="512M" --memory-swap=0 --entrypoint /bin/sh --rm alpine
/ # ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
/ #
Using the container
As a final step, let’s try to see that the cgroup limits we set earlier are actually working.
First, I’ll run a CPU intensive task that should use 100% of a single CPU core
/ # while true; do true; done
and open a terminal from host to see the real CPU utilization for this process. I find the process id and check consumption with top:
michal@michal-lg:~$ ps -ef | grep -i busybox
root 8953 8952 0 16:10 pts/1 00:00:07 /bin/busybox sh
And then use `top` to verify that CPU usage doesn’t exceed 10%.
michal@michal-lg:~$ top -p 8953
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8953 root 20 0 1696 1024 896 R 10.0 0.0 0:15.25 busybox
Similarly, for memory, I run tail
to keep reading from /dev/zero
. tail
reads into an in-memory buffer that will shortly exceed our 500 MiB memory limit at which point the cgroup memory controller will kill the process.
/ # tail /dev/zero
Killed
We can now exit the container, and cleanup by unmounting the root directory /tmp/container-1/merged
michal@michal-lg:/tmp/container-1$ sudo umount merged
And that’s it! We’ve created a container from scratch in a terminal.
Conclusion
The main takeaway should be that containers aren’t magic. They are not virtual machines. They are an awesome feature baked into the Linux kernel for isolating processes. They achieve this isolation through cgroups and namespaces.
You can see the full-list of commands on my github.
I hope you learned something new! If you did, consider subscribing! I’m also always happy to connect with readers on LinkedIn and BlueSky
If you enjoyed the post, chances are you’ll enjoy my other writing too! All my posts are backed by a substantial deep-dive into the given problem space.
Perhaps check out my deep-dive into SQLite storage format, my implementation of MapReduce from scratch, or my introduction to CUDA programming?