Linux container from scratch

Let's build a minimal container step-by-step in a terminal

Dec 07, 2024

I recently built a docker clone from scratch in Go. This made me wonder - how hard would it be to do the same step-by-step in a terminal? Let’s find out!

Safety warning

If you do decide to follow along, I’d highly recommend to setup a Linux virtual machine. I’ll be running a bunch of privileged commands and I would like to avoid unintentionally nuking my readers’ systems.

With the warning out of the way, let’s get into it!

Container filesystem

I’ll keep this section brief, for a deeper explanation of container filesystems, especially overlayFS, check out my previous post. In essence, we create a directory structure for our container, download Alpine-based minirootfs, and mount it with overlayFS.

# create folder structure in a temporary directory
mkdir -p /tmp/container-1/{lower,upper,work,merged}

cd /tmp/container-1

# download alpine minirootfs
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.3-x86_64.tar.gz

tar -xzf alpine-minirootfs-3.20.3-x86_64.tar.gz -C lower

# mount overlayFS, our container root will be in /tmp/container-1/merged
sudo mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merged

After we run this, we should have a directory like this.

michal@michal-lg:/tmp/container-1$ ls
alpine-minirootfs-3.20.3-x86_64.tar.gz  lower  merged  upper  work

The container itself will use /tmp/container-1/merged as the root of its filesystem:

michal@michal-lg:/tmp/container-1/merged$ ls
bin  etc   lib    mnt  proc  run   srv  tmp  var
dev  home  media  opt  root  sbin  sys  usr

Control groups

Let’s restrict the resource consumption of this container to, say, 100m CPU and 500 MiB.

Setting up cgroups is super easy:

# make a new cgroup slice and a child cgroup for our container
sudo mkdir -p /sys/fs/cgroup/toydocker.slice/container-1

cd /sys/fs/cgroup/toydocker.slice/

# enable modifying cpu and memory for the child cgroup
sudo -- sh -c 'echo "+memory +cpu" > cgroup.subtree_control'

cd container-1

# set max cpu usage to 10%
sudo -- sh -c 'echo "10000 100000" > cpu.max'

# set memory limit to 500 MiB
sudo -- sh -c 'echo "500M" > memory.max'

# Disable swap
sudo -- sh -c 'echo "0" > memory.swap.max'

The cpu.max syntax is a bit unusual, but it means that out of 100 000 time units, this cgroup can consume 10 000 of those units. If we instead wanted to limit the cgroup to 2 CPUs, it would be 200 000 out of 100 000.

Interestingly, the cpu.max rule doesn’t restrict the process to use a single physical core. So on a 4 core machine, it’s fine if a process uses 2500 time units on each of cores 0, 1, 2, 3, since the total is 10 000. For limiting the number of physical cores to use, cpusets can be used.

We can see that when we created the cgroup, default rules were automatically set up.

michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ ls
cgroup.controllers      cpu.pressure         memory.numa_stat
cgroup.events           cpu.stat             memory.oom.group
cgroup.freeze           cpu.stat.local       memory.peak
cgroup.kill             cpu.uclamp.max       memory.pressure
cgroup.max.depth        cpu.uclamp.min       memory.reclaim
cgroup.max.descendants  cpu.weight           memory.stat
cgroup.pressure         cpu.weight.nice      memory.swap.current
cgroup.procs            io.pressure          memory.swap.events
cgroup.stat             memory.current       memory.swap.high
cgroup.subtree_control  memory.events        memory.swap.max
cgroup.threads          memory.events.local  memory.swap.peak
cgroup.type             memory.high          memory.zswap.current
cpu.idle                memory.low           memory.zswap.max
cpu.max                 memory.max           memory.zswap.writeback
cpu.max.burst           memory.min

Let’s check that the ones we modified took effect:

michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ cat cpu.max
10000 100000
michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ cat memory.max
524288000
michal@michal-lg:/sys/fs/cgroup/toydocker.slice/container-1$ cat memory.swap.max
0

Looks good. Next, let’s see how we can put a process into the cgroup and further isolate it via namespaces.

Namespaces

Let’s first answer the why of namespaces, then we can see how they are used.

If cgroups are the main mechanism for restricting resource usage, namespaces are the main mechanism for isolating resources themselves.

Let’s take filesystem mounts as an example. When we mount a new filesystem in the host, it’s visible to all processes. We need to be aware of what other mounts exist on a system to avoid clashes. With mount namespace, each process can make filesystem changes as it wishes without affecting any other process outside of this namespace.

The same idea extends to other resources: networking, inter-process communication, process ids, users, etc.

With motivation out of the way, let’s see it in action.

# enter interactive root
sudo -i

# Add current process to cgroup
echo $$ > /sys/fs/cgroup/toydocker.slice/container-1/cgroup.procs

# Create new namespaces
unshare \
    --uts \
    --pid \
    --mount \
    --mount-proc \
    --net \
    --ipc \
    --cgroup \
    --fork \
    /bin/bash

This piece of code is a little arcane, mostly because I want to keep everything in a single terminal.

First, we enter an interactive root shell. This is because I need to run the next two commands from the same shell and both with root privileges.

michal@michal-lg:~$ # enter interactive root
sudo -i
[sudo] password for michal: 
root@michal-lg:~#

The second command adds the current shell process to the cgroup we created earlier. Any children of this process will also automatically be added to the cgroup.

root@michal-lg:~# echo $$
28156
root@michal-lg:~# echo $$ > /sys/fs/cgroup/toydocker.slice/container-1/cgroup.procs

When we do this, the current shell is part of the cgroup and all the CPU and memory restrictions we set up earlier already apply.

Next, when we create the namespaces, it forks the current process and runs a bash shell. You can learn more about the unshare command in man pages.

root@michal-lg:~# unshare \
    --uts \
    --pid \
    --mount \
    --mount-proc \
    --net \
    --ipc \
    --cgroup \
    --fork \
    /bin/bash
root@michal-lg:~#

This looks unremarkable, but we have essentially created a container through cgroup and namespace isolation. Let’s test that the UTS namespace is working correctly by changing the hostname and seeing that it doesn’t change on the host.

Container terminal:

root@michal-lg:~# hostname
michal-lg
root@michal-lg:~# hostname mycontainer
root@michal-lg:~# hostname
mycontainer
root@michal-lg:~#

Host terminal:

michal@michal-lg:~$ hostname
michal-lg

Since we also used `pid` namespace, `/bin/bash` should now have process id of 1. Let’s verify from the container:

root@michal-lg:~# ps
    PID TTY          TIME CMD
      1 pts/1    00:00:00 bash
     32 pts/1    00:00:00 ps

And let’s see what the real process id is from the host’s perspective.

michal@michal-lg:~$ ps -ef | grep -i /bin/bash
root        8952    8932  0 16:10 pts/1    00:00:00 unshare --uts --pid --mount --mount-proc --net --ipc --cgroup --fork /bin/bash
root        8953    8952  0 16:10 pts/1    00:00:00 /bin/bash

There are some post-processing steps that a container runtime would do at this point before launching user’s application. Let’s go through those next.

Container-side setup

First, we isolate the container from the host filesystem by changing the root using the pivot_root command.

pivot_root is a safer equivalent to chroot /tmp/container-1/merged used by container runtimes to avoid breakout exploits. Security is not my expertise, so I’ll link to this article explaining how these exploits work and how pivot_root prevents them.

root@michal-lg:~# cd /tmp/container-1/merged
mount --make-rprivate /
mkdir old_root
pivot_root . old_root
umount -l /old_root
rm -rf /old_root
root@michal-lg:/tmp/container-1/merged#

Making the root private prevents the container from affecting host’s mount table. This could again be used for exploits.

In my terminal, I need to run “cd ..” to refresh state after we deleted the old root. Since we removed old root, PATH variables no longer resolve correctly.

But since we are now in the `/tmp/container-1/merged` directory and this filesystem is based on Alpine minirootfs, we have basic utilities in the `bin` directory.

root@michal-lg:/tmp/container-1/merged# cd ..
root@michal-lg:/# ls
bash: /usr/bin/ls: No such file or directory
root@michal-lg:/# /bin/ls
bin    dev    etc    home   lib    media  mnt    opt    proc   root   run    sbin   srv    sys    tmp    usr    var

Let’s also setup basic devices we’ll need later and mount useful virtual filesystems:

mknod -m 666 dev/null c 1 3
mknod -m 666 dev/zero c 1 5
mknod -m 666 dev/tty c 5 0

/bin/mkdir -p dev/{pts,shm}
/bin/mount -t devpts devpts dev/pts
/bin/mount -t tmpfs tmpfs dev/shm
/bin/mount -t sysfs sysfs sys/
/bin/mount -t tmpfs tmpfs run/
/bin/mount -t proc proc proc/

For instance, if we didn’t mount `proc`, we wouldn’t have access to process information and running commands that depend on reading process info would fail:

root@michal-lg:/# top
top: no process info in /proc

After the mount, things work correctly again.

Mem: 7560280K used, 8661696K free, 161756K shrd, 135464K buff, 2364264K cached
CPU:   0% usr   0% sys   0% nic  98% idle   0% io   0% irq   0% sirq
Load average: 0.30 0.38 0.37 1/1233 64
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    1     0 root     S    12896   0%   6   0% /bin/bash
   64     1 root     R     1624   0%   9   0% top

At this point, we could configure networking, export env variables, etc. For our minimal purposes, we are done and it’s time to launch user’s application!

Let’s suppose the user wanted to run a simple interactive shell. We can launch it like this:

exec /bin/busybox sh

I use busybox since it works as a minimal init script and it ships in alpine minirootfs. Using exec replaces the old process with the new one since we don’t need to keep the shell around.

root@michal-lg:/# exec /bin/busybox sh
/ # ls
bin    dev    etc    home   lib    media  mnt    opt    proc   root   run    sbin   srv    sys    tmp    usr    var
/ #

Right now, we are roughly where we’d be if we ran the following docker command:

michal@michal-lg:~$ docker run -it --cpus="0.1" --memory="512M" --memory-swap=0 --entrypoint /bin/sh --rm alpine
/ # ls
bin    dev    etc    home   lib    media  mnt    opt    proc   root   run    sbin   srv    sys    tmp    usr    var
/ #

Using the container

As a final step, let’s try to see that the cgroup limits we set earlier are actually working.

First, I’ll run a CPU intensive task that should use 100% of a single CPU core

/ # while true; do true; done

and open a terminal from host to see the real CPU utilization for this process. I find the process id and check consumption with top:

michal@michal-lg:~$ ps -ef | grep -i busybox
root        8953    8952  0 16:10 pts/1    00:00:07 /bin/busybox sh

And then use `top` to verify that CPU usage doesn’t exceed 10%.

michal@michal-lg:~$ top -p 8953
PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                           
   8953 root      20   0    1696   1024    896 R  10.0   0.0   0:15.25 busybox

Similarly, for memory, I run tail to keep reading from /dev/zero. tail reads into an in-memory buffer that will shortly exceed our 500 MiB memory limit at which point the cgroup memory controller will kill the process.

/ # tail /dev/zero
Killed

We can now exit the container, and cleanup by unmounting the root directory /tmp/container-1/merged

michal@michal-lg:/tmp/container-1$ sudo umount merged

And that’s it! We’ve created a container from scratch in a terminal.

Conclusion

The main takeaway should be that containers aren’t magic. They are not virtual machines. They are an awesome feature baked into the Linux kernel for isolating processes. They achieve this isolation through cgroups and namespaces.

You can see the full-list of commands on my github.

I hope you learned something new! If you did, consider subscribing! I’m also always happy to connect with readers on LinkedIn and BlueSky

If you enjoyed the post, chances are you’ll enjoy my other writing too! All my posts are backed by a substantial deep-dive into the given problem space.

Perhaps check out my deep-dive into SQLite storage format, my implementation of MapReduce from scratch, or my introduction to CUDA programming?