Primer on Linux container filesystems

Building a container filesytem by hand

Nov 16, 2024

I spent the weekend building a toy Docker clone. One question going into this project was how each container gets its own filesystem. Let’s first reverse-engineer what Docker does and then replicate it ourselves.

I’ll start by starting a shell in a Docker container using Alpine image.

michal@michal-lg:~$ docker run -it --entrypoint /bin/sh --rm --name "alpine-container" alpine
/ # ls
bin    dev    etc    home   lib    media  mnt    opt    proc   root   run    sbin   srv    sys    tmp    usr    var
/ # hostname
a7cbf0aea1ad
/ # cd home && ls

We are in a separate filesystem and it’s pretty empty. Let’s make a file.

/ # echo -e "Hello there\nGeneral Kenobi" > /home/hello_there.txt
/ # cat /home/hello_there.txt 
Hello there
General Kenobi

What do you think, can we access this file from the host?
…

…

…
We can - let’s find it!

Docker stores everything under /var/lib/docker so we can start looking from there from a second terminal:

root@michal-lg:/var/lib/docker# find -name hello_there.txt
./overlay2/1557145fe40a1595d090eeafa72c39a7b54cca4791ae9e3ffafabff06466125c/diff/home/hello_there.txt
./overlay2/1557145fe40a1595d090eeafa72c39a7b54cca4791ae9e3ffafabff06466125c/merged/home/hello_there.txt
root@michal-lg:/var/lib/docker#

Curiously, we found the file twice in two different directories. Let’s see what else is in the diff and merged directories.

root@michal-lg:/var/lib/docker/.../diff# ls
home  root
root@michal-lg:/var/lib/docker/.../merged# ls
bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

The diff directory only contains an empty root directory and a home directory with the file we created. Contents of merged exactly match the container’s filesystem.

Overlayfs

Docker uses overlayfs filesystem. Overlayfs lets us combine 2 file file trees, “lower” and “upper”, into a combined view “merged”. Docker calls the “upper” file tree “diff”, which is perhaps more fitting and I’ll be referring to it as such.

Usage of union filesystems like overlayfs comes from an interesting observation: Often we want to run multiple containers on a single host. Chances are, these containers might share the lower layer - be it Alpine, Ubuntu, or a more specialized one like golang.

By making the lower layer read-only, multiple containers can share it. Changes are only written to the upper layer. Let’s illustrate what happened when I earlier created hello_there.txt. I’m not expanding all folders to avoid clutter.

When we then create hello_there.txt in /home, it was written to diff and overlayfs constructed a combined view in merged.

What if we modify something in the lower layer? Let’s rename /bin/echo to /bin/echo.old.

/ # cd bin
/bin # mv echo echo.old
/bin # ls
...
chgrp          echo.old       gzip           ln             mount          printenv       setserial      umount
...

As mentioned, the lower layer is read-only so the only place where the file is actually modified is in diff.

root@michal-lg:/var/lib/docker/overlay2/.../diff/bin# ls -l
total 0
c--------- 1 root root 0, 0 Nov 11 23:06 echo
lrwxrwxrwx 1 root root   12 Sep  6 13:34 echo.old -> /bin/busybox

There are two files! One for the no longer existing echo and one for echo.old. Overlayfs uses special whiteout files to deal with deletion of a file. When overlayfs sees this file, it knows not to include it in the merged view.

The second file is much less interesting, it’s the renamed echo, which turns out was just a symbolic link to busybox.

Creating container filesystem

Finally, let’s see how Docker uses Overlayfs under the hood to create a new filesystem.

First, let’s create temporary directories in /tmp/ where we’ll setup a container filesystem manually.

michal@michal-lg:/tmp$ mkdir -p /tmp/container-demo/{diff,merged,work}
michal@michal-lg:/tmp$ ls container-demo/
diff  merged  work

We’ve seen diff and merged before. work is used by overlayfs as a scratchpad for and we don’t need to care about it.

Next, download Alpine minirootfs for your CPU architecture and extract it to /tmp/container-demo/. I’ll rename the extracted folder to “alpine”.

michal@michal-lg:/tmp/container-demo$ ls
alpine  diff  merged  work

Now that everything is setup, we can mount the overlayfs filesystem.

michal@michal-lg:/tmp/container-demo$ sudo mount -t overlay  overlay  -o lowerdir=alpine,upperdir=diff,workdir=work  merged

Now if we list contents of merged, we’ll see the Alpine file system:

michal@michal-lg:/tmp/container-demo/merged$ ls
bin  etc   lib    mnt  proc  run   srv  tmp  var
dev  home  media  opt  root  sbin  sys  usr

And if we create a file in there, it will be written to the diff folder.

michal@michal-lg:/tmp/container-demo/merged$ echo hello > hello.txt
michal@michal-lg:/tmp/container-demo/merged$ ls
bin  etc        home  media  opt   root  sbin  sys  usr
dev  hello.txt  lib   mnt    proc  run   srv   tmp  var
michal@michal-lg:/tmp/container-demo/merged$ ls ../diff/
hello.txt

As a final point, we can create a new shell process and set merged as its root directory. This is how Linux containers can only see their own filesystem. Any process spawned by this shell will inherit the root so it will also be contained to this filesystem.

michal@michal-lg:/tmp/container-demo$ sudo chroot merged /bin/sh
/ # ls
bin        hello.txt  media      proc       sbin       tmp
dev        home       mnt        root       srv        usr
etc        lib        opt        run        sys        var
/ # cd ..
/ #

A full implementation would take advantage of namespaces for additional isolation. You can learn more about those in Linux man-pages.

Conclusion

Containers are a black box for vast majority of engineers. You should now have a solid understanding of how things work under the hood!

If you are interested in a more complete implementation of Docker from scratch, consider checkout my Golang Docker clone that’s around 200 lines of code.

I hope you learned something new! If you did, consider subscribing and/or following me on LinkedIn.

You might also enjoy some of my other posts linked below. Perhaps my deep-dive into SQLite storage format, my implementation of MapReduce from scratch, or my series building an ML inference engine from scratch?

Thanks for reading Michal’s Deep Dives! This post is public so feel free to share it.

How does SQLite store data?

Michal Pitr

Mar 17

Recently I’ve been implementing a subset of SQLite (the world’s most used database, btw) from scratch in Go. I’ll share what I’ve learned about how SQLite stores data on disk which will help us understand key database concepts. Thanks for reading Michal’s Substack! Subscribe for free to receive new posts and support my work.

Read full story

MapReduce from Scratch

Michal Pitr

Apr 28

Over the last couple of weeks, I’ve been building MapReduce from scratch.

Read full story

Build Your Own Inference Engine: From Scratch to "7"

Michal Pitr

Aug 4

Build Your Own Inference Engine: From Scratch to "7"

I like to keep things practical. Let’s train a simple neural network, save the model, and write an inference engine that can execute inputs against the model. Sounds like a fun time to me!

Read full story

Michal’s Deep Dives

Primer on Linux container filesystems

Building a container filesytem by hand

Overlayfs

Creating container filesystem

Conclusion

How does SQLite store data?

MapReduce from Scratch

Build Your Own Inference Engine: From Scratch to "7"

Discussion about this post