I spent the weekend building a toy Docker clone. One question going into this project was how each container gets its own filesystem. Let’s first reverse-engineer what Docker does and then replicate it ourselves.
I’ll start by starting a shell in a Docker container using Alpine image.
michal@michal-lg:~$ docker run -it --entrypoint /bin/sh --rm --name "alpine-container" alpine
/ # ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
/ # hostname
a7cbf0aea1ad
/ # cd home && ls
We are in a separate filesystem and it’s pretty empty. Let’s make a file.
/ # echo -e "Hello there\nGeneral Kenobi" > /home/hello_there.txt
/ # cat /home/hello_there.txt
Hello there
General Kenobi
What do you think, can we access this file from the host?
…
…
…
We can - let’s find it!
Docker stores everything under /var/lib/docker
so we can start looking from there from a second terminal:
root@michal-lg:/var/lib/docker# find -name hello_there.txt
./overlay2/1557145fe40a1595d090eeafa72c39a7b54cca4791ae9e3ffafabff06466125c/diff/home/hello_there.txt
./overlay2/1557145fe40a1595d090eeafa72c39a7b54cca4791ae9e3ffafabff06466125c/merged/home/hello_there.txt
root@michal-lg:/var/lib/docker#
Curiously, we found the file twice in two different directories. Let’s see what else is in the diff
and merged
directories.
root@michal-lg:/var/lib/docker/.../diff# ls
home root
root@michal-lg:/var/lib/docker/.../merged# ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
The diff
directory only contains an empty root
directory and a home
directory with the file we created. Contents of merged
exactly match the container’s filesystem.
Overlayfs
Docker uses overlayfs filesystem. Overlayfs lets us combine 2 file file trees, “lower” and “upper”, into a combined view “merged”. Docker calls the “upper” file tree “diff”, which is perhaps more fitting and I’ll be referring to it as such.
Usage of union filesystems like overlayfs comes from an interesting observation: Often we want to run multiple containers on a single host. Chances are, these containers might share the lower layer - be it Alpine, Ubuntu, or a more specialized one like golang.
By making the lower layer read-only, multiple containers can share it. Changes are only written to the upper layer. Let’s illustrate what happened when I earlier created hello_there.txt
. I’m not expanding all folders to avoid clutter.
When we then create hello_there.txt
in /home
, it was written to diff
and overlayfs constructed a combined view in merged
.
What if we modify something in the lower layer? Let’s rename /bin/echo
to /bin/echo.old
.
/ # cd bin
/bin # mv echo echo.old
/bin # ls
...
chgrp echo.old gzip ln mount printenv setserial umount
...
As mentioned, the lower layer is read-only so the only place where the file is actually modified is in diff
.
root@michal-lg:/var/lib/docker/overlay2/.../diff/bin# ls -l
total 0
c--------- 1 root root 0, 0 Nov 11 23:06 echo
lrwxrwxrwx 1 root root 12 Sep 6 13:34 echo.old -> /bin/busybox
There are two files! One for the no longer existing echo
and one for echo.old
. Overlayfs uses special whiteout files to deal with deletion of a file. When overlayfs sees this file, it knows not to include it in the merged view.
The second file is much less interesting, it’s the renamed echo, which turns out was just a symbolic link to busybox.
Creating container filesystem
Finally, let’s see how Docker uses Overlayfs under the hood to create a new filesystem.
First, let’s create temporary directories in /tmp/
where we’ll setup a container filesystem manually.
michal@michal-lg:/tmp$ mkdir -p /tmp/container-demo/{diff,merged,work}
michal@michal-lg:/tmp$ ls container-demo/
diff merged work
We’ve seen diff
and merged
before. work
is used by overlayfs as a scratchpad for and we don’t need to care about it.
Next, download Alpine minirootfs for your CPU architecture and extract it to /tmp/container-demo/
. I’ll rename the extracted folder to “alpine
”.
michal@michal-lg:/tmp/container-demo$ ls
alpine diff merged work
Now that everything is setup, we can mount the overlayfs filesystem.
michal@michal-lg:/tmp/container-demo$ sudo mount -t overlay overlay -o lowerdir=alpine,upperdir=diff,workdir=work merged
Now if we list contents of merged
, we’ll see the Alpine file system:
michal@michal-lg:/tmp/container-demo/merged$ ls
bin etc lib mnt proc run srv tmp var
dev home media opt root sbin sys usr
And if we create a file in there, it will be written to the diff
folder.
michal@michal-lg:/tmp/container-demo/merged$ echo hello > hello.txt
michal@michal-lg:/tmp/container-demo/merged$ ls
bin etc home media opt root sbin sys usr
dev hello.txt lib mnt proc run srv tmp var
michal@michal-lg:/tmp/container-demo/merged$ ls ../diff/
hello.txt
As a final point, we can create a new shell process and set merged
as its root directory. This is how Linux containers can only see their own filesystem. Any process spawned by this shell will inherit the root so it will also be contained to this filesystem.
michal@michal-lg:/tmp/container-demo$ sudo chroot merged /bin/sh
/ # ls
bin hello.txt media proc sbin tmp
dev home mnt root srv usr
etc lib opt run sys var
/ # cd ..
/ #
A full implementation would take advantage of namespaces for additional isolation. You can learn more about those in Linux man-pages.
Conclusion
Containers are a black box for vast majority of engineers. You should now have a solid understanding of how things work under the hood!
If you are interested in a more complete implementation of Docker from scratch, consider checkout my Golang Docker clone that’s around 200 lines of code.
I hope you learned something new! If you did, consider subscribing and/or following me on LinkedIn.
You might also enjoy some of my other posts linked below. Perhaps my deep-dive into SQLite storage format, my implementation of MapReduce from scratch, or my series building an ML inference engine from scratch?